Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle issue with inconsistent zxid on reconnection #28

Merged
merged 5 commits into from
Mar 25, 2019

Conversation

nikitagromov
Copy link
Contributor

@nikitagromov nikitagromov commented Mar 24, 2019

We can get an infinite loop of reconnections when zxid was changed on the server and we attempt to connect with old zxid

@Go0dYear
Copy link

Have the same issue

@cybergrind
Copy link
Member

hey @nikitagromov! thank you for your contribution, but I believe that this particular case requires some additional tests.
As far as I can see, this case is responsible for network errors that usually shouldn't be related to txid change on the server, it looks like your fix will abruptly close session even without self.state.transition_to(States.LOST), probably in cases where the session is still on the server.

So I prefer to see some tests for that case and then we can decide where it should be fixed.

@nikitagromov
Copy link
Contributor Author

nikitagromov commented Mar 25, 2019

@cybergrind if you attempt to connect to zookeeper server with wrong zxid server will close your connection without any response, so you can't detect if session still present on server or not.
Log from server

2019-03-25 09:04:58,863 [myid:] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@192] - Accepted socket connection from /192.168.1.1:33253
2019-03-25 09:04:58,864 [myid:] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@901] - Refusing session request for client /192.168.1.1:33253 as it has seen zxid 0x756b5b2 our last zxid is 0x7007e client must try another server
2019-03-25 09:04:58,864 [myid:] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1008] - Closed socket connection for client /192.168.1.1:33253 (no session established for client)

I will add tests for reproducing this issue

@cybergrind
Copy link
Member

In that case probably we cannot reach state when: if response.session_id == 0: # invalid session, probably expired
And probably self.state.transition_to(States.LOST) still makes sense for case when we have network errors

@nikitagromov
Copy link
Contributor Author

nikitagromov commented Mar 25, 2019

Yep, because we don't have response from server :)

I agree with you that I should change state to self.state.transition_to(States.LOST)

@nikitagromov
Copy link
Contributor Author

nikitagromov commented Mar 25, 2019

@cybergrind I've updated PR. Also I've added additional error log for case when repair_loop task failed on some unexpectable exception

aiozk/session.py Outdated
await self.set_existing_watches()
self.conn.start_read_loop()
await self.set_existing_watches()
except Exception as error:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which errors do we expect here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the solely purpose of all this try/catch block is to log error, probably self.repair_loop_task.add_done_callback with logging error will look cleaner

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep 👍

await zk.start()
# simulate failed connection
await zk.session.close()
zk.session.last_zxid = 1231231241312312
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I comment out this string and run the test on codebase without any fixes - it doesn't work.
Probably we need a test that doesn't work if we override zxid and works if we don't override it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should fail with timeout exception, I will check it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean code without this line should pass the test.

# zk.session.last_zxid = 1231231241312312

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On codebase without fixes it will not work even if we comment zk.session.last_zxid. We can't reconnect to zk because of session holds in closing state. I will add test which checks only session reconnect

@cybergrind
Copy link
Member

cybergrind commented Mar 25, 2019 via email

@nikitagromov
Copy link
Contributor Author

nikitagromov commented Mar 25, 2019

@cybergrind

  1. comment line https://github.com/tipsi/aiozk/blob/master/aiozk/session.py#L143
  2. add
        self.closing = False
        self.started = False

to end of def close method

and test:

@pytest.mark.asyncio
async def test_correct_reconnection():
    async def coro():
        zk = get_client()
        await zk.start()
        # simulate failed connection
        await zk.session.close()
       # zk.session.last_zxid = 1231231241312312
        await zk.session.start()
        await zk.session.close()
    try:
        await asyncio.wait_for(coro(), timeout=10)
    except asyncio.TimeoutError as exc:
        pytest.fail("Failed with timeout on session reconnection attemt")

will pass on codebase without fix

@cybergrind cybergrind merged commit dab98e3 into micro-fan:master Mar 25, 2019
@cybergrind
Copy link
Member

@nikitagromov thank you for your contribution. it was a pleasure to work with you =)
Uploaded it as 0.12.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants