New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exponential growth of watchers when ZK client disconnects/reconnects #10

Open
jonpither opened this Issue Jan 9, 2014 · 2 comments

Comments

Projects
None yet
2 participants
@jonpither

jonpither commented Jan 9, 2014

Hi,

If for whatever reason the ZK client drops a connnection and reconnects before the SESSION_EXPIRY is raised, the number of watchers managed by the client will double each time.

During some network issues we had a few processes grind down to a halt because of this.

This is because Avout raises a watcher to manage the internal state of atoms and refs. I.e. see invalidateCache in avout.refs.

What should happen here is that each time the watcher is triggered, it should filter out events like {:event-type :None, :keeper-state :Disconnected}, perhaps by only ever being interested in event-type NodeDataChanged.

When these events get raised Avout currently registers brand new watchers even though the original watcher will still carry on firing. In ZK watchers are "fire-once" but not in the case of some events like the disconnect/connect.

This is the same for addWatch in refs and atoms.

holguinj added a commit to holguinj/avout that referenced this issue Jun 10, 2014

Corrects implementation of removeWatch (issue #11)
As @jafingerhut pointed out, there are two apparently incorrect uses
of swap!, one in atoms.clj and one in refs.clj. In both cases, the
error is in the removeWatch implementation. This resolves #11, and
possibly issues #3 and #10.

holguinj added a commit to holguinj/avout that referenced this issue Jun 11, 2014

Corrects implementation of removeWatch (issue #11)
As @jafingerhut pointed out, there are two apparently incorrect uses
of swap!, one in atoms.clj and one in refs.clj. In both cases, the
error is in the removeWatch implementation. This resolves #11, and
possibly issues #3 and #10.

holguinj added a commit to holguinj/avout that referenced this issue Jun 11, 2014

Corrects implementation of removeWatch (issue #11)
As @jafingerhut pointed out, there are two apparently incorrect uses
of swap!, one in atoms.clj and one in refs.clj. In both cases, the
error is in the removeWatch implementation (shown to be broken by the
new tests in my previous commit). This commit fixes resolves #11 and
resolves #3. I can't tell for sure yet, but I suspect that it might
help with #10.
@holguinj

This comment has been minimized.

Show comment
Hide comment
@holguinj

holguinj Jun 30, 2014

Contributor

Hi @jonpither,

I know it's been a while and I'm not sure if this is something you're still running into, but I've got some time and I'd like to see if I can fix this issue.

I think I've got the cause nailed down, but I could use some help reproducing the issue so I can be sure. Can you point me in the right direction?

Contributor

holguinj commented Jun 30, 2014

Hi @jonpither,

I know it's been a while and I'm not sure if this is something you're still running into, but I've got some time and I'd like to see if I can fix this issue.

I think I've got the cause nailed down, but I could use some help reproducing the issue so I can be sure. Can you point me in the right direction?

@jonpither

This comment has been minimized.

Show comment
Hide comment
@jonpither

jonpither Jul 3, 2014

Hi @holguinj

It's a tough one to reproduce. You want to cause a session disconnect but NOT a session timeout. I can't recall how I programmatically nailed it down, but you can set a long-time out and effect some disconnects. I've since moved project so I don't have the sort of test infrastructural code available.

Once you've got the disconnects happening you can then query the amount of watchers registered before/after. What should happen is that the watcher gets fired on a disconnect (and removes itself as per usual), then re-registers.

What currently happens is that the watcher fires, doesn't remove itself, and re-registers itself. Hence an exponential growth.

Good luck.

jonpither commented Jul 3, 2014

Hi @holguinj

It's a tough one to reproduce. You want to cause a session disconnect but NOT a session timeout. I can't recall how I programmatically nailed it down, but you can set a long-time out and effect some disconnects. I've since moved project so I don't have the sort of test infrastructural code available.

Once you've got the disconnects happening you can then query the amount of watchers registered before/after. What should happen is that the watcher gets fired on a disconnect (and removes itself as per usual), then re-registers.

What currently happens is that the watcher fires, doesn't remove itself, and re-registers itself. Hence an exponential growth.

Good luck.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment