New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A bad agent should not bring the console down #262

Closed
ypujante opened this Issue Apr 9, 2014 · 6 comments

Comments

Projects
None yet
2 participants
@ypujante
Member

ypujante commented Apr 9, 2014

@ypujante ypujante added the bug label Apr 9, 2014

@ykorabelnikov

This comment has been minimized.

ykorabelnikov commented Apr 9, 2014

What would be helpful is the following behavior on the console:

  1. Handle the exception from the bad agent(s) and still continue to work otherwise.
  2. Log the exception (this already happens) but also log the agent that sent it. Then it’s trivial to go and dump the agents cache, look at its logs, etc.
  3. In the UI highlight the bad agent somehow. Maybe in the agents tab, maybe in the main view. This will help with inspecting the overall fabric for bad agents.

Agents can go bad every once in a while due to different reasons – in software development bugs happen :) But if console handles it gracefully then very little harm is done. Thanks again for your help!

@ypujante

This comment has been minimized.

Member

ypujante commented Apr 10, 2014

I agree. Note that the fix will go in the next version of glu (5.5.x).

@ypujante ypujante added the critical label Apr 19, 2014

@ypujante

This comment has been minimized.

Member

ypujante commented Apr 19, 2014

I was able to reproduce the issue on my machine. I do not know yet what the problem is but investigating...

@ypujante

This comment has been minimized.

Member

ypujante commented Apr 21, 2014

Fixed in 4.7.3 and 5.5.1

@ypujante ypujante closed this Apr 21, 2014

@ykorabelnikov

This comment has been minimized.

ykorabelnikov commented Apr 21, 2014

Thank you! And thanks for patching the 4.x line.

@ypujante

This comment has been minimized.

Member

ypujante commented Apr 21, 2014

@ykorabelnikov you are welcome.

My understanding of the bug makes me believe that it happened because during the upgrade:

  1. the agents stops and restarts.
  2. When it restarts it needs to re-instantiate the glu script.
  3. Prior to 4.6.2, the glu script was not stored locally and was being fetched from its original location (as defined in the glu model).
  4. If the original location is not accessible, then glu cannot re-instantiate the glu script and simply ignores this entry (the agent itself is fine).
  5. after booting, the agent synchronizes the filesystem with ZooKeeper (Syncing filesystem <=> ZooKeeper message in the agent log)
  6. this step (was) blindly loading all the states from the filesystem (which are java serialized objects) and storing them in ZooKeeper as json object
  7. the issue is that in 4.7.1 the format of the file has changed and so because of 4) and 6) you end up with the wrong format in ZooKeeper for those states that were ignored in 4)
  8. the console then receives this invalid data and then fails

What I did to fix the issue:

  1. during the boot process, the agent will upgrade old format to new format
  2. if a state cannot be restored, it is moved to a separate location and a "dummy" InvalidStateScript is instantiated so that it will appear in the console with the proper stack trace so that you can identify what the problem is
  3. on the console side, it no longer fails if one entry cannot be read but will also instantiate a dummy one so that it is not silent

Technically the console should not see this problem because of the fix in the agent, but if you use the new console with an old agent, then at least the console will "survive" and continue to be operable.

Both 4.7.3 and 5.5.1 have those fixes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment