Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/build/cmd/coordinator: active gomote sessions and coordinator deploys are mutually exclusive #39280

Open
dmitshur opened this issue May 27, 2020 · 2 comments
Labels
Builders x/build issues (builders, bots, dashboards) NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@dmitshur
Copy link
Contributor

dmitshur commented May 27, 2020

There are times when people need to investigate issues that require configuration or environment that is hard to reproduce locally, and they use gomote instances for such debugging sessions.

A limitation of the current implementation of this system is that all remote buildlets are lost when cmd/coordinator restarts. This limitation is documented in doc/remote-buildlet.txt:

Currently, if the coordinator dies or restarts, all buildlets are lost.

An unfortunate consequence is that this can reduce the window of time when cmd/coordinator can be re-deployed without disrupting investigative done by others.

This can generally be worked around by coordinating with people who are using gomotes to find a good time for a deploy, but it scales poorly when there are more concurrent investigations being done, especially during business hours.

This is the tracking issue to track how much of a problem it is and investigate ways to improve this situation if it ends up becoming a bottleneck.

/cc @cagedmantis @toothrot @andybons

@dmitshur dmitshur added Builders x/build issues (builders, bots, dashboards) NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels May 27, 2020
@dmitshur dmitshur added this to the Unreleased milestone May 27, 2020
@dmitshur
Copy link
Contributor Author

An ultimate solution would be a mechanism that makes it possible to redeploy coordinator without losing gomote sessions, or at least makes it possible to re-connect to them without their state getting lost.

Short of that solution, if finding a deploy window starts to become difficult to do and starts to block other work, then perhaps some sort of a queue can be arranged so that a future deploy can be scheduled in advance, and people don't start new gomote sessions just as older ones wrap up.

In my experience this isn't a huge problem that warrants spending a lot of time on yet, but I just wanted to share some initial thoughts.

@toothrot
Copy link
Contributor

An interim solution could be to break out the gomote ssh proxy endpoint into its own service, as it likely needs to be deployed less frequently than the coordinator itself. This would still require some persistent state to be managed across coordinator deploys, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Builders x/build issues (builders, bots, dashboards) NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

2 participants