Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

App Engine deployment of Locate with --promote causes brief outage #114

Open
stephen-soltesz opened this issue Feb 13, 2023 · 2 comments
Open

Comments

@stephen-soltesz
Copy link
Contributor

With the --promote flag to the deployment of Locate in App Engine, all the heartbeat services are briefly disconnected and must reset with the Locate server before the set of healthy servers are re-populated again. This takes a surprisingly long time.

Screen Shot 2023-02-13 at 10 29 08 AM

Unfortunately, Flexible environment App Engine does not support automatic migration (as Standard environment does).

However, we should still be able to create a tool that performs a gradual migration using Traffic Splitting.

It may also be possible to improve the shutdown / warmup mechanism for Locate to "hand off" from one version to the next more gracefully.

@stephen-soltesz
Copy link
Contributor Author

An incremental split appears to work as intended. From 19:35 to 20:10 a 10% increase every 5min resulted in no visible decrease in test rates or locate connections. The increase in traffic was due to hourly client traffic.

Screen Shot 2023-02-13 at 3 26 04 PM

@stephen-soltesz
Copy link
Contributor Author

During the original event on 2023-02-09, the time for all servers to re-register took over 1hr (see image below). We see the same slow update in staging. @cristinaleonr suspects this is due to the heartbeat service's exponential backoff and plans to add additional metrics to the hbs so that we can see both the node and locate metrics.

Screen Shot 2023-02-21 at 3 44 44 PM

Also, notable, during the manually split deployment to production on 13th, we do not see disruptions to the available health server counts.

Screen Shot 2023-02-21 at 3 43 39 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant