Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/pkgsite: select healthy DB per request #40444

Open
jba opened this issue Jul 28, 2020 · 3 comments
Open

x/pkgsite: select healthy DB per request #40444

jba opened this issue Jul 28, 2020 · 3 comments

Comments

@jba
Copy link
Contributor

@jba jba commented Jul 28, 2020

Currently, our service processes connect to a single DB on startup. If that DB enters a bad state while running, the process continues to run, serving 500s.

Instead, we should pick from a set of healthy DBs on each request.

@jba jba self-assigned this Jul 28, 2020
@gopherbot gopherbot added this to the Unreleased milestone Jul 28, 2020
@gopherbot
Copy link

@gopherbot gopherbot commented Jul 28, 2020

Change https://golang.org/cl/244603 mentions this issue: internal/frontend: get a DataSource on each request

gopherbot pushed a commit to golang/pkgsite that referenced this issue Jul 28, 2020
Reorganize the server so that each request gets its own DataSource,
instead of using a single DataSource for every request.

Currently, the behavior doesn't change because we do in fact use
the same DataSource for every request. But this paves the way
to having a pool of health-checked DB connections, while still
having each request work with a single connection.

For golang/go#40444.

Change-Id: I717450593a8dcfd5689a8d28f634324776305042
Reviewed-on: https://go-review.googlesource.com/c/pkgsite/+/244603
Reviewed-by: Julie Qiu <julie@golang.org>
@jba
Copy link
Contributor Author

@jba jba commented Sep 11, 2020

Alternative to picking a healthy DB connection per request:

  1. On startup, we connect to one of the configured DBs. All requests use the same connection. (This is the way it works now).
  2. We implement a health check that pings the DB.

This will cause the health check watcher (AppEngine or the GKE load balancer) to quickly kill processes that are connected to a bad DB, and start new ones.

@heschik
Copy link
Contributor

@heschik heschik commented Sep 14, 2020

That approach will work well assuming a failure mode where one database becomes completely unavailable. If instead all the databases become 5% unavailable, e.g. because of a bug that overloads them all, they won't find a database that works all the time. Depending on the exact characteristics of the failure, how fast the instances get killed, and how fast they restart, that could end up making a minor outage into a very large one.

There's also some risk with pointing each instance to a single database; after an outage on one replica, all the frontend instances will be pointed at the other replica.

Since the failures we've seen so far are (AFAIK) total outages of one replica, this might be fine. But an approach that doesn't crash would be better in all respects, except perhaps for development ease.

@gopherbot gopherbot added the go.dev label Sep 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants