Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Djanicek/core 1481/debug fix #8639

Merged
merged 12 commits into from Mar 8, 2023
Merged

Conversation

djanicekpach
Copy link
Contributor

There are 2 issues with the upgrade load tests(there are more, but keeping the scope to these two EOF errors)

  1. The auth interceptor fails without a registered Enterprise server
  2. Pachw can connect to the old postgres if the tests start running before the helm upgrade is complete. Any DB connections from pachw then fail if this happens.

To solve this we

  1. register the enterprise server when we buildAndRun pachw
  2. We wait for postgres to be ready before running the load test. This follows what we do with other pods on start up and potentially also helps stabilize other upgrade tests.

Considered but not done:

I looked into adding retries into the load tests or upgrade tests somewhere to address issue 2

  • It is difficult to restart in the upgrade_test itself because the load tests don't fail atomically(they start a commit and potentially make repos), so a retry from the same state never works.
  • Retrying in the load test means we have to restart the worker process to get a new DB client, which is slower and I'm worried about the effects of adding retries to a test that is designed to measure test duration, and that is not primarily designed for the upgrade test. I did get this working by using panic() in the pfsload worker on error, but simply waiting for postgres seems like the cleanest fix to me at the moment.

Idiosyncrasies in the PR

  • Wait seconds is now 0. It was 10, probably because someone notice the DB errors, but didn't know what was up. Now that we wait correctly we can toss the artificial wait.
  • Cleanup after is false. This lets the CI pipeline actually dump the k8s logs and pods on failure which makes debugging easier, and it does not appear to have any functional impact(and it shouldn't based on code flow). Better debugging is nice.
  • I synced registerAuthServer a little bit with pachd, but stopped short of setting public=true since it wasn't necessary to fix issue 1

Copy link
Member

@jrockway jrockway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Some comments.

waitForLabeledPod(t, ctx, kubeClient, namespace, "app=pg-bouncer")
}

func waitForPostgres(t testing.TB, ctx context.Context, kubeClient *kube.Clientset, namespace string) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this as race-y as pachw starting to use postgres mid-upgrade? I think the upgrade will have to apply new labels to pods, and you'll have to wait for the new labels (app=postgres + upgrade=this-is-one). Otherwise, this wait could find the un-upgraded postgres that is about to restart, I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was aware of this potential but I think

  • all of the other waitFor functions are affected by this as well, so it's not a new issue. I made a ticket to address this holistically in the future https://pachyderm.atlassian.net/browse/CORE-1536
  • postgres goes into terminating very quickly, it just takes a while to spin down and back up. Since we check postgres last(after loki, pachd and pg-bouncer) the chance that postgres isn't at least going down is really small. Watching locally at least, we aren't even close to hitting this race condition.

I could attempt to find a way to get the controller-revision-hash and confirm it's good, but I think it's pragmatic to get this in and handle that in bulk with CORE-1536.

waitForLabeledPod(t, ctx, kubeClient, namespace, "app.kubernetes.io/name=postgresql")
}

func waitForLabeledPod(t testing.TB, ctx context.Context, kubeClient *kube.Clientset, namespace string, label string) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, it would be nice if this t.Log'd every retry; which label selector it's waiting on, and the status of the matches (pod phase if != running, which container is not ready, etc.).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea, will do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edit: it looks like all this info is already logged in require.noErrorWithinTRetry, so I will refrain from doing it a second time.

@@ -59,6 +85,7 @@ func (pachwb *pachwBuilder) buildAndRun(ctx context.Context) error {
pachwb.initInternalServer,
pachwb.registerAuthServer,
pachwb.registerPFSServer, //PFS seems to need a non-nil auth server.
pachwb.registerEnterpriseServer,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other pachs register enterprise before auth and I think we should do that here. auth also requires an identity server, but I guess we don't call methods that actually dereference the identity server? dunno.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed, yeah with auth on I think I did have to add an identity server at some point, but this configuration hasn't needed it. I'll change the ordering (and add back auth registration to enterprise which is the reason for the order I believe)

}))
t.Logf("preUpgrade done; starting postUpgrade")
postUpgrade(t, minikubetestenv.UpgradeRelease(t,
context.Background(),
ns,
k,
&minikubetestenv.DeployOpts{
WaitSeconds: 10,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a feeling that WaitSeconds was added only for this. It has no other users and we should remove it. (Also, it should have been Wait as a time.Duration.)

WaitSeconds: 10,
CleanupAfter: true,
WaitSeconds: 0,
CleanupAfter: false,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I would omit the zero value in cases like this. CleanupAfter is false even if you don't have this line of code.

@djanicekpach djanicekpach merged commit 6abe2be into master Mar 8, 2023
bbonenfant pushed a commit that referenced this pull request Mar 9, 2023
* fix postgres timing and grpc calls in pachw for upgrade tests
djanicekpach added a commit that referenced this pull request Mar 10, 2023
* fix postgres timing and grpc calls in pachw for upgrade tests
djanicekpach added a commit that referenced this pull request Mar 10, 2023
* fix postgres timing and grpc calls in pachw for upgrade tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants