New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Djanicek/core 1481/debug fix #8639
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Some comments.
waitForLabeledPod(t, ctx, kubeClient, namespace, "app=pg-bouncer") | ||
} | ||
|
||
func waitForPostgres(t testing.TB, ctx context.Context, kubeClient *kube.Clientset, namespace string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this as race-y as pachw starting to use postgres mid-upgrade? I think the upgrade will have to apply new labels to pods, and you'll have to wait for the new labels (app=postgres + upgrade=this-is-one). Otherwise, this wait could find the un-upgraded postgres that is about to restart, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was aware of this potential but I think
- all of the other waitFor functions are affected by this as well, so it's not a new issue. I made a ticket to address this holistically in the future https://pachyderm.atlassian.net/browse/CORE-1536
- postgres goes into terminating very quickly, it just takes a while to spin down and back up. Since we check postgres last(after loki, pachd and pg-bouncer) the chance that postgres isn't at least going down is really small. Watching locally at least, we aren't even close to hitting this race condition.
I could attempt to find a way to get the controller-revision-hash
and confirm it's good, but I think it's pragmatic to get this in and handle that in bulk with CORE-1536.
waitForLabeledPod(t, ctx, kubeClient, namespace, "app.kubernetes.io/name=postgresql") | ||
} | ||
|
||
func waitForLabeledPod(t testing.TB, ctx context.Context, kubeClient *kube.Clientset, namespace string, label string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, it would be nice if this t.Log'd every retry; which label selector it's waiting on, and the status of the matches (pod phase if != running, which container is not ready, etc.).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea, will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Edit: it looks like all this info is already logged in require.noErrorWithinTRetry, so I will refrain from doing it a second time.
src/internal/pachd/pachw.go
Outdated
@@ -59,6 +85,7 @@ func (pachwb *pachwBuilder) buildAndRun(ctx context.Context) error { | |||
pachwb.initInternalServer, | |||
pachwb.registerAuthServer, | |||
pachwb.registerPFSServer, //PFS seems to need a non-nil auth server. | |||
pachwb.registerEnterpriseServer, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other pachs register enterprise before auth and I think we should do that here. auth also requires an identity server, but I guess we don't call methods that actually dereference the identity server? dunno.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed, yeah with auth on I think I did have to add an identity server at some point, but this configuration hasn't needed it. I'll change the ordering (and add back auth registration to enterprise which is the reason for the order I believe)
})) | ||
t.Logf("preUpgrade done; starting postUpgrade") | ||
postUpgrade(t, minikubetestenv.UpgradeRelease(t, | ||
context.Background(), | ||
ns, | ||
k, | ||
&minikubetestenv.DeployOpts{ | ||
WaitSeconds: 10, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a feeling that WaitSeconds was added only for this. It has no other users and we should remove it. (Also, it should have been Wait as a time.Duration.)
src/testing/deploy/upgrade_test.go
Outdated
WaitSeconds: 10, | ||
CleanupAfter: true, | ||
WaitSeconds: 0, | ||
CleanupAfter: false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I would omit the zero value in cases like this. CleanupAfter is false even if you don't have this line of code.
513834a
to
e7391d7
Compare
* fix postgres timing and grpc calls in pachw for upgrade tests
* fix postgres timing and grpc calls in pachw for upgrade tests
There are 2 issues with the upgrade load tests(there are more, but keeping the scope to these two EOF errors)
To solve this we
Considered but not done:
I looked into adding retries into the load tests or upgrade tests somewhere to address issue 2
Idiosyncrasies in the PR