Djanicek/core 1481/debug fix #8639

djanicekpach · 2023-03-08T18:02:11Z

There are 2 issues with the upgrade load tests(there are more, but keeping the scope to these two EOF errors)

The auth interceptor fails without a registered Enterprise server
Pachw can connect to the old postgres if the tests start running before the helm upgrade is complete. Any DB connections from pachw then fail if this happens.

To solve this we

register the enterprise server when we buildAndRun pachw
We wait for postgres to be ready before running the load test. This follows what we do with other pods on start up and potentially also helps stabilize other upgrade tests.

Considered but not done:

I looked into adding retries into the load tests or upgrade tests somewhere to address issue 2

It is difficult to restart in the upgrade_test itself because the load tests don't fail atomically(they start a commit and potentially make repos), so a retry from the same state never works.
Retrying in the load test means we have to restart the worker process to get a new DB client, which is slower and I'm worried about the effects of adding retries to a test that is designed to measure test duration, and that is not primarily designed for the upgrade test. I did get this working by using panic() in the pfsload worker on error, but simply waiting for postgres seems like the cleanest fix to me at the moment.

Idiosyncrasies in the PR

Wait seconds is now 0. It was 10, probably because someone notice the DB errors, but didn't know what was up. Now that we wait correctly we can toss the artificial wait.
Cleanup after is false. This lets the CI pipeline actually dump the k8s logs and pods on failure which makes debugging easier, and it does not appear to have any functional impact(and it shouldn't based on code flow). Better debugging is nice.
I synced registerAuthServer a little bit with pachd, but stopped short of setting public=true since it wasn't necessary to fix issue 1

jrockway

Looks good. Some comments.

jrockway · 2023-03-08T20:33:28Z

src/internal/minikubetestenv/deploy.go

+	waitForLabeledPod(t, ctx, kubeClient, namespace, "app=pg-bouncer")
+}
+
+func waitForPostgres(t testing.TB, ctx context.Context, kubeClient *kube.Clientset, namespace string) {


Isn't this as race-y as pachw starting to use postgres mid-upgrade? I think the upgrade will have to apply new labels to pods, and you'll have to wait for the new labels (app=postgres + upgrade=this-is-one). Otherwise, this wait could find the un-upgraded postgres that is about to restart, I think.

Yeah, I was aware of this potential but I think

all of the other waitFor functions are affected by this as well, so it's not a new issue. I made a ticket to address this holistically in the future https://pachyderm.atlassian.net/browse/CORE-1536

postgres goes into terminating very quickly, it just takes a while to spin down and back up. Since we check postgres last(after loki, pachd and pg-bouncer) the chance that postgres isn't at least going down is really small. Watching locally at least, we aren't even close to hitting this race condition.

I could attempt to find a way to get the controller-revision-hash and confirm it's good, but I think it's pragmatic to get this in and handle that in bulk with CORE-1536.

jrockway · 2023-03-08T20:37:37Z

src/internal/minikubetestenv/deploy.go

+	waitForLabeledPod(t, ctx, kubeClient, namespace, "app.kubernetes.io/name=postgresql")
+}
+
+func waitForLabeledPod(t testing.TB, ctx context.Context, kubeClient *kube.Clientset, namespace string, label string) {


btw, it would be nice if this t.Log'd every retry; which label selector it's waiting on, and the status of the matches (pod phase if != running, which container is not ready, etc.).

good idea, will do.

Edit: it looks like all this info is already logged in require.noErrorWithinTRetry, so I will refrain from doing it a second time.

jrockway · 2023-03-08T20:50:20Z

src/internal/pachd/pachw.go

@@ -59,6 +85,7 @@ func (pachwb *pachwBuilder) buildAndRun(ctx context.Context) error {
 		pachwb.initInternalServer,
 		pachwb.registerAuthServer,
 		pachwb.registerPFSServer, //PFS seems to need a non-nil auth server.
+		pachwb.registerEnterpriseServer,


The other pachs register enterprise before auth and I think we should do that here. auth also requires an identity server, but I guess we don't call methods that actually dereference the identity server? dunno.

changed, yeah with auth on I think I did have to add an identity server at some point, but this configuration hasn't needed it. I'll change the ordering (and add back auth registration to enterprise which is the reason for the order I believe)

jrockway · 2023-03-08T20:52:10Z

src/testing/deploy/upgrade_test.go

 				}))
 			t.Logf("preUpgrade done; starting postUpgrade")
 			postUpgrade(t, minikubetestenv.UpgradeRelease(t,
 				context.Background(),
 				ns,
 				k,
 				&minikubetestenv.DeployOpts{
-					WaitSeconds:  10,


I have a feeling that WaitSeconds was added only for this. It has no other users and we should remove it. (Also, it should have been Wait as a time.Duration.)

jrockway · 2023-03-08T20:53:09Z

src/testing/deploy/upgrade_test.go

-					WaitSeconds:  10,
-					CleanupAfter: true,
+					WaitSeconds:  0,
+					CleanupAfter: false,


In general I would omit the zero value in cases like this. CleanupAfter is false even if you don't have this line of code.

* fix postgres timing and grpc calls in pachw for upgrade tests

djanicekpach requested review from jrockway, FahadBSyed, molinamelendezj and brycemcanally March 8, 2023 18:02

jrockway approved these changes Mar 8, 2023

View reviewed changes

djanicekpach added 12 commits March 8, 2023 14:59

draft: minimal fix

a345789

testing

d95ff38

parallelize test again

1ba4b98

un-parallel tests

b4e0722

config updates

fee1eee

deploy test deflake

6559cc3

re-parallelize upgrade test

e42d29c

de-parallelize upgrade test

58b0448

allow for debugging in pipeline

0576e7a

fix parallelism typo

7f12f14

remove debug code

2d7afc6

pr updates

e7391d7

djanicekpach force-pushed the djanicek/core-1481/debug-fix branch from 513834a to e7391d7 Compare March 8, 2023 23:00

djanicekpach merged commit 6abe2be into master Mar 8, 2023

bbonenfant pushed a commit that referenced this pull request Mar 9, 2023

Djanicek/core 1481/debug fix (#8639)

e8f1440

* fix postgres timing and grpc calls in pachw for upgrade tests

djanicekpach added a commit that referenced this pull request Mar 10, 2023

Djanicek/core 1481/debug fix (#8639)

d1459b1

* fix postgres timing and grpc calls in pachw for upgrade tests

djanicekpach added a commit that referenced this pull request Mar 10, 2023

Djanicek/core 1481/debug fix (#8639) (#8646)

4136129

* fix postgres timing and grpc calls in pachw for upgrade tests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Djanicek/core 1481/debug fix #8639

Djanicek/core 1481/debug fix #8639

djanicekpach commented Mar 8, 2023

jrockway left a comment

jrockway Mar 8, 2023

djanicekpach Mar 8, 2023

jrockway Mar 8, 2023

djanicekpach Mar 8, 2023

djanicekpach Mar 8, 2023

jrockway Mar 8, 2023

djanicekpach Mar 8, 2023

jrockway Mar 8, 2023

jrockway Mar 8, 2023

Djanicek/core 1481/debug fix #8639

Djanicek/core 1481/debug fix #8639

Conversation

djanicekpach commented Mar 8, 2023

jrockway left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment