Fix and improve migrate-nodegroup test #214

metral · 2019-08-07T05:03:07Z

Fixes #194.

The changes in this fix resolve a leaked ENI that lead to a dependency violation on tear down. It was resolved by bumping up aws-cni to v1.5.2 that was recently released [1]. We now also:

Wait 5 min after the migration step to ensure it completes,
Delete the workload Namespace, and aws-cni DaemonSet to ensure these get explicitly torn down.

This fix was tested across 70+ runs of the migrate-nodegroup test, and no run hit issue #194.

Additionally, logLevel, logFile, and image have been added as options to VpcCniOptions, and defaults for logging have been adjusted to be verbose, and be produced for the aws-cni Pods.

[1] - The specific fixes in aws-cni v1.5.2 were:

Detach ENI before deleting.
The addition of a healthz endpoint: Adding healthz endpoint to IPamD aws/amazon-vpc-cni-k8s#548 and Update start script to wait for ipamd health check aws/amazon-vpc-cni-k8s#553. The healthz endpoint is used in the readiness & liveness probes of the DaemonSet.

nodejs/eks/cni.ts

lukehoban · 2019-08-08T02:41:28Z

nodejs/eks/cni/aws-k8s-cni.yaml

@@ -81,15 +81,20 @@ spec:
      tolerations:
        - operator: Exists
      containers:
-        - image: 602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.5.0


Does this now march any released version of this yaml from the source? Would love to keep this identical to some released version if possible.

The upstream YAML manifest and ours match. Below is the diff between upstream and ours.
The main differences lie in:

Removing the image, and AWS_VPC_K8S_CNI_LOGLEVEL as we set those above in cni.ts

Enabling the use of the readiness & liveness probes since these are omitted upstream as this healthz endpoint is only available on CNI images >= v1.5.2 which we use now.

$ diff -u upstream-aws-k8s-cni.yaml aws-k8s-cni.yaml --- upstream-aws-k8s-cni.yaml 2019-08-08 11:28:15.838079451 -0700 +++ aws-k8s-cni.yaml 2019-08-07 19:12:21.888614038 -0700 @@ -81,23 +81,20 @@ tolerations: - operator: Exists containers: - - image: 602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.5.2 - imagePullPolicy: Always + - imagePullPolicy: Always ports: - containerPort: 61678 name: metrics name: aws-node - #readinessProbe: - # exec: - # command: ["/app/grpc_health_probe", "-addr=:50051"] - # initialDelaySeconds: 5 - #livenessProbe: - # exec: - # command: ["/app/grpc_health_probe", "-addr=:50051"] - # initialDelaySeconds: 5 + readinessProbe: + exec: + command: ["/app/grpc_health_probe", "-addr=:50051"] + initialDelaySeconds: 5 + livenessProbe: + exec: + command: ["/app/grpc_health_probe", "-addr=:50051"] + initialDelaySeconds: 5 env: - - name: AWS_VPC_K8S_CNI_LOGLEVEL - value: DEBUG - name: MY_NODE_NAME valueFrom: fieldRef:

metral · 2019-08-08T18:34:14Z

Shouldn’t we just update the image in the yaml to the right default and use similar overrides as above?

I'll make changes to adopt the override approach instead of the current one, and not remove bits from the YAML manifest directly to maintain that with upstream as best we can. Though note that logFile is not a default in the YAML manifest that we have to set to get any sort of meaningful logs out of aws-cni

metral · 2019-08-08T18:52:24Z

Feedback has been addressed. PTAL @lukehoban.

Updated diff in the YAML manifest. The diffs are:

Enabling the use of the probes since pulumi-eks has been updated to using image = v1.5.2
Adding in a missing envvar for AWS_VPC_K8S_CNI_LOG_FILE in the YAML manifest to have meaningful logs, and to allow to be user-configurable via override approach.

$ diff -u upstream-aws-k8s-cni.yaml aws-k8s-cni.yaml
--- upstream-aws-k8s-cni.yaml   2019-08-08 11:52:46.430206543 -0700
+++ aws-k8s-cni.yaml    2019-08-08 11:51:38.085461619 -0700
@@ -87,17 +87,19 @@
             - containerPort: 61678
               name: metrics
           name: aws-node
-          #readinessProbe:
-          #  exec:
-          #    command: ["/app/grpc_health_probe", "-addr=:50051"]
-          #  initialDelaySeconds: 5
-          #livenessProbe:
-          #  exec:
-          #    command: ["/app/grpc_health_probe", "-addr=:50051"]
-          #  initialDelaySeconds: 5
+          readinessProbe:
+            exec:
+              command: ["/app/grpc_health_probe", "-addr=:50051"]
+            initialDelaySeconds: 5
+          livenessProbe:
+            exec:
+              command: ["/app/grpc_health_probe", "-addr=:50051"]
+            initialDelaySeconds: 5
           env:
             - name: AWS_VPC_K8S_CNI_LOGLEVEL
               value: DEBUG
+            - name: AWS_VPC_K8S_CNI_LOG_FILE
+              value: stdout
             - name: MY_NODE_NAME
               valueFrom:
                 fieldRef:

lukehoban

LGTM

lukehoban · 2019-08-08T19:09:10Z

Great that this is ultimately addressed by fixes in 1.5.2 CNI!

metral · 2019-08-08T19:48:14Z

Great that this is ultimately addressed by fixes in 1.5.2 CNI!

Agreed!

Also, the explicit removal of the NGINX workload and aws-cni also helped to achieve successful cluster tear downs. #215 could help here, and would remove our need to do these manual deletions in the migrate-nodegroup test.

metral force-pushed the metral/fix-migrate-ng-test branch from ac38a22 to b549c9b Compare August 7, 2019 05:33

metral marked this pull request as ready for review August 7, 2019 16:54

metral requested review from lukehoban and lblackstone August 7, 2019 17:03

lblackstone approved these changes Aug 7, 2019

View reviewed changes

lukehoban reviewed Aug 8, 2019

View reviewed changes

metral force-pushed the metral/fix-migrate-ng-test branch from 64c053e to 922a509 Compare August 8, 2019 02:48

metral force-pushed the metral/fix-migrate-ng-test branch from 922a509 to 44151b3 Compare August 8, 2019 18:52

metral force-pushed the metral/fix-migrate-ng-test branch from 44151b3 to 47c5033 Compare August 8, 2019 18:56

lukehoban approved these changes Aug 8, 2019

View reviewed changes

metral added 4 commits August 8, 2019 12:52

fix(test): sleep after migration, and remove the workload & CNI

f7eab1f

feat(cni): add opts for VPC CNI container image, logLevel, and logFile

f35a3e4

chore(cni): update from v1.5.0 -> v1.5.2

5ec0986

Update CHANGELOG to include 'fix and improve migrate-nodegroup test'

b145e46

metral force-pushed the metral/fix-migrate-ng-test branch from 47c5033 to b145e46 Compare August 8, 2019 19:53

metral merged commit 4e5ab24 into master Aug 8, 2019

pulumi-bot deleted the metral/fix-migrate-ng-test branch August 8, 2019 23:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix and improve migrate-nodegroup test #214

Fix and improve migrate-nodegroup test #214

metral commented Aug 7, 2019 •

edited

Loading

lukehoban Aug 8, 2019

metral Aug 8, 2019

metral commented Aug 8, 2019 •

edited

Loading

metral commented Aug 8, 2019 •

edited

Loading

lukehoban left a comment

lukehoban commented Aug 8, 2019

metral commented Aug 8, 2019 •

edited

Loading

Fix and improve migrate-nodegroup test #214

Fix and improve migrate-nodegroup test #214

Conversation

metral commented Aug 7, 2019 • edited Loading

lukehoban Aug 8, 2019

Choose a reason for hiding this comment

metral Aug 8, 2019

Choose a reason for hiding this comment

metral commented Aug 8, 2019 • edited Loading

metral commented Aug 8, 2019 • edited Loading

lukehoban left a comment

Choose a reason for hiding this comment

lukehoban commented Aug 8, 2019

metral commented Aug 8, 2019 • edited Loading

metral commented Aug 7, 2019 •

edited

Loading

metral commented Aug 8, 2019 •

edited

Loading

metral commented Aug 8, 2019 •

edited

Loading

metral commented Aug 8, 2019 •

edited

Loading