Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix and improve migrate-nodegroup test #214

Merged
merged 4 commits into from
Aug 8, 2019
Merged

Conversation

metral
Copy link
Contributor

@metral metral commented Aug 7, 2019

Fixes #194.

The changes in this fix resolve a leaked ENI that lead to a dependency violation on tear down. It was resolved by bumping up aws-cni to v1.5.2 that was recently released [1]. We now also:

  • Wait 5 min after the migration step to ensure it completes,
  • Delete the workload Namespace, and aws-cni DaemonSet to ensure these get explicitly torn down.

This fix was tested across 70+ runs of the migrate-nodegroup test, and no run hit issue #194.

Additionally, logLevel, logFile, and image have been added as options to VpcCniOptions, and defaults for logging have been adjusted to be verbose, and be produced for the aws-cni Pods.


[1] - The specific fixes in aws-cni v1.5.2 were:

@metral metral force-pushed the metral/fix-migrate-ng-test branch from ac38a22 to b549c9b Compare August 7, 2019 05:33
@metral metral marked this pull request as ready for review August 7, 2019 16:54
nodejs/eks/cni.ts Outdated Show resolved Hide resolved
nodejs/eks/cni.ts Outdated Show resolved Hide resolved
@@ -81,15 +81,20 @@ spec:
tolerations:
- operator: Exists
containers:
- image: 602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.5.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this now march any released version of this yaml from the source? Would love to keep this identical to some released version if possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The upstream YAML manifest and ours match. Below is the diff between upstream and ours.
The main differences lie in:

  • Removing the image, and AWS_VPC_K8S_CNI_LOGLEVEL as we set those above in cni.ts
  • Enabling the use of the readiness & liveness probes since these are omitted upstream as this healthz endpoint is only available on CNI images >= v1.5.2 which we use now.
$ diff -u upstream-aws-k8s-cni.yaml aws-k8s-cni.yaml
--- upstream-aws-k8s-cni.yaml   2019-08-08 11:28:15.838079451 -0700
+++ aws-k8s-cni.yaml    2019-08-07 19:12:21.888614038 -0700
@@ -81,23 +81,20 @@
       tolerations:
         - operator: Exists
       containers:
-        - image: 602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.5.2
-          imagePullPolicy: Always
+        - imagePullPolicy: Always
           ports:
             - containerPort: 61678
               name: metrics
           name: aws-node
-          #readinessProbe:
-          #  exec:
-          #    command: ["/app/grpc_health_probe", "-addr=:50051"]
-          #  initialDelaySeconds: 5
-          #livenessProbe:
-          #  exec:
-          #    command: ["/app/grpc_health_probe", "-addr=:50051"]
-          #  initialDelaySeconds: 5
+          readinessProbe:
+            exec:
+              command: ["/app/grpc_health_probe", "-addr=:50051"]
+            initialDelaySeconds: 5
+          livenessProbe:
+            exec:
+              command: ["/app/grpc_health_probe", "-addr=:50051"]
+            initialDelaySeconds: 5
           env:
-            - name: AWS_VPC_K8S_CNI_LOGLEVEL
-              value: DEBUG
             - name: MY_NODE_NAME
               valueFrom:
                 fieldRef:

@metral metral force-pushed the metral/fix-migrate-ng-test branch from 64c053e to 922a509 Compare August 8, 2019 02:48
@metral
Copy link
Contributor Author

metral commented Aug 8, 2019

Shouldn’t we just update the image in the yaml to the right default and use similar overrides as above?

I'll make changes to adopt the override approach instead of the current one, and not remove bits from the YAML manifest directly to maintain that with upstream as best we can. Though note that logFile is not a default in the YAML manifest that we have to set to get any sort of meaningful logs out of aws-cni

@metral metral force-pushed the metral/fix-migrate-ng-test branch from 922a509 to 44151b3 Compare August 8, 2019 18:52
@metral
Copy link
Contributor Author

metral commented Aug 8, 2019

Feedback has been addressed. PTAL @lukehoban.

Updated diff in the YAML manifest. The diffs are:

  • Enabling the use of the probes since pulumi-eks has been updated to using image = v1.5.2
  • Adding in a missing envvar for AWS_VPC_K8S_CNI_LOG_FILE in the YAML manifest to have meaningful logs, and to allow to be user-configurable via override approach.
$ diff -u upstream-aws-k8s-cni.yaml aws-k8s-cni.yaml
--- upstream-aws-k8s-cni.yaml   2019-08-08 11:52:46.430206543 -0700
+++ aws-k8s-cni.yaml    2019-08-08 11:51:38.085461619 -0700
@@ -87,17 +87,19 @@
             - containerPort: 61678
               name: metrics
           name: aws-node
-          #readinessProbe:
-          #  exec:
-          #    command: ["/app/grpc_health_probe", "-addr=:50051"]
-          #  initialDelaySeconds: 5
-          #livenessProbe:
-          #  exec:
-          #    command: ["/app/grpc_health_probe", "-addr=:50051"]
-          #  initialDelaySeconds: 5
+          readinessProbe:
+            exec:
+              command: ["/app/grpc_health_probe", "-addr=:50051"]
+            initialDelaySeconds: 5
+          livenessProbe:
+            exec:
+              command: ["/app/grpc_health_probe", "-addr=:50051"]
+            initialDelaySeconds: 5
           env:
             - name: AWS_VPC_K8S_CNI_LOGLEVEL
               value: DEBUG
+            - name: AWS_VPC_K8S_CNI_LOG_FILE
+              value: stdout
             - name: MY_NODE_NAME
               valueFrom:
                 fieldRef:

@metral metral force-pushed the metral/fix-migrate-ng-test branch from 44151b3 to 47c5033 Compare August 8, 2019 18:56
Copy link
Member

@lukehoban lukehoban left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lukehoban
Copy link
Member

Great that this is ultimately addressed by fixes in 1.5.2 CNI!

@metral
Copy link
Contributor Author

metral commented Aug 8, 2019

Great that this is ultimately addressed by fixes in 1.5.2 CNI!

Agreed!

Also, the explicit removal of the NGINX workload and aws-cni also helped to achieve successful cluster tear downs. #215 could help here, and would remove our need to do these manual deletions in the migrate-nodegroup test.

@metral metral force-pushed the metral/fix-migrate-ng-test branch from 47c5033 to b145e46 Compare August 8, 2019 19:53
@metral metral merged commit 4e5ab24 into master Aug 8, 2019
@pulumi-bot pulumi-bot deleted the metral/fix-migrate-ng-test branch August 8, 2019 23:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DependencyViolation: resource sg has a dependent object
3 participants