Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration test for cni-repair-controller #316

Merged
merged 2 commits into from
Jan 25, 2024

Conversation

alpeb
Copy link
Member

@alpeb alpeb commented Jan 2, 2024

(Note this will fail until linkerd/linkerd2#11699 lands)

The integration-cni-plugin.yml workflow (formerly known as cni-plugin-integration.yml) has been expanded to run the new recipe cni-repair-controller-integration, which performs the following steps:

  • Rebuilds the linkerd-cni-repair-controller crate and cni-plugin
  • Creates a new cluster at version v1.27.6-k3s1 (version required for Calico to work)
  • Triggers a new ./cni-repair-controller/integration/run.sh script which:
    • Installs Calico
    • Installs the latest linkerd-edge CLI
    • Installs linkerd-cni and wait for it to become ready
    • Install the linkerd control plane in CNI mode
    • Install a pause DaemonSet

The linkerd-cni instance has been configured to include an extra initContainer that will delay its start for 15s. Since we waited for it to become ready, this doesn't affect the initial install. But then a new node is added to the cluster, and this delay allows for the new pause DaemonSet replica to start before the full CNI config is ready, so we can observe its failure to come up. Once the new linkerd-cni replica becomes ready we observe how the pause failed replica is replaced by a new healthy one.

@alpeb alpeb requested a review from a team as a code owner January 2, 2024 20:29
@alpeb alpeb force-pushed the alpeb/cni-repair-controller-tests branch from 9fbaa2e to e4fb23b Compare January 2, 2024 21:41
(Note this will fail until linkerd/linkerd2#11699 lands)

The `integration-cni-plugin.yml` workflow (formerly known as `cni-plugin-integration.yml`) has been expanded to run the new recipe `cni-repair-controller-integration`, which performs the following steps:

- Rebuilds the `linkerd-cni-repair-controller` crate and `cni-plugin`
- Creates a new cluster at version `v1.27.6-k3s1` (version required for Calico to work)
- Triggers a new `./cni-repair-controller/integration/run.sh` script which:
  - Installs Calico
  - Installs the latest linkerd-edge CLI
  - Installs `linkerd-cni` and wait for it to become ready
  - Install the linkerd control plane in CNI mode
  - Install a `pause` DaemonSet

The `linkerd-cni` instance has been configured to include an extra initContainer that will delay its start for 15s. Since we waited for it to become ready, this doesn't affect the initial install. But then a new node is added to the cluster, and this delay allows for the new `pause` DaemonSet replica to start before the full CNI config is ready, so we can observe its failure to come up. Once the new `linkerd-cni` replica becomes ready we observe how the `pause` failed replica is replaced by a new healthy one.
@alpeb alpeb force-pushed the alpeb/cni-repair-controller-tests branch from e4fb23b to 89c3415 Compare January 22, 2024 14:12
.dockerignore Outdated
@@ -1,2 +1 @@
rust-toolchain
target/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this change accidental? Or do we need the target dir in the context when building an image?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. This was a leftover from an iteration where the binary wasn't build inside the same Dockerfile.

# the full CNI config is ready and enter a failure mode
extraInitContainers:
- name: sleep
image: busybox
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on the nitpicky side, the CNI plugin runs alpine, can we re-use the same image so we don't pull busybox in tests?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good thinking 👍

@alpeb alpeb merged commit fb9c51e into main Jan 25, 2024
18 checks passed
@alpeb alpeb deleted the alpeb/cni-repair-controller-tests branch January 25, 2024 16:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants