Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🌱 Fix self-hosted flakes in E2E tests #3639

Merged

Conversation

fabriziopandini
Copy link
Member

What this PR does / why we need it:
This PR aims to fix self-hosted flakes by ensuring that mhc is not run during this test and by adding two delay in order to avoid doing move too aggressively.

In order to make this happen the entire management of cluster templates was re-architected so:

  • cluster templates for tests are now generated from a set composable bases
  • MHC-remediation has a new dedicated template
  • all the other templates are without MHC
  • also, in order to simplify things, this PR drops the distinction between CI and DEV test config

On top of that, this PR includes, a set of nits for improving test logs (mostly uppercase)

Which issue(s) this PR fixes:
Fixes #3589

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 15, 2020
@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Sep 15, 2020
@fabriziopandini
Copy link
Member Author

/milestone v0.3.10

@k8s-ci-robot k8s-ci-robot added this to the v0.3.10 milestone Sep 15, 2020
@fabriziopandini
Copy link
Member Author

/retest

@fabriziopandini fabriziopandini force-pushed the fix-self-hosted-flakes branch 3 times, most recently from 343f90f to 9121d59 Compare September 15, 2020 12:10
@fabriziopandini
Copy link
Member Author

/hold
investigating an error on ci-e2e.sh that shows up un CI only, while locally everything works just fine 🤔

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 15, 2020
scripts/ci-e2e.sh Outdated Show resolved Hide resolved
test/e2e/Makefile Outdated Show resolved Hide resolved
@vincepri
Copy link
Member

From the PR's description, seems we're now splitting manifests in order to separate MHC components? Would this be something that we would expect users to do as well?

During move, we should be able to pause controllers and look at the state of the cluster if it's safe to move, is there anything stopping us to do so?

Can we split this PR in multiple ones? In particular:

also, in order to simplify things, this PR drops the distinction between CI and DEV test config
On top of that, this PR includes, a set of nits for improving test logs (mostly uppercase)

These changes seems outside of the PR's scope

Copy link

@sedefsavas sedefsavas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix!

This PR is also addressing this issue: #3461
I am fine with adding it in the title and the issues fixed by this PR or making this separate PRs.

test/e2e/config/docker.yaml Outdated Show resolved Hide resolved
test/e2e/Makefile Outdated Show resolved Hide resolved
test/framework/machine_helpers.go Outdated Show resolved Hide resolved
@sedefsavas
Copy link

@fabriziopandini Locally, works for me as well.

/test pull-cluster-api-e2e

@sedefsavas
Copy link

In test/e2e/Makefile

cd $(TOOLS_DIR); go build -tags=tools -o $(BIN_DIR)/kustomize sigs.k8s.io/kustomize/kustomize/v3

should be

cd $(TOOLS_DIR) && go build -tags=tools -o $(BIN_DIR)/kustomize sigs.k8s.io/kustomize/kustomize/v3

@fabriziopandini
Copy link
Member Author

@vincepri

From the PR's description, seems we're now splitting manifests in order to separate MHC components? Would this be something that we would expect users to do as well?

No. The current MHC object in E2E tests is specifically configured to always trigger remediation after 30s the node is started, and I removed it from the other e2e tests in order to avoid flakes due to remediation kicking in case of slow execution

During move, we should be able to pause controllers and look at the state of the cluster if it's safe to move, is there anything stopping us to do so?

This is not how the move logics works. Currently we are pausing reconciliation on clusters included in the Move scope, but the controller will continue to run.

Can we split this PR in multiple ones?

Done! rif
#3649
#3650
#3651

These changes seems outside of the PR's scope

Some of them are nits, but refactoring templates for MHC is potentially related to self-hosted flakes, see the note above about MHC and #3589 (comment).
However, no problem to move to another PR

@k8s-ci-robot k8s-ci-robot removed the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Sep 16, 2020
@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Sep 16, 2020
@fabriziopandini
Copy link
Member Author

/hold cancel
/test pull-cluster-api-e2e-full

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 16, 2020
@vincepri
Copy link
Member

No. The current MHC object in E2E tests is specifically configured to always trigger remediation after 30s the node is started, and I removed it from the other e2e tests in order to avoid flakes due to remediation kicking in case of slow execution

I assume this change has moved to the other PR, I'll take a look there before commenting

This is not how the move logics works. Currently we are pausing reconciliation on clusters included in the Move scope, but the controller will continue to run.

We should probably consider stopping all controllers before move, for safety purposes and to avoid any running slow worker to perform actions while move is running.

@vincepri
Copy link
Member

/test pull-cluster-api-e2e-full

@randomvariable
Copy link
Member

/lgtm

pending e2e pass

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 16, 2020
Copy link
Member

@vincepri vincepri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve
/lgtm

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vincepri

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 16, 2020
@k8s-ci-robot k8s-ci-robot merged commit 0cf9f80 into kubernetes-sigs:master Sep 16, 2020
@fabriziopandini fabriziopandini deleted the fix-self-hosted-flakes branch September 17, 2020 09:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix self hosting E2E test flakes
5 participants