Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add optional v1alpha4 and self-hosted management cluster tests #2833

Merged
merged 3 commits into from Oct 8, 2021

Conversation

randomvariable
Copy link
Member

@randomvariable randomvariable commented Oct 6, 2021

Signed-off-by: Naadir Jeewa jeewan@vmware.com

What type of PR is this?

/kind failing-test
/area testing

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Part of #2788
Part of #2793

Special notes for your reviewer:

Checklist:

  • squashed commits
  • includes documentation
  • adds unit tests
  • adds or updates e2e tests

Release note:

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. area/testing Issues or PRs related to testing cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 6, 2021
@k8s-ci-robot k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Oct 6, 2021
@randomvariable
Copy link
Member Author

/test ?

@k8s-ci-robot
Copy link
Contributor

@randomvariable: The following commands are available to trigger required jobs:

  • /test pull-cluster-api-provider-aws-build
  • /test pull-cluster-api-provider-aws-test
  • /test pull-cluster-api-provider-aws-verify

The following commands are available to trigger optional jobs:

  • /test pull-cluster-api-provider-aws-e2e
  • /test pull-cluster-api-provider-aws-e2e-blocking
  • /test pull-cluster-api-provider-aws-e2e-conformance
  • /test pull-cluster-api-provider-aws-e2e-conformance-with-ci-artifacts
  • /test pull-cluster-api-provider-aws-e2e-eks

Use /test all to run the following jobs that were automatically triggered:

  • pull-cluster-api-provider-aws-build
  • pull-cluster-api-provider-aws-test
  • pull-cluster-api-provider-aws-verify

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@randomvariable
Copy link
Member Author

/test pull-cluster-api-provider-aws-e2e

@randomvariable
Copy link
Member Author

/milestone v1.0.0
/triage accepted
/priority critical-urgent

@k8s-ci-robot k8s-ci-robot added this to the v1.0.0 milestone Oct 6, 2021
@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority labels Oct 6, 2021
@randomvariable
Copy link
Member Author

/test pull-cluster-api-provider-aws-e2e

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Oct 6, 2021
@randomvariable
Copy link
Member Author

/test pull-cluster-api-provider-aws-e2e

@randomvariable
Copy link
Member Author

/test pull-cluster-api-provider-aws-e2e

1 similar comment
@randomvariable
Copy link
Member Author

/test pull-cluster-api-provider-aws-e2e

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 7, 2021
@sedefsavas
Copy link
Contributor

I am able to replicate locally. After the upgrade, all deployment machines come up but MachineDeployment is not being reconciled hence never detects scaling up is successful.

@randomvariable
Copy link
Member Author

randomvariable commented Oct 8, 2021

I am able to replicate locally. After the upgrade, all deployment machines come up but MachineDeployment is not being reconciled hence never detects scaling up is successful.

Is that a separate issue to the one that keeps showing up for all the ones using a remote management cluster?

"reason": "ProgressDeadlineExceeded",
[1] "message": "ReplicaSet "capa-controller-manager-7bd56cd58" has timed out progressing."

@sedefsavas
Copy link
Contributor

sedefsavas commented Oct 8, 2021

They are the same, that deadline is exceeded because CAPI controllers don't reconcile anything.
KCP show only one error:
KubeadmControlPlane" "cluster"="clusterctl-upgrade-vexwe7" "name"="clusterctl-upgrade-vexwe7-control-plane" "namespace"="clusterctl-upgrade" "reconciler group"="controlplane.cluster.x-k8s.io" "reconciler kind"="KubeadmControlPlane" E1008 06:28:50.096627 1 reflector.go:138] pkg/mod/k8s.io/client-go@v0.22.2/tools/cache/reflector.go:167: Failed to watch *v1.PartialObjectMetadata: unknown

And CAPI logs do not show any error but also it doesn't reconcile when I make changes to MachineDeployment.

I will check with Stefan/Fabrizio when they are online.

[UPDATE]: Failures I saw was probably due to using an unsupported version of K8s.

@randomvariable
Copy link
Member Author

randomvariable commented Oct 8, 2021

If you trace the lines back, then it's waiting for the deployment of CAPA controllers to succeed with the appv1.Deployment of the controller coming up. Following the trace to

[1]     /home/prow/go/pkg/mod/sigs.k8s.io/cluster-api/test@v1.0.0/framework/deployment_helpers.go:72
[1] 
[1]     Full Stack Trace
[1]     sigs.k8s.io/cluster-api/test/framework.WaitForDeploymentsAvailable(0x3096840, 0xc000054140, 0x7fe5f0056d40, 0xc0006e40e0, 0xc0003b1400, 0xc001215d80, 0x2, 0x2)
[1]     	/home/prow/go/pkg/mod/sigs.k8s.io/cluster-api/test@v1.0.0/framework/deployment_helpers.go:72 +0x2cf
[1]     sigs.k8s.io/cluster-api/test/framework/clusterctl.UpgradeManagementClusterAndWait(0x3096840, 0xc000054140, 0x30b47a0, 0xc0011220c0, 0xc000e43800, 0x31, 0x2c27eda, 0x7, 0xc000658ec0, 0x32, ...)
[1]     	/home/prow/go/pkg/mod/sigs.k8s.io/cluster-api/test@v1.0.0/framework/clusterctl/clusterctl_helpers.go:155 +0x8ab
[1]     sigs.k8s.io/cluster-api/test/e2e.ClusterctlUpgradeSpec.func2()

gets https://github.com/kubernetes-sigs/cluster-api/blob/main/test/framework/clusterctl/clusterctl_helpers.go#L153-L166
and https://github.com/kubernetes-sigs/cluster-api/blob/main/test/framework/deployment_helpers.go#L52-L73

@randomvariable
Copy link
Member Author

Just ran in us-west-2, and it's working fine locally. :(

@randomvariable
Copy link
Member Author

One thing to try might be to see if the e2e image upload is actually working. Will need to to look at the logs mid run, but there will be a log line along the lines of:

[1] STEP: Uploading the e2e image to s3:///capi-images.us-west-2.921092657406.oci-images/sha256:545ee55243f44f063048ea53fc99e13fc487bcd88c9b794fcf0d3e72d2b36fde
[1] STEP: Making the image public
[1] STEP: Searching for AMI: name=capa-ami-ubuntu-18.04-1.21.2*

at that point , run

aws s3 cp  s3://capi-images.us-west-2.921092657406.oci-images/sha256:545ee55243f44f063048ea53fc99e13fc487bcd88c9b794fcf0d3e72d2b36fde local.tar

Note that you'll need to change s3:/// to s3:// due to a typo in my log line.

@sedefsavas
Copy link
Contributor

Thanks, will check out that.

/test pull-cluster-api-provider-aws-e2e

@randomvariable
Copy link
Member Author

Yeah, the bucket is empty during the test run. I think something is deleting the S3 image. Maybe it's a lambda or some security thing deleting any object made publicly readable.

Will probably need to create an S3 bucket policy that allows members of the AWS account to read, but not set publicly readable. Might have to check if this is what is happening with k8s infra / sig testing folk.

@sedefsavas
Copy link
Contributor

sedefsavas commented Oct 8, 2021

It is now in mid run and looks like the image is missing:
fatal error: An error occurred (404) when calling the HeadObject operation: Key "sha256:2395e936830e46b0138729b29c253f97404d382edecf4e098d268f0c7694b144" does not exist

@randomvariable
Copy link
Member Author

I think I'm confident to say that the upgrades are working based on local testing, and would suggest reverting the timeout changes and the skip, and then changing Describe to PDescribe for the v1alpha3 upgrade test (rather than comment out), merge in and do a release.

Thoughts @richardcase ?. PS EKS upgrade test currently isn't possible because of #2828 dependencies.

@randomvariable randomvariable changed the title Enable v1alpha3 upgrade e2e test and add optional v1alpha4 and self-hosted management cluster tests Add optional v1alpha4 and self-hosted management cluster tests Oct 8, 2021
@sedefsavas
Copy link
Contributor

/test pull-cluster-api-provider-aws-e2e
Last one time before making the release.

@richardcase
Copy link
Member

Yay @sedefsavas the e2e passed.

@sedefsavas
Copy link
Contributor

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 8, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sedefsavas

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 8, 2021
@richardcase
Copy link
Member

I think I'm confident to say that the upgrades are working based on local testing, and would suggest reverting the timeout changes and the skip, and then changing Describe to PDescribe for the v1alpha3 upgrade test (rather than comment out), merge in and do a release.

Sounds good to me, especially as the e2e is passing. The timeout changes remain changed but all good:

/lgtm

@sedefsavas
Copy link
Contributor

sedefsavas commented Oct 8, 2021

@richardcase These tests were already passing. The issue is with the upgrade test which I disabled in this PR.
I also confirm that it passes locally. It is not passing here because of the reason randomvariable found: the e2e image pushed to s3 is getting deleted in the prow.

@sedefsavas
Copy link
Contributor

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 8, 2021
@k8s-ci-robot k8s-ci-robot merged commit 82f3c98 into kubernetes-sigs:main Oct 8, 2021
@k8s-ci-robot k8s-ci-robot modified the milestones: v1.0.0, v1.x Oct 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/testing Issues or PRs related to testing cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants