-
Notifications
You must be signed in to change notification settings - Fork 20
Improve the visiblity into individual e2e test case failures #68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve the visiblity into individual e2e test case failures #68
Conversation
Signed-off-by: timflannagan <timflannagan@gmail.com>
Signed-off-by: timflannagan <timflannagan@gmail.com>
Signed-off-by: timflannagan <timflannagan@gmail.com>
|
Holding for CI feedback in #66. /hold |
|
These changes only affect the dev testing suite. Adding the required no-ff labels. /label px-approved |
tylerslaton
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: timflannagan, tylerslaton The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Removing the hold now that #66 had e2e runs that expressed the desired behavior. /hold cancel |
environments Signed-off-by: timflannagan <timflannagan@gmail.com>
Signed-off-by: timflannagan <timflannagan@gmail.com>
ae08d58 to
4c7276b
Compare
| return err | ||
| } | ||
|
|
||
| cmd := exec.Command("/bin/bash", "-c", "./collect-ci-artifacts.sh") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Follow-up: avoid hardcoding the path to the script location.
| // current test case failed. attempt to collect CI artifacts if the | ||
| // $ARTIFACT_DIR environment variable has been set. This variable is | ||
| // always present in downstream CI environments. | ||
| artifactDir := os.Getenv("ARTIFACT_DIR") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential follow-up: There's overlap between how we handle the $ARTIFACT_DIR environment variable in the Makefile invocation vs. how we're handling the value of that shared variable here. It could be worth consolidating this logic into a shared CLI flag, but it's unclear whether that's the correct implementation going forward.
The main problem with this approach is the potential for skew between how ginkgo handles relative paths vs. how we're handling relative paths within this internal testing code. The expectation is that downstream CI would set the value of $ARTIFACT_DIR variable to an absolute path, so we're in the clear there, but there's no guarantee that we're correctly configuring (or enforcing) the want/need for an absolute path when running the e2e suite locally during dev workflows.
I don't think anything I mentioned is blocking, but it's still something worth noting and potentially revisiting in future phase implementations.
| # "oc" binary is located at "/cli/oc" path. This is problematic as the /cli directory | ||
| # doesn't exist in the $PATH environment variable, which causes issues when running | ||
| # this script via the exec.Command Golang function. | ||
| if [[ "$OPENSHIFT_CI" == "true" ]]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: We could also perform a -d check on whether the /cli directory exists. The OPENSHIFT_CI variable will always be set when running the e2e suite in downstream CI environments, so it feels like we can make a really solid assumption on where the oc binary will live going forward.
This isn't completely future proof in the case this directory location changes, but we should be fine given we're stat(ing) the configured kubectl binary immediately after performing this conditional check. In the future, updating this script and adding a conditional check for validating this directory exists could produce a more robust implementation.
|
@timflannagan: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
| set -o nounset | ||
| set -o errexit | ||
|
|
||
| : "${KUBECONFIG:?}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need/use the KUBECONFIG variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, we're not explicitly using that variable here, but it's a nice sanity check given it verifies that variable has been set, and that we're going to be able to use a kubectl/oc binary further into the control flow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: comment to that effect then? Otherwise someone might decide to delete it in the future.
|
Any idea as to what's going on in the CI failure here? I took a look and it seems like things failed pretty quickly so I'm not sure if we know its a flake or not. |
|
There's some flakes in the e2e testing suite, which is why I wanted to introduce these changes. If you dive into the test case directory that failed, you can see the relevant YAML output of the resources that get reconciled. I had the hunch that our consistent usage of cert-manager throughout the e2e suite was problematic as we were constantly installing and then performing a cascading deletion of those contents, and the installing those same contents again without giving any buffer for that deletion to happen. It looks like that hunch is being validated looking at the following status of a failed BundleDeployment resource: - lastTransitionTime: "2022-09-28T03:37:23Z"
message: 'rendered manifests contain a resource that already exists. Unable
to continue with install: Namespace "openshift-cert-manager-operator" in namespace
"" exists and cannot be imported into the current release: invalid ownership
metadata; annotation validation error: key "meta.helm.sh/release-name" must
equal "cert-managerzv79l": current value is "cert-managerlwcq9"'
reason: InstallFailed
status: "False"
type: InstalledNot too sure whether this is a rukpak bug, or we need to add some buffer to ensure we're not re-installing the same contents that are being deleted. The potential problem with the latter is that we're hiding controller failures. |
|
/lgtm |
Related to #66 which has the debug information present.