-
Notifications
You must be signed in to change notification settings - Fork 26
ci-artifacts maintenance overview #291
Comments
If we stop always building it means that when someone works on the repo and actually tries to change it they might be surprised that their changed have no effect and they have to start patching things around to get the repo to use their revision, which might be annoying.
With regards to the 8 minutes it saves and CI performance in general - we do a lot of things in the CI unnecessarily linearly - e.g. "start the image helper build, wait for it to build":
I think if we shift the approach to "apply as many manifests as you possibly can for everything", then separately do the waiting after-the-fact, we would get much more significant performance improvements that would make optimizations such as getting rid of the helper image pretty insignificant. For example, that image could easily be built while the machineset is scaling. If we just apply all the manifests and rely on the eventual reconciliation of everything we wouldn't even have to think about it - things that can happen in parallel will happen in parallel, things that block on other things will simply wait for them to complete. |
for that, I was thinking putting a big bold warning at the beginning of the file, so that people don't get surprised,
good point, although I'm afraid it might complicate the troubleshooting when things go wrong for maybe it could be coded differently: instead of
|
Maybe even move it to subprojects then so it's more clearly a separate thing
Yep that's definitely what I had in mind |
I added
to the list, once this PR is merged: https://gitlab.com/nvidia/kuberetes/gpu-operator/-/merge_requests/346 |
It would be simple enough to turn these ImageStreams into Dockerfiles and use the built-in image build on them. The thing about these statements though is that it suggests that these should be defined in separate repository to avoid building them on every test. If that was the case then it would just import the image at the beginning of the test from an integration stream. Similar to how OCP images are automatically imported from their integration streams. It also takes a lot less time to import an image than it does to build it and reconcile it so this would be the best approach in reducing the test run time. |
one thing I realize I forgot to stress about the current workflow, is that it works the same way in your local cluster as in Prow (hence the notion of Looking farther, this also enables seamless portability to any other CI infrastructure. Only |
Some things I have in mind for improving/fixing
ci-artifacts
:Fix the GPU Operator
deploy_from_operatorhub
to work withv1.9.0-beta
andv1.9.0
when releasedmaster
branch testing, but I couldn't do it for OperatorHub deployment until released by NVIDIA (beta was released last week)Update gpu_operator_set_namespace to useClusterPolicy.status.namespace
(see PR)will be simpler that the code I wrote before this PR was mergedWONT FIX
,oc get pod -l app.kubernetes.io/component=gpu-operator -A -ojsonpath={.items[].metadata.namespace}
is simple enoughEnable testing the GPU Operator v1.9 (when released, ie > 2021-12-03)
Call
hack/must-gather.sh
script instead of custom scriptsTurn the image helper BuildConfig into a simple DockerFile + quay.io "build on master-merge"
master
GPU Operator testDouble check the alert-testing of the GPU Operator
master
branchRefresh the versions used for the GPU Operator upgrade testing (currently only 4.6 --> 4.7)
master
4.6 --> 4.7
upgrade testing (both versions are not supported anymore)Confirm the fate of testing the GPU Operator on OCP 4.6 clusters
Enable testing of the GPU Operator on OCP 4.10
Improve the GPU Operator and rewrite gpu_operator_get_csv_version
master-branch
bundle version (eg,21.11.25-git.57914a2
), but that's not enough as this information isn't part of the operator image (recently the CI was using the same outdated image for a week and we failed to notice it until we had to test custom GPU Operator PRs)The text was updated successfully, but these errors were encountered: