ci-artifacts maintenance overview #291

kpouget · 2021-11-29T08:28:20Z

Some things I have in mind for improving/fixing ci-artifacts:

Fix the GPU Operator deploy_from_operatorhub to work with v1.9.0-beta and v1.9.0 when released
- the single-namespace + entitlement-free new features change the way the GPU Operator is deployed, I addressed it for master branch testing, but I couldn't do it for OperatorHub deployment until released by NVIDIA (beta was released last week)
- Fixed with testing: prow/gpu-operator: fix operatorhub and master branch testing #289
~~Update gpu_operator_set_namespace to use ClusterPolicy.status.namespace (see PR)~~
- ~~will be simpler that the code I wrote before this PR was merged~~
- EDIT: WONT FIX, oc get pod -l app.kubernetes.io/component=gpu-operator -A -ojsonpath={.items[].metadata.namespace} is simple enough
Enable testing the GPU Operator v1.9 (when released, ie > 2021-12-03)
- Done with gpu_operator_deploy_from_operatorhub: update to support v1.9 deployment #293
- Done with GPU Operator: include v1.9 nightly testing and a bit of cleanup openshift/release#24239
Call hack/must-gather.sh script instead of custom scripts
- Done with testing: prow/gpu-operator.sh: use the GPU Operator must-gather image #294
Turn the image helper BuildConfig into a simple DockerFile + quay.io "build on master-merge"
- this image is 100% static, never updated, there's no need to rebuilt it for every master GPU Operator test
- this image is duplicated in NFD master test
- takes 8 minutes to build

2021-11-28 23:32:23,960 p=90 u=psap-ci-runner n=ansible | Sunday 28 November 2021  23:32:23 +0000 (0:00:00.697)       0:00:08.016 ******* 
2021-11-28 23:32:24,402 p=90 u=psap-ci-runner n=ansible | TASK: gpu_operator_bundle_from_commit : Wait for the helper image to be built

Double check the alert-testing of the GPU Operator master branch
- I'm doubtful about what happens with the driver-cannot-be-built alert wrt entitlement-free deployments
Refresh the versions used for the GPU Operator upgrade testing (currently only 4.6 --> 4.7)
- Removed master 4.6 --> 4.7 upgrade testing (both versions are not supported anymore)
  - GPU Operator: include v1.9 nightly testing and a bit of cleanup openshift/release#24239
Confirm the fate of testing the GPU Operator on OCP 4.6 clusters
- EUS release
Enable testing of the GPU Operator on OCP 4.10
Improve the GPU Operator and rewrite gpu_operator_get_csv_version
- I think it's becoming critical that the GPU Operator image exposes the GPU Operator version (v1.x.y) and the git commit used to build it
- in the CI I already include the commit hash + commit date in the master-branch bundle version (eg, 21.11.25-git.57914a2), but that's not enough as this information isn't part of the operator image (recently the CI was using the same outdated image for a week and we failed to notice it until we had to test custom GPU Operator PRs)
- once this is done, update this role to avoid fetching the information from the CSV, whenever possible (won't be backported)

The text was updated successfully, but these errors were encountered:

omertuc · 2021-11-29T09:12:36Z

this image is 100% static, never updated, there's no need to rebuilt it for every master GPU Operator test

If we stop always building it means that when someone works on the repo and actually tries to change it they might be surprised that their changed have no effect and they have to start patching things around to get the repo to use their revision, which might be annoying.

takes 8 minutes to build

With regards to the 8 minutes it saves and CI performance in general - we do a lot of things in the CI unnecessarily linearly - e.g. "start the image helper build, wait for it to build":

ci-artifacts/roles/gpu_operator_bundle_from_commit/tasks/build_helper.yml

Lines 28 to 37 in d0eef0c

    
           - name: Wait for the helper image to be built 
        
             command: 
        
               oc get pod -lopenshift.io/build.name 
        
                          -ocustom-columns=phase:status.phase 
        
                          --no-headers 
        
                          -n gpu-operator-ci-utils 
        
             register: wait_helper_image 
        
             until: "'Succeeded' in wait_helper_image.stdout or 'Failed' in wait_helper_image.stdout or 'Error' in wait_helper_image.stdout" 
        
             retries: 40 
        
             delay: 30

I think if we shift the approach to "apply as many manifests as you possibly can for everything", then separately do the waiting after-the-fact, we would get much more significant performance improvements that would make optimizations such as getting rid of the helper image pretty insignificant.

For example, that image could easily be built while the machineset is scaling. If we just apply all the manifests and rely on the eventual reconciliation of everything we wouldn't even have to think about it - things that can happen in parallel will happen in parallel, things that block on other things will simply wait for them to complete.

kpouget · 2021-11-29T10:23:38Z

If we stop always building it means that when someone works on the repo and actually tries to change it they might be surprised that their changed have no effect and they have to start patching things around to get the repo to use their revision, which might be annoying.

for that, I was thinking putting a big bold warning at the beginning of the file, so that people don't get surprised,
and then 1st merge the PR to update Dockerfile, and then keep working on the new behavior

I think if we shift the approach to "apply as many manifests as you possibly can for everything", then separately do the waiting after-the-fact, we would get much more significant performance improvements that would make optimizations such as getting rid of the helper image pretty insignificant.

good point, although I'm afraid it might complicate the troubleshooting when things go wrong

for maybe it could be coded differently: instead of scale up cluster; wait for the end of the scale up; build the GPU Operator,
it could be more focused on the needs:

deploy NFD (without waiting for it)
scale up cluster (without waiting for it)

build the GPU Operator

wait for the end of the scale up # 1
wait for the GPU label to appear  #2
wait for the GPU Operator

(#1 and #2 are kind of redundant, but help troubleshooting)

omertuc · 2021-11-29T10:26:23Z

for that, I was thinking putting a big bold warning at the beginning of the file, so that people don't get surprised,
and then 1st merge the PR to update Dockerfile, and then keep working on the new behavior

Maybe even move it to subprojects then so it's more clearly a separate thing

wait for the end of the scale up # 1
wait for the GPU label to appear #2
wait for the GPU Operator

Yep that's definitely what I had in mind

kpouget · 2021-11-30T09:35:42Z

I added

Call hack/must-gather.sh script instead of custom scripts

to the list, once this PR is merged: https://gitlab.com/nvidia/kuberetes/gpu-operator/-/merge_requests/346

lranjbar · 2021-12-01T11:40:54Z

Turn the image helper BuildConfig into a simple DockerFile + quay.io "build on master-merge"

this image is 100% static, never updated, there's no need to rebuilt it for every master GPU Operator test

this image is duplicated in NFD master test

takes 8 minutes to build
2021-11-28 23:32:23,960 p=90 u=psap-ci-runner n=ansible | Sunday 28 November 2021  23:32:23 +0000 (0:00:00.697)       0:00:08.016 ******* 
2021-11-28 23:32:24,402 p=90 u=psap-ci-runner n=ansible | TASK: gpu_operator_bundle_from_commit : Wait for the helper image to be built

It would be simple enough to turn these ImageStreams into Dockerfiles and use the built-in image build on them. The thing about these statements though is that it suggests that these should be defined in separate repository to avoid building them on every test. If that was the case then it would just import the image at the beginning of the test from an integration stream. Similar to how OCP images are automatically imported from their integration streams. It also takes a lot less time to import an image than it does to build it and reconcile it so this would be the best approach in reducing the test run time.

kpouget · 2021-12-01T12:36:53Z

one thing I realize I forgot to stress about the current workflow, is that it works the same way in your local cluster as in Prow (hence the notion of toolbox, you can reuse all of these tools to setup your cluster and/or test your ongoing development locally).

Looking farther, this also enables seamless portability to any other CI infrastructure. Only /testing/prow/ is lightly Prow-specific, as it expects a few env var and some secret files at the right location; and of course the test definitions in openshift/release

kpouget pinned this issue Nov 29, 2021

kpouget closed this as completed Apr 3, 2022

kpouget unpinned this issue Apr 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci-artifacts maintenance overview #291

ci-artifacts maintenance overview #291

kpouget commented Nov 29, 2021 •

edited

Loading

omertuc commented Nov 29, 2021 •

edited

Loading

kpouget commented Nov 29, 2021

omertuc commented Nov 29, 2021

kpouget commented Nov 30, 2021

lranjbar commented Dec 1, 2021

kpouget commented Dec 1, 2021

ci-artifacts maintenance overview #291

ci-artifacts maintenance overview #291

Comments

kpouget commented Nov 29, 2021 • edited Loading

omertuc commented Nov 29, 2021 • edited Loading

kpouget commented Nov 29, 2021

omertuc commented Nov 29, 2021

kpouget commented Nov 30, 2021

lranjbar commented Dec 1, 2021

kpouget commented Dec 1, 2021

kpouget commented Nov 29, 2021 •

edited

Loading

omertuc commented Nov 29, 2021 •

edited

Loading