Skip to content
This repository has been archived by the owner on Aug 16, 2023. It is now read-only.

ci-artifacts maintenance overview #291

Closed
4 of 10 tasks
kpouget opened this issue Nov 29, 2021 · 6 comments
Closed
4 of 10 tasks

ci-artifacts maintenance overview #291

kpouget opened this issue Nov 29, 2021 · 6 comments

Comments

@kpouget
Copy link
Collaborator

kpouget commented Nov 29, 2021

Some things I have in mind for improving/fixing ci-artifacts:

2021-11-28 23:32:23,960 p=90 u=psap-ci-runner n=ansible | Sunday 28 November 2021  23:32:23 +0000 (0:00:00.697)       0:00:08.016 ******* 
2021-11-28 23:32:24,402 p=90 u=psap-ci-runner n=ansible | TASK: gpu_operator_bundle_from_commit : Wait for the helper image to be built
  • Double check the alert-testing of the GPU Operator master branch

    • I'm doubtful about what happens with the driver-cannot-be-built alert wrt entitlement-free deployments
  • Refresh the versions used for the GPU Operator upgrade testing (currently only 4.6 --> 4.7)

  • Confirm the fate of testing the GPU Operator on OCP 4.6 clusters

    • EUS release
  • Enable testing of the GPU Operator on OCP 4.10

  • Improve the GPU Operator and rewrite gpu_operator_get_csv_version

    • I think it's becoming critical that the GPU Operator image exposes the GPU Operator version (v1.x.y) and the git commit used to build it
    • in the CI I already include the commit hash + commit date in the master-branch bundle version (eg, 21.11.25-git.57914a2), but that's not enough as this information isn't part of the operator image (recently the CI was using the same outdated image for a week and we failed to notice it until we had to test custom GPU Operator PRs)
    • once this is done, update this role to avoid fetching the information from the CSV, whenever possible (won't be backported)
@kpouget kpouget pinned this issue Nov 29, 2021
@omertuc
Copy link
Contributor

omertuc commented Nov 29, 2021

this image is 100% static, never updated, there's no need to rebuilt it for every master GPU Operator test

If we stop always building it means that when someone works on the repo and actually tries to change it they might be surprised that their changed have no effect and they have to start patching things around to get the repo to use their revision, which might be annoying.

takes 8 minutes to build

With regards to the 8 minutes it saves and CI performance in general - we do a lot of things in the CI unnecessarily linearly - e.g. "start the image helper build, wait for it to build":

- name: Wait for the helper image to be built
command:
oc get pod -lopenshift.io/build.name
-ocustom-columns=phase:status.phase
--no-headers
-n gpu-operator-ci-utils
register: wait_helper_image
until: "'Succeeded' in wait_helper_image.stdout or 'Failed' in wait_helper_image.stdout or 'Error' in wait_helper_image.stdout"
retries: 40
delay: 30

I think if we shift the approach to "apply as many manifests as you possibly can for everything", then separately do the waiting after-the-fact, we would get much more significant performance improvements that would make optimizations such as getting rid of the helper image pretty insignificant.

For example, that image could easily be built while the machineset is scaling. If we just apply all the manifests and rely on the eventual reconciliation of everything we wouldn't even have to think about it - things that can happen in parallel will happen in parallel, things that block on other things will simply wait for them to complete.

@kpouget
Copy link
Collaborator Author

kpouget commented Nov 29, 2021

If we stop always building it means that when someone works on the repo and actually tries to change it they might be surprised that their changed have no effect and they have to start patching things around to get the repo to use their revision, which might be annoying.

for that, I was thinking putting a big bold warning at the beginning of the file, so that people don't get surprised,
and then 1st merge the PR to update Dockerfile, and then keep working on the new behavior

I think if we shift the approach to "apply as many manifests as you possibly can for everything", then separately do the waiting after-the-fact, we would get much more significant performance improvements that would make optimizations such as getting rid of the helper image pretty insignificant.

good point, although I'm afraid it might complicate the troubleshooting when things go wrong

for maybe it could be coded differently: instead of scale up cluster; wait for the end of the scale up; build the GPU Operator,
it could be more focused on the needs:

deploy NFD (without waiting for it)
scale up cluster (without waiting for it)

build the GPU Operator

wait for the end of the scale up # 1
wait for the GPU label to appear  #2
wait for the GPU Operator

(#1 and #2 are kind of redundant, but help troubleshooting)

@omertuc
Copy link
Contributor

omertuc commented Nov 29, 2021

for that, I was thinking putting a big bold warning at the beginning of the file, so that people don't get surprised,
and then 1st merge the PR to update Dockerfile, and then keep working on the new behavior

Maybe even move it to subprojects then so it's more clearly a separate thing

wait for the end of the scale up # 1
wait for the GPU label to appear #2
wait for the GPU Operator

Yep that's definitely what I had in mind

@kpouget
Copy link
Collaborator Author

kpouget commented Nov 30, 2021

I added

  • Call hack/must-gather.sh script instead of custom scripts

to the list, once this PR is merged: https://gitlab.com/nvidia/kuberetes/gpu-operator/-/merge_requests/346

@lranjbar
Copy link

lranjbar commented Dec 1, 2021

  • Turn the image helper BuildConfig into a simple DockerFile + quay.io "build on master-merge"

    • this image is 100% static, never updated, there's no need to rebuilt it for every master GPU Operator test
    • this image is duplicated in NFD master test
    • takes 8 minutes to build
2021-11-28 23:32:23,960 p=90 u=psap-ci-runner n=ansible | Sunday 28 November 2021  23:32:23 +0000 (0:00:00.697)       0:00:08.016 ******* 
2021-11-28 23:32:24,402 p=90 u=psap-ci-runner n=ansible | TASK: gpu_operator_bundle_from_commit : Wait for the helper image to be built

It would be simple enough to turn these ImageStreams into Dockerfiles and use the built-in image build on them. The thing about these statements though is that it suggests that these should be defined in separate repository to avoid building them on every test. If that was the case then it would just import the image at the beginning of the test from an integration stream. Similar to how OCP images are automatically imported from their integration streams. It also takes a lot less time to import an image than it does to build it and reconcile it so this would be the best approach in reducing the test run time.

@kpouget
Copy link
Collaborator Author

kpouget commented Dec 1, 2021

one thing I realize I forgot to stress about the current workflow, is that it works the same way in your local cluster as in Prow (hence the notion of toolbox, you can reuse all of these tools to setup your cluster and/or test your ongoing development locally).

Looking farther, this also enables seamless portability to any other CI infrastructure. Only /testing/prow/ is lightly Prow-specific, as it expects a few env var and some secret files at the right location; and of course the test definitions in openshift/release

@kpouget kpouget closed this as completed Apr 3, 2022
@kpouget kpouget unpinned this issue Apr 3, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants