Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: install nvidia driver using runfile #482

Merged
merged 38 commits into from
Sep 14, 2022
Merged

feat: install nvidia driver using runfile #482

merged 38 commits into from
Sep 14, 2022

Conversation

faiq
Copy link
Collaborator

@faiq faiq commented Aug 29, 2022

What problem does this PR solve?:

Starting work on installing drivers with runfiles

Which issue(s) does this PR fix?:

Special notes for your reviewer:

Does this PR introduce a user-facing change?:


@github-actions
Copy link
Contributor

github-actions bot commented Aug 29, 2022

File Coverage
All files 14%
pkg/ansible/runner.go 0%
pkg/app/artifacts.go 0%
pkg/app/build.go 0%
pkg/app/build_azure.go 2%
pkg/app/build_gcp.go 0%
pkg/app/config.go 48%
pkg/app/errors.go 0%
pkg/app/provision.go 0%
pkg/app/root.go 0%
pkg/app/utils.go 7%
pkg/app/validate.go 0%
pkg/appansible/io.go 0%
pkg/appansible/playbook.go 0%
pkg/azure/azure.go 0%
pkg/logging/logger.go 0%
pkg/packer/manifest.go 0%
pkg/packer/packer.go 0%
pkg/stringutil/rand.go 0%
pkg/version/info.go 8%

Minimum allowed coverage is 75%

Generated by 🐒 cobertura-action against cefc034

@faiq faiq force-pushed the faiq/nvidia-runfile branch 2 times, most recently from e4b3f1f to 92adde0 Compare August 30, 2022 23:30
$(MAKE) devkit.run WHAT="make build-$* \
BUILD_DRY_RUN=${BUILD_DRY_RUN} \
VERBOSITY=$(VERBOSITY) \
ADDITIONAL_ARGS="--instance-type=g4dn.2xlarge$(if $(ADDITIONAL_ARGS),$(SPACE)$(ADDITIONAL_ARGS))" \
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this instance is what kaptain uses for their testing. we should do also test on this regularly. we've seen issues with this in the past

https://jira.d2iq.com/browse/D2IQ-88483


# to pick up the changes to unload nouveau
# some hardware like aws g4dn.2xlarge require this
- name: unconditionally reboot the machine with all defaults
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the machine also reboot for preprovisioned, or is sysprep==false in that case?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's false.

however, they might need to reboot it on their own depending on OS if they were previously using nouveu drivers

- name: run nvidia-smi
shell:
executable: /bin/bash
cmd: nvidia-smi
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should add it to the path on install.

Copy link
Collaborator

@supershal supershal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work. 🎉

Copy link
Contributor

@alejandroEsc alejandroEsc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I understand all of it but looks good to me. One comment I would make is the naming for certain places has inconsistent nvidia naming, not a huge issue but something to consider in the future.

overrides/offline-nvidia.yaml Show resolved Hide resolved

# to pick up the changes to unload nouveau
# some hardware like aws g4dn.2xlarge require this
- name: unconditionally reboot the machine with all defaults
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the machine also reboot for preprovisioned, or is sysprep==false in that case?

apt:
name: linux-headers-{{ ansible_kernel }}
name:
- linux-headers-{{ ansible_kernel }}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$ apt-cache depends build-essential
build-essential
 |Depends: libc6-dev
  Depends: <libc-dev>
    libc6-dev
  Depends: gcc
  Depends: g++
  Depends: make
    make-guile
  Depends: dpkg-dev

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes it a little cleaner :)

@faiq faiq changed the title feat: install nvidia driver with runfile feat: install nvidia driver USING runfile Sep 14, 2022
@faiq faiq changed the title feat: install nvidia driver USING runfile feat: install nvidia driver usin runfile Sep 14, 2022
@faiq faiq changed the title feat: install nvidia driver usin runfile feat: install nvidia driver using runfile Sep 14, 2022
@github-actions github-actions bot added feature and removed feature labels Sep 14, 2022
@faiq faiq merged commit 22f89b8 into main Sep 14, 2022
@faiq faiq deleted the faiq/nvidia-runfile branch September 14, 2022 16:48
@fatz
Copy link
Contributor

fatz commented Oct 26, 2022

just want to make clear that this patch deleted the vault functionality to make sure staying with the centos minor version kernel.

That means whenever RedHat releases a 7.10 release we'll either make a half 7.9 half 7.10 image as we'll install 7.10 patch in a 7.9 image or we'll simply have to immediately drop support for 7.9 to follow 7.10.

But the minor version is hard coded into every KIB build. So @faiq I really really suggest keeping this backward compatibility as we render previous KIB versions unusable as soon as there is a new RHEL7/CentOS7 release

CentOS moves almost instant the kernel packages of 7.9 into vault once a new release is there.

This is the reason for the vault code as it drops the dependency on CentOS releases and timeline

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants