Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix latest GPU container image tags #667

Merged
merged 3 commits into from
Jul 26, 2021
Merged

Fix latest GPU container image tags #667

merged 3 commits into from
Jul 26, 2021

Conversation

0x2b3bfa0
Copy link
Member

@0x2b3bfa0 0x2b3bfa0 commented Jul 22, 2021

We were pushing the latest tag twice[1, 2] with the latest GPU and non-GPU images, in parallel. Whichever image got pushed first ended up overriding the other one. 馃檲

@0x2b3bfa0 0x2b3bfa0 added bug Something isn't working p0-critical Max priority (ASAP) cml-runner Subcommand cml-image Subcommand labels Jul 22, 2021
@0x2b3bfa0 0x2b3bfa0 self-assigned this Jul 22, 2021
@0x2b3bfa0 0x2b3bfa0 temporarily deployed to internal July 22, 2021 11:16 Inactive
@0x2b3bfa0 0x2b3bfa0 requested a review from casperdcl July 22, 2021 11:18
Copy link
Contributor

@casperdcl casperdcl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doh! though... do we need a separate latest and latest-gpu?

.github/workflows/test-deploy.yml Show resolved Hide resolved
@casperdcl
Copy link
Contributor

Should fix #666

I don't see how - #666 clearly uses dvcorg/cml:0-dvc2-base1-gpu, no?

@0x2b3bfa0
Copy link
Member Author

Please notice that the failing step on #666 is the test_runner one, not the test_container one.

With out current configuration, not specifiying an image option in a GitLab CI/CD job will cause that job to be executed in the default container image specified with the --docker-image when invoking the runner.

@casperdcl
Copy link
Contributor

btw do you know if we support OSX? I presume yes?

@0x2b3bfa0
Copy link
Member Author

0x2b3bfa0 commented Jul 22, 2021

Partially: we seem to support macOS1 on GitHub; on GitLab, runner platforms are hardcoded, and the Mach-O loader would have a hard time trying to read ELF files.

cml/src/drivers/github.js

Lines 209 to 213 in a2eedb6

const arch = process.platform === 'darwin' ? 'osx-x64' : 'linux-x64';
const ver = '2.278.0';
const destination = resolve(workdir, 'actions-runner.tar.gz');
const url = `https://github.com/actions/runner/releases/download/v${ver}/actions-runner-${arch}-${ver}.tar.gz`;
await download({ url, path: destination });

cml/src/drivers/gitlab.js

Lines 149 to 151 in a2eedb6

const url =
'https://gitlab-runner-downloads.s3.amazonaws.com/latest/binaries/gitlab-runner-linux-amd64';
await download({ url, path: bin });

However, who cares? Apple doesn't support CUDA on modern macOS systems, despite having NVIDIA GPU devices.


1 OS X is a thing from the past. 馃槃

@casperdcl casperdcl temporarily deployed to internal July 22, 2021 11:53 Inactive
Comment on lines +143 to +147
try {
await exec('cuda-smi');
} catch (err) {
gpu = false;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you think of this for 馃崕 support?

could go back to

Suggested change
try {
await exec('cuda-smi');
} catch (err) {
gpu = false;
}
gpu = false;

if you prefer...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is cuda-smi supposed to do? 馃

Copy link
Contributor

@casperdcl casperdcl Jul 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's effectively what nvidia-smi is called on (some?) 馃崕 systems tmux-plugins/tmux-cpu#24

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it's a third party tool (?)

Should we rely on that to detect CUDA?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idk, the impression I got was that CUDA can be installed on a mac without an nvidia-smi binary but with a cuda-smi binary available. Would be nice to get confirmation though.

Copy link
Contributor

@DavidGOrtega DavidGOrtega left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think that this change fixes #666

The real issue is that our CML AMI seems not to be having nvidia/cuda...

If you see the tests are the same.

  • One running inside our CML docker image
  • One running directly in the baremetal cloud runner. Its that OS the one which does not know nothing about nvidia-smi

@0x2b3bfa0
Copy link
Member Author

@DavidGOrtega, are you sure that test_runner runs on the bare machine?

@DavidGOrtega
Copy link
Contributor

DavidGOrtega commented Jul 22, 2021

Ah! True Gitlab runs using the docker executor no?

--executor "${IN_DOCKER ? 'shell' : 'docker'}" \

@0x2b3bfa0
Copy link
Member Author

Exactly!

Copy link
Contributor

@DavidGOrtega DavidGOrtega left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its fine.
Baremetal will check nvidia-smi and determine gpu and use the proper image

@DavidGOrtega
Copy link
Contributor

@0x2b3bfa0 before I do the merge one thing that we have to take in mind. Was gpu image still working with non GPU instances? I had some issues before

@0x2b3bfa0
Copy link
Member Author

@DavidGOrtega, there won't be any issues with GPU images on non-GPU machines unless we apply iterative/terraform-provider-iterative#151

@casperdcl casperdcl merged commit 20d5645 into master Jul 26, 2021
@casperdcl casperdcl deleted the add-gpu-to-latest branch July 26, 2021 09:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cml-image Subcommand cml-runner Subcommand p0-critical Max priority (ASAP)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

nvidia smi not accesible in CML AMI
3 participants