Fix latest GPU container image tags #667

0x2b3bfa0 · 2021-07-22T11:16:21Z

We were pushing the latest tag twice^{[1, 2]} with the latest GPU and non-GPU images, in parallel. Whichever image got pushed first ended up overriding the other one. 🙈

Closes nvidia smi not accesible in CML AMI #666 . Baremetal will check nvidia-smi and determine gpu and use the proper image

casperdcl

doh! though... do we need a separate latest and latest-gpu?

.github/workflows/test-deploy.yml

casperdcl · 2021-07-22T11:34:15Z

Should fix #666

I don't see how - #666 clearly uses dvcorg/cml:0-dvc2-base1-gpu, no?

0x2b3bfa0 · 2021-07-22T11:38:36Z

Please notice that the failing step on #666 is the test_runner one, not the test_container one.

With out current configuration, not specifiying an image option in a GitLab CI/CD job will cause that job to be executed in the default container image specified with the --docker-image when invoking the runner.

casperdcl · 2021-07-22T11:41:35Z

btw do you know if we support OSX? I presume yes?

0x2b3bfa0 · 2021-07-22T11:47:57Z

Partially: we seem to support macOS¹ on GitHub; on GitLab, runner platforms are hardcoded, and the Mach-O loader would have a hard time trying to read ELF files.

cml/src/drivers/github.js

Lines 209 to 213 in a2eedb6

    
           const arch = process.platform === 'darwin' ? 'osx-x64' : 'linux-x64'; 
        
           const ver = '2.278.0'; 
        
           const destination = resolve(workdir, 'actions-runner.tar.gz'); 
        
           const url = `https://github.com/actions/runner/releases/download/v${ver}/actions-runner-${arch}-${ver}.tar.gz`; 
        
           await download({ url, path: destination });

cml/src/drivers/gitlab.js

Lines 149 to 151 in a2eedb6

    
           const url = 
        
             'https://gitlab-runner-downloads.s3.amazonaws.com/latest/binaries/gitlab-runner-linux-amd64'; 
        
           await download({ url, path: bin });

However, who cares? Apple doesn't support CUDA on modern macOS systems, despite having NVIDIA GPU devices.

¹ OS X is a thing from the past. 😄

casperdcl · 2021-07-22T11:54:47Z

src/drivers/gitlab.js

+      try {
+        await exec('cuda-smi');
+      } catch (err) {
+        gpu = false;
+      }


what do you think of this for 🍎 support?

could go back to

Suggested change

try {

await exec('cuda-smi');

} catch (err) {

gpu = false;

}

gpu = false;

if you prefer...

What is cuda-smi supposed to do? 🤔

it's effectively what nvidia-smi is called on (some?) 🍎 systems tmux-plugins/tmux-cpu#24

But it's a third party tool (?)

Should we rely on that to detect CUDA?

idk, the impression I got was that CUDA can be installed on a mac without an nvidia-smi binary but with a cuda-smi binary available. Would be nice to get confirmation though.

DavidGOrtega

I do not think that this change fixes #666

The real issue is that our CML AMI seems not to be having nvidia/cuda...

If you see the tests are the same.

One running inside our CML docker image
One running directly in the baremetal cloud runner. Its that OS the one which does not know nothing about nvidia-smi

0x2b3bfa0 · 2021-07-22T12:15:02Z

@DavidGOrtega, are you sure that test_runner runs on the bare machine?

DavidGOrtega · 2021-07-22T12:18:09Z

Ah! True Gitlab runs using the docker executor no?

--executor "${IN_DOCKER ? 'shell' : 'docker'}" \

0x2b3bfa0 · 2021-07-22T12:20:03Z

Exactly!

DavidGOrtega

I think its fine.
Baremetal will check nvidia-smi and determine gpu and use the proper image

.github/workflows/test-deploy.yml

DavidGOrtega · 2021-07-23T08:39:22Z

@0x2b3bfa0 before I do the merge one thing that we have to take in mind. Was gpu image still working with non GPU instances? I had some issues before

0x2b3bfa0 · 2021-07-23T16:00:16Z

@DavidGOrtega, there won't be any issues with GPU images on non-GPU machines unless we apply iterative/terraform-provider-iterative#151

0x2b3bfa0 added 2 commits July 22, 2021 13:07

Add latest-gpu tag to container images

5d70c42

Use the latest-gpu tag for GitLab Docker with GPU

7940615

0x2b3bfa0 added bug Something isn't working p0-critical Max priority (ASAP) cml-runner Subcommand cml-image Subcommand labels Jul 22, 2021

0x2b3bfa0 requested a review from DavidGOrtega July 22, 2021 11:16

0x2b3bfa0 self-assigned this Jul 22, 2021

0x2b3bfa0 temporarily deployed to internal July 22, 2021 11:16 Inactive

0x2b3bfa0 requested a review from casperdcl July 22, 2021 11:18

casperdcl reviewed Jul 22, 2021

View reviewed changes

.github/workflows/test-deploy.yml Show resolved Hide resolved

add cuda-smi detection

0f5bfc6

casperdcl temporarily deployed to internal July 22, 2021 11:53 Inactive

casperdcl reviewed Jul 22, 2021

View reviewed changes

DavidGOrtega suggested changes Jul 22, 2021

View reviewed changes

DavidGOrtega approved these changes Jul 22, 2021

View reviewed changes

DavidGOrtega reviewed Jul 22, 2021

View reviewed changes

.github/workflows/test-deploy.yml Show resolved Hide resolved

DavidGOrtega mentioned this pull request Jul 26, 2021

nvidia smi not accesible in CML AMI #666

Closed

casperdcl approved these changes Jul 26, 2021

View reviewed changes

casperdcl merged commit 20d5645 into master Jul 26, 2021

casperdcl deleted the add-gpu-to-latest branch July 26, 2021 09:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix latest GPU container image tags #667

Fix latest GPU container image tags #667

0x2b3bfa0 commented Jul 22, 2021 •

edited by DavidGOrtega

Loading

casperdcl left a comment

casperdcl commented Jul 22, 2021

0x2b3bfa0 commented Jul 22, 2021

casperdcl commented Jul 22, 2021

0x2b3bfa0 commented Jul 22, 2021 •

edited

Loading

casperdcl Jul 22, 2021

0x2b3bfa0 Jul 22, 2021

casperdcl Jul 22, 2021 •

edited

Loading

0x2b3bfa0 Jul 22, 2021

casperdcl Jul 22, 2021

DavidGOrtega left a comment •

edited

Loading

0x2b3bfa0 commented Jul 22, 2021

DavidGOrtega commented Jul 22, 2021 •

edited

Loading

0x2b3bfa0 commented Jul 22, 2021

DavidGOrtega left a comment

DavidGOrtega commented Jul 23, 2021

0x2b3bfa0 commented Jul 23, 2021

Fix latest GPU container image tags #667

Fix latest GPU container image tags #667

Conversation

0x2b3bfa0 commented Jul 22, 2021 • edited by DavidGOrtega Loading

casperdcl left a comment

Choose a reason for hiding this comment

casperdcl commented Jul 22, 2021

0x2b3bfa0 commented Jul 22, 2021

casperdcl commented Jul 22, 2021

0x2b3bfa0 commented Jul 22, 2021 • edited Loading

casperdcl Jul 22, 2021

Choose a reason for hiding this comment

0x2b3bfa0 Jul 22, 2021

Choose a reason for hiding this comment

casperdcl Jul 22, 2021 • edited Loading

Choose a reason for hiding this comment

0x2b3bfa0 Jul 22, 2021

Choose a reason for hiding this comment

casperdcl Jul 22, 2021

Choose a reason for hiding this comment

DavidGOrtega left a comment • edited Loading

Choose a reason for hiding this comment

0x2b3bfa0 commented Jul 22, 2021

DavidGOrtega commented Jul 22, 2021 • edited Loading

0x2b3bfa0 commented Jul 22, 2021

DavidGOrtega left a comment

Choose a reason for hiding this comment

DavidGOrtega commented Jul 23, 2021

0x2b3bfa0 commented Jul 23, 2021

0x2b3bfa0 commented Jul 22, 2021 •

edited by DavidGOrtega

Loading

0x2b3bfa0 commented Jul 22, 2021 •

edited

Loading

casperdcl Jul 22, 2021 •

edited

Loading

DavidGOrtega left a comment •

edited

Loading

DavidGOrtega commented Jul 22, 2021 •

edited

Loading