Skip to content

cml-cloud-run with gpus #117

@elleobrien

Description

@elleobrien

I'm testing out using an EC2 GPU w/ the cloud container cml-gpu-py3-cloud-runner. I wanted to make sure I'm on the right track:

name: train-my-model

on: [push]

jobs:
  deploy-cloud-runner:
    runs-on: [ubuntu-latest]
    container: docker://dvcorg/cml-gpu-cloud-runner

    steps:
      - name: deploy
        env:
          repo_token: ${{ secrets.REPO_TOKEN }} 
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        run: |
          echo "Deploying..."
          MACHINE="CML-$(openssl rand -hex 12)"
          docker-machine create \
              --driver amazonec2 \
              --amazonec2-instance-type g3s.xlarge \
              --amazonec2-region us-east-2 \
              --amazonec2-zone a \
              --amazonec2-vpc-id vpc-76f1f01e \
              --amazonec2-ssh-user ubuntu \
              $MACHINE
          eval "$(docker-machine env --shell sh $MACHINE)"
          ( 
          docker-machine ssh $MACHINE "sudo mkdir -p /docker_machine && sudo chmod 777 /docker_machine" && \
          docker-machine scp -r -q ~/.docker/machine/ $MACHINE:/docker_machine && \
          docker run --name runner -d \
            -v /docker_machine/machine:/root/.docker/machine \
            -e RUNNER_IDLE_TIMEOUT=120 \
            -e DOCKER_MACHINE=${MACHINE} \
            -e RUNNER_LABELS=cml \
            -e repo_token=$repo_token \
            -e NVIDIA_VISIBLE_DEVICES=all \
            -e RUNNER_REPO=https://github.com/andronovhopf/test_cloud \
           dvcorg/cml-gpu-py3-cloud-runner && \
               sleep 20 && echo "Deployed $MACHINE"
          ) || (echo y | docker-machine rm $MACHINE && exit 1)
  train:
    needs: deploy-cloud-runner
    runs-on: [self-hosted,cml]
    
    steps:
      - uses: actions/checkout@v2

      - name: cml_run
        env:
          repo_token: ${{ secrets.REPO_TOKEN }} 
        run: |
          nvidia-smi

This isn't working yet; looks to be issues getting the drivers setup on the self-hosted runner. I'm betting I have a flag wrong somewhere in the deploy job. I tried adding the flag --gpus all to docker run but that didn't work. Any ideas?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions