-
Notifications
You must be signed in to change notification settings - Fork 347
Closed
Description
I'm testing out using an EC2 GPU w/ the cloud container cml-gpu-py3-cloud-runner. I wanted to make sure I'm on the right track:
name: train-my-model
on: [push]
jobs:
deploy-cloud-runner:
runs-on: [ubuntu-latest]
container: docker://dvcorg/cml-gpu-cloud-runner
steps:
- name: deploy
env:
repo_token: ${{ secrets.REPO_TOKEN }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
run: |
echo "Deploying..."
MACHINE="CML-$(openssl rand -hex 12)"
docker-machine create \
--driver amazonec2 \
--amazonec2-instance-type g3s.xlarge \
--amazonec2-region us-east-2 \
--amazonec2-zone a \
--amazonec2-vpc-id vpc-76f1f01e \
--amazonec2-ssh-user ubuntu \
$MACHINE
eval "$(docker-machine env --shell sh $MACHINE)"
(
docker-machine ssh $MACHINE "sudo mkdir -p /docker_machine && sudo chmod 777 /docker_machine" && \
docker-machine scp -r -q ~/.docker/machine/ $MACHINE:/docker_machine && \
docker run --name runner -d \
-v /docker_machine/machine:/root/.docker/machine \
-e RUNNER_IDLE_TIMEOUT=120 \
-e DOCKER_MACHINE=${MACHINE} \
-e RUNNER_LABELS=cml \
-e repo_token=$repo_token \
-e NVIDIA_VISIBLE_DEVICES=all \
-e RUNNER_REPO=https://github.com/andronovhopf/test_cloud \
dvcorg/cml-gpu-py3-cloud-runner && \
sleep 20 && echo "Deployed $MACHINE"
) || (echo y | docker-machine rm $MACHINE && exit 1)
train:
needs: deploy-cloud-runner
runs-on: [self-hosted,cml]
steps:
- uses: actions/checkout@v2
- name: cml_run
env:
repo_token: ${{ secrets.REPO_TOKEN }}
run: |
nvidia-smi
This isn't working yet; looks to be issues getting the drivers setup on the self-hosted runner. I'm betting I have a flag wrong somewhere in the deploy job. I tried adding the flag --gpus all to docker run but that didn't work. Any ideas?
Metadata
Metadata
Assignees
Labels
No labels