CML cloud runner #108

DavidGOrtega · 2020-05-28T14:58:48Z

Introduces CML-runner a wrapper over GH and GL runners.
Using it within a deploy job in the workflow will deploy automatically a self managed runner on cloud that will be waiting for jobs a time given by RUNNER_IDLE_TIMEOUT in seconds. If a self-hosted runner is idle for that time it unregisters from GH/GL itself and shutdown the tied machine.

Below there is a sample in bash of all the logic that could be wrapped in CML-cloud-deploy @dmpetrov

CML example 1 running on AWS self-hosted runners

Github

⚠️ In GH It needs to generate a new token since the runners api seems to be unreachable by the common GITHUB_TOKEN

name: train-my-model

on: [push]

jobs:
  deploy-cloud-runner:
    runs-on: [ubuntu-latest]
    container: docker://dvcorg/cml-cloud-runner

    steps:
      - name: deploy
        env:
          repo_token: ${{ secrets.REPO_TOKEN }} 
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        run: |
          echo "Deploying..."

          MACHINE="CML-$(openssl rand -hex 12)"
          docker-machine create \
              --driver amazonec2 \
              --amazonec2-instance-type t2.micro \
              --amazonec2-region us-east-1 \
              --amazonec2-zone f \
              --amazonec2-vpc-id vpc-06bc773d85a0a04f7 \
              --amazonec2-ssh-user ubuntu \
              $MACHINE

          eval "$(docker-machine env --shell sh $MACHINE)"

          ( 
          docker-machine ssh $MACHINE "sudo mkdir -p /docker_machine && sudo chmod 777 /docker_machine" && \
          docker-machine scp -r -q ~/.docker/machine/ $MACHINE:/docker_machine && \

          docker run --name runner -d \
            -v /docker_machine/machine:/root/.docker/machine \
            -e RUNNER_IDLE_TIMEOUT=120 \
            -e DOCKER_MACHINE=${MACHINE} \
            -e RUNNER_LABELS=cml \
            -e repo_token=$repo_token \
            -e RUNNER_REPO=https://github.com/DavidGOrtega/3_tensorboard \
           dvcorg/cml-cloud-runner && \

          sleep 20 && echo "Deployed $MACHINE"
          ) || (echo y | docker-machine rm $MACHINE && exit 1)

  train:
    needs: deploy
    runs-on: [self-hosted,cml]

    steps:
      - uses: actions/checkout@v2

      - name: cml_run
        env:
          repo_token: ${{ secrets.GITHUB_TOKEN }} 
        run: |
          pip install -r requirements.txt
          python train.py
        
          cat metrics.txt >> report.md
          cml-publish confusion_matrix.png --md >> report.md
          cml-send-github-check report.md

Gitlab

deploy-cloud-runner:
  image: dvcorg/cml-cloud-runner
  script:
    - MACHINE="CML-$(openssl rand -hex 12)"
    - docker-machine create
        --driver amazonec2
        --amazonec2-instance-type t2.micro
        --amazonec2-region us-east-1
        --amazonec2-zone f
        --amazonec2-vpc-id vpc-06bc773d85a0a04f7
        --amazonec2-ssh-user ubuntu
        $MACHINE

    - eval "$(docker-machine env --shell sh $MACHINE)"

    - ( 
      docker-machine ssh $MACHINE "sudo mkdir -p /docker_machine && sudo chmod 777 /docker_machine" &&
      docker-machine scp -r -q ~/.docker/machine/ $MACHINE:/docker_machine &&

      docker run --name runner -d
        -v /docker_machine/machine:/root/.docker/machine
        -e RUNNER_IDLE_TIMEOUT=120
        -e DOCKER_MACHINE=${MACHINE}
        -e RUNNER_LABELS=cml
        -e repo_token=$repo_token
        -e RUNNER_REPO=https://gitlab.com/DavidGOrtega/3_tensorboard
        dvcorg/cml-cloud-runner &&
        
      sleep 15 && echo "Deployed $MACHINE"
      ) || (echo y | docker-machine rm $MACHINE && exit 1)


train:
  tags:
    - cml

  script:
    - pip install -r requirements.txt
    - python train.py
        
    - cat metrics.txt >> report.md
    - cml-publish confusion_matrix.png --md >> report.md
    - cml-send-github-check report.md

dmpetrov

@DavidGOrtega I didn't get when the report is generated?

dmpetrov

a few more comments

dmpetrov · 2020-06-03T04:42:51Z

.github/workflows/publish.yml

        tag_names: true

+
+    - name: Publish CML runner docker image


CML runner is a bit too general name. Any workflow runs in a runner. It probably cloud runner or something like this.

I hace changed to CML self-hosted runner.

dmpetrov · 2020-06-03T04:46:35Z

.github/workflows/publish.yml

        tag_names: true

+
+    - name: Publish CML runner docker image


I didn't get the idea of this docker. It looks very similar to CML. Why do we need this?

Its creating a docker image under the tag runner including the runner piece of code.

For every current cml docker that we have (cml, cml-python3, cml-gpu, cml-gpu-python3) we have latest and runner tags, the last one contains the docker, runner and docker-machine code in the dockerfile needed to use the self-hosted runner on cloud

DavidGOrtega · 2020-06-03T07:59:24Z

@DavidGOrtega I didn't get when the report is generated?

You mean in the cml job? I just used a wait job to control the time and be able to test. But in that job you can actually do whatever you want.

dmpetrov · 2020-06-03T08:05:12Z

You mean in the cml job? I just used a wait job to control the time and be able to test. But in that job you can actually do whatever you want.

I mean - the report that we sent as a comment. The major part of the CML :)

DavidGOrtega · 2020-06-03T08:10:12Z

I understand you. We have the deploy job and then the cml job. The cml is going to do comments, report, etc... It's just doing a sleep because this PR is more about how to deploy the self hosted cloud runner, but as I say the cml job should be doing the report, comment, tensorboard, etc... like always. Im going to setup a minimum example.

dmpetrov · 2020-06-03T08:21:10Z

Im going to setup a minimum example.

yep. the example is needed.

DavidGOrtega · 2020-06-03T08:26:06Z

Should I use CML's example number 1?

dmpetrov · 2020-06-03T08:53:03Z

Should I use CML's example number 1?

any report is fine for now

dmpetrov

✨ Looks great!
One question is inline. Also, please make sure basic scenarios are tested like:

no success in resource allocation. what should happen?
job never finishes? how we can handle this properly?
is it possible that the training is done but the machine is still working?

dmpetrov · 2020-06-04T18:07:09Z

.github/workflows/publish.yml

+      if: github.event_name == 'push' && (contains(github.ref, 'tags') || github.ref == 'refs/heads/master')
+      uses: elgohr/Publish-Docker-Github-Action@master
+      env:
+        DOCKER_FROM: cml


Why do we need DOCKER_FROM? Why don't we use that in other images like cml-gpu-py3?

the same question about buildargs

we build the images using Dockerfile-cloud-runner, so to not repeat 4 dockerfiles we just use the FROM wit args coming from buildargs in the plugin. According to the specs they have to come from the action ENV

ARG DOCKER_FROM=cml FROM dvcorg/${DOCKER_FROM}:latest as base

no success in resource allocation. what should happen?
Job will fail and workflow wont happen

job never finishes? how we can handle this properly?
Very interesting question. Are we speaking about the deploy job or the train?
In both cases the workflow will timeout since the whole workflow has a limited time. However a non ending training process will end up in a machine working forever. In any circumstance the user should see that the workflow did not succeed properly.
This make me think in the next iteration where we can actually add a check of the machine being cleaned up.

is it possible that the training is done but the machine is still working?
No, they have an idle mechanism, if no jobs are handled in RUNNER_IDLE_TIMEOUT in secs they kill them self

…nner

code wip

20f9773

DavidGOrtega requested a review from dmpetrov May 28, 2020 14:59

DavidGOrtega added 9 commits May 28, 2020 17:00

TIMEOUT_TIMER

168cf68

docker entrypoint

ab9ec9f

setInterval

b74827b

RUNNER_IDLE_TIMEOUT seconds

1f60175

gitlab and docker runner

ddf59cd

backslashes

aee4949

merge master

10ad3df

remove docker publish test

6ed39f8

remove dockermachine in docker gpu

c1a3474

dmpetrov requested changes Jun 3, 2020

View reviewed changes

cml self-hosted runner

cc7d542

DavidGOrtega added 4 commits June 3, 2020 20:23

cloud runner

dd7bb8e

deploy tag runner for testing

6593bee

set back if dockerfile

df463db

docker repos

5a36bf9

dmpetrov approved these changes Jun 4, 2020

View reviewed changes

DavidGOrtega added 2 commits June 4, 2020 21:04

Merge branch 'master' of https://github.com/iterative/cml into cml-ru…

049f5f7

…nner

bump version

f8cb6cb

DavidGOrtega changed the title ~~CML-runner && CML-cloud-deploy~~ CML clooud runner Jun 4, 2020

DavidGOrtega changed the title ~~CML clooud runner~~ CML cloud runner Jun 4, 2020

DavidGOrtega merged commit 38c9df6 into master Jun 4, 2020

DavidGOrtega deleted the cml-runner branch June 4, 2020 19:29

DavidGOrtega mentioned this pull request Jun 4, 2020

Revert "CML cloud runner" #114

Merged

DavidGOrtega restored the cml-runner branch June 4, 2020 19:42

DavidGOrtega mentioned this pull request Jun 4, 2020

Cml runner #115

Merged

elleobrien mentioned this pull request Jun 10, 2020

Testing cml-cloud-run #116

Closed

DavidGOrtega deleted the cml-runner branch June 26, 2020 20:23

CML cloud runner #108

CML cloud runner #108

Uh oh!

Conversation

DavidGOrtega commented May 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CML example 1 running on AWS self-hosted runners

Github

Gitlab

Uh oh!

dmpetrov left a comment

Choose a reason for hiding this comment

Uh oh!

dmpetrov left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DavidGOrtega commented Jun 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dmpetrov commented Jun 3, 2020

Uh oh!

DavidGOrtega commented Jun 3, 2020

Uh oh!

dmpetrov commented Jun 3, 2020

Uh oh!

DavidGOrtega commented Jun 3, 2020

Uh oh!

dmpetrov commented Jun 3, 2020

Uh oh!

dmpetrov left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DavidGOrtega Jun 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DavidGOrtega Jun 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

DavidGOrtega commented May 28, 2020 •

edited

Loading

DavidGOrtega commented Jun 3, 2020 •

edited

Loading

DavidGOrtega Jun 4, 2020 •

edited

Loading

DavidGOrtega Jun 4, 2020 •

edited

Loading