Skip to content

Conversation

@DavidGOrtega
Copy link
Contributor

@DavidGOrtega DavidGOrtega commented May 28, 2020

Introduces CML-runner a wrapper over GH and GL runners.
Using it within a deploy job in the workflow will deploy automatically a self managed runner on cloud that will be waiting for jobs a time given by RUNNER_IDLE_TIMEOUT in seconds. If a self-hosted runner is idle for that time it unregisters from GH/GL itself and shutdown the tied machine.

Below there is a sample in bash of all the logic that could be wrapped in CML-cloud-deploy @dmpetrov

CML example 1 running on AWS self-hosted runners

Github

⚠️ In GH It needs to generate a new token since the runners api seems to be unreachable by the common GITHUB_TOKEN

name: train-my-model

on: [push]

jobs:
  deploy-cloud-runner:
    runs-on: [ubuntu-latest]
    container: docker://dvcorg/cml-cloud-runner

    steps:
      - name: deploy
        env:
          repo_token: ${{ secrets.REPO_TOKEN }} 
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        run: |
          echo "Deploying..."

          MACHINE="CML-$(openssl rand -hex 12)"
          docker-machine create \
              --driver amazonec2 \
              --amazonec2-instance-type t2.micro \
              --amazonec2-region us-east-1 \
              --amazonec2-zone f \
              --amazonec2-vpc-id vpc-06bc773d85a0a04f7 \
              --amazonec2-ssh-user ubuntu \
              $MACHINE

          eval "$(docker-machine env --shell sh $MACHINE)"

          ( 
          docker-machine ssh $MACHINE "sudo mkdir -p /docker_machine && sudo chmod 777 /docker_machine" && \
          docker-machine scp -r -q ~/.docker/machine/ $MACHINE:/docker_machine && \

          docker run --name runner -d \
            -v /docker_machine/machine:/root/.docker/machine \
            -e RUNNER_IDLE_TIMEOUT=120 \
            -e DOCKER_MACHINE=${MACHINE} \
            -e RUNNER_LABELS=cml \
            -e repo_token=$repo_token \
            -e RUNNER_REPO=https://github.com/DavidGOrtega/3_tensorboard \
           dvcorg/cml-cloud-runner && \

          sleep 20 && echo "Deployed $MACHINE"
          ) || (echo y | docker-machine rm $MACHINE && exit 1)

  train:
    needs: deploy
    runs-on: [self-hosted,cml]

    steps:
      - uses: actions/checkout@v2

      - name: cml_run
        env:
          repo_token: ${{ secrets.GITHUB_TOKEN }} 
        run: |
          pip install -r requirements.txt
          python train.py
        
          cat metrics.txt >> report.md
          cml-publish confusion_matrix.png --md >> report.md
          cml-send-github-check report.md

Gitlab

deploy-cloud-runner:
  image: dvcorg/cml-cloud-runner
  script:
    - MACHINE="CML-$(openssl rand -hex 12)"
    - docker-machine create
        --driver amazonec2
        --amazonec2-instance-type t2.micro
        --amazonec2-region us-east-1
        --amazonec2-zone f
        --amazonec2-vpc-id vpc-06bc773d85a0a04f7
        --amazonec2-ssh-user ubuntu
        $MACHINE

    - eval "$(docker-machine env --shell sh $MACHINE)"

    - ( 
      docker-machine ssh $MACHINE "sudo mkdir -p /docker_machine && sudo chmod 777 /docker_machine" &&
      docker-machine scp -r -q ~/.docker/machine/ $MACHINE:/docker_machine &&

      docker run --name runner -d
        -v /docker_machine/machine:/root/.docker/machine
        -e RUNNER_IDLE_TIMEOUT=120
        -e DOCKER_MACHINE=${MACHINE}
        -e RUNNER_LABELS=cml
        -e repo_token=$repo_token
        -e RUNNER_REPO=https://gitlab.com/DavidGOrtega/3_tensorboard
        dvcorg/cml-cloud-runner &&
        
      sleep 15 && echo "Deployed $MACHINE"
      ) || (echo y | docker-machine rm $MACHINE && exit 1)


train:
  tags:
    - cml

  script:
    - pip install -r requirements.txt
    - python train.py
        
    - cat metrics.txt >> report.md
    - cml-publish confusion_matrix.png --md >> report.md
    - cml-send-github-check report.md

@DavidGOrtega DavidGOrtega requested a review from dmpetrov May 28, 2020 14:59
Copy link
Member

@dmpetrov dmpetrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DavidGOrtega I didn't get when the report is generated?

Copy link
Member

@dmpetrov dmpetrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few more comments

tag_names: true


- name: Publish CML runner docker image
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CML runner is a bit too general name. Any workflow runs in a runner. It probably cloud runner or something like this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hace changed to CML self-hosted runner.

tag_names: true


- name: Publish CML runner docker image
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't get the idea of this docker. It looks very similar to CML. Why do we need this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its creating a docker image under the tag runner including the runner piece of code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For every current cml docker that we have (cml, cml-python3, cml-gpu, cml-gpu-python3) we have latest and runner tags, the last one contains the docker, runner and docker-machine code in the dockerfile needed to use the self-hosted runner on cloud

@DavidGOrtega
Copy link
Contributor Author

DavidGOrtega commented Jun 3, 2020

@DavidGOrtega I didn't get when the report is generated?

You mean in the cml job? I just used a wait job to control the time and be able to test. But in that job you can actually do whatever you want.

@dmpetrov
Copy link
Member

dmpetrov commented Jun 3, 2020

You mean in the cml job? I just used a wait job to control the time and be able to test. But in that job you can actually do whatever you want.

I mean - the report that we sent as a comment. The major part of the CML :)

@DavidGOrtega
Copy link
Contributor Author

I understand you. We have the deploy job and then the cml job. The cml is going to do comments, report, etc... It's just doing a sleep because this PR is more about how to deploy the self hosted cloud runner, but as I say the cml job should be doing the report, comment, tensorboard, etc... like always. Im going to setup a minimum example.

@dmpetrov
Copy link
Member

dmpetrov commented Jun 3, 2020

Im going to setup a minimum example.

yep. the example is needed.

@DavidGOrtega
Copy link
Contributor Author

Should I use CML's example number 1?

@dmpetrov
Copy link
Member

dmpetrov commented Jun 3, 2020

Should I use CML's example number 1?

any report is fine for now

Copy link
Member

@dmpetrov dmpetrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✨ Looks great!
One question is inline. Also, please make sure basic scenarios are tested like:

  1. no success in resource allocation. what should happen?
  2. job never finishes? how we can handle this properly?
  3. is it possible that the training is done but the machine is still working?

if: github.event_name == 'push' && (contains(github.ref, 'tags') || github.ref == 'refs/heads/master')
uses: elgohr/Publish-Docker-Github-Action@master
env:
DOCKER_FROM: cml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need DOCKER_FROM? Why don't we use that in other images like cml-gpu-py3?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the same question about buildargs

Copy link
Contributor Author

@DavidGOrtega DavidGOrtega Jun 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we build the images using Dockerfile-cloud-runner, so to not repeat 4 dockerfiles we just use the FROM wit args coming from buildargs in the plugin. According to the specs they have to come from the action ENV

ARG DOCKER_FROM=cml

FROM dvcorg/${DOCKER_FROM}:latest as base

Copy link
Contributor Author

@DavidGOrtega DavidGOrtega Jun 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. no success in resource allocation. what should happen?
    Job will fail and workflow wont happen

  2. job never finishes? how we can handle this properly?
    Very interesting question. Are we speaking about the deploy job or the train?
    In both cases the workflow will timeout since the whole workflow has a limited time. However a non ending training process will end up in a machine working forever. In any circumstance the user should see that the workflow did not succeed properly.
    This make me think in the next iteration where we can actually add a check of the machine being cleaned up.

  3. is it possible that the training is done but the machine is still working?
    No, they have an idle mechanism, if no jobs are handled in RUNNER_IDLE_TIMEOUT in secs they kill them self

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@DavidGOrtega DavidGOrtega changed the title CML-runner && CML-cloud-deploy CML clooud runner Jun 4, 2020
@DavidGOrtega DavidGOrtega changed the title CML clooud runner CML cloud runner Jun 4, 2020
@DavidGOrtega DavidGOrtega merged commit 38c9df6 into master Jun 4, 2020
@DavidGOrtega DavidGOrtega deleted the cml-runner branch June 4, 2020 19:29
@DavidGOrtega DavidGOrtega restored the cml-runner branch June 4, 2020 19:42
@DavidGOrtega DavidGOrtega mentioned this pull request Jun 4, 2020
@elleobrien elleobrien mentioned this pull request Jun 10, 2020
@DavidGOrtega DavidGOrtega deleted the cml-runner branch June 26, 2020 20:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants