-
Notifications
You must be signed in to change notification settings - Fork 348
CML cloud runner #108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CML cloud runner #108
Conversation
dmpetrov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DavidGOrtega I didn't get when the report is generated?
dmpetrov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a few more comments
.github/workflows/publish.yml
Outdated
| tag_names: true | ||
|
|
||
|
|
||
| - name: Publish CML runner docker image |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CML runner is a bit too general name. Any workflow runs in a runner. It probably cloud runner or something like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hace changed to CML self-hosted runner.
.github/workflows/publish.yml
Outdated
| tag_names: true | ||
|
|
||
|
|
||
| - name: Publish CML runner docker image |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't get the idea of this docker. It looks very similar to CML. Why do we need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its creating a docker image under the tag runner including the runner piece of code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For every current cml docker that we have (cml, cml-python3, cml-gpu, cml-gpu-python3) we have latest and runner tags, the last one contains the docker, runner and docker-machine code in the dockerfile needed to use the self-hosted runner on cloud
You mean in the cml job? I just used a wait job to control the time and be able to test. But in that job you can actually do whatever you want. |
I mean - the report that we sent as a comment. The major part of the CML :) |
|
I understand you. We have the deploy job and then the cml job. The cml is going to do comments, report, etc... It's just doing a sleep because this PR is more about how to deploy the self hosted cloud runner, but as I say the cml job should be doing the report, comment, tensorboard, etc... like always. Im going to setup a minimum example. |
yep. the example is needed. |
|
Should I use CML's example number 1? |
any report is fine for now |
dmpetrov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✨ Looks great!
One question is inline. Also, please make sure basic scenarios are tested like:
- no success in resource allocation. what should happen?
- job never finishes? how we can handle this properly?
- is it possible that the training is done but the machine is still working?
| if: github.event_name == 'push' && (contains(github.ref, 'tags') || github.ref == 'refs/heads/master') | ||
| uses: elgohr/Publish-Docker-Github-Action@master | ||
| env: | ||
| DOCKER_FROM: cml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need DOCKER_FROM? Why don't we use that in other images like cml-gpu-py3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the same question about buildargs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we build the images using Dockerfile-cloud-runner, so to not repeat 4 dockerfiles we just use the FROM wit args coming from buildargs in the plugin. According to the specs they have to come from the action ENV
ARG DOCKER_FROM=cml
FROM dvcorg/${DOCKER_FROM}:latest as base
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
no success in resource allocation. what should happen?
Job will fail and workflow wont happen -
job never finishes? how we can handle this properly?
Very interesting question. Are we speaking about the deploy job or the train?
In both cases the workflow will timeout since the whole workflow has a limited time. However a non ending training process will end up in a machine working forever. In any circumstance the user should see that the workflow did not succeed properly.
This make me think in the next iteration where we can actually add a check of the machine being cleaned up. -
is it possible that the training is done but the machine is still working?
No, they have an idle mechanism, if no jobs are handled in RUNNER_IDLE_TIMEOUT in secs they kill them self
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Introduces CML-runner a wrapper over GH and GL runners.
Using it within a deploy job in the workflow will deploy automatically a self managed runner on cloud that will be waiting for jobs a time given by
RUNNER_IDLE_TIMEOUTin seconds. If a self-hosted runner is idle for that time it unregisters from GH/GL itself and shutdown the tied machine.Below there is a sample in bash of all the logic that could be wrapped in CML-cloud-deploy @dmpetrov
CML example 1 running on AWS self-hosted runners
Github
Gitlab