Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monitor runs from server #467

Merged
merged 111 commits into from Oct 15, 2020

Conversation

Hedingber
Copy link
Contributor

@Hedingber Hedingber commented Oct 6, 2020

Don't be horified by the number of lines changed it's mostly test code moved from one place to another.

Terminology for this description:

  • Monitoring - keeping the Run state updated.
  • Log collection - Collecting the logs of a Run and saving them in MLRun's persistency.

What we had before this PR:
Monitoring:
There was basically 2 mechanisms:

  • Client - the MLRun SDK code running inside the user's Jobs (wrapping its code) updated the Run state according to progress - e.g. to running when it started/to completed when it finished/to error when an error was raised
  • Client + Server - When a user was running a job using the SDK (calling .run()) by default watch=True which means that the SDK will poll the API's GET /log/{project}/{uid} endpoint to get the run logs. In the API's code of this endpoint there was a hack - since it anyways gets the Job's pods, it looked on their state, and updated the Run's state accordingly.
    So practically get logs endpoints was abused to monitor Run status, and Client log polling was abused to trigger the monitoring periodically.

Log collection:
The get logs endpoint has two possible sources, MLRun's persistency and K8s, when there is something in the persistency it reads from there, otherwise it fall back to K8s. Practically nothing was collecting the logs from K8s to persistency, so when clients did get logs, the API basically just proxied them to K8s.
The exception is the cleanup mechanism, which before it cleans up job's resources, it collects the logs and saves them to persistency, so eventually the logs were getting to the persistency.

What this PR does:
Monitoring:

  • Client - didn't touch, this will keep working the same way, I think that the SDK running in the Job has the most accurate information about the real state, and therefore it should continue being part of the monitoring mechanism, also since it's working in kind of PUSHing way (and not polling), it gives almost zero latency between real state and the time the Run state is updated.
  • Server - removed any logic to update Run state from get logs endpoint, this endpoint now do what it should - give logs.
    Added monitoring logic that runs periodically (by default every 5 seconds, configurable) in background and updates the Runs states.

Log collection:
Not much changed, I think that trying to move logs from K8s to our persistency while the job is running is a waste of time, either we'll poll K8s too often and do a lot of un-needed pressure on it, or we'll poll it too rarely so the logs in our persistency (which will be the ones served to the user) won't be updated/relevant.
The only thing I did change, is that when the monitoring identifies a run reached stable state (completed/failed) it collects the logs from K8s and push to our persistency.

Other changes:

  • Changed pipelines job to use the API to run jobs (otherwise it won't monitor them) instead of running them indepndently.
  • Moved runtime handlers tests from tests/runtimes/test_runtime_handlers to tests/api/runtime_handlers and splitted them to test file per handler, did it cause I needed some of the API's testing facilities.
    Note - those handlers are running only in context of the API, therefore the right thing to do is to also move the runtime handlers themselves out of the mlrun/runtimes package to somewhere under mlrun/api as part of the effort to distinct SDK code from API code. this PR is already too big so I didn't do it here, will probably do it in some followup PR.

mlrun/api/api/utils.py Outdated Show resolved Hide resolved
mlrun/api/api/utils.py Show resolved Hide resolved
mlrun/api/main.py Outdated Show resolved Hide resolved
mlrun/api/main.py Outdated Show resolved Hide resolved
mlrun/runtimes/base.py Outdated Show resolved Hide resolved
mlrun/runtimes/base.py Outdated Show resolved Hide resolved
mlrun/runtimes/constants.py Outdated Show resolved Hide resolved
mlrun/runtimes/constants.py Show resolved Hide resolved
tests/conftest.py Outdated Show resolved Hide resolved
tests/conftest.py Outdated Show resolved Hide resolved
@Hedingber Hedingber merged commit b1697b6 into mlrun:development Oct 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants