New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Monitor runs from server #467
Merged
Hedingber
merged 111 commits into
mlrun:development
from
Hedingber:monitor_runs_from_server
Oct 15, 2020
Merged
Monitor runs from server #467
Hedingber
merged 111 commits into
mlrun:development
from
Hedingber:monitor_runs_from_server
Oct 15, 2020
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
omesser
suggested changes
Oct 13, 2020
transient -> non-terminal
transient -> non-terminal
omesser
approved these changes
Oct 15, 2020
This was referenced Oct 22, 2020
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Don't be horified by the number of lines changed it's mostly test code moved from one place to another.
Terminology for this description:
What we had before this PR:
Monitoring:
There was basically 2 mechanisms:
running
when it started/tocompleted
when it finished/toerror
when an error was raised.run()
) by defaultwatch=True
which means that the SDK will poll the API'sGET /log/{project}/{uid}
endpoint to get the run logs. In the API's code of this endpoint there was a hack - since it anyways gets the Job's pods, it looked on their state, and updated the Run's state accordingly.So practically get logs endpoints was abused to monitor Run status, and Client log polling was abused to trigger the monitoring periodically.
Log collection:
The get logs endpoint has two possible sources, MLRun's persistency and K8s, when there is something in the persistency it reads from there, otherwise it fall back to K8s. Practically nothing was collecting the logs from K8s to persistency, so when clients did get logs, the API basically just proxied them to K8s.
The exception is the cleanup mechanism, which before it cleans up job's resources, it collects the logs and saves them to persistency, so eventually the logs were getting to the persistency.
What this PR does:
Monitoring:
Added monitoring logic that runs periodically (by default every 5 seconds, configurable) in background and updates the Runs states.
Log collection:
Not much changed, I think that trying to move logs from K8s to our persistency while the job is running is a waste of time, either we'll poll K8s too often and do a lot of un-needed pressure on it, or we'll poll it too rarely so the logs in our persistency (which will be the ones served to the user) won't be updated/relevant.
The only thing I did change, is that when the monitoring identifies a run reached stable state (completed/failed) it collects the logs from K8s and push to our persistency.
Other changes:
tests/runtimes/test_runtime_handlers
totests/api/runtime_handlers
and splitted them to test file per handler, did it cause I needed some of the API's testing facilities.Note - those handlers are running only in context of the API, therefore the right thing to do is to also move the runtime handlers themselves out of the
mlrun/runtimes
package to somewhere undermlrun/api
as part of the effort to distinct SDK code from API code. this PR is already too big so I didn't do it here, will probably do it in some followup PR.