Monitor runs from server #467

Hedingber · 2020-10-06T21:03:46Z

Don't be horified by the number of lines changed it's mostly test code moved from one place to another.

Terminology for this description:

Monitoring - keeping the Run state updated.
Log collection - Collecting the logs of a Run and saving them in MLRun's persistency.

What we had before this PR:
Monitoring:
There was basically 2 mechanisms:

Client - the MLRun SDK code running inside the user's Jobs (wrapping its code) updated the Run state according to progress - e.g. to running when it started/to completed when it finished/to error when an error was raised
Client + Server - When a user was running a job using the SDK (calling .run()) by default watch=True which means that the SDK will poll the API's GET /log/{project}/{uid} endpoint to get the run logs. In the API's code of this endpoint there was a hack - since it anyways gets the Job's pods, it looked on their state, and updated the Run's state accordingly.
So practically get logs endpoints was abused to monitor Run status, and Client log polling was abused to trigger the monitoring periodically.

Log collection:
The get logs endpoint has two possible sources, MLRun's persistency and K8s, when there is something in the persistency it reads from there, otherwise it fall back to K8s. Practically nothing was collecting the logs from K8s to persistency, so when clients did get logs, the API basically just proxied them to K8s.
The exception is the cleanup mechanism, which before it cleans up job's resources, it collects the logs and saves them to persistency, so eventually the logs were getting to the persistency.

What this PR does:
Monitoring:

Client - didn't touch, this will keep working the same way, I think that the SDK running in the Job has the most accurate information about the real state, and therefore it should continue being part of the monitoring mechanism, also since it's working in kind of PUSHing way (and not polling), it gives almost zero latency between real state and the time the Run state is updated.
Server - removed any logic to update Run state from get logs endpoint, this endpoint now do what it should - give logs.
Added monitoring logic that runs periodically (by default every 5 seconds, configurable) in background and updates the Runs states.

Log collection:
Not much changed, I think that trying to move logs from K8s to our persistency while the job is running is a waste of time, either we'll poll K8s too often and do a lot of un-needed pressure on it, or we'll poll it too rarely so the logs in our persistency (which will be the ones served to the user) won't be updated/relevant.
The only thing I did change, is that when the monitoring identifies a run reached stable state (completed/failed) it collects the logs from K8s and push to our persistency.

Other changes:

Changed pipelines job to use the API to run jobs (otherwise it won't monitor them) instead of running them indepndently.
Moved runtime handlers tests from tests/runtimes/test_runtime_handlers to tests/api/runtime_handlers and splitted them to test file per handler, did it cause I needed some of the API's testing facilities.
Note - those handlers are running only in context of the API, therefore the right thing to do is to also move the runtime handlers themselves out of the mlrun/runtimes package to somewhere under mlrun/api as part of the effort to distinct SDK code from API code. this PR is already too big so I didn't do it here, will probably do it in some followup PR.

…s_from_server

mlrun/api/api/utils.py

mlrun/api/main.py

mlrun/runtimes/base.py

mlrun/runtimes/constants.py

tests/conftest.py

…s_from_server

transient -> non-terminal

Hedingber added 30 commits October 3, 2020 19:52

Create the place to monitor the run

cfc4141

more specific

c14082d

Merge branch 'development' of github.com:mlrun/mlrun into monitor_run…

2bbab78

…s_from_server

structure

631c42c

linting

22772b3

tons of states

1ff7890

status monitor might work

d37932b

linting

354b5cc

logging

5fa35b6

linting

d6ba6d4

weird

793f0b5

ok

3275d29

logging

812015a

enable timeout

6e4a683

state

9e412cf

don't store empty logs

e799d77

resume monitoring on startup

2eb9b4f

fix

1f562ce

long timeout

d383ced

temp

a1d5680

runtime handlers tests to different files

b399c50

linting

61815f8

new dict to object class + mocks to mock class

ac30bd9

mock_list_pods support several calls

ef1a3d8

naming

380ed5e

some progress

ed9ebf2

order

0370fdb

Simplify k8s singeltone and remove k8s helper mock

c35cd31

Basic monitor logs kubejob test passing

2b2f2c0

Don't update run state on get_log - simply get log

ececc89

omesser suggested changes Oct 13, 2020

View reviewed changes

Hedingber added 23 commits October 14, 2020 14:53

Merge branch 'development' of github.com:mlrun/mlrun into monitor_run…

2bb4e88

…s_from_server

log interval

e4fd496

remove log

6295151

remove continue

3b3390c

stable -> terminal

66c9e6c

transient -> non-terminal

stable -> terminal

f64e982

transient -> non-terminal

simply collect logs

5fed94d

comment

46b8e46

comment

e78332e

linting

a1449d5

removing unused

7466ab7

one lint

4ad0a51

mock

f73db16

task -> run

1d209b9

remove unneeded

0ac0601

Use k8s client package classes

72ad99b

simlifiedddddd

94d9c1c

linting

0c085d2

remove dict to object wrapper

99b75a2

Don't trigger periodics on tests

b2fc59b

Add debouncing + ensure logs collected bug

ccc05b4

Add debouncing test

0292b5a

linting

643c4ea

omesser approved these changes Oct 15, 2020

View reviewed changes

Hedingber added 2 commits October 15, 2020 05:27

remove noisy log

87757b7

Add PENDING_SUBMISSION (not in docs but saw it live)

3ef7a42

Hedingber merged commit b1697b6 into mlrun:development Oct 15, 2020

This was referenced Oct 22, 2020

Add old pod_status header for BC with <0.5.3 clients #489

Merged

Fix scheduler not running anything #490

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Monitor runs from server #467

Monitor runs from server #467

Hedingber commented Oct 6, 2020 •

edited

Monitor runs from server #467

Monitor runs from server #467

Conversation

Hedingber commented Oct 6, 2020 • edited

Hedingber commented Oct 6, 2020 •

edited