Enable system metrics logging for resuming an existing run #10312

chenmoneygithub · 2023-11-07T00:13:51Z

🛠 DevTools 🛠

Install mlflow from this PR

pip install git+https://github.com/mlflow/mlflow.git@refs/pull/10312/merge

Checkout with GitHub CLI

gh pr checkout 10312

Related Issues/PRs

Resolve #10253

What changes are proposed in this pull request?

Enable system metrics logging for resuming an existing run.

How is this PR tested?

Existing unit/integration tests
New unit/integration tests
Manual tests

Does this PR require documentation update?

Release Notes

Is this a user-facing change?

No. You can skip the rest of this section.
Yes. Give a description of this change to be included in the release notes for MLflow users.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

Interface

area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
area/windows: Windows support

Language

language/r: R APIs and clients
language/java: Java APIs and clients
language/new: Proposals for new client languages

Integrations

integrations/azure: Azure and Azure ML integrations
integrations/sagemaker: SageMaker integrations
integrations/databricks: Databricks integrations

How should the PR be classified in the release notes? Choose one:

rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
rn/feature - A new user-facing feature worth mentioning in the release notes
rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
rn/documentation - A user-facing documentation change worth mentioning in the release notes

github-actions · 2023-11-07T00:14:15Z

Documentation preview for a77b849 will be available here when this CircleCI job completes successfully.

More info

Ignore this comment if this PR does not change the documentation.
It takes a few minutes for the preview to be available.
The preview is updated when a new commit is pushed to this PR.
This comment was created by https://github.com/mlflow/mlflow/actions/runs/6859977045.

mlflow/system_metrics/system_metrics_monitor.py

Signed-off-by: chenmoneygithub <chen.qian@databricks.com>

mlflow/system_metrics/system_metrics_monitor.py

mlflow/tracking/fluent.py

Signed-off-by: chenmoneygithub <chen.qian@databricks.com>

Co-authored-by: Harutaka Kawamura <hkawamura0130@gmail.com> Signed-off-by: Chen Qian <chenmoney@google.com>

Signed-off-by: chenmoneygithub <chen.qian@databricks.com>

mlflow/system_metrics/system_metrics_monitor.py

tests/system_metrics/test_system_metrics_logging.py

mlflow/tracking/fluent.py

chenmoneygithub · 2023-11-13T19:36:24Z

@danielyxyang Thanks for your help! But please don't start contributing by doing code review. Usually we assign code review tasks to OSS contributors after they have nailed several solid PRs, otherwise the review could be distracting and increasing our overhead.

Signed-off-by: chenmoneygithub <chen.qian@databricks.com>

danielyxyang · 2023-11-13T20:38:13Z

Oops sorry for that! I'll let you do your work then:)

docs/source/getting-started/quickstart-2/index.rst

Signed-off-by: chenmoneygithub <chen.qian@databricks.com>

tests/system_metrics/test_system_metrics_logging.py

harupy · 2023-11-14T00:27:25Z

tests/system_metrics/test_system_metrics_logging.py

+    # Unset the environment variables to avoid affecting other test cases.
+    mlflow.disable_system_metrics_logging()
+    mlflow.set_system_metrics_sampling_interval(None)
+    mlflow.set_system_metrics_samples_before_logging(None)


you might want to add this as an autouse fixture in case an error occurs before we reach these lines.

tests/system_metrics/test_system_metrics_logging.py

mlflow/system_metrics/system_metrics_monitor.py

harupy · 2023-11-14T00:33:55Z

tests/system_metrics/test_system_metrics_logging.py

+    expected_metrics_name = [f"system/{name}" for name in expected_metrics_name]
+    for name in expected_metrics_name:
+        assert name in metrics
+


Suggested change

expected_metrics_name = [f"system/{name}" for name in expected_metrics_name]

for name in expected_metrics_name:

assert name in metrics

expected_metric_names = [f"system/{name}" for name in expected_metrics_name]

assert sorted(metrics) == expected_metrics_name

it's not guaranteed to be equivalent. The expected set does not contain GPU metrics, which is logged when GPU is available.

assert metrics.keys() >= set(expected_metrics_name)

might work.

harupy · 2023-11-14T00:37:21Z

tests/system_metrics/test_system_metrics_logging.py

+
+    # Pause for a bit to allow the system metrics monitoring to exit.
+    time.sleep(1)
+    thread_names = [thread.name for thread in threading.enumerate()]


Can we call join on thread collected by threading.enumerate?

I kinda prefer to align the testing code with users' code, for which they won't explicitly join threads returned by threading.enumerate

mlflow/tracking/fluent.py

harupy · 2023-11-14T00:41:02Z

tests/system_metrics/test_system_metrics_logging.py

+    metrics_history = mlflow.tracking.MlflowClient().get_metric_history(
+        run.info.run_id, "system/cpu_utilization_percentage"
+    )
+    assert metrics_history[-1].step > last_step


Can we make this assertion more strict. For example,

assert [m.step for m in metrics_history] == expected_steps

the behavior here is not deterministic because of python thread management.

what's not deterministic? The number of steps?

yes, the number of steps is not guaranteed.

Signed-off-by: chenmoneygithub <chen.qian@databricks.com>

harupy · 2023-11-14T02:44:21Z

tests/system_metrics/test_system_metrics_logging.py

 import mlflow
 from mlflow.system_metrics.system_metrics_monitor import SystemMetricsMonitor


+@pytest.fixture(scope="module", autouse=True)
+def clean_mlruns_dir():


can we rename this fixture?

should the scope be module or function? should we run this after each test?

sorry my bad.

harupy

LGTM

Signed-off-by: chenmoneygithub <chen.qian@databricks.com>

) Signed-off-by: chenmoneygithub <chen.qian@databricks.com> Signed-off-by: Chen Qian <chenmoney@google.com> Co-authored-by: Harutaka Kawamura <hkawamura0130@gmail.com> Signed-off-by: swathi <konakanchi.swathi@gmail.com>

github-actions bot added rn/bug-fix Mention under Bug Fixes in Changelogs. rn/feature Mention under Features in Changelogs. and removed rn/bug-fix Mention under Bug Fixes in Changelogs. rn/feature Mention under Features in Changelogs. labels Nov 7, 2023

chenmoneygithub force-pushed the system-metrics-existing-run branch from 7e46d80 to d7a8fa4 Compare November 7, 2023 00:15

chenmoneygithub requested a review from harupy November 7, 2023 00:16

danielyxyang reviewed Nov 8, 2023

View reviewed changes

mlflow/system_metrics/system_metrics_monitor.py Outdated Show resolved Hide resolved

danielyxyang reviewed Nov 8, 2023

View reviewed changes

mlflow/system_metrics/system_metrics_monitor.py Outdated Show resolved Hide resolved

chenmoneygithub requested review from danielyxyang and removed request for danielyxyang November 9, 2023 21:05

chenmoneygithub added 2 commits November 9, 2023 13:08

enable system metrics logging for resuming an existing run

9e0f66a

Signed-off-by: chenmoneygithub <chen.qian@databricks.com>

remove the hard-coded string

97f14d4

Signed-off-by: chenmoneygithub <chen.qian@databricks.com>

chenmoneygithub force-pushed the system-metrics-existing-run branch from 3638859 to 97f14d4 Compare November 9, 2023 21:09

harupy reviewed Nov 9, 2023

View reviewed changes

mlflow/system_metrics/system_metrics_monitor.py Show resolved Hide resolved

harupy reviewed Nov 9, 2023

View reviewed changes

mlflow/tracking/fluent.py Outdated Show resolved Hide resolved

chenmoneygithub and others added 2 commits November 9, 2023 15:50

better

1b9d2cb

Signed-off-by: chenmoneygithub <chen.qian@databricks.com>

Update mlflow/tracking/fluent.py

06984a8

Co-authored-by: Harutaka Kawamura <hkawamura0130@gmail.com> Signed-off-by: Chen Qian <chenmoney@google.com>

chenmoneygithub requested a review from harupy November 10, 2023 00:25

small fix

f7ca09c

Signed-off-by: chenmoneygithub <chen.qian@databricks.com>

danielyxyang suggested changes Nov 11, 2023

View reviewed changes

chenmoneygithub and others added 2 commits November 13, 2023 11:41

Merge branch 'mlflow:master' into master

9fe82bd

merge master

e56a193

Signed-off-by: chenmoneygithub <chen.qian@databricks.com>

github-actions bot removed the rn/feature Mention under Features in Changelogs. label Nov 13, 2023

harupy reviewed Nov 13, 2023

View reviewed changes

docs/source/getting-started/quickstart-2/index.rst Outdated Show resolved Hide resolved

conflict

412756e

Signed-off-by: chenmoneygithub <chen.qian@databricks.com>

harupy reviewed Nov 14, 2023

View reviewed changes

tests/system_metrics/test_system_metrics_logging.py Outdated Show resolved Hide resolved

harupy reviewed Nov 14, 2023

View reviewed changes

tests/system_metrics/test_system_metrics_logging.py Outdated Show resolved Hide resolved

harupy reviewed Nov 14, 2023

View reviewed changes

mlflow/system_metrics/system_metrics_monitor.py Outdated Show resolved Hide resolved

harupy reviewed Nov 14, 2023

View reviewed changes

mlflow/system_metrics/system_metrics_monitor.py Show resolved Hide resolved

harupy reviewed Nov 14, 2023

View reviewed changes

mlflow/tracking/fluent.py Outdated Show resolved Hide resolved

harupy reviewed Nov 14, 2023

View reviewed changes

address comment

08d8888

Signed-off-by: chenmoneygithub <chen.qian@databricks.com>

harupy reviewed Nov 14, 2023

View reviewed changes

harupy approved these changes Nov 14, 2023

View reviewed changes

fix stuff

a77b849

Signed-off-by: chenmoneygithub <chen.qian@databricks.com>

chenmoneygithub merged commit 373c654 into mlflow:master Nov 14, 2023
34 checks passed

chenmoneygithub deleted the system-metrics-existing-run branch January 2, 2024 22:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable system metrics logging for resuming an existing run #10312

Enable system metrics logging for resuming an existing run #10312

chenmoneygithub commented Nov 7, 2023 •

edited

Loading

github-actions bot commented Nov 7, 2023 •

edited

Loading

chenmoneygithub commented Nov 13, 2023 •

edited

Loading

danielyxyang commented Nov 13, 2023

harupy Nov 14, 2023

chenmoneygithub Nov 14, 2023

harupy Nov 14, 2023

chenmoneygithub Nov 14, 2023

harupy Nov 14, 2023

harupy Nov 14, 2023

harupy Nov 14, 2023

chenmoneygithub Nov 14, 2023

harupy Nov 14, 2023

chenmoneygithub Nov 14, 2023

harupy Nov 14, 2023

chenmoneygithub Nov 14, 2023

harupy Nov 14, 2023

harupy Nov 14, 2023 •

edited

Loading

chenmoneygithub Nov 14, 2023

harupy left a comment

Enable system metrics logging for resuming an existing run #10312

Enable system metrics logging for resuming an existing run #10312

Conversation

chenmoneygithub commented Nov 7, 2023 • edited Loading

Install mlflow from this PR

Checkout with GitHub CLI

Related Issues/PRs

What changes are proposed in this pull request?

How is this PR tested?

Does this PR require documentation update?

Release Notes

Is this a user-facing change?

What component(s), interfaces, languages, and integrations does this PR affect?

How should the PR be classified in the release notes? Choose one:

github-actions bot commented Nov 7, 2023 • edited Loading

chenmoneygithub commented Nov 13, 2023 • edited Loading

danielyxyang commented Nov 13, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

harupy Nov 14, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

harupy left a comment

Choose a reason for hiding this comment

chenmoneygithub commented Nov 7, 2023 •

edited

Loading

github-actions bot commented Nov 7, 2023 •

edited

Loading

chenmoneygithub commented Nov 13, 2023 •

edited

Loading

harupy Nov 14, 2023 •

edited

Loading