Skip to content

Conversation

@daavoo
Copy link
Contributor

@daavoo daavoo commented Aug 25, 2021

dvc.org PR: iterative/dvc.org#2765

Closes #113
Closes #128

This P.R. introduces a new public function: dvclive.get_step() and removes step from dvclive.init()

The main use cases are driven by (but not limited to) using dvclive alongside dvc checkpoints and resuming training:

  • Custom control flow
while dvclive.get_step() < X:
    train()
    metrics = eval()
    for m, v in metrics.items():
        dvclive.log(m, v)
    dvclive.next_step()
  • ML Framework
model.fit(
    . . .
    epochs=params["epochs"],
    initial_epoch=dvclive.get_step(),
)

@daavoo daavoo requested review from dberenbaum and pared August 25, 2021 11:37
@codecov-commenter
Copy link

codecov-commenter commented Aug 25, 2021

Codecov Report

Merging #142 (f42f778) into master (1ffce08) will increase coverage by 0.24%.
The diff coverage is 100.00%.

❗ Current head f42f778 differs from pull request most recent head bbab7c8. Consider uploading reports for the commit bbab7c8 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master     #142      +/-   ##
==========================================
+ Coverage   90.88%   91.12%   +0.24%     
==========================================
  Files          14       14              
  Lines         340      338       -2     
==========================================
- Hits          309      308       -1     
+ Misses         31       30       -1     
Impacted Files Coverage Ξ”
dvclive/__init__.py 100.00% <100.00%> (ΓΈ)
dvclive/metrics.py 96.93% <100.00%> (+0.78%) ⬆️
dvclive/mmcv.py 100.00% <0.00%> (ΓΈ)

Continue to review full report at Codecov.

Legend - Click here to learn more
Ξ” = absolute <relative> (impact), ΓΈ = not affected, ? = missing data
Powered by Codecov. Last update 1ffce08...bbab7c8. Read the comment docs.

Copy link

@dberenbaum dberenbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Reviewing raised some questions that are outside the scope of this PR (I can extract to separate issues):

Custom steps

  • What is the intended use case for a custom step?
  • Should it overwrite existing values for the same step (it doesn't)?
  • Should results be ordered by write time or step (they are ordered by write time)?
  • Why set custom steps at the metric level with dvclive.log(step=n) since the step value should probably apply to all metrics?
  • If I log one metric and then set a different step for a second metric, which step number should be used for the first metric (it will have a different step in the tsv and the summary json)?

tsv -> summary workflow

Would summary -> tsv be more helpful (this would obviously require summary to always exist)? It's more intuitive to me (and follows the internal logic of MetricLogger._metrics) to gather all metrics for a step and then append to metrics logs. It also enables no-step scenarios like classical ML algorithms by logging the summary without ever creating the tsv files.

This was referenced Aug 26, 2021
Comment on lines +40 to +43
def get_step() -> None:
global _metric_logger # pylint: disable=global-statement
_metric_logger = _lazy_init(_metric_logger)
return _metric_logger.step
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method returns int.
Also, I think we shoul change MetricLogger's step property into get_step() method to maintain consistency with with API. @daavoo what do you think?

Copy link
Contributor Author

@daavoo daavoo Sep 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, tbh. get_X method instead of @property kind of feels strange and but step doesn't sound good for a public method neither

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, my POV is that I presume that at some point one might want to, parallelize their code, and in that case do something like:
dvclive = MetricsLogger() in that case, dvclive.get_step stops working.
Now that I mention that, it would probably be good to mention that dvclive is not thread-safe, and one needs to initialize their own Loggers in case of parallel jobs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point.

However, assuming we are focusing on "integrations first" (iterative/example-repos-dev#77 (comment)), the parallelization would happen at the ML Framework level and most ML Frameworks already take care of properly calling the callbacks/loggers.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the downside to having the public API match MetricsLogger?

I don't think I understand the point about ML framework integrations. Even if ML frameworks spawn a separate process for each model training, dvclive would try to read/write using the same file by default, right? Users might need to specify a different path for each one, which isn't supported yet in the callbacks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would need to investigate each case but, at least in the Deep Learning Frameworks I'm familiar with, parallelism usually occurs:

  • At the Data Loader level
    Which doesn't affect DVCLive callbacks.

  • In Distributed training strategy
    Where ML Framework usually provide some decorator like rank_zero_only / master_only which is (should be) used in the DVCLive callback.

@pared pared mentioned this pull request Sep 8, 2021
@daavoo daavoo mentioned this pull request Sep 10, 2021
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

dvclive.init: Future of arguments checkpoints: num(ber)/epoch awareness

5 participants