-
Notifications
You must be signed in to change notification settings - Fork 39
Added dvclive.get_step
#142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #142 +/- ##
==========================================
+ Coverage 90.88% 91.12% +0.24%
==========================================
Files 14 14
Lines 340 338 -2
==========================================
- Hits 309 308 -1
+ Misses 31 30 -1
Continue to review full report at Codecov.
|
dberenbaum
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Reviewing raised some questions that are outside the scope of this PR (I can extract to separate issues):
Custom steps
- What is the intended use case for a custom step?
- Should it overwrite existing values for the same step (it doesn't)?
- Should results be ordered by write time or step (they are ordered by write time)?
- Why set custom steps at the metric level with
dvclive.log(step=n)since the step value should probably apply to all metrics? - If I log one metric and then set a different step for a second metric, which step number should be used for the first metric (it will have a different step in the tsv and the summary json)?
tsv -> summary workflow
Would summary -> tsv be more helpful (this would obviously require summary to always exist)? It's more intuitive to me (and follows the internal logic of MetricLogger._metrics) to gather all metrics for a step and then append to metrics logs. It also enables no-step scenarios like classical ML algorithms by logging the summary without ever creating the tsv files.
| def get_step() -> None: | ||
| global _metric_logger # pylint: disable=global-statement | ||
| _metric_logger = _lazy_init(_metric_logger) | ||
| return _metric_logger.step |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method returns int.
Also, I think we shoul change MetricLogger's step property into get_step() method to maintain consistency with with API. @daavoo what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure, tbh. get_X method instead of @property kind of feels strange and but step doesn't sound good for a public method neither
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, my POV is that I presume that at some point one might want to, parallelize their code, and in that case do something like:
dvclive = MetricsLogger() in that case, dvclive.get_step stops working.
Now that I mention that, it would probably be good to mention that dvclive is not thread-safe, and one needs to initialize their own Loggers in case of parallel jobs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point.
However, assuming we are focusing on "integrations first" (iterative/example-repos-dev#77 (comment)), the parallelization would happen at the ML Framework level and most ML Frameworks already take care of properly calling the callbacks/loggers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the downside to having the public API match MetricsLogger?
I don't think I understand the point about ML framework integrations. Even if ML frameworks spawn a separate process for each model training, dvclive would try to read/write using the same file by default, right? Users might need to specify a different path for each one, which isn't supported yet in the callbacks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would need to investigate each case but, at least in the Deep Learning Frameworks I'm familiar with, parallelism usually occurs:
-
At the Data Loader level
Which doesn't affect DVCLive callbacks. -
In Distributed training strategy
Where ML Framework usually provide some decorator like rank_zero_only / master_only which is (should be) used in the DVCLive callback.
β I have followed the Contributing to DVCLive guide.
π If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.
dvc.orgPR: iterative/dvc.org#2765Closes #113
Closes #128
This P.R. introduces a new public function:
dvclive.get_step()and removesstepfromdvclive.init()The main use cases are driven by (but not limited to) using dvclive alongside
dvc checkpointsand resuming training: