Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add internal counters to send only the latest datapoints to studio #788

Merged
merged 9 commits into from
Feb 28, 2024

Conversation

AlexandreKempf
Copy link
Contributor

@AlexandreKempf AlexandreKempf commented Feb 15, 2024

Following a bug detected during this feature development.

Using the step value to send the latest data to Studio can lead to weird behavior because the step is poorly defined in some loggers (eg. pytorch lightning logger). Because the step definition is poorly defined in lightning we used a hack to ensure the log_metrics calls by the lightning trainer were correct. But calling log_metrics from outside the lightning trainer (a separate thread for instance) leads to data not being sent to Studio or duplicates data.

This PR introduces a counter for each metric that increments when Studio receives the data points. Instead of using the step property as a proxy for which data has been sent to studio, we literally count them now. This way, when we want to send data points to Studio, we can only send the points it hasn't received yet.

The test added in the PR fails in the main branch because logs[test_metric] is

[
    {'step': '0', 'test': '0.5'}, 
    {'step': '1', 'test': '0.5'}, 
    {'step': '2', 'test': '0.5'}, 
    {'step': '3', 'test': '0.5'}
]

which is expected. But test_calls is

[
    {'step': '0', 'test': '0.5'}, 
    {'step': '1', 'test': '0.5'},
    {'step': '1', 'test': '0.5'}, 
    {'step': '1', 'test': '0.5'},
    {'step': '2', 'test': '0.5'}, 
    {'step': '3', 'test': '0.5'},
    {'step': '3', 'test': '0.5'}
]

With this PR, both values logs[test_metric] and test_calls are the same and don't contain duplicated elements.

@codecov-commenter
Copy link

codecov-commenter commented Feb 15, 2024

Codecov Report

All modified and coverable lines are covered by tests βœ…

Comparison is base (9eb04c2) 95.53% compared to head (bcf6c04) 95.55%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #788      +/-   ##
==========================================
+ Coverage   95.53%   95.55%   +0.02%     
==========================================
  Files          55       55              
  Lines        3559     3580      +21     
  Branches      319      319              
==========================================
+ Hits         3400     3421      +21     
  Misses        111      111              
  Partials       48       48              

β˜” View full report in Codecov by Sentry.
πŸ“’ Have feedback on the report? Share it here.

src/dvclive/studio.py Outdated Show resolved Hide resolved
Copy link
Contributor

@dberenbaum dberenbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Do you think there are any additional tests that would be helpful to add?

@AlexandreKempf
Copy link
Contributor Author

LGTM! Do you think there are any additional tests that would be helpful to add?

Sure! I took a bad habit the last few years not to write tests. It needs to become a habit once again. Sorry for that, I'll add them :)

Copy link
Member

@shcheklein shcheklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to have a proper test, from the brief description I don't quite understand in what situations we have a bug - it would be helpful to have a test that makes it obvious / clear

@AlexandreKempf
Copy link
Contributor Author

AlexandreKempf commented Feb 19, 2024

@shcheklein
I updated the description to explain the problem a bit better.
I created a new test in the lightning framework since it was problematic.
The new test uses a thread and a sleep function, but I tried to apply what you suggested in the CPU monitoring PR about time.sleep in tests. I also used @pytest.mark.timeout(3) to ensure the test doesn't hang forever, but fails instead if it reaches a 3s duration.

Let me know what you think :)

Copy link
Member

@shcheklein shcheklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating the description, I think I understand it better now.

Just a few small questions to clarify re the test. Otherwise I think it should be fine (though I'm not an expert in the details re the data points management)

@shcheklein
Copy link
Member

@daavoo it would be great if you can take a look :)

@AlexandreKempf AlexandreKempf mentioned this pull request Feb 20, 2024
2 tasks
@dberenbaum dberenbaum mentioned this pull request Feb 20, 2024
dberenbaum and others added 2 commits February 21, 2024 08:11
* add test for repeated step in studio

* drop outdated lightning test comments
@shcheklein
Copy link
Member

thanks for the update @AlexandreKempf ! can we add a test / change this test a bit to have more datapoints with a different cadence of updates. Correct me folks, if I'm wrong but we keep counter per metric path, right? some metric can be updated a few times before the next step (and even before sync), some once. And we need in all cases make sure that on sync we send "delta" properly.

@dberenbaum
Copy link
Contributor

@AlexandreKempf Are you ready to merge this one?

@AlexandreKempf AlexandreKempf merged commit bc0a5b5 into main Feb 28, 2024
14 checks passed
@AlexandreKempf AlexandreKempf deleted the send-to-studio-counter branch February 28, 2024 17:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants