Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dropping NaNs #4

Closed
djmcgregor opened this issue Jan 22, 2021 · 2 comments
Closed

Dropping NaNs #4

djmcgregor opened this issue Jan 22, 2021 · 2 comments

Comments

@djmcgregor
Copy link

djmcgregor commented Jan 22, 2021

In analyze.py, I see you intentionally drop all rows with NaNs. For low activity repos, you can actually lose a lot of data this way. For example, you may have visitors, but no clones (let's ignore that this action counts as a unique clone each time it runs).
I suggest the NaNs be replaced with zeros. Perhaps you have already tried this out?

Raw fetch data
image

Exported data to views_clones_aggregate.csv
image

Edit

Since you use df.groupby().max() replacing nans with zeros should not be a problem for edge case nans, since a previously logged non-zero value will override subsequent nans.
https://github.com/jgehrcke/github-repo-stats/blob/main/analyze.py#L774

I could make a PR for this, but it's a simple enough change
Line 670: df = df.fillna(0)
https://github.com/jgehrcke/github-repo-stats/blob/main/analyze.py#L670

@jgehrcke
Copy link
Owner

@djmcgregor thanks for the review and for the write-up. Super valuable!

I was curious about which bad assumption exactly I codified, and found it (gladly) in a code comment:

NaN are expected only at the boundaries of each fragment (first and maybe last sample).

Which you've shown to be a wrong assumption. Thanks again.

Since you use df.groupby().max() replacing nans with zeros should not be a problem for edge case nans,

Agree!

What's not surprising is that fetch.py is not lossy, but that the culprit is in the aggregation in analyze.py. The individual snapshot files obtained by fetch.py are still in the git history, i.e. the data loss you mention is conceptually reversible :-).

I could make a PR for this, but it's a simple enough change

I'd actually be super happy about a PR -- it's great to get contributions! Given the importance, I might indeed want to take care of this quickly now. Well. If I have not submitted a patch in the next 24 hours please feel free to drop a PR! :)

I've looked at your research interests. Cool stuff! :)

jgehrcke added a commit that referenced this issue Jan 23, 2021
jgehrcke added a commit that referenced this issue Jan 23, 2021
@djmcgregor
Copy link
Author

Yes, your comments are quite detailed, and very helpful for following your logic. Closing, as #6 correctly addresses this issue.

The individual snapshot files obtained by fetch.py are still in the git history

This is interesting, I hadn't noticed the lost data is inherently stored. Good to know.

I'd actually be super happy about a PR

Glad to hear. As I continue to take a look at this repo, I'll share any enhancement ideas or issues I find. Thanks for making this and sharing open source!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants