-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dropping NaNs #4
Comments
@djmcgregor thanks for the review and for the write-up. Super valuable! I was curious about which bad assumption exactly I codified, and found it (gladly) in a code comment:
Which you've shown to be a wrong assumption. Thanks again.
Agree! What's not surprising is that
I'd actually be super happy about a PR -- it's great to get contributions! Given the importance, I might indeed want to take care of this quickly now. Well. If I have not submitted a patch in the next 24 hours please feel free to drop a PR! :) I've looked at your research interests. Cool stuff! :) |
Yes, your comments are quite detailed, and very helpful for following your logic. Closing, as #6 correctly addresses this issue.
This is interesting, I hadn't noticed the lost data is inherently stored. Good to know.
Glad to hear. As I continue to take a look at this repo, I'll share any enhancement ideas or issues I find. Thanks for making this and sharing open source! |
In analyze.py, I see you intentionally drop all rows with NaNs. For low activity repos, you can actually lose a lot of data this way. For example, you may have visitors, but no clones (let's ignore that this action counts as a unique clone each time it runs).
I suggest the NaNs be replaced with zeros. Perhaps you have already tried this out?
Raw fetch data
Exported data to views_clones_aggregate.csv
Edit
Since you use
df.groupby().max()
replacing nans with zeros should not be a problem for edge case nans, since a previously logged non-zero value will override subsequent nans.https://github.com/jgehrcke/github-repo-stats/blob/main/analyze.py#L774
I could make a PR for this, but it's a simple enough change
Line 670:
df = df.fillna(0)
https://github.com/jgehrcke/github-repo-stats/blob/main/analyze.py#L670
The text was updated successfully, but these errors were encountered: