Dropping NaNs #4

djmcgregor · 2021-01-22T20:46:23Z

In analyze.py, I see you intentionally drop all rows with NaNs. For low activity repos, you can actually lose a lot of data this way. For example, you may have visitors, but no clones (let's ignore that this action counts as a unique clone each time it runs).
I suggest the NaNs be replaced with zeros. Perhaps you have already tried this out?

Raw fetch data

Exported data to views_clones_aggregate.csv

Edit

Since you use df.groupby().max() replacing nans with zeros should not be a problem for edge case nans, since a previously logged non-zero value will override subsequent nans.
https://github.com/jgehrcke/github-repo-stats/blob/main/analyze.py#L774

I could make a PR for this, but it's a simple enough change
Line 670: df = df.fillna(0)
https://github.com/jgehrcke/github-repo-stats/blob/main/analyze.py#L670

The text was updated successfully, but these errors were encountered:

jgehrcke · 2021-01-23T09:16:12Z

@djmcgregor thanks for the review and for the write-up. Super valuable!

I was curious about which bad assumption exactly I codified, and found it (gladly) in a code comment:

NaN are expected only at the boundaries of each fragment (first and maybe last sample).

Which you've shown to be a wrong assumption. Thanks again.

Since you use df.groupby().max() replacing nans with zeros should not be a problem for edge case nans,

Agree!

What's not surprising is that fetch.py is not lossy, but that the culprit is in the aggregation in analyze.py. The individual snapshot files obtained by fetch.py are still in the git history, i.e. the data loss you mention is conceptually reversible :-).

I could make a PR for this, but it's a simple enough change

I'd actually be super happy about a PR -- it's great to get contributions! Given the importance, I might indeed want to take care of this quickly now. Well. If I have not submitted a patch in the next 24 hours please feel free to drop a PR! :)

I've looked at your research interests. Cool stuff! :)

analyze.py: address issue #4

djmcgregor · 2021-01-23T18:17:58Z

Yes, your comments are quite detailed, and very helpful for following your logic. Closing, as #6 correctly addresses this issue.

The individual snapshot files obtained by fetch.py are still in the git history

This is interesting, I hadn't noticed the lost data is inherently stored. Good to know.

I'd actually be super happy about a PR

Glad to hear. As I continue to take a look at this repo, I'll share any enhancement ideas or issues I find. Thanks for making this and sharing open source!

jgehrcke added a commit that referenced this issue Jan 23, 2021

analyze.py: address issue #4

c7805cb

jgehrcke added a commit that referenced this issue Jan 23, 2021

Merge pull request #6 from jgehrcke/jp/issue-4

243b626

analyze.py: address issue #4

djmcgregor closed this as completed Jan 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dropping NaNs #4

Dropping NaNs #4

djmcgregor commented Jan 22, 2021 •

edited

jgehrcke commented Jan 23, 2021

djmcgregor commented Jan 23, 2021

Dropping NaNs #4

Dropping NaNs #4

Comments

djmcgregor commented Jan 22, 2021 • edited

Edit

jgehrcke commented Jan 23, 2021

djmcgregor commented Jan 23, 2021

djmcgregor commented Jan 22, 2021 •

edited