You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Now I'm extending/porting that tool to go, which makes it really easy to read single JSON documents from a stream. In doing so, I noticed that sometimes, these hourly dumps have lots of NULL characters inserted into them between documents. One example is in 2015-07-21-12.json.gz -- steps to reproduce:
$ wget http://data.githubarchive.org/2015-07-21-12.json.gz
$ gzip -d 2015-07-21-12.json.gz
$ grep 2990966171 2015-07-21-12.json > 2990966171.json
# Note that the line *begins* with 2990966171; it contains multiple events
$ jq . < 2990966171.json > /dev/null
parse error: Invalid numeric literal at line 1, column 14056
# hmmmm.....
$ hd -s 14040 -n 2500 2990966171.json
For githubcontributions.io, originally my ETL tool just broke the file into lines and just parsed the first entry in each, basically because I was lazy and that worked most of the time (except, .e.g, see #18). Here's that code:
https://github.com/tenex/github-contributions/blob/b89b314451c52921084af5a67528044ebf7963bb/util/archive-processor#L605
Now I'm extending/porting that tool to
go
, which makes it really easy to read single JSON documents from a stream. In doing so, I noticed that sometimes, these hourly dumps have lots of NULL characters inserted into them between documents. One example is in 2015-07-21-12.json.gz -- steps to reproduce:💥 💥 💥 💥 💥 💥
Is this some sort of coverup? 😉 Any ideas about how it happened?
The text was updated successfully, but these errors were encountered: