Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange JSON data I found #135

Closed
hut8 opened this issue Mar 8, 2016 · 4 comments
Closed

Strange JSON data I found #135

hut8 opened this issue Mar 8, 2016 · 4 comments

Comments

@hut8
Copy link
Contributor

hut8 commented Mar 8, 2016

For githubcontributions.io, originally my ETL tool just broke the file into lines and just parsed the first entry in each, basically because I was lazy and that worked most of the time (except, .e.g, see #18). Here's that code:
https://github.com/tenex/github-contributions/blob/b89b314451c52921084af5a67528044ebf7963bb/util/archive-processor#L605

Now I'm extending/porting that tool to go, which makes it really easy to read single JSON documents from a stream. In doing so, I noticed that sometimes, these hourly dumps have lots of NULL characters inserted into them between documents. One example is in 2015-07-21-12.json.gz -- steps to reproduce:

$ wget http://data.githubarchive.org/2015-07-21-12.json.gz
$ gzip -d 2015-07-21-12.json.gz
$ grep 2990966171 2015-07-21-12.json > 2990966171.json
# Note that the line *begins* with 2990966171; it contains multiple events
$ jq . < 2990966171.json > /dev/null
parse error: Invalid numeric literal at line 1, column 14056
# hmmmm.....
$ hd -s 14040 -n 2500 2990966171.json

💥 💥 💥 💥 💥 💥

000036d8  75 2f 33 38 36 39 37 35  32 3f 22 7d 7d 00 00 00  |u/3869752?"}}...|
000036e8  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00003ff8  00 00 00 00 00 00 00 00  00 00 00 00 00 00 7b 22  |..............{"|
00004008  69 64 22 3a 22 32 39 39  30 39 38 32 38 39 31 22  |id":"2990982891"|
00004018  2c 22 74 79 70 65 22 3a  22 50 75 6c 6c 52 65 71  |,"type":"PullReq|
00004028  75 65 73 74 52 65 76 69  65 77 43 6f 6d 6d 65 6e  |uestReviewCommen|
00004038  74 45 76 65 6e 74 22 2c  22 61 63 74 6f 72 22 3a  |tEvent","actor":|
00004048  7b 22 69 64 22 3a 38 37  35 32 2c 22 6c 6f 67 69  |{"id":8752,"logi|
00004058  6e 22 3a 22 6d 69 78 6f  6e 69 63 22 2c 22 67 72  |n":"mixonic","gr|
00004068  61 76 61 74 61 72 5f 69  64 22 3a 22 22 2c 22 75  |avatar_id":"","u|
00004078  72 6c 22 3a 22 68 74 74  70 73 3a 2f 2f 61 70 69  |rl":"https://api|
00004088  2e 67 69 74 68 75 62 2e  63 6f 6d 2f 75 73 65 72  |.github.com/user|
00004098  73 2f 6d 69                                       |s/mi|
0000409c

Is this some sort of coverup? 😉 Any ideas about how it happened?

@igrigorik
Copy link
Owner

Hmm, interesting. No, not sure what might be causing that. We parse each event and use YAJL to encode them before writing them out: https://github.com/igrigorik/githubarchive.org/blob/master/crawler/crawler.rb#L69

Checking YAJL repo, I don't see any issues that could be causing this. Hmmmm.

@notslang
Copy link

ran into this same problem... piping it through tr -s '\0' '\n' is a decent workaround

@igrigorik
Copy link
Owner

Not sure what the issue here is. I'm inclined to close this as a "wontfix".. at least, until we have a working theory for what's causing it.

@igrigorik
Copy link
Owner

We've reprocessed past archives as part of #112. This — I think — should be fixed. Let me know if otherwise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants