Strange JSON data I found #135

hut8 · 2016-03-08T00:36:59Z

For githubcontributions.io, originally my ETL tool just broke the file into lines and just parsed the first entry in each, basically because I was lazy and that worked most of the time (except, .e.g, see #18). Here's that code:
https://github.com/tenex/github-contributions/blob/b89b314451c52921084af5a67528044ebf7963bb/util/archive-processor#L605

Now I'm extending/porting that tool to go, which makes it really easy to read single JSON documents from a stream. In doing so, I noticed that sometimes, these hourly dumps have lots of NULL characters inserted into them between documents. One example is in 2015-07-21-12.json.gz -- steps to reproduce:

$ wget http://data.githubarchive.org/2015-07-21-12.json.gz
$ gzip -d 2015-07-21-12.json.gz
$ grep 2990966171 2015-07-21-12.json > 2990966171.json
# Note that the line *begins* with 2990966171; it contains multiple events
$ jq . < 2990966171.json > /dev/null
parse error: Invalid numeric literal at line 1, column 14056
# hmmmm.....
$ hd -s 14040 -n 2500 2990966171.json

💥 💥 💥 💥 💥 💥

000036d8  75 2f 33 38 36 39 37 35  32 3f 22 7d 7d 00 00 00  |u/3869752?"}}...|
000036e8  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00003ff8  00 00 00 00 00 00 00 00  00 00 00 00 00 00 7b 22  |..............{"|
00004008  69 64 22 3a 22 32 39 39  30 39 38 32 38 39 31 22  |id":"2990982891"|
00004018  2c 22 74 79 70 65 22 3a  22 50 75 6c 6c 52 65 71  |,"type":"PullReq|
00004028  75 65 73 74 52 65 76 69  65 77 43 6f 6d 6d 65 6e  |uestReviewCommen|
00004038  74 45 76 65 6e 74 22 2c  22 61 63 74 6f 72 22 3a  |tEvent","actor":|
00004048  7b 22 69 64 22 3a 38 37  35 32 2c 22 6c 6f 67 69  |{"id":8752,"logi|
00004058  6e 22 3a 22 6d 69 78 6f  6e 69 63 22 2c 22 67 72  |n":"mixonic","gr|
00004068  61 76 61 74 61 72 5f 69  64 22 3a 22 22 2c 22 75  |avatar_id":"","u|
00004078  72 6c 22 3a 22 68 74 74  70 73 3a 2f 2f 61 70 69  |rl":"https://api|
00004088  2e 67 69 74 68 75 62 2e  63 6f 6d 2f 75 73 65 72  |.github.com/user|
00004098  73 2f 6d 69                                       |s/mi|
0000409c

Is this some sort of coverup? 😉 Any ideas about how it happened?

The text was updated successfully, but these errors were encountered:

igrigorik · 2016-03-27T16:43:12Z

Hmm, interesting. No, not sure what might be causing that. We parse each event and use YAJL to encode them before writing them out: https://github.com/igrigorik/githubarchive.org/blob/master/crawler/crawler.rb#L69

Checking YAJL repo, I don't see any issues that could be causing this. Hmmmm.

notslang · 2016-05-26T03:59:29Z

ran into this same problem... piping it through tr -s '\0' '\n' is a decent workaround

igrigorik · 2016-05-29T18:58:16Z

Not sure what the issue here is. I'm inclined to close this as a "wontfix".. at least, until we have a working theory for what's causing it.

igrigorik · 2016-06-24T20:49:16Z

We've reprocessed past archives as part of #112. This — I think — should be fixed. Let me know if otherwise.

hut8 mentioned this issue Mar 8, 2016

Lb/fix nulls in githubarchive tenex/opensourcecontributors#78

Closed

igrigorik added the need-feedback label May 29, 2016

igrigorik closed this as completed Jun 24, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strange JSON data I found #135

Strange JSON data I found #135

hut8 commented Mar 8, 2016

igrigorik commented Mar 27, 2016

notslang commented May 26, 2016

igrigorik commented May 29, 2016

igrigorik commented Jun 24, 2016

Strange JSON data I found #135

Strange JSON data I found #135

Comments

hut8 commented Mar 8, 2016

igrigorik commented Mar 27, 2016

notslang commented May 26, 2016

igrigorik commented May 29, 2016

igrigorik commented Jun 24, 2016