Currently the crawler uses EventMachine's timer functions to rotate the output files. Because the clock of the machine that runs the crawler might not be 100% synced with the clock at Github, this leads to events ending up in the "wrong" file.
For example, the earliest event in the file "2012-03-20-20.json" happened at "2012/03/20 19:59:55 -0700" and should actually be in the file "2012-03-20-19.json". The latest event in the same file happened at "2012/03/20 20:59:52 -0700", but an event that happened on "2012/03/20 20:59:58 -0700" is wrongly written into the file "2012-03-20-21.json".
Instead of relying on the crawler's clock, the crawler could parse the "created_at" field that is part of each event's dictionary and use thw information to select the correct file to write to.
Hmm, yeah that's a great idea. One assumption here: github's timeline is delivered "in order", but that seems like a sane thing to assume (although for sanity sake can and should be verified).
Instead of having only one file open at a time, you could use a hash to map from "current date and hour" to file handle. When a file handle hasn't been written to for a minute or so, close it and rename the file from .json.current to .json.
And, talking about API sanity: Of couse a check to see if the time of an event is within a couple of minutes of the crawler's clock might not be a bad idea.
Fixed in b297069