Use time of event instead of local time #8

Closed
soult opened this Issue Apr 20, 2012 · 3 comments

2 participants

@soult

Currently the crawler uses EventMachine's timer functions to rotate the output files. Because the clock of the machine that runs the crawler might not be 100% synced with the clock at Github, this leads to events ending up in the "wrong" file.

For example, the earliest event in the file "2012-03-20-20.json" happened at "2012/03/20 19:59:55 -0700" and should actually be in the file "2012-03-20-19.json". The latest event in the same file happened at "2012/03/20 20:59:52 -0700", but an event that happened on "2012/03/20 20:59:58 -0700" is wrongly written into the file "2012-03-20-21.json".

Instead of relying on the crawler's clock, the crawler could parse the "created_at" field that is part of each event's dictionary and use thw information to select the correct file to write to.

@igrigorik
Owner

Hmm, yeah that's a great idea. One assumption here: github's timeline is delivered "in order", but that seems like a sane thing to assume (although for sanity sake can and should be verified).

@soult

Instead of having only one file open at a time, you could use a hash to map from "current date and hour" to file handle. When a file handle hasn't been written to for a minute or so, close it and rename the file from .json.current to .json.

And, talking about API sanity: Of couse a check to see if the time of an event is within a couple of minutes of the crawler's clock might not be a bad idea.

@igrigorik
Owner

Fixed in b297069

@igrigorik igrigorik closed this Aug 3, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment