adding my own shitty gtfs csv parser + schedule files #10

jakswa · 2015-02-24T07:21:40Z

Well, it's been a year and marta.io still suffers from the same API edge cases for terminating stations (and those stations near them). Here we have a calendar file that lays out the service # to use each day of the week, and then a file for each service. This schedule data is intended for use whenever a train isn't in the API for a given direction. I hope to highlight schedule data vs realtime data, so people know that it's not based on realtime data.

There should also exist exceptions for holidays/etc, but none exist in the GTFS spec right now, and I'm going to add those (and the logic for them) when they appear.

An alternative I considered was scraping MARTA's website for each station's schedule. I chose GTFS because it's a standard spec, and because I got scared while looking at the source code on MARTA's website.

parser reduces 50MB or so of GTFS CSVs down to around 400kB worth of train-related json files
this was as painful as I thought it'd be
i'll need to re-run (and probably fix) this task as the gtfs data changes

- parser reduces 50MB or so of CSVs down to 600kB worth of train-related json files - this was as painful as I thought it'd be - i'll need to re-run (and probably fix) this task as the gtfs data changes

bblack · 2015-02-24T21:21:02Z

lib/tasks/parse_gtfs.js

+
+// of course they couldn't use the same station names
+// as in the realtime API
+var station_name_mapping = {


jakswa · 2015-02-26T18:39:33Z

In my testing over the past couple of days, it turns out the schedule data is kind of useless in predicting when a train will depart from one of the terminal stations. It might still be better than nothing, but I'm leaning towards only using it on terminal stations like Doraville.

Maybe 1 out of every 4 or 5 predictions is accurate enough, in my guestimation, to be useful. When things go well, you see these gray numbers that indicate schedule data. The as the countdown nears zero for doraville or airport, you'd see a train suddenly appear:

More often than not, though, either the schedule time will pass with no train leaving (and we just roll over to predict the next scheduled arrival, which is depressing), or a train departs too early and we get confusing states going on. Below, the schedule still had 6min and 3min to go for the terminal stations, but a train left early, so we're stuck in some kind of useless state that I'm sure will confuse users:

bblack · 2015-02-26T18:55:05Z

@itsmarta definitely has the realtime data for those end stations somewhere, since it's visible on the boards (at least at Chamblee). Have you tried contacting them to see if they can get that stuff reflected in the API?

http://www.itsmarta.com/developers/contact-us.aspx

I see you started this thread - you may want to push harder on it:
https://groups.google.com/forum/?fromgroups=#!topic/atl-transit-developers/JB0YWwgTHdM

Also the page at http://itsmarta.com/developers/default.aspx gives the email address martadevreq@itsmarta.com

jakswa · 2015-02-26T19:35:41Z

Back when I was gathering feedback from reddit, someone did PM me with an undocumented API route that is station specific (think next_train/:station) that outputs realtime data mixed with schedule data. The guess was that they used this endpoint for all the signs at each station, because it contains arrivals that are marked with (scheduled). Here's some example output for doraville.

However, I'd have to hit that endpoint once for each station we care about, and then repeated every 10s or so, if someone is looking at the website. Not ideal... but I could start doing it every 10s for the terminating stations, at least...

P.S. after examining the undocumented endpoint's (scheduled) arrivals and my GTFS data, it looks like I'm doing what marta is doing... I'd like to monitor it a bit more, and see if they deviate from GTFS (say if there are delays), but it looks like this work is going to go out if I'm doing the exact same thing as marta

- mixing in schedule data alongside realtime data, whenever a station only has realtime predictions for a single direction (hopefully only near-terminating stations) - this should sidestep the cases close to after-hours when there are many stations with no predictions - marked as `scheduled: true` to differentiate them

jakswa · 2015-02-26T20:09:43Z

I'll deploy this branch to marta.io when I get a free moment later today.

- http://replygif.net/i/216.gif

- also solve case where now = 11:55 and arrival = past midnight - first time wrapping my head around timestamps that go past 24:00:00 ಠ_ಠ

- task time jumped from ~0.8s to 5+ seconds. still doable. - due to parsing every line of the CSV now, instead of only looking at the start of the vast majority of lines - could add stream listeners on both, and send only valid lines to the CSV parser... maybe later - verified this by running the task again and seeing if my JSON changed - also deleted my json and made sure the task added it back without anything show up in `git diff`

- might need a changelog if these features keep up

jakswa · 2015-02-28T07:25:37Z

This has been deployed for a couple days now and is working pretty well. I've cleaned up some things, fixed some edge cases, added csv-parse, and updated the README. Clickin' the button!

adding my own shitty gtfs csv parser + schedule files

jakswa force-pushed the gtfs_fml branch from 616e2fc to a8d9a4e Compare February 24, 2015 07:35

adding own shitty gtfs csv parser + schedule files

0e95100

- parser reduces 50MB or so of CSVs down to 600kB worth of train-related json files - this was as painful as I thought it'd be - i'll need to re-run (and probably fix) this task as the gtfs data changes

jakswa force-pushed the gtfs_fml branch from a8d9a4e to 0e95100 Compare February 24, 2015 08:40

adding API route for scheduled arrival times by station

6c17ea8

bblack reviewed Feb 24, 2015
View reviewed changes

jakswa added 6 commits February 27, 2015 00:18

don't use schedule data >60m away

7d2887e

handle GTFS spec allowing > 24hr timestamps

42c17f6

- http://replygif.net/i/216.gif

5s timeout instead of 30, c'mon express

7f27a19

clean up GTFS timestamp search, add comments

30e68bb

- also solve case where now = 11:55 and arrival = past midnight - first time wrapping my head around timestamps that go past 24:00:00 ಠ_ಠ

adding quick blurb in README about GTFS

243ae87

- might need a changelog if these features keep up

jakswa added a commit that referenced this pull request Feb 28, 2015

Merge pull request #10 from jakswa/gtfs_fml

27cd8fd

adding my own shitty gtfs csv parser + schedule files

jakswa merged commit 27cd8fd into master Feb 28, 2015

jakswa deleted the gtfs_fml branch February 28, 2015 07:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adding my own shitty gtfs csv parser + schedule files #10

adding my own shitty gtfs csv parser + schedule files #10

jakswa commented Feb 24, 2015

bblack Feb 24, 2015

jakswa commented Feb 26, 2015

bblack commented Feb 26, 2015

jakswa commented Feb 26, 2015

jakswa commented Feb 26, 2015

jakswa commented Feb 28, 2015

adding my own shitty gtfs csv parser + schedule files #10

adding my own shitty gtfs csv parser + schedule files #10

Conversation

jakswa commented Feb 24, 2015

bblack Feb 24, 2015

Choose a reason for hiding this comment

jakswa commented Feb 26, 2015

bblack commented Feb 26, 2015

jakswa commented Feb 26, 2015

jakswa commented Feb 26, 2015

jakswa commented Feb 28, 2015