Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding my own shitty gtfs csv parser + schedule files #10

Merged
merged 9 commits into from
Feb 28, 2015
Merged

Conversation

jakswa
Copy link
Owner

@jakswa jakswa commented Feb 24, 2015

Well, it's been a year and marta.io still suffers from the same API edge cases for terminating stations (and those stations near them). Here we have a calendar file that lays out the service # to use each day of the week, and then a file for each service. This schedule data is intended for use whenever a train isn't in the API for a given direction. I hope to highlight schedule data vs realtime data, so people know that it's not based on realtime data.

There should also exist exceptions for holidays/etc, but none exist in the GTFS spec right now, and I'm going to add those (and the logic for them) when they appear.

An alternative I considered was scraping MARTA's website for each station's schedule. I chose GTFS because it's a standard spec, and because I got scared while looking at the source code on MARTA's website.

  • parser reduces 50MB or so of GTFS CSVs down to around 400kB worth of train-related json files
  • this was as painful as I thought it'd be
  • i'll need to re-run (and probably fix) this task as the gtfs data changes

- parser reduces 50MB or so of CSVs down to 600kB worth of
  train-related json files
- this was as painful as I thought it'd be
- i'll need to re-run (and probably fix) this task as the gtfs data
  changes

// of course they couldn't use the same station names
// as in the realtime API
var station_name_mapping = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😏

@jakswa
Copy link
Owner Author

jakswa commented Feb 26, 2015

In my testing over the past couple of days, it turns out the schedule data is kind of useless in predicting when a train will depart from one of the terminal stations. It might still be better than nothing, but I'm leaning towards only using it on terminal stations like Doraville.

Maybe 1 out of every 4 or 5 predictions is accurate enough, in my guestimation, to be useful. When things go well, you see these gray numbers that indicate schedule data. The as the countdown nears zero for doraville or airport, you'd see a train suddenly appear:

m3

More often than not, though, either the schedule time will pass with no train leaving (and we just roll over to predict the next scheduled arrival, which is depressing), or a train departs too early and we get confusing states going on. Below, the schedule still had 6min and 3min to go for the terminal stations, but a train left early, so we're stuck in some kind of useless state that I'm sure will confuse users:

mart1
marta2

@bblack
Copy link
Collaborator

bblack commented Feb 26, 2015

@itsmarta definitely has the realtime data for those end stations somewhere, since it's visible on the boards (at least at Chamblee). Have you tried contacting them to see if they can get that stuff reflected in the API?

http://www.itsmarta.com/developers/contact-us.aspx

I see you started this thread - you may want to push harder on it:
https://groups.google.com/forum/?fromgroups=#!topic/atl-transit-developers/JB0YWwgTHdM

Also the page at http://itsmarta.com/developers/default.aspx gives the email address martadevreq@itsmarta.com

@jakswa
Copy link
Owner Author

jakswa commented Feb 26, 2015

Back when I was gathering feedback from reddit, someone did PM me with an undocumented API route that is station specific (think next_train/:station) that outputs realtime data mixed with schedule data. The guess was that they used this endpoint for all the signs at each station, because it contains arrivals that are marked with (scheduled). Here's some example output for doraville.

However, I'd have to hit that endpoint once for each station we care about, and then repeated every 10s or so, if someone is looking at the website. Not ideal... but I could start doing it every 10s for the terminating stations, at least...

P.S. after examining the undocumented endpoint's (scheduled) arrivals and my GTFS data, it looks like I'm doing what marta is doing... I'd like to monitor it a bit more, and see if they deviate from GTFS (say if there are delays), but it looks like this work is going to go out if I'm doing the exact same thing as marta

- mixing in schedule data alongside realtime data, whenever a station
  only has realtime predictions for a single direction (hopefully only
near-terminating stations)
- this should sidestep the cases close to after-hours when there are
  many stations with no predictions
- marked as `scheduled: true` to differentiate them
@jakswa
Copy link
Owner Author

jakswa commented Feb 26, 2015

I'll deploy this branch to marta.io when I get a free moment later today.

- also solve case where now = 11:55 and arrival = past midnight
  - first time wrapping my head around timestamps that go
    past 24:00:00 ಠ_ಠ
- task time jumped from ~0.8s to 5+ seconds. still doable.
  - due to parsing every line of the CSV now, instead of only looking at
    the start of the vast majority of lines
  - could add stream listeners on both, and send only valid lines to the
    CSV parser... maybe later
- verified this by running the task again and seeing if my JSON changed
  - also deleted my json and made sure the task added it back without
    anything show up in `git diff`
- might need a changelog if these features keep up
@jakswa
Copy link
Owner Author

jakswa commented Feb 28, 2015

This has been deployed for a couple days now and is working pretty well. I've cleaned up some things, fixed some edge cases, added csv-parse, and updated the README. Clickin' the button!

img

jakswa added a commit that referenced this pull request Feb 28, 2015
adding my own shitty gtfs csv parser + schedule files
@jakswa jakswa merged commit 27cd8fd into master Feb 28, 2015
@jakswa jakswa deleted the gtfs_fml branch February 28, 2015 07:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants