Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MTA buses config #97

Merged
merged 1 commit into from
Apr 2, 2023
Merged

Conversation

cedarbaum
Copy link
Collaborator

About

This change adds static and realtime trip data for MTA buses.

Issue with trip loops

In my testing so far, this appears to work mostly out of the box except for one error that sometime occurs for routes:

service map cannot be topologically sorted: the provided graph is not topologically sortable because it contains a cycle

I looked into this a bit, and it seems there are loops being created in some GTFS trips. For example, on the M96 route, I observed this:

image

My guess would be this is a bug with the feed, but it's hard to say for sure without a repro outside of transiter. I did try writing a small Python script to poll the feed and look for loops within trips, but I haven't yet seen the same behavior there (this could just be an issue of timing though).

It does seem transiter is resilient to this sort of error, though I am curious if you think it's something we need to understand better before using this config. Also, I am curious if you have any thoughts about where the bug most likely is and what could be done to further investigate.

Other notes:

  1. The GTFS realtime feeds appear to work even without an API key specified in the URL. The official documentation still says to use one, so it is probably still best to follow that.
  2. The documented GTFS realtime feed URLs use HTTP for transport, but HTTPS does work so I've used that instead.

As always, I appreciate any time you're able to spend looking over this and please let me know if you have any feedback when you have a chance.

@cedarbaum
Copy link
Collaborator Author

Quick follow-up - I believe I now understand the trip loop issue. Apparently trip ids are not unique for MTA bus trips. This causes transiter to try and reconcile the stop sequences of multiple trips running on the same route, which creates the cycles.

I believe creating an id from a combination of TripId and VehicleId will be unique, so this could be done as an extension in the gtfs library. I'll look into this next.

@jamespfennell
Copy link
Owner

jamespfennell commented Mar 23, 2023 via email

@cedarbaum
Copy link
Collaborator Author

Thanks for the initial feedback, and it would be great to have you try it on your end when you are back and have time! In the meantime, I've done some more experimentation and have some additional notes below:

  1. As mentioned previously, TripIds are not unique. I believe this can be fixed (at least in the majority of cases) with this PR: Add extension for MTA buses gtfs#2
  2. In addition to non-unique TripIds, stop sequences also don't appear to be stable across updates (though I want to double-check my work here, since this seems very strange, even for GTFS 🙂). For example, consider the following 2 updates for a trip:

Update 1

[{'stopSequence': 0, 'arrival': {'time': '1679544559'}, 'departure': {'time': '1679544559'}, 'stopId': '305733'},
 {'stopSequence': 1, 'arrival': {'time': '1679544574'}, 'departure': {'time': '1679544574'}, 'stopId': '305734'},
 {'stopSequence': 2, 'arrival': {'time': '1679544607'}, 'departure': {'time': '1679544607'}, 'stopId': '305735'}]

Update 2

[{'stopSequence': 2, 'arrival': {'time': '1679544303'}, 'departure': {'time': '1679544303'}, 'stopId': '305734'},
 {'stopSequence': 3, 'arrival': {'time': '1679544336'}, 'departure': {'time': '1679544336'}, 'stopId': '305735'},
 {'stopSequence': 4, 'arrival': {'time': '1679544376'}, 'departure': {'time': '1679544376'}, 'stopId': '305736'}],

This causes some issues when transiter tries to merge the trip timeliness across updates, most noticeably the loop issue mentioned previously.

I was able to fix this by simply deleting all previous stop times on trip updates (i.e. MTA bus trips will only return current and futures stops now). Something more clever could probably be done to merge the timelines.

@jamespfennell
Copy link
Owner

I just tried this out and I hit the GTFS parsing error you fixed in jamespfennell/gtfs#1! I also noticed that when this parsing error is hit the client command (go run . install ....) actually hangs which is not great, so I fixed that in ffab925.

Also trying to fix the CI failure right now, which is unrelated to your change.

@jamespfennell
Copy link
Owner

Wow this system has over 12 thousand stops!

@jamespfennell
Copy link
Owner

Using your fix in the GTFS package I was able to play around with this. It looks awesome! I sanity checked that the data from one of my local bus stops matches the MTA website and it does!

To merge this in, I think we're blocked on the GTFS Go package PR you created? And also, if we want to use that, I think we'll need to change the Transiter Go code to wire in that extension as a parser option? Just like how the subway does it. Unfortunately this is a big tedious...


Regarding service maps, I just realized that they're more complex than the NYC subway in a way that may need changes to the service maps feature itself (not in this PR, to be clear). For the subway, a northbound trip calls at the exact same stations as a southbound trip, just in the reverse order. For the buses this is not true at all. For example, my local bus the B57 goes down Court St on the way south and up Smith St on the way north. The Transiter generated service map isn't useful because it's basically all the northbound stops followed separately by all the southbound stops.

I wonder if Transiter should support creating northbound-only and southbound-only service maps? This could be specified in the service maps config. So for the buses there wouldn't just be a "realtime" service map; there would be a "realtime_northbound" map and a "realtime_southbound" map.

@jamespfennell
Copy link
Owner

Hmmm just as an FYI, I'm getting lots of "service map cannot be topologically sorted" errors but no duplicate trip IDs. So perhaps there is another reason these maps can't be sorted, in addition to the trip ID duplication issue? I'm going to look further.

@cedarbaum
Copy link
Collaborator Author

cedarbaum commented Mar 28, 2023

Hey @jamespfennell - did you see my theory about unstable stop sequence numbers in my last comment? It may explain why you're still seeing that error. When I completely replace all stop sequences with each new update (i.e. make no attempt to preserve trip history), I no longer get the error. I also validated this theory with a python script that compares stop sequences between updates for the same trip id. Let me know if that makes sense based on what you're seeing on your end.

EDIT: For reference, this is the hack I use to clear previous trip data in updateStopTimesInDB to test this theory:

	var stopTimePksToDelete []int64
	for _, stopTime := range stopSequenceToStopTimePk {
		stopTimePksToDelete = append(stopTimePksToDelete, stopTime.Pk)
	}
	if err := updateCtx.Querier.DeleteTripStopTimes(ctx, stopTimePksToDelete); err != nil {
		return false, err
	}
	stopSequenceToStopTimePk = map[int32]db.ListTripStopTimesForUpdateRow{}

@jamespfennell
Copy link
Owner

Ah yes, sorry I did read that comment last week but had forgotten it. Do you have any opinion on the best way to fix this? Should Transiter ignore the stop sequences numbers and generate its own? This is what it does for the subway.

Though I do wonder - is the MTA using stop sequences because some trips call at the same stop twice? In this case the GTFS realtime spec does require stop sequences be provided to disambiguate the stop times for the same stop.

@jamespfennell
Copy link
Owner

By the way, I'm happy to merge in any version of this work that you're happy with. Things like improving the stop sequences situation can always be done in subsequent PRs (or not at all!). From my perspective, even just the new config file (along with the transfers.txt fix) is a great addition. Totally up to you.

@jamespfennell
Copy link
Owner

jamespfennell commented Mar 28, 2023

I think there is a quick fix for the service maps issue alone. When calculating the service map, we could skip all stops times that are in the past. It's a one line fix to this one SQL query - it just needs an additional WHERE trip_stop_time.past IS FALSE clause. I should probably do this anyway it seems? Like the realtime service maps should show only current/future service

This wouldn't fix the trip data itself, and as in your comment above, we would still see duplicate past stops.

jamespfennell added a commit that referenced this pull request Mar 28, 2023
Realtime service maps should only use future stop times. Using past stop
times can result in inaccurate service maps when (for example) a route is
finishing for the night. After the last train passes a station, the station
should disappear from the map because there is no longer service there.
With the current code, the station remains in the map until the trip is
finished entirely.

Previously this was the behavior of Transiter. However in the large refactor
in be23046
I introduced this bug.

Discovered by @cedarbaum in #97
@cedarbaum
Copy link
Collaborator Author

cedarbaum commented Mar 29, 2023

Thanks for the service maps fix! I agree that this can be merged safely with just that (after bumping gtfs version and also addressing other comments). However, I also think there is some value in one or both of the below configuration options (which could be added to the YAML definitions):

1.) removeTripHistory: this option, when true, would make no attempt to keep past trips. This would probably be fine for many use cases (e.g., arrival boards).
2.) reassignStopSequences: this option (based on your suggestion) would ignore the stop sequences from GTFS and assign sequentially (assuming no duplicate stop ids). I think this would be OK in most cases but, as you mentioned above, it would fail for routes that do indeed revisit stops.

So going forward, I can cleanup this PR and potentially also work on (1) or (2) if you think this an OK approach. Let me know your thoughts!

@jamespfennell
Copy link
Owner

Awesome, that sounds great!

I personally think (2) is the way to go because I think it will end up making the MTA buses work correctly with trip history. The implementation shouldn't be too bad either? After adding the reassignStopSequences field to the feed config, I think it would just be a code change in realtime.go to clear the stop sequence numbers if the new config field is true. This clearing just needs to happen before the stop sequences are automatically assigned by Transiter.

If you wanted to implement that that would be awesome! Could do it in this PR, or a follow up PR.


By the way, I think the downside of the new configuration option only really applies if you're trying to link the realtime data with the static data. My understanding is that the stop sequences give you a way of linking a stop time prediction in the realtime data to the associated scheduled time in GTFS static. For the MTA buses this realtime-static linking will presumably always be broken anyway because of the bug you discovered. But it's definitely a limitation of the new option that we should document. (To be clear Transiter currently doesn't attempt to do this realtime-static linking, but it's something I want to do in the future because it's needed to get some transit systems working, like the SF BART.)


Another random interesting thing (sorry for the brain dump) is that the MTA bus transit system is really big, and when I play with it locally I'm seeing trip feed updates take 7 or 8 seconds. Having this system in the repo will be great for performance benchmarking. I just took a pprof profile out of curiosity, and am finding that a lot of the slowness is because Transiter updates trip stop times with individual SQL queries, one at a time. This is slow just because for each SQL query, a network call has to be made to Postgres. Maybe in the future there is some optimization that could be done based on this kind of profiling.

@cedarbaum
Copy link
Collaborator Author

Another random interesting thing (sorry for the brain dump) is that the MTA bus transit system is really big, and when I play with it locally I'm seeing trip feed updates take 7 or 8 seconds. Having this system in the repo will be great for performance benchmarking. I just took a pprof profile out of curiosity, and am finding that a lot of the slowness is because Transiter updates trip stop times with individual SQL queries, one at a time. This is slow just because for each SQL query, a network call has to be made to Postgres. Maybe in the future there is some optimization that could be done based on this kind of profiling.

Interesting! I was sort of thinking about this a bit when I was looking at that code, but figured Postgres would do some magic to optimize these operations within the same transaction. Do you think a multi-row UPSERT would help? Or, in the case of my hack to forget trip history, a full delete + multi-row INSERT.

@jamespfennell
Copy link
Owner

Yeah I was thinking a multi-row upsert (or even just update because inserts are relatively rare) would befaster. The main motivation would be to avoid the overhead of making a SQL call to Postgres for each stop time update update. Not sure how hard it is though! I think Postgres supports it, but not sure about sqlc.

@jamespfennell
Copy link
Owner

I created a new issue for the optimizing the feed updater, just so that I don't derail your PR :) #100

@cedarbaum cedarbaum force-pushed the us-ny-buses branch 2 times, most recently from 04b6c67 to 7ad7276 Compare April 1, 2023 18:54
@jamespfennell
Copy link
Owner

Awesome, thank you!

@jamespfennell jamespfennell merged commit b228cde into jamespfennell:master Apr 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants