New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: filter to fill in stop times from schedules #243
Conversation
Move `Concentrate.Filter.GTFS.*` to `Concentrate.GTFS.*`. We are introducing a `Concentrate.GroupFilter` that will use `GTFS` data, so organizing this part of the app under the `Filter` namespace doesn't make as much sense as it did before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Is it fine to hard-code
status
as the field that triggers this behavior, or do we want the key to be configurable as well as the values?
Given that adding additional fields would require updating the GTFS-RT parsers, I don't think it's worth having that be programmable.
- Is the slight metaprogramming/testing awkwardness worth it to allow the values to be known at compile time? Perhaps it would also be fine to use a runtime MapSet membership test.
I think a runtime test would be better, so that these can be configured at runtime with the JSON file. Otherwise, we'd need to do a code update if the status values change, instead of only a configuration change.
I meant only fields that are already part of StopTimeUpdate; for example someone could have this behavior trigger on specific values of
Any suggestions for what this should look like, structurally? Currently we don't seem to have anything like a "filter configuration" in the runtime config. Should there be a new top-level key? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that adding additional fields would require updating the GTFS-RT parsers, I don't think it's worth having that be programmable.
I meant only fields that are already part of StopTimeUpdate; for example someone could have this behavior trigger on specific values of
uncertainty
orschedule_relationship
. (I don't know why they would do that, but they could)
Hmm. uncertainty
might be interesting. I know Swiftly uses specific values in their output: https://swiftly-inc.stoplight.io/docs/realtime-standalone/b3A6Mjg0MzYyMTk-gtfs-rt-trip-updates
However, it's probably YAGNI, so I wouldn't worry about it.
I think a runtime test would be better, so that these can be configured at runtime with the JSON file.
Any suggestions for what this should look like, structurally? Currently we don't seem to have anything like a "filter configuration" in the runtime config. Should there be a new top-level key?
On second thought: I think making the filter configuration run-time configurable in general would be nice, but I don't think it needs to be part of this PR. Can you create a new Asana task for that (if one doesn't already exist)?
New question: were you able to configure Concentrate locally and see the correct behavior in its output?
defmodule Concentrate.GroupFilter.ScheduledStopTimes do | ||
@moduledoc """ | ||
Uses the static GTFS schedule to fill in missing arrival/departure times on stop time updates | ||
that have specific `status` values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: should this note that it needs to come before RemoveUnneededTimes
in the filter list, or not include arrival/departure times in cases when we'd drop them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it needs to, strictly speaking. It does need to if you want RemoveUnneededTimes
to remove times it generates, but I think that's a logical consequence of its position in the filter list. If it's generally understood that filters are run in the order they're configured, I think we don't have to specifically call it out here. (Maybe we do want to call that out somewhere in the general documentation?)
I've just done this, and the good news is that the filter works as intended. The bad news is that This still leaves a few-second gap between the delete and the insert, so unfortunately an incorrect feed might be produced during that time. I wonder if it would be worth a refactor to either use an "upsert" approach, or use Mnesia and perform the update in a transaction as |
Building the records to be inserted into ETS can take a while (over a minute), and since all records are deleted from the table before this work begins, the module will return `:unknown` for all lookups during this time. This issue would occur once per hour, as per the static GTFS refresh interval, and would result in `RemoveUnneededTimes` temporarily not working. This refactor applies the same approach used in `StopIDs`: build all records up-front, then clear the table and immediately insert all the records in one batch. This still takes a second or two, but the time gap is much shorter, reducing the chance that an incorrect feed would be produced and the time until a corrected feed would be produced. Eliminating the issue entirely will require a more complex approach, such as writing some "upsert" logic which also deletes records that shouldn't be in the table anymore, or using Mnesia for its transaction support.
1036a6b
to
c3ca297
Compare
I've incorporated the above-mentioned changes. The unfortunate thing about this approach is that it uses a lot of memory while ingesting GTFS: ~3.3GB on this branch compared to ~2.2GB on It still works on our 4GB ECS instances, though it takes 9 minutes to populate the tables... and I wonder what might happen when two HASTUS ratings are in the feed at once. Considering |
It looks like I could definitely see combining the data with |
If you mean for the new |
Ah, I was looking at L91 which doesn't. I don't suppose you tried using eflame or similar to see where the time is going? |
I haven't yet done so, since I thought the memory usage would be more of an issue than the time usage, though I guess the time is also an issue for the initial warm-up of the app (given it's single-instance). I think what I'll do is try a quick experiment with combining all the stop-times servers, and see what that does for both metrics. If the warm-up time is no worse than it is currently, I'd be comfortable declaring the issue out-of-scope for this task. |
Good news on the experiment: Memory usage is ~2.0GB at max, ~880MB steady state, and the update takes ~1.5 minutes, the same time as |
Having three different servers loading `stop_times.txt` and maintaining their own subset of it was resulting in very high CPU and memory usage. Combining these, similar to `GTFS.Trips`, saves on system resources and significantly reduces the "warm-up time" in which filters are not fully functional. Approximate numbers on a developer machine: Metric | pre `StopTimes` | post `StopTimes` | consolidated ------------ | --------------- | ---------------- | ------------ Update Time | 1.5 min | 4 min | 1.5 min Peak Memory | 2.2GB | 3.3GB | 2.0GB Steady-State | 1.3GB | 1.7GB | 880MB The first column is from when there were only two `stop_times.txt` servers, `PickupDropOff` and `StopIDs`. We can see that consolidation is also an improvement over these numbers. There are some incidental changes that should not have any effect on the behavior of the app: * `RemoveUnneededTimes` no longer works with stop time updates that are missing a `stop_sequence`. This should not have happened anyway since ac72186, which formalized the requirement that all stop time updates should have a stop sequence for merging to work correctly. * `RemoveUnneededTimes` and `IncludeStopID` no longer work with GTFS feeds that have _only_ a `stop_times.txt`. Like `ScheduledStopTimes`, they now require the full chain of `stop_times.txt`, `trips.txt`, `routes.txt`, and `agencies.txt` to be present and valid, though the files are unused in these filters. Both the old and new behaviors are accidents of the implementation details, and the new behaviour could be changed if we wanted to explicitly support invalid "partial" GTFS feeds, but currently we have no need for this.
Coverage of commit
|
Refactor done! There are some incidental changes here that should not have any effect on the behavior of the app for our purposes (I also documented them in the commit message):
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks good!
I'm a little wary of deploying this without being able to test it on dev-green
, but given that you tested it locally and saw the correct behavior, testing it on dev
seems fine.
Unfortunately, Concentrate |
Coverage of commit
|
I pushed up a small commit that resolved an issue I think from a last minute refactor where Here's dotcom-dev-green with all its boarding glory: Compared to prod which is all "Scheduled": The metrics in ECS seem fine, so I think this branch is a go! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
Asana: 🧠 Provide our own predictions for some CR trains
Best reviewed per-commit, as there's a noisy refactor beforehand.
dev
environment has not actually updated its feed since 5/25 (Asana). Since this means all of its updates are very outdated, they get filtered out, and we can't see the effect of the new filter.Configurability is in a separate WIP commit because I was not sure whether this was the best way to go about it. Hard-coding MBTA-specific status values into the filter didn't seem ideal, but there are some open questions in my mind:
status
as the field that triggers this behavior, or do we want the key to be configurable as well as the values?