Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: filter to fill in stop times from schedules #243

Merged
merged 6 commits into from Jul 20, 2022

Conversation

digitalcora
Copy link
Contributor

Asana: 🧠 Provide our own predictions for some CR trains

Best reviewed per-commit, as there's a noisy refactor beforehand.

⚠️ Testing this in a deployed environment is currently blocked on an issue where the CRB dev environment has not actually updated its feed since 5/25 (Asana). Since this means all of its updates are very outdated, they get filtered out, and we can't see the effect of the new filter.

Configurability is in a separate WIP commit because I was not sure whether this was the best way to go about it. Hard-coding MBTA-specific status values into the filter didn't seem ideal, but there are some open questions in my mind:

  • Is it fine to hard-code status as the field that triggers this behavior, or do we want the key to be configurable as well as the values?
  • Is the slight metaprogramming/testing awkwardness worth it to allow the values to be known at compile time? Perhaps it would also be fine to use a runtime MapSet membership test.

Move `Concentrate.Filter.GTFS.*` to `Concentrate.GTFS.*`.

We are introducing a `Concentrate.GroupFilter` that will use `GTFS`
data, so organizing this part of the app under the `Filter` namespace
doesn't make as much sense as it did before.
Copy link
Member

@paulswartz paulswartz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Is it fine to hard-code status as the field that triggers this behavior, or do we want the key to be configurable as well as the values?

Given that adding additional fields would require updating the GTFS-RT parsers, I don't think it's worth having that be programmable.

  • Is the slight metaprogramming/testing awkwardness worth it to allow the values to be known at compile time? Perhaps it would also be fine to use a runtime MapSet membership test.

I think a runtime test would be better, so that these can be configured at runtime with the JSON file. Otherwise, we'd need to do a code update if the status values change, instead of only a configuration change.

lib/concentrate/gtfs/stop_times.ex Show resolved Hide resolved
lib/concentrate/group_filter/scheduled_stop_times.ex Outdated Show resolved Hide resolved
@digitalcora
Copy link
Contributor Author

Given that adding additional fields would require updating the GTFS-RT parsers, I don't think it's worth having that be programmable.

I meant only fields that are already part of StopTimeUpdate; for example someone could have this behavior trigger on specific values of uncertainty or schedule_relationship. (I don't know why they would do that, but they could)

I think a runtime test would be better, so that these can be configured at runtime with the JSON file.

Any suggestions for what this should look like, structurally? Currently we don't seem to have anything like a "filter configuration" in the runtime config. Should there be a new top-level key?

@mbta mbta deleted a comment from github-actions bot Jun 23, 2022
Copy link
Member

@paulswartz paulswartz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that adding additional fields would require updating the GTFS-RT parsers, I don't think it's worth having that be programmable.

I meant only fields that are already part of StopTimeUpdate; for example someone could have this behavior trigger on specific values of uncertainty or schedule_relationship. (I don't know why they would do that, but they could)

Hmm. uncertainty might be interesting. I know Swiftly uses specific values in their output: https://swiftly-inc.stoplight.io/docs/realtime-standalone/b3A6Mjg0MzYyMTk-gtfs-rt-trip-updates

However, it's probably YAGNI, so I wouldn't worry about it.

I think a runtime test would be better, so that these can be configured at runtime with the JSON file.

Any suggestions for what this should look like, structurally? Currently we don't seem to have anything like a "filter configuration" in the runtime config. Should there be a new top-level key?

On second thought: I think making the filter configuration run-time configurable in general would be nice, but I don't think it needs to be part of this PR. Can you create a new Asana task for that (if one doesn't already exist)?

New question: were you able to configure Concentrate locally and see the correct behavior in its output?

config/config.exs Outdated Show resolved Hide resolved
lib/concentrate/group_filter/scheduled_stop_times.ex Outdated Show resolved Hide resolved
defmodule Concentrate.GroupFilter.ScheduledStopTimes do
@moduledoc """
Uses the static GTFS schedule to fill in missing arrival/departure times on stop time updates
that have specific `status` values.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: should this note that it needs to come before RemoveUnneededTimes in the filter list, or not include arrival/departure times in cases when we'd drop them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it needs to, strictly speaking. It does need to if you want RemoveUnneededTimes to remove times it generates, but I think that's a logical consequence of its position in the filter list. If it's generally understood that filters are run in the order they're configured, I think we don't have to specifically call it out here. (Maybe we do want to call that out somewhere in the general documentation?)

@mbta mbta deleted a comment from github-actions bot Jun 27, 2022
@digitalcora
Copy link
Contributor Author

digitalcora commented Jun 27, 2022

New question: were you able to configure Concentrate locally and see the correct behavior in its output?

I've just done this, and the good news is that the filter works as intended. The bad news is that StopTimes takes 4 minutes (at least on my machine) to populate the ETS table, and during most of that time the filter is inoperative since StopTimes deletes all records before starting to insert the replacements. The approach here was mostly copied from PickupDropOff, which I notice also has this problem, taking ~1.5 minutes to populate even when StopTimes is disabled. However, looking at StopIDs, I can at least improve the approach so the inserts are done all at once (and maybe apply that same approach to PickupDropOff as a refactor).

This still leaves a few-second gap between the delete and the insert, so unfortunately an incorrect feed might be produced during that time. I wonder if it would be worth a refactor to either use an "upsert" approach, or use Mnesia and perform the update in a transaction as api does. Maybe not as part of this task, since the problem is not new to this task.

Building the records to be inserted into ETS can take a while (over a
minute), and since all records are deleted from the table before this
work begins, the module will return `:unknown` for all lookups during
this time. This issue would occur once per hour, as per the static GTFS
refresh interval, and would result in `RemoveUnneededTimes` temporarily
not working.

This refactor applies the same approach used in `StopIDs`: build all
records up-front, then clear the table and immediately insert all the
records in one batch. This still takes a second or two, but the time
gap is much shorter, reducing the chance that an incorrect feed would
be produced and the time until a corrected feed would be produced.

Eliminating the issue entirely will require a more complex approach,
such as writing some "upsert" logic which also deletes records that
shouldn't be in the table anymore, or using Mnesia for its transaction
support.
@digitalcora digitalcora force-pushed the cfg-predictions-from-schedules branch from 1036a6b to c3ca297 Compare June 27, 2022 18:19
@mbta mbta deleted a comment from github-actions bot Jun 27, 2022
@digitalcora digitalcora temporarily deployed to dev-green June 27, 2022 19:07 Inactive
@digitalcora
Copy link
Contributor Author

I've incorporated the above-mentioned changes. The unfortunate thing about this approach is that it uses a lot of memory while ingesting GTFS: ~3.3GB on this branch compared to ~2.2GB on master (there's also a ~400MB increase in memory used once everything has settled down, to ~1.7GB on this branch).

It still works on our 4GB ECS instances, though it takes 9 minutes to populate the tables... and I wonder what might happen when two HASTUS ratings are in the feed at once. Considering StopTimes, PickupDropOff, and StopIDs are each creating their own table based on stop_times.txt and have basically the same arguments to their lookups, maybe we should combine them for some memory savings?

@paulswartz
Copy link
Member

It looks like PickupDropOff parses the CSV in parallel: does using that speed up the ingest for Schedule?

I could definitely see combining the data with PickupDropOff: Trips does something similar where it supports two different queries with a single table.

@digitalcora
Copy link
Contributor Author

It looks like PickupDropOff parses the CSV in parallel: does using that speed up the ingest for Schedule?

If you mean for the new StopTimes here, it does already do that (see L102).

@paulswartz
Copy link
Member

Ah, I was looking at L91 which doesn't.

I don't suppose you tried using eflame or similar to see where the time is going?

@digitalcora
Copy link
Contributor Author

I don't suppose you tried using eflame or similar to see where the time is going?

I haven't yet done so, since I thought the memory usage would be more of an issue than the time usage, though I guess the time is also an issue for the initial warm-up of the app (given it's single-instance). I think what I'll do is try a quick experiment with combining all the stop-times servers, and see what that does for both metrics. If the warm-up time is no worse than it is currently, I'd be comfortable declaring the issue out-of-scope for this task.

@digitalcora
Copy link
Contributor Author

Good news on the experiment: Memory usage is ~2.0GB at max, ~880MB steady state, and the update takes ~1.5 minutes, the same time as PickupDropOff previously took alone. I'll clean up this approach and add it as a refactor commit, then we can think about further optimizations later since the performance is not made any worse by this PR.

Having three different servers loading `stop_times.txt` and maintaining
their own subset of it was resulting in very high CPU and memory usage.
Combining these, similar to `GTFS.Trips`, saves on system resources and
significantly reduces the "warm-up time" in which filters are not fully
functional.

Approximate numbers on a developer machine:

Metric       | pre `StopTimes` | post `StopTimes` | consolidated
------------ | --------------- | ---------------- | ------------
Update Time  | 1.5 min         | 4 min            | 1.5 min
Peak Memory  | 2.2GB           | 3.3GB            | 2.0GB
Steady-State | 1.3GB           | 1.7GB            | 880MB

The first column is from when there were only two `stop_times.txt`
servers, `PickupDropOff` and `StopIDs`. We can see that consolidation
is also an improvement over these numbers.

There are some incidental changes that should not have any effect on the
behavior of the app:

* `RemoveUnneededTimes` no longer works with stop time updates that are
  missing a `stop_sequence`. This should not have happened anyway since
  ac72186, which formalized the requirement that all stop time updates
  should have a stop sequence for merging to work correctly.

* `RemoveUnneededTimes` and `IncludeStopID` no longer work with GTFS
  feeds that have _only_ a `stop_times.txt`. Like `ScheduledStopTimes`,
  they now require the full chain of `stop_times.txt`, `trips.txt`,
  `routes.txt`, and `agencies.txt` to be present and valid, though the
  files are unused in these filters. Both the old and new behaviors are
  accidents of the implementation details, and the new behaviour could
  be changed if we wanted to explicitly support invalid "partial" GTFS
  feeds, but currently we have no need for this.
@mbta mbta deleted a comment from github-actions bot Jun 29, 2022
@github-actions
Copy link

Coverage of commit edbc809

Summary coverage rate:
  lines......: 95.0% (1024 of 1078 lines)
  functions..: 77.8% (339 of 436 functions)
  branches...: no data found

Files changed coverage rate:
                                                            |Lines       |Functions  |Branches    
  Filename                                                  |Rate     Num|Rate    Num|Rate     Num
  ================================================================================================
  lib/concentrate/filter/include_route_direction.ex         | 100%     10| 100%     3|    -      0
  lib/concentrate/filter/include_stop_id.ex                 |60.0%      5|66.7%     3|    -      0
  lib/concentrate/group_filter/remove_unneeded_times.ex     |91.7%     36|87.5%     8|    -      0
  lib/concentrate/group_filter/scheduled_stop_times.ex      |91.7%     12|75.0%     4|    -      0
  lib/concentrate/group_filter/vehicle_stop_match.ex        | 100%     13| 100%     3|    -      0
  lib/concentrate/gtfs/helpers.ex                           | 100%      5| 100%     1|    -      0
  lib/concentrate/gtfs/stop_times.ex                        | 100%     34|86.7%    15|    -      0
  lib/concentrate/gtfs/stops.ex                             | 100%     17|80.0%     5|    -      0
  lib/concentrate/gtfs/supervisor.ex                        |33.3%      3|50.0%     2|    -      0
  lib/concentrate/gtfs/trips.ex                             | 100%     17|80.0%     5|    -      0
  lib/concentrate/gtfs/unzip.ex                             | 100%      6| 100%     2|    -      0
  lib/concentrate/merge_filter.ex                           |91.0%     67|71.4%    14|    -      0
  lib/concentrate/supervisor.ex                             | 100%      8| 100%     7|    -      0

Download coverage report

@digitalcora
Copy link
Contributor Author

Refactor done! There are some incidental changes here that should not have any effect on the behavior of the app for our purposes (I also documented them in the commit message):

  • RemoveUnneededTimes no longer works with stop time updates that are missing a stop_sequence. This should not have happened anyway since ac72186, which formalized the requirement that all stop time updates should have a stop sequence for merging to work correctly.

  • RemoveUnneededTimes and IncludeStopID no longer work with GTFS feeds that have only a stop_times.txt. Like ScheduledStopTimes, they now require the full chain of stop_times.txt, trips.txt, routes.txt, and agencies.txt to be present and valid, though the files are unused in these filters. Both the old and new behaviors are accidents of the implementation details, and the new behaviour could be changed if we wanted to explicitly support invalid "partial" GTFS feeds, but as far as I know we have no need for this.

Copy link
Member

@paulswartz paulswartz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good!

I'm a little wary of deploying this without being able to test it on dev-green, but given that you tested it locally and saw the correct behavior, testing it on dev seems fine.

@digitalcora
Copy link
Contributor Author

Unfortunately, Concentrate dev is also currently hooked up to the (broken) CRB dev feed, so without changing that, the only way to test this branch in a deployed environment would be... to deploy it to production 😓 Because of that, I'm inclined to consider this blocked on fixing the CRB dev feed, unless we (and by "we" I mean probably you or @losvedir) decide that Concentrate dev should use prod data sources.

@losvedir losvedir temporarily deployed to dev-green July 7, 2022 14:43 Inactive
@losvedir losvedir temporarily deployed to dev-green July 19, 2022 18:16 Inactive
@github-actions
Copy link

Coverage of commit 3b1f93e

Summary coverage rate:
  lines......: 95.0% (1024 of 1078 lines)
  functions..: 77.8% (339 of 436 functions)
  branches...: no data found

Files changed coverage rate:
                                                            |Lines       |Functions  |Branches    
  Filename                                                  |Rate     Num|Rate    Num|Rate     Num
  ================================================================================================
  lib/concentrate/filter/include_route_direction.ex         | 100%     10| 100%     3|    -      0
  lib/concentrate/filter/include_stop_id.ex                 |60.0%      5|66.7%     3|    -      0
  lib/concentrate/group_filter/remove_unneeded_times.ex     |91.7%     36|87.5%     8|    -      0
  lib/concentrate/group_filter/scheduled_stop_times.ex      |91.7%     12|75.0%     4|    -      0
  lib/concentrate/group_filter/vehicle_stop_match.ex        | 100%     13| 100%     3|    -      0
  lib/concentrate/gtfs/helpers.ex                           | 100%      5| 100%     1|    -      0
  lib/concentrate/gtfs/stop_times.ex                        | 100%     34|86.7%    15|    -      0
  lib/concentrate/gtfs/stops.ex                             | 100%     17|80.0%     5|    -      0
  lib/concentrate/gtfs/supervisor.ex                        |33.3%      3|50.0%     2|    -      0
  lib/concentrate/gtfs/trips.ex                             | 100%     17|80.0%     5|    -      0
  lib/concentrate/gtfs/unzip.ex                             | 100%      6| 100%     2|    -      0
  lib/concentrate/merge_filter.ex                           |91.0%     67|71.4%    14|    -      0
  lib/concentrate/supervisor.ex                             | 100%      8| 100%     7|    -      0

Download coverage report

@losvedir losvedir temporarily deployed to dev-green July 20, 2022 20:46 Inactive
@losvedir
Copy link
Contributor

I pushed up a small commit that resolved an issue I think from a last minute refactor where StopTimes.get no longer exists (now it's StopTimes.arrival_departure). After that, it appears to work!

Here's dotcom-dev-green with all its boarding glory:

Screen Shot 2022-07-20 at 3 53 23 PM

Compared to prod which is all "Scheduled":

Screen Shot 2022-07-20 at 3 53 36 PM

The metrics in ECS seem fine, so I think this branch is a go!

Copy link
Contributor

@losvedir losvedir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@losvedir losvedir merged commit 6ef6552 into master Jul 20, 2022
@losvedir losvedir deleted the cfg-predictions-from-schedules branch July 21, 2022 11:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants