feat: filter to fill in stop times from schedules #243

digitalcora · 2022-06-22T18:45:10Z

Asana: 🧠 Provide our own predictions for some CR trains

Best reviewed per-commit, as there's a noisy refactor beforehand.

⚠️ Testing this in a deployed environment is currently blocked on an issue where the CRB dev environment has not actually updated its feed since 5/25 (Asana). Since this means all of its updates are very outdated, they get filtered out, and we can't see the effect of the new filter.

Configurability is in a separate WIP commit because I was not sure whether this was the best way to go about it. Hard-coding MBTA-specific status values into the filter didn't seem ideal, but there are some open questions in my mind:

Is it fine to hard-code status as the field that triggers this behavior, or do we want the key to be configurable as well as the values?
Is the slight metaprogramming/testing awkwardness worth it to allow the values to be known at compile time? Perhaps it would also be fine to use a runtime MapSet membership test.

Move `Concentrate.Filter.GTFS.*` to `Concentrate.GTFS.*`. We are introducing a `Concentrate.GroupFilter` that will use `GTFS` data, so organizing this part of the app under the `Filter` namespace doesn't make as much sense as it did before.

paulswartz

Is it fine to hard-code status as the field that triggers this behavior, or do we want the key to be configurable as well as the values?

Given that adding additional fields would require updating the GTFS-RT parsers, I don't think it's worth having that be programmable.

Is the slight metaprogramming/testing awkwardness worth it to allow the values to be known at compile time? Perhaps it would also be fine to use a runtime MapSet membership test.

I think a runtime test would be better, so that these can be configured at runtime with the JSON file. Otherwise, we'd need to do a code update if the status values change, instead of only a configuration change.

lib/concentrate/gtfs/stop_times.ex

lib/concentrate/group_filter/scheduled_stop_times.ex

test/concentrate/group_filter/scheduled_stop_times_test.exs

digitalcora · 2022-06-23T17:52:23Z

Given that adding additional fields would require updating the GTFS-RT parsers, I don't think it's worth having that be programmable.

I meant only fields that are already part of StopTimeUpdate; for example someone could have this behavior trigger on specific values of uncertainty or schedule_relationship. (I don't know why they would do that, but they could)

I think a runtime test would be better, so that these can be configured at runtime with the JSON file.

Any suggestions for what this should look like, structurally? Currently we don't seem to have anything like a "filter configuration" in the runtime config. Should there be a new top-level key?

paulswartz

Given that adding additional fields would require updating the GTFS-RT parsers, I don't think it's worth having that be programmable.

I meant only fields that are already part of StopTimeUpdate; for example someone could have this behavior trigger on specific values of uncertainty or schedule_relationship. (I don't know why they would do that, but they could)

Hmm. uncertainty might be interesting. I know Swiftly uses specific values in their output: https://swiftly-inc.stoplight.io/docs/realtime-standalone/b3A6Mjg0MzYyMTk-gtfs-rt-trip-updates

However, it's probably YAGNI, so I wouldn't worry about it.

I think a runtime test would be better, so that these can be configured at runtime with the JSON file.

Any suggestions for what this should look like, structurally? Currently we don't seem to have anything like a "filter configuration" in the runtime config. Should there be a new top-level key?

On second thought: I think making the filter configuration run-time configurable in general would be nice, but I don't think it needs to be part of this PR. Can you create a new Asana task for that (if one doesn't already exist)?

New question: were you able to configure Concentrate locally and see the correct behavior in its output?

config/config.exs

lib/concentrate/group_filter/scheduled_stop_times.ex

paulswartz · 2022-06-23T20:53:28Z

lib/concentrate/group_filter/scheduled_stop_times.ex

+defmodule Concentrate.GroupFilter.ScheduledStopTimes do
+  @moduledoc """
+  Uses the static GTFS schedule to fill in missing arrival/departure times on stop time updates
+  that have specific `status` values.


question: should this note that it needs to come before RemoveUnneededTimes in the filter list, or not include arrival/departure times in cases when we'd drop them?

I don't think it needs to, strictly speaking. It does need to if you want RemoveUnneededTimes to remove times it generates, but I think that's a logical consequence of its position in the filter list. If it's generally understood that filters are run in the order they're configured, I think we don't have to specifically call it out here. (Maybe we do want to call that out somewhere in the general documentation?)

digitalcora · 2022-06-27T16:23:05Z

New question: were you able to configure Concentrate locally and see the correct behavior in its output?

I've just done this, and the good news is that the filter works as intended. The bad news is that StopTimes takes 4 minutes (at least on my machine) to populate the ETS table, and during most of that time the filter is inoperative since StopTimes deletes all records before starting to insert the replacements. The approach here was mostly copied from PickupDropOff, which I notice also has this problem, taking ~1.5 minutes to populate even when StopTimes is disabled. However, looking at StopIDs, I can at least improve the approach so the inserts are done all at once (and maybe apply that same approach to PickupDropOff as a refactor).

This still leaves a few-second gap between the delete and the insert, so unfortunately an incorrect feed might be produced during that time. I wonder if it would be worth a refactor to either use an "upsert" approach, or use Mnesia and perform the update in a transaction as api does. Maybe not as part of this task, since the problem is not new to this task.

Building the records to be inserted into ETS can take a while (over a minute), and since all records are deleted from the table before this work begins, the module will return `:unknown` for all lookups during this time. This issue would occur once per hour, as per the static GTFS refresh interval, and would result in `RemoveUnneededTimes` temporarily not working. This refactor applies the same approach used in `StopIDs`: build all records up-front, then clear the table and immediately insert all the records in one batch. This still takes a second or two, but the time gap is much shorter, reducing the chance that an incorrect feed would be produced and the time until a corrected feed would be produced. Eliminating the issue entirely will require a more complex approach, such as writing some "upsert" logic which also deletes records that shouldn't be in the table anymore, or using Mnesia for its transaction support.

digitalcora · 2022-06-27T19:21:31Z

I've incorporated the above-mentioned changes. The unfortunate thing about this approach is that it uses a lot of memory while ingesting GTFS: ~3.3GB on this branch compared to ~2.2GB on master (there's also a ~400MB increase in memory used once everything has settled down, to ~1.7GB on this branch).

It still works on our 4GB ECS instances, though it takes 9 minutes to populate the tables... and I wonder what might happen when two HASTUS ratings are in the feed at once. Considering StopTimes, PickupDropOff, and StopIDs are each creating their own table based on stop_times.txt and have basically the same arguments to their lookups, maybe we should combine them for some memory savings?

paulswartz · 2022-06-27T21:09:51Z

It looks like PickupDropOff parses the CSV in parallel: does using that speed up the ingest for Schedule?

I could definitely see combining the data with PickupDropOff: Trips does something similar where it supports two different queries with a single table.

digitalcora · 2022-06-27T21:15:20Z

It looks like PickupDropOff parses the CSV in parallel: does using that speed up the ingest for Schedule?

If you mean for the new StopTimes here, it does already do that (see L102).

paulswartz · 2022-06-28T13:34:24Z

Ah, I was looking at L91 which doesn't.

I don't suppose you tried using eflame or similar to see where the time is going?

digitalcora · 2022-06-28T15:00:08Z

I don't suppose you tried using eflame or similar to see where the time is going?

I haven't yet done so, since I thought the memory usage would be more of an issue than the time usage, though I guess the time is also an issue for the initial warm-up of the app (given it's single-instance). I think what I'll do is try a quick experiment with combining all the stop-times servers, and see what that does for both metrics. If the warm-up time is no worse than it is currently, I'd be comfortable declaring the issue out-of-scope for this task.

digitalcora · 2022-06-29T18:07:53Z

Good news on the experiment: Memory usage is ~2.0GB at max, ~880MB steady state, and the update takes ~1.5 minutes, the same time as PickupDropOff previously took alone. I'll clean up this approach and add it as a refactor commit, then we can think about further optimizations later since the performance is not made any worse by this PR.

Having three different servers loading `stop_times.txt` and maintaining their own subset of it was resulting in very high CPU and memory usage. Combining these, similar to `GTFS.Trips`, saves on system resources and significantly reduces the "warm-up time" in which filters are not fully functional. Approximate numbers on a developer machine: Metric | pre `StopTimes` | post `StopTimes` | consolidated ------------ | --------------- | ---------------- | ------------ Update Time | 1.5 min | 4 min | 1.5 min Peak Memory | 2.2GB | 3.3GB | 2.0GB Steady-State | 1.3GB | 1.7GB | 880MB The first column is from when there were only two `stop_times.txt` servers, `PickupDropOff` and `StopIDs`. We can see that consolidation is also an improvement over these numbers. There are some incidental changes that should not have any effect on the behavior of the app: * `RemoveUnneededTimes` no longer works with stop time updates that are missing a `stop_sequence`. This should not have happened anyway since ac72186, which formalized the requirement that all stop time updates should have a stop sequence for merging to work correctly. * `RemoveUnneededTimes` and `IncludeStopID` no longer work with GTFS feeds that have _only_ a `stop_times.txt`. Like `ScheduledStopTimes`, they now require the full chain of `stop_times.txt`, `trips.txt`, `routes.txt`, and `agencies.txt` to be present and valid, though the files are unused in these filters. Both the old and new behaviors are accidents of the implementation details, and the new behaviour could be changed if we wanted to explicitly support invalid "partial" GTFS feeds, but currently we have no need for this.

github-actions · 2022-06-29T19:44:49Z

Coverage of commit `edbc809`

Summary coverage rate:
  lines......: 95.0% (1024 of 1078 lines)
  functions..: 77.8% (339 of 436 functions)
  branches...: no data found

Files changed coverage rate:
                                                            |Lines       |Functions  |Branches    
  Filename                                                  |Rate     Num|Rate    Num|Rate     Num
  ================================================================================================
  lib/concentrate/filter/include_route_direction.ex         | 100%     10| 100%     3|    -      0
  lib/concentrate/filter/include_stop_id.ex                 |60.0%      5|66.7%     3|    -      0
  lib/concentrate/group_filter/remove_unneeded_times.ex     |91.7%     36|87.5%     8|    -      0
  lib/concentrate/group_filter/scheduled_stop_times.ex      |91.7%     12|75.0%     4|    -      0
  lib/concentrate/group_filter/vehicle_stop_match.ex        | 100%     13| 100%     3|    -      0
  lib/concentrate/gtfs/helpers.ex                           | 100%      5| 100%     1|    -      0
  lib/concentrate/gtfs/stop_times.ex                        | 100%     34|86.7%    15|    -      0
  lib/concentrate/gtfs/stops.ex                             | 100%     17|80.0%     5|    -      0
  lib/concentrate/gtfs/supervisor.ex                        |33.3%      3|50.0%     2|    -      0
  lib/concentrate/gtfs/trips.ex                             | 100%     17|80.0%     5|    -      0
  lib/concentrate/gtfs/unzip.ex                             | 100%      6| 100%     2|    -      0
  lib/concentrate/merge_filter.ex                           |91.0%     67|71.4%    14|    -      0
  lib/concentrate/supervisor.ex                             | 100%      8| 100%     7|    -      0

Download coverage report

digitalcora · 2022-06-29T19:47:14Z

Refactor done! There are some incidental changes here that should not have any effect on the behavior of the app for our purposes (I also documented them in the commit message):

RemoveUnneededTimes no longer works with stop time updates that are missing a stop_sequence. This should not have happened anyway since ac72186, which formalized the requirement that all stop time updates should have a stop sequence for merging to work correctly.
RemoveUnneededTimes and IncludeStopID no longer work with GTFS feeds that have only a stop_times.txt. Like ScheduledStopTimes, they now require the full chain of stop_times.txt, trips.txt, routes.txt, and agencies.txt to be present and valid, though the files are unused in these filters. Both the old and new behaviors are accidents of the implementation details, and the new behaviour could be changed if we wanted to explicitly support invalid "partial" GTFS feeds, but as far as I know we have no need for this.

paulswartz

The code looks good!

I'm a little wary of deploying this without being able to test it on dev-green, but given that you tested it locally and saw the correct behavior, testing it on dev seems fine.

digitalcora · 2022-06-30T13:14:49Z

Unfortunately, Concentrate dev is also currently hooked up to the (broken) CRB dev feed, so without changing that, the only way to test this branch in a deployed environment would be... to deploy it to production 😓 Because of that, I'm inclined to consider this blocked on fixing the CRB dev feed, unless we (and by "we" I mean probably you or @losvedir) decide that Concentrate dev should use prod data sources.

…ture

github-actions · 2022-07-19T18:17:54Z

Coverage of commit `3b1f93e`

Summary coverage rate:
  lines......: 95.0% (1024 of 1078 lines)
  functions..: 77.8% (339 of 436 functions)
  branches...: no data found

Files changed coverage rate:
                                                            |Lines       |Functions  |Branches    
  Filename                                                  |Rate     Num|Rate    Num|Rate     Num
  ================================================================================================
  lib/concentrate/filter/include_route_direction.ex         | 100%     10| 100%     3|    -      0
  lib/concentrate/filter/include_stop_id.ex                 |60.0%      5|66.7%     3|    -      0
  lib/concentrate/group_filter/remove_unneeded_times.ex     |91.7%     36|87.5%     8|    -      0
  lib/concentrate/group_filter/scheduled_stop_times.ex      |91.7%     12|75.0%     4|    -      0
  lib/concentrate/group_filter/vehicle_stop_match.ex        | 100%     13| 100%     3|    -      0
  lib/concentrate/gtfs/helpers.ex                           | 100%      5| 100%     1|    -      0
  lib/concentrate/gtfs/stop_times.ex                        | 100%     34|86.7%    15|    -      0
  lib/concentrate/gtfs/stops.ex                             | 100%     17|80.0%     5|    -      0
  lib/concentrate/gtfs/supervisor.ex                        |33.3%      3|50.0%     2|    -      0
  lib/concentrate/gtfs/trips.ex                             | 100%     17|80.0%     5|    -      0
  lib/concentrate/gtfs/unzip.ex                             | 100%      6| 100%     2|    -      0
  lib/concentrate/merge_filter.ex                           |91.0%     67|71.4%    14|    -      0
  lib/concentrate/supervisor.ex                             | 100%      8| 100%     7|    -      0

Download coverage report

losvedir · 2022-07-20T20:57:19Z

I pushed up a small commit that resolved an issue I think from a last minute refactor where StopTimes.get no longer exists (now it's StopTimes.arrival_departure). After that, it appears to work!

Here's dotcom-dev-green with all its boarding glory:

Compared to prod which is all "Scheduled":

The metrics in ECS seem fine, so I think this branch is a go!

losvedir

🎉

digitalcora added 2 commits June 13, 2022 15:07

doc: fix broken guide link in README

37f5c23

refactor: move GTFS out of Filter namespace

b61c59f

Move `Concentrate.Filter.GTFS.*` to `Concentrate.GTFS.*`. We are introducing a `Concentrate.GroupFilter` that will use `GTFS` data, so organizing this part of the app under the `Filter` namespace doesn't make as much sense as it did before.

digitalcora requested a review from paulswartz June 23, 2022 15:36

paulswartz reviewed Jun 23, 2022

View reviewed changes

lib/concentrate/gtfs/stop_times.ex Show resolved Hide resolved

lib/concentrate/group_filter/scheduled_stop_times.ex Outdated Show resolved Hide resolved

test/concentrate/group_filter/scheduled_stop_times_test.exs Outdated Show resolved Hide resolved

mbta deleted a comment from github-actions bot Jun 23, 2022

paulswartz reviewed Jun 23, 2022

View reviewed changes

mbta deleted a comment from github-actions bot Jun 27, 2022

digitalcora added 2 commits June 27, 2022 14:19

feat: filter to fill in stop times from schedules

c3ca297

digitalcora force-pushed the cfg-predictions-from-schedules branch from 1036a6b to c3ca297 Compare June 27, 2022 18:19

mbta deleted a comment from github-actions bot Jun 27, 2022

digitalcora temporarily deployed to dev-green June 27, 2022 19:07 Inactive

mbta deleted a comment from github-actions bot Jun 29, 2022

digitalcora requested a review from paulswartz June 29, 2022 19:47

paulswartz approved these changes Jun 30, 2022

View reviewed changes

losvedir temporarily deployed to dev-green July 7, 2022 14:43 Inactive

fix: StopTimes.get is probably supposed to be StopTimes.arrival_depar…

3b1f93e

…ture

losvedir temporarily deployed to dev-green July 19, 2022 18:16 Inactive

losvedir temporarily deployed to dev-green July 20, 2022 20:46 Inactive

losvedir approved these changes Jul 20, 2022

View reviewed changes

losvedir merged commit 6ef6552 into master Jul 20, 2022

losvedir deleted the cfg-predictions-from-schedules branch July 21, 2022 11:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: filter to fill in stop times from schedules #243

feat: filter to fill in stop times from schedules #243

digitalcora commented Jun 22, 2022

paulswartz left a comment

digitalcora commented Jun 23, 2022

paulswartz left a comment

paulswartz Jun 23, 2022

digitalcora Jun 23, 2022 •

edited

digitalcora commented Jun 27, 2022 •

edited

digitalcora commented Jun 27, 2022

paulswartz commented Jun 27, 2022

digitalcora commented Jun 27, 2022

paulswartz commented Jun 28, 2022

digitalcora commented Jun 28, 2022

digitalcora commented Jun 29, 2022

github-actions bot commented Jun 29, 2022

digitalcora commented Jun 29, 2022

paulswartz left a comment

digitalcora commented Jun 30, 2022

github-actions bot commented Jul 19, 2022

losvedir commented Jul 20, 2022

losvedir left a comment

feat: filter to fill in stop times from schedules #243

feat: filter to fill in stop times from schedules #243

Conversation

digitalcora commented Jun 22, 2022

paulswartz left a comment

Choose a reason for hiding this comment

digitalcora commented Jun 23, 2022

paulswartz left a comment

Choose a reason for hiding this comment

paulswartz Jun 23, 2022

Choose a reason for hiding this comment

digitalcora Jun 23, 2022 • edited

Choose a reason for hiding this comment

digitalcora commented Jun 27, 2022 • edited

digitalcora commented Jun 27, 2022

paulswartz commented Jun 27, 2022

digitalcora commented Jun 27, 2022

paulswartz commented Jun 28, 2022

digitalcora commented Jun 28, 2022

digitalcora commented Jun 29, 2022

github-actions bot commented Jun 29, 2022

Coverage of commit edbc809

digitalcora commented Jun 29, 2022

paulswartz left a comment

Choose a reason for hiding this comment

digitalcora commented Jun 30, 2022

github-actions bot commented Jul 19, 2022

Coverage of commit 3b1f93e

losvedir commented Jul 20, 2022

losvedir left a comment

Choose a reason for hiding this comment

digitalcora Jun 23, 2022 •

edited

digitalcora commented Jun 27, 2022 •

edited

Coverage of commit `edbc809`

Coverage of commit `3b1f93e`