Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize the realtime trips updater #100

Closed
jamespfennell opened this issue Apr 1, 2023 · 11 comments
Closed

Optimize the realtime trips updater #100

jamespfennell opened this issue Apr 1, 2023 · 11 comments

Comments

@jamespfennell
Copy link
Owner

For the new NYC buses transit system (#97), I'm seeing GTFS realtime feed updates taking up to 7 seconds. To be clear, this is probably not a huge deal from a features perspective because the accuracy of transit data is O(30 seconds) so having slightly stale data is fine. And the update process is in the background so it doesn't impact HTTP latency. The main problem is that for those 7 seconds Transiter is probably using a non-trivial amount of CPU/memory/IO, and thus requires a larger machine for deployment. By analogy, when the Python to Go migration was merged the minimum requirements to run Transiter with the NYC subway system dropped from ~2.5 CPUs to ~0.5 CPUs because the update process became 5 times more efficient. It's literally cheaper to run Transiter now.

As a first pass I ran pprof on a Transiter instance with the new system:

go tool pprof "http://localhost:8082/debug/pprof/profile?seconds=300"  

which dumped a new profile in my~/pprof directory. (Note that you can always take a prprof profile of a running Tranister instance using the admin HTTP endpoint /debug/pprof/profile.) This is the flamegraph, limited to the UpdateFeed function which drives the feed update:

image

There are three big consumers of CPU time, in decreasing order:

  1. db.(*Queries).UpdateTripStopTime - this function sends an update SQL query to Postgres and is invoked once for every single stop time update in the feed. This is a lot of stop time updates! I suspect the reason this is taking so long isn't because the SQL query is slow. It's because of the overhead of calling Postgres from the Go code. It would be awesome to batch these SQL queries somehow. This is non-trivial though because it's an update query and Postgres/sqlc doesn't seem to offer a simple solution here.
  2. db.(*Queries).InsertServiceMapStop - this inserts a single node in a service map. Again, batching this would probably help dramatically. This is easier than the last case because it's an insert query, so we can probably use sqlc's support for CopyFrom.
  3. common.Hash - this takes a hash of the GTFS data for a single trip, and uses that to decide if the trip needs to be updated. If not, the trip update is skipped. This was submitted in b1bc7e1 and overall is a pretty good optimization - it halves the time needed to update the NYC subway. However there may be a way to make the hashing faster.
@jamespfennell
Copy link
Owner Author

Optimization 2 seems really easy: e947f72!

@jamespfennell
Copy link
Owner Author

With this commit the InsertServiceMapStop portion of the flamegraph becomes tiny. Going to merge that commit.

image

jamespfennell added a commit that referenced this issue Apr 1, 2023
Per report in the bug, improves performance of the update task and the
code is as simple or even simpler.
@jamespfennell
Copy link
Owner Author

jamespfennell commented Apr 1, 2023

I did a first pass at the stops times too in 119c651. The results here look good, though not amazing. The update times seem to be halved compared to master before this ticket, and in the flame graph updateStopTimesInDB is much smaller:

image

At this point the code is a little simpler, too.

This code though operates at the level of a single trip. I wonder to get a really great update improvement do we need to batch stop time updates across all trips? This would make this code potentially more complex, but the savings could be awesome.

By the way, a nice thing about playing around with these optimizations is that the testing coverage is quite good, so there is not much concern about introducing regressions.

@jamespfennell
Copy link
Owner Author

The last flamegraph isn't quite right - it turns out there's another goroutine that handles "copy from" operations, and it's reporting taking a lot of time. The really weird thing is that all of the time seems to be taken with constructing errors, even though no errors are reported:

image

@jamespfennell
Copy link
Owner Author

Looking through the source code of the pgx driver, it seems the function encodePreparedStatementArgument is no longer in mainline. In fact, it seems that the "copy from" part of pgx changed a lot between v4 (what Transiter is using) and v5 (the most recent pgx version). I think all the dependencies of Transiter should be updated before doing more performance improvements.

jamespfennell added a commit that referenced this issue Apr 2, 2023
All of the diffs are from updating pgx to v5, which was the point of this
anyway.
jamespfennell added a commit that referenced this issue Apr 2, 2023
All of the diffs are from updating pgx to v5, which was the point of this
anyway.
@jamespfennell
Copy link
Owner Author

With the dependencies updated and using the previous commit mentioned above (not yet merged), the results here look pretty nice. I think I'm seeing about a 60% performance improvement overall, both from looking at the time taken for each update (has gone from ~8 seconds to ~3 seconds) and from the total time spent in the update function in pprof (from ~16 seconds to 6 seconds, over a 300 second profile).

@jamespfennell
Copy link
Owner Author

The current state of play on this branch:

image

The main 3 things we can speed up here are:

  • Making the trip hashing algorithm better (common.HashValue). Right now it marshals the trip to JSON and takes a hash of the bytes; presumably this would be much faster if we just hashed the struct directly somehow.
  • Batching UpdateTrip (second column under realtime.(*stagedUpdates).run).
  • Batching MarkStopTimesPast (first column under realtime.(*stagedUpdates).run).

Both of these batchings are a little trickier because we can't delete the old row and reinsert it. We may need to write some custom pgx driving code around a UPDATE FROM SQL query.

Other than these 3, I think we're at the limit of what we can do, at least in idiomatic Go. Looking through other parts of the flamegraph, non-trivial time is being spent in e.g. map accesses. That this is the bottleneck is a good sign that we've optimized the code well I think.

jamespfennell added a commit that referenced this issue Apr 2, 2023
From the investigation on the bug (#100), the main slowness in the realtime
updater comes from sending stop time updates one-at-a-time to Postgres. With
this commit the updates changed to delete+insert, and both the delete and
insert queries are batched across the whole update. For the NYC buses, this
commit seems to make the updater at least 50% faster.
@jamespfennell
Copy link
Owner Author

So turns out sqlc/pgx have a really simple way to do batch updates using :batchexec instead of :exec. Using this makes the two SQL queries mentioned in the last comment negligible in the flamegraph. The last part is optimizing the hash calculation - this calculation is taking up 25% of the update time right now!

jamespfennell added a commit that referenced this issue Apr 2, 2023
This improves performance by the same mechanism as other batching commits.
@jamespfennell
Copy link
Owner Author

I did a proof-of-concept of a custom trip hasher in bfb910c and, as expected, performance is much better, another 25% reduction in the time taken to do updates. The code is not crazy bad, so I'm going to commit it to the GTFS library and use it. Then this bug will be finished.

One other thing I've noticed in all of the profiles is a long time being spent in garbage collection. The GTFS library creates a lot of pointers and maybe this is why the GC is so busy? Now that generics has landed, all of these pointers could be replace with an optional type which would be much more GC friendly. This would be a big breaking change.

@jamespfennell
Copy link
Owner Author

jamespfennell commented Apr 3, 2023

I just took a "heap profile" (first time every doing that!) and accordingly to the inuse_space, 70% of the memory is being allocated in the GTFS realtime parser. The profile seems a little janky though and is only showing a few functions:

image

There is also alloc_space in which the distribution follows the CPU distribution.

Edit: my understanding of the profile is incorrect here. The profile is just showing live memory, and so doesn't tell us about the allocation profile of the program.

@jamespfennell
Copy link
Owner Author

I just committed an update to the GTFS library, which incorporates a much faster hash function for trips (jamespfennell/gtfs@02cdcf2). As expected, CPU usage in the update process decreased again.

I rolled out all of the changes in this issue to demo.transiter.dev, and am seeing an awesome reduction in the amount of time taken to do updates for the NYC subway numbered trains:

image

The graph is using a 5 minute sliding window, which is why the decrease is smooth. Note that the bottom axis is 0.2 seconds. The average update time thus dropped from ~800ms to ~350ms! On my CPU monitoring, this seems to have reduced Transiter's CPU usage by 10% of a vCPU (maybe about a 50% reduction, but I'm not sure because the node is multi-tenanted with other projects I run).

Finally, the last profile I took shows maybe 45% of the CPU time being spent GCing memory. I think this is potentially expected because with these optimizations Transiter is highly IO bounded. But we can revisit this later of course.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant