Skip to content

kaaveland/bus-eta

Repository files navigation

bus-eta

This repository is an exploration and visualization of the Entur Real-Time Dataset. I found this data set when I was looking for something to use, in order to get to know DuckDB.

It is a companion repository to a blogpost that's going to be published to arktekk.no/blog at some point. If you're reading this, maybe that's where you came from?

I meant for this to just be a quick and fun session to kill some time one evening, but it escalated a little bit, because it was very fun. Overall, I spent every free evening I had for some 10-12 days, working on this. I'm thankful to have such an understanding family. It is definitely code on the prototype-stage, though. There are tons and tons of usability/UX problems.

Setup

This repository uses uv, you can get it from here. It will create a virtualenv in .venv, which is good to know if you want to configure an editor or IDE to find the appropriate sources.

Short note on dependencies: For friendly setup, pyproject.toml is set up with a huge set of dependencies. If wanting to use this for anything professional, I would recommend setting up 2 or 3 projects with different sets of dependencies.

Install dependencies, including jupyter (required for IDE-integration with notebooks):

uv sync --all-extras

Install all dependencies for the scripts but skip jupyter:

uv sync --extra=scripts

Install dependencies for the webapp only (also requires data files, see below):

uv sync

Add a dependency:

uv add plotly

Run jupyter:

uv run --with jupyter jupyter

Run the scripts to build a data repository:

uv run python -m kollektivkart.etl -h

Run the webapp in development mode (requires extra steps, see Scripts and Dashboard app below):

uv run python -m kollektivkart

License

MIT -- see LICENSE.md. You can use this code for any purpose, and you do not have to attribute it to me.

You also do not get to blame me if something in this repository becomes a black hole that consumes absolutely all of your RAM.

Data license

All data that is fetched and used by this repository is owned by Entur, see data.entur.no. It is available under the NLOD license.

What's here

Notebooks

EnturRealtimeEDA.ipynb is an analysis I did on the real time data set to get familiar with it. This will consume a lot of memory with PyCharm or the IntelliJ notebook-plugin due to its size, I had to increase heap size to 4096m.

This notebook produces a few files:

  • arrivals.parquet which contains "raw" data from the BigQuery table
  • entur.db which is a DuckDB instance with several tables, including arrivals
  • leg_stats.parquet contains aggregated statistics for public transit stop-to-stop legs
  • stop_stats.parquet contains aggregated statistics for public transit legs that arrived at each stop.

Dashboard app

There's a flask API app that serves some queries for the frontend react.js app. You can download a stats.db (~400 MB) to make it work locally:

curl -o data/stats.db https://kaaveland-bus-eta-data.hel1.your-objectstorage.com/devdata/stats.db

Run it with uv run python -m kollektivkart or build it with docker and run it. The dash webapp needs these files at runtime to work. If running with docker, use a volume and provide the PARQUET_LOCATION environment location to stats.db location location. If you want to load them directly from S3, you can provide an environment file like so:

AWS_REGION=hel1.your-objectstorage.com
DUCKDB_S3_ENDPOINT=hel1.your-objectstorage.com
AWS_ACCESS_KEY_ID=YOUR_OWN_AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY=YOUR_OWN_AWS_SECRET_ACCESS_KEY
PARQUET_LOCATION=s3://kaaveland-bus-eta-data

API

In api.py there's an API that is mounted to /api on the dash app.

Frontend

There's a SPA frontend under frontend that I'm writing to practice TypeScript and react a little bit. It's what's deployed to kollektivkart.arktekk.no.

Deployment

I run this using podman and a caddy reverse proxy at kollektivkart.arktekk.no.

You can use the docker image at ghcr, as discussed in the Dashboard app section, it requires access to data files.

Feel free to find inspiration in DEPLOY.md.

Note that uv run python -m kollektivkart is not a suitable way to run this application for any sort of load. Put it behind gunicorn or something else suitable. The docker image takes care of this already.

NB! This webapp puts a lot of data (> 500MB) in memory once it loads, so use --preload with gunicorn. This ensures a faster startup and also since this memory is only written once, it can be shared with copy-on-write memory between the workers. This lets you run many workers without using a lot of RAM. gunicorn usually recommends 1-2 workers per CPU, but since DuckDB is also threaded on the C level, 2 workers per CPU may be a little high.

Contributions & tickets

You're welcome to file tickets & issues. You're welcome to contribute patches. Just be aware that this was a hobby/passion project with a clear goal (learn DuckDB), so unless there's significant interest in making it into something more than a proof of concept, it is likely that I will stop working on it.

Where to next?

As alluded to under the previous section, I've not decided whether to make something more of this yet. If I did, here are some things that should be done:

  • Tons of usability bugs and annoyances to fix in the webapp.
  • Fix structural issues in the project, separate the scripts, notebooks and the webapp into different packages. ✅
  • Add automated tests.
  • Set up a nightly job to fetch new data and update the app. This also requires introducing partitioning by date to avoid aggregating through all the old data again. ✅

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors