This repository is an exploration and visualization of the Entur Real-Time Dataset. I found this data set when I was looking for something to use, in order to get to know DuckDB.
It is a companion repository to a blogpost that's going to be published to arktekk.no/blog at some point. If you're reading this, maybe that's where you came from?
I meant for this to just be a quick and fun session to kill some time one evening, but it escalated a little bit, because it was very fun. Overall, I spent every free evening I had for some 10-12 days, working on this. I'm thankful to have such an understanding family. It is definitely code on the prototype-stage, though. There are tons and tons of usability/UX problems.
This repository uses uv, you can get it from here. It will create a virtualenv in .venv, which
is good to know if you want to configure an editor or IDE to find the appropriate sources.
Short note on dependencies: For friendly setup, pyproject.toml is set up with a huge set of dependencies. If wanting
to use this for anything professional, I would recommend setting up 2 or 3 projects with different sets of dependencies.
Install dependencies, including jupyter (required for IDE-integration with notebooks):
uv sync --all-extrasInstall all dependencies for the scripts but skip jupyter:
uv sync --extra=scriptsInstall dependencies for the webapp only (also requires data files, see below):
uv syncAdd a dependency:
uv add plotlyRun jupyter:
uv run --with jupyter jupyterRun the scripts to build a data repository:
uv run python -m kollektivkart.etl -hRun the webapp in development mode (requires extra steps, see Scripts and Dashboard app below):
uv run python -m kollektivkartMIT -- see LICENSE.md. You can use this code for any purpose, and you do not have to attribute it to me.
You also do not get to blame me if something in this repository becomes a black hole that consumes absolutely all of your RAM.
All data that is fetched and used by this repository is owned by Entur, see data.entur.no. It is available under the NLOD license.
EnturRealtimeEDA.ipynb is an analysis I did on the real time data set to get familiar with it. This will consume a lot of memory with PyCharm or the IntelliJ notebook-plugin due to its size, I had to increase heap size to 4096m.
This notebook produces a few files:
arrivals.parquetwhich contains "raw" data from the BigQuery tableentur.dbwhich is a DuckDB instance with several tables, includingarrivalsleg_stats.parquetcontains aggregated statistics for public transit stop-to-stop legsstop_stats.parquetcontains aggregated statistics for public transit legs that arrived at each stop.
There's a flask API app that serves some queries for the frontend react.js app. You can download a stats.db (~400 MB) to make it work locally:
curl -o data/stats.db https://kaaveland-bus-eta-data.hel1.your-objectstorage.com/devdata/stats.dbRun it with uv run python -m kollektivkart or build it with docker and run it. The dash webapp needs these files at
runtime to work. If running with docker, use a volume and provide the PARQUET_LOCATION environment location to stats.db location
location. If you want to load them directly from S3, you can provide an environment file like so:
AWS_REGION=hel1.your-objectstorage.com
DUCKDB_S3_ENDPOINT=hel1.your-objectstorage.com
AWS_ACCESS_KEY_ID=YOUR_OWN_AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY=YOUR_OWN_AWS_SECRET_ACCESS_KEY
PARQUET_LOCATION=s3://kaaveland-bus-eta-dataIn api.py there's an API that is mounted to /api on the dash app.
There's a SPA frontend under frontend that I'm writing to practice TypeScript and react a little bit. It's what's deployed to kollektivkart.arktekk.no.
I run this using podman and a caddy reverse proxy at kollektivkart.arktekk.no.
You can use the docker image at ghcr, as discussed in the Dashboard app section, it requires access to data files.
Feel free to find inspiration in DEPLOY.md.
Note that uv run python -m kollektivkart is not a suitable way to run this application for any sort of load. Put it behind
gunicorn or something else suitable. The docker image takes care of this already.
NB! This webapp puts a lot of data (> 500MB) in memory once it loads, so use --preload with gunicorn.
This ensures a faster startup and also since this memory is only written once, it can be shared with copy-on-write
memory between the workers. This lets you run many workers without using a lot of RAM. gunicorn usually
recommends 1-2 workers per CPU, but since DuckDB is also threaded on the C level, 2 workers per CPU may be a
little high.
You're welcome to file tickets & issues. You're welcome to contribute patches. Just be aware that this was a hobby/passion project with a clear goal (learn DuckDB), so unless there's significant interest in making it into something more than a proof of concept, it is likely that I will stop working on it.
As alluded to under the previous section, I've not decided whether to make something more of this yet. If I did, here are some things that should be done:
- Tons of usability bugs and annoyances to fix in the webapp.
- Fix structural issues in the project, separate the scripts, notebooks and the webapp into different packages. ✅
- Add automated tests.
- Set up a nightly job to fetch new data and update the app. This also requires introducing partitioning by date to avoid aggregating through all the old data again. ✅