Create ML models to forecast the severity of accidents based on the UK Road Safety Dataset.
This project is aiming to represent an end-to-end machine learning use case including
- Data ingestion: Load (partitioned) data from the web and store it in datalake that mimics AWS S3 service.
- Model training: Provide training environment for running training manually, i.e. in a Jupyter environment, as well as running by the workflow orchestrator. Also account for experiment tracking and model registration.
- Model serving: Creation of docker images and provision of an inference web server containing a prediction functionality.
- Model monitoring: Continuous tracking of model performance and presentation in dashboards.
The UK Road Safety Dataset, is a dataset maintained by the Department for Transport. It provides detailed data on personal injury road accidents in GB from 1979, the vehicles involved, and the resulting casualties. The dataset is divided into Accident, Casualty, and Vehicle information, offering insights into accident severity, environmental conditions, individual demographics, and vehicle details. This dataset is useful for road safety studies, accident analysis, prevention measures, and machine learning model development.
For our project, we utilize data starting from 2016, which contains accident-specific characteristics. This data is suitable for training a machine learning model that may have the potential to be used within a live service to indicate dangerous traffic situations. For this POC project, for the sake of simplicity, we will stick to a batch mode for deployment and monitoring.
The project is decomposed into different scenarios, that in real-world applications are handled by different processes.
-
ingestion: Using dagster as generic workflow orchestrator, to download raw data, applying merging of different data tables and preprocessing to carry out all the steps to be able to train a machine learning model on this data. Since the raw data is partitioned into yearly intervals, we make use of the partitioning the data into so called partitioned data assets.
-
training-manual: After sucessful data ingestion, we can use it to train a machine learning model. Thereby a jupyterlab environment is created, in which a training notebook can be ran manually. Also an MLflow instance is created for logging model metrics and register trained models. The step training_manual is not strictly required as part of the workflow pipeline, however, it is suitable to familarize with the model and its properties.
-
training-workflow: Since we want to automatically run training by the workflow orchestrator, we make use of dagstermill, a dagster integration of papermill to run notebooks in a process while transfering parameters to the notebook. Doing so, you do not have to transfer the code created inside of training notebooks to particular module, and instead just connect the existing notebooks to the pipeline.
-
simulation: The simulation mode represents a real-world machine learning in production environment. Of course, for our use case the data is available on a yearly basis, so there is no data streaming process applicable. Instead we simulate continuously produced data from the already ingested data and run the processing workflows on a minute-wise schedule. In the real-world scenario new events, i.e. accidents, occur on a daily basis, but I guess you do not want to wait for days testing the simulation model, do you? 😉
To overcome this, we provide a time warp mode, where a day is represented by few real-world seconds. Also the simulation start date is synthetic. Both the parameters SECONDS_PER_DAY
and SIMULATION_START_DATE
, respectively, can be configured by environment variables.
The simulation workflows are orchestrated by dagster including inference making, monitoring report creation and dashboard visualization using Grafana.
The simulation mode requires data ingestion and training to be completed.
In order to be able to use the project, the following requirements exist:
- make (for Windows check this)
- docker
- docker-compose
- If you want to run unit-tests and use pre-commit, also Poetry alongside with Python3.11 is required
make unit-tests
The output will show test results including code coverage logged to the console.
make ingestion
Several containers are (built and) started. To manually start data ingestion, open dagster UI in your browser. At the right hand side use Materialize all and Launch backfill the materialize the data assets to minio environment that locally mimics an AWS S3 environment. It is also worth navigating to the dagster runs section providing detailed information on the run properties, e.g. logs and run performance.
When running, find the following instances:
dagster UI: http://0.0.0.0:3000/asset-groups
dagster runs: http://0.0.0.0:3000/runs
Requires data ingestion completed.
make training-manual
Start Jupyter server and Mlflow in the background.
Run training notebook and verify experiments and model artifacts are logged to MLFlow. For training,
all data available up to SIMULATION_START_DATE
are used, all events beyond this date are considered
to occur in future.
When running, find the following instances:
Jupyter server: http://0.0.0.0:8888
MLFlow UI: http://0.0.0.0:5000
Requires data ingestion completed.
make training-workflow
To manually start a training job, open dagster UI in your browser. Again use "Materialize all" and "Launch backfill" the materialize the data assets, i.e. run an experiment and log the model to MLflow.
When running, find the following instances:
dagster UI: http://0.0.0.0:3000/locations/model-training-using-workflow/asset-groups/model_training
MLFlow UI: http://0.0.0.0:5000
Requires model training (manual or via workflow) completed. Given, in MLflow a model accident_severity
registered, specify existing <your-trained-model-version>
.
make simulation MODEL_VERSION=<your-trained-model-version>
Model simulation workflows available in dagster_UI. Workflow is scheduled to run every EVAL_SCHEDULER_INCREMENT
minute, refreshing the relevant assets continuously. However, first starting simulation you need to flip the toggle to
"Auto-materialize on" in dagster UI. Assets in simulation mode are materialized to PostgreSQL
database inference_db, which can be accessed using Adminer.
dagster UI: http://0.0.0.0:3000/locations/model-application-simulation/asset-groups/recent
Adminer: http://0.0.0.0:8080 (username: postgres_user, password: postgres_password)
As part of the simulation, an model inference web server is exposed using FastAPI.
It is directly requested in one workflow step, but you can also play around with it. It exposes a Swagger UI
making it easy to drop some requests. Both the exposeed endpoints /predict
and /info
provide complete IO examples for you to experiment
without any further knowledge about the model json schema.
Model inference App: http://0.0.0.0:8000/docs
For monitoring the model performance, we utilize Evidently AI, which is the last workflow step running in the pipeline. Take note, that evidently requires reference_data and _evaluation_data. For providing the reference data, we use a subset of the training data. The report output, which is continuously created, is persisted in the inference_db. A [Grafana}(https://grafana.com/) instance is configured to read reports from inference_db and create dashboads also continuously updated.
Grafana: http://0.0.0.0:3030 (username: admin, password: admin)
Monitoring Dashboards: http://0.0.0.0:3030/d/efbeb80f-e16a-4e1c-ba98-8f5ff862dfcc/accidents-severity-predictions
Take note of further documentation material: