Project accident-severity-forecast

Create ML models to forecast the severity of accidents based on the UK Road Safety Dataset.

Introduction

This project is aiming to represent an end-to-end machine learning use case including

Data ingestion: Load (partitioned) data from the web and store it in datalake that mimics AWS S3 service.
Model training: Provide training environment for running training manually, i.e. in a Jupyter environment, as well as running by the workflow orchestrator. Also account for experiment tracking and model registration.
Model serving: Creation of docker images and provision of an inference web server containing a prediction functionality.
Model monitoring: Continuous tracking of model performance and presentation in dashboards.

Dataset

The UK Road Safety Dataset, is a dataset maintained by the Department for Transport. It provides detailed data on personal injury road accidents in GB from 1979, the vehicles involved, and the resulting casualties. The dataset is divided into Accident, Casualty, and Vehicle information, offering insights into accident severity, environmental conditions, individual demographics, and vehicle details. This dataset is useful for road safety studies, accident analysis, prevention measures, and machine learning model development.

For our project, we utilize data starting from 2016, which contains accident-specific characteristics. This data is suitable for training a machine learning model that may have the potential to be used within a live service to indicate dangerous traffic situations. For this POC project, for the sake of simplicity, we will stick to a batch mode for deployment and monitoring.

Project outline

The project is decomposed into different scenarios, that in real-world applications are handled by different processes.

ingestion: Using dagster as generic workflow orchestrator, to download raw data, applying merging of different data tables and preprocessing to carry out all the steps to be able to train a machine learning model on this data. Since the raw data is partitioned into yearly intervals, we make use of the partitioning the data into so called partitioned data assets.
training-manual: After sucessful data ingestion, we can use it to train a machine learning model. Thereby a jupyterlab environment is created, in which a training notebook can be ran manually. Also an MLflow instance is created for logging model metrics and register trained models. The step training_manual is not strictly required as part of the workflow pipeline, however, it is suitable to familarize with the model and its properties.
training-workflow: Since we want to automatically run training by the workflow orchestrator, we make use of dagstermill, a dagster integration of papermill to run notebooks in a process while transfering parameters to the notebook. Doing so, you do not have to transfer the code created inside of training notebooks to particular module, and instead just connect the existing notebooks to the pipeline.
simulation: The simulation mode represents a real-world machine learning in production environment. Of course, for our use case the data is available on a yearly basis, so there is no data streaming process applicable. Instead we simulate continuously produced data from the already ingested data and run the processing workflows on a minute-wise schedule. In the real-world scenario new events, i.e. accidents, occur on a daily basis, but I guess you do not want to wait for days testing the simulation model, do you? 😉

To overcome this, we provide a time warp mode, where a day is represented by few real-world seconds. Also the simulation start date is synthetic. Both the parameters SECONDS_PER_DAY and SIMULATION_START_DATE, respectively, can be configured by environment variables.

The simulation workflows are orchestrated by dagster including inference making, monitoring report creation and dashboard visualization using Grafana.

The simulation mode requires data ingestion and training to be completed.

Quickstart

Prerequisites

In order to be able to use the project, the following requirements exist:

make (for Windows check this)
docker
docker-compose
If you want to run unit-tests and use pre-commit, also Poetry alongside with Python3.11 is required

0. Run unit-tests

make unit-tests

The output will show test results including code coverage logged to the console.

1. Run data ingestion

make ingestion

Several containers are (built and) started. To manually start data ingestion, open dagster UI in your browser. At the right hand side use Materialize all and Launch backfill the materialize the data assets to minio environment that locally mimics an AWS S3 environment. It is also worth navigating to the dagster runs section providing detailed information on the run properties, e.g. logs and run performance.

When running, find the following instances:

dagster UI: http://0.0.0.0:3000/asset-groups

dagster runs: http://0.0.0.0:3000/runs

2. Manual training

Requires data ingestion completed.

make training-manual

Start Jupyter server and Mlflow in the background. Run training notebook and verify experiments and model artifacts are logged to MLFlow. For training, all data available up to SIMULATION_START_DATE are used, all events beyond this date are considered to occur in future.

When running, find the following instances:

Jupyter server: http://0.0.0.0:8888

MLFlow UI: http://0.0.0.0:5000

3. Training workflow

Requires data ingestion completed.

make training-workflow

To manually start a training job, open dagster UI in your browser. Again use "Materialize all" and "Launch backfill" the materialize the data assets, i.e. run an experiment and log the model to MLflow.

When running, find the following instances:

dagster UI: http://0.0.0.0:3000/locations/model-training-using-workflow/asset-groups/model_training

MLFlow UI: http://0.0.0.0:5000

4. Simulation

Requires model training (manual or via workflow) completed. Given, in MLflow a model accident_severity registered, specify existing <your-trained-model-version>.

make simulation MODEL_VERSION=<your-trained-model-version>

4.1 Simulation workflow

Model simulation workflows available in dagster_UI. Workflow is scheduled to run every EVAL_SCHEDULER_INCREMENT minute, refreshing the relevant assets continuously. However, first starting simulation you need to flip the toggle to "Auto-materialize on" in dagster UI. Assets in simulation mode are materialized to PostgreSQL database inference_db, which can be accessed using Adminer.

dagster UI: http://0.0.0.0:3000/locations/model-application-simulation/asset-groups/recent

Adminer: http://0.0.0.0:8080 (username: postgres_user, password: postgres_password)

4.2 Model inference

As part of the simulation, an model inference web server is exposed using FastAPI. It is directly requested in one workflow step, but you can also play around with it. It exposes a Swagger UI making it easy to drop some requests. Both the exposeed endpoints /predict and /info provide complete IO examples for you to experiment without any further knowledge about the model json schema.

Model inference App: http://0.0.0.0:8000/docs

4.3 Model monitoring

For monitoring the model performance, we utilize Evidently AI, which is the last workflow step running in the pipeline. Take note, that evidently requires reference_data and _evaluation_data. For providing the reference data, we use a subset of the training data. The report output, which is continuously created, is persisted in the inference_db. A [Grafana}(https://grafana.com/) instance is configured to read reports from inference_db and create dashboads also continuously updated.

Grafana: http://0.0.0.0:3030 (username: admin, password: admin)

Monitoring Dashboards: http://0.0.0.0:3030/d/efbeb80f-e16a-4e1c-ba98-8f5ff862dfcc/accidents-severity-predictions

Further documention

Take note of further documentation material:

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
docs		docs
model-deployment		model-deployment
model-monitoring		model-monitoring
model-training		model-training
notebooks		notebooks
workflow-orchestration		workflow-orchestration
.env.local		.env.local
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.pylintrc		.pylintrc
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docker-compose.yaml		docker-compose.yaml
wait-for-it.sh		wait-for-it.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project accident-severity-forecast

Introduction

Dataset

Project outline

Quickstart

Prerequisites

0. Run unit-tests

1. Run data ingestion

2. Manual training

3. Training workflow

4. Simulation

4.1 Simulation workflow

4.2 Model inference

4.3 Model monitoring

Further documention

About

Releases

Packages

Languages

License

lerummi/accident-severity-forecast

Folders and files

Latest commit

History

Repository files navigation

Project accident-severity-forecast

Introduction

Dataset

Project outline

Quickstart

Prerequisites

0. Run unit-tests

1. Run data ingestion

2. Manual training

3. Training workflow

4. Simulation

4.1 Simulation workflow

4.2 Model inference

4.3 Model monitoring

Further documention

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages