# Predictive model: Business Use Case

## Goal

- Using the available [Road Safety Data](https://www.data.gov.uk/dataset/cb7ae6f0-4be6-4935-9277-47e5ce24a11f/road-safety-data) create a predictive modeling business use case
- Desired use case: **On-premise forecast of *dangarous* traffic situations**
- Train model capable to predict accident severity "score": Overall accident severity

## Considerations

- Take all available information into account including spatiotemporal, environmental, vehicle
- Important when defining features: 
    - Only features available *before* accident can be used
    - Do not utilize features not to be used in practice, e.g. 
        - due to GDPR (driver properties)
        - busincess-strategic (car model name)

In [None]:
import os
import pandas
import mlflow
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from workflows.helpers.utils import (
    infer_catboost_feature_types,
    read_partitioned_pandas_asset
)

from sklearn import set_config
from sklearn.metrics import mean_absolute_error, mean_squared_error, confusion_matrix
from sklearn.model_selection import ShuffleSplit

from catboost import CatBoostRegressor, Pool

set_config(transform_output="pandas")

## Load dataset

When running this notebook manually, we need to utilize the `read_partitioned_pandas_asset` method here.

When using workflow-orchestration, instead variable `X` is automatically overwritten by the respective source asset, namely `accidents_vehicles_casualties_dataset`.
Check out this [tutorial](https://docs.dagster.io/integrations/dagstermill/using-notebooks-with-dagster).

In [None]:
X = read_partitioned_pandas_asset("accidents_vehicles_casualties_dataset")

Extract target to train on

In [None]:
y = X.pop("target")

## Properties Visualization

In [None]:
date = pandas.to_datetime(
    X[["accident.year", "accident.month", "accident.day"]]
    .rename(
        columns={
            "accident.year": "year",
            "accident.month": "month",
            "accident.day": "day"
        }
    )
)
date.name = "date"
date = date.to_frame()
date["count"] = 1

# Accident counts
f_count, ax = plt.subplots(figsize=(10, 3))
ax.set_title("Accident count by dates")
sns.histplot(date, x="date", ax=ax)

# Severity score counts
f_score, ax = plt.subplots(figsize=(10, 3))
ax.set_title("Severity score counts")
sns.histplot(y, bins=range(0, int(max(y))), ax=ax)

## Training and logging

- Define model parameters
- Apply training rounds and log params and metrics to tracking server
- Log trained model as model artifact

In [None]:
feat = infer_catboost_feature_types(X)

catboost_init_params = {
    "cat_features": feat["categorical"], 
    "text_features": feat["text"], 
    "od_type": "Iter", 
    "iterations": 400,
    "train_dir": "/tmp/catboost"
}

catboost_fit_params = {
    "early_stopping_rounds": 101,
    "verbose": 100
}

model = CatBoostRegressor(**catboost_init_params)

In [None]:
mlflow.set_registry_uri(os.environ["MLFLOW_TRACKING_URI"])
mlflow.set_experiment("casualty-regression")

In [None]:
with mlflow.start_run() as run:
    
    run_id = run.info.run_id

    mlflow.log_figure(f_count, "count_by_date.png")
    mlflow.log_figure(f_score, "score_count.png")
    mlflow.log_params(catboost_init_params)
    mlflow.log_params(catboost_fit_params)
    mlflow.log_param("n_accidents", len(X))
    mlflow.log_param("min_date", date["date"].min())
    mlflow.log_param("max_date", date["date"].max())

    splitter = ShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
    
    for train, test in splitter.split(X):
        Xtrain = X.iloc[train]
        ytrain = y.iloc[train]

        data_test = Pool(
            X.iloc[test], 
            y.iloc[test], 
            text_features=feat["text"],
            cat_features=feat["categorical"]
        )

        train, val = next(
            ShuffleSplit(n_splits=1, test_size=0.05, random_state=42)
            .split(Xtrain)
        )

        data_val = Pool(
            X.iloc[val], 
            y.iloc[val], 
            text_features=feat["text"],
            cat_features=feat["categorical"]
        )

        data_train = Pool(
            X.iloc[train], 
            y.iloc[train], 
            text_features=feat["text"],
            cat_features=feat["categorical"]
        )

        model.fit(data_train, eval_set=data_val, **catboost_fit_params)

        ypred = model.predict(data_test)
        
        mae = mean_absolute_error(data_test.get_label(), ypred)
        mse = mean_squared_error(data_test.get_label(), ypred)
        
        mlflow.log_metrics(
            {
                "mae": mae,
                "mse": mse
            }
        )

        # Confusion matrix
        confusion = pandas.DataFrame(
            confusion_matrix(
                data_test.get_label(), 
                np.round(ypred, 0).astype(int)
            )
        )
        confusion.to_csv("/tmp/confusion.csv")
        mlflow.log_artifact("/tmp/confusion.csv")

Log model and register new version to registry

In [None]:
example_input = X.sample(n=100)

with mlflow.start_run(run_id=run_id):
    # Fit on complete data and log model artifact
    model.fit(X, y, **catboost_fit_params)
    mlflow.catboost.log_model(
        model, 
        artifact_path="model",
        input_example=example_input
    )

In [None]:
mlflow.register_model(
    model_uri=f"runs:/{run_id}/model",
    name="accident-severity"
)