# Delay cause prediction

Using the data collected since 2022.10.05 we tried to build a model that predicts the cause of the delays based on information that is available prior to the journey.
As "no delay" is also a possible prediction of the model, we are able to predict whether the train will be late or not. Since MAV only gives delay causes for delays that are
over 5-6 minutes, the no delay classification given by the model should be interpreted similarly.

In [None]:
import numpy as np
import pandas as pd

from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

from imblearn.over_sampling import RandomOverSampler

import matplotlib.pyplot as plt
import seaborn as sns

from datetime import datetime

random_state = 42
np.random.seed(random_state)

Reading the preprepared training data from disk. The data was created by running the query in the "scraped-data-eda" notebook.

In [None]:
# read the dataset
df = pd.read_csv('data/training_data.csv')
df

## Preprocessing the data
The model uses the time of day (hour only), the route name, and the train numbers as input. The target variable is the delay cause.
Altough the train number is categorical data without order, we decided to not use one-hot encoding, as sparsly encoding the huge number of trains drastically enlarged our dataset. 

In [None]:
# drop the delay column
df = df.drop(columns=['delay'])

# transform timestamp to hour
df['timestamp'] = df['timestamp'].apply(lambda x: datetime.fromtimestamp(x / 1000).hour)

# encode the non-numberic values
df['relation'] = df['relation'].apply(hash)
df['train_number'] = df['train_number'].apply(hash)
df['delay_cause'] = np.where(df['delay_cause'].isna(), 'no delay', df['delay_cause'])
#df = df[~df['delay_cause'].isna()]
categories = pd.Categorical(df['delay_cause'])
df['delay_cause'] = categories.codes

df

As the collection of data started in 2022.10.05, the causes that are relatively infrequent did not appear enough times in the training dataset for the model to sufficiently learn the patterns necessary for predicting the causes accurately. To compensate for this, we oversampled the less frequent causes.

In [None]:
# split the dataset into 80% training and 10% test 10% validation set
X_train, X_test, Y_train, Y_test = train_test_split(df.drop(columns=['delay_cause']), df.delay_cause, test_size=0.2, random_state=random_state)
#X_test, X_valid, Y_test, Y_valid = train_test_split(X_test, Y_test, test_size=0.5, random_state=random_state)

# normalize the values
scaler = StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
#X_valid = scaler.transform(X_valid)
# the dataset is imbalanced, therefore we oversample to try to correct this
values, counts = np.unique(Y_train, return_counts=True)
print(pd.DataFrame({'values':values, 'counts':counts}))
ros = RandomOverSampler()
X_train, Y_train = ros.fit_resample(X_train, Y_train)
values, counts = np.unique(Y_train, return_counts=True)
print(pd.DataFrame({'values':values, 'counts':counts}))

## Model building
We tried many methods (KNN, Neural Network, Naive Bayes, SVM), but RandomForest gave the best results.

In [None]:
# create a random forest classifier model
model = RandomForestClassifier(n_estimators=10, verbose=2, random_state=random_state, n_jobs=10)

# train the model
history = model.fit(X_train, Y_train)

## Testing
The model is tested on 20% of the inital dataset.

In [None]:
# score the model between 0.0 and 1.0
score = model.score(X_test, Y_test)
print(f'score: {score}')
# predict delay casuses using the trained model
predictions = history.predict(X_test)

## Displaying the results

We checked the preformance of the model by creating a confusion matrix, as well as checking the precision, recall and f1-score of it.

In [None]:
rownames = categories.from_codes(predictions, categories=categories.categories)
colnames = categories.from_codes(Y_test, categories=categories.categories)
# create confusion matrix from true and predicted labels
conf = pd.crosstab(colnames, rownames, margins=True, normalize='index',)
# conf = conf.apply(lambda column: column / column.iloc[len(column)-1], axis=1)

# drop the margin
#conf = conf.drop(columns=['All'])
conf = conf.drop(conf.tail(1).index)

# show the confusion matrix
conf

**Key observations**

Overall, the model appears to have a significant bias towards predicting no delays, even with oversampling applied. 
While it performs reasonably well on static causes such as railway condition related delays and constructions, it struggles with the less frequent causes,
particularly the "Delay due to train's technical fault," which is the second most frequent cause of delays. This is probably due to the large number of trains.
It is likely that the model would benefit from more data for these infrequent causes in order to improve its performance.


In [None]:
sns.heatmap(conf)
plt.ylabel('true label')
plt.xlabel('predicted')
plt.title('Confusion matrix (normalized)')
plt.show()

In [None]:
print(classification_report(colnames, rownames, zero_division=0))