## Longer term delay analysis

The data used in this analisys ranges from 2021.01.01 to 2022.9.30.
All trains and their positions as well as their potential delays are
sampled every minute, resulting in ~10GB data. This dataset does not
contain the cause of the delays, but is better suited for analysing
trends in delays over a "long" period of time.

In [None]:
import pandas as pd
import numpy as np

from datetime import datetime, timedelta
import dask.dataframe as dd
import dask.array as da
import dask.bag as db
import dask
#dask.config.set({"optimization.fuse.active": True})

from custom_loader import Loader
from tqdm import tqdm


import re

#import bamboolib
import plotly.express as px

In [None]:
def immutable_sort(list_to_sort:list) -> list:
    res = list_to_sort.copy()
    res.sort()
    return res

def epoch_to_date(day_since_epoch:int) ->  datetime:
    return datetime(1970,1,1) + timedelta(days = day_since_epoch)

## Setting up the connection

The data is stored in Cassandra db, which is well suited to store large amounts of data.
This data was scraped by u/gaborath on reddit, who graciously gave us this sample. He has
a cool [website](https://mav-stat.info) on the same topic.

In [None]:
with open('cassandra-credentials.txt','r') as f:
    user = f.readline().strip()
    pw = f.readline().strip()

In [None]:
dask_cassandra_loader = Loader()
keyspace = 'mav'
cluster = ['vm.niif.cloud.bme.hu']

dask_cassandra_loader.connect_to_cassandra(cluster,
                                           keyspace,
                                           username=user,
                                           password=pw, port=11352)
dask_cassandra_loader.connect_to_local_dask()

## Distribution of delays

The delays are categorized based on from 0 (non inclusive) to 1000 minutes by 5 minute increments.
The resulting distribution can be seen below. (only 0 to 250 displayed for clarity)

In [None]:
bins = list(range(0,1000,5))
delays_binned = None
#epoch range: 18628-19296 
for i in tqdm(range(18628,19296,5)):
    try:
        table = dask_cassandra_loader.load_cassandra_table('train_data',
                                                 ['elviraid', 'delay',],
                                                           [],
                                                 #[('epoch', 'equal', [19221])],
                                                 [('epoch', [i,i+1,i+2,i+3,i+4])],
                                                 force=False)
        if table.data is None:
            continue
        df = table.data.groupby('elviraid').agg({'delay':'mean'}).reset_index()
        df = df['delay'].map_partitions(pd.cut, bins)
        if delays_binned is None:
            tmp = df.compute()
            tmp = tmp.groupby(tmp).size()
            delays_binned = tmp
        else:
            tmp = df.compute()
            tmp = tmp.groupby(tmp).size()
            delays_binned = delays_binned + tmp
    except Exception as e:
        print(e)

In [None]:
plot_df = pd.DataFrame({'x':delays_binned.index,'y':delays_binned})
plot_df['x'] = plot_df['x'].astype(str)
plot_df.to_csv('data/delays_binned.csv')

In [None]:
plot_df = pd.read_csv('data/delays_binned.csv').head(50)
fig = px.histogram(plot_df,x='x', y='y', title  = 'distribution of mean train delays')
fig.update_yaxes(type='log', title='count, logarithmic')
fig.update_xaxes(title='delay group (minutes)')
fig

## The mean delays for each route

Finding the mean delays for each route is useful for diagnostical reasons.
It can help diagnose problems with:

- infrastucture
- management
- failures in collaboration (with other railway companies)

We suggest rescheduling the routes that have a high average delay or fixing
the underlying problems.

In [None]:
cumul = None
for i in tqdm(range(18628,19296,5)):
    success = False
    while not success:
        try:
            table = dask_cassandra_loader.load_cassandra_table('train_data',
                                                     ['relation', 'delay',],
                                                               [],
                                                     #[('epoch', 'equal', [19221])],
                                                     [('epoch', [i,i+1,i+2,i+3,i+4])],
                                                     force=False)
            if table.data is None:
                continue
            df = table.data.groupby('relation').agg({'delay':'mean'})
            if cumul is None:
                cumul = df.compute().reset_index()
                cumul['delay'] = np.where(cumul['delay'].isna(),0,cumul['delay'])
            else:
                tmp = df.compute().reset_index()
                tmp['delay'] = np.where(tmp['delay'].isna(),0,tmp['delay'])
                cumul = pd.concat([cumul, tmp]).groupby(by='relation').mean()
            success = True
        except Exception as e:
            print(e)

In [None]:
mean_delay_route = cumul.reset_index()
mean_delay_route['relation'] = mean_delay_route['relation'].apply(lambda x: x.split(' - '))
mean_delay_route['relation'] = mean_delay_route['relation'].apply(immutable_sort)
mean_delay_route['relation'] = mean_delay_route['relation'].astype(str)
mean_delay_route = mean_delay_route.groupby('relation').mean().reset_index()
mean_delay_route = mean_delay_route.sort_values(by=['delay'], ascending=[False])
mean_delay_route.to_csv('data/mean_delay_route.csv')

In [None]:
plot_df = pd.read_csv('data/mean_delay_route.csv').head(10)
print(plot_df)
fig = px.bar(plot_df, x='relation', y='delay', title='Mean delays for each route (Top 10)')
fig.update_yaxes(title = 'mean delay (min)')
fig.update_xaxes(title = 'route')
fig

## Observing seasonality in delays

By creating a time series based on the mean delays, we might be able to observe
seasonility in delays, which can help diagnose the shortcomings of the current
system when it comes to weather conditions.



In [None]:
cumul = None
for i in tqdm(range(18628,19296,5)):
    table = dask_cassandra_loader.load_cassandra_table('train_data',
                                             ['epoch', 'elviraid', 'delay',],
                                                       [],
                                             [('epoch', [i,i+1,i+2,i+3,i+4])],
                                             force=False)
    if table.data is None:
        continue
    df = table.data.groupby(['epoch','elviraid']).agg({'delay':'mean'})
    df['is_delayed'] = df['delay'].map_partitions(lambda x: x > 1)
    df = df.reset_index(0)
    df = df.groupby(['epoch','is_delayed']).size().compute().reset_index(0).rename(columns={0:'count'})
    if cumul is None:
        cumul = df
    else:
        cumul = pd.concat([cumul,df])

In [None]:
delay_percentage = cumul.reset_index().pivot(index='epoch',columns=['is_delayed'])
delay_percentage = delay_percentage['count']
delay_percentage.columns = delay_percentage.columns.ravel()
delay_percentage = delay_percentage.rename(columns={False:'not_delayed_count',True:'delayed_count'})
delay_percentage['delayed_percentage'] = (delay_percentage['delayed_count'] / (delay_percentage['delayed_count']+delay_percentage['not_delayed_count']))*100
delay_percentage.to_csv('data/delay_percentage.csv')

In [None]:
plot_df = pd.read_csv('data/delay_percentage.csv')
plot_df['epoch'] = plot_df['epoch'].apply(epoch_to_date)
plot_df['sma30'] = plot_df['delayed_percentage'].rolling(30).mean()
fig = px.line(plot_df, x='epoch', y = ['delayed_percentage','sma30'],
              title = 'Percentage of trains with mean delays longer 1 minute')
fig.update_yaxes(title = 'percentage of delayed trains')
fig.update_xaxes(title = 'date')

## Trains with high average delays

Trains with high average delays might be in bad condition, suggesting they need to be
serviced or retired entirely. However, high average delays might be caused by factors
outside the trains' conditions, which is why we suggest that this data should not be taken
out of context and should be examined in conjunction with the routes that have high delays.

A short investigation into these trains' conditions could reveal the real causes of the delays.

In [None]:
cumul = None
#19296
for i in tqdm(range(18628,19296,5)):
    table = dask_cassandra_loader.load_cassandra_table('train_data',
                                             ['trainnumber', 'delay','elviraid'],
                                                       [],
                                             [('epoch', [i,i+1,i+2,i+3,i+4])],
                                             force=False)
    if table.data is None:
        continue
    df = table.data.groupby(['trainnumber','elviraid']).agg({'delay':'mean'})
    df = df.reset_index(0).compute()
    if cumul is None:
        cumul = df
    else:
        cumul = pd.concat([cumul,df])

In [None]:
delays_per_train = cumul.groupby('trainnumber').agg({'elviraid':'count','delay':'mean'})
delays_per_train = delays_per_train.sort_values(by=['delay'],ascending=[False])
delays_per_train.to_csv('data/delays_per_train.csv')

In [None]:
plot_df = pd.read_csv('data/delays_per_train.csv')
plot_df = plot_df[plot_df['elviraid']>10].head(10).reset_index()
print(plot_df)
fig = px.bar(plot_df, x='trainnumber', y='delay', title='Mean delays for each train')
fig.update_yaxes(title = 'mean delay (min)')
fig.update_xaxes(title = 'train number')
fig