## Longer term delay analysis

The data used in this analisys ranges from 2021.01.01 to 2022.9.30.
All trains and their positions as well as their potential delays are
sampled every minute, resulting in ~10GB data. This dataset does not
contain the cause of the delays, but is better suited for analysing
trends in delays over a "long" period of time.

In [1]:
import pandas as pd
import numpy as np

from datetime import datetime, timedelta
import dask.dataframe as dd
import dask.array as da
import dask.bag as db
import dask
#dask.config.set({"optimization.fuse.active": True})

from custom_loader import Loader
from tqdm import tqdm


import re

#import bamboolib
import plotly.express as px

In [2]:
def immutable_sort(list_to_sort:list) -> list:
    res = list_to_sort.copy()
    res.sort()
    return res

def epoch_to_date(day_since_epoch:int) ->  datetime:
    return datetime(1970,1,1) + timedelta(days = day_since_epoch)

## Setting up the connection

The data is stored in Cassandra db, which is well suited to store large amounts of data.
This data was scraped by u/gaborath on reddit, who graciously gave us this sample. He has
a cool [website](https://mav-stat.info) on the same topic.

In [3]:
with open('cassandra-credentials.txt','r') as f:
    user = f.readline().strip()
    pw = f.readline().strip()

In [4]:
dask_cassandra_loader = Loader()
keyspace = 'mav'
cluster = ['vm.niif.cloud.bme.hu']

dask_cassandra_loader.connect_to_cassandra(cluster,
                                           keyspace,
                                           username=user,
                                           password=pw, port=11352)
dask_cassandra_loader.connect_to_local_dask()

2022-12-12 22:26:13,640 - distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
2022-12-12 22:26:13,704 - distributed.scheduler - INFO - State start
2022-12-12 22:26:13,711 - distributed.scheduler - INFO - Clear task state
2022-12-12 22:26:13,714 - distributed.scheduler - INFO -   Scheduler at:     tcp://127.0.0.1:46635
2022-12-12 22:26:13,717 - distributed.scheduler - INFO -   dashboard at:            127.0.0.1:8787
2022-12-12 22:26:13,769 - distributed.nanny - INFO -         Start Nanny at: 'tcp://127.0.0.1:45551'
2022-12-12 22:26:13,799 - distributed.nanny - INFO -         Start Nanny at: 'tcp://127.0.0.1:40489'
2022-12-12 22:26:13,832 - distributed.nanny - INFO -         Start Nanny at: 'tcp://127.0.0.1:43275'
2022-12-12 22:26:13,871 - distributed.nanny - INFO -         Start Nanny at: 'tcp://127.0.0.1:40087'
2022-12-12 22:26:15,788 - distributed.worker - INFO -       Start worker a

## Distribution of delays

The delays are categorized based on from 0 (non inclusive) to 1000 minutes by 5 minute increments.
The resulting distribution can be seen below. (only 0 to 250 displayed for clarity)

In [None]:
bins = list(range(0,1000,5))
delays_binned = None
#epoch range: 18628-19296 
for i in tqdm(range(18628,19296,5)):
    try:
        table = dask_cassandra_loader.load_cassandra_table('train_data',
                                                 ['elviraid', 'delay',],
                                                           [],
                                                 #[('epoch', 'equal', [19221])],
                                                 [('epoch', [i,i+1,i+2,i+3,i+4])],
                                                 force=False)
        if table.data is None:
            continue
        df = table.data.groupby('elviraid').agg({'delay':'mean'}).reset_index()
        df = df['delay'].map_partitions(pd.cut, bins)
        if delays_binned is None:
            tmp = df.compute()
            tmp = tmp.groupby(tmp).size()
            delays_binned = tmp
        else:
            tmp = df.compute()
            tmp = tmp.groupby(tmp).size()
            delays_binned = delays_binned + tmp
    except Exception as e:
        print(e)

In [None]:
plot_df = pd.DataFrame({'x':delays_binned.index,'y':delays_binned})
plot_df['x'] = plot_df['x'].astype(str)
plot_df.to_csv('data/delays_binned.csv')

In [None]:
plot_df = pd.read_csv('data/delays_binned.csv').head(50)
fig = px.histogram(plot_df,x='x', y='y', title  = 'distribution of mean train delays')
fig.update_yaxes(type='log', title='count, logarithmic')
fig.update_xaxes(title='delay group (minutes)')
fig

## The mean delays for each route

Finding the mean delays for each route is useful for diagnostical reasons.
It can help diagnose problems with:

- infrastucture
- management
- failures in collaboration (with other railway companies)

We suggest rescheduling the routes that have a high average delay or fixing
the underlying problems.

In [None]:
cumul = None
for i in tqdm(range(18628,19296,5)):
    success = False
    while not success:
        try:
            table = dask_cassandra_loader.load_cassandra_table('train_data',
                                                     ['relation', 'delay',],
                                                               [],
                                                     #[('epoch', 'equal', [19221])],
                                                     [('epoch', [i,i+1,i+2,i+3,i+4])],
                                                     force=False)
            if table.data is None:
                continue
            df = table.data.groupby('relation').agg({'delay':'mean'})
            if cumul is None:
                cumul = df.compute().reset_index()
                cumul['delay'] = np.where(cumul['delay'].isna(),0,cumul['delay'])
            else:
                tmp = df.compute().reset_index()
                tmp['delay'] = np.where(tmp['delay'].isna(),0,tmp['delay'])
                cumul = pd.concat([cumul, tmp]).groupby(by='relation').mean()
            success = True
        except Exception as e:
            print(e)

Key:       ('read_data-aggregate-chunk-0a56a82bda73b2a9407e99e76a3f93d3-35eefbe93ded249e48756b112b3bbf9a', 2)
Function:  execute_task
args:      ((subgraph_callable-2b6f4a28-33dc-468c-acb9-a63eff8a72d7, 'relation', (<function Table.__read_data at 0x7f5209c8a3a0>, 'SELECT relation, delay \nFROM train_data \nWHERE epoch=18667 ALLOW FILTERING', ['vm.niif.cloud.bme.hu'], 'mav', 'matee', '6RNsNszM2rjLGRSPk1MZ9', 11352)))
kwargs:    {}
Exception: 'NoHostAvailable(\'Unable to connect to any servers\', {\'193.225.24.253:11352\': OSError(None, "Tried connecting to [(\'193.225.24.253\', 11352)]. Last error: timed out")})'

Key:       ('read_data-aggregate-chunk-0a56a82bda73b2a9407e99e76a3f93d3-35eefbe93ded249e48756b112b3bbf9a', 1)
Function:  execute_task
args:      ((subgraph_callable-2b6f4a28-33dc-468c-acb9-a63eff8a72d7, 'relation', (<function Table.__read_data at 0x7f0fa40b93a0>, 'SELECT relation, delay \nFROM train_data \nWHERE epoch=18665 ALLOW FILTERING', ['vm.niif.cloud.bme.hu'], 'mav', 'm

('Unable to connect to any servers', {'193.225.24.253:11352': OSError(None, "Tried connecting to [('193.225.24.253', 11352)]. Last error: timed out")})


Key:       ('read_data-aggregate-chunk-0a56a82bda73b2a9407e99e76a3f93d3-35eefbe93ded249e48756b112b3bbf9a', 3)
Function:  execute_task
args:      ((subgraph_callable-2b6f4a28-33dc-468c-acb9-a63eff8a72d7, 'relation', (<function Table.__read_data at 0x7f770480cb80>, 'SELECT relation, delay \nFROM train_data \nWHERE epoch=18663 ALLOW FILTERING', ['vm.niif.cloud.bme.hu'], 'mav', 'matee', '6RNsNszM2rjLGRSPk1MZ9', 11352)))
kwargs:    {}
Exception: 'NoHostAvailable(\'Unable to connect to any servers\', {\'193.225.24.253:11352\': OSError(None, "Tried connecting to [(\'193.225.24.253\', 11352)]. Last error: timed out")})'

Key:       ('read_data-aggregate-chunk-807fe609d3a95b7e269981d2295c3d43-65a3288635ea662ca0e53f337babcff0', 1)
Function:  execute_task
args:      ((subgraph_callable-b389c164-338f-485e-9192-1ee207db10e8, 'relation', (<function Table.__read_data at 0x7f0f8f7169d0>, 'SELECT relation, delay \nFROM train_data \nWHERE epoch=18665 ALLOW FILTERING', ['vm.niif.cloud.bme.hu'], 'mav', 'm

('Unable to connect to any servers', {'193.225.24.253:11352': OSError(None, "Tried connecting to [('193.225.24.253', 11352)]. Last error: timed out")})


Key:       ('read_data-aggregate-chunk-173489dbade6bb63f80bf80d5cf465d5-4d9b9d62ae5068e9f4a435c9ec8bdfc8', 2)
Function:  execute_task
args:      ((subgraph_callable-93c6243b-d0c2-4c05-9528-3b0871af8331, 'relation', (<function Table.__read_data at 0x7f75e4675700>, 'SELECT relation, delay \nFROM train_data \nWHERE epoch=18699 ALLOW FILTERING', ['vm.niif.cloud.bme.hu'], 'mav', 'matee', '6RNsNszM2rjLGRSPk1MZ9', 11352)))
kwargs:    {}
Exception: 'NoHostAvailable(\'Unable to connect to any servers\', {\'193.225.24.253:11352\': OSError(None, "Tried connecting to [(\'193.225.24.253\', 11352)]. Last error: timed out")})'

Key:       ('read_data-aggregate-chunk-173489dbade6bb63f80bf80d5cf465d5-4d9b9d62ae5068e9f4a435c9ec8bdfc8', 3)
Function:  execute_task
args:      ((subgraph_callable-93c6243b-d0c2-4c05-9528-3b0871af8331, 'relation', (<function Table.__read_data at 0x7fb312225550>, 'SELECT relation, delay \nFROM train_data \nWHERE epoch=18698 ALLOW FILTERING', ['vm.niif.cloud.bme.hu'], 'mav', 'm

('Unable to connect to any servers', {'193.225.24.253:11352': OSError(None, "Tried connecting to [('193.225.24.253', 11352)]. Last error: timed out")})


Key:       __read_data-f3656f29-0f0c-443f-a21f-62d34a134ca2
Function:  __read_data
args:      ('SELECT relation, delay \nFROM train_data \nWHERE epoch=18702 ALLOW FILTERING', ['vm.niif.cloud.bme.hu'], 'mav', 'matee', '6RNsNszM2rjLGRSPk1MZ9', 11352)
kwargs:    {}
Exception: 'NoHostAvailable(\'Unable to connect to any servers\', {\'193.225.24.253:11352\': OSError(None, "Tried connecting to [(\'193.225.24.253\', 11352)]. Last error: timed out")})'



load_cassandra_table failed: 


 13%|██████████████████████████▌                                                                                                                                                                           | 18/134 [04:06<24:56, 12.90s/it]2022-12-12 22:30:23,654 - distributed.utils_perf - INFO - full garbage collection released 72.57 MiB from 1877 reference cycles (threshold: 9.54 MiB)
Key:       ('read_data-aggregate-chunk-3c84f2d838a61d52170b81e5f16cc349-716bc5506650d301ebfb7efffab1b740', 0)
Function:  execute_task
args:      ((subgraph_callable-4dd8e214-d287-4db4-a76b-2c2b12356192, 'relation', (<function Table.__read_data at 0x7f75e45904c0>, 'SELECT relation, delay \nFROM train_data \nWHERE epoch=18728 ALLOW FILTERING', ['vm.niif.cloud.bme.hu'], 'mav', 'matee', '6RNsNszM2rjLGRSPk1MZ9', 11352)))
kwargs:    {}
Exception: 'NoHostAvailable(\'Unable to connect to any servers\', {\'193.225.24.253:11352\': OSError(None, "Tried connecting to [(\'193.225.24.253\', 11352)]. Last error: timed out

('Unable to connect to any servers', {'193.225.24.253:11352': OSError(None, "Tried connecting to [('193.225.24.253', 11352)]. Last error: timed out")})


Key:       ('read_data-aggregate-chunk-0707bf0d79caaee2218cadfb5b9850d6-02a47535d5274bd306755f3e4100284a', 1)
Function:  execute_task
args:      ((subgraph_callable-ad434f9a-8da5-40de-b03f-6abb83fc6ea0, 'relation', (<function Table.__read_data at 0x7f7670249e50>, 'SELECT relation, delay \nFROM train_data \nWHERE epoch=18737 ALLOW FILTERING', ['vm.niif.cloud.bme.hu'], 'mav', 'matee', '6RNsNszM2rjLGRSPk1MZ9', 11352)))
kwargs:    {}
Exception: 'NoHostAvailable(\'Unable to connect to any servers\', {\'193.225.24.253:11352\': OSError(None, "Tried connecting to [(\'193.225.24.253\', 11352)]. Last error: timed out")})'



('Unable to connect to any servers', {'193.225.24.253:11352': OSError(None, "Tried connecting to [('193.225.24.253', 11352)]. Last error: timed out")})


Key:       ('read_data-aggregate-chunk-186f2da5887ef391723ea17d25c95de7-d80b4b5024499550a8fd7415fd3a30ec', 0)
Function:  execute_task
args:      ((subgraph_callable-ab066099-fe83-42e5-8803-a3198e28e87b, 'relation', (<function Table.__read_data at 0x7f76da5f6ee0>, 'SELECT relation, delay \nFROM train_data \nWHERE epoch=18757 ALLOW FILTERING', ['vm.niif.cloud.bme.hu'], 'mav', 'matee', '6RNsNszM2rjLGRSPk1MZ9', 11352)))
kwargs:    {}
Exception: 'NoHostAvailable(\'Unable to connect to any servers\', {\'193.225.24.253:11352\': OSError(None, "Tried connecting to [(\'193.225.24.253\', 11352)]. Last error: timed out")})'



('Unable to connect to any servers', {'193.225.24.253:11352': OSError(None, "Tried connecting to [('193.225.24.253', 11352)]. Last error: timed out")})


Key:       ('read_data-aggregate-chunk-186f2da5887ef391723ea17d25c95de7-d80b4b5024499550a8fd7415fd3a30ec', 4)
Function:  execute_task
args:      ((subgraph_callable-ab066099-fe83-42e5-8803-a3198e28e87b, 'relation', (<function Table.__read_data at 0x7f76da5f6160>, 'SELECT relation, delay \nFROM train_data \nWHERE epoch=18753 ALLOW FILTERING', ['vm.niif.cloud.bme.hu'], 'mav', 'matee', '6RNsNszM2rjLGRSPk1MZ9', 11352)))
kwargs:    {}
Exception: 'NoHostAvailable("Unable to connect to any servers using keyspace \'mav\'", [\'193.225.24.253\'])'

Key:       ('read_data-aggregate-chunk-4e4cea759a9f9a777fcd1e7fe1c9f91d-2c2bc72785b857b7708592d7fae805c0', 4)
Function:  execute_task
args:      ((subgraph_callable-ffeef408-dc1e-4680-a4a0-8c70c615c173, 'relation', (<function Table.__read_data at 0x7f7704e061f0>, 'SELECT relation, delay \nFROM train_data \nWHERE epoch=18753 ALLOW FILTERING', ['vm.niif.cloud.bme.hu'], 'mav', 'matee', '6RNsNszM2rjLGRSPk1MZ9', 11352)))
kwargs:    {}
Exception: 'NoHostAva

('Unable to connect to any servers', {'193.225.24.253:11352': OSError(None, "Tried connecting to [('193.225.24.253', 11352)]. Last error: timed out")})


Key:       ('read_data-aggregate-chunk-4e4cea759a9f9a777fcd1e7fe1c9f91d-2c2bc72785b857b7708592d7fae805c0', 1)
Function:  execute_task
args:      ((subgraph_callable-ffeef408-dc1e-4680-a4a0-8c70c615c173, 'relation', (<function Table.__read_data at 0x7f0fc08258b0>, 'SELECT relation, delay \nFROM train_data \nWHERE epoch=18756 ALLOW FILTERING', ['vm.niif.cloud.bme.hu'], 'mav', 'matee', '6RNsNszM2rjLGRSPk1MZ9', 11352)))
kwargs:    {}
Exception: 'NoHostAvailable("Unable to connect to any servers using keyspace \'mav\'", [\'193.225.24.253\'])'

 19%|██████████████████████████████████████▍                                                                                                                                                               | 26/134 [06:08<33:46, 18.76s/it]2022-12-12 22:32:31,683 - distributed.utils_perf - INFO - full garbage collection released 10.62 MiB from 4528 reference cycles (threshold: 9.54 MiB)
Key:       ('read_data-aggregate-chunk-441e1c0944b4edba0ca1e457eb14bc

('Unable to connect to any servers', {'193.225.24.253:11352': OSError(None, "Tried connecting to [('193.225.24.253', 11352)]. Last error: timed out")})


Key:       ('read_data-aggregate-chunk-441e1c0944b4edba0ca1e457eb14bcdb-49662143b98672cbb46958cc4756a441', 1)
Function:  execute_task
args:      ((subgraph_callable-b977da43-1b32-4b6b-bd5f-092443a0b4ec, 'relation', (<function Table.__read_data at 0x7f76f7beb160>, 'SELECT relation, delay \nFROM train_data \nWHERE epoch=18780 ALLOW FILTERING', ['vm.niif.cloud.bme.hu'], 'mav', 'matee', '6RNsNszM2rjLGRSPk1MZ9', 11352)))
kwargs:    {}
Exception: 'NoHostAvailable("Unable to connect to any servers using keyspace \'mav\'", [\'193.225.24.253\'])'

Key:       ('read_data-aggregate-chunk-343979e0b045030dc167934748eeab0f-c2190b3380f131c45ddc52e33261e746', 3)
Function:  execute_task
args:      ((subgraph_callable-a3889a54-91a7-4018-9658-fe1cd44c2bfe, 'relation', (<function Table.__read_data at 0x7f52096ee310>, 'SELECT relation, delay \nFROM train_data \nWHERE epoch=18784 ALLOW FILTERING', ['vm.niif.cloud.bme.hu'], 'mav', 'matee', '6RNsNszM2rjLGRSPk1MZ9', 11352)))
kwargs:    {}
Exception: 'NoHostAva

('Unable to connect to any servers', {'193.225.24.253:11352': OSError(None, "Tried connecting to [('193.225.24.253', 11352)]. Last error: timed out")})


Key:       ('read_data-aggregate-chunk-343979e0b045030dc167934748eeab0f-c2190b3380f131c45ddc52e33261e746', 2)
Function:  execute_task
args:      ((subgraph_callable-a3889a54-91a7-4018-9658-fe1cd44c2bfe, 'relation', (<function Table.__read_data at 0x7f0fc0bf70d0>, 'SELECT relation, delay \nFROM train_data \nWHERE epoch=18783 ALLOW FILTERING', ['vm.niif.cloud.bme.hu'], 'mav', 'matee', '6RNsNszM2rjLGRSPk1MZ9', 11352)))
kwargs:    {}
Exception: 'NoHostAvailable("Unable to connect to any servers using keyspace \'mav\'", [\'193.225.24.253\'])'

Key:       __read_data-9fbc162c-d08f-4cd8-925b-72781d3e2d9a
Function:  __read_data
args:      ('SELECT relation, delay \nFROM train_data \nWHERE epoch=18787 ALLOW FILTERING', ['vm.niif.cloud.bme.hu'], 'mav', 'matee', '6RNsNszM2rjLGRSPk1MZ9', 11352)
kwargs:    {}
Exception: 'NoHostAvailable(\'Unable to connect to any servers\', {\'193.225.24.253:11352\': OSError(None, "Tried connecting to [(\'193.225.24.253\', 11352)]. Last error: timed out")})'



load_cassandra_table failed: 


Key:       __read_data-76f8d79d-2837-48b5-a4c7-27e8c43eb5c4
Function:  __read_data
args:      ('SELECT relation, delay \nFROM train_data \nWHERE epoch=18794 ALLOW FILTERING', ['vm.niif.cloud.bme.hu'], 'mav', 'matee', '6RNsNszM2rjLGRSPk1MZ9', 11352)
kwargs:    {}
Exception: 'NoHostAvailable("Unable to connect to any servers using keyspace \'mav\'", [\'193.225.24.253\'])'



load_cassandra_table failed: 


 31%|████████████████████████████████████████████████████████████▌                                                                                                                                         | 41/134 [09:02<17:35, 11.35s/it]2022-12-12 22:35:26,042 - distributed.utils_perf - INFO - full garbage collection released 80.24 MiB from 1533 reference cycles (threshold: 9.54 MiB)
Key:       ('read_data-aggregate-chunk-a038e5ddfab8b6bc097c68dceb18430a-7714665c7f273c80d5012bf7bc153ee3', 2)
Function:  execute_task
args:      ((subgraph_callable-06737517-f6bf-44c3-8137-996b415e9d87, 'relation', (<function Table.__read_data at 0x7f75dcff6280>, 'SELECT relation, delay \nFROM train_data \nWHERE epoch=18854 ALLOW FILTERING', ['vm.niif.cloud.bme.hu'], 'mav', 'matee', '6RNsNszM2rjLGRSPk1MZ9', 11352)))
kwargs:    {}
Exception: 'NoHostAvailable(\'Unable to connect to any servers\', {\'193.225.24.253:11352\': OSError(None, "Tried connecting to [(\'193.225.24.253\', 11352)]. Last error: timed out

('Unable to connect to any servers', {'193.225.24.253:11352': OSError(None, "Tried connecting to [('193.225.24.253', 11352)]. Last error: timed out")})


Key:       ('read_data-aggregate-chunk-e4b071e79d7b07f892ceabc30ce44b8f-f5a650aa50630671e1e6d40a1e85c11c', 3)
Function:  execute_task
args:      ((subgraph_callable-c4444c9b-c84e-4f37-87c8-d99b0b704ea0, 'relation', (<function Table.__read_data at 0x7f51d4320ee0>, 'SELECT relation, delay \nFROM train_data \nWHERE epoch=18866 ALLOW FILTERING', ['vm.niif.cloud.bme.hu'], 'mav', 'matee', '6RNsNszM2rjLGRSPk1MZ9', 11352)))
kwargs:    {}
Exception: 'NoHostAvailable("Unable to connect to any servers using keyspace \'mav\'", [\'193.225.24.253\'])'

Key:       ('read_data-aggregate-chunk-e4b071e79d7b07f892ceabc30ce44b8f-f5a650aa50630671e1e6d40a1e85c11c', 1)
Function:  execute_task
args:      ((subgraph_callable-c4444c9b-c84e-4f37-87c8-d99b0b704ea0, 'relation', (<function Table.__read_data at 0x7fb24c4310d0>, 'SELECT relation, delay \nFROM train_data \nWHERE epoch=18867 ALLOW FILTERING', ['vm.niif.cloud.bme.hu'], 'mav', 'matee', '6RNsNszM2rjLGRSPk1MZ9', 11352)))
kwargs:    {}
Exception: 'NoHostAva

("Unable to connect to any servers using keyspace 'mav'", ['193.225.24.253'])


 42%|██████████████████████████████████████████████████████████████████████████████████▋                                                                                                                   | 56/134 [12:15<14:19, 11.02s/it]2022-12-12 22:38:38,791 - distributed.utils_perf - INFO - full garbage collection released 162.53 MiB from 2369 reference cycles (threshold: 9.54 MiB)
Key:       ('read_data-aggregate-chunk-e8a9f53bac6648764d0117373fd5f3c6-ab5bcb29880a2cf7ba8289f0da6781c2', 2)
Function:  execute_task
args:      ((subgraph_callable-6f99d685-3337-44c3-868b-5abbb84a9e60, 'relation', (<function Table.__read_data at 0x7fb312ed5280>, 'SELECT relation, delay \nFROM train_data \nWHERE epoch=18915 ALLOW FILTERING', ['vm.niif.cloud.bme.hu'], 'mav', 'matee', '6RNsNszM2rjLGRSPk1MZ9', 11352)))
kwargs:    {}
Exception: 'NoHostAvailable(\'Unable to connect to any servers\', {\'193.225.24.253:11352\': OSError(None, "Tried connecting to [(\'193.225.24.253\', 11352)]. Last error: timed ou

('Unable to connect to any servers', {'193.225.24.253:11352': OSError(None, "Tried connecting to [('193.225.24.253', 11352)]. Last error: timed out")})


Key:       ('read_data-aggregate-chunk-c0a389a5c70bfdcff828790c47f77a48-7838035af2d5f25dbbf5f142ae519bc4', 0)
Function:  execute_task
args:      ((subgraph_callable-a9f9903a-ef15-4a21-b1ed-1a7e68c114de, 'relation', (<function Table.__read_data at 0x7f5194576ee0>, 'SELECT relation, delay \nFROM train_data \nWHERE epoch=18926 ALLOW FILTERING', ['vm.niif.cloud.bme.hu'], 'mav', 'matee', '6RNsNszM2rjLGRSPk1MZ9', 11352)))
kwargs:    {}
Exception: 'NoHostAvailable(\'Unable to connect to any servers\', {\'193.225.24.253:11352\': OSError(None, "Tried connecting to [(\'193.225.24.253\', 11352)]. Last error: timed out")})'

Key:       ('read_data-aggregate-chunk-c0a389a5c70bfdcff828790c47f77a48-7838035af2d5f25dbbf5f142ae519bc4', 2)
Function:  execute_task
args:      ((subgraph_callable-a9f9903a-ef15-4a21-b1ed-1a7e68c114de, 'relation', (<function Table.__read_data at 0x7fb312570040>, 'SELECT relation, delay \nFROM train_data \nWHERE epoch=18927 ALLOW FILTERING', ['vm.niif.cloud.bme.hu'], 'mav', 'm

('Unable to connect to any servers', {'193.225.24.253:11352': OSError(None, "Tried connecting to [('193.225.24.253', 11352)]. Last error: timed out")})


 49%|████████████████████████████████████████████████████████████████████████████████████████████████                                                                                                      | 65/134 [14:16<13:46, 11.98s/it]

In [None]:
mean_delay_route = cumul.reset_index()
mean_delay_route['relation'] = mean_delay_route['relation'].apply(lambda x: x.split(' - '))
mean_delay_route['relation'] = mean_delay_route['relation'].apply(immutable_sort)
mean_delay_route['relation'] = mean_delay_route['relation'].astype(str)
mean_delay_route = mean_delay_route.groupby('relation').mean().reset_index()
mean_delay_route = mean_delay_route.sort_values(by=['delay'], ascending=[False])
mean_delay_route.to_csv('data/mean_delay_route.csv')

In [None]:
plot_df = pd.read_csv('data/mean_delay_route.csv').head(10)
print(plot_df)
fig = px.bar(plot_df, x='relation', y='delay', title='Mean delays for each route (Top 10)')
fig.update_yaxes(title = 'mean delay (min)')
fig.update_xaxes(title = 'route')
fig

## Observing seasonality in delays

By creating a time series based on the mean delays, we might be able to observe
seasonility in delays, which can help diagnose the shortcomings of the current
system when it comes to weather conditions.



In [None]:
cumul = None
for i in tqdm(range(18628,19296,5)):
    table = dask_cassandra_loader.load_cassandra_table('train_data',
                                             ['epoch', 'elviraid', 'delay',],
                                                       [],
                                             [('epoch', [i,i+1,i+2,i+3,i+4])],
                                             force=False)
    if table.data is None:
        continue
    df = table.data.groupby(['epoch','elviraid']).agg({'delay':'mean'})
    df['is_delayed'] = df['delay'].map_partitions(lambda x: x > 1)
    df = df.reset_index(0)
    df = df.groupby(['epoch','is_delayed']).size().compute().reset_index(0).rename(columns={0:'count'})
    if cumul is None:
        cumul = df
    else:
        cumul = pd.concat([cumul,df])

In [None]:
delay_percentage = cumul.reset_index().pivot(index='epoch',columns=['is_delayed'])
delay_percentage = delay_percentage['count']
delay_percentage.columns = delay_percentage.columns.ravel()
delay_percentage = delay_percentage.rename(columns={False:'not_delayed_count',True:'delayed_count'})
delay_percentage['delayed_percentage'] = (delay_percentage['delayed_count'] / (delay_percentage['delayed_count']+delay_percentage['not_delayed_count']))*100
delay_percentage.to_csv('data/delay_percentage.csv')

In [None]:
plot_df = pd.read_csv('data/delay_percentage.csv')
plot_df['epoch'] = plot_df['epoch'].apply(epoch_to_date)
plot_df['sma30'] = plot_df['delayed_percentage'].rolling(30).mean()
fig = px.line(plot_df, x='epoch', y = ['delayed_percentage','sma30'],
              title = 'Percentage of trains with mean delays longer 1 minute')
fig.update_yaxes(title = 'percentage of delayed trains')
fig.update_xaxes(title = 'date')

## Trains with high average delays

Trains with high average delays might be in bad condition, suggesting they need to be
serviced or retired entirely. However, high average delays might be caused by factors
outside the trains' conditions, which is why we suggest that this data should not be taken
out of context and should be examined in conjunction with the routes that have high delays.

A short investigation into these trains' conditions could reveal the real causes of the delays.

In [None]:
cumul = None
#19296
for i in tqdm(range(18628,19296,5)):
    table = dask_cassandra_loader.load_cassandra_table('train_data',
                                             ['trainnumber', 'delay','elviraid'],
                                                       [],
                                             [('epoch', [i,i+1,i+2,i+3,i+4])],
                                             force=False)
    if table.data is None:
        continue
    df = table.data.groupby(['trainnumber','elviraid']).agg({'delay':'mean'})
    df = df.reset_index(0).compute()
    if cumul is None:
        cumul = df
    else:
        cumul = pd.concat([cumul,df])

In [None]:
delays_per_train = cumul.groupby('trainnumber').agg({'elviraid':'count','delay':'mean'})
delays_per_train = delays_per_train.sort_values(by=['delay'],ascending=[False])
delays_per_train.to_csv('data/delays_per_train.csv')

In [None]:
plot_df = pd.read_csv('data/delays_per_train.csv')
plot_df = plot_df[plot_df['elviraid']>10].head(10).reset_index()
print(plot_df)
fig = px.bar(plot_df, x='trainnumber', y='delay', title='Mean delays for each train')
fig.update_yaxes(title = 'mean delay (min)')
fig.update_xaxes(title = 'train number')
fig