Name
Date

# Summary

Public transport companies provide live data about their vehicle like where they arrived on which time and what time was planned.

In this project, we are exploring this data.

SBB provides it's data live. (Almost) always when a train enters (arrival) and leaves (departure) a railway station, this data including the expected data (that you can also find in the app by SBB) is sent and collected in the "ist-daten" dataset. SBB provides also an dataset with all that data from the last day via an API where you can also download the data as CSV:

Dataset: [Soll/Ist Vergleich Abfahrts-/Ankunftszeiten SBB](https://data.sbb.ch/explore/dataset/ist-daten-sbb/table/?flg=de&dataChart=eyJxdWVyaWVzIjpbeyJjaGFydHMiOlt7InR5cGUiOiJjb2x1bW4iLCJmdW5jIjoiQ09VTlQiLCJ5QXhpcyI6Imxpbmllbl9pZCIsInNjaWVudGlmaWNEaXNwbGF5Ijp0cnVlLCJjb2xvciI6InJhbmdlLUFjY2VudCJ9XSwieEF4aXMiOiJ2ZXJrZWhyc21pdHRlbF90ZXh0IiwibWF4cG9pbnRzIjoyMCwidGltZXNjYWxlIjoiIiwic29ydCI6IiIsInNlcmllc0JyZWFrZG93biI6ImFua3VuZnRzdmVyc3BhdHVuZyIsInN0YWNrZWQiOiJwZXJjZW50IiwiY29uZmlnIjp7ImRhdGFzZXQiOiJpc3QtZGF0ZW4tc2JiIiwib3B0aW9ucyI6eyJmbGciOiJkZSJ9fX1dLCJ0aW1lc2NhbGUiOiIiLCJkaXNwbGF5TGVnZW5kIjp0cnVlLCJhbGlnbk1vbnRoIjp0cnVlfQ%3D%3D)

But sure, this is only a subset of the whole picture in Switzerland. There are other companies than SBB as well! We can also travel back in time. To do so, we can use Opentransportdata's data.

Here are some some older dataset with the same kind of data: https://opentransportdata.swiss/de/dataset/istdaten

And here are the files stored as zip file for the years 2016 - 2023 (2016 and 2017 are incomplete): https://opentransportdata.swiss/de/ist-daten-archiv/

# Ideas
* Is early related to delay?
* Where has a train a lot of delay and makes a lot of time good again?

# Imports

In [2]:
# 1. import of numpy and pandas:
import numpy as np
import pandas as pd

# Show all of our columns in this Jupyter notebook:
pd.set_option('display.max_columns', None)

We define variables holding the column names in order to prevent hard-coded column names within the code and to be able ro rename those if we want to:

In [8]:
CSV_FILE_PATH = '../data/sbb_data/2023-03-21_ist-daten-sbb.csv'

COLUMN_ARRIVAL_TIME = 'Ankunftszeit'
COLUMN_ARRIVAL_PROGNOSE = 'An Prognose'
COLUMN_DEPARTURE_TIME = 'Abfahrtszeit'
COLUMN_DEPARTURE_PROGNOSE = 'Ab Prognose'
COLUMN_DELAY_ARRIVAL = 'delay_arrival'
COLUMN_DELAY_DEPARTUE = 'delay_departure'

COLUMN_VEHICLE_TYPE = 'Verkehrsmittel Text'
COLUMN_JOURNEY_ID = "Fahrt Bezeichner"



# Data acquisition

In [3]:
date_format = "%Y-%m-%d %H:%M:%S"


COLUMNS_WITH_DATE = [COLUMN_ARRIVAL_TIME,
                     COLUMN_ARRIVAL_PROGNOSE,
                     COLUMN_DEPARTURE_TIME,
                     COLUMN_DEPARTURE_PROGNOSE]

df = pd.read_csv(CSV_FILE_PATH,
                 sep=';', parse_dates=COLUMNS_WITH_DATE, date_parser=lambda x: pd.to_datetime(x, format=date_format),)


# EDI
(exploratory data analysis)

In [None]:
df.shape

As we can see, we have 25 columns and 63065 rows.

In [None]:
df.info()

Accordingly to `info`, "Umlauf ID" is the only column with no valid entries.

The dtypes are in most cases `object`s instead of strings.

The following columns have missing values:
* Produkt ID
* Umlauf ID
* Ankunftszeit
* An Prognose
* An Prognose Status
* Abfahrtszeit
* Ab Prognose
* Ab Prognsoe Status
* Geoposition
* lod

In [None]:
df.dtypes

## Nominal and Ordinal data:
Here are some examples for Nominal and Ordinal data.

In the following output we see that
* Vor the "Verkehrsmittel Text" we have 14 different categories or companies. The most seems to be an "S" like "S-Bahn". But these categories aren't that consistent.
* we have a lot of different dates for Abfahrszeit (1301) and Ankunftszeit (1301) -> A day has 1’440 minutes. We see that for the most of the minutes of the day, we have entries. But compared to "An Prognose", we have much less because this data seems to be accurate in seconds.

In [None]:
# nominal data:
df[[COLUMN_VEHICLE_TYPE]].value_counts()
# df[['Verkehrsmittel Text']].value_counts()

# # ordinal data:
# df[['Abfahrtszeit']].value_counts()
# df[['Ankunftszeit']].value_counts()
# df[['An Prognose']].value_counts()

# We'd get discrete date if we add a column "Verspätung Ankunft" (delay) by subtracting "Ankunft" of "An Prognose".


In [None]:
df.sample(10)

# Data cleansing & data transformation

In [6]:
df = df.rename(columns={
    COLUMN_ARRIVAL_TIME: (COLUMN_ARRIVAL_TIME:='arrival'),
    COLUMN_ARRIVAL_PROGNOSE: (COLUMN_ARRIVAL_PROGNOSE:='arrival_prognose'),
    COLUMN_DEPARTURE_TIME: (COLUMN_DEPARTURE_TIME:='departure'),
    COLUMN_DEPARTURE_PROGNOSE: (COLUMN_DEPARTURE_PROGNOSE:='departure_prognose'),
    COLUMN_DELAY_ARRIVAL: (COLUMN_DELAY_ARRIVAL:='delay_arrival'),
    COLUMN_DELAY_DEPARTUE: (COLUMN_DELAY_DEPARTUE:='delay_departure'),
})


{'Ankunftszeit': '',
 'An Prognose': '',
 'Abfahrtszeit': '',
 'Ab Prognose': '',
 'delay_arrival': '',
 'delay_departure': ''}

In [5]:
df.sort_values(by=COLUMN_JOURNEY_ID, inplace=True)

We are first creating the columns delay_arrival and delay_departure in order to be able to do some calculation on the delays.

In [None]:
def create_column_for_arrival_and_departure(
        column_arrival_time,
        column_arrival_prognose,
        column_departure_time,
        column_departure_prognose,
        new_column_name_arrival_diff,
        new_column_name_departure_diff):

    # Calculate the delays for arrival and and departure:
    df[new_column_name_arrival_diff] = (
        df[column_arrival_time] - df[column_arrival_prognose]).fillna(pd.Timedelta(seconds=0))
    df[new_column_name_departure_diff] = (
        df[column_departure_time] - df[column_departure_prognose]).fillna(pd.Timedelta(seconds=0))

    # Since the times real times are not more precise than 1 minute, we want to only have a delay
    # when the System registered that it is delayed:
    DEFAULT_TIME = pd.Timedelta(hours=0)
    df.loc[df[new_column_name_arrival_diff] < pd.Timedelta(
        minutes=1), new_column_name_arrival_diff] = DEFAULT_TIME
    df.loc[df[new_column_name_departure_diff] < pd.Timedelta(
        minutes=1), new_column_name_departure_diff] = DEFAULT_TIME

    # Say there's no delay if we don't have time information:
    df[new_column_name_arrival_diff].fillna(pd.Timedelta(hours=0))
    df[new_column_name_departure_diff].fillna(pd.Timedelta(hours=0))

    df[[new_column_name_arrival_diff,
        new_column_name_departure_diff]]

    # [['delay_arrival', new_column_name_departure_diff]]
    df.loc[df[new_column_name_arrival_diff] > pd.Timedelta(minutes=0)]
    df.loc[df[new_column_name_departure_diff] > pd.Timedelta(minutes=0)].sort_values(
        "Fahrt Bezeichner")  # [['delay_arrival', new_column_name_departure_diff]]
    df.loc[df[new_column_name_departure_diff] > pd.Timedelta(
        minutes=0)].value_counts("Fahrt Bezeichner")


create_column_for_arrival_and_departure(column_arrival_time=COLUMN_ARRIVAL_TIME,
                                        column_arrival_prognose=COLUMN_ARRIVAL_PROGNOSE,
                                        column_departure_time=COLUMN_DEPARTURE_TIME,
                                        column_departure_prognose=COLUMN_DEPARTURE_PROGNOSE,
                                        new_column_name_arrival_diff=COLUMN_DELAY_ARRIVAL,
                                        new_column_name_departure_diff=COLUMN_DELAY_DEPARTUE)


And something similar if the vehicle arrives or departures ahead of schedule.

In [None]:
# Calculate the early for arrival and and departure:
df['early_arrival'] = (df['An Prognose'] - df['Ankunftszeit']).fillna(pd.Timedelta(seconds=0))
df['early_departure'] = (df['Ab Prognose'] - df['Abfahrtszeit']).fillna(pd.Timedelta(seconds=0))

# Since the times real times are not more precise than 1 minute, we want to only have a early
# when the difference is bigger than 1 minute:
DEFAULT_TIME = pd.Timedelta(hours=0)
df.loc[df['early_arrival'] < pd.Timedelta(minutes=1), 'early_arrival'] = DEFAULT_TIME
df.loc[df['early_departure'] < pd.Timedelta(minutes=1), 'early_departure'] = DEFAULT_TIME

# Say there's no early if we don't have time information:
df['early_arrival'].fillna(DEFAULT_TIME)
df['early_departure'].fillna(DEFAULT_TIME)

df[['early_arrival',
    'early_departure']]

# Visualization

In [None]:
df.groupby(COLUMN_JOURNEY_ID).agg({'delay_an': 'mean'})

A possible data visualization could be:

* A map where one can see on which location (we have the GPS coordinates) we measured the most delay (we only concentrate on the high delays).
* Which company (like SBB) has how many minutes delays per day in average? (Therefore, we'd use multiple CSVs for multiple days)

# Data storage (optional)

# References