Name
Date

# Summary

Public transport companies provide live data about their vehicle like where they arrived on which time and what time was planned.

In this project, we are exploring this data.

SBB provides it's data live. (Almost) always when a train enters (arrival) and leaves (departure) a railway station, this data including the expected data (that you can also find in the app by SBB) is sent and collected in the "ist-daten" dataset. SBB provides also an dataset with all that data from the last day via an API where you can also download the data as CSV:

Dataset: [Soll/Ist Vergleich Abfahrts-/Ankunftszeiten SBB](https://data.sbb.ch/explore/dataset/ist-daten-sbb/table/?flg=de&dataChart=eyJxdWVyaWVzIjpbeyJjaGFydHMiOlt7InR5cGUiOiJjb2x1bW4iLCJmdW5jIjoiQ09VTlQiLCJ5QXhpcyI6Imxpbmllbl9pZCIsInNjaWVudGlmaWNEaXNwbGF5Ijp0cnVlLCJjb2xvciI6InJhbmdlLUFjY2VudCJ9XSwieEF4aXMiOiJ2ZXJrZWhyc21pdHRlbF90ZXh0IiwibWF4cG9pbnRzIjoyMCwidGltZXNjYWxlIjoiIiwic29ydCI6IiIsInNlcmllc0JyZWFrZG93biI6ImFua3VuZnRzdmVyc3BhdHVuZyIsInN0YWNrZWQiOiJwZXJjZW50IiwiY29uZmlnIjp7ImRhdGFzZXQiOiJpc3QtZGF0ZW4tc2JiIiwib3B0aW9ucyI6eyJmbGciOiJkZSJ9fX1dLCJ0aW1lc2NhbGUiOiIiLCJkaXNwbGF5TGVnZW5kIjp0cnVlLCJhbGlnbk1vbnRoIjp0cnVlfQ%3D%3D)

But sure, this is only a subset of the whole picture in Switzerland. There are other companies than SBB as well! We can also travel back in time. To do so, we can use Opentransportdata's data.

Here are some some older dataset with the same kind of data: https://opentransportdata.swiss/de/dataset/istdaten

And here are the files stored as zip file for the years 2016 - 2023 (2016 and 2017 are incomplete): https://opentransportdata.swiss/de/ist-daten-archiv/

# Ideas
* Is early related to delay?
* Where has a train a lot of delay and makes a lot of time good again?

# Imports

In [27]:
# 1. import of numpy and pandas:
import numpy as np
import pandas as pd

# Show all of our columns in this Jupyter notebook:
pd.set_option('display.max_columns', None)

# Data acquisition

In [28]:
date_format = "%Y-%m-%d %H:%M:%S"


COLUMNS_WITH_DATE = ['Ankunftszeit',
                     'An Prognose',
                     'Abfahrtszeit',
                     'Ab Prognose']

df = pd.read_csv('../data/sbb_data/2023-03-21_ist-daten-sbb.csv',
                 sep=';', parse_dates=COLUMNS_WITH_DATE, date_parser=lambda x: pd.to_datetime(x, format=date_format),)


# EDI
(exploratory data analysis)

In [29]:
df.shape

(63065, 25)

As we can see, we have 25 columns and 63065 rows.

In [30]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 63065 entries, 0 to 63064
Data columns (total 25 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   Betriebstag          63065 non-null  object        
 1   Fahrt Bezeichner     63065 non-null  object        
 2   Betreiber ID         63065 non-null  object        
 3   Betreiber Abkürzung  63065 non-null  object        
 4   Betreiber Name       63065 non-null  object        
 5   Produkt ID           62808 non-null  object        
 6   Linie                63065 non-null  int64         
 7   Linien Text          63065 non-null  object        
 8   Umlauf ID            0 non-null      float64       
 9   Verkehrsmittel Text  63065 non-null  object        
 10  Zusatzfahrt TF       63065 non-null  bool          
 11  Fällt aus            63065 non-null  bool          
 12  BPUIC                63065 non-null  int64         
 13  Haltestellen Name    63065 non-

Accordingly to `info`, "Umlauf ID" is the only column with no valid entries.

The dtypes are in most cases `object`s instead of strings.

The following columns have missing values:
* Produkt ID
* Umlauf ID
* Ankunftszeit
* An Prognose
* An Prognose Status
* Abfahrtszeit
* Ab Prognose
* Ab Prognsoe Status
* Geoposition
* lod

In [31]:
df.dtypes

Betriebstag                    object
Fahrt Bezeichner               object
Betreiber ID                   object
Betreiber Abkürzung            object
Betreiber Name                 object
Produkt ID                     object
Linie                           int64
Linien Text                    object
Umlauf ID                     float64
Verkehrsmittel Text            object
Zusatzfahrt TF                   bool
Fällt aus                        bool
BPUIC                           int64
Haltestellen Name              object
Ankunftszeit           datetime64[ns]
An Prognose            datetime64[ns]
An Prognose Status             object
Abfahrtszeit           datetime64[ns]
Ab Prognose            datetime64[ns]
Ab Prognsoe Status             object
Durchfahrt TF                    bool
Ankunftsverspätung               bool
Abfahrtsverspätung               bool
Geoposition                    object
lod                            object
dtype: object

## Nominal and Ordinal data:
Here are some examples for Nominal and Ordinal data.

In the following output we see that
* Vor the "Verkehrsmittel Text" we have 14 different categories or companies. The most seems to be an "S" like "S-Bahn". But these categories aren't that consistent.
* we have a lot of different dates for Abfahrszeit (1301) and Ankunftszeit (1301) -> A day has 1’440 minutes. We see that for the most of the minutes of the day, we have entries. But compared to "An Prognose", we have much less because this data seems to be accurate in seconds.

In [32]:
# nominal data:
df[['Verkehrsmittel Text']].value_counts()
# df[['Verkehrsmittel Text']].value_counts()

# # ordinal data:
# df[['Abfahrtszeit']].value_counts()
# df[['Ankunftszeit']].value_counts()
# df[['An Prognose']].value_counts()

# We'd get discrete date if we add a column "Verspätung Ankunft" (delay) by subtracting "Ankunft" of "An Prognose".


Verkehrsmittel Text
S                      48780
IR                      4171
RE                      3711
IC                      2988
R                       2331
TER                      487
EC                       358
ICE                      122
TGV                       55
RJX                       25
NJ                        20
SN                         8
EXT                        7
RB                         2
dtype: int64

In [33]:
df.sample(10)

Unnamed: 0,Betriebstag,Fahrt Bezeichner,Betreiber ID,Betreiber Abkürzung,Betreiber Name,Produkt ID,Linie,Linien Text,Umlauf ID,Verkehrsmittel Text,Zusatzfahrt TF,Fällt aus,BPUIC,Haltestellen Name,Ankunftszeit,An Prognose,An Prognose Status,Abfahrtszeit,Ab Prognose,Ab Prognsoe Status,Durchfahrt TF,Ankunftsverspätung,Abfahrtsverspätung,Geoposition,lod
36500,2023-03-20,85:11:25787:001,85:11,SBB,Schweizerische Bundesbahnen SBB,Zug,25787,S10,,S,False,False,8505212,Castione-Arbedo,2023-03-20 20:35:00,2023-03-20 20:35:52,REAL,2023-03-20 20:35:00,2023-03-20 20:35:59,REAL,False,False,False,"46.2229216531378, 9.041461806217393",http://lod.opentransportdata.swiss/didok/8505212
7344,2023-03-20,85:11:26028:001,85:11,SBB,Schweizerische Bundesbahnen SBB,Zug,26028,S22,,S,False,False,8500293,Thalbrücke,2023-03-20 08:04:00,2023-03-20 08:04:00,PROGNOSE,2023-03-20 08:04:00,2023-03-20 08:04:30,PROGNOSE,False,False,False,"47.30842205269395, 7.688271815963863",http://lod.opentransportdata.swiss/didok/8500293
33244,2023-03-20,85:11:18261:002,85:11,SBB,Schweizerische Bundesbahnen SBB,Zug,18261,S2,,S,False,False,8503221,Siebnen-Wangen,2023-03-20 17:01:00,2023-03-20 17:00:59,REAL,2023-03-20 17:01:00,2023-03-20 17:01:25,REAL,False,False,False,"47.182670697408795, 8.900713603936849",http://lod.opentransportdata.swiss/didok/8503221
24797,2023-03-20,85:11:18103:001,85:11,SBB,Schweizerische Bundesbahnen SBB,Zug,18103,TER,,TER,False,False,8504314,La Chaux-de-Fonds,2023-03-20 06:53:00,2023-03-20 06:53:25,REAL,NaT,NaT,,False,False,False,"47.09837435548107, 6.825982721046246",http://lod.opentransportdata.swiss/didok/8504314
57741,2023-03-20,85:11:20417:001,85:11,SBB,Schweizerische Bundesbahnen SBB,Zug,20417,S24,,S,False,False,8502206,Baar,2023-03-20 05:52:00,2023-03-20 05:52:50,REAL,2023-03-20 05:52:00,2023-03-20 05:53:35,REAL,False,False,False,"47.19536403756796, 8.523271537389256",http://lod.opentransportdata.swiss/didok/8502206
18557,2023-03-20,85:11:24522:001,85:11,SBB,Schweizerische Bundesbahnen SBB,Zug,24522,S5,,S,False,False,8518452,Prilly-Malley,2023-03-20 08:32:00,2023-03-20 08:33:06,REAL,2023-03-20 08:33:00,2023-03-20 08:33:52,REAL,False,False,False,"46.52669197652302, 6.602628939250655",http://lod.opentransportdata.swiss/didok/8518452
50870,2023-03-20,85:11:18624:001,85:11,SBB,Schweizerische Bundesbahnen SBB,Zug,18624,S6,,S,False,False,8503528,Otelfingen,2023-03-20 07:26:00,2023-03-20 07:27:33,REAL,2023-03-20 07:26:00,2023-03-20 07:27:59,REAL,False,False,False,"47.45493620253028, 8.387532432091916",http://lod.opentransportdata.swiss/didok/8503528
42812,2023-03-20,85:11:96756:002,85:11,SBB,Schweizerische Bundesbahnen SBB,Zug,96756,SL6,,S,False,False,8501003,Satigny,2023-03-20 18:14:00,2023-03-20 18:13:20,REAL,2023-03-20 18:14:00,2023-03-20 18:14:21,REAL,False,False,False,"46.21422722553378, 6.037665211701825",http://lod.opentransportdata.swiss/didok/8501003
8584,2023-03-20,85:11:24647:001,85:11,SBB,Schweizerische Bundesbahnen SBB,Zug,24647,S6,,S,False,False,8504014,Palézieux,2023-03-20 14:23:00,2023-03-20 14:23:33,REAL,NaT,NaT,,False,False,False,"46.542763593581064, 6.8378752768139",http://lod.opentransportdata.swiss/didok/8504014
24902,2023-03-20,85:11:18199:001,85:11,SBB,Schweizerische Bundesbahnen SBB,Zug,18199,RE,,RE,False,False,8500121,Courfaivre,2023-03-21 01:04:00,2023-03-21 01:03:50,REAL,2023-03-21 01:04:00,2023-03-21 01:04:06,REAL,False,False,False,"47.335083125319805, 7.291165742215083",http://lod.opentransportdata.swiss/didok/8500121


# Data cleansing & data transformation

In [36]:
df.sort_values(by="Fahrt Bezeichner", inplace=True)

We are first creating the columns delay_arrival and delay_departure in order to be able to do some calculation on the delays.

In [40]:
# Calculate the delays for arrival and and departure:
df['delay_arrival'] = (df['Ankunftszeit'] - df['An Prognose']).fillna(pd.Timedelta(seconds=0))
df['delay_departure'] = (df['Abfahrtszeit'] - df['Ab Prognose']).fillna(pd.Timedelta(seconds=0))

# Since the times real times are not more precise than 1 minute, we want to only have a delay
# when the System registered that it is delayed:
DEFAULT_TIME = pd.Timedelta(hours=0)
df.loc[df['delay_arrival'] < pd.Timedelta(minutes=1), 'delay_arrival'] = DEFAULT_TIME
df.loc[df['delay_departure'] < pd.Timedelta(minutes=1), 'delay_departure'] = DEFAULT_TIME

# Say there's no delay if we don't have time information:
df['delay_arrival'].fillna(pd.Timedelta(hours=0))
df['delay_departure'].fillna(pd.Timedelta(hours=0))

df[['delay_arrival',
    'delay_departure']]

df.loc[df['delay_arrival'] > pd.Timedelta(minutes=0)]  # [['delay_arrival', 'delay_departure']]
df.loc[df['delay_departure'] > pd.Timedelta(minutes=0)].sort_values("Fahrt Bezeichner")  # [['delay_arrival', 'delay_departure']]
df.loc[df['delay_departure'] > pd.Timedelta(minutes=0)].value_counts("Fahrt Bezeichner")

Fahrt Bezeichner
85:11:1704:001     4
85:11:25111:001    4
85:11:314:001      4
85:11:7241:001     4
85:11:23137:001    3
                  ..
85:11:21159:001    1
85:11:21158:001    1
85:11:21150:001    1
85:11:21145:001    1
85:11:992:001      1
Length: 1341, dtype: int64

And something similar if the vehicle arrives or departures ahead of schedule.

In [None]:
# Calculate the early for arrival and and departure:
df['early_arrival'] = (df['An Prognose'] - df['Ankunftszeit']).fillna(pd.Timedelta(seconds=0))
df['early_departure'] = (df['Ab Prognose'] - df['Abfahrtszeit']).fillna(pd.Timedelta(seconds=0))

# Since the times real times are not more precise than 1 minute, we want to only have a early
# when the difference is bigger than 1 minute:
DEFAULT_TIME = pd.Timedelta(hours=0)
df.loc[df['early_arrival'] < pd.Timedelta(minutes=1), 'early_arrival'] = DEFAULT_TIME
df.loc[df['early_departure'] < pd.Timedelta(minutes=1), 'early_departure'] = DEFAULT_TIME

# Say there's no early if we don't have time information:
df['early_arrival'].fillna(DEFAULT_TIME)
df['early_departure'].fillna(DEFAULT_TIME)

df[['early_arrival',
    'early_departure']]

# Visualization

In [None]:
df.groupby('Fahrt Bezeichner').agg({'delay_an': 'mean'})

A possible data visualization could be:

* A map where one can see on which location (we have the GPS coordinates) we measured the most delay (we only concentrate on the high delays).
* Which company (like SBB) has how many minutes delays per day in average? (Therefore, we'd use multiple CSVs for multiple days)

# Data storage (optional)

# References