Name
Date

# Summary

Public transport companies provide live data about their vehicle like where they arrived on which time and what time was planned.

In this project, we are exploring this data.

SBB provides it's data live. (Almost) always when a train enters (arrival) and leaves (departure) a railway station, this data including the expected data (that you can also find in the app by SBB) is sent and collected in the "ist-daten" dataset. SBB provides also an dataset with all that data from the last day via an API where you can also download the data as CSV:

Dataset: [Soll/Ist Vergleich Abfahrts-/Ankunftszeiten SBB](https://data.sbb.ch/explore/dataset/ist-daten-sbb/table/?flg=de&dataChart=eyJxdWVyaWVzIjpbeyJjaGFydHMiOlt7InR5cGUiOiJjb2x1bW4iLCJmdW5jIjoiQ09VTlQiLCJ5QXhpcyI6Imxpbmllbl9pZCIsInNjaWVudGlmaWNEaXNwbGF5Ijp0cnVlLCJjb2xvciI6InJhbmdlLUFjY2VudCJ9XSwieEF4aXMiOiJ2ZXJrZWhyc21pdHRlbF90ZXh0IiwibWF4cG9pbnRzIjoyMCwidGltZXNjYWxlIjoiIiwic29ydCI6IiIsInNlcmllc0JyZWFrZG93biI6ImFua3VuZnRzdmVyc3BhdHVuZyIsInN0YWNrZWQiOiJwZXJjZW50IiwiY29uZmlnIjp7ImRhdGFzZXQiOiJpc3QtZGF0ZW4tc2JiIiwib3B0aW9ucyI6eyJmbGciOiJkZSJ9fX1dLCJ0aW1lc2NhbGUiOiIiLCJkaXNwbGF5TGVnZW5kIjp0cnVlLCJhbGlnbk1vbnRoIjp0cnVlfQ%3D%3D)

But sure, this is only a subset of the whole picture in Switzerland. There are other companies than SBB as well! We can also travel back in time. To do so, we can use Opentransportdata's data.

Here are some some older dataset with the same kind of data: https://opentransportdata.swiss/de/dataset/istdaten

And here are the files stored as zip file for the years 2016 - 2023 (2016 and 2017 are incomplete): https://opentransportdata.swiss/de/ist-daten-archiv/

# Imports

In [1]:
# 1. import of numpy and pandas:
import numpy as np
import pandas as pd

# Data acquisition

In [7]:
date_format = "%Y-%m-%d %H:%M:%S"


COLUMNS_WITH_DATE = ['Ankunftszeit',
                     'An Prognose',
                     'Abfahrtszeit',
                     'Ab Prognose']

df = pd.read_csv('../data/sbb_data/2023-03-21_ist-daten-sbb.csv',
                 sep=';', parse_dates=COLUMNS_WITH_DATE, date_parser=lambda x: pd.to_datetime(x, format=date_format))


# EDI
(exploratory data analysis)

In [8]:
df.shape

(63065, 25)

As we can see, we have 25 columns and 63065 rows.

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 63065 entries, 0 to 63064
Data columns (total 25 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   Betriebstag          63065 non-null  object        
 1   Fahrt Bezeichner     63065 non-null  object        
 2   Betreiber ID         63065 non-null  object        
 3   Betreiber Abkürzung  63065 non-null  object        
 4   Betreiber Name       63065 non-null  object        
 5   Produkt ID           62808 non-null  object        
 6   Linie                63065 non-null  int64         
 7   Linien Text          63065 non-null  object        
 8   Umlauf ID            0 non-null      float64       
 9   Verkehrsmittel Text  63065 non-null  object        
 10  Zusatzfahrt TF       63065 non-null  bool          
 11  Fällt aus            63065 non-null  bool          
 12  BPUIC                63065 non-null  int64         
 13  Haltestellen Name    63065 non-

Accordingly to `info`, "Umlauf ID" is the only column with no valid entries.

The dtypes are in most cases `object`s instead of strings.

The following columns have missing values:
* Produkt ID
* Umlauf ID
* Ankunftszeit
* An Prognose
* An Prognose Status
* Abfahrtszeit
* Ab Prognose
* Ab Prognsoe Status
* Geoposition
* lod

In [10]:
df.dtypes

Betriebstag                    object
Fahrt Bezeichner               object
Betreiber ID                   object
Betreiber Abkürzung            object
Betreiber Name                 object
Produkt ID                     object
Linie                           int64
Linien Text                    object
Umlauf ID                     float64
Verkehrsmittel Text            object
Zusatzfahrt TF                   bool
Fällt aus                        bool
BPUIC                           int64
Haltestellen Name              object
Ankunftszeit           datetime64[ns]
An Prognose            datetime64[ns]
An Prognose Status             object
Abfahrtszeit           datetime64[ns]
Ab Prognose            datetime64[ns]
Ab Prognsoe Status             object
Durchfahrt TF                    bool
Ankunftsverspätung               bool
Abfahrtsverspätung               bool
Geoposition                    object
lod                            object
dtype: object

## Nominal and Ordinal data:
Here are some examples for Nominal and Ordinal data.

In the following output we see that
* Vor the "Verkehrsmittel Text" we have 14 different categories or companies. The most seems to be an "S" like "S-Bahn". But these categories aren't that consistent.
* we have a lot of different dates for Abfahrszeit (1301) and Ankunftszeit (1301) -> A day has 1’440 minutes. We see that for the most of the minutes of the day, we have entries. But compared to "An Prognose", we have much less because this data seems to be accurate in seconds.

In [23]:
# nominal data:
df[['Verkehrsmittel Text']].value_counts()
# df[['Verkehrsmittel Text']].value_counts()

# # ordinal data:
# df[['Abfahrtszeit']].value_counts()
# df[['Ankunftszeit']].value_counts()
# df[['An Prognose']].value_counts()

# We'd get discrete date if we add a column "Verspätung Ankunft" (delay) by subtracting "Ankunft" of "An Prognose".


Verkehrsmittel Text
S                      48780
IR                      4171
RE                      3711
IC                      2988
R                       2331
TER                      487
EC                       358
ICE                      122
TGV                       55
RJX                       25
NJ                        20
SN                         8
EXT                        7
RB                         2
dtype: int64

In [11]:
df.sample(10)

Unnamed: 0,Betriebstag,Fahrt Bezeichner,Betreiber ID,Betreiber Abkürzung,Betreiber Name,Produkt ID,Linie,Linien Text,Umlauf ID,Verkehrsmittel Text,...,An Prognose,An Prognose Status,Abfahrtszeit,Ab Prognose,Ab Prognsoe Status,Durchfahrt TF,Ankunftsverspätung,Abfahrtsverspätung,Geoposition,lod
8492,2023-03-20,85:11:24566:001,85:11,SBB,Schweizerische Bundesbahnen SBB,Zug,24566,S5,,S,...,NaT,,2023-03-20 17:06:00,2023-03-20 17:06:17,REAL,False,False,False,"46.542763593581064, 6.8378752768139",http://lod.opentransportdata.swiss/didok/8504014
2302,2023-03-20,85:11:24171:002,85:11,SBB,Schweizerische Bundesbahnen SBB,Zug,24171,S1,,S,...,2023-03-20 19:12:59,REAL,2023-03-20 19:12:00,2023-03-20 19:13:43,REAL,False,False,False,"46.503786768358125, 6.690609436558637",http://lod.opentransportdata.swiss/didok/8501122
22365,2023-03-20,85:11:19581:001,85:11,SBB,Schweizerische Bundesbahnen SBB,Zug,19581,S15,,S,...,2023-03-20 21:09:18,REAL,2023-03-20 21:09:00,2023-03-20 21:10:16,REAL,False,False,False,"47.378176674223226, 8.540212349099065",http://lod.opentransportdata.swiss/didok/8503000
3326,2023-03-20,85:11:24245:002,85:11,SBB,Schweizerische Bundesbahnen SBB,Zug,24245,S2,,S,...,2023-03-20 13:07:09,REAL,2023-03-20 13:07:00,2023-03-20 13:07:48,REAL,False,False,False,"46.71008452939167, 6.56816324702275",http://lod.opentransportdata.swiss/didok/8501112
28472,2023-03-20,85:11:25831:001,85:11,SBB,Schweizerische Bundesbahnen SBB,Zug,25831,RE80,,RE,...,2023-03-20 17:02:35,REAL,2023-03-20 17:01:00,2023-03-20 17:03:08,REAL,False,False,False,"46.179040272391426, 8.865773290075774",http://lod.opentransportdata.swiss/didok/8505402
49060,2023-03-20,85:11:18660:001,85:11,SBB,Schweizerische Bundesbahnen SBB,Zug,18660,S6,,S,...,2023-03-20 15:43:41,REAL,2023-03-20 15:43:00,2023-03-20 15:44:24,REAL,False,False,False,"47.30579195922035, 8.591522357247918",http://lod.opentransportdata.swiss/didok/8503102
42787,2023-03-20,85:11:964:001,85:11,SBB,Schweizerische Bundesbahnen SBB,Zug,964,IC6,,IC,...,2023-03-20 11:01:06,REAL,NaT,NaT,,False,False,False,"47.5474120550501, 7.589562790156525",http://lod.opentransportdata.swiss/didok/8500010
50242,2023-03-20,85:11:19157:001,85:11,SBB,Schweizerische Bundesbahnen SBB,Zug,19157,S11,,S,...,2023-03-20 15:37:21,REAL,NaT,NaT,,False,False,False,"47.53576697592631, 8.738883430510043",http://lod.opentransportdata.swiss/didok/8506020
33538,2023-03-20,85:11:8846:001,85:11,SBB,Schweizerische Bundesbahnen SBB,Zug,8846,S28,,S,...,NaT,UNBEKANNT,2023-03-20 13:01:00,NaT,UNBEKANNT,False,False,False,"47.31992356869366, 7.963203602001299",http://lod.opentransportdata.swiss/didok/8502121
27987,2023-03-20,85:11:25335:001,85:11,SBB,Schweizerische Bundesbahnen SBB,Zug,25335,S30,,S,...,2023-03-20 14:17:13,REAL,2023-03-20 14:18:00,2023-03-20 14:18:15,REAL,False,False,False,"46.135157690407, 8.807058681014864",http://lod.opentransportdata.swiss/didok/8505407


# Data cleansing & data transformation

We are first creating the columns delay_arrival and delay_departure in order to be able to do some calculation on the delays.

In [19]:
from datetime import timedelta


# Calculate the delays for arrival and and departure:
df['delay_arrival'] = abs(df['Ankunftszeit'] - df['An Prognose']).fillna(pd.Timedelta(seconds=0))
df['delay_departure'] = abs(df['Abfahrtszeit'] - df['Ab Prognose']).fillna(pd.Timedelta(seconds=0))

# Say there's no delay if we don't have time information:
df['delay_arrival'].fillna(pd.Timedelta(hours=0))
df['delay_departure'].fillna(pd.Timedelta(hours=0))

df[['delay_arrival',
    'delay_departure']]

Unnamed: 0,delay_arrival,delay_departure
0,0 days 00:00:09,0 days 00:00:30
1,0 days 00:00:00,0 days 00:00:00
2,0 days 00:00:00,0 days 00:00:00
3,0 days 00:00:00,0 days 00:00:00
4,0 days 00:00:44,0 days 00:01:04
...,...,...
63060,0 days 00:00:46,0 days 00:04:27
63061,0 days 00:01:26,0 days 00:01:36
63062,0 days 00:00:00,0 days 00:00:51
63063,0 days 00:10:29,0 days 00:11:37


# Visualization

In [46]:
df.groupby('Fahrt Bezeichner').agg({'delay_an': 'mean'})

Unnamed: 0_level_0,delay_an
Fahrt Bezeichner,Unnamed: 1_level_1
85:11:1007:001,0 days 00:00:00
85:11:1009:001,-1 days +23:59:45
85:11:1056:001,-1 days +23:59:42.333333334
85:11:1057:001,-1 days +23:59:37.125000
85:11:1058:001,-1 days +23:59:50.333333334
...,...
85:11:990:001,-1 days +23:52:52.500000
85:11:991:001,-1 days +23:59:24.200000
85:11:992:001,0 days 00:00:44.750000
85:11:99:002,-1 days +23:57:07


A possible data visualization could be:

* A map where one can see on which location (we have the GPS coordinates) we measured the most delay (we only concentrate on the high delays).
* Which company (like SBB) has how many minutes delays per day in average? (Therefore, we'd use multiple CSVs for multiple days)

# Data storage (optional)

# References