We first import pandas to read, parse, store and do anything to our dataframe followed by numpy for matrices and math functions

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/e/ed/Pandas_logo.svg/1920px-Pandas_logo.svg.png" width="512" height="207">

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/3/31/NumPy_logo_2020.svg/1920px-NumPy_logo_2020.svg.png" width="512" height="230">



In [20]:
import pandas as pd
import numpy as np

We start by reading the "dataset.csv" file and make it a dataframe <br>
The delimiter in the dataset is ';'

In [21]:
csv = pd.read_csv("dataset.csv", sep=';')

Remove all comment columns since they're not data

In [22]:
csv = csv.drop("Cancellation comments", axis=1)
csv = csv.drop("Departure delay comments", axis=1)
csv = csv.drop("Arrival delay comments", axis=1)

Remove all duplicates

In [23]:
csv = csv.drop_duplicates()

# Clean Date column
The date format must be %Y-%m (A year and a month) <br>
We replace all wrong delimiters by a '-' <br>
We convert all the strings to datetimes under the wanted format <br>
We exclude all data from before 2000 and after today <br>

In [24]:
csv["Date"] = csv["Date"].astype(str).str.replace(r"(\d{4})\w(\d{2})", r"\1-\2", regex=True)
csv["Date"] = pd.to_datetime(csv["Date"], errors="coerce", format="%Y-%m")
today = pd.to_datetime("today").normalize()
csv.loc[(csv.Date < "2000-01-01") | (csv.Date > today), "Date"] = pd.NaT

Clean Service column

In [25]:
csv["Service"] = csv["Service"].convert_dtypes(str)

Clean Departure station

In [26]:
csv["Departure station"] = csv["Departure station"].convert_dtypes(str)
mask = csv["Departure station"].str.contains(r".+\d.+", na=False)
csv.loc[mask, "Departure station"] = np.nan

Clean Arrival station

In [27]:
csv["Arrival station"] = csv["Arrival station"].convert_dtypes(str)
mask = csv["Arrival station"].str.contains(r".+\d.+", na=False)
csv.loc[mask, "Arrival station"] = np.nan

Clean Average journey time

In [28]:
mask = csv["Average journey time"].astype(str).str.contains(r"[a-zA-Z]", na=False)
csv.loc[mask, "Average journey time"] = np.nan
csv["Average journey time"] = csv["Average journey time"].convert_dtypes(float)
csv.loc[csv["Average journey time"] < 0, "Average journey time"] = np.nan

            Date   Service   Departure station       Arrival station  \
0     2018-01-01  National    BORDEAUX ST JEAN    PARIS MONTPARNASSE   
1     2018-01-01  National   LA ROCHELLE VILLE    PARIS MONTPARNASSE   
2     2018-01-01  National  PARIS MONTPARNASSE               QUIMPER   
3     2018-01-01  National  PARIS MONTPARNASSE               ST MALO   
4     2018-01-01  National  PARIS MONTPARNASSE   ST PIERRE DES CORPS   
...          ...       ...                 ...                   ...   
10835 2020-04-01  National           PARIS EST            STRASBOURG   
10836 2020-05-01  National                <NA>        LYON PART DIEU   
10837 2021-03-01  National          PARIS LYON    VALENCE ALIXAN TGV   
10838 2019-07-01  National     MARNE LA VALLEE  MARSEILLE ST CHARLES   
10839 2019-01-01  Nataonal               bIMES            PARIS LYON   

       Average journey time  Number of scheduled trains  \
0                     141.0                         NaN   
1                

Clean Number of scheduled trains

In [29]:
mask = csv["Number of scheduled trains"].astype(str).str.contains(r"[a-zA-Z]", na=False)
csv.loc[mask, "Number of scheduled trains"] = np.nan
csv["Number of scheduled trains"] = (csv["Number of scheduled trains"] % 1 == 0)

Clean Number of cancelled trains

Clean Number of trains delayed at departure

Clean Average delay of late trains at departure

Clean Average delay of all trains at departure

Clean Number of trains delayed at arrival

Clean Average delay of late trains at arrival

Clean Average delay of all trains at arrival

Clean Number of trains delayed > 15 min

Clean Average delay of trains > 15min (if competing with flights)

Clean Number of trains delayed > 30min

Clean Number of trains delayed > 60min

Clean Pct delay due to external causes

Clean Pct delay due to infrastructure

CleanPct delay due to traffic management

Clean Pct delay due to rolling stock

Clean Pct delay due to station management and equipment reuse

Clean Pct delay due to passenger handling (crowding, disabled persons, connections)