# Chosing, exploring and wrangling data set - Content list

1. Introduction
2. Import libraries and data
3. Explore and clean TGV data set
4. Explore and clean Number of travelers data set

# 1. Introduction

For this project, I decided to work with open data from SNCF, France's national state-owned railway company. All the data I will use in this project is open and was found on their website https://ressources.data.sncf.com/. I will combine several data sets in order to find relationships between different variables, such as the delays experienced in the train station, the number of passengers per year in a given train station, the region from which the train left, amongst others.

# 2. Import libraries and data

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import os

import warnings
warnings.filterwarnings('ignore')

In [2]:
# Create path
path = r'C:\Users\Mathilde\Documents\DATA ANALYSIS CAREERFOUNDRY\Aug 2024 - SNCF project'
# Data set on the delays on the TGV (high speed trains)
df_delay_tgv = pd.read_csv(os.path.join(path, '02 Data', 'Original data', 'regularite-mensuelle-tgv-aqst.csv'), index_col = False, delimiter=';')
#  Data set on the number of passengers in the train stations
df_nb_travelers = pd.read_csv(os.path.join(path, '02 Data', 'Original data', 'frequentation-gares.csv'), index_col = False, delimiter=';')

# 3. Explore and clean TGV data set

In [3]:
pd.set_option('display.max_columns', None)
df_delay_tgv.head()

Unnamed: 0,Date,Service,Gare de départ,Gare d'arrivée,Durée moyenne du trajet,Nombre de circulations prévues,Nombre de trains annulés,Commentaire annulations,Nombre de trains en retard au départ,Retard moyen des trains en retard au départ,Retard moyen de tous les trains au départ,Commentaire retards au départ,Nombre de trains en retard à l'arrivée,Retard moyen des trains en retard à l'arrivée,Retard moyen de tous les trains à l'arrivée,Commentaire retards à l'arrivée,Nombre trains en retard > 15min,Retard moyen trains en retard > 15 (si liaison concurrencée par vol),Nombre trains en retard > 30min,Nombre trains en retard > 60min,Prct retard pour causes externes,Prct retard pour cause infrastructure,Prct retard pour cause gestion trafic,Prct retard pour cause matériel roulant,Prct retard pour cause gestion en gare et réutilisation de matériel,"Prct retard pour cause prise en compte voyageurs (affluence, gestions PSH, correspondances)"
0,2018-01,National,BORDEAUX ST JEAN,PARIS MONTPARNASSE,141,870,5,,289,11.247809,3.693179,,147,28.436735,6.511118,,110,6.511118,44,8,36.134454,31.092437,10.92437,15.966387,5.042017,0.840336
1,2018-01,National,LA ROCHELLE VILLE,PARIS MONTPARNASSE,165,222,0,,8,2.875,0.095796,,34,21.52402,5.696096,,22,5.696096,5,0,15.384615,30.769231,38.461538,11.538462,3.846154,0.0
2,2018-01,National,PARIS MONTPARNASSE,QUIMPER,220,248,1,,37,9.501351,1.003981,,26,55.692308,7.578947,"Ce mois-ci, l'OD a été touchée par les inciden...",26,7.548387,17,7,26.923077,38.461538,15.384615,19.230769,0.0,0.0
3,2018-01,National,PARIS MONTPARNASSE,ST MALO,156,102,0,,12,19.9125,1.966667,,13,48.623077,6.790686,"Ce mois-ci, l'OD a été touchée par les inciden...",8,6.724757,6,4,23.076923,46.153846,7.692308,15.384615,7.692308,0.0
4,2018-01,National,PARIS MONTPARNASSE,ST PIERRE DES CORPS,61,391,2,,61,7.796995,0.886889,,71,12.405164,3.346487,,17,3.346487,6,0,21.212121,42.424242,9.090909,21.212121,6.060606,0.0


In [4]:
df_delay_tgv.shape

(9598, 26)

In [5]:
df_delay_tgv.describe()

Unnamed: 0,Durée moyenne du trajet,Nombre de circulations prévues,Nombre de trains annulés,Commentaire annulations,Nombre de trains en retard au départ,Retard moyen des trains en retard au départ,Retard moyen de tous les trains au départ,Commentaire retards au départ,Nombre de trains en retard à l'arrivée,Retard moyen des trains en retard à l'arrivée,Retard moyen de tous les trains à l'arrivée,Nombre trains en retard > 15min,Retard moyen trains en retard > 15 (si liaison concurrencée par vol),Nombre trains en retard > 30min,Nombre trains en retard > 60min,Prct retard pour causes externes,Prct retard pour cause infrastructure,Prct retard pour cause gestion trafic,Prct retard pour cause matériel roulant,Prct retard pour cause gestion en gare et réutilisation de matériel,"Prct retard pour cause prise en compte voyageurs (affluence, gestions PSH, correspondances)"
count,9598.0,9598.0,9598.0,0.0,9598.0,9598.0,9598.0,0.0,9598.0,9598.0,9598.0,9598.0,9598.0,9598.0,9598.0,9598.0,9598.0,9598.0,9598.0,9598.0,9598.0
mean,170.01094,265.281413,10.070744,,87.775891,11.597082,3.056784,,36.074495,34.17,5.78538,25.588248,33.754032,12.169098,4.459158,22.074921,21.992758,19.73023,19.048353,7.0722,7.445573
std,87.300663,178.879185,24.784164,,89.711554,11.918225,5.097969,,30.557789,15.491757,7.46944,21.927697,19.720504,11.440176,5.079933,16.423143,15.346118,14.965066,13.903218,8.094415,9.960441
min,0.0,0.0,0.0,,0.0,0.0,-229.269444,,0.0,-40.109259,-472.638889,0.0,-4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,100.0,147.0,0.0,,22.0,5.588875,1.165647,,14.0,24.967716,3.245905,10.0,22.725868,4.0,1.0,10.714286,11.764706,9.52381,10.0,0.0,0.0
50%,163.0,227.0,2.0,,54.0,9.565817,2.260175,,28.0,32.599802,5.103441,20.0,35.971216,9.0,3.0,20.0,20.0,17.857143,17.274536,5.555556,4.477612
75%,222.0,346.0,8.0,,128.0,14.724359,3.897177,,49.0,41.365968,7.783922,35.0,44.931399,17.0,6.0,30.769231,30.0,27.586207,26.0,10.569227,11.111111
max,786.0,1100.0,297.0,,596.0,316.188095,84.516667,,376.0,299.6,92.0,312.0,299.6,202.0,71.0,100.0,100.0,100.0,100.0,100.0,100.0


The columns "Commentaire annulations" and "Commentaire retards au départ" are empty (and if they had containedanything, it would have been unstructured data -  comments about the cancellations and delays). Let's remove them.

I will also rename all the columns to translate them to English before going further.

##  3.1 Drop the empty columns

In [6]:
df_delay_tgv = df_delay_tgv.drop(columns=['Commentaire annulations', 'Commentaire retards au départ'])

## 3.2 Rename columns to English

In [7]:
df_delay_tgv.rename(columns={
    "Gare de départ": "Departure station",
    "Gare d'arrivée": "Arrival station",
    "Durée moyenne du trajet": "Avg trip length",
    "Nombre de circulations prévues": "Number of trips scheduled",
    "Nombre de trains annulés": "Number of trains cancelled",
    "Nombre de trains en retard au départ": "Number of trains delayed on departure",
    "Retard moyen des trains en retard au départ": "Avg delay of trains delayed on departure",
    "Retard moyen de tous les trains au départ": "Avg delay of all trains on departure",
    "Nombre de trains en retard à l'arrivée": "Number of trains delayed on arrival",
    "Retard moyen des trains en retard à l'arrivée": "Avg delay of trains delayed on arrival",
    "Retard moyen de tous les trains à l'arrivée": "Avg delay of all trains on arrival",
    "Commentaire retards à l'arrivée" : "Comments delay on arrival",
    "Nombre trains en retard > 15min": "Number of trains >15 min delay",
    "Retard moyen trains en retard > 15 (si liaison concurrencée par vol)": "Avg delay of trains >15 (if trip has a flight concurrence)",
    "Nombre trains en retard > 30min": "Number of trains >30 min delay",
    "Nombre trains en retard > 60min": "Number of trains >60 min delay",
    "Prct retard pour causes externes": "% delay from external causes",
    "Prct retard pour cause infrastructure": "% delay infrastructure cause",
    "Prct retard pour cause gestion trafic": "% delay traffic management cause",
    "Prct retard pour cause matériel roulant": "% delay rolling stock cause",
    "Prct retard pour cause gestion en gare et réutilisation de matériel": "% delay station management and reutilization of stock",
    "Prct retard pour cause prise en compte voyageurs (affluence, gestions PSH, correspondances)": "% delay because of accommodation of passengers (crowd, disability, connections)"
}, inplace = True)

In [8]:
# Explore again
df_delay_tgv.describe()

Unnamed: 0,Avg trip length,Number of trips scheduled,Number of trains cancelled,Number of trains delayed on departure,Avg delay of trains delayed on departure,Avg delay of all trains on departure,Number of trains delayed on arrival,Avg delay of trains delayed on arrival,Avg delay of all trains on arrival,Number of trains >15 min delay,Avg delay of trains >15 (if trip has a flight concurrence),Number of trains >30 min delay,Number of trains >60 min delay,% delay from external causes,% delay infrastructure cause,% delay traffic management cause,% delay rolling stock cause,% delay station management and reutilization of stock,"% delay because of accommodation of passengers (crowd, disability, connections)"
count,9598.0,9598.0,9598.0,9598.0,9598.0,9598.0,9598.0,9598.0,9598.0,9598.0,9598.0,9598.0,9598.0,9598.0,9598.0,9598.0,9598.0,9598.0,9598.0
mean,170.01094,265.281413,10.070744,87.775891,11.597082,3.056784,36.074495,34.17,5.78538,25.588248,33.754032,12.169098,4.459158,22.074921,21.992758,19.73023,19.048353,7.0722,7.445573
std,87.300663,178.879185,24.784164,89.711554,11.918225,5.097969,30.557789,15.491757,7.46944,21.927697,19.720504,11.440176,5.079933,16.423143,15.346118,14.965066,13.903218,8.094415,9.960441
min,0.0,0.0,0.0,0.0,0.0,-229.269444,0.0,-40.109259,-472.638889,0.0,-4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,100.0,147.0,0.0,22.0,5.588875,1.165647,14.0,24.967716,3.245905,10.0,22.725868,4.0,1.0,10.714286,11.764706,9.52381,10.0,0.0,0.0
50%,163.0,227.0,2.0,54.0,9.565817,2.260175,28.0,32.599802,5.103441,20.0,35.971216,9.0,3.0,20.0,20.0,17.857143,17.274536,5.555556,4.477612
75%,222.0,346.0,8.0,128.0,14.724359,3.897177,49.0,41.365968,7.783922,35.0,44.931399,17.0,6.0,30.769231,30.0,27.586207,26.0,10.569227,11.111111
max,786.0,1100.0,297.0,596.0,316.188095,84.516667,376.0,299.6,92.0,312.0,299.6,202.0,71.0,100.0,100.0,100.0,100.0,100.0,100.0


The "min" shows some negative values in several columns with delay values: it shouldn't be possible. Let's check it out.

## 3.3 Deal with outliers/impossible values

In [9]:
# Identify rows with negative values
df_delay_tgv.loc[df_delay_tgv['Avg delay of all trains on departure'] < 0]

Unnamed: 0,Date,Service,Departure station,Arrival station,Avg trip length,Number of trips scheduled,Number of trains cancelled,Number of trains delayed on departure,Avg delay of trains delayed on departure,Avg delay of all trains on departure,Number of trains delayed on arrival,Avg delay of trains delayed on arrival,Avg delay of all trains on arrival,Comments delay on arrival,Number of trains >15 min delay,Avg delay of trains >15 (if trip has a flight concurrence),Number of trains >30 min delay,Number of trains >60 min delay,% delay from external causes,% delay infrastructure cause,% delay traffic management cause,% delay rolling stock cause,% delay station management and reutilization of stock,"% delay because of accommodation of passengers (crowd, disability, connections)"
416,2018-04,National,SAINT ETIENNE CHATEAUCREUX,PARIS LYON,170,81,18,8,4.945833,-1.066667,5,26.280000,3.248677,,4,3.248677,2,0,33.333333,33.333333,0.000000,0.000000,33.333333,0.000000
478,2018-04,National,ST MALO,PARIS MONTPARNASSE,159,81,24,11,3.584848,-2.158187,17,19.013725,7.208480,"Ce mois-ci, l'OD a été touchée par la grève (1...",9,7.208480,1,0,17.647059,64.705882,5.882353,11.764706,0.000000,0.000000
487,2018-04,National,MARSEILLE ST CHARLES,TOURCOING,301,25,7,0,0.000000,-0.370370,0,0.000000,2.500000,,0,2.500000,0,0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
617,2018-05,International,GENEVE,PARIS LYON,180,217,59,8,14.875000,-1.759494,43,41.544186,11.452426,,40,11.452426,29,5,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
654,2018-06,National,PARIS MONTPARNASSE,ST PIERRE DES CORPS,61,392,69,19,22.894737,-0.069247,64,15.905208,3.992931,,14,3.992931,9,3,20.930233,27.906977,20.930233,16.279070,11.627907,2.325581
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8098,2023-06,National,TOURCOING,BORDEAUX ST JEAN,305,30,0,0,0.000000,-0.004598,2,0.000000,1.117816,,2,100.025000,2,2,0.000000,0.000000,0.000000,100.000000,0.000000,0.000000
8492,2023-09,International,PARIS LYON,ITALIE,423,30,16,0,0.000000,-0.029762,0,0.000000,0.000000,,0,0.000000,0,0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
9306,2024-04,National,ST MALO,PARIS MONTPARNASSE,153,95,0,1,1.533333,-0.179298,7,26.600000,3.495789,,4,37.483333,3,0,14.285714,0.000000,14.285714,57.142857,0.000000,14.285714
9360,2024-05,International,PARIS LYON,BARCELONA,409,64,0,6,1.494444,-0.166667,0,0.000000,0.000000,,0,0.000000,0,0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000


### Deal with "Avg delay of all trains on departure" column

The column "Avg delay of trains delayed on departure" doesn't have negative values, so we can use it to calculate the value of the column "Average delay of all trains on departure", using these formulas:

* Avg delay of trains delayed on departure = Sum of delays at departure / Number of trains delayed on departure

* Avg delay of all trains on departure = Sum of delays on departure / Number of trips scheduled

* ##### => Avg delay of all trains on departure = Avg delay of trains delayed on departure * Number of trains delayed on departure / Number of trips scheduled

With the exception where the Number of trips scheduled is 0 ; then the avg delay has to be 0.

In [10]:
# Calculate the replacement values
replacement_values = np.where(
    df_delay_tgv["Number of trips scheduled"] != 0,
    df_delay_tgv["Avg delay of trains delayed on departure"] *
    df_delay_tgv["Number of trains delayed on departure"] /
    df_delay_tgv["Number of trips scheduled"],
    0
)

In [11]:
# Replace the negative values in "Average delay of all trains at departure"
df_delay_tgv.loc[df_delay_tgv["Avg delay of all trains on departure"] < 0, "Avg delay of all trains on departure"] = replacement_values[df_delay_tgv["Avg delay of all trains on departure"] < 0]

In [12]:
# Check result
df_delay_tgv.loc[df_delay_tgv['Avg delay of all trains on departure'] < 0]

Unnamed: 0,Date,Service,Departure station,Arrival station,Avg trip length,Number of trips scheduled,Number of trains cancelled,Number of trains delayed on departure,Avg delay of trains delayed on departure,Avg delay of all trains on departure,Number of trains delayed on arrival,Avg delay of trains delayed on arrival,Avg delay of all trains on arrival,Comments delay on arrival,Number of trains >15 min delay,Avg delay of trains >15 (if trip has a flight concurrence),Number of trains >30 min delay,Number of trains >60 min delay,% delay from external causes,% delay infrastructure cause,% delay traffic management cause,% delay rolling stock cause,% delay station management and reutilization of stock,"% delay because of accommodation of passengers (crowd, disability, connections)"


In [13]:
# Check result
df_delay_tgv.describe()

Unnamed: 0,Avg trip length,Number of trips scheduled,Number of trains cancelled,Number of trains delayed on departure,Avg delay of trains delayed on departure,Avg delay of all trains on departure,Number of trains delayed on arrival,Avg delay of trains delayed on arrival,Avg delay of all trains on arrival,Number of trains >15 min delay,Avg delay of trains >15 (if trip has a flight concurrence),Number of trains >30 min delay,Number of trains >60 min delay,% delay from external causes,% delay infrastructure cause,% delay traffic management cause,% delay rolling stock cause,% delay station management and reutilization of stock,"% delay because of accommodation of passengers (crowd, disability, connections)"
count,9598.0,9598.0,9598.0,9598.0,9598.0,9598.0,9598.0,9598.0,9598.0,9598.0,9598.0,9598.0,9598.0,9598.0,9598.0,9598.0,9598.0,9598.0,9598.0
mean,170.01094,265.281413,10.070744,87.775891,11.597082,3.136249,36.074495,34.17,5.78538,25.588248,33.754032,12.169098,4.459158,22.074921,21.992758,19.73023,19.048353,7.0722,7.445573
std,87.300663,178.879185,24.784164,89.711554,11.918225,4.101583,30.557789,15.491757,7.46944,21.927697,19.720504,11.440176,5.079933,16.423143,15.346118,14.965066,13.903218,8.094415,9.960441
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-40.109259,-472.638889,0.0,-4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,100.0,147.0,0.0,22.0,5.588875,1.174019,14.0,24.967716,3.245905,10.0,22.725868,4.0,1.0,10.714286,11.764706,9.52381,10.0,0.0,0.0
50%,163.0,227.0,2.0,54.0,9.565817,2.263914,28.0,32.599802,5.103441,20.0,35.971216,9.0,3.0,20.0,20.0,17.857143,17.274536,5.555556,4.477612
75%,222.0,346.0,8.0,128.0,14.724359,3.904016,49.0,41.365968,7.783922,35.0,44.931399,17.0,6.0,30.769231,30.0,27.586207,26.0,10.569227,11.111111
max,786.0,1100.0,297.0,596.0,316.188095,84.516667,376.0,299.6,92.0,312.0,299.6,202.0,71.0,100.0,100.0,100.0,100.0,100.0,100.0


All good, the column doesn't have negative values anymore.

### Deal with "Avg delay of trains delayed on arrival" column

In [14]:
df_delay_tgv.loc[df_delay_tgv['Avg delay of trains delayed on arrival'] < 0]

Unnamed: 0,Date,Service,Departure station,Arrival station,Avg trip length,Number of trips scheduled,Number of trains cancelled,Number of trains delayed on departure,Avg delay of trains delayed on departure,Avg delay of all trains on departure,Number of trains delayed on arrival,Avg delay of trains delayed on arrival,Avg delay of all trains on arrival,Comments delay on arrival,Number of trains >15 min delay,Avg delay of trains >15 (if trip has a flight concurrence),Number of trains >30 min delay,Number of trains >60 min delay,% delay from external causes,% delay infrastructure cause,% delay traffic management cause,% delay rolling stock cause,% delay station management and reutilization of stock,"% delay because of accommodation of passengers (crowd, disability, connections)"
2886,2019-11,National,MONTPELLIER,PARIS LYON,380,227,11,189,4.910406,4.277469,44,-30.5125,-150.562114,,44,34.677381,18,3,52.272727,13.636364,15.909091,15.909091,2.272727,0.0
2889,2019-11,National,NIMES,PARIS LYON,224,226,11,190,8.765614,7.369322,63,-40.109259,-151.291008,,44,34.677381,18,3,46.774194,17.741935,14.516129,12.903226,3.225806,4.83871


There are only 2 negative values - these are two lines with Paris Lyon as arrival station, in Nov 2019. The absolute values of those negative numbers are close enough to the mean of the column (34.17), which suggests that the error might only be the minus sign in front of the number. Let's replace them with their absolute values.

In [15]:
df_delay_tgv.loc[df_delay_tgv["Avg delay of trains delayed on arrival"] < 0, "Avg delay of trains delayed on arrival"] = df_delay_tgv['Avg delay of trains delayed on arrival'].abs()

In [16]:
# Check result
df_delay_tgv.loc[df_delay_tgv['Avg delay of trains delayed on arrival'] < 0]

Unnamed: 0,Date,Service,Departure station,Arrival station,Avg trip length,Number of trips scheduled,Number of trains cancelled,Number of trains delayed on departure,Avg delay of trains delayed on departure,Avg delay of all trains on departure,Number of trains delayed on arrival,Avg delay of trains delayed on arrival,Avg delay of all trains on arrival,Comments delay on arrival,Number of trains >15 min delay,Avg delay of trains >15 (if trip has a flight concurrence),Number of trains >30 min delay,Number of trains >60 min delay,% delay from external causes,% delay infrastructure cause,% delay traffic management cause,% delay rolling stock cause,% delay station management and reutilization of stock,"% delay because of accommodation of passengers (crowd, disability, connections)"


All good, the column doesn't have negative values anymore.

### Deal with "Avg delay of all trains on arrival" column

In [17]:
# Observe the negative values
df_delay_tgv.loc[df_delay_tgv['Avg delay of all trains on arrival'] < 0]

Unnamed: 0,Date,Service,Departure station,Arrival station,Avg trip length,Number of trips scheduled,Number of trains cancelled,Number of trains delayed on departure,Avg delay of trains delayed on departure,Avg delay of all trains on departure,Number of trains delayed on arrival,Avg delay of trains delayed on arrival,Avg delay of all trains on arrival,Comments delay on arrival,Number of trains >15 min delay,Avg delay of trains >15 (if trip has a flight concurrence),Number of trains >30 min delay,Number of trains >60 min delay,% delay from external causes,% delay infrastructure cause,% delay traffic management cause,% delay rolling stock cause,% delay station management and reutilization of stock,"% delay because of accommodation of passengers (crowd, disability, connections)"
419,2018-04,International,MARSEILLE ST CHARLES,MADRID,464,23,16,2,1.208333,0.383333,0,0.000000,-2.714286,,0,-2.714286,0,0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0
505,2018-04,National,PARIS VAUGIRARD,BORDEAUX ST JEAN,165,57,15,4,6.733333,0.099603,2,17.300000,-0.160714,,1,-0.160714,0,0,0.000000,0.000000,0.000000,100.000000,0.000000,0.0
1074,2018-09,National,PARIS MONTPARNASSE,ST MALO,171,97,0,4,4.279167,0.176460,2,18.875000,-2.404296,,1,-2.404296,0,0,0.000000,50.000000,50.000000,0.000000,0.000000,0.0
1099,2018-09,National,RENNES,PARIS VAUGIRARD,116,27,0,2,2.275000,0.174074,0,0.000000,-0.326543,,0,-0.326543,0,0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0
1739,2019-02,International,MARSEILLE ST CHARLES,MADRID,464,28,0,10,2.031667,0.916667,1,37.000000,-1.571429,,1,-1.571429,1,0,0.000000,0.000000,100.000000,0.000000,0.000000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8255,2023-07,National,TOURCOING,BORDEAUX ST JEAN,304,29,0,4,2.750000,0.393103,1,35.783333,-2.761494,,1,35.783333,1,0,100.000000,0.000000,0.000000,0.000000,0.000000,0.0
8318,2023-08,National,TOURCOING,BORDEAUX ST JEAN,304,28,0,1,3.000000,0.107143,1,31.750000,-0.491667,,1,31.750000,1,0,0.000000,0.000000,0.000000,100.000000,0.000000,0.0
8973,2024-01,International,PARIS EST,FRANCFORT,258,120,58,5,3.743333,0.302778,8,26.937500,-0.014722,,8,26.937500,3,0,0.000000,42.857143,42.857143,14.285714,0.000000,0.0
9013,2024-02,National,PARIS MONTPARNASSE,TOULOUSE MATABIAU,271,207,11,23,10.257246,0.753061,15,50.136667,-0.566667,,15,50.136667,7,4,26.666667,33.333333,0.000000,13.333333,6.666667,20.0


Let's proceed in a similar fashion as the for column "Avg delay of all trains on departure". Based on the same logic:

Avg delay of all trains on arrival = Avg delay of trains delayed on arrival * Number of trains delayed on arrival / Number of trips scheduled

Unless the Number of trips scheduled = 0; then, the value is 0.

In [18]:
# Calculate the replacement values
replacement_values2 = np.where(
    df_delay_tgv["Number of trips scheduled"] != 0,
    df_delay_tgv["Avg delay of trains delayed on arrival"] *
    df_delay_tgv["Number of trains delayed on arrival"] /
    df_delay_tgv["Number of trips scheduled"],
    0
)

In [19]:
# Replace the negative values in "Average delay of all trains at departure"
df_delay_tgv.loc[df_delay_tgv["Avg delay of all trains on arrival"] < 0, "Avg delay of all trains on arrival"] = replacement_values2[df_delay_tgv["Avg delay of all trains on arrival"] < 0]

In [20]:
# Check result
df_delay_tgv.describe()

Unnamed: 0,Avg trip length,Number of trips scheduled,Number of trains cancelled,Number of trains delayed on departure,Avg delay of trains delayed on departure,Avg delay of all trains on departure,Number of trains delayed on arrival,Avg delay of trains delayed on arrival,Avg delay of all trains on arrival,Number of trains >15 min delay,Avg delay of trains >15 (if trip has a flight concurrence),Number of trains >30 min delay,Number of trains >60 min delay,% delay from external causes,% delay infrastructure cause,% delay traffic management cause,% delay rolling stock cause,% delay station management and reutilization of stock,"% delay because of accommodation of passengers (crowd, disability, connections)"
count,9598.0,9598.0,9598.0,9598.0,9598.0,9598.0,9598.0,9598.0,9598.0,9598.0,9598.0,9598.0,9598.0,9598.0,9598.0,9598.0,9598.0,9598.0,9598.0
mean,170.01094,265.281413,10.070744,87.775891,11.597082,3.136249,36.074495,34.184716,5.961176,25.588248,33.754032,12.169098,4.459158,22.074921,21.992758,19.73023,19.048353,7.0722,7.445573
std,87.300663,178.879185,24.784164,89.711554,11.918225,4.101583,30.557789,15.459253,4.151375,21.927697,19.720504,11.440176,5.079933,16.423143,15.346118,14.965066,13.903218,8.094415,9.960441
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,100.0,147.0,0.0,22.0,5.588875,1.174019,14.0,24.984962,3.257813,10.0,22.725868,4.0,1.0,10.714286,11.764706,9.52381,10.0,0.0,0.0
50%,163.0,227.0,2.0,54.0,9.565817,2.263914,28.0,32.601505,5.110889,20.0,35.971216,9.0,3.0,20.0,20.0,17.857143,17.274536,5.555556,4.477612
75%,222.0,346.0,8.0,128.0,14.724359,3.904016,49.0,41.365968,7.791706,35.0,44.931399,17.0,6.0,30.769231,30.0,27.586207,26.0,10.569227,11.111111
max,786.0,1100.0,297.0,596.0,316.188095,84.516667,376.0,299.6,92.0,312.0,299.6,202.0,71.0,100.0,100.0,100.0,100.0,100.0,100.0


All good, the column doesn't have negative values anymore.

#### Deal with "Avg delay of trains >15 (if trip has a flight concurrence)" column

This column is actually not supposed to have any value under 15.

In [21]:
# Locate values under 15
df_delay_tgv.loc[df_delay_tgv['Avg delay of trains >15 (if trip has a flight concurrence)'] < 15]

Unnamed: 0,Date,Service,Departure station,Arrival station,Avg trip length,Number of trips scheduled,Number of trains cancelled,Number of trains delayed on departure,Avg delay of trains delayed on departure,Avg delay of all trains on departure,Number of trains delayed on arrival,Avg delay of trains delayed on arrival,Avg delay of all trains on arrival,Comments delay on arrival,Number of trains >15 min delay,Avg delay of trains >15 (if trip has a flight concurrence),Number of trains >30 min delay,Number of trains >60 min delay,% delay from external causes,% delay infrastructure cause,% delay traffic management cause,% delay rolling stock cause,% delay station management and reutilization of stock,"% delay because of accommodation of passengers (crowd, disability, connections)"
0,2018-01,National,BORDEAUX ST JEAN,PARIS MONTPARNASSE,141,870,5,289,11.247809,3.693179,147,28.436735,6.511118,,110,6.511118,44,8,36.134454,31.092437,10.924370,15.966387,5.042017,0.840336
1,2018-01,National,LA ROCHELLE VILLE,PARIS MONTPARNASSE,165,222,0,8,2.875000,0.095796,34,21.524020,5.696096,,22,5.696096,5,0,15.384615,30.769231,38.461538,11.538462,3.846154,0.000000
2,2018-01,National,PARIS MONTPARNASSE,QUIMPER,220,248,1,37,9.501351,1.003981,26,55.692308,7.578947,"Ce mois-ci, l'OD a été touchée par les inciden...",26,7.548387,17,7,26.923077,38.461538,15.384615,19.230769,0.000000,0.000000
3,2018-01,National,PARIS MONTPARNASSE,ST MALO,156,102,0,12,19.912500,1.966667,13,48.623077,6.790686,"Ce mois-ci, l'OD a été touchée par les inciden...",8,6.724757,6,4,23.076923,46.153846,7.692308,15.384615,7.692308,0.000000
4,2018-01,National,PARIS MONTPARNASSE,ST PIERRE DES CORPS,61,391,2,61,7.796995,0.886889,71,12.405164,3.346487,,17,3.346487,6,0,21.212121,42.424242,9.090909,21.212121,6.060606,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9333,2024-04,International,PARIS LYON,ITALIE,502,28,0,1,14.766667,0.459524,0,0.000000,0.000000,,0,0.000000,0,0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
9360,2024-05,International,PARIS LYON,BARCELONA,409,64,0,6,1.494444,0.140104,0,0.000000,0.000000,,0,0.000000,0,0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
9426,2024-05,National,STRASBOURG,NANTES,335,57,0,7,4.700000,0.640351,0,0.000000,0.000000,,0,0.000000,0,0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
9432,2024-05,International,PARIS LYON,ITALIE,502,18,0,2,10.741667,1.194444,0,0.000000,0.000000,,0,0.000000,0,0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000


2157 rows with an incoherent value out of the 9598 rows of the data set: that's 22%. This is way too much to impute values or delete the rows without affecting the analysis. This variable is not very interesting, it is out of the scope of our topic: let's simply delete the column.

In [22]:
df_delay_tgv = df_delay_tgv.drop(columns = ['Avg delay of trains >15 (if trip has a flight concurrence)'])

In [23]:
# Check result
df_delay_tgv.describe()

Unnamed: 0,Avg trip length,Number of trips scheduled,Number of trains cancelled,Number of trains delayed on departure,Avg delay of trains delayed on departure,Avg delay of all trains on departure,Number of trains delayed on arrival,Avg delay of trains delayed on arrival,Avg delay of all trains on arrival,Number of trains >15 min delay,Number of trains >30 min delay,Number of trains >60 min delay,% delay from external causes,% delay infrastructure cause,% delay traffic management cause,% delay rolling stock cause,% delay station management and reutilization of stock,"% delay because of accommodation of passengers (crowd, disability, connections)"
count,9598.0,9598.0,9598.0,9598.0,9598.0,9598.0,9598.0,9598.0,9598.0,9598.0,9598.0,9598.0,9598.0,9598.0,9598.0,9598.0,9598.0,9598.0
mean,170.01094,265.281413,10.070744,87.775891,11.597082,3.136249,36.074495,34.184716,5.961176,25.588248,12.169098,4.459158,22.074921,21.992758,19.73023,19.048353,7.0722,7.445573
std,87.300663,178.879185,24.784164,89.711554,11.918225,4.101583,30.557789,15.459253,4.151375,21.927697,11.440176,5.079933,16.423143,15.346118,14.965066,13.903218,8.094415,9.960441
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,100.0,147.0,0.0,22.0,5.588875,1.174019,14.0,24.984962,3.257813,10.0,4.0,1.0,10.714286,11.764706,9.52381,10.0,0.0,0.0
50%,163.0,227.0,2.0,54.0,9.565817,2.263914,28.0,32.601505,5.110889,20.0,9.0,3.0,20.0,20.0,17.857143,17.274536,5.555556,4.477612
75%,222.0,346.0,8.0,128.0,14.724359,3.904016,49.0,41.365968,7.791706,35.0,17.0,6.0,30.769231,30.0,27.586207,26.0,10.569227,11.111111
max,786.0,1100.0,297.0,596.0,316.188095,84.516667,376.0,299.6,92.0,312.0,202.0,71.0,100.0,100.0,100.0,100.0,100.0,100.0


The column was successfully dropped.

### Deal with "Avg trip length" column

The column "Avg trip length" should never take the value 0, but it is its minimum value. Let's have a look:

In [24]:
pd.set_option('display.max_rows', None)
df_delay_tgv.loc[df_delay_tgv['Avg trip length'] == 0]

Unnamed: 0,Date,Service,Departure station,Arrival station,Avg trip length,Number of trips scheduled,Number of trains cancelled,Number of trains delayed on departure,Avg delay of trains delayed on departure,Avg delay of all trains on departure,Number of trains delayed on arrival,Avg delay of trains delayed on arrival,Avg delay of all trains on arrival,Comments delay on arrival,Number of trains >15 min delay,Number of trains >30 min delay,Number of trains >60 min delay,% delay from external causes,% delay infrastructure cause,% delay traffic management cause,% delay rolling stock cause,% delay station management and reutilization of stock,"% delay because of accommodation of passengers (crowd, disability, connections)"
3406,2020-04,National,PARIS MONTPARNASSE,ST MALO,0,0,8,0,0.0,0.0,0,0.0,0.0,,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3413,2020-04,National,MARNE LA VALLEE,LYON PART DIEU,0,0,21,0,0.0,0.0,0,0.0,0.0,,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3414,2020-04,National,MARNE LA VALLEE,MARSEILLE ST CHARLES,0,0,14,0,0.0,0.0,0,0.0,0.0,,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3415,2020-04,National,MARSEILLE ST CHARLES,MARNE LA VALLEE,0,0,13,0,0.0,0.0,0,0.0,0.0,,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3416,2020-04,National,PARIS LYON,ANNECY,0,0,7,0,0.0,0.0,0,0.0,0.0,,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3417,2020-04,National,ANNECY,PARIS LYON,0,0,7,0,0.0,0.0,0,0.0,0.0,,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3418,2020-04,National,PARIS LYON,BELLEGARDE (AIN),0,0,15,0,0.0,0.0,0,0.0,0.0,,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3423,2020-04,National,PARIS LYON,PERPIGNAN,0,0,11,0,0.0,0.0,0,0.0,0.0,,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3424,2020-04,National,TOULON,PARIS LYON,0,0,14,0,0.0,0.0,0,0.0,0.0,,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3427,2020-04,International,FRANCFORT,PARIS EST,0,0,12,0,0.0,0.0,0,0.0,0.0,,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


All of those rows are from 2020, mostly April and May: it's probably due to Covid-19 pandemic disturbance. For all the rows, the value of "Number of trips scheduled" is 0, but "Number of trains cancelled" is mostly not 0. Let's look at the "0" in "Number of trips scheduled".

In [25]:
pd.set_option('display.max_rows', 10)
df_delay_tgv.loc[df_delay_tgv['Number of trips scheduled'] == 0]

Unnamed: 0,Date,Service,Departure station,Arrival station,Avg trip length,Number of trips scheduled,Number of trains cancelled,Number of trains delayed on departure,Avg delay of trains delayed on departure,Avg delay of all trains on departure,Number of trains delayed on arrival,Avg delay of trains delayed on arrival,Avg delay of all trains on arrival,Comments delay on arrival,Number of trains >15 min delay,Number of trains >30 min delay,Number of trains >60 min delay,% delay from external causes,% delay infrastructure cause,% delay traffic management cause,% delay rolling stock cause,% delay station management and reutilization of stock,"% delay because of accommodation of passengers (crowd, disability, connections)"
3406,2020-04,National,PARIS MONTPARNASSE,ST MALO,0,0,8,0,0.0,0.0,0,0.0,0.0,,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3413,2020-04,National,MARNE LA VALLEE,LYON PART DIEU,0,0,21,0,0.0,0.0,0,0.0,0.0,,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3414,2020-04,National,MARNE LA VALLEE,MARSEILLE ST CHARLES,0,0,14,0,0.0,0.0,0,0.0,0.0,,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3415,2020-04,National,MARSEILLE ST CHARLES,MARNE LA VALLEE,0,0,13,0,0.0,0.0,0,0.0,0.0,,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3416,2020-04,National,PARIS LYON,ANNECY,0,0,7,0,0.0,0.0,0,0.0,0.0,,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4235,2020-10,International,MARSEILLE ST CHARLES,MADRID,0,0,0,0,0.0,0.0,0,0.0,0.0,,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
4320,2020-11,International,MADRID,MARSEILLE ST CHARLES,0,0,0,0,0.0,0.0,0,0.0,0.0,,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
4321,2020-11,International,MARSEILLE ST CHARLES,MADRID,0,0,0,0,0.0,0.0,0,0.0,0.0,,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
4432,2020-12,International,MADRID,MARSEILLE ST CHARLES,0,0,0,0,0.0,0.0,0,0.0,0.0,,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


These are the very same rows. The issue, then, is that the "Number of train cancelled" is not always 0 and is higher than the number of trips scheduled, which doesn't make sense. Let's change the value of "Number of trains cancelled" to 0.

In [26]:
df_delay_tgv.loc[df_delay_tgv['Number of trips scheduled'] == 0, 'Number of trains cancelled'] = 0

## 3.4 Missing values

In [27]:
pd.set_option('display.max_rows', None)
df_delay_tgv.isnull().sum()

Date                                                                                  0
Service                                                                               0
Departure station                                                                     0
Arrival station                                                                       0
Avg trip length                                                                       0
Number of trips scheduled                                                             0
Number of trains cancelled                                                            0
Number of trains delayed on departure                                                 0
Avg delay of trains delayed on departure                                              0
Avg delay of all trains on departure                                                  0
Number of trains delayed on arrival                                                   0
Avg delay of trains delayed on a

The only column with missing values is "Comments delay at arrival", which is fine: it's a column of unstructured data anyway. I prefer not to get rid of the column because it might give us some insights later, if I get some surprising results or outliers in my analysis.

## 3.5 Duplicates

In [28]:
#view duplicates
df_delay_tgv[df_delay_tgv.duplicated()]

Unnamed: 0,Date,Service,Departure station,Arrival station,Avg trip length,Number of trips scheduled,Number of trains cancelled,Number of trains delayed on departure,Avg delay of trains delayed on departure,Avg delay of all trains on departure,Number of trains delayed on arrival,Avg delay of trains delayed on arrival,Avg delay of all trains on arrival,Comments delay on arrival,Number of trains >15 min delay,Number of trains >30 min delay,Number of trains >60 min delay,% delay from external causes,% delay infrastructure cause,% delay traffic management cause,% delay rolling stock cause,% delay station management and reutilization of stock,"% delay because of accommodation of passengers (crowd, disability, connections)"


No duplicates.

## 3.6 Mixed-types columns

In [29]:
for col in df_delay_tgv.columns.tolist():
  weird = (df_delay_tgv[[col]].applymap(type) != df_delay_tgv[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df_delay_tgv[weird]) > 0:
    print (col)

Comments delay on arrival


The column "Comments delay at arrival" is mixed-type: let's change it all to string.

In [30]:
df_delay_tgv['Comments delay on arrival'] = df_delay_tgv['Comments delay on arrival'].astype('str')

## 3.7 Export cleaned TGV data set

In [31]:
df_delay_tgv.to_csv(os.path.join(path, '02 Data','Prepared Data', 'delay_tgv_clean.csv'))

# 4. Explore and clean Number of travelers data set

In [32]:
df_nb_travelers.head()

Unnamed: 0,Nom de la gare,Code UIC,Code postal,Segmentation DRG,Total Voyageurs 2023,Total Voyageurs + Non voyageurs 2023,Total Voyageurs 2022,Total Voyageurs + Non voyageurs 2022,Total Voyageurs 2021,Total Voyageurs + Non voyageurs 2021,Total Voyageurs 2020,Total Voyageurs + Non voyageurs 2020,Total Voyageurs 2019,Total Voyageurs + Non voyageurs 2019,Total Voyageurs 2018,Total Voyageurs + Non voyageurs 2018,Total Voyageurs 2017,Total Voyageurs + Non voyageurs 2017,Total Voyageurs 2016,Total Voyageurs + Non voyageurs 2016,Total Voyageurs 2015,Total Voyageurs + Non voyageurs 2015
0,Abancourt,87313759,60220,C,77974,77974,71517,71517,51811,51811,32396,32396,42685,42685,40228,40228,43760,43760,41096,41096,39720,39720
1,Agde,87781278,34300,B,757491,946864,689202,861503,561160,701450,394380,492975,542288,677860,588297,735372,697091,871364,660656,825820,662516,828146
2,Agen,87586008,47000,A,1640325,2050407,1540511,1925639,1184007,1480009,860964,1076205,1211323,1514154,1109199,1386499,1194455,1493068,1141620,1427026,1183150,1478938
3,Agonac,87595157,24460,C,9537,9537,6468,6468,4119,4119,3271,3271,2538,2538,1492,1492,1583,1583,1134,1134,1127,1127
4,Aguilcourt - Variscourt,87171702,2190,C,13464,13464,13197,13197,12294,12294,10436,10436,8796,8796,6155,6155,7368,7368,8979,8979,9821,9821


## 4.1 Drop useless columns

For each year, we have a column for the number of travelers and the number of travelers + visitors who didn't travel. We don't need the latest. We don't need the postal code either, nor the Segmentation DRG. Let's trop all of that. I'll keep "Code IUC", which is an unique identifier for train stations, because some train stations might have the same name and we'll need to distinguish them.

In [33]:
df_nb_travelers = df_nb_travelers.drop(columns=['Code postal', 'Segmentation DRG', 'Total Voyageurs + Non voyageurs 2023', 'Total Voyageurs + Non voyageurs 2022', 'Total Voyageurs + Non voyageurs 2021', 'Total Voyageurs + Non voyageurs 2020', 'Total Voyageurs + Non voyageurs 2019', 'Total Voyageurs + Non voyageurs 2018', 'Total Voyageurs + Non voyageurs 2017', 'Total Voyageurs + Non voyageurs 2016', 'Total Voyageurs + Non voyageurs 2015'])

In [34]:
df_nb_travelers.head(1)

Unnamed: 0,Nom de la gare,Code UIC,Total Voyageurs 2023,Total Voyageurs 2022,Total Voyageurs 2021,Total Voyageurs 2020,Total Voyageurs 2019,Total Voyageurs 2018,Total Voyageurs 2017,Total Voyageurs 2016,Total Voyageurs 2015
0,Abancourt,87313759,77974,71517,51811,32396,42685,40228,43760,41096,39720


## 4.2 Rename columns to English

In [35]:
df_nb_travelers.rename(columns={
    "Nom de la gare": "Station name",
    "Code UIC": "UIC code",
    "Total Voyageurs 2023": "Total travelers 2023",
    "Total Voyageurs 2022": "Total travelers 2022",
    "Total Voyageurs 2021": "Total travelers 2021",
    "Total Voyageurs 2020": "Total travelers 2020",
    "Total Voyageurs 2019": "Total travelers 2019",
    "Total Voyageurs 2018": "Total travelers 2018",
    "Total Voyageurs 2017": "Total travelers 2017",
    "Total Voyageurs 2016": "Total travelers 2016",
    "Total Voyageurs 2015": "Total travelers 2015",
}, inplace = True)

## 4.3 Deal with outliers/impossible values

In [36]:
df_nb_travelers.describe()

Unnamed: 0,UIC code,Total travelers 2023,Total travelers 2022,Total travelers 2021,Total travelers 2020,Total travelers 2019,Total travelers 2018,Total travelers 2017,Total travelers 2016,Total travelers 2015
count,3010.0,3010.0,3010.0,3010.0,3010.0,3010.0,3010.0,3010.0,3010.0,3010.0
mean,87489350.0,928100.0,865659.2,690866.8,476099.4,910497.5,893420.9,909789.5,882647.4,872881.9
std,206750.3,5795150.0,5438984.0,3982991.0,3227785.0,6096153.0,6078375.0,6158213.0,6018658.0,5938435.0
min,87009700.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,87313250.0,11278.5,10248.0,7463.0,5528.25,7721.0,6801.75,7503.75,7645.0,7789.75
50%,87487260.0,58628.5,52202.5,38576.5,30616.5,44365.5,40249.5,43387.5,41707.0,43345.0
75%,87684110.0,292112.8,263005.8,205473.0,156267.5,238454.8,219903.0,233455.8,223584.5,230358.2
max,87988720.0,226768500.0,211698500.0,126621000.0,114468400.0,244744400.0,244465600.0,247963200.0,242273100.0,238356100.0


The numerical values seem to make sense, except for the 0 in the min. It means a station with no travelers at all. Let's check it out.

In [37]:
pd.set_option('display.max_rows', None)
df_nb_travelers.loc[df_nb_travelers['Total travelers 2023']==0]

Unnamed: 0,Station name,UIC code,Total travelers 2023,Total travelers 2022,Total travelers 2021,Total travelers 2020,Total travelers 2019,Total travelers 2018,Total travelers 2017,Total travelers 2016,Total travelers 2015
185,Exideuil,87592881,0,0,0,0,0,0,79,288,834
221,Hymont - Mattaincourt,87144220,0,0,354,761,1058,928,695,1496,1537
346,Merrey,87142406,0,0,0,0,1,4,75,172,251
364,Montbré,87171256,0,0,0,1355,2655,2609,3307,3668,3103
388,Nonant-le-Pin,87444570,0,1,0,0,0,2,0,36,28
400,Pargny-sur-Saulx,87174334,0,0,0,0,0,0,0,0,0
418,Pont-Hébert,87447185,0,0,0,0,1,286,455,340,326
420,Pont-du-Casse,87586404,0,0,156,360,711,573,451,292,237
480,Saint-Martin-Bellevue,87746222,0,0,0,0,0,2,312,857,447
491,Saint-Sever,87447714,0,0,2,7,48,76,75,78,341


In [38]:
df_nb_travelers.loc[df_nb_travelers['Total travelers 2015']==0]

Unnamed: 0,Station name,UIC code,Total travelers 2023,Total travelers 2022,Total travelers 2021,Total travelers 2020,Total travelers 2019,Total travelers 2018,Total travelers 2017,Total travelers 2016,Total travelers 2015
5,Aigrefeuille le Thou,87485193,61127,51200,32989,24987,31170,18670,14513,266,0
50,Base Aérienne,87699223,2714,2079,659,95,0,0,0,0,0
168,Delle,87184440,32269,31341,19250,17348,25996,1250,0,0,0
231,L'Etang les Sablons,87726109,148199,70338,0,0,0,0,0,0,0
242,La Couronne,87583617,364,0,0,0,0,0,0,0,0
276,Le Mans Hôpital-Université,87743872,10061,0,0,0,0,0,0,0,0
387,Noisy-le-Roi,87733659,354395,168105,0,0,0,0,0,0,0
393,Nîmes Pont du Gard,87703975,1054535,745834,590063,405086,15135,0,0,0,0
400,Pargny-sur-Saulx,87174334,0,0,0,0,0,0,0,0,0
525,Allée Royale,87710913,46010,21783,0,0,0,0,0,0,0


73 stations with 0 travelers in 2023: most of them still had travelers in 2015 and seem to have closed in the following years.

46 stations with no travelers in 2015: most of them start having travelers in the folowing years, they must be new stations.

Few stations with 0 travelers recorded ever. They will not disturb our analysis.

No action taken.

## 4.4 Missing values

In [39]:
df_nb_travelers.isnull().sum()

Station name            0
UIC code                0
Total travelers 2023    0
Total travelers 2022    0
Total travelers 2021    0
Total travelers 2020    0
Total travelers 2019    0
Total travelers 2018    0
Total travelers 2017    0
Total travelers 2016    0
Total travelers 2015    0
dtype: int64

No missing values.

## 4.5 Duplicates

In [40]:
# View duplicates in data set
df_nb_travelers[df_nb_travelers.duplicated()]

Unnamed: 0,Station name,UIC code,Total travelers 2023,Total travelers 2022,Total travelers 2021,Total travelers 2020,Total travelers 2019,Total travelers 2018,Total travelers 2017,Total travelers 2016,Total travelers 2015


No duplicates.

In [41]:
# View duplicates in Station name column
df_nb_travelers[df_nb_travelers['Station name'].duplicated()]

Unnamed: 0,Station name,UIC code,Total travelers 2023,Total travelers 2022,Total travelers 2021,Total travelers 2020,Total travelers 2019,Total travelers 2018,Total travelers 2017,Total travelers 2016,Total travelers 2015
1813,Belleville,87141804,31805,26977,20628,17494,25006,23040,28306,28504,28800


In [42]:
df_nb_travelers.loc[df_nb_travelers['Station name'] == 'Belleville']

Unnamed: 0,Station name,UIC code,Total travelers 2023,Total travelers 2022,Total travelers 2021,Total travelers 2020,Total travelers 2019,Total travelers 2018,Total travelers 2017,Total travelers 2016,Total travelers 2015
1254,Belleville,87486142,134484,113138,71832,57289,85566,73772,67909,58233,49298
1813,Belleville,87141804,31805,26977,20628,17494,25006,23040,28306,28504,28800


Indeed, there are two stations with the same name, but as we can distinguisd them with the UIC code, it's alright. 

## 4.6 Mixed-types columns

In [43]:
for col in df_nb_travelers.columns.tolist():
  weird = (df_nb_travelers[[col]].applymap(type) != df_nb_travelers[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df_nb_travelers[weird]) > 0:
    print (col)

No mixed-types columns.

## 4.7 Export Number of travelers data set

In [44]:
df_nb_travelers.to_csv(os.path.join(path, '02 Data','Prepared Data', 'df_nb_travelers_clean.csv'))