# CanYouCatchIt?
A web application allowing you to obtain the percentage of chance that your bus/tram/metro is late. 💻🤖🎲🚌 🚎🚇🔮

_Build with the STIB API (available [here](https://opendata.stib-mivb.be/store/))_

# Notes: Making some models 💻🤖🚌 🚎🚇
We are here to explore the data

## Load the data

Write a function loading the csv files

In [None]:
# import
import glob
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedShuffleSplit
import os

# Set the path to the directory holding CSV files
DELAY_PATH = '/home/haeresis/Documents/Github/CanYouCatchIt/machine_learning/data'

def load_delay_data(delay_path=DELAY_PATH):
    """
    Load the cvs file in a panda dataframe
    """
    return pd.concat([pd.read_csv(f) for f in glob.glob('../data/delay*.csv')], ignore_index = True)

In [None]:
# load the csv file
delay = load_delay_data()
delay.dropna(inplace=True)
delay.reset_index(drop=True, inplace=True)

# Get names of indexes for which column line has not a value of 39
index_to_remove = delay[ delay['line'] != 39].index
# Delete these row indexes from dataFrame
delay.drop(index_to_remove , inplace=True)

nunique = delay.apply(pd.Series.nunique)
cols_to_drop = nunique[nunique == 1].index
delay = delay.drop(cols_to_drop, axis=1)

delay = delay.drop(['trip', 'theoretical_time', 'expectedArrivalTime', 'date'], axis=1)

# Reset the labels
delay.reset_index(drop=True, inplace=True)

# Stratifie the data with the different hour
# This make sure that the representation of each hour is the same in the train set then in the overall dataset
# This stratification is not necessary is you have enough data
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(delay, delay["hour"]):
    strat_train_set = delay.loc[train_index]
    strat_test_set = delay.loc[test_index]

### Create a copy of the train set

In [None]:
# Create a copy so we can play with it without harming the training set
delay = strat_train_set.copy()
delay.head()

## Looking For Correlations

In [None]:
corr_matrix = delay.corr()

Look at how much each attribute correlates with the delay value

In [None]:
corr_matrix["delay"].sort_values(ascending=False) # warning: this check only linear correlation

In [None]:
plt.figure(figsize= (10,10), dpi=100)
sns.heatmap(corr_matrix)

In [None]:
from pandas.plotting import scatter_matrix
attributes = ["delay", "humidity", "hour", "wind", "temp"]
scatter_matrix(delay[attributes], figsize=(12, 8))

In [None]:
delay.plot(kind="scatter", x="hour", y="delay", alpha=0.1)

## Try adding a new feature combining 2 other ones (hour, minute)

In [None]:
delay["hour_and_minute"] = delay["hour"]*3600 + delay["minute"]*60

In [None]:
delay.head(3)

In [None]:
corr_matrix = delay.corr()
corr_matrix["delay"].sort_values(ascending=False)

In [None]:
delay.plot(kind="scatter", x="hour_and_minute", y="delay", alpha=0.2)

With the experiment we can see that the attributes that are the most corroleted (linearly) with the delay are the temperature, the wind and the humidity.
We can also see that the hour and the minute attributes are not linearly corroleted with the delay