<div align="center">

# <u> Flight delay forecasting </u>
## Machine Learning Project by:
#### Tamara Pallien, Jan Bohlman & Frederic Baumeister

</div>

<img src="res/title.jpg">


 Photo by D. C. Cavalleri: https://www.pexels.com/de-de/foto/flughafen-2421196/

## Introduction:



In [1]:
# necessary imports 

""" Data Manipulation and Visualization"""
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#import plotly.express as px


""" Machine Learning """
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import r2_score, mean_squared_error

""" Time """
from datetime import datetime, timedelta

RSEED = 42


In [2]:
""" Read dataset"""

df = pd.read_csv('data/Train.csv')
df.head()

Unnamed: 0,ID,DATOP,FLTID,DEPSTN,ARRSTN,STD,STA,STATUS,AC,target
0,train_id_0,2016-01-03,TU 0712,CMN,TUN,2016-01-03 10:30:00,2016-01-03 12.55.00,ATA,TU 32AIMN,260.0
1,train_id_1,2016-01-13,TU 0757,MXP,TUN,2016-01-13 15:05:00,2016-01-13 16.55.00,ATA,TU 31BIMO,20.0
2,train_id_2,2016-01-16,TU 0214,TUN,IST,2016-01-16 04:10:00,2016-01-16 06.45.00,ATA,TU 32AIMN,0.0
3,train_id_3,2016-01-17,TU 0480,DJE,NTE,2016-01-17 14:10:00,2016-01-17 17.00.00,ATA,TU 736IOK,0.0
4,train_id_4,2016-01-17,TU 0338,TUN,ALG,2016-01-17 14:30:00,2016-01-17 15.50.00,ATA,TU 320IMU,22.0


In [3]:
# droping rows:

df = df.drop('ID', axis=1)
df = df.drop('DATOP', axis=1)
df = df.drop('AC', axis=1)
df = df.drop('FLTID', axis=1)
df.head()

Unnamed: 0,DEPSTN,ARRSTN,STD,STA,STATUS,target
0,CMN,TUN,2016-01-03 10:30:00,2016-01-03 12.55.00,ATA,260.0
1,MXP,TUN,2016-01-13 15:05:00,2016-01-13 16.55.00,ATA,20.0
2,TUN,IST,2016-01-16 04:10:00,2016-01-16 06.45.00,ATA,0.0
3,DJE,NTE,2016-01-17 14:10:00,2016-01-17 17.00.00,ATA,0.0
4,TUN,ALG,2016-01-17 14:30:00,2016-01-17 15.50.00,ATA,22.0


### Data cleaning 


In [4]:
""" Converting timestamps to Datetime obj."""

df['STA'] = df['STA'].str.replace('.', ':', regex=False)
df['STA'] = pd.to_datetime(df['STA']).map(pd.Timestamp.timestamp)

df['STD'] = pd.to_datetime(df['STD']).map(pd.Timestamp.timestamp)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 107833 entries, 0 to 107832
Data columns (total 6 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   DEPSTN  107833 non-null  object 
 1   ARRSTN  107833 non-null  object 
 2   STD     107833 non-null  float64
 3   STA     107833 non-null  float64
 4   STATUS  107833 non-null  object 
 5   target  107833 non-null  float64
dtypes: float64(3), object(3)
memory usage: 4.9+ MB


In [5]:
""" create Target"""

X = df.drop('target', axis=1)
y= df['target']

"""Target info"""
print(f"We have {X.shape[0]} observations in our dataset and {X.shape[1]} features")
print(f"Our target vector has also {y.shape[0]} values")


We have 107833 observations in our dataset and 5 features
Our target vector has also 107833 values


## Preprocessing

In [6]:
cat_feats = ['STATUS','DEPSTN', 'ARRSTN']

X = pd.get_dummies(X, columns=cat_feats, drop_first=True)

In [7]:
""" train/test split"""
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=RSEED)

## Very simple Baseline model 

In [10]:
dec_reg = DecisionTreeRegressor(max_depth=3)
dec_reg.fit(X_train, y_train)

In [11]:
y_test_predicted = dec_reg.predict(X_test)

print("R2: {:.2f}".format(r2_score(y_test, y_test_predicted)))
print("Mean Squared Error: {:.2f}".format(mean_squared_error(y_test, y_test_predicted)))


R2: 0.03
Mean Squared Error: 13362.84


## nice to have:

Plots: 

- Airport with most delay 

- cluster time of year (delay)

- find routes that are late most 

AIRPORT | COUNTRY 

# Baseline Model

## Conclusion


____
This Project was Part of the Data Science Bootcamp at NeueFische, for more information visit: 

[NeueFische](https://www.neuefische.de)