# Aprenentatge Supervisat - Regressions

**Exercici 1**

Crea almenys tres models de regressió diferents per intentar predir el millor possible l’endarreriment dels vols (ArrDelay) de DelayedFlights.csv.

**Exercici 2**

Compara’ls en base al MSE i al R2 .

**Exercici 3**

Entrena’ls utilitzant els diferents paràmetres que admeten.

**Exercici 4**

Compara el seu rendiment utilitzant l’aproximació traint/test o utilitzant totes les dades (validació interna)

**Exercici 5**

Realitza algun procés d’enginyeria de variables per millorar-ne la predicció

**Exercici 6**

No utilitzis la variable DepDelay a l’hora de fer prediccions

In [1]:
import pandas as pd
pd.set_option("display.max_columns", None)

import numpy as np
from sklearn.linear_model import LinearRegression, LassoCV
from sklearn.tree import DecisionTreeRegressor

from sklearn.metrics import mean_squared_error, r2_score

from sklearn.model_selection import train_test_split

from scipy import stats 

# Data

[Airlines Delay: Airline on-time statistics and delay causes](https://www.kaggle.com/giovamata/airlinedelaycauses) 

- Year: 1987-2008
- Month: 1-12
- DayofMonth: 1-31
- DayOfWeek: 1 (Monday) - 7 (Sunday)
- DepTime: departure time (local, hhmm)
- CRSDepTime: scheduled departure time (local, hhmm)
- ArrTime: arrival time (local, hhmm)
- CRSArrTime: scheduled arrival time (local, hhmm)
- UniqueCarrier: unique carrier code
- FlightNum: flight number
- TailNum: plane tail number 
- ActualElapsedTime: flygth time in minutes (Total)
- CRSElapsedTime: scheduled	flygth time in minutes (Total)
- AirTime: time on air in minutes
- ArrDelay:	arrival delay in minutes
- DepDelay: departure delay in minutes
- Origin: origin IATA airport code
- Dest:	destination IATA airport code
- Distance: distance in miles
- TaxiIn: taxi in time, in minutes (movement on ground)
- TaxiOut: taxi out time, in minutes (movement on ground)
- Cancelled: was the flight cancelled?
- CancellationCode:	[reason for cancellation](https://aspmhelp.faa.gov/index/Types_of_Delay.html) (A = carrier, B = weather, C = NAS, D = security)
- Diverted:	1 = yes, 0 = no ("Desviado")
- CarrierDelay:	delayed time due to Carrier in minutes
- WeatherDelay:	delayed time due to Weather in minutes
- NASDelay: delayed time due to NAS in minutes
- SecurityDelay: delayed time due to security in minuts
- LateAircraftDelay: delayed time due to late aircraft in minutes

In [2]:
# Read csv
df_raw = pd.read_csv("./archive/DelayedFlights.csv", index_col = [0])

  mask |= (ar1 == a)


In [3]:
## Columns and Data types
df_raw.info(show_counts = True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1936758 entries, 0 to 7009727
Data columns (total 29 columns):
 #   Column             Non-Null Count    Dtype  
---  ------             --------------    -----  
 0   Year               1936758 non-null  int64  
 1   Month              1936758 non-null  int64  
 2   DayofMonth         1936758 non-null  int64  
 3   DayOfWeek          1936758 non-null  int64  
 4   DepTime            1936758 non-null  float64
 5   CRSDepTime         1936758 non-null  int64  
 6   ArrTime            1929648 non-null  float64
 7   CRSArrTime         1936758 non-null  int64  
 8   UniqueCarrier      1936758 non-null  object 
 9   FlightNum          1936758 non-null  int64  
 10  TailNum            1936753 non-null  object 
 11  ActualElapsedTime  1928371 non-null  float64
 12  CRSElapsedTime     1936560 non-null  float64
 13  AirTime            1928371 non-null  float64
 14  ArrDelay           1928371 non-null  float64
 15  DepDelay           1936758 non-n

In [4]:
## Sample
df_raw.sample(10)

Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
5155099,2008,9,9,2,1645.0,1635,2213.0,2155,F9,720,N201FR,208.0,200.0,180.0,18.0,10.0,DEN,DCA,1476,16.0,12.0,0,N,0,6.0,0.0,12.0,0.0,0.0
1392501,2008,3,29,6,1406.0,1400,1600.0,1555,OO,2932,N709BR,114.0,115.0,100.0,5.0,6.0,MCI,AUS,650,5.0,9.0,0,N,0,,,,,
2637248,2008,5,29,4,1544.0,1456,1703.0,1634,UA,1217,N381UA,139.0,158.0,126.0,29.0,48.0,DEN,SJC,948,4.0,9.0,0,N,0,0.0,0.0,0.0,0.0,29.0
5060224,2008,9,3,3,1011.0,954,1126.0,1129,UA,817,N845UA,75.0,95.0,53.0,-3.0,17.0,SFO,LAX,337,4.0,18.0,0,N,0,,,,,
5457593,2008,10,25,6,938.0,930,1136.0,1135,WN,1655,N658SW,118.0,125.0,106.0,1.0,8.0,MCO,BWI,787,3.0,9.0,0,N,0,,,,,
6916211,2008,12,6,6,1309.0,1250,1736.0,1715,AA,1122,N5BUAA,207.0,205.0,186.0,21.0,19.0,DFW,BOS,1562,4.0,17.0,0,N,0,19.0,0.0,2.0,0.0,0.0
3933506,2008,7,31,4,1220.0,1105,1512.0,1346,DL,1484,N933DL,172.0,161.0,132.0,86.0,75.0,MCO,LGA,950,20.0,20.0,0,N,0,3.0,0.0,16.0,0.0,67.0
3113275,2008,6,20,5,1208.0,1150,1445.0,1447,XE,2879,N21154,97.0,117.0,83.0,-2.0,18.0,MSP,CLE,622,6.0,8.0,0,N,0,,,,,
545730,2008,1,30,3,1923.0,1855,2242.0,2227,AS,578,N799AS,139.0,152.0,124.0,15.0,28.0,SEA,DEN,1024,5.0,10.0,0,N,0,15.0,0.0,0.0,0.0,0.0
5351150,2008,9,18,4,1714.0,1610,57.0,2358,CO,1069,N41135,283.0,288.0,254.0,59.0,64.0,LAS,EWR,2227,10.0,19.0,0,N,0,14.0,0.0,0.0,0.0,45.0


In [5]:
## Drop Duplicates
df_raw.drop_duplicates(inplace = True)

In [6]:
## Null Values %
df_raw.isnull().mean()*100

Year                  0.000000
Month                 0.000000
DayofMonth            0.000000
DayOfWeek             0.000000
DepTime               0.000000
CRSDepTime            0.000000
ArrTime               0.367109
CRSArrTime            0.000000
UniqueCarrier         0.000000
FlightNum             0.000000
TailNum               0.000258
ActualElapsedTime     0.433044
CRSElapsedTime        0.010223
AirTime               0.433044
ArrDelay              0.433044
DepDelay              0.000000
Origin                0.000000
Dest                  0.000000
Distance              0.000000
TaxiIn                0.367109
TaxiOut               0.023493
Cancelled             0.000000
CancellationCode      0.000000
Diverted              0.000000
CarrierDelay         35.588892
WeatherDelay         35.588892
NASDelay             35.588892
SecurityDelay        35.588892
LateAircraftDelay    35.588892
dtype: float64

In [7]:
## Columns with low percentage of nulls (less than 2% in total)
subset = ["ArrTime", "TailNum", "ActualElapsedTime", "CRSElapsedTime", 
          "AirTime", "ArrDelay", "TaxiIn", "TaxiOut"]
## Drop rows with low percentage of nulls
df_raw = df_raw.dropna(subset=subset)

In [8]:
## Transform DepTime and ArrTime to a more consistent notation (hh:mm)
df_raw["DepTime"] = df_raw["DepTime"].astype(int).apply(lambda x: str(x).zfill(4)).apply(lambda x: x[0:2] + ":" + x[2:]) 
df_raw["CRSDepTime"] = df_raw["CRSDepTime"].astype(int).apply(lambda x: str(x).zfill(4)).apply(lambda x: x[0:2] + ":" + x[2:]) 
df_raw["ArrTime"] = df_raw["ArrTime"].astype(int).apply(lambda x: str(x).zfill(4)).apply(lambda x: x[0:2] + ":" + x[2:]) 
df_raw["CRSArrTime"] = df_raw["CRSArrTime"].astype(int).apply(lambda x: str(x).zfill(4)).apply(lambda x: x[0:2] + ":" + x[2:]) 

In [9]:
## Change dtypes
df_raw["FlightNum"] = df_raw["FlightNum"].astype(str)
df_raw["Cancelled"] = df_raw["Cancelled"].astype(str)
df_raw["Diverted"] = df_raw["Diverted"].astype(str)

In [10]:
## Divide into numerical and categorical
df_num = df_raw.select_dtypes(include = ["int64", "float64"])
df_cat = df_raw.select_dtypes(exclude = ["int64", "float64"])

In [11]:
## Describe num
df_num.describe().round(2)

Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Distance,TaxiIn,TaxiOut,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
count,1928366.0,1928366.0,1928366.0,1928366.0,1928366.0,1928366.0,1928366.0,1928366.0,1928366.0,1928366.0,1928366.0,1928366.0,1247484.0,1247484.0,1247484.0,1247484.0,1247484.0
mean,2008.0,6.11,15.75,3.98,133.31,134.2,108.28,42.2,43.09,764.95,6.81,18.22,19.18,3.7,15.02,0.09,25.3
std,0.0,3.48,8.78,2.0,72.06,71.23,68.64,56.78,53.27,573.89,5.27,14.31,43.55,21.49,33.83,2.02,42.05
min,2008.0,1.0,1.0,1.0,14.0,-21.0,0.0,-109.0,6.0,11.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2008.0,3.0,8.0,2.0,80.0,82.0,58.0,9.0,12.0,338.0,4.0,10.0,0.0,0.0,0.0,0.0,0.0
50%,2008.0,6.0,16.0,4.0,116.0,116.0,90.0,24.0,24.0,606.0,6.0,14.0,2.0,0.0,2.0,0.0,8.0
75%,2008.0,9.0,23.0,6.0,165.0,165.0,137.0,56.0,53.0,997.0,8.0,21.0,21.0,0.0,15.0,0.0,33.0
max,2008.0,12.0,31.0,7.0,1114.0,660.0,1091.0,2461.0,2467.0,4962.0,240.0,422.0,2436.0,1352.0,1357.0,392.0,1316.0


In [12]:
## Drop Year
df_raw.drop(columns = "Year", inplace = True)

In [13]:
## Describe cat
df_cat.describe()

Unnamed: 0,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,Origin,Dest,Cancelled,CancellationCode,Diverted
count,1928366,1928366,1928366,1928366,1928366,1928366,1928366,1928366,1928366,1928366,1928366,1928366
unique,1438,1193,1440,1361,20,7498,5360,303,302,1,1,1
top,18:00,18:00,21:00,19:30,WN,16,N325SW,ATL,ORD,0,N,0
freq,3176,13867,2981,9148,376201,1575,961,131213,108265,1928366,1928366,1928366


In [14]:
## Drop Cancelled, CancellationCode and Diverted
df_raw.drop(columns = ["Cancelled", "CancellationCode", "Diverted"], inplace = True)

In [15]:
## Categorical column with delay > 15 min (1 = Yes, 0 = No)
df_raw["DelayCat"] = df_raw["ArrDelay"].apply(lambda x: 1 if x > 15 else 0)

In [16]:
## Mean Velocity columns in miles/min
df_raw = df_raw[df_raw["AirTime"] != 0]
df_raw["Velocity"] = df_raw["Distance"] / df_raw["AirTime"] 

In [17]:
## Origin-Destination Columns
df_raw["Fligth"] = df_raw["Origin"] + "-" + df_raw["Dest"]

In [18]:
## Save Final Dataframe
df_raw.to_csv("df_clean.csv")

# Exercisi 1

In [19]:
# Read csv
df_clean = pd.read_csv("df_clean.csv", index_col = [0])

  mask |= (ar1 == a)


In [20]:
# Sample
df_clean.sample(10)

Unnamed: 0,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay,DelayCat,Velocity,Fligth
4354510,8,11,1,23:23,21:15,00:45,22:49,XE,2268,N14542,82.0,94.0,47.0,116.0,128.0,EWR,BUF,282,5.0,30.0,0.0,0.0,23.0,0.0,93.0,1,6.0,EWR-BUF
2953331,5,14,3,11:37,11:30,13:23,12:18,CO,1620,N37420,226.0,168.0,139.0,65.0,7.0,IAH,PHX,1009,4.0,83.0,0.0,7.0,58.0,0.0,0.0,1,7.258993,IAH-PHX
2591111,5,22,4,11:40,11:25,13:23,13:15,OO,6159,N941SW,43.0,50.0,28.0,8.0,15.0,ORD,GRR,137,4.0,11.0,,,,,,0,4.892857,ORD-GRR
3196342,6,20,5,21:42,20:50,23:32,22:25,OO,2879,N710BR,170.0,155.0,118.0,67.0,52.0,EWR,MKE,725,5.0,47.0,0.0,0.0,67.0,0.0,0.0,1,6.144068,EWR-MKE
1117017,2,28,4,21:10,20:40,23:44,23:13,AS,85,N317AS,214.0,213.0,191.0,31.0,30.0,SEA,ANC,1449,3.0,20.0,0.0,0.0,9.0,0.0,22.0,1,7.586387,SEA-ANC
4812894,8,1,5,05:56,05:50,08:30,08:25,CO,633,N11641,214.0,215.0,189.0,5.0,6.0,LGA,IAH,1416,5.0,20.0,,,,,,0,7.492063,LGA-IAH
560621,1,27,7,16:27,16:00,18:57,17:50,B6,1106,N216JB,150.0,110.0,130.0,67.0,27.0,RDU,JFK,426,10.0,10.0,9.0,0.0,40.0,0.0,18.0,1,3.276923,RDU-JFK
1785491,3,13,4,15:09,15:00,18:24,18:13,DL,1064,N922DL,195.0,193.0,149.0,11.0,9.0,MIA,JFK,1090,32.0,14.0,,,,,,0,7.315436,MIA-JFK
2541349,5,4,7,16:35,16:20,19:25,19:18,OH,5292,N653CA,110.0,118.0,86.0,7.0,15.0,MDW,ATL,590,9.0,15.0,,,,,,0,6.860465,MDW-ATL
791211,2,17,7,20:45,19:48,22:08,21:08,OO,6370,N986SW,83.0,80.0,60.0,60.0,57.0,ONT,SFO,363,13.0,10.0,0.0,0.0,0.0,0.0,60.0,1,6.05,ONT-SFO


In [21]:
# Variables independientes
x = df_clean[["DepDelay", "Distance", "AirTime"]]
# Variables dependientes
y = df_clean["ArrDelay"]

In [22]:
# Regresión Lineal
model_1 = LinearRegression().fit(x,y)

# Arbol de regresión
model_2 = DecisionTreeRegressor(random_state=1).fit(x,y)

# Regresión Lasso
model_3 = LassoCV().fit(x,y)

# Exercisi 2

In [23]:
# Fitted values
y_1 = model_1.predict(x)
y_2 = model_2.predict(x)
y_3 = model_3.predict(x)

In [24]:
# MSE
RMSE_1 = mean_squared_error(y, y_1)
RMSE_2 = mean_squared_error(y, y_2)
RMSE_3 = mean_squared_error(y, y_3)

print(
    """
    Root MSE:
    Model 1: {}
    Model 2: {}
    Model 3: {}
    """.format(RMSE_1, RMSE_2, RMSE_3)
)


    Root MSE:
    Model 1: 245.7145687570543
    Model 2: 65.09371839944818
    Model 3: 245.77375226934674
    


In [25]:
# R2
R2_1 = r2_score(y, y_1)
R2_2 = r2_score(y, y_2)
R2_3 = r2_score(y, y_3)

print(
    """
    R2 score:
    Model 1: {}
    Model 2: {}
    Model 3: {}
    """.format(R2_1, R2_2, R2_3)
)


    R2 score:
    Model 1: 0.9237969223346831
    Model 2: 0.9798125861897027
    Model 3: 0.9237785678439155
    


El segon model consegueix explicar el 98% de la variablitat de la variable ArrDelay, mentre que els altre dos només el 92%. La suma dels errors al cuadrat es major per tant en el primer i el tercer model. 

# Exercici 3

In [26]:
# Regresión Lineal Normalizando
model_1 = LinearRegression(normalize=True).fit(x,y)

# Arbol de regresión Personalizado
model_2 = DecisionTreeRegressor(max_depth=4, min_samples_leaf=0.01, random_state=1).fit(x,y)

# Regresión Lasso Normalizando
model_3 = LassoCV(normalize=True).fit(x,y)

In [27]:
# Fitted values
y_1 = model_1.predict(x)
y_2 = model_2.predict(x)
y_3 = model_3.predict(x)

In [28]:
# R2
R2_1 = r2_score(y, y_1)
R2_2 = r2_score(y, y_2)
R2_3 = r2_score(y, y_3)

print(
    """
    R2 score:
    Model 1: {}
    Model 2: {}
    Model 3: {}
    """.format(R2_1, R2_2, R2_3)
)


    R2 score:
    Model 1: 0.9237969223346831
    Model 2: 0.8302023989333205
    Model 3: 0.92370083084097
    


- Al normalitzar les dades a la regresió lineal no en aquest cas no s'aconsegueix cap millora. 
- Cambiant alguns parametres del arbre de regresió, com pot ser l'alçada o les entrades a cada fulla, varía la variabilitat explicada, en aquest cas, cap a pitjor. L'algoritme ja intentar optimitzar aquests factors. Nosaltres els podem alterar en situación específiques perque s'adaptin a les nostres necessitats.
- Al normalitzar les dades a la regresió LASSO tampoc notem cap millora. 

# Exercici 4

In [29]:
# Split into Test and Train
X_train, X_test, Y_train, Y_test = train_test_split(x, y, test_size=0.33, random_state=1)

In [30]:
# Regresión Lineal Train
model_1 = LinearRegression().fit(X_train,Y_train)

# Arbol de regresión Train
model_2 = DecisionTreeRegressor(random_state=1).fit(X_train,Y_train)

# Regresión Train
model_3 = LassoCV().fit(X_train,Y_train)

In [31]:
# Predicted values
y_1 = model_1.predict(X_test)
y_2 = model_2.predict(X_test)
y_3 = model_3.predict(X_test)

In [32]:
# R2
R2_1 = r2_score(Y_test, y_1)
R2_2 = r2_score(Y_test, y_2)
R2_3 = r2_score(Y_test, y_3)

print(
    """
    R2 score:
    Model 1: {}
    Model 2: {}
    Model 3: {}
    """.format(R2_1, R2_2, R2_3)
)


    R2 score:
    Model 1: 0.9235715044294373
    Model 2: 0.8615492890668902
    Model 3: 0.9235596392381868
    


L'arbre de regresió funciona molt millor que els altres dos models alhora de predir obresvacions que s'han utilitzat al entrenament, degut a la manera en com ha estat construit. A l'hora de predir noves observacións, en aquest cas el conjunt de test, la seva capacitat explicativa cau. En canvi, els altres dos models aconsegueixen mantenirse bastant estables. 

# Exercici 5

In [33]:
# Remove outliers from y
z_scores = np.abs(stats.zscore(y))
filtered_entries = z_scores < 3

x = x[filtered_entries]
y = y[filtered_entries]

# Remove outliers from x
z_scores = np.abs(stats.zscore(x))
filtered_entries = (z_scores < 3).all(axis=1)

x = x[filtered_entries]
y = y[filtered_entries]

In [35]:
# Split into Test and Train
X_train, X_test, Y_train, Y_test = train_test_split(x, y, test_size=0.33, random_state=1)

In [36]:
# Regresión Lineal Train
model_1 = LinearRegression().fit(X_train,Y_train)

# Arbol de regresión Train
model_2 = DecisionTreeRegressor(random_state=1).fit(X_train,Y_train)

# Regresión Train
model_3 = LassoCV().fit(X_train,Y_train)

In [37]:
# Predicted values
y_1 = model_1.predict(X_test)
y_2 = model_2.predict(X_test)
y_3 = model_3.predict(X_test)

In [38]:
# R2
R2_1 = r2_score(Y_test, y_1)
R2_2 = r2_score(Y_test, y_2)
R2_3 = r2_score(Y_test, y_3)

print(
    """
    R2 score:
    Model 1: {}
    Model 2: {}
    Model 3: {}
    """.format(R2_1, R2_2, R2_3)
)


    R2 score:
    Model 1: 0.8428952233133932
    Model 2: 0.7176031367228972
    Model 3: 0.8428851266835601
    


Al eliminar els outliers del dataset la R2 enlloc de augmentar, disminueix. La variabilitat de la variable independent explicada pel model es menor. Això pot estar provocat per la correlació dels outliers amb la resta de variables explicatives. 

# Exercici 6

In [40]:
# Remove DepDelay
X_train = X_train[["Distance", "AirTime"]]
X_test = X_test[["Distance", "AirTime"]]

In [42]:
# Regresión Lineal Train
model_1 = LinearRegression().fit(X_train,Y_train)

# Arbol de regresión Train
model_2 = DecisionTreeRegressor(random_state=1).fit(X_train,Y_train)

# Regresión Train
model_3 = LassoCV().fit(X_train,Y_train)

In [43]:
# Predicted values
y_1 = model_1.predict(X_test)
y_2 = model_2.predict(X_test)
y_3 = model_3.predict(X_test)

In [44]:
# R2
R2_1 = r2_score(Y_test, y_1)
R2_2 = r2_score(Y_test, y_2)
R2_3 = r2_score(Y_test, y_3)

print(
    """
    R2 score:
    Model 1: {}
    Model 2: {}
    Model 3: {}
    """.format(R2_1, R2_2, R2_3)
)


    R2 score:
    Model 1: 0.0470133520506234
    Model 2: 0.022220708257650768
    Model 3: 0.046999746197615044
    


Al eliminar la variable independent DepDelay la capacitat predictiva dels models desapareix. Las variables independents restants explican menys del 1% de la variable dependent, el que significa que la modificació de aquestes variables no s'observa en la variable explicada. 