# GBDT - Flight Delays

Vamos predizer quando um voo atrasará mais do que 15 minutos:

1. Utilize as bases de treinamento e teste do desafio "Flight delays" do [kaggle](https://www.kaggle.com/c/flight-delays-spring-2018/overview)
2. Aplique Gradient Boosting e alcance o maior resultado possível na **base de testes**, seguindo o processo que praticamos até agora na **base de treino**.
    - Utilize o `roc_auc_score(y_test, y_pred)` para avaliar o seu modelo
3. Em duplas, nota será normalizada:
    - **1** ponto: Implementação e avaliação na **base de treinamento**.
    - **1** ponto: Normalização do resultado na **base de testes** entre `0,5` a `0,72998` (kaggle - Leaderboard - Public);
    - **1** ponto extra: maior que `0,72998` na **base de testes** (kaggle - Leaderboard - Public), Mais um ponto extra!

Para obter o resultado na base de testes, é necessário submeter na competição.

**Install dependencies**

In [1]:
import sys
!{sys.executable} -m pip install xgboost



**Data Understanding**

- **Month**, **DayofMonth**, **DayOfWeek**
- **DepTime**: departure time
- **UniqueCarrier**: code of a company-career
- **Origin**: flight origin
- **Dest**: flight destination
- **Distance**: distance between Origin and Dest airports
- **dep_delayed_15min**: target

**Data Exploration**

In [2]:
import pandas as pd

df = pd.read_csv('../data/flight_delays/flight_delays_train.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 9 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   Month              100000 non-null  object
 1   DayofMonth         100000 non-null  object
 2   DayOfWeek          100000 non-null  object
 3   DepTime            100000 non-null  int64 
 4   UniqueCarrier      100000 non-null  object
 5   Origin             100000 non-null  object
 6   Dest               100000 non-null  object
 7   Distance           100000 non-null  int64 
 8   dep_delayed_15min  100000 non-null  object
dtypes: int64(2), object(7)
memory usage: 6.9+ MB


In [3]:
df.head()

Unnamed: 0,Month,DayofMonth,DayOfWeek,DepTime,UniqueCarrier,Origin,Dest,Distance,dep_delayed_15min
0,c-8,c-21,c-7,1934,AA,ATL,DFW,732,N
1,c-4,c-20,c-3,1548,US,PIT,MCO,834,N
2,c-9,c-2,c-5,1422,XE,RDU,CLE,416,N
3,c-11,c-25,c-6,1015,OO,DEN,MEM,872,N
4,c-10,c-7,c-6,1828,WN,MDW,OMA,423,Y


In [4]:
df.describe()

Unnamed: 0,DepTime,Distance
count,100000.0,100000.0
mean,1341.52388,729.39716
std,476.378445,574.61686
min,1.0,30.0
25%,931.0,317.0
50%,1330.0,575.0
75%,1733.0,957.0
max,2534.0,4962.0


### Data Handling

In [5]:
def converter(dataframe, column_list):
    df = dataframe.copy()
    for column in column_list:
        df[column] = df[column].map(lambda x: str(x)[2:]).astype(int)
    return df

def lister(dataframe, column_list, entries):
    entries_list = []
    for column in column_list:
        column_entries_list = dataframe[column].value_counts().head(entries).index.tolist()
        entries_list.append(column_entries_list)
    return entries_list

def mapper(dataframe, column_list, column_valid_values_list, invalid_value_substitute):
    df = dataframe.copy()
    for column in column_list:
        index = column_list.index(column)
        df[column] = df[column].apply(lambda x: x if x in column_valid_values_list[index] else invalid_value_substitute)
    return df

In [6]:
df_2 = converter(df, ['Month','DayofMonth','DayOfWeek'])
df_2.head()

Unnamed: 0,Month,DayofMonth,DayOfWeek,DepTime,UniqueCarrier,Origin,Dest,Distance,dep_delayed_15min
0,8,21,7,1934,AA,ATL,DFW,732,N
1,4,20,3,1548,US,PIT,MCO,834,N
2,9,2,5,1422,XE,RDU,CLE,416,N
3,11,25,6,1015,OO,DEN,MEM,872,N
4,10,7,6,1828,WN,MDW,OMA,423,Y


In [7]:
valid_columns_entries_list = lister(df_2, ['UniqueCarrier','Origin','Dest'], 15)

In [8]:
df_3 = mapper(df_2, ['UniqueCarrier','Origin','Dest'], valid_columns_entries_list, 'OTHER')
df_3.head()

Unnamed: 0,Month,DayofMonth,DayOfWeek,DepTime,UniqueCarrier,Origin,Dest,Distance,dep_delayed_15min
0,8,21,7,1934,AA,ATL,DFW,732,N
1,4,20,3,1548,US,OTHER,OTHER,834,N
2,9,2,5,1422,XE,OTHER,OTHER,416,N
3,11,25,6,1015,OO,DEN,OTHER,872,N
4,10,7,6,1828,WN,OTHER,OTHER,423,Y


In [9]:
df_4 = df_3.copy()
df_4["dep_delayed_15min"] = df_4["dep_delayed_15min"].map({"Y": 1, "N": 0})
df_4.head()

Unnamed: 0,Month,DayofMonth,DayOfWeek,DepTime,UniqueCarrier,Origin,Dest,Distance,dep_delayed_15min
0,8,21,7,1934,AA,ATL,DFW,732,0
1,4,20,3,1548,US,OTHER,OTHER,834,0
2,9,2,5,1422,XE,OTHER,OTHER,416,0
3,11,25,6,1015,OO,DEN,OTHER,872,0
4,10,7,6,1828,WN,OTHER,OTHER,423,1


In [10]:
df_5 = df_4.copy()
df_5 = pd.get_dummies(df_4)
df_5.head()

Unnamed: 0,Month,DayofMonth,DayOfWeek,DepTime,Distance,dep_delayed_15min,UniqueCarrier_AA,UniqueCarrier_AS,UniqueCarrier_CO,UniqueCarrier_DL,...,Dest_EWR,Dest_IAH,Dest_LAS,Dest_LAX,Dest_MSP,Dest_ORD,Dest_OTHER,Dest_PHX,Dest_SFO,Dest_SLC
0,8,21,7,1934,732,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,4,20,3,1548,834,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
2,9,2,5,1422,416,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
3,11,25,6,1015,872,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
4,10,7,6,1828,423,1,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0


**Divisão de base de dados**

In [11]:
from sklearn.model_selection import train_test_split

y = df_5['dep_delayed_15min'].values
X = df_5.drop(['dep_delayed_15min'], axis=1).values

X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y,
                                                    test_size=0.3,
                                                    random_state=42,
                                                    shuffle=False)

**Aplicação do Modelo**

In [12]:
from sklearn.model_selection import StratifiedKFold, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from xgboost.sklearn import XGBClassifier
import json

# https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn
# https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/

In [13]:
skf = StratifiedKFold(n_splits=5, random_state=42, shuffle=True)

In [81]:
pipeline = Pipeline([#('scaler', StandardScaler()),
                    ('XGB', XGBClassifier(random_state=42))])

In [86]:
gscv_hyperparameters = {'XGB__learning_rate': [0.01],
                        'XGB__max_depth': [100],
                        'XGB__subsample': [0.3, 0.7],
                        'XGB__booster': ['gbtree'],
                        'XGB__colsample_bytree': [0.3, 0.7],
                        'XGB__objective': ['binary:logistic'],
                        'XGB__gamma':[0, 1],
                        'XGB__reg_alpha':[0, 1],
                        'XGB__reg_lambda': [0, 1]}

gscv = GridSearchCV(pipeline,
                    param_grid=gscv_hyperparameters,
                    cv=skf,
                    scoring='roc_auc',
                    verbose=True,
                    n_jobs=-2)

In [87]:
gscv_result = gscv.fit(X_train, y_train)

print("Best Score: ", gscv_result.best_score_)
print("Best Parameters:\n", json.dumps(gscv_result.best_params_, indent=2))

Fitting 5 folds for each of 32 candidates, totalling 160 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 23 concurrent workers.
[Parallel(n_jobs=-2)]: Done   4 tasks      | elapsed:   12.9s
[Parallel(n_jobs=-2)]: Done 160 out of 160 | elapsed:  3.8min finished


Best Score:  0.7438005141629636
Best Parameters:
 {
  "XGB__booster": "gbtree",
  "XGB__colsample_bytree": 0.7,
  "XGB__gamma": 0,
  "XGB__learning_rate": 0.01,
  "XGB__max_depth": 100,
  "XGB__objective": "binary:logistic",
  "XGB__reg_alpha": 0,
  "XGB__reg_lambda": 1,
  "XGB__subsample": 0.7
}


In [88]:
best_estimator = gscv_result.best_estimator_
best_estimator.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('XGB',
                 XGBClassifier(base_score=0.5, booster='gbtree',
                               colsample_bylevel=1, colsample_bynode=1,
                               colsample_bytree=0.7, gamma=0, gpu_id=-1,
                               importance_type='gain',
                               interaction_constraints='', learning_rate=0.01,
                               max_delta_step=0, max_depth=100,
                               min_child_weight=1, missing=nan,
                               monotone_constraints='()', n_estimators=100,
                               n_jobs=0, num_parallel_tree=1,
                               objective='binary:logistic', random_state=501,
                               reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
                               subsample=0.7, tree_method='exact',
                               validate_parameters=1, verbosity=None))],
         verbose=False)

In [89]:
y_pred = best_estimator.predict_proba(X_test)[:, 1]
y_pred

array([0.2523174 , 0.2801052 , 0.40619928, ..., 0.31178242, 0.31136003,
       0.34821117], dtype=float32)

**Model Evaluation**

In [90]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score, accuracy_score, roc_auc_score
import numpy as np

In [91]:
mae     = mean_absolute_error(y_test, y_pred)
mse     = mean_squared_error(y_test, y_pred)
rmse    = np.sqrt(mean_squared_error(y_test, y_pred))
r_sqrd  = r2_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred)

print(f"Mean Absolute Error   : {mae:.2f}")
print(f"Mean Squared Error    : {mse:.2f}")
print(f"Root Mean Square Error: {rmse:.2f}")
print(f"R-Squared             : {r_sqrd:.2f}")
print(f"ROC AUC               : {roc_auc:.2f}")

Mean Absolute Error   : 0.36
Mean Squared Error    : 0.15
Root Mean Square Error: 0.39
R-Squared             : 0.01
ROC AUC               : 0.75


**Fit model on the entire `Train` dataset**

In [92]:
best_estimator.fit(X, y)

Pipeline(memory=None,
         steps=[('XGB',
                 XGBClassifier(base_score=0.5, booster='gbtree',
                               colsample_bylevel=1, colsample_bynode=1,
                               colsample_bytree=0.7, gamma=0, gpu_id=-1,
                               importance_type='gain',
                               interaction_constraints='', learning_rate=0.01,
                               max_delta_step=0, max_depth=100,
                               min_child_weight=1, missing=nan,
                               monotone_constraints='()', n_estimators=100,
                               n_jobs=0, num_parallel_tree=1,
                               objective='binary:logistic', random_state=501,
                               reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
                               subsample=0.7, tree_method='exact',
                               validate_parameters=1, verbosity=None))],
         verbose=False)

**Predict on `Test` dataset**

In [93]:
test = pd.read_csv('../data/flight_delays/flight_delays_test.csv')

In [94]:
test = converter(test, ['Month','DayofMonth','DayOfWeek'])
test = mapper(test, ['UniqueCarrier','Origin','Dest'], valid_columns_entries_list, 'OTHER')
test = pd.get_dummies(test)

In [95]:
y_pred_test = best_estimator.predict_proba(test.values)
y_pred_test

array([[0.76727617, 0.23272382],
       [0.7395996 , 0.26040044],
       [0.73140454, 0.26859543],
       ...,
       [0.65786386, 0.3421361 ],
       [0.73639894, 0.26360103],
       [0.75647175, 0.24352822]], dtype=float32)

In [97]:
submission = pd.DataFrame({'id': test.index, 'dep_delayed_15min': y_pred_test[:, 1]})
submission.to_csv('data/submission_fon.csv', index=False)

[submission_result.png](submission_result.png)