<center>
<img src="../../img/ods_stickers.jpg" />
</center> 
     
## <center>  [mlcourse.ai](https://mlcourse.ai) – Open Machine Learning Course 

#### <center> Author: [Yury Kashnitskiy](https://yorko.github.io) (@yorko) 

# <center>Assignment #3. Spring 2019
## <center> Part 3. Gradient boosting

**In this assignment, you're asked to beat a baseline in the ["Flight delays" competition](https://www.kaggle.com/c/flight-delays-fall-2018).**

Prior to working on the assignment, you'd better check out the corresponding course material:
 1. [Classification, Decision Trees and k Nearest Neighbors](https://nbviewer.jupyter.org/github/Yorko/mlcourse_open/blob/master/jupyter_english/topic03_decision_trees_kNN/topic3_decision_trees_kNN.ipynb?flush_cache=true), the same as an interactive web-based [Kaggle Kernel](https://www.kaggle.com/kashnitsky/topic-3-decision-trees-and-knn) 
 2. Ensembles:
  - [Bagging](https://nbviewer.jupyter.org/github/Yorko/mlcourse_open/blob/master/jupyter_english/topic05_ensembles_random_forests/topic5_part1_bagging.ipynb?flush_cache=true), the same as a [Kaggle Kernel](https://www.kaggle.com/kashnitsky/topic-5-ensembles-part-1-bagging)
  - [Random Forest](https://nbviewer.jupyter.org/github/Yorko/mlcourse_open/blob/master/jupyter_english/topic05_ensembles_random_forests/topic5_part2_random_forest.ipynb?flush_cache=true), the same as a [Kaggle Kernel](https://www.kaggle.com/kashnitsky/topic-5-ensembles-part-2-random-forest)
  - [Feature Importance](https://nbviewer.jupyter.org/github/Yorko/mlcourse_open/blob/master/jupyter_english/topic05_ensembles_random_forests/topic5_part3_feature_importance.ipynb?flush_cache=true), the same as a [Kaggle Kernel](https://www.kaggle.com/kashnitsky/topic-5-ensembles-part-3-feature-importance)
 3. - [Gradient boosting](https://nbviewer.jupyter.org/github/Yorko/mlcourse_open/blob/master/jupyter_english/topic10_boosting/topic10_gradient_boosting.ipynb?flush_cache=true), the same as a [Kaggle Kernel](https://www.kaggle.com/kashnitsky/topic-10-gradient-boosting) 
   - Logistic regression, Random Forest, and LightGBM in the "Kaggle Forest Cover Type Prediction" competition: [Kernel](https://www.kaggle.com/kashnitsky/topic-10-practice-with-logit-rf-and-lightgbm) 
 4. You can also practice with demo assignments, which are simpler and already shared with solutions:
  - "Decision trees with a toy task and the UCI Adult dataset": [assignment](https://www.kaggle.com/kashnitsky/a3-demo-decision-trees) + [solution](https://www.kaggle.com/kashnitsky/a3-demo-decision-trees-solution)
  - "Logistic Regression and Random Forest in the credit scoring problem": [assignment](https://www.kaggle.com/kashnitsky/assignment-5-logit-and-rf-for-credit-scoring) + [solution](https://www.kaggle.com/kashnitsky/a5-demo-logit-and-rf-for-credit-scoring-sol)
 5. There are also 7 video lectures on trees, forests, boosting and their applications: [mlcourse.ai/video](https://mlcourse.ai/video) 

### Your task is to:
 1. beat **"A3 baseline (8 credits)"** on Public LB (**0.73449** LB score)
 2. rename your [team](https://www.kaggle.com/c/flight-delays-fall-2018/team) in full accordance with the course rating
 
 This task is intended to be relatively easy. Here you are not required to upload your reproducible solution.
 
### <center> Deadline for A3: 2019 March 31, 20:59 GMT (London time)

In [None]:
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

Download data from the [competition page](https://www.kaggle.com/c/flight-delays-fall-2018/data) and change paths if needed.

In [None]:
train_df = pd.read_csv('flight_delays_train.csv')
test_df = pd.read_csv('flight_delays_test.csv')

In [None]:
train_df.head()

In [None]:
test_df.head()

Given flight departure time, carrier's code, departure airport, destination location, and flight distance, you have to predict departure delay for more than 15 minutes. As the simplest benchmark, let's take logistic regression and two features that are easiest to take: DepTime and Distance. This will correspond to **"simple logit baseline"** on Public LB.

In [None]:
X_train = train_df[['Distance', 'DepTime']].values 
y_train = train_df['dep_delayed_15min'].map({'Y': 1, 'N': 0}).values
X_test = test_df[['Distance', 'DepTime']].values

X_train_part, X_valid, y_train_part, y_valid = \
    train_test_split(X_train, y_train, 
                     test_size=0.3, random_state=17)

In [None]:
logit_pipe = Pipeline([('scaler', StandardScaler()),
                       ('logit', LogisticRegression(C=1, random_state=17, solver='liblinear'))])

In [None]:
logit_pipe.fit(X_train_part, y_train_part)
logit_valid_pred = logit_pipe.predict_proba(X_valid)[:, 1]

roc_auc_score(y_valid, logit_valid_pred)

In [None]:
logit_pipe.fit(X_train, y_train)
logit_test_pred = logit_pipe.predict_proba(X_test)[:, 1]

pd.Series(logit_test_pred, 
          name='dep_delayed_15min').to_csv('logit_2feat.csv', 
                                           index_label='id', header=True)

Now you have to beat **"A3 baseline (8 credits)"** on Public LB. It's not challenging at all. Go for LightGBM, maybe some other models (or ensembling) as well. Include categorical features, do some simple feature engineering as well. Good luck!

## FREERIDE.

In [None]:
from lightgbm import LGBMClassifier
from scipy.sparse import csr_matrix
from matplotlib import pyplot as plt

In [None]:
train_df.shape, test_df.shape

In [None]:
train_df['dep_delayed_15min'] = train_df['dep_delayed_15min'].map({'Y': 1, 'N': 0})

## EDA.

#### Find out which one of Carriers top/anti-top in 15min_delays.

In [None]:
  train_df.groupby('dep_delayed_15min')['UniqueCarrier']\
        .value_counts()\
        .sort_index(level=0).shape

In [None]:
def get_distribution_by_target(df, target, feature, split_idx, level=0):
    '''Group set by target and get feature distribution.
    '''
    # Group data by target.
    grouped = df.groupby(target)[feature]\
        .value_counts()\
        .sort_index(level=level)

    # Divide by target.
    tmp_index = grouped.index.levels[1]
    intime_term = grouped[:split_idx].values
    delay_term = grouped[split_idx:].values

    # Make a selection dataframe.
    distr_df = pd.DataFrame([delay_term]).T
    distr_df.set_index(tmp_index, inplace=True)
    distr_df.columns = ['delay']
    distr_df['intime'] = intime_term

    distr_df['ratio'] = round(distr_df['delay'] / distr_df['intime'] * 100, 2)

    return distr_df.sort_values(by='ratio', ascending=True)


def plot_distribution(df, title):
    '''Plot distributin of a feature by ratio.
    '''
    plt.figure(figsize=(11,6))
    plt.barh(y=df.index, width=df.ratio)
    plt.box(False)
    plt.grid(False)
    plt.title('Distribution of {}'.format(title))
    plt.show();
    
    return 0

In [None]:
unique_carrier_distr = get_distribution_by_target(train_df, 'dep_delayed_15min', 'UniqueCarrier', 22, level=0)

In [None]:
unique_carrier_distr

In [None]:
plot_distribution(unique_carrier_distr, 'UniqueCarrier');

In [None]:
from typing import List

def make_dummy(df, feature: str, values: List[str]):
    '''Make dummy feature matching extracted values from the list.
    '''
    return [1 if val in values else 0 for val in df[feature]]

In [None]:
top_delay_carriers = unique_carrier_distr.index[-6:]
top_intime_carriers = unique_carrier_distr.index[:2]

top_delay_carriers, top_intime_carriers

In [None]:
train_df['top_delay_carriers'] = make_dummy(train_df, 'UniqueCarrier', top_delay_carriers)
test_df['top_delay_carriers'] = make_dummy(test_df, 'UniqueCarrier', top_delay_carriers)

train_df['top_intime_carriers'] = make_dummy(train_df, 'UniqueCarrier', top_intime_carriers)
test_df['top_intime_carriers'] = make_dummy(test_df, 'UniqueCarrier', top_intime_carriers)

In [None]:
train_df.shape, test_df.shape

#### Distribution of target by Month.

In [None]:
# Change months` codes by numbers 1, 2, ..., 12 in train set.
train_df.Month = train_df.Month.map(lambda name: np.float(name[2:]))

In [None]:
# Change months` codes by numbers 1, 2, ..., 12 in test set.
test_df.Month = test_df.Month.map(lambda name: np.float(name[2:]))

In [None]:
month_distr = get_distribution_by_target(train_df, 'dep_delayed_15min', 'Month', 12, level=0) 

In [None]:
month_distr

In [None]:
plot_distribution(month_distr, 'Month');

In [None]:
top_delay_month = month_distr.index[-3:]
top_intime_month = month_distr.index[:3]

top_delay_month, top_intime_month

In [None]:
train_df['top_delay_month'] = make_dummy(train_df, 'Month', top_delay_month)
test_df['top_delay_month'] = make_dummy(test_df, 'Month', top_delay_month)

train_df['top_intime_month'] = make_dummy(train_df, 'Month', top_intime_month)
test_df['top_intime_month'] = make_dummy(test_df, 'Month', top_intime_month)

In [None]:
train_df.shape, test_df.shape

#### Day of Week.

In [None]:
# Change day codes by numbers 1, 2, ..., 7 in train and test sets.
train_df.DayOfWeek = train_df.DayOfWeek.map(lambda name: np.float(name[2:]))
test_df.DayOfWeek = test_df.DayOfWeek.map(lambda name: np.float(name[2:]))

In [None]:
day_distr = get_distribution_by_target(train_df, 'dep_delayed_15min', 'DayOfWeek', 7)
day_distr

In [None]:
plot_distribution(day_distr, 'Day of the Week');

In [None]:
top_delay_day = day_distr.index[-2:]
top_intime_day = day_distr.index[:2]

top_delay_day, top_intime_day

In [None]:
train_df['top_delay_day'] = make_dummy(train_df, 'DayOfWeek', top_delay_day)
test_df['top_delay_day'] = make_dummy(test_df, 'DayOfWeek', top_delay_day)

train_df['top_intime_day'] = make_dummy(train_df, 'DayOfWeek', top_intime_day)
test_df['top_intime_day'] = make_dummy(test_df, 'DayOfWeek', top_intime_day)

In [None]:
train_df.shape, test_df.shape

#### Convert DepTime to hour.

In [None]:
train_df['dep_hour'] = train_df.DepTime.map(lambda hour: float(hour // 100))
test_df['dep_hour'] = test_df.DepTime.map(lambda hour: float(hour // 100))

In [None]:
train_df.loc[train_df.dep_hour == 25, 'dep_hour'] = 1.0

In [None]:
hour_distr = get_distribution_by_target(train_df, 'dep_delayed_15min', 'dep_hour', 25)
hour_distr

In [None]:
plot_distribution(hour_distr, 'Hour');

In [None]:
most_delay_hour = hour_distr.index[-1:]
top_delay_hour = hour_distr.index[-4:-1]
most_intime_hour = hour_distr.index[:2]
top_intime_hour = hour_distr.index[2:6]

In [None]:
FEATURES = ['most_delay_hour', 'top_delay_hour', 'most_intime_hour', 'top_intime_hour']
VALUES = most_delay_hour, top_delay_hour, most_intime_hour, top_intime_hour,

In [None]:
for feature, values in zip(FEATURES, VALUES):
    train_df[feature] = make_dummy(train_df, 'dep_hour', values)
    test_df[feature] = make_dummy(test_df, 'dep_hour', values)

#### Distance.

In [None]:
import seaborn as sns

In [None]:
train_df.dep_delayed_15min.isna().value_counts()

In [None]:
sns.catplot(data=train_df, x='Distance', hue='dep_delayed_15min', kind='swarm')

#### Concatenate Train and Test sets.

In [None]:
df_full = pd.concat([train_df.drop('dep_delayed_15min', axis=1), 
                     test_df])

df_full.shape

In [None]:
df_full.columns