# <a href="http://www.datascience-paris-saclay.fr">Paris Saclay Center for Data Science</a>
# <a href=https://www.ramp.studio/problems/air_passengers>RAMP</a> on predicting the number of air passengers

<i> Balázs Kégl (LAL/CNRS), Alex Gramfort (Inria), Djalel Benbouzid (UPMC), Mehdi Cherti (LAL/CNRS) </i>

## Introduction
The data set was donated to us by an unnamed company handling flight ticket reservations. The data is thin, it contains
<ul>
<li> the date of departure
<li> the departure airport
<li> the arrival airport
<li> the mean and standard deviation of the number of weeks of the reservations made before the departure date
<li> a field called <code>log_PAX</code> which is related to the number of passengers (the actual number were changed for privacy reasons)
</ul>

The goal is to predict the <code>log_PAX</code> column. The prediction quality is measured by RMSE. 

The data is obviously limited, but since data and location informations are available, it can be joined to external data sets. <b>The challenge in this RAMP is to find good data that can be correlated to flight traffic</b>.

In [3]:
%matplotlib inline
import os
import importlib
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
pd.set_option('display.max_columns', None)

## Load the dataset using pandas

The training and testing data are located in the folder `data`. They are compressed `csv` file (i.e. `csv.bz2`). We can load the dataset using pandas.

In [4]:
data = pd.read_csv(
    os.path.join('data', 'train.csv.bz2')
)

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8902 entries, 0 to 8901
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   DateOfDeparture   8902 non-null   object 
 1   Departure         8902 non-null   object 
 2   Arrival           8902 non-null   object 
 3   WeeksToDeparture  8902 non-null   float64
 4   log_PAX           8902 non-null   float64
 5   std_wtd           8902 non-null   float64
dtypes: float64(3), object(3)
memory usage: 417.4+ KB


In [6]:
data.head()

Unnamed: 0,DateOfDeparture,Departure,Arrival,WeeksToDeparture,log_PAX,std_wtd
0,2012-06-19,ORD,DFW,12.875,12.331296,9.812647
1,2012-09-10,LAS,DEN,14.285714,10.775182,9.466734
2,2012-10-05,DEN,LAX,10.863636,11.083177,9.035883
3,2011-10-09,ATL,ORD,11.48,11.169268,7.990202
4,2012-02-21,DEN,SFO,11.45,11.269364,9.517159


While it makes `Departure` and `Arrival` are the code of the airport, we see that the `DateOfDeparture` should be a date instead of string. We can use pandas to convert this data.

In [7]:
data.loc[:, 'DateOfDeparture'] = pd.to_datetime(data.loc[:, 'DateOfDeparture'])

In [8]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8902 entries, 0 to 8901
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   DateOfDeparture   8902 non-null   datetime64[ns]
 1   Departure         8902 non-null   object        
 2   Arrival           8902 non-null   object        
 3   WeeksToDeparture  8902 non-null   float64       
 4   log_PAX           8902 non-null   float64       
 5   std_wtd           8902 non-null   float64       
dtypes: datetime64[ns](1), float64(3), object(2)
memory usage: 417.4+ KB


When you will create a submission, `ramp-workflow` will load the data for you and split into a data matrix `X` and a target vector `y`. It will also take care about splitting the data into a training and testing set. These utilities are available in the module `problem.py` which we will load.

In [9]:
import problem

The function `get_train_data()` loads the training data and returns a pandas dataframe `X` and a numpy vector `y`.

In [10]:
X, y = problem.get_train_data()

We can check the information of the data `X`

In [11]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8902 entries, 0 to 8901
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   DateOfDeparture   8902 non-null   object 
 1   Departure         8902 non-null   object 
 2   Arrival           8902 non-null   object 
 3   WeeksToDeparture  8902 non-null   float64
 4   std_wtd           8902 non-null   float64
dtypes: float64(2), object(3)
memory usage: 347.9+ KB


In [12]:
X.shape

(8902, 5)

## Preprocessing dates

Getting dates into numerical columns is a common operation when time series data is analyzed with non-parametric predictors. The code below makes the following transformations:

- numerical columns for year (2011-2012), month (1-12), day of the month (1-31), day of the week (0-6), and week of the year (1-52)
- number of days since 1970-01-01

In [13]:
# Make a copy of the original data to avoid writing on the original data
X_encoded = X.copy()

# following http://stackoverflow.com/questions/16453644/regression-with-date-variable-using-scikit-learn
X_encoded['DateOfDeparture'] = pd.to_datetime(X_encoded['DateOfDeparture'])
X_encoded['year'] = X_encoded['DateOfDeparture'].dt.year
X_encoded['month'] = X_encoded['DateOfDeparture'].dt.month
X_encoded['day'] = X_encoded['DateOfDeparture'].dt.day
X_encoded['weekday'] = X_encoded['DateOfDeparture'].dt.weekday
X_encoded['week'] = X_encoded['DateOfDeparture'].dt.week
X_encoded['n_days'] = X_encoded['DateOfDeparture'].apply(lambda date: (date - pd.to_datetime("1970-01-01")).days)

In [14]:
X_encoded.tail(5)

Unnamed: 0,DateOfDeparture,Departure,Arrival,WeeksToDeparture,std_wtd,year,month,day,weekday,week,n_days
8897,2011-10-02,DTW,ATL,9.263158,7.316967,2011,10,2,6,39,15249
8898,2012-09-25,DFW,ORD,12.772727,10.641034,2012,9,25,1,39,15608
8899,2012-01-19,SFO,LAS,11.047619,7.908705,2012,1,19,3,3,15358
8900,2013-02-03,ORD,PHL,6.076923,4.030334,2013,2,3,6,5,15739
8901,2011-11-26,DTW,ATL,9.526316,6.167733,2011,11,26,5,47,15304


We will perform all preprocessing steps within a scikit-learn [pipeline](https://scikit-learn.org/stable/modules/compose.html) which chains together tranformation and estimator steps. This offers offers convenience and safety (help avoid leaking statistics from your test data into the trained model in cross-validation) and the whole pipeline can be evaluated with `cross_val_score`.

To perform the above encoding within a scikit-learn [pipeline](https://scikit-learn.org/stable/modules/compose.html) we will a function and using `FunctionTransformer` to make it compatible with scikit-learn API.

In [15]:
from sklearn.preprocessing import FunctionTransformer

def _encode_dates(X):
    # With pandas < 1.0, we wil get a SettingWithCopyWarning
    # In our case, we will avoid this warning by triggering a copy
    # More information can be found at:
    # https://github.com/scikit-learn/scikit-learn/issues/16191
    X_encoded = X.copy()

    # Make sure that DateOfDeparture is of datetime format
    X_encoded.loc[:, 'DateOfDeparture'] = pd.to_datetime(X_encoded['DateOfDeparture'])
    # Encode the DateOfDeparture
    X_encoded.loc[:, 'year'] = X_encoded['DateOfDeparture'].dt.year
    X_encoded.loc[:, 'month'] = X_encoded['DateOfDeparture'].dt.month
    X_encoded.loc[:, 'day'] = X_encoded['DateOfDeparture'].dt.day
    X_encoded.loc[:, 'weekday'] = X_encoded['DateOfDeparture'].dt.weekday
    X_encoded.loc[:, 'week'] = X_encoded['DateOfDeparture'].dt.week
    X_encoded.loc[:, 'n_days'] = X_encoded['DateOfDeparture'].apply(
        lambda date: (date - pd.to_datetime("1970-01-01")).days
    )
    # Once we did the encoding, we will not need DateOfDeparture
    return X_encoded.drop(columns=["DateOfDeparture"])

date_encoder = FunctionTransformer(_encode_dates)

In [16]:
date_encoder.fit_transform(X).head()

Unnamed: 0,Departure,Arrival,WeeksToDeparture,std_wtd,year,month,day,weekday,week,n_days
0,ORD,DFW,12.875,9.812647,2012,6,19,1,25,15510
1,LAS,DEN,14.285714,9.466734,2012,9,10,0,37,15593
2,DEN,LAX,10.863636,9.035883,2012,10,5,4,40,15618
3,ATL,ORD,11.48,7.990202,2011,10,9,6,40,15256
4,DEN,SFO,11.45,9.517159,2012,2,21,1,8,15391


## Linear regressor

When dealing with a linear model, we need to one-hot encode categorical variables instead of ordinal encoding and standardize numerical variables. Thus we will:

- encode the date;
- then, one-hot encode all categorical columns, including the encoded date as well;
- standardize the numerical columns.

In [17]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score

date_encoder = FunctionTransformer(_encode_dates)
date_cols = ["DateOfDeparture"]

categorical_encoder = OneHotEncoder(handle_unknown="ignore")
categorical_cols = [
    "Arrival", "Departure", "year", "month", "day",
    "weekday", "week", "n_days"
]

numerical_scaler = StandardScaler()
numerical_cols = ["WeeksToDeparture", "std_wtd"]

preprocessor = make_column_transformer(
    (categorical_encoder, categorical_cols),
    (numerical_scaler, numerical_cols)
)

We can now combine our `preprocessor` with the `LinearRegression` estimator in a `Pipeline`:

In [18]:
from sklearn.linear_model import LinearRegression

regressor = LinearRegression()

pipeline = make_pipeline(date_encoder, preprocessor, regressor) 
#first the date data is split, then onehotencoded and scaled, and then the regression is applied
pipeline.fit(X,y)
y_pred = pipeline.predict(X_encoded)

And we can evaluate our linear-model pipeline:

In [19]:
scores = cross_val_score(
    pipeline, X, y, cv=5, scoring='neg_mean_squared_error'
)
rmse_scores = np.sqrt(-scores)

print(
    f"RMSE: {np.mean(rmse_scores):.4f} +/- {np.std(rmse_scores):.4f}"
)

RMSE: 0.6117 +/- 0.0149


**Tests Martha**

Ensembles are constructed from decision tree models. Trees are added one at a time to the ensemble and fit to correct the prediction errors made by prior models. This is a type of ensemble machine learning model referred to as boosting.
Recall that decision trees are added to the model sequentially in an effort to correct and improve upon the predictions made by prior trees (more trees require lower learning rate)
Using fewer samples introduces more variance for each tree, although it can improve the overall performance of the model.(sub_samples)

In [20]:
params = {'n_estimators': 2000, #number of decision trees 
          'subsample': 1,
          'max_depth': 8,
          'min_samples_split': 5,
          'learning_rate': 0.01,
          'loss': 'ls'}

In [54]:
from sklearn import ensemble 

regressor2 = ensemble.GradientBoostingRegressor(**params)

pipeline2 = make_pipeline(date_encoder, preprocessor, regressor2)
pipeline2.fit(X, y)

In [None]:
scores = cross_val_score(
    pipeline2, X, y, cv=5, scoring='neg_mean_squared_error'
)
rmse_scores = np.sqrt(-scores)

print(
    f"RMSE: {np.mean(rmse_scores):.4f} +/- {np.std(rmse_scores):.4f}"
)

In [None]:
from sklearn.model_selection import RandomizedSearchCV
pipeline2 = make_pipeline(date_encoder, preprocessor, regressor2)

space = dict()
space['gradientboostingregressor__n_estimators'] = [10, 500, 1000, 2000]
space['gradientboostingregressor__subsample'] = [0.5, 0.7, 1.0]
space['gradientboostingregressor__max_depth'] = [2, 5, 8]
space['gradientboostingregressor__learning_rate'] = [0.001, 0.01, 0.1, 1.0]

In [None]:
search = RandomizedSearchCV(pipeline2, param_distributions=space, verbose=8)
search.fit(X, y)

In [None]:
scores_df = pd.DataFrame(search.cv_results_)
#scores_df = scores.sort(columns=['rank_test_score']).reset_index(drop='index')
scores_df

In [None]:
space2 = dict()
space['gradientboostingregressor__min_samples_split'] = [0.1, 0.3, 0.8, 1.0]

In [None]:
search2 = RandomizedSearchCV(pipeline2, param_distributions=space2, verbose=8)
search2.fit(X, y)

In [None]:
scores_df2 = pd.DataFrame(search2.cv_results_)
#scores_df = scores.sort(columns=['rank_test_score']).reset_index(drop='index')
scores_df2

**Test Martha, adding data**

source of airport info: https://openflights.org/data.html

In [19]:
# when submitting a kit, the `__file__` variable will corresponds to the
# path to `estimator.py`. However, this variable is not defined in the
# notebook and thus we must define the `__file__` variable to imitate
# how a submission `.py` would work.
__file__ = os.path.join('submissions', 'starting_kit', 'estimator.py')
filepath = os.path.join(os.path.dirname(__file__), 'airport_geolocation.csv')
filepath

'submissions/starting_kit/airport_geolocation.csv'

In [20]:
pd.read_csv('airport_geolocation.csv').head()

Unnamed: 0,Airport ID,Name,City,Country,IATA,ICAO,Latitude,Longitude,Altitude,Timezone,DST
0,3411,Barter Island LRRS Airport,Barter Island,United States,BTI,PABA,70.134003,-143.582001,2,-9,A
1,3412,Wainwright Air Station,Fort Wainwright,United States,\N,PAWT,70.613403,-159.860001,35,-9,A
2,3413,Cape Lisburne LRRS Airport,Cape Lisburne,United States,LUR,PALU,68.875099,-166.110001,16,-9,A
3,3414,Point Lay LRRS Airport,Point Lay,United States,PIZ,PPIZ,69.732903,-163.005005,22,-9,A
4,3415,Hilo International Airport,Hilo,United States,ITO,PHTO,19.721399,-155.048004,38,-10,N


In [59]:
def _merge_external_data(X):
    X = X.copy()  # to avoid raising SettingOnCopyWarning
    # Make sure that DateOfDeparture is of dtype datetime
    X.loc[:, "DateOfDeparture"] = pd.to_datetime(X['DateOfDeparture'])
    data_geolocation = pd.read_csv('airport_geolocation.csv')
    data_geolocation = data_geolocation[['IATA', 'Latitude', 'Longitude']]
    data_geolocation = data_geolocation.rename(columns={'IATA': 'Departure', 'Latitude': 'Latitude', 'Longitude': 'Longitude'})
    X_merged = pd.merge(
        X, data_geolocation, how='left', on=['Departure'], sort=False
    )
    return X_merged

data_merger = FunctionTransformer(_merge_external_data)


In [60]:
data_merger.fit_transform(X)

Unnamed: 0,DateOfDeparture,Departure,Arrival,WeeksToDeparture,std_wtd,Latitude,Longitude
0,2012-06-19,ORD,DFW,12.875000,9.812647,41.978600,-87.904800
1,2012-09-10,LAS,DEN,14.285714,9.466734,36.080101,-115.152000
2,2012-10-05,DEN,LAX,10.863636,9.035883,39.861698,-104.672996
3,2011-10-09,ATL,ORD,11.480000,7.990202,33.636700,-84.428101
4,2012-02-21,DEN,SFO,11.450000,9.517159,39.861698,-104.672996
...,...,...,...,...,...,...,...
8897,2011-10-02,DTW,ATL,9.263158,7.316967,42.212399,-83.353401
8898,2012-09-25,DFW,ORD,12.772727,10.641034,32.896801,-97.038002
8899,2012-01-19,SFO,LAS,11.047619,7.908705,37.618999,-122.375000
8900,2013-02-03,ORD,PHL,6.076923,4.030334,41.978600,-87.904800


In [61]:
pipeline3 = make_pipeline(data_merger, date_encoder, preprocessor, regressor2)

NameError: name 'regressor2' is not defined

In [None]:
scores = cross_val_score(
    pipeline3, X, y, cv=5, scoring='neg_mean_squared_error'
)
rmse_scores = np.sqrt(-scores)

print(
    f"RMSE: {np.mean(rmse_scores):.4f} +/- {np.std(rmse_scores):.4f}"
)

**XG Boost**

In [20]:
dat_nicolas = pd.read_csv('testemoica.csv')
dat_nicolas.shape

(8902, 12)

In [21]:
def _merge_external_data_N(X):
    X = X.copy()  # to avoid raising SettingOnCopyWarning
    # Make sure that DateOfDeparture is of dtype datetime
    X.loc[:, "DateOfDeparture"] = pd.to_datetime(X['DateOfDeparture'])
    data_nicolas = pd.read_csv('testemoica.csv')
    data_nicolas = data_nicolas[['Departure', 'Arrival', 'WeeksToDeparture', 'std_wtd', 'year', 'month', 'day', 'weekday', 'week', 'n_days', 'passengers load', 'distance km']]
    X_merged = pd.merge(
        X, data_nicolas, how='left', left_index=True, right_index=True #on=[['Departure', 'Arrival', 'year', 'month', 'day']], sort=False
    )
    X_merged = X_merged[['DateOfDeparture', 'Departure_x', 'Arrival_x', 'WeeksToDeparture_x', 'std_wtd_x', 'year', 'month', 'day', 'weekday', 'week', 'n_days', 'passengers load', 'distance km']]
    X_merged = X_merged.rename(
             columns={'DateOfDeparture':'DateOfDeparture', 'Departure_x':'Departure', 'Arrival_x':'Arrival', 'WeeksToDeparture_x':'WeeksToDeparture', 'std_wtd_x':'std_wtd_x', 'year':'year', 'month':'month', 'day':'day', 'weekday':'weekday', 'week':'week', 'n_days':'n_days', 'passengers load':'passsengers_load', 'distance km':'distance_km'}
              )
    return X_merged

data_merger_N = FunctionTransformer(_merge_external_data_N)
data_merger_N.fit_transform(X).head()



Unnamed: 0,DateOfDeparture,Departure,Arrival,WeeksToDeparture,std_wtd_x,year,month,day,weekday,week,n_days,passsengers_load,distance_km
0,2012-06-19,ORD,DFW,12.875,9.812647,2012,6,19,1,25,15510,32171795.0,1290.179371
1,2012-09-10,LAS,DEN,14.285714,9.466734,2012,9,10,0,37,15593,19959651.0,991.064058
2,2012-10-05,DEN,LAX,10.863636,9.035883,2012,10,5,4,40,15618,25799841.0,1366.918847
3,2011-10-09,ATL,ORD,11.48,7.990202,2011,10,9,6,40,15256,44414121.0,974.522557
4,2012-02-21,DEN,SFO,11.45,9.517159,2012,2,21,1,8,15391,25799841.0,1538.142783


In [22]:
from sklearn.preprocessing import OrdinalEncoder
from sklearn.compose import make_column_transformer

date_encoder = FunctionTransformer(_encode_dates)
date_cols = ["DateOfDeparture"]

categorical_encoder = OrdinalEncoder()
categorical_cols = ["Arrival", "Departure"]

preprocessor = make_column_transformer(
    (date_encoder, date_cols),
    (categorical_encoder, categorical_cols),
    remainder='passthrough',  # passthrough numerical columns as they are
)

In [104]:
import xgboost as xgb
xgb_r = xgb.XGBRegressor(objective ='reg:linear', 
                  n_estimators = 1000, seed = 123, max_depth = 8)
pipeline4 = make_pipeline(data_merger_N, preprocessor, xgb_r)
pipeline4.fit(X, y)



Pipeline(steps=[('functiontransformer',
                 FunctionTransformer(func=<function _merge_external_data_N at 0x7fa780a06af0>)),
                ('columntransformer',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('functiontransformer',
                                                  FunctionTransformer(func=<function _encode_dates at 0x7fa77d2e5ca0>),
                                                  ['DateOfDeparture']),
                                                 ('ordinalencoder',
                                                  OrdinalEncoder(),
                                                  ['Arrival', 'D...
                              interaction_constraints='',
                              learning_rate=0.300000012, max_delta_step=0,
                              max_depth=8, min_child_weight=1, missing=nan,
                              monotone_constraints='()', n_estimators=1000,
           

In [105]:
scores = cross_val_score(
    pipeline4, X, y, cv=5, scoring='neg_mean_squared_error'
)
rmse_scores = np.sqrt(-scores)

print(
    f"RMSE: {np.mean(rmse_scores):.4f} +/- {np.std(rmse_scores):.4f}"
)

RMSE: 0.3903 +/- 0.0225


In [45]:
import xgboost as xgb
xgb_r = xgb.XGBRegressor(n_estimators = 500, max_depth = 4)
pipeline5 = make_pipeline(data_merger_N, preprocessor, xgb_r)
pipeline5.fit(X, y)

Pipeline(steps=[('functiontransformer',
                 FunctionTransformer(func=<function _merge_external_data_N at 0x7ffe806464c0>)),
                ('columntransformer',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('functiontransformer',
                                                  FunctionTransformer(func=<function _encode_dates at 0x7ffe7be18550>),
                                                  ['DateOfDeparture']),
                                                 ('ordinalencoder',
                                                  OrdinalEncoder(),
                                                  ['Arrival', 'D...
                              colsample_bytree=1, gamma=0, gpu_id=-1,
                              importance_type='gain',
                              interaction_constraints='',
                              learning_rate=0.300000012, max_delta_step=0,
                              max_depth

In [46]:
scores = cross_val_score(
    pipeline5, X, y, cv=5, scoring='neg_mean_squared_error'
)
rmse_scores = np.sqrt(-scores)

print(
    f"RMSE: {np.mean(rmse_scores):.4f} +/- {np.std(rmse_scores):.4f}"
)

RMSE: 0.3816 +/- 0.0224


In [37]:
params_xgb = {'n_estimators': 2000, #number of decision trees 
          }

In [41]:
space_xgb = dict()
space_xgb['xgbregressor__n_estimators'] = [500, 1000, 2000, 5000]
space_xgb['xgbregressor__max_depth'] = [2, 4, 6, 8]

In [42]:
search = RandomizedSearchCV(pipeline5, param_distributions=space_xgb, verbose=8)
search.fit(X, y)

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV] xgbregressor__n_estimators=1000, xgbregressor__max_depth=2 ......


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  xgbregressor__n_estimators=1000, xgbregressor__max_depth=2, score=0.834, total=   2.2s
[CV] xgbregressor__n_estimators=1000, xgbregressor__max_depth=2 ......


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    2.2s remaining:    0.0s


[CV]  xgbregressor__n_estimators=1000, xgbregressor__max_depth=2, score=0.838, total=   2.1s
[CV] xgbregressor__n_estimators=1000, xgbregressor__max_depth=2 ......


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    4.3s remaining:    0.0s


[CV]  xgbregressor__n_estimators=1000, xgbregressor__max_depth=2, score=0.833, total=   2.1s
[CV] xgbregressor__n_estimators=1000, xgbregressor__max_depth=2 ......


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    6.5s remaining:    0.0s


[CV]  xgbregressor__n_estimators=1000, xgbregressor__max_depth=2, score=0.813, total=   2.1s
[CV] xgbregressor__n_estimators=1000, xgbregressor__max_depth=2 ......


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    8.6s remaining:    0.0s


[CV]  xgbregressor__n_estimators=1000, xgbregressor__max_depth=2, score=0.834, total=   2.2s
[CV] xgbregressor__n_estimators=1000, xgbregressor__max_depth=4 ......


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:   10.8s remaining:    0.0s


[CV]  xgbregressor__n_estimators=1000, xgbregressor__max_depth=4, score=0.856, total=   4.0s
[CV] xgbregressor__n_estimators=1000, xgbregressor__max_depth=4 ......


[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:   14.8s remaining:    0.0s


[CV]  xgbregressor__n_estimators=1000, xgbregressor__max_depth=4, score=0.839, total=   4.3s
[CV] xgbregressor__n_estimators=1000, xgbregressor__max_depth=4 ......


[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:   19.0s remaining:    0.0s


[CV]  xgbregressor__n_estimators=1000, xgbregressor__max_depth=4, score=0.863, total=   4.4s
[CV] xgbregressor__n_estimators=1000, xgbregressor__max_depth=4 ......
[CV]  xgbregressor__n_estimators=1000, xgbregressor__max_depth=4, score=0.838, total=   4.1s
[CV] xgbregressor__n_estimators=1000, xgbregressor__max_depth=4 ......
[CV]  xgbregressor__n_estimators=1000, xgbregressor__max_depth=4, score=0.858, total=   4.2s
[CV] xgbregressor__n_estimators=2000, xgbregressor__max_depth=8 ......
[CV]  xgbregressor__n_estimators=2000, xgbregressor__max_depth=8, score=0.855, total=   5.5s
[CV] xgbregressor__n_estimators=2000, xgbregressor__max_depth=8 ......
[CV]  xgbregressor__n_estimators=2000, xgbregressor__max_depth=8, score=0.845, total=   5.1s
[CV] xgbregressor__n_estimators=2000, xgbregressor__max_depth=8 ......
[CV]  xgbregressor__n_estimators=2000, xgbregressor__max_depth=8, score=0.855, total=   5.1s
[CV] xgbregressor__n_estimators=2000, xgbregressor__max_depth=8 ......
[CV]  xgbregress

[Parallel(n_jobs=1)]: Done  50 out of  50 | elapsed:  4.8min finished


RandomizedSearchCV(estimator=Pipeline(steps=[('functiontransformer',
                                              FunctionTransformer(func=<function _merge_external_data_N at 0x7ffe806464c0>)),
                                             ('columntransformer',
                                              ColumnTransformer(remainder='passthrough',
                                                                transformers=[('functiontransformer',
                                                                               FunctionTransformer(func=<function _encode_dates at 0x7ffe7be18550>),
                                                                               ['DateOfDeparture']),
                                                                              ('ordinalencoder',
                                                                               O...
                                                           min_child_weight=1,
                                     

In [44]:
scores_df = pd.DataFrame(search.cv_results_)
#scores_df = scores.sort(columns=['rank_test_score']).reset_index(drop='index')
scores_df

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_xgbregressor__n_estimators,param_xgbregressor__max_depth,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,1.894366,0.018658,0.261792,0.015491,1000,2,"{'xgbregressor__n_estimators': 1000, 'xgbregre...",0.834215,0.83811,0.833492,0.812999,0.833544,0.830472,0.008902,9
1,3.875738,0.134038,0.297964,0.019988,1000,4,"{'xgbregressor__n_estimators': 1000, 'xgbregre...",0.856041,0.838577,0.863176,0.837994,0.857625,0.850683,0.010398,5
2,5.164105,0.337951,0.299661,0.010206,2000,8,"{'xgbregressor__n_estimators': 2000, 'xgbregre...",0.855428,0.844949,0.854719,0.828442,0.844625,0.845633,0.009752,7
3,2.554008,0.21388,0.327158,0.008927,500,4,"{'xgbregressor__n_estimators': 500, 'xgbregres...",0.859004,0.841616,0.863867,0.837594,0.859589,0.852334,0.010605,1
4,1.842755,0.166217,0.325981,0.01137,500,2,"{'xgbregressor__n_estimators': 500, 'xgbregres...",0.826916,0.821482,0.825802,0.804028,0.826654,0.820976,0.008698,10
5,3.731842,0.511455,0.346164,0.032921,500,6,"{'xgbregressor__n_estimators': 500, 'xgbregres...",0.866295,0.854334,0.858057,0.828751,0.849479,0.851383,0.012582,2
6,11.157346,1.348288,0.342296,0.014839,5000,6,"{'xgbregressor__n_estimators': 5000, 'xgbregre...",0.865567,0.853514,0.85771,0.828576,0.849134,0.8509,0.012408,4
7,8.000274,1.283662,0.367252,0.024799,2000,4,"{'xgbregressor__n_estimators': 2000, 'xgbregre...",0.852668,0.835383,0.861434,0.837861,0.85491,0.848451,0.010109,6
8,9.543667,0.828303,0.355981,0.032957,5000,8,"{'xgbregressor__n_estimators': 5000, 'xgbregre...",0.855428,0.844949,0.854719,0.828442,0.844625,0.845633,0.009752,7
9,6.493992,0.475016,0.362459,0.034016,1000,6,"{'xgbregressor__n_estimators': 1000, 'xgbregre...",0.865623,0.853529,0.857731,0.828575,0.849153,0.850922,0.012424,3


**Gradient Boost**

In [34]:
params = {'n_estimators': 2000, #number of decision trees 
          'subsample': 0.5,
          'max_depth': 8,
          'min_samples_split': 5,
          'learning_rate': 0.01,
          'loss': 'ls'}

In [35]:
from sklearn import ensemble 

GradBoost = ensemble.GradientBoostingRegressor(**params)

pipeline6 = make_pipeline(data_merger_N, preprocessor, GradBoost)
pipeline6.fit(X, y)

Pipeline(steps=[('functiontransformer',
                 FunctionTransformer(func=<function _merge_external_data_N at 0x7ffe806464c0>)),
                ('columntransformer',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('functiontransformer',
                                                  FunctionTransformer(func=<function _encode_dates at 0x7ffe7be18550>),
                                                  ['DateOfDeparture']),
                                                 ('ordinalencoder',
                                                  OrdinalEncoder(),
                                                  ['Arrival', 'Departure'])])),
                ('gradientboostingregressor',
                 GradientBoostingRegressor(learning_rate=0.01, max_depth=8,
                                           min_samples_split=5,
                                           n_estimators=2000, subsample=0.5))])

In [36]:
scores = cross_val_score(
    pipeline6, X, y, cv=5, scoring='neg_mean_squared_error'
)
rmse_scores = np.sqrt(-scores)

print(
    f"RMSE: {np.mean(rmse_scores):.4f} +/- {np.std(rmse_scores):.4f}"
)

RMSE: 0.3605 +/- 0.0264


In [59]:
from sklearn.model_selection import RandomizedSearchCV

space = dict()
space['gradientboostingregressor__n_estimators'] = [500, 1000, 2000]
space['gradientboostingregressor__subsample'] = [0.5, 0.7, 1.0]
space['gradientboostingregressor__max_depth'] = [2, 5, 8]
space['gradientboostingregressor__learning_rate'] = [0.001, 0.01, 0.1, 1.0]

In [32]:
search = RandomizedSearchCV(pipeline6, param_distributions=space, verbose=8)
search.fit(X, y)

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV] gradientboostingregressor__subsample=0.7, gradientboostingregressor__n_estimators=2000, gradientboostingregressor__max_depth=5, gradientboostingregressor__learning_rate=0.001 


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  gradientboostingregressor__subsample=0.7, gradientboostingregressor__n_estimators=2000, gradientboostingregressor__max_depth=5, gradientboostingregressor__learning_rate=0.001, score=0.619, total=  37.5s
[CV] gradientboostingregressor__subsample=0.7, gradientboostingregressor__n_estimators=2000, gradientboostingregressor__max_depth=5, gradientboostingregressor__learning_rate=0.001 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   37.5s remaining:    0.0s


[CV]  gradientboostingregressor__subsample=0.7, gradientboostingregressor__n_estimators=2000, gradientboostingregressor__max_depth=5, gradientboostingregressor__learning_rate=0.001, score=0.585, total=  38.0s
[CV] gradientboostingregressor__subsample=0.7, gradientboostingregressor__n_estimators=2000, gradientboostingregressor__max_depth=5, gradientboostingregressor__learning_rate=0.001 


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  1.3min remaining:    0.0s


[CV]  gradientboostingregressor__subsample=0.7, gradientboostingregressor__n_estimators=2000, gradientboostingregressor__max_depth=5, gradientboostingregressor__learning_rate=0.001, score=0.616, total=  36.0s
[CV] gradientboostingregressor__subsample=0.7, gradientboostingregressor__n_estimators=2000, gradientboostingregressor__max_depth=5, gradientboostingregressor__learning_rate=0.001 


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:  1.9min remaining:    0.0s


[CV]  gradientboostingregressor__subsample=0.7, gradientboostingregressor__n_estimators=2000, gradientboostingregressor__max_depth=5, gradientboostingregressor__learning_rate=0.001, score=0.606, total=  36.1s
[CV] gradientboostingregressor__subsample=0.7, gradientboostingregressor__n_estimators=2000, gradientboostingregressor__max_depth=5, gradientboostingregressor__learning_rate=0.001 


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:  2.5min remaining:    0.0s


[CV]  gradientboostingregressor__subsample=0.7, gradientboostingregressor__n_estimators=2000, gradientboostingregressor__max_depth=5, gradientboostingregressor__learning_rate=0.001, score=0.620, total=  35.2s
[CV] gradientboostingregressor__subsample=0.5, gradientboostingregressor__n_estimators=2000, gradientboostingregressor__max_depth=8, gradientboostingregressor__learning_rate=0.01 


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:  3.0min remaining:    0.0s


[CV]  gradientboostingregressor__subsample=0.5, gradientboostingregressor__n_estimators=2000, gradientboostingregressor__max_depth=8, gradientboostingregressor__learning_rate=0.01, score=0.881, total=  40.0s
[CV] gradientboostingregressor__subsample=0.5, gradientboostingregressor__n_estimators=2000, gradientboostingregressor__max_depth=8, gradientboostingregressor__learning_rate=0.01 


[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:  3.7min remaining:    0.0s


[CV]  gradientboostingregressor__subsample=0.5, gradientboostingregressor__n_estimators=2000, gradientboostingregressor__max_depth=8, gradientboostingregressor__learning_rate=0.01, score=0.863, total=  39.9s
[CV] gradientboostingregressor__subsample=0.5, gradientboostingregressor__n_estimators=2000, gradientboostingregressor__max_depth=8, gradientboostingregressor__learning_rate=0.01 


[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:  4.4min remaining:    0.0s


[CV]  gradientboostingregressor__subsample=0.5, gradientboostingregressor__n_estimators=2000, gradientboostingregressor__max_depth=8, gradientboostingregressor__learning_rate=0.01, score=0.875, total=  40.1s
[CV] gradientboostingregressor__subsample=0.5, gradientboostingregressor__n_estimators=2000, gradientboostingregressor__max_depth=8, gradientboostingregressor__learning_rate=0.01 
[CV]  gradientboostingregressor__subsample=0.5, gradientboostingregressor__n_estimators=2000, gradientboostingregressor__max_depth=8, gradientboostingregressor__learning_rate=0.01, score=0.850, total=  42.5s
[CV] gradientboostingregressor__subsample=0.5, gradientboostingregressor__n_estimators=2000, gradientboostingregressor__max_depth=8, gradientboostingregressor__learning_rate=0.01 
[CV]  gradientboostingregressor__subsample=0.5, gradientboostingregressor__n_estimators=2000, gradientboostingregressor__max_depth=8, gradientboostingregressor__learning_rate=0.01, score=0.877, total=  50.1s
[CV] gradientboo

[CV]  gradientboostingregressor__subsample=0.7, gradientboostingregressor__n_estimators=500, gradientboostingregressor__max_depth=5, gradientboostingregressor__learning_rate=0.1, score=0.872, total=   9.8s
[CV] gradientboostingregressor__subsample=1.0, gradientboostingregressor__n_estimators=500, gradientboostingregressor__max_depth=5, gradientboostingregressor__learning_rate=1.0 
[CV]  gradientboostingregressor__subsample=1.0, gradientboostingregressor__n_estimators=500, gradientboostingregressor__max_depth=5, gradientboostingregressor__learning_rate=1.0, score=0.708, total=  12.4s
[CV] gradientboostingregressor__subsample=1.0, gradientboostingregressor__n_estimators=500, gradientboostingregressor__max_depth=5, gradientboostingregressor__learning_rate=1.0 
[CV]  gradientboostingregressor__subsample=1.0, gradientboostingregressor__n_estimators=500, gradientboostingregressor__max_depth=5, gradientboostingregressor__learning_rate=1.0, score=0.674, total=  12.5s
[CV] gradientboostingregre

[Parallel(n_jobs=1)]: Done  50 out of  50 | elapsed: 16.1min finished


RandomizedSearchCV(estimator=Pipeline(steps=[('functiontransformer',
                                              FunctionTransformer(func=<function _merge_external_data_N at 0x7ffe806464c0>)),
                                             ('columntransformer',
                                              ColumnTransformer(remainder='passthrough',
                                                                transformers=[('functiontransformer',
                                                                               FunctionTransformer(func=<function _encode_dates at 0x7ffe7be18550>),
                                                                               ['DateOfDeparture']),
                                                                              ('ordinalencoder',
                                                                               O...
                                              GradientBoostingRegressor(learning_rate=0.01,
                        

In [43]:
scores_df = pd.DataFrame(search.cv_results_)
#scores_df = scores.sort(columns=['rank_test_score']).reset_index(drop='index')
scores_df

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_xgbregressor__n_estimators,param_xgbregressor__max_depth,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,1.894366,0.018658,0.261792,0.015491,1000,2,"{'xgbregressor__n_estimators': 1000, 'xgbregre...",0.834215,0.83811,0.833492,0.812999,0.833544,0.830472,0.008902,9
1,3.875738,0.134038,0.297964,0.019988,1000,4,"{'xgbregressor__n_estimators': 1000, 'xgbregre...",0.856041,0.838577,0.863176,0.837994,0.857625,0.850683,0.010398,5
2,5.164105,0.337951,0.299661,0.010206,2000,8,"{'xgbregressor__n_estimators': 2000, 'xgbregre...",0.855428,0.844949,0.854719,0.828442,0.844625,0.845633,0.009752,7
3,2.554008,0.21388,0.327158,0.008927,500,4,"{'xgbregressor__n_estimators': 500, 'xgbregres...",0.859004,0.841616,0.863867,0.837594,0.859589,0.852334,0.010605,1
4,1.842755,0.166217,0.325981,0.01137,500,2,"{'xgbregressor__n_estimators': 500, 'xgbregres...",0.826916,0.821482,0.825802,0.804028,0.826654,0.820976,0.008698,10
5,3.731842,0.511455,0.346164,0.032921,500,6,"{'xgbregressor__n_estimators': 500, 'xgbregres...",0.866295,0.854334,0.858057,0.828751,0.849479,0.851383,0.012582,2
6,11.157346,1.348288,0.342296,0.014839,5000,6,"{'xgbregressor__n_estimators': 5000, 'xgbregre...",0.865567,0.853514,0.85771,0.828576,0.849134,0.8509,0.012408,4
7,8.000274,1.283662,0.367252,0.024799,2000,4,"{'xgbregressor__n_estimators': 2000, 'xgbregre...",0.852668,0.835383,0.861434,0.837861,0.85491,0.848451,0.010109,6
8,9.543667,0.828303,0.355981,0.032957,5000,8,"{'xgbregressor__n_estimators': 5000, 'xgbregre...",0.855428,0.844949,0.854719,0.828442,0.844625,0.845633,0.009752,7
9,6.493992,0.475016,0.362459,0.034016,1000,6,"{'xgbregressor__n_estimators': 1000, 'xgbregre...",0.865623,0.853529,0.857731,0.828575,0.849153,0.850922,0.012424,3


In [None]:
#n_jobs=-1

**Merge 2 datasets**

In [50]:
def _merge_external_data_2(X):
    X = X.copy()  # to avoid raising SettingOnCopyWarning
    # Make sure that DateOfDeparture is of dtype datetime
    X.loc[:, "DateOfDeparture"] = pd.to_datetime(X['DateOfDeparture'])
    
    #merging dataset 1
    data_nicolas = pd.read_csv('testemoica.csv')
    data_nicolas = data_nicolas[['Departure', 'Arrival', 'WeeksToDeparture', 'std_wtd', 'year', 'month', 'day', 'weekday', 'week', 'n_days', 'passengers load', 'distance km']]
    X_merged = pd.merge(
        X, data_nicolas, how='left', left_index=True, right_index=True #on=[['Departure', 'Arrival', 'year', 'month', 'day']], sort=False
    )
    X_merged = X_merged[['DateOfDeparture', 'Departure_x', 'Arrival_x', 'WeeksToDeparture_x', 'std_wtd_x', 'year', 'month', 'day', 'weekday', 'week', 'n_days', 'passengers load', 'distance km']]
    X_merged = X_merged.rename(
             columns={'DateOfDeparture':'DateOfDeparture', 'Departure_x':'Departure', 'Arrival_x':'Arrival', 'WeeksToDeparture_x':'WeeksToDeparture', 'std_wtd_x':'std_wtd_x', 'year':'year', 'month':'month', 'day':'day', 'weekday':'weekday', 'week':'week', 'n_days':'n_days', 'passengers load':'enplanement', 'distance km':'distance_km'}
              )
    
    #merging dataset 2
    data_temp = pd.read_csv('external_data.csv', parse_dates=["Date"])
    data_temp = data_temp[['Date', 'AirPort', 'Max TemperatureC']]
    data_temp = data_temp.rename(
        columns={'Date': 'DateOfDeparture', 'AirPort': 'Arrival'})
    X_merged2 = pd.merge(
        X_merged, data_temp, how='left', on=['DateOfDeparture', 'Arrival'], sort=False
    )
    return X_merged2

data_merger_2= FunctionTransformer(_merge_external_data_2)
data_merger_2.fit_transform(X).head()


Unnamed: 0,DateOfDeparture,Departure,Arrival,WeeksToDeparture,std_wtd_x,year,month,day,weekday,week,n_days,passsengers_load,distance_km,Max TemperatureC
0,2012-06-19,ORD,DFW,12.875,9.812647,2012,6,19,1,25,15510,32171795.0,1290.179371,34
1,2012-09-10,LAS,DEN,14.285714,9.466734,2012,9,10,0,37,15593,19959651.0,991.064058,33
2,2012-10-05,DEN,LAX,10.863636,9.035883,2012,10,5,4,40,15618,25799841.0,1366.918847,22
3,2011-10-09,ATL,ORD,11.48,7.990202,2011,10,9,6,40,15256,44414121.0,974.522557,27
4,2012-02-21,DEN,SFO,11.45,9.517159,2012,2,21,1,8,15391,25799841.0,1538.142783,16


**GradientBoost with 2 datasets**

In [63]:
params_GB_2 = {'n_estimators': 1000, #number of decision trees 
          'subsample': 0.7,
          'max_depth': 8,
          'min_samples_split': 5,
          'learning_rate': 0.01,
          'loss': 'ls'}

In [64]:
GradBoost_GB_2 = ensemble.GradientBoostingRegressor(**params_GB_2)

pipeline7 = make_pipeline(data_merger_2, preprocessor, GradBoost_GB_2)
pipeline7.fit(X, y)

Pipeline(steps=[('functiontransformer',
                 FunctionTransformer(func=<function _merge_external_data_2 at 0x7ffe66481b80>)),
                ('columntransformer',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('functiontransformer',
                                                  FunctionTransformer(func=<function _encode_dates at 0x7ffe7be18550>),
                                                  ['DateOfDeparture']),
                                                 ('ordinalencoder',
                                                  OrdinalEncoder(),
                                                  ['Arrival', 'Departure'])])),
                ('gradientboostingregressor',
                 GradientBoostingRegressor(learning_rate=0.01, max_depth=8,
                                           min_samples_split=5,
                                           n_estimators=1000, subsample=0.7))])

In [65]:
scores = cross_val_score(
    pipeline7, X, y, cv=5, scoring='neg_mean_squared_error'
)
rmse_scores = np.sqrt(-scores)

print(
    f"RMSE: {np.mean(rmse_scores):.4f} +/- {np.std(rmse_scores):.4f}"
)

RMSE: 0.3699 +/- 0.0258


In [60]:
search = RandomizedSearchCV(pipeline7, param_distributions=space, verbose=8)
search.fit(X, y)

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV] gradientboostingregressor__subsample=0.7, gradientboostingregressor__n_estimators=1000, gradientboostingregressor__max_depth=5, gradientboostingregressor__learning_rate=0.01 


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  gradientboostingregressor__subsample=0.7, gradientboostingregressor__n_estimators=1000, gradientboostingregressor__max_depth=5, gradientboostingregressor__learning_rate=0.01, score=0.841, total=  17.1s
[CV] gradientboostingregressor__subsample=0.7, gradientboostingregressor__n_estimators=1000, gradientboostingregressor__max_depth=5, gradientboostingregressor__learning_rate=0.01 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   17.1s remaining:    0.0s


[CV]  gradientboostingregressor__subsample=0.7, gradientboostingregressor__n_estimators=1000, gradientboostingregressor__max_depth=5, gradientboostingregressor__learning_rate=0.01, score=0.818, total=  18.5s
[CV] gradientboostingregressor__subsample=0.7, gradientboostingregressor__n_estimators=1000, gradientboostingregressor__max_depth=5, gradientboostingregressor__learning_rate=0.01 


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:   35.5s remaining:    0.0s


[CV]  gradientboostingregressor__subsample=0.7, gradientboostingregressor__n_estimators=1000, gradientboostingregressor__max_depth=5, gradientboostingregressor__learning_rate=0.01, score=0.835, total=  18.4s
[CV] gradientboostingregressor__subsample=0.7, gradientboostingregressor__n_estimators=1000, gradientboostingregressor__max_depth=5, gradientboostingregressor__learning_rate=0.01 


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:   54.0s remaining:    0.0s


[CV]  gradientboostingregressor__subsample=0.7, gradientboostingregressor__n_estimators=1000, gradientboostingregressor__max_depth=5, gradientboostingregressor__learning_rate=0.01, score=0.818, total=  18.3s
[CV] gradientboostingregressor__subsample=0.7, gradientboostingregressor__n_estimators=1000, gradientboostingregressor__max_depth=5, gradientboostingregressor__learning_rate=0.01 


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:  1.2min remaining:    0.0s


[CV]  gradientboostingregressor__subsample=0.7, gradientboostingregressor__n_estimators=1000, gradientboostingregressor__max_depth=5, gradientboostingregressor__learning_rate=0.01, score=0.838, total=  18.9s
[CV] gradientboostingregressor__subsample=1.0, gradientboostingregressor__n_estimators=500, gradientboostingregressor__max_depth=5, gradientboostingregressor__learning_rate=0.1 


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:  1.5min remaining:    0.0s


[CV]  gradientboostingregressor__subsample=1.0, gradientboostingregressor__n_estimators=500, gradientboostingregressor__max_depth=5, gradientboostingregressor__learning_rate=0.1, score=0.874, total=  13.4s
[CV] gradientboostingregressor__subsample=1.0, gradientboostingregressor__n_estimators=500, gradientboostingregressor__max_depth=5, gradientboostingregressor__learning_rate=0.1 


[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:  1.7min remaining:    0.0s


[CV]  gradientboostingregressor__subsample=1.0, gradientboostingregressor__n_estimators=500, gradientboostingregressor__max_depth=5, gradientboostingregressor__learning_rate=0.1, score=0.854, total=  13.3s
[CV] gradientboostingregressor__subsample=1.0, gradientboostingregressor__n_estimators=500, gradientboostingregressor__max_depth=5, gradientboostingregressor__learning_rate=0.1 


[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:  2.0min remaining:    0.0s


[CV]  gradientboostingregressor__subsample=1.0, gradientboostingregressor__n_estimators=500, gradientboostingregressor__max_depth=5, gradientboostingregressor__learning_rate=0.1, score=0.870, total=  13.8s
[CV] gradientboostingregressor__subsample=1.0, gradientboostingregressor__n_estimators=500, gradientboostingregressor__max_depth=5, gradientboostingregressor__learning_rate=0.1 
[CV]  gradientboostingregressor__subsample=1.0, gradientboostingregressor__n_estimators=500, gradientboostingregressor__max_depth=5, gradientboostingregressor__learning_rate=0.1, score=0.843, total=  14.1s
[CV] gradientboostingregressor__subsample=1.0, gradientboostingregressor__n_estimators=500, gradientboostingregressor__max_depth=5, gradientboostingregressor__learning_rate=0.1 
[CV]  gradientboostingregressor__subsample=1.0, gradientboostingregressor__n_estimators=500, gradientboostingregressor__max_depth=5, gradientboostingregressor__learning_rate=0.1, score=0.869, total=  13.3s
[CV] gradientboostingregre

[CV]  gradientboostingregressor__subsample=1.0, gradientboostingregressor__n_estimators=500, gradientboostingregressor__max_depth=2, gradientboostingregressor__learning_rate=0.001, score=0.156, total=   6.3s
[CV] gradientboostingregressor__subsample=1.0, gradientboostingregressor__n_estimators=2000, gradientboostingregressor__max_depth=2, gradientboostingregressor__learning_rate=0.1 
[CV]  gradientboostingregressor__subsample=1.0, gradientboostingregressor__n_estimators=2000, gradientboostingregressor__max_depth=2, gradientboostingregressor__learning_rate=0.1, score=0.834, total=  21.1s
[CV] gradientboostingregressor__subsample=1.0, gradientboostingregressor__n_estimators=2000, gradientboostingregressor__max_depth=2, gradientboostingregressor__learning_rate=0.1 
[CV]  gradientboostingregressor__subsample=1.0, gradientboostingregressor__n_estimators=2000, gradientboostingregressor__max_depth=2, gradientboostingregressor__learning_rate=0.1, score=0.825, total=  20.3s
[CV] gradientboostin

[Parallel(n_jobs=1)]: Done  50 out of  50 | elapsed: 20.2min finished


RandomizedSearchCV(estimator=Pipeline(steps=[('functiontransformer',
                                              FunctionTransformer(func=<function _merge_external_data_2 at 0x7ffe66481b80>)),
                                             ('columntransformer',
                                              ColumnTransformer(remainder='passthrough',
                                                                transformers=[('functiontransformer',
                                                                               FunctionTransformer(func=<function _encode_dates at 0x7ffe7be18550>),
                                                                               ['DateOfDeparture']),
                                                                              ('ordinalencoder',
                                                                               O...
                                              GradientBoostingRegressor(learning_rate=0.01,
                        

In [61]:
scores_df = pd.DataFrame(search.cv_results_)
#scores_df = scores.sort(columns=['rank_test_score']).reset_index(drop='index')
scores_df

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_gradientboostingregressor__subsample,param_gradientboostingregressor__n_estimators,param_gradientboostingregressor__max_depth,param_gradientboostingregressor__learning_rate,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,17.913896,0.62578,0.309667,0.01004,0.7,1000,5,0.01,"{'gradientboostingregressor__subsample': 0.7, ...",0.84113,0.817569,0.835098,0.817912,0.837957,0.829933,0.010137,4
1,13.289786,0.304292,0.302924,0.036834,1.0,500,5,0.1,"{'gradientboostingregressor__subsample': 1.0, ...",0.874333,0.854305,0.870341,0.842521,0.869075,0.862115,0.011919,2
2,84.854046,10.226326,0.399925,0.059378,1.0,2000,8,0.1,"{'gradientboostingregressor__subsample': 1.0, ...",0.867839,0.848315,0.87114,0.840818,0.863153,0.858253,0.011702,3
3,40.836347,0.47499,0.46215,0.019253,0.5,2000,8,0.001,"{'gradientboostingregressor__subsample': 0.5, ...",0.746784,0.718503,0.744882,0.724376,0.750279,0.736965,0.012928,8
4,9.321905,0.137984,0.26668,0.005739,0.7,500,5,0.01,"{'gradientboostingregressor__subsample': 0.7, ...",0.784133,0.756271,0.778317,0.764948,0.783981,0.77353,0.0111,6
5,5.844905,0.189687,0.27063,0.020624,1.0,500,2,0.001,"{'gradientboostingregressor__subsample': 1.0, ...",0.15493,0.149048,0.156867,0.150811,0.156012,0.153534,0.003057,10
6,20.202624,0.677988,0.321341,0.066372,1.0,2000,2,0.1,"{'gradientboostingregressor__subsample': 1.0, ...",0.834318,0.825241,0.826871,0.814659,0.836355,0.827489,0.007685,5
7,29.848711,0.920852,0.363366,0.037182,0.7,1000,8,0.01,"{'gradientboostingregressor__subsample': 0.7, ...",0.875295,0.852529,0.86991,0.842837,0.87062,0.862238,0.012412,1
8,4.029249,0.258289,0.271449,0.030301,0.5,500,2,0.01,"{'gradientboostingregressor__subsample': 0.5, ...",0.471588,0.456919,0.474424,0.467243,0.468933,0.467821,0.00597,9
9,13.318954,0.386995,0.275624,0.01184,1.0,500,5,0.01,"{'gradientboostingregressor__subsample': 1.0, ...",0.782952,0.758361,0.772332,0.763302,0.776974,0.770784,0.008934,7


**Merge enplanment, distance, GDP**

In [29]:
def _merge_external_data_3(X):
    X = X.copy()  # to avoid raising SettingOnCopyWarning
    # Make sure that DateOfDeparture is of dtype datetime
    X.loc[:, "DateOfDeparture"] = pd.to_datetime(X['DateOfDeparture'])
    
    #merging dataset 1
    data_nicolas = pd.read_csv('testemoica.csv')
    data_nicolas = data_nicolas[['Departure', 'Arrival', 'WeeksToDeparture', 'std_wtd', 'year', 'month', 'day', 'weekday', 'week', 'n_days', 'passengers load', 'distance km']]
    X_merged = pd.merge(
        X, data_nicolas, how='left', left_index=True, right_index=True #on=[['Departure', 'Arrival', 'year', 'month', 'day']], sort=False
    )
    X_merged = X_merged[['DateOfDeparture', 'Departure_x', 'Arrival_x', 'WeeksToDeparture_x', 'std_wtd_x', 'year', 'month', 'day', 'weekday', 'week', 'n_days', 'passengers load', 'distance km']]
    X_merged = X_merged.rename(
             columns={'DateOfDeparture':'DateOfDeparture', 'Departure_x':'Departure', 'Arrival_x':'Arrival', 'WeeksToDeparture_x':'WeeksToDeparture', 'std_wtd_x':'std_wtd_x', 'year':'year', 'month':'month', 'day':'day', 'weekday':'weekday', 'week':'week', 'n_days':'n_days', 'passengers load':'enplanement', 'distance km':'distance_km'}
              )
    
    #merging dataset 2
    data_gdp = pd.read_csv('gdp_data_processed.csv', parse_dates=["year"])
    data_gdp = data_gdp.rename(
               columns={'year' : 'DateOfDeparture', 'airport' : 'Departure', 'gdp' : 'gdp', 'LineCode' : 'LineCode'})
    data_gdp = date_encoder.fit_transform(data_gdp)
    data_gdp = data_gdp[['Departure', 'LineCode', 'gdp', 'year']]
    
    X_merged2 = pd.merge(
        X_merged, data_gdp, how='left', on=['year', 'Departure'], sort=False
    )
    return X_merged2

data_merger_3= FunctionTransformer(_merge_external_data_3)
data_merger_3.fit_transform(X).head()

Unnamed: 0,DateOfDeparture,Departure,Arrival,WeeksToDeparture,std_wtd_x,year,month,day,weekday,week,n_days,enplanement,distance_km,LineCode,gdp
0,2012-06-19,ORD,DFW,12.875,9.812647,2012,6,19,1,25,15510,32171795.0,1290.179371,1.0,561588192.0
1,2012-06-19,ORD,DFW,12.875,9.812647,2012,6,19,1,25,15510,32171795.0,1290.179371,2.0,508763428.0
2,2012-06-19,ORD,DFW,12.875,9.812647,2012,6,19,1,25,15510,32171795.0,1290.179371,3.0,550397.0
3,2012-06-19,ORD,DFW,12.875,9.812647,2012,6,19,1,25,15510,32171795.0,1290.179371,6.0,272006.0
4,2012-06-19,ORD,DFW,12.875,9.812647,2012,6,19,1,25,15510,32171795.0,1290.179371,10.0,


In [27]:
data_gdp = pd.read_csv('gdp_data_processed.csv', parse_dates=["year"])
data_gdp = data_gdp.rename(
        columns={'year' : 'DateOfDeparture', 'airport' : 'Departure', 'gdp' : 'gdp'})
data_gdp = date_encoder.fit_transform(data_gdp)
data_gdp

Unnamed: 0,LineCode,Departure,gdp,year,month,day,weekday,week,n_days
0,1.0,ATL,277405509.0,2011,1,1,5,52,14975
1,2.0,ATL,250779823.0,2011,1,1,5,52,14975
2,3.0,ATL,,2011,1,1,5,52,14975
3,6.0,ATL,267771.0,2011,1,1,5,52,14975
4,10.0,ATL,3704264.0,2011,1,1,5,52,14975
...,...,...,...,...,...,...,...,...,...
2095,88.0,SEA,36704724.0,2013,1,1,1,1,15706
2096,89.0,SEA,8691437.0,2013,1,1,1,1,15706
2097,90.0,SEA,87187960.0,2013,1,1,1,1,15706
2098,91.0,SEA,56342381.0,2013,1,1,1,1,15706
