<a href="https://colab.research.google.com/github/MathMachado/eDreams/blob/master/eDreams2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Install & Load Main Python libraries



In [0]:
!pip install bamboolib

In [0]:
import pandas as pd
import numpy as np

import matplotlib
import bamboolib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Load dataframes: training & test sample

In [0]:
url_train= "https://raw.githubusercontent.com/MathMachado/eDreams/master/Dataframes/train.csv?token=AGDJQ62XLVU7JXF5SI6OC625WBWJE"
url_test= "https://raw.githubusercontent.com/MathMachado/eDreams/master/Dataframes/test.csv?token=AGDJQ64SG2DNKAW4RWBFGZS5WBWHS"

# Stacking training and validation samples for a single treatment
df_train= pd.read_csv(url_train, sep= ";", index_col='ID', parse_dates = ['DEPARTURE', 'ARRIVAL'])
df_test= pd.read_csv(url_test, sep= ";", index_col='ID', parse_dates = ['DEPARTURE', 'ARRIVAL'])

# Resetting the test sample indices
df_test.index= range(50000, 80000)

# merge train and test
df = df_train.append(df_test, sort= True)

# Records training and test dataframe indexes to separate these dataframes later
train_index = df_train.index
test_index = df_test.index

In [0]:
df.shape

In [0]:
df.head()

In [0]:
df.tail()

In [0]:
df_test.head()

In [0]:
df_test.tail()

In [0]:
df.info()

# Data Preparation

In [0]:
df_T= df.copy()
# Capturing the Company: First 2 positions of WEBSITE.
df_T['COMPANY']= df_T['WEBSITE'].str[0:2].astype(str)

# Capturing the Country: rest of the string of WEBSITE.
df_T['COUNTRY']= df_T['WEBSITE'].str[2:len(df['WEBSITE'])].astype(str)

df_T.head()

In [0]:
df_T['COMPANY'].value_counts() 

There's no 'TL'. So I'll replace 'TL' by 'MV'.

In [0]:
df_T['COMPANY']= df_T['COMPANY'].replace('TL', 'MV')
df_T['COMPANY'].value_counts() 

In [0]:
df_T['COUNTRY'].value_counts() 

I'll consider countries with 3 characters as a Missing value.

In [0]:
df_T['COUNTRY']= df_T['COUNTRY'].replace(['PLC', 'DEC', 'DKC', 'FRC'], 'MV')
df_T['COUNTRY'].value_counts() 

In [0]:
df_T['COUNTRY']= df_T['COUNTRY'].replace(['UK'], 'GB')
df_T['COUNTRY'].value_counts() 

## Treating date variables
> Since there is no information regarding the year of the transaction, I will assume that the transactions are from 2018 or 2019. I will assign the year conveniently from the analysis of the variables DEPARTURE and ARRIVAL.

In [0]:
df2= df_T.copy()
df2['DEPARTURE_WITH_YEAR']= df2['DEPARTURE'] +'/2018'
df2['ARRIVAL_WITH_YEAR']= df2['ARRIVAL'] +'/2018'
df2['ARRIVAL_WITH_YEAR_FIXED']= df2['ARRIVAL'] +'/2019'

df2['DEPARTURE_WITH_YEAR']= pd.to_datetime(df2['DEPARTURE_WITH_YEAR'])
df2['ARRIVAL_WITH_YEAR']= pd.to_datetime(df2['ARRIVAL_WITH_YEAR'])
df2['ARRIVAL_WITH_YEAR_FIXED']= pd.to_datetime(df2['ARRIVAL_WITH_YEAR_FIXED'])
df2.head()

As we do not have year information, in some cases/rows we have ARRIVAL < DEPARTURE. Let's take a look in some cases where ARRIVAL < DEPARTURE:

In [0]:
df3= df2.copy()

# I created the variable IS_ARRIVAL_BEFORE_DEPARTURE to help us identify when ARRIVAL < DEPARTURE:
df3['IS_ARRIVAL_BEFORE_DEPARTURE']= df3['ARRIVAL_WITH_YEAR'] < df3['DEPARTURE_WITH_YEAR']

# Fixing cases when ARRIVAL < DEPARTURE
df3.loc[df3['IS_ARRIVAL_BEFORE_DEPARTURE']== True, 'ARRIVAL_WITH_YEAR']= df3['ARRIVAL_WITH_YEAR_FIXED']
df3[['ARRIVAL', 'DEPARTURE', 'DEPARTURE_WITH_YEAR', 'ARRIVAL_WITH_YEAR']][df3['IS_ARRIVAL_BEFORE_DEPARTURE']== True].head()

> Take for example line 15 (output above):
* DEPARTURE= December 15th;
* ARRIVAL= January 29th.

> Without information for the year, then ARRIVAL < DEPARTURE. However, look at the variables DEPARTURE_WITH_YEAR and ARRIVAL_WITH_YEAR above:
* DEPARTURE_WITH_YEAR= December 15th of 2018;
* ARRIVAL_WITH_YEAR= January 29th of 2019.

In this case, we fixed the problem.



In [0]:
# Drop the unnecessary variables:
df3= df3.drop(columns= ['DEPARTURE', 'ARRIVAL', 'ARRIVAL_WITH_YEAR_FIXED', 'WEBSITE'])

Next, we calculate the variable ARRIVAL_DEPARTURE = ARRIVAL_WITH_YEAR - DEPARTURE_WITH_YEAR:

In [0]:
# Calculate the variable ARRIVAL_DEPARTURE:
df3['ARRIVAL_DEPARTURE']= (df3['ARRIVAL_WITH_YEAR']-df3['DEPARTURE_WITH_YEAR']).dt.days.astype(int)

# Show some cases:
df3[['DEPARTURE_WITH_YEAR', 'ARRIVAL_WITH_YEAR', 'ARRIVAL_DEPARTURE']].head() #[df3['IS_ARRIVAL_BEFORE_DEPARTURE']== True].head()

> Something strange with the variable ARRIVAL_DEPARTURE. Take a look at the line 2 (above). We have:
* DEPARTURE_WITH_YEAR= 2018-07-29
* ARRIVAL_WITH_YEAR= 2018-08-19

It is 21 days between DEPARTURE and ARRIVAL!

Let's take a look in some statistics below. For example, let's look at the proportion of cases where ARRIVAL_DEPARTURE > 5. I am using 5 as an example, but I consider 5 a long time between departure and arrival.

In [0]:
df3_Zoom= df3[df3['ARRIVAL_DEPARTURE'] > 5]
df3_Zoom.shape[0]

In [0]:
df3.head()

In [0]:
df3.groupby('TRIP_TYPE').agg({'ARRIVAL_DEPARTURE': ['min', 'median','max']})

In [0]:
df3.groupby('HAUL_TYPE').agg({'ARRIVAL_DEPARTURE': ['min', 'median','max']})

In [0]:
df3.groupby(['TRIP_TYPE','HAUL_TYPE']).agg({'ARRIVAL_DEPARTURE': ['min', 'mean', 'median','max']})

Strangely, many cases over 5 days. 

Below, I present the distribution of the ARRIVAL_DEPARTURE variable. As we can see, there are cases older than 50 days!

In [0]:
sns.distplot(df3['ARRIVAL_DEPARTURE'])

Below I present some descriptive statistics for ARRIVAL_DEPARTURE.

In [0]:
df3.groupby('HAUL_TYPE').agg({'ARRIVAL_DEPARTURE': ['min', 'median', 'mean', 'max', 'count']})

Let's take a look at the Boxplot:

In [0]:
plt.rcdefaults()
sns.catplot(y='ARRIVAL_DEPARTURE', kind="box", data=df3, height=4, aspect=1.5)
plt.show()

In [0]:
df3['TRIP_TYPE'].value_counts() 

Function to detect Outliers based on IQR-Score:

In [0]:
# Function that identify outlier using IQR-Score:
def IQR_Score_Outlier_Detect(column):
    global df_T

    Q1 = df_T[column].quantile(0.25)
    Q3 = df_T[column].quantile(0.75)
    IQR = Q3 - Q1
    Lim_Inf= Q1-1.5*IQR
    Lim_Sup= Q3+1.5*IQR
    print(Lim_Inf, Lim_Sup)

    # Replace outliers by Lim_Inf and Lim_Sup
    df_T[column+'_IQR'] = np.where(((df_T[column] < Lim_Inf)), Lim_Inf, df_T[column])
    df_T[column+'_IQR'] = np.where(((df_T[column] > Lim_Sup)), Lim_Sup, df_T[column])

    # Identify the outlier
    #df_T[column+'_IS_OUTLIER_IQR']= np.where(((df_T[column] < Lim_Inf)), True, False)
    #df_T[column+'_IS_OUTLIER_IQR']= np.where(((df_T[column] > Lim_Sup)), True, False)

In [0]:
df_T= df3.copy()
df_T['ARRIVAL_DEPARTURE_2']= df_T['ARRIVAL_DEPARTURE']
IQR_Score_Outlier_Detect('ARRIVAL_DEPARTURE_2')

In [0]:
df_T.head()

In [0]:
# Deleting Unneeded Variables
df4= df_T.copy()
df4= df_T.drop(columns= ['TIMESTAMP','DEPARTURE_WITH_YEAR','ARRIVAL_WITH_YEAR', 'IS_ARRIVAL_BEFORE_DEPARTURE'], axis= 1)

## Handling Missing Values

In [0]:
df4.info()

> Apparently we have some problems from Missing Values to DEVICE. Don't worry about the Missing values of the EXTRA_BAGGAGE variable that is our response variable and the 30,000 Missing values presented come from the test sample and are just the values we want to predict.

In [0]:
# Converting column DISTANCE to numeric. For this purpose, I'll cut the distance in the ","
df6= df4.copy()
df6[['DISTANCE_2','DISTANCE_REST']] = df6['DISTANCE'].str.split(",",expand=True)
df6['DISTANCE_2']= pd.to_numeric(df6['DISTANCE_2'])
df6[['HAUL_TYPE','DISTANCE','DISTANCE_2','DISTANCE_REST']].head(10)

In [0]:
df6.groupby('HAUL_TYPE').agg({'DISTANCE_2': ['min', 'median', 'max', 'count']})

Something strange with the minimum of DISTANCE_2. No sense DOMESTIC = 0. Much less INTERCONTINENTAL = 0. Let's investigate this a little further. However, I will work with DISTANCE_2 (following I will rename DISTANCE_2 TO DISTANCE) and disregard DISTANCE_REST.

In [0]:
df6= df6.drop(columns= ['DISTANCE_REST','DISTANCE'], axis= 1)
df6= df6.rename({'DISTANCE_2': 'DISTANCE'}, axis=1)
df6.head()

In [0]:
# How many cases where DISTANCE = 0?
df6[['DISTANCE']][df6['DISTANCE']==0].count()

There are 288 records where DISTANCE = 0. I consider these records to be Missing Values.

In [0]:
median_by__HAUL_TYPE= df6.groupby('HAUL_TYPE')['DISTANCE'].median()
median_by__TRIP_TYPE= df6.groupby('TRIP_TYPE')['DISTANCE'].median()

In [0]:
median_by__HAUL_TYPE

In [0]:
median_by__TRIP_TYPE

In [0]:
median_by__HAUL_TYPE[0]

In [0]:
median_DISTANCE_C= median_by__HAUL_TYPE[0]
median_DISTANCE_D= median_by__HAUL_TYPE[1]
median_DISTANCE_I= median_by__HAUL_TYPE[2]

In [0]:
median_DISTANCE_RT= median_by__TRIP_TYPE[2]
median_DISTANCE_OW= median_by__TRIP_TYPE[1]
median_DISTANCE_MD= median_by__TRIP_TYPE[0]

In [0]:
# Identifying Missing Values in DISTANCE. In this case, zeros.
df6.loc[df6['DISTANCE'] == 0, 'DISTANCE']= np.nan

# Checking Missing Values
df6.isna().sum()

Let's treat Missing Values in DISTANCE and DEVICE below:

In [0]:
df6['DISTANCE_HT']= df6['DISTANCE']
df6['DISTANCE_TT']= df6['DISTANCE']

# Missing Value imputation for DOMESTIC
df6['DISTANCE_HT'] = np.where(((df6['DISTANCE_HT'].isnull()) & (df6['HAUL_TYPE'] =="DOMESTIC")), median_DISTANCE_D, df6['DISTANCE_HT'])

# Missing Value imputation for INTERCONTINENTAL
df6['DISTANCE_HT'] = np.where(((df6['DISTANCE_HT'].isnull()) & (df6['HAUL_TYPE'] =="INTERCONTINENTAL")), median_DISTANCE_I, df6['DISTANCE_HT'])

# Missing Value imputation for CONTINENTAL
df6['DISTANCE_HT'] = np.where(((df6['DISTANCE_HT'].isnull()) & (df6['HAUL_TYPE'] =="CONTINENTAL")), median_DISTANCE_C, df6['DISTANCE_HT'])

##############
# Missing Value imputation for MULTI_DESTINATION
df6['DISTANCE_TT'] = np.where(((df6['DISTANCE_TT'].isnull()) & (df6['TRIP_TYPE'] =="MULTI_DESTINATION")), median_DISTANCE_MD, df6['DISTANCE_TT'])

# Missing Value imputation for ONE_WAY
df6['DISTANCE_TT'] = np.where(((df6['DISTANCE_TT'].isnull()) & (df6['TRIP_TYPE'] =="ONE_WAY")), median_DISTANCE_OW, df6['DISTANCE_TT'])

# Missing Value imputation for ROUND_TRIP
df6['DISTANCE_TT'] = np.where(((df6['DISTANCE_TT'].isnull()) & (df6['TRIP_TYPE'] =="ROUND_TRIP")), median_DISTANCE_RT, df6['DISTANCE_TT'])

df6= df6.drop(columns= ['DISTANCE'])

In [0]:
# Checking Missing Values
df6.isna().sum()

In [0]:
# Treating Missing Values of DEVICE
df6['DEVICE'].value_counts() 

In [0]:
# Replacing NaN's of DEVICE with 'NO_DEVICE'
df6["DEVICE"].fillna("NO_DEVICE", inplace= True)

# Checking Missing Values
df6.isna().sum()

As we can see above, missing values have been addressed.

In [0]:
# Checking...
df6['DEVICE'].value_counts() 

In [0]:
df7= df6.copy()
df7.head()

# Handling Outliers in DISTANCE using IQR-Score
> Consider the following output:

In [0]:
sns.distplot(df7['DISTANCE_HT'])

As we can see above, we have some outliers in the DISTANCE variable.

In [0]:
df_T= df7.copy()
df_T['DISTANCE_HT_2']= df_T['DISTANCE_HT']
df_T['DISTANCE_TT_2']= df_T['DISTANCE_TT']

IQR_Score_Outlier_Detect('DISTANCE_HT_2')
IQR_Score_Outlier_Detect('DISTANCE_TT_2')

df_T.head()

Response-variable distribution after outlier treatment by IQR-Score:

In [0]:
df8= df_T.copy()

# BEFORE Outlier treatment in DISTANCE variable
df8['EXTRA_BAGGAGE'].value_counts() 

# Binning numeric features

In [0]:
df9= df8.copy()
df9['EXTRA_BAGGAGE'].value_counts() 

In [0]:
d_Var_Target= {True: 1, False: 0}
df9['EXTRA_BAGGAGE']= df9['EXTRA_BAGGAGE'].map(d_Var_Target)
df9.head()

In [0]:
df9.info()

## Treating numerical variables

In [0]:
l_Vars_Num= ['DISTANCE_HT', 'DISTANCE_TT', 'DISTANCE_HT_2', 'DISTANCE_TT_2', 'DISTANCE_HT_2_IQR', 'DISTANCE_TT_2_IQR', 'ARRIVAL_DEPARTURE', 'ARRIVAL_DEPARTURE_2', 'ARRIVAL_DEPARTURE_2_IQR']
for var in l_Vars_Num:
    df9[var+'_CAT']= pd.cut(df9[var], 10)
    df9= df9.drop(columns= [var], axis= 1)

df9.head()

## Treating categorical variables

In [0]:
df10= df9.copy()

In [0]:
df10 = pd.get_dummies(df10, columns=['DISTANCE_HT_CAT','DISTANCE_HT_2_CAT', 'DISTANCE_TT_CAT', 'DISTANCE_TT_2_CAT', 'DISTANCE_HT_2_IQR_CAT', 'DISTANCE_TT_2_IQR_CAT', 'ADULTS','ARRIVAL_DEPARTURE_CAT','ARRIVAL_DEPARTURE_2_CAT', 'ARRIVAL_DEPARTURE_2_IQR_CAT','CHILDREN','GDS','INFANTS','NO_GDS','DEVICE', 'HAUL_TYPE', 'PRODUCT', 'SMS', 'TRAIN', 'TRIP_TYPE', 'COMPANY', 'COUNTRY'], drop_first=True)
df10.head()

# Modeling

## Train/Test Split

### Balancing the training sample

In [0]:
df10['EXTRA_BAGGAGE'].value_counts()

In [0]:
sns.countplot(x= 'EXTRA_BAGGAGE', data= df10, palette= 'hls')

So, I'll select 10,000 EXTRA_BAGGAGE= 0 and all EXTRA_BAGGAGE= 1.

In [0]:
# Selecting all EXTRA_BAGGAGE= 1...
df11_1= df10[df10['EXTRA_BAGGAGE']== 1]
df11_1.shape

In [0]:
# Selecting all EXTRA_BAGGAGE= 1...
df11_temp= df10[df10['EXTRA_BAGGAGE']== 0]
df11_0= df11_temp.sample(n= 10000)
df11_0.shape

In [0]:
df11= df11_1.append(df11_0)
df11.shape

The following is the training sample, which we will balance.

In [0]:
df12= df11.copy()
X= df12[df12['EXTRA_BAGGAGE'].notna()]
X= X.drop(columns= ['EXTRA_BAGGAGE'], axis= 1)
y= df12[['EXTRA_BAGGAGE']][df12['EXTRA_BAGGAGE'].notna()]

print(X.shape, y.shape)

Importance Sampling through 'Random Forest':

In [0]:
from sklearn.ensemble import RandomForestClassifier
rf_clf = RandomForestClassifier(n_estimators = 500, max_depth=15)
rf_clf.fit(X, y)
rf_y_pred = rf_clf.predict(X)

pd.Series(rf_clf.feature_importances_, index = X.columns).nlargest(30).plot(kind = 'barh',
                                                                               figsize = (9, 9),
                                                                              title = 'Feature importance from RandomForest').invert_yaxis();

# Dropping variables

In [0]:
df13= df12.copy()
df13.head()

In [0]:
df13= df13.drop(columns= ['DISTANCE_HT_2_CAT_(17792.7, 19766.0]', 'DISTANCE_TT_CAT_(2006.3, 3979.6]', 'DISTANCE_TT_CAT_(3979.6, 5952.9]', 'DISTANCE_TT_CAT_(5952.9, 7926.2]', 'DISTANCE_TT_CAT_(7926.2, 9899.5]', 'DISTANCE_TT_CAT_(9899.5, 11872.8]', 'DISTANCE_TT_CAT_(11872.8, 13846.1]', 'DISTANCE_TT_CAT_(13846.1, 15819.4]', 'DISTANCE_TT_CAT_(15819.4, 17792.7]', 'DISTANCE_TT_CAT_(17792.7, 19766.0]', 'DISTANCE_TT_2_CAT_(2006.3, 3979.6]', 'DISTANCE_TT_2_CAT_(3979.6, 5952.9]', 'DISTANCE_TT_2_CAT_(5952.9, 7926.2]', 'DISTANCE_TT_2_CAT_(7926.2, 9899.5]', 'DISTANCE_TT_2_CAT_(9899.5, 11872.8]', 'DISTANCE_TT_2_CAT_(11872.8, 13846.1]', 'DISTANCE_TT_2_CAT_(13846.1, 15819.4]', 'DISTANCE_TT_2_CAT_(15819.4, 17792.7]', 'DISTANCE_TT_2_CAT_(17792.7, 19766.0]'])

In [0]:
X= df13[df13['EXTRA_BAGGAGE'].notna()]
X= X.drop(columns= ['EXTRA_BAGGAGE'], axis= 1)
y= df13[['EXTRA_BAGGAGE']][df13['EXTRA_BAGGAGE'].notna()]

In [0]:
from sklearn.ensemble import RandomForestClassifier
rf_clf = RandomForestClassifier(n_estimators = 500, max_depth=15)
rf_clf.fit(X, y)
rf_y_pred = rf_clf.predict(X)

pd.Series(rf_clf.feature_importances_, index = X.columns).nlargest(30).plot(kind = 'barh',
                                                                               figsize = (9, 9),
                                                                              title = 'Feature importance from RandomForest').invert_yaxis();

In [0]:
from sklearn.feature_selection import SelectFromModel

# Create a list of feature names
feat_labels = X.columns

# Create a selector object that will use the random forest classifier to identify
# features that have an importance of more than 0.005
sfm = SelectFromModel(rf_clf, threshold=0.005)

# Train the selector
sfm.fit(X, y)

In [0]:
# Print the names of the most important features
l_Vars= []
for feature_list_index in sfm.get_support(indices=True):
    print(feat_labels[feature_list_index])
    l_Vars.append(feat_labels[feature_list_index])

Note below that we have reduced the number of columns/variables. Instead of 112, now we are working only with 28.

In [0]:
X2= df13[l_Vars][df13['EXTRA_BAGGAGE'].notna()]
print(X2.shape, y.shape)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X2,y, test_size= 0.2)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

Next, we will apply the following estimators / classifiers to the training sample:

# Evaluation
> As I'll submit a binary output, I need to use a F1-Score, as suggested in the challenge. Firstly, let's understand what's F1-Score metric:

* **Precision**: When the model predicts positive, how often is it correct? A low precision can also indicate a large number of False Positives.

    $Precision= \frac{TruePositives}{TruePositive + FalsePositives}$

* **Recall**: Recall is the number of True Positives divided by the number of True Positives and the number of False Negatives. Put another way it is the number of positive predictions divided by the number of positive class values in the test data. It is also called Sensitivity or the True Positive Rate. A low recall indicates many False Negatives.

    $Recall= \frac{TruePositives}{TruePositives + FalseNegatives}$

* **F1 Score**: F1 score conveys the balance between the precision and the recall.

    $F1= 2*\frac{Precision*Recall}{Precision+Recall}$

Source: [Classification Accuracy is Not Enough: More Performance Measures You Can Use](https://machinelearningmastery.com/classification-accuracy-is-not-enough-more-performance-measures-you-can-use/)

## Interpretation
> A good F1 score means that you have low false positives and low false negatives, so you’re correctly identifying real threats and you are not disturbed by false alarms. 
>> An F1 score is considered perfect when it’s 1, while the model is a total failure when it’s 0.

In [0]:
from sklearn.model_selection import cross_val_score, StratifiedKFold, GridSearchCV
from sklearn.metrics import f1_score, accuracy_score

### Decision Tree Classifier

In [0]:
from sklearn.tree import DecisionTreeClassifier

DT= DecisionTreeClassifier(max_depth= 5)
DT.fit(X_train, y_train)
y_pred = DT.predict(X_test)
f1_score(y_test, y_pred)

In [0]:
accuracy_score(y_test, y_pred)

### Random Forest Classifier

In [0]:
from sklearn.ensemble import RandomForestClassifier

RF= RandomForestClassifier(max_depth= 5, n_estimators= 1000, n_jobs= -1)
RF.fit(X_train,y_train)
y_pred = RF.predict(X_test)
f1_score(y_test, y_pred, average= 'weighted')

In [0]:
accuracy_score(y_test, y_pred)

### AdaBoost Classifier

In [0]:
from sklearn.ensemble import AdaBoostClassifier

AB = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=8),n_estimators=600)
AB.fit(X_train,y_train)
y_pred = AB.predict(X_test)
f1_score(y_test, y_pred, average= 'weighted')

In [0]:
accuracy_score(y_test, y_pred)

### Extra Trees Classifier

In [0]:
from sklearn.ensemble import ExtraTreesClassifier

ET= ExtraTreesClassifier(n_estimators = 750, max_features = 'sqrt', max_depth = 35,  criterion = 'entropy', random_state = 20111974)
ET.fit(X_train,y_train)
y_pred = ET.predict(X_test)
f1_score(y_test, y_pred, average= 'weighted')

In [0]:
accuracy_score(y_test, y_pred)

# Conclusion

DecisionTreeClassifier presented better F1-Score. Finally, I will fine-tune the parameters to try to improve F1-Score.

In [0]:
tree_parameter = {'criterion':['gini','entropy'],'max_depth':[4,5,6,7,8,9,10,11,12,15,20,30,40,50,70,90,120,150], 'min_samples_split': [2, 10, 20], 'min_samples_leaf': [1, 5, 10], 'max_leaf_nodes': [None, 5, 10, 20]}
clf = GridSearchCV(DecisionTreeClassifier(), tree_parameter, cv= 5)
clf.fit(X_train, y_train)

In [0]:
y_pred = clf.predict(X_test)
f1_score(y_test, y_pred, average= 'weighted')

As we can see, after fine tuning, I had a slight improvement in the value of F1-Score. So, my final F1-Score= 0.6666.

# Submitting my model

In [0]:
X_Val= df10[l_Vars][df10['EXTRA_BAGGAGE'].isna()]
y_Val= df10[['EXTRA_BAGGAGE']][df10['EXTRA_BAGGAGE'].isna()]

print(X_Val.shape, y_Val.shape)

In [0]:
X_Val['EXTRA_BAGGAGE'] = clf.predict(X_Val)
y_pred_submission= X_Val[['EXTRA_BAGGAGE']]
y_pred_submission.head()

In [0]:
y_pred_submission['EXTRA_BAGGAGE']= y_pred_submission['EXTRA_BAGGAGE'].map({0.0: False, 1.0: True})
y_pred_submission.head()

In [0]:
y_pred_submission.to_csv(r'eDreams_Submission.csv')