<h1>Importing All Necessary Modules</h1>

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, GridSearchCV

## For cloning
from sklearn.base import clone

from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import LinearSVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score

You are given a dataset and data preparations steps were readily provided in next cells except for reading part. Run all of them to save time after uploading dataset to notebook. 

In [2]:
## Read
df = pd.read_csv('../data/airplane satisfaction.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,id,Gender,Customer Type,Age,Type of Travel,Class,Flight Distance,Inflight wifi service,Departure/Arrival time convenient,...,Inflight entertainment,On-board service,Leg room service,Baggage handling,Checkin service,Inflight service,Cleanliness,Departure Delay in Minutes,Arrival Delay in Minutes,satisfaction
0,0,70172,Male,Loyal Customer,13,Personal Travel,Eco Plus,460,3,4,...,5,4,3,4,4,5,5,25,18.0,neutral or dissatisfied
1,1,5047,Male,disloyal Customer,25,Business travel,Business,235,3,2,...,1,1,5,3,1,4,1,1,6.0,neutral or dissatisfied
2,2,110028,Female,Loyal Customer,26,Business travel,Business,1142,2,2,...,5,4,3,4,4,4,5,0,0.0,satisfied
3,3,24026,Female,Loyal Customer,25,Business travel,Business,562,2,5,...,2,2,5,3,1,4,2,11,9.0,neutral or dissatisfied
4,4,119299,Male,Loyal Customer,61,Business travel,Business,214,3,3,...,3,3,4,4,3,3,3,0,0.0,satisfied


In [3]:
df = pd.get_dummies(df, drop_first = True)

## in my case, get dummies return boolean data types. Therefore, I used the following. 
## If your pandas is latest version, you can uncomment it.
# df.iloc[:, -6:] = df.iloc[:, -6:].astype(np.int64)

df.rename({'satisfaction_satisfied': 'satisfaction'}, axis = 1, inplace = True)

cols = ['Online boarding', 'Inflight entertainment', 'Seat comfort', 'On-board service', 'Leg room service', 'Cleanliness', 'Flight Distance',  'Inflight wifi service', 'Baggage handling', 'Inflight service', 'Checkin service', 'Food and drink', 'Ease of Online booking', 'Age', 'Class_Eco Plus', 'Customer Type_disloyal Customer', 'Type of Travel_Personal Travel', 'Class_Eco', 'satisfaction']
cols_for_scaling = ['Age', 'Flight Distance']

df = df[cols]
X = df.drop('satisfaction', axis = 1)
y = df['satisfaction']

ss = StandardScaler().fit(X[cols_for_scaling])
X[cols_for_scaling] = ss.transform(X[cols_for_scaling])

In [4]:
y.value_counts()

satisfaction
0    58879
1    45025
Name: count, dtype: int64

<h1>Modeling</h1>

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

Call three models - LogisticRegression, LinearSVC with default parameters, and LinearSVC with different parameter, such as C = 10. Train them separately and see their scores on test dataset. Note that LinearSVC is time consuming model but faster than SVC

In [6]:
print('fitting...')
lscv = LinearSVC().fit(X_train, y_train)
print('scoring...')
lscv.score(X_test, y_test)

fitting...
scoring...




0.8746535263319988

In [7]:
print('fitting...')
lscv = LinearSVC(C = 10).fit(X_train, y_train)
print('scoring...')
lscv.score(X_test, y_test)

fitting...
scoring...




0.8611410532799507

In [8]:
log = LogisticRegression().fit(X_train, y_train)
log.score(X_test, y_test)

0.8735756082537727

Your next task is to combine them into a set of models and fit voting classifier. Use hard voting since linear svc doesn't provide probabilities.

In [9]:
clfs = [('lscv', LinearSVC(C = 1)),
        ('lscv_C_10', (LinearSVC(C = 10))),
         ('log', LogisticRegression())
]

In [10]:
vc = VotingClassifier(estimators = clfs, voting = 'hard')
vc.fit(X_train, y_train)



In [11]:
vc.score(X_test, y_test)

0.8754619648906683

Combine n different KNeighborsClassifier and fit them with VotingClassifier. This time you can benefit from soft voting approach. Try to use low number of models since KNeighborsClassifier is time consuming. Note that less number of neighbors will mostly result in probabilities of 0 and 1 for this dataset instead of float value. So, try to use two-digit value for n_neighbors

In [12]:
def get_ensemble_neighbors(init, n_estimators, step):
    models = []
    for i in range(init, init + n_estimators*step, step):
        models += [('knc_' + str(i),KNeighborsClassifier(n_neighbors = i))]
    return models

In [13]:
models = get_ensemble_neighbors(10, 5, 10)

In [14]:
models

[('knc_10', KNeighborsClassifier(n_neighbors=10)),
 ('knc_20', KNeighborsClassifier(n_neighbors=20)),
 ('knc_30', KNeighborsClassifier(n_neighbors=30)),
 ('knc_40', KNeighborsClassifier(n_neighbors=40)),
 ('knc_50', KNeighborsClassifier(n_neighbors=50))]

In [15]:
vc2 = VotingClassifier(estimators = models, voting = 'soft').fit(X_train, y_train)

In [16]:
vc2.score(X_test, y_test)

0.9275100092392978

Now, train all KNN models and get probabilities for each. After that, obtain average probabilities and use this result for your predictions, accuracy of which is also expected to be obtained.

In [17]:
probas = np.zeros((len(models), y_test.size, 2))

for i, model in enumerate(models):
    ## Fitting and scoring
    score = model[1].fit(X_train, y_train).score(X_test, y_test)
    print('model '+ str(i) + ': ' + str(score))
    
    ## Getting probabilities
    ## Note that both score and predict_proba calculates distances. Calling predict_proba will increase time requirement for 
    ## finishing number of iterations. Instead, you can only call predict_proba and get scores manually to reduce CPU 
    ## computation.
    model_probas = model[1].predict_proba(X_test)
    probas[i] = model_probas

model 0: 0.9292808746535264
model 1: 0.9270095472744071
model 2: 0.924853711117955
model 3: 0.922928857406837
model 4: 0.9220049276255005


In [18]:
probas.shape

(5, 25976, 2)

In [19]:
avg_probas = probas.mean(axis = 0)

In [20]:
preds = np.zeros((avg_probas.shape[0]))
preds[avg_probas[:, 1] > 0.5] = 1
preds[avg_probas[:, 1] <= 0.5] = 0

In [21]:
preds

array([0., 0., 0., ..., 1., 0., 0.])

In [22]:
## Exactly the same with VotingClassifier output.
accuracy_score(y_test, preds)

0.9275100092392978

Make your forest! Bagging or Pasting is the sampling technique behind Random Forest Approach. Now, your task is to create BaggingClassifier with the estimator of DecisionTreeClassifier and train it. Set bootstrap as True first then check the results with False option to compare. In case of True, the model will take samples with replacement (bagging), whereas False indicates without replacement (pasting). Please, include random_state option so that scores can also be compared with RandomForestClassifier later.

In [23]:
forest = BaggingClassifier(estimator = DecisionTreeClassifier(max_depth = 10),
                 n_estimators = 100,
                 random_state = 42,
                 bootstrap = True, # False case is called Pasting,
                 n_jobs = -1,
                 )

In [24]:
forest.fit(X_train, y_train)

In [25]:
forest.score(X_test, y_test)

0.9476439790575916

Additionally, train individual DecisionTreeClassifier to see how results changed.

In [26]:
DecisionTreeClassifier(max_depth = 10).fit(X_train, y_train).score(X_test, y_test)

0.9409454881429011

Train RandomForestClassifier with previous parameters of both bagging and decision tree.

In [27]:
rfc = RandomForestClassifier(max_depth = 10, random_state = 42).fit(X_train, y_train)
rfc.score(X_test, y_test)

0.9448721897135818

XGBoost is highly advanced and complicated model (also one of most favorites among ML community) thanks to special features - boosting. In the following, you will see the code with some parameters. Get familiar with documentation of xgboost and explore new parameters. 

In [28]:
from xgboost import XGBClassifier

In [29]:
from sklearn.metrics import log_loss
xgbc = XGBClassifier(max_depth = 10,
                     eval_metric = log_loss, ## By default, metric is log_loss
                     early_stopping_rounds = None, 
                     n_jobs = -1)
xgbc.fit(X_train, y_train, 
         eval_set = [(X_test, y_test)])

[0]	validation_0-logloss:0.48036	validation_0-log_loss:0.48036
[1]	validation_0-logloss:0.36181	validation_0-log_loss:0.36181
[2]	validation_0-logloss:0.28667	validation_0-log_loss:0.28667
[3]	validation_0-logloss:0.23411	validation_0-log_loss:0.23411
[4]	validation_0-logloss:0.19791	validation_0-log_loss:0.19791
[5]	validation_0-logloss:0.17216	validation_0-log_loss:0.17216
[6]	validation_0-logloss:0.15390	validation_0-log_loss:0.15390
[7]	validation_0-logloss:0.14043	validation_0-log_loss:0.14043
[8]	validation_0-logloss:0.13065	validation_0-log_loss:0.13065
[9]	validation_0-logloss:0.12294	validation_0-log_loss:0.12294
[10]	validation_0-logloss:0.11650	validation_0-log_loss:0.11650
[11]	validation_0-logloss:0.11143	validation_0-log_loss:0.11143
[12]	validation_0-logloss:0.10819	validation_0-log_loss:0.10819
[13]	validation_0-logloss:0.10558	validation_0-log_loss:0.10558
[14]	validation_0-logloss:0.10378	validation_0-log_loss:0.10379
[15]	validation_0-logloss:0.10234	validation_0-log

In [30]:
xgbc.score(X_test, y_test)

0.9594240837696335

In [31]:
help(xgbc)

Help on XGBClassifier in module xgboost.sklearn object:

class XGBClassifier(XGBModel, sklearn.base.ClassifierMixin)
 |  XGBClassifier(*, objective: Union[str, Callable[[numpy.ndarray, numpy.ndarray], Tuple[numpy.ndarray, numpy.ndarray]], NoneType] = 'binary:logistic', use_label_encoder: Optional[bool] = None, **kwargs: Any) -> None
 |  
 |  Implementation of the scikit-learn API for XGBoost classification.
 |  
 |  
 |  Parameters
 |  ----------
 |  
 |      n_estimators : int
 |          Number of boosting rounds.
 |  
 |      max_depth :  Optional[int]
 |          Maximum tree depth for base learners.
 |      max_leaves :
 |          Maximum number of leaves; 0 indicates no limit.
 |      max_bin :
 |          If using histogram-based algorithm, maximum number of bins per feature
 |      grow_policy :
 |          Tree growing policy. 0: favor splitting at nodes closest to the node, i.e. grow
 |          depth-wise. 1: favor splitting at nodes with highest loss change.
 |      learni