Now that we have our ideal data ready, we can focus on building models on it.

In [15]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings('ignore', category=DeprecationWarning)

import joblib

In [16]:
data = pd.read_csv('data.csv')
data.head()

Unnamed: 0,Gender,Customer Type,Age,Type of Travel,Class,Flight Distance,Inflight wifi service,Departure/Arrival time convenient,Ease of Online booking,Gate location,...,Inflight entertainment,On-board service,Leg room service,Baggage handling,Checkin service,Inflight service,Cleanliness,Departure Delay in Minutes,Arrival Delay in Minutes,satisfaction
0,1,0,13,1,2,3.408408,3,4,3,1,...,5,4,3,4,4,5,5,25,18.0,0
1,1,1,25,0,0,2.979982,3,2,3,3,...,1,1,5,3,1,4,1,1,6.0,0
2,0,0,26,0,0,4.08821,2,2,2,2,...,5,4,3,4,4,4,5,0,0.0,1
3,0,0,25,0,0,3.547703,2,5,5,5,...,2,2,5,3,1,4,2,11,9.0,0
4,1,0,61,0,0,2.92471,3,3,3,3,...,3,3,4,4,3,3,3,0,0.0,1


## Splitting the data to test, validation and training data

In [17]:
labels = data['satisfaction']
features = data.drop('satisfaction', axis = 1)

X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.4, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_test, y_test, test_size=0.5, random_state=42)

In [4]:
for dataset in [y_train, y_val, y_test]:
    print(round(len(dataset) / len(labels), 2))

0.6
0.2
0.2


In [18]:
X_val.to_csv('val_features.csv', index=False)
X_test.to_csv('test_features.csv', index=False)

y_val.to_csv('val_labels.csv', index=False)
y_test.to_csv('test_labels.csv', index=False)

## Cross Validation

Cross validation is used to determine the best hyperparameters for the machine learning algorithms.

Function to observe the results of Cross Validation:

In [11]:
def print_results(results):
    print('BEST PARAMS: {}\n'.format(results.best_params_))

    means = results.cv_results_['mean_test_score']
    stds = results.cv_results_['std_test_score']
    for mean, std, params in zip(means, stds, results.cv_results_['params']):
        print('{} (+/-{}) for {}'.format(round(mean, 3), round(std * 2, 3), params))

## Logistic Regression

"C" hyperparameter stands for "Complexity". Higher complexity means lower regularization which can cause overfitting.

In [6]:
lr = LogisticRegression()
parameters = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]
}

cv = GridSearchCV(lr, parameters, cv=5)
cv.fit(X_train, y_train.values.ravel())

print_results(cv)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

BEST PARAMS: {'C': 0.01}

0.864 (+/-0.005) for {'C': 0.001}
0.868 (+/-0.005) for {'C': 0.01}
0.867 (+/-0.009) for {'C': 0.1}
0.867 (+/-0.007) for {'C': 1}
0.868 (+/-0.006) for {'C': 10}
0.867 (+/-0.007) for {'C': 100}
0.867 (+/-0.006) for {'C': 1000}


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


We got 86.8% accuracy on the training phase for Logistic Regression

Pickling our model so that we can use it later

In [7]:
joblib.dump(cv.best_estimator_, 'models/LR_model.pkl')

['models/LR_model.pkl']

## Support Vector Machine

Cross validation on SVM takes so much time since our data is huge. So we will just go with the default hyperparameters and hope for the best.

In [7]:
svc = SVC()
SVC_model = svc.fit(X_train,y_train)
joblib.dump(SVC_model, 'models/SVC_model.pkl')

['models/SVC_model.pkl']

## Multilayer Perceptron

The same deal with the SVM, cross validation takes so much time.

In [8]:
mlp = MLPClassifier()
MLP_model = mlp.fit(X_train,y_train)
joblib.dump(SVC_model, 'models/MLP_model.pkl')



['models/MLP_model.pkl']

## Random Forest

"n_estimators" is the number of the decision tree structures. "max_depth" is the depth cap of the trees.

In [10]:
rf = RandomForestClassifier()
parameters = {
    'n_estimators': [5, 50, 250],
    'max_depth': [2, 4, 8, 16, 32, None]
}

cv = GridSearchCV(rf, parameters, cv=5)
cv.fit(X_train, y_train.values.ravel())

NameError: name 'print_results' is not defined

In [12]:
print_results(cv)

BEST PARAMS: {'max_depth': None, 'n_estimators': 250}

0.843 (+/-0.026) for {'max_depth': 2, 'n_estimators': 5}
0.873 (+/-0.01) for {'max_depth': 2, 'n_estimators': 50}
0.877 (+/-0.007) for {'max_depth': 2, 'n_estimators': 250}
0.898 (+/-0.014) for {'max_depth': 4, 'n_estimators': 5}
0.906 (+/-0.005) for {'max_depth': 4, 'n_estimators': 50}
0.909 (+/-0.005) for {'max_depth': 4, 'n_estimators': 250}
0.935 (+/-0.005) for {'max_depth': 8, 'n_estimators': 5}
0.938 (+/-0.001) for {'max_depth': 8, 'n_estimators': 50}
0.938 (+/-0.002) for {'max_depth': 8, 'n_estimators': 250}
0.951 (+/-0.004) for {'max_depth': 16, 'n_estimators': 5}
0.959 (+/-0.003) for {'max_depth': 16, 'n_estimators': 50}
0.959 (+/-0.003) for {'max_depth': 16, 'n_estimators': 250}
0.95 (+/-0.005) for {'max_depth': 32, 'n_estimators': 5}
0.961 (+/-0.002) for {'max_depth': 32, 'n_estimators': 50}
0.962 (+/-0.003) for {'max_depth': 32, 'n_estimators': 250}
0.95 (+/-0.004) for {'max_depth': None, 'n_estimators': 5}
0.96 (+/-0.0

We got 96.2% accuracy on the training phase for Random Forest. Which is impressive.

In [13]:
joblib.dump(cv.best_estimator_, 'models/RF_model.pkl')

['models/RF_model.pkl']

## Gradient Boosting

The same with SVM and MLP, cross validation would take forever.

In [14]:
gb = GradientBoostingClassifier()
GB_model = gb.fit(X_train, y_train)
joblib.dump(GB_model, 'models/GB_model.pkl')

['models/GB_model.pkl']