<div style="background-color:#f0f0f0; padding:10px; border-radius:5px;">
    <h2 style="color:#333333; text-align:center;">Background</h2>
    <p style="color:#555555; font-size:16px;">
        At Invistico Airlines, the drive to enhance customer satisfaction is paramount. With a focus on understanding key drivers of customer satisfaction from a dataset of 129,880 passenger survey responses, the senior analyst team has moved beyond initial Decision Tree models to a more sophisticated <strong>Random Forest</strong> approach. This shift addresses the challenges of overfitting and enhances predictive accuracy and reliability. They have asked you to go ahead with building a model that overcomes this impediment of overfitting. The senior team has also shared results of the Decision Tree model.
    </p>
    <img src="airline.png" alt="Invistico Airlines" style="display: block; margin-left: auto; margin-right: auto; width: 50%;">
</div>

<div style="background-color:#f0f0f0; padding:10px; border-radius:5px; margin-top:10px;">
    <h2 style="color:#333333; text-align:center;">Objective</h2>
    <p style="color:#555555; font-size:16px;">
        The objective is to enhance the predictive accuracy and reliability of customer satisfaction models by transitioning from Decision Trees to a Random Forest approach, leveraging the strength of ensemble learning.
    </p>
</div>

<div style="background-color:#f0f0f0; padding:10px; border-radius:5px; margin-top:10px;">
    <h2 style="color:#333333; text-align:center;">Goals</h2>
    <ul style="color:#555555; font-size:16px;">
        <li>Utilize the Random Forest model to obtain deeper insights into the key factors affecting customer satisfaction, helping tailor services to better meet passenger needs.</li>
        <li>Ensure model robustness and accuracy, evaluating performance through metrics such as accuracy, precision, recall, and F1-score.</li>
    </ul>
</div>


In [1]:
# Imports:
 
import numpy as np
import pandas as pd

import pickle as pkl
 
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, PredefinedSplit, GridSearchCV
from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score

import warnings
warnings.filterwarnings('ignore')

import time

In [2]:
# Read the file:
air_data = pd.read_csv("Invistico_Airline.csv")

## EDA:

In [3]:
air_data.head(10)

Unnamed: 0,satisfaction,Customer Type,Age,Type of Travel,Class,Flight Distance,Seat comfort,Departure/Arrival time convenient,Food and drink,Gate location,...,Online support,Ease of Online booking,On-board service,Leg room service,Baggage handling,Checkin service,Cleanliness,Online boarding,Departure Delay in Minutes,Arrival Delay in Minutes
0,satisfied,Loyal Customer,65,Personal Travel,Eco,265,0,0,0,2,...,2,3,3,0,3,5,3,2,0,0.0
1,satisfied,Loyal Customer,47,Personal Travel,Business,2464,0,0,0,3,...,2,3,4,4,4,2,3,2,310,305.0
2,satisfied,Loyal Customer,15,Personal Travel,Eco,2138,0,0,0,3,...,2,2,3,3,4,4,4,2,0,0.0
3,satisfied,Loyal Customer,60,Personal Travel,Eco,623,0,0,0,3,...,3,1,1,0,1,4,1,3,0,0.0
4,satisfied,Loyal Customer,70,Personal Travel,Eco,354,0,0,0,3,...,4,2,2,0,2,4,2,5,0,0.0
5,satisfied,Loyal Customer,30,Personal Travel,Eco,1894,0,0,0,3,...,2,2,5,4,5,5,4,2,0,0.0
6,satisfied,Loyal Customer,66,Personal Travel,Eco,227,0,0,0,3,...,5,5,5,0,5,5,5,3,17,15.0
7,satisfied,Loyal Customer,10,Personal Travel,Eco,1812,0,0,0,3,...,2,2,3,3,4,5,4,2,0,0.0
8,satisfied,Loyal Customer,56,Personal Travel,Business,73,0,0,0,3,...,5,4,4,0,1,5,4,4,0,0.0
9,satisfied,Loyal Customer,22,Personal Travel,Eco,1556,0,0,0,3,...,2,2,2,4,5,3,4,2,30,26.0


In [4]:
# Check data types
air_data.dtypes

satisfaction                          object
Customer Type                         object
Age                                    int64
Type of Travel                        object
Class                                 object
Flight Distance                        int64
Seat comfort                           int64
Departure/Arrival time convenient      int64
Food and drink                         int64
Gate location                          int64
Inflight wifi service                  int64
Inflight entertainment                 int64
Online support                         int64
Ease of Online booking                 int64
On-board service                       int64
Leg room service                       int64
Baggage handling                       int64
Checkin service                        int64
Cleanliness                            int64
Online boarding                        int64
Departure Delay in Minutes             int64
Arrival Delay in Minutes             float64
dtype: obj

In [5]:
# Dataframe for Nulls and their percentage in column 
nulls = pd.DataFrame(air_data.isnull().sum()/air_data.shape[0]*100, columns = ['perecntage_nulls'])
nulls['total_nulls'] = air_data.isnull().sum()
nulls

Unnamed: 0,perecntage_nulls,total_nulls
satisfaction,0.0,0
Customer Type,0.0,0
Age,0.0,0
Type of Travel,0.0,0
Class,0.0,0
Flight Distance,0.0,0
Seat comfort,0.0,0
Departure/Arrival time convenient,0.0,0
Food and drink,0.0,0
Gate location,0.0,0


In [6]:
# Drop the nulls since they are miniscule
air_data_subset = air_data.dropna(axis=0)

In [7]:
# Check nulls again:
air_data_subset.isna().sum()

satisfaction                         0
Customer Type                        0
Age                                  0
Type of Travel                       0
Class                                0
Flight Distance                      0
Seat comfort                         0
Departure/Arrival time convenient    0
Food and drink                       0
Gate location                        0
Inflight wifi service                0
Inflight entertainment               0
Online support                       0
Ease of Online booking               0
On-board service                     0
Leg room service                     0
Baggage handling                     0
Checkin service                      0
Cleanliness                          0
Online boarding                      0
Departure Delay in Minutes           0
Arrival Delay in Minutes             0
dtype: int64

In [8]:
# Convert categorical features to on-hot encoded:
air_data_subset_dummies = pd.get_dummies(air_data_subset, 
                                         columns=['Customer Type','Type of Travel','Class'])

air_data_subset_dummies['satisfaction'] = air_data_subset_dummies['satisfaction'].map({'satisfied': 1, 'dissatisfied':0})

In [9]:
air_data_subset_dummies['satisfaction'].value_counts()

1    70882
0    58605
Name: satisfaction, dtype: int64

In [10]:
# Check data types
air_data_subset_dummies.dtypes

satisfaction                           int64
Age                                    int64
Flight Distance                        int64
Seat comfort                           int64
Departure/Arrival time convenient      int64
Food and drink                         int64
Gate location                          int64
Inflight wifi service                  int64
Inflight entertainment                 int64
Online support                         int64
Ease of Online booking                 int64
On-board service                       int64
Leg room service                       int64
Baggage handling                       int64
Checkin service                        int64
Cleanliness                            int64
Online boarding                        int64
Departure Delay in Minutes             int64
Arrival Delay in Minutes             float64
Customer Type_Loyal Customer           uint8
Customer Type_disloyal Customer        uint8
Type of Travel_Business travel         uint8
Type of Tr

## Modelling:

We will train the model on the validation set first, then find the best parameters of the model using Grid Search. Using these parameters we will train our model again on the training set(X_train and y_train) and then evaluate our performance

In [11]:
# Segregate our data in X and y 
y = air_data_subset_dummies["satisfaction"]
X = air_data_subset_dummies.drop("satisfaction", axis=1)

In [12]:
# training and test split:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

In [13]:
def make_results(model_name, model_object):
    '''
    Accepts as arguments a model name (your choice - string) and
    a fit GridSearchCV model object.
  
    Returns a pandas df with the F1, recall, precision, and accuracy scores
    for the model with the best mean F1 score across all validation folds.  
    '''

    # Get all the results from the CV and put them in a df
    cv_results = pd.DataFrame(model_object.cv_results_)

    # Isolate the row of the df with the max(mean f1 score)
    best_estimator_results = cv_results.iloc[cv_results['mean_test_f1'].idxmax(), :]

    # Extract accuracy, precision, recall, and f1 score from that row
    f1 = best_estimator_results.mean_test_f1
    recall = best_estimator_results.mean_test_recall
    precision = best_estimator_results.mean_test_precision
    accuracy = best_estimator_results.mean_test_accuracy
  
    # Create table of results
    table = pd.DataFrame()
    table = table.append({'Model': model_name,
                        'F1': f1,
                        'Recall': recall,
                        'Precision': precision,
                        'Accuracy': accuracy
                        },
                        ignore_index=True
                       )
  
    return table

### Model building using Validaiton Set


In [14]:
%%time 
# Validation set split:
X_tr, X_val, y_tr, y_val = train_test_split(X_train, y_train, test_size = 0.25, random_state = 0)

# Instantiate the Model:
rf_2 = RandomForestClassifier(random_state=0)

# Tune the model by setting the hyper parameters:
cv_para = {
    'max_depth' : [10,50],
    'min_samples_leaf': [1, 2, 3],
    'min_samples_split': [2,3,4],
    'max_features': ['sqrt'], # Random subset of features whose size is the square root of the total number of features will be considered for splitting a node
    'n_estimators': [50, 100],
    'max_samples': [0.5, 0.9]
}

# Assign a dictionary of scoring metrics to capture
scoring = {'accuracy', 'precision', 'recall', 'f1'}

# Split the index:
split_index = [0 if x in X_val.index else -1 for x in X_train.index]

# Predefined split:
custom_split = PredefinedSplit(split_index)


# Find the best parameters using Grid Search:
rf_val = GridSearchCV(
    estimator=rf_2, 
    param_grid=cv_para,
    refit='f1', 
    cv=custom_split,
    scoring = scoring,
    verbose = 1,
    n_jobs = -1
)


CPU times: user 118 ms, sys: 2.69 ms, total: 120 ms
Wall time: 120 ms


In [18]:
%%time
# Training our model
#rf_val.fit(X_train, y_train)

Fitting 1 folds for each of 72 candidates, totalling 72 fits
CPU times: user 8.42 s, sys: 389 ms, total: 8.81 s
Wall time: 1min 21s


GridSearchCV(cv=PredefinedSplit(test_fold=array([-1, -1, ..., -1, -1])),
             estimator=RandomForestClassifier(random_state=0), n_jobs=-1,
             param_grid={'max_depth': [10, 50], 'max_features': ['sqrt'],
                         'max_samples': [0.5, 0.9],
                         'min_samples_leaf': [1, 2, 3],
                         'min_samples_split': [2, 3, 4],
                         'n_estimators': [50, 100]},
             refit='f1', scoring={'f1', 'accuracy', 'recall', 'precision'},
             verbose=1)

### Do not run this code directly 

In [20]:
# Save the model locally:
import pickle

path = '/Users/dawny/Documents/PortFolio_projects/Python/InvisticoAirlines_RandForest' # Change it to your path 

# Pickle the model
with open(path+'rf_val_model.pickle', 'wb') as to_write:
    pickle.dump(rf_val, to_write)

### Load the model saved locally

In [17]:
import pickle

path = '/Users/dawny/Documents/PortFolio_projects/Python/InvisticoAirlines_RandForest'

# Re-load the model:
with open(path + 'rf_val_model.pickle', 'rb') as to_read:
    rf_val = pickle.load(to_read)

In [18]:
rf_val

GridSearchCV(cv=PredefinedSplit(test_fold=array([-1, -1, ..., -1, -1])),
             estimator=RandomForestClassifier(random_state=0), n_jobs=-1,
             param_grid={'max_depth': [10, 50], 'max_features': ['sqrt'],
                         'max_samples': [0.5, 0.9],
                         'min_samples_leaf': [1, 2, 3],
                         'min_samples_split': [2, 3, 4],
                         'n_estimators': [50, 100]},
             refit='f1', scoring={'f1', 'recall', 'accuracy', 'precision'},
             verbose=1)

In [40]:
table = make_results("Random Forest Validated",rf_val)
table

Unnamed: 0,Model,F1,Recall,Precision,Accuracy
0,Random Forest Validated,0.957007,0.94515,0.969166,0.953705


In [20]:
# Find the best parameters:
rf_val.best_params_

{'max_depth': 50,
 'max_features': 'sqrt',
 'max_samples': 0.9,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'n_estimators': 100}

We have found our best parameters from the training set and validation set, now we will optimise our previous model (rf_val) using these parameters `specifically` to again train on the X_train and y_train sets

In [21]:
# Using the best_estimators:
rf_opt = RandomForestClassifier(
    max_depth=50, 
    max_features='sqrt', 
    max_samples=0.9,
    min_samples_leaf=1,
    min_samples_split=2,
    n_estimators=100,
    random_state=0)

In [22]:
%%time 
# Fit/Train the optimised model:
rf_opt.fit(X_train, y_train)

CPU times: user 7.03 s, sys: 30.4 ms, total: 7.06 s
Wall time: 7.07 s


RandomForestClassifier(max_depth=50, max_features='sqrt', max_samples=0.9,
                       random_state=0)

Note: We will now use this model to make predictions on the test set

In [23]:
y_pred = rf_opt.predict(X_test)

In [32]:
# Precision Score
pc_test = precision_score(y_test, y_pred)
print("The precision score is {pc:.5f}".format(pc = pc_test))

The precision score is 0.96946


In [33]:
# Get recall score:

rc_test = recall_score(y_test, y_pred)
print("The recall score is {rc:.5f}".format(rc = rc_test))

The recall score is 0.94715


In [34]:
# Get accuracy score:

ac_test = accuracy_score(y_test, y_pred)
print("The accuracy score is {ac:.5f}".format(ac = ac_test))

The accuracy score is 0.95471


In [35]:
# Get F1 score.

f1_test = f1_score(y_test, y_pred)
print("The F1 score is {f1:.5f}".format(f1 = f1_test))

The F1 score is 0.95818


In [37]:
print("\nThe precision score is: {pc:.3f}".format(pc = pc_test), "for the test set,", "\nwhich means of all positive predictions,", "{pc_pct:.1f}% prediction are true positive.".format(pc_pct = pc_test * 100))
print('')
print("\nThe recall score is: {rc:.3f}".format(rc = rc_test), "for the test set,", "\nwhich means of which means of all real positive cases in test set,", "{rc_pct:.1f}% are  predicted positive.".format(rc_pct = rc_test * 100))
print('')
print("\nThe accuracy score is: {ac:.3f}".format(ac = ac_test), "for the test set,", "\nwhich means of all cases in test set,", "{ac_pct:.1f}% are predicted true positive or true negative.".format(ac_pct = ac_test * 100))
print('')
print("\nThe F1 score is: {f1:.3f}".format(f1 = f1_test), "for the test set,", "\nwhich means the test set's harmonic mean is {f1_pct:.1f}%.".format(f1_pct = f1_test * 100))


The precision score is: 0.969 for the test set, 
which means of all positive predictions, 96.9% prediction are true positive.


The recall score is: 0.947 for the test set, 
which means of which means of all real positive cases in test set, 94.7% are  predicted positive.


The accuracy score is: 0.955 for the test set, 
which means of all cases in test set, 95.5% are predicted true positive or true negative.


The F1 score is: 0.958 for the test set, 
which means the test set's harmonic mean is 95.8%.


In [45]:
table_2 = pd.DataFrame({'Model': ["Tuned Random Optimised[best_params]"],
                        'F1':  [f1_test],
                        'Recall': [rc_test],
                        'Precision': [pc_test],
                        'Accuracy': [ac_test]
                      }
                    )
table_2

Unnamed: 0,Model,F1,Recall,Precision,Accuracy
0,Tuned Random Optimised[best_params],0.958176,0.947152,0.969461,0.954714


In [46]:
table_final = pd.concat([table, table_2])
table_final

Unnamed: 0,Model,F1,Recall,Precision,Accuracy
0,Random Forest Validated,0.957007,0.94515,0.969166,0.953705
0,Tuned Random Optimised[best_params],0.958176,0.947152,0.969461,0.954714
