<a href="https://colab.research.google.com/github/promiseeselojor/British-Airways-Virtual-Internship-Program/blob/main/British_Airways_Task_2(Virtual_Internship).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Install Pycaret**

In [None]:
!pip install git+https://github.com/pycaret/pycaret.git#egg=pycaret

# **Import Libraries**

In [2]:
import pandas as pd
import numpy as np

# **Load Dataset**

In [2]:
#mount google drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
#import the data and load the dataframe
data_path = '/content/drive/MyDrive/Data Science Projects/British Airways Data Science/customer_booking.csv'
df = pd.read_csv(data_path, encoding = "ISO-8859-1")

customer_booking = df.copy()

In [4]:
#create a sample containing 90% from the dataset
data = customer_booking.sample(frac =0.9, random_state=786)

#create a dataset of the remaining 10% left
data_unseen = customer_booking.drop(data.index)

data.reset_index(drop=True, inplace =True)
data_unseen.reset_index(drop=True, inplace = True)

print('Data for modeling: ' + str(data.shape))
print('Unseen data for predictions: ' + str(data_unseen.shape))

Data for modeling: (45000, 14)
Unseen data for predictions: (5000, 14)


# **Data Preparation**

In [5]:
#create a machine learning transformation pipeline to preprocess the data before feeding it into the ML
from pycaret.classification import *
clf1 = setup(data = data, target = 'booking_complete', session_id=123, normalize = True, 
              transformation = False, fix_imbalance = True)

Unnamed: 0,Description,Value
0,Session id,123
1,Target,booking_complete
2,Target type,Binary
3,Original data shape,"(45000, 14)"
4,Transformed data shape,"(67091, 22)"
5,Transformed train set shape,"(53590, 22)"
6,Transformed test set shape,"(13501, 22)"
7,Ordinal features,1
8,Numeric features,8
9,Categorical features,5


# **Model Training and Selection**

In [6]:
best = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
dummy,Dummy Classifier,0.8507,0.5,0.0,0.0,0.0,0.0,0.0,0.271
et,Extra Trees Classifier,0.8445,0.766,0.1348,0.4346,0.2055,0.1451,0.1761,1.737
rf,Random Forest Classifier,0.8435,0.657,0.065,0.3668,0.1102,0.0682,0.1002,1.006
knn,K Neighbors Classifier,0.7612,0.6945,0.4494,0.3002,0.3599,0.2203,0.2271,16.155
ada,Ada Boost Classifier,0.745,0.5932,0.332,0.2625,0.2804,0.1388,0.1416,0.734
lightgbm,Light Gradient Boosting Machine,0.7376,0.5711,0.2632,0.2085,0.2256,0.0755,0.0775,0.394
ridge,Ridge Classifier,0.7285,0.0,0.7045,0.3164,0.4366,0.2904,0.3308,0.523
lda,Linear Discriminant Analysis,0.7285,0.7782,0.7045,0.3164,0.4366,0.2904,0.3308,0.27
lr,Logistic Regression,0.7273,0.7782,0.7064,0.3155,0.4362,0.2895,0.3304,1.004
svm,SVM - Linear Kernel,0.7071,0.0,0.7379,0.3031,0.4295,0.2763,0.3259,0.34


Processing:   0%|          | 0/61 [00:00<?, ?it/s]

I'm going to select the LDA (Linear Discriminant Analysis) model because it has a good recall score, AUC, and accuracy of 72.85%  which are the key metrics needed for this model. I wont really depend on accuracy for this model beacuase of the target variable imbalance

The Extra Tree Classifier algo has the best accuracy(84.45%) but has a very low recall score. It has a lot of false negatives (predicting a customer will not book a holiday with the airline when in fact they did). The goal of this model is to maintain a good enough accuracy and also minimize false negatives as much as possible.

For this scenario and business case, it's much more beneficial to have a model that has a low number of false negatives. Predicting a customer will book a holiday and they eventually end up not booking is much preferrable than having the model predict that a customer will not book a holiday and they eventually end up booking. The latter may be very costly to the business




In [7]:
lda = create_model('lda')

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.7343,0.7932,0.7447,0.328,0.4554,0.3131,0.3592
1,0.7295,0.7655,0.6723,0.3116,0.4259,0.2788,0.3141
2,0.726,0.7762,0.7106,0.3148,0.4363,0.2894,0.3312
3,0.7302,0.7705,0.7021,0.3173,0.4371,0.2915,0.3312
4,0.7324,0.7794,0.7191,0.3222,0.445,0.301,0.3431
5,0.721,0.7651,0.6603,0.3019,0.4144,0.2632,0.2979
6,0.7251,0.7749,0.6879,0.3106,0.428,0.2796,0.3179
7,0.7416,0.8053,0.7558,0.3374,0.4666,0.3276,0.374
8,0.7114,0.768,0.6921,0.2991,0.4177,0.264,0.3051
9,0.7339,0.7835,0.7,0.3207,0.4398,0.2956,0.3344


Processing:   0%|          | 0/4 [00:00<?, ?it/s]

In [8]:
print(lda)

LinearDiscriminantAnalysis(covariance_estimator=None, n_components=None,
                           priors=None, shrinkage=None, solver='svd',
                           store_covariance=False, tol=0.0001)


# **Hyperparameter Tuning**

In [9]:
#hyperparameter optimizations
tuned_lda = tune_model(lda)

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.734,0.7932,0.7468,0.328,0.4558,0.3135,0.36
1,0.7292,0.7655,0.6723,0.3113,0.4256,0.2784,0.3137
2,0.7257,0.7762,0.7106,0.3145,0.436,0.2889,0.3309
3,0.7302,0.7705,0.7021,0.3173,0.4371,0.2915,0.3312
4,0.7321,0.7795,0.7191,0.3219,0.4447,0.3006,0.3427
5,0.721,0.765,0.6603,0.3019,0.4144,0.2632,0.2979
6,0.7254,0.7749,0.69,0.3113,0.429,0.2808,0.3194
7,0.7419,0.8054,0.7558,0.3378,0.4669,0.328,0.3743
8,0.7124,0.7681,0.6943,0.3003,0.4192,0.266,0.3073
9,0.7342,0.7835,0.7,0.321,0.4401,0.2961,0.3348


Processing:   0%|          | 0/7 [00:00<?, ?it/s]

Fitting 10 folds for each of 10 candidates, totalling 100 fits


In [10]:
print(tuned_lda) 

LinearDiscriminantAnalysis(covariance_estimator=None, n_components=None,
                           priors=None, shrinkage=0.01, solver='eigen',
                           store_covariance=False, tol=0.0001)


# **Model Analysis and Evalutation**

In [11]:
evaluate_model(tuned_lda)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Pipeline Plot', 'pipelin…

This confusion matrix is on the test set which includes 30% of our data. We have 1,420 True Positives (11%) — these are the customers for which we will be able to extend the lifetime value if they end up booking an holiday with British Airways. If we wouldn’t have predicted, then there was no opportunity for intervention.

We also have 3,142 (23%) False Positives where we might lose money because the promotion offered to these customers might just be an extra cost.

8,343 (62%) are True Negatives and 596 (4.42%) are False Negative (this is a missed opportunity)

# **Prediction on Test Holdout/Sample**

In [18]:
#predict model on test set
test_results = predict_model(tuned_lda)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Linear Discriminant Analysis,0.7231,0.7786,0.7044,0.3113,0.4317,0.2833,0.3246


There's no signicant different between the test results and training results, so the model is ok

# **Finalize Model for Deployment**

In [13]:
#finalize lda_model

final_lda = finalize_model(tuned_lda)

In [14]:
print(final_lda)

Pipeline(memory=FastMemory(location=/tmp/joblib),
         steps=[('numerical_imputer',
                 TransformerWrapper(exclude=None,
                                    include=['num_passengers', 'purchase_lead',
                                             'length_of_stay', 'flight_hour',
                                             'wants_extra_baggage',
                                             'wants_preferred_seat',
                                             'wants_in_flight_meals',
                                             'flight_duration'],
                                    transformer=SimpleImputer(add_indicator=False,
                                                              copy=True,
                                                              fill_value=None,
                                                              missing_...
                                                                              random_state=None,
                          

#**Predict Unseen Data**

In [16]:
#predict model on the unseen data we created initally
unseen_predictions = predict_model(final_lda, data =data_unseen)
unseen_predictions.head()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Linear Discriminant Analysis,0.7188,0.7496,0.6781,0.3067,0.4224,0.2699,0.3071


Unnamed: 0,num_passengers,sales_channel,trip_type,purchase_lead,length_of_stay,flight_hour,flight_day,route,booking_origin,wants_extra_baggage,wants_preferred_seat,wants_in_flight_meals,flight_duration,booking_complete,prediction_label,prediction_score
0,1,Internet,RoundTrip,3,48,20,Thu,AKLDEL,New Zealand,1,0,1,5.52,0,0,0.8369
1,4,Internet,RoundTrip,265,24,19,Mon,AKLDEL,New Zealand,1,0,1,5.52,0,0,0.8978
2,1,Internet,RoundTrip,245,34,4,Tue,AKLDEL,New Zealand,1,1,1,5.52,0,0,0.8607
3,1,Internet,RoundTrip,65,17,9,Wed,AKLICN,New Zealand,1,0,0,6.62,0,0,0.855
4,1,Internet,RoundTrip,22,89,14,Tue,AKLICN,South Korea,1,0,1,6.62,0,0,0.8267


# **Saving The Model**

In [17]:
#saving the final model
save_model(final_lda, 'Final_LDA_British_Airways_Prediction_Model')


Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=FastMemory(location=/tmp/joblib),
          steps=[('numerical_imputer',
                  TransformerWrapper(exclude=None,
                                     include=['num_passengers', 'purchase_lead',
                                              'length_of_stay', 'flight_hour',
                                              'wants_extra_baggage',
                                              'wants_preferred_seat',
                                              'wants_in_flight_meals',
                                              'flight_duration'],
                                     transformer=SimpleImputer(add_indicator=False,
                                                               copy=True,
                                                               fill_value=None,
                                                               missing_...
                                                                               random_state=None,
            