<a href="https://colab.research.google.com/github/gshreya5/colab/blob/main/predict_BA_holiday_bookings_randomforest.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Predicting customer ✈️ buying behaviour


Customers are more empowered than ever because they have access to a wealth of information at their fingertips. This is one of the reasons the buying cycle is very different to what it used to be. Today, if we hope that a customer purchases flights or holidays as they come into the airport, game's already lost! Being reactive in this situation is not ideal; airlines must be proactive in order to acquire customers before they embark on their holiday.

 **GOAL** : To manipulate and prepare the provided customer booking data to build a high-quality predictive model.

With this predictive model, it is important to interpret the results in order to understand how “predictive” the data really is and whether we can feasibly use it to predict the target outcome (customers buying holidays). Therefore, we should also evaluate the model's performance and output how each variable contributes to the predictive model's power.

# Load Libraries

In [161]:
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics  
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn import tree
import pydot
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import cross_val_score

from yellowbrick.classifier import ConfusionMatrix


%matplotlib inline

# Load Dataset

In [106]:
url = 'https://cdn.theforage.com/vinternships/companyassets/tMjbs76F526fF5v3G/L3MQ8f6cYSkfoukmz/1667814300249/customer_booking.csv'

df = pd.read_csv(url, encoding='latin-1')

In [12]:
df.head(2)

Unnamed: 0,num_passengers,sales_channel,trip_type,purchase_lead,length_of_stay,flight_hour,flight_day,route,booking_origin,wants_extra_baggage,wants_preferred_seat,wants_in_flight_meals,flight_duration,booking_complete
0,2,Internet,RoundTrip,262,19,7,Sat,AKLDEL,New Zealand,1,0,0,5.52,0
1,1,Internet,RoundTrip,112,20,3,Sat,AKLDEL,New Zealand,0,0,0,5.52,0


# Explore Data

To provide more context, below is a more detailed data description, explaining exactly what each column means:

* num_passengers = number of passengers travelling
* sales_channel = sales channel booking was made on
* trip_type = trip Type (Round Trip, One Way, Circle Trip)
* purchase_lead = number of days between travel date and booking date
* length_of_stay = number of days spent at destination
* flight_hour = hour of flight departure
* flight_day = day of week of flight departure
* route = origin -> destination flight route
* booking_origin = country from where booking was made
* wants_extra_baggage = if the customer wanted extra baggage in the booking
* wants_preferred_seat = if the customer wanted a preferred seat in the booking
* wants_in_flight_meals = if the customer wanted in-flight meals in the booking
* flight_duration = total duration of flight (in hours)
* booking_complete = flag indicating if the customer completed the booking

In [13]:
df.shape

(50000, 14)

In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 14 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   num_passengers         50000 non-null  int64  
 1   sales_channel          50000 non-null  object 
 2   trip_type              50000 non-null  object 
 3   purchase_lead          50000 non-null  int64  
 4   length_of_stay         50000 non-null  int64  
 5   flight_hour            50000 non-null  int64  
 6   flight_day             50000 non-null  object 
 7   route                  50000 non-null  object 
 8   booking_origin         50000 non-null  object 
 9   wants_extra_baggage    50000 non-null  int64  
 10  wants_preferred_seat   50000 non-null  int64  
 11  wants_in_flight_meals  50000 non-null  int64  
 12  flight_duration        50000 non-null  float64
 13  booking_complete       50000 non-null  int64  
dtypes: float64(1), int64(8), object(5)
memory usage: 5.3+ 

No null values - a small victory!

In [16]:
df.describe()

Unnamed: 0,num_passengers,purchase_lead,length_of_stay,flight_hour,wants_extra_baggage,wants_preferred_seat,wants_in_flight_meals,flight_duration,booking_complete
count,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0
mean,1.59124,84.94048,23.04456,9.06634,0.66878,0.29696,0.42714,7.277561,0.14956
std,1.020165,90.451378,33.88767,5.41266,0.470657,0.456923,0.494668,1.496863,0.356643
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0,4.67,0.0
25%,1.0,21.0,5.0,5.0,0.0,0.0,0.0,5.62,0.0
50%,1.0,51.0,17.0,9.0,1.0,0.0,0.0,7.57,0.0
75%,2.0,115.0,28.0,13.0,1.0,1.0,1.0,8.83,0.0
max,9.0,867.0,778.0,23.0,1.0,1.0,1.0,9.5,1.0


Only about 15% of bookings were made.

About 30% want preferred seat.

# Correlation

In [17]:
df.corr().booking_complete

num_passengers           0.024116
purchase_lead           -0.022131
length_of_stay          -0.042408
flight_hour              0.007127
wants_extra_baggage      0.068139
wants_preferred_seat     0.050116
wants_in_flight_meals    0.026511
flight_duration         -0.106266
booking_complete         1.000000
Name: booking_complete, dtype: float64

# Visualization

## Trip Type

In [29]:
px.histogram(df,
             x='trip_type',
             color='booking_complete',
             title="Distribution of Trip types")

Majority (Almost all of it) of the bookings started were for Rountrips, then Oneway and lastly CircleTrip

Same with Confirmed Bookings.

## Num of passengers

In [35]:
px.histogram(df,
             x='num_passengers',
             color='booking_complete',
             title="Distribution of Num of passengers per trip")

Most of the bookings started and completed are for one passengers 

## Sales Channel

In [37]:
px.histogram(df,
             x='sales_channel',
             color='booking_complete',
             title="Visualization of Channels used for bookings")

Prefered method of booking is through Internet

## Flight Day, Flight Hour and Trip Duration

In [45]:
px.scatter(df,
             x='flight_hour',
             y='length_of_stay',
              facet_col="flight_day",
             color='booking_complete',
             title="Visualization of flight_day vs length_of_stay")

Length of stay is usually less than 200 hrs

Flight's departure days fall mostly on Monday, Tuesday, Wednesday maybe because prices are cheaper during workdays

Mid morning to mid noon seems to have highest traffic

## Booking Origin

In [56]:
px.histogram(df,
             x='booking_origin',
             color='booking_complete',
             title="Visualization of booking_origin")

Australia seems to be the place where most booking started.

Malaysia seems to be the place where most bookings were completed.

## Amenities

In [82]:
fig = make_subplots(rows=1, cols=3,subplot_titles=("wants_extra_baggage", "wants_preferred_seat", "wants_in_flight_meals"))

fig.add_trace(go.Histogram(
             x=df[df.booking_complete==1]['wants_extra_baggage'],
             name = 'completed'
             ),
              row=1, col=1)
fig.add_trace(go.Histogram(
             x=df[df.booking_complete==0]['wants_extra_baggage'],
             name = 'not completed'
             ),
              row=1, col=1)

fig.add_trace(go.Histogram(
             x=df[df.booking_complete==1]['wants_preferred_seat'],
             name = 'completed'
             ),
              row=1, col=2)
fig.add_trace(go.Histogram(
             x=df[df.booking_complete==0]['wants_preferred_seat'],
             name = 'not completed'
             ),
              row=1, col=2)

fig.add_trace(go.Histogram(
             x=df[df.booking_complete==1]['wants_in_flight_meals'],
             name = 'completed'
             ),
              row=1, col=3)
fig.add_trace(go.Histogram(
             x=df[df.booking_complete==0]['wants_in_flight_meals'],
             name = 'not completed'
             ),
              row=1, col=3)

fig.update_layout(xaxis_title='Amenities', yaxis_title='Count', legend_title='booking status')

fig.show()

Majority preferred the extra baggage amenity over meals and preferred seats.

## Route

In [89]:
px.histogram(df,
             x='route',
             color='booking_complete',
             title="Visualization of Flight Routes")

Route AKL (Auckland, NewZealand) to KUL (Kuala Lumpur, Malaysia) and

Penang (PEN), Malaysia to Taipei (TPE), Taiwan - China

gets the most traffic.

## Purchase Lead

In [124]:
px.histogram(df,
             x='purchase_lead',
             color='booking_complete',
             title="Visualization of purchase_lead")

Most bookings were done 100 days before departure date, and there were some that were done before 400 days of the departure date.

# Modelling

encode object type data to numerics

In [109]:
objList = df.select_dtypes(include = "object").columns
print(objList)

Index(['sales_channel', 'trip_type', 'flight_day', 'route', 'booking_origin'], dtype='object')


In [107]:
le = LabelEncoder()

for col in objList:
    col_n = col+"_n"
    df[col_n] = le.fit_transform(df[col])

Split data into features X and target Y

In [110]:
X = df[['num_passengers', 'sales_channel_n', 'trip_type_n', 'purchase_lead',
       'length_of_stay', 'flight_hour', 'flight_day_n', 'route_n',
       'booking_origin_n', 'wants_extra_baggage', 'wants_preferred_seat',
       'wants_in_flight_meals', 'flight_duration']]
Y = df.booking_complete

In [125]:
# Using StandardScaler 
sc = StandardScaler()
X_scaled = sc.fit_transform(X)

Let's check if data is imbalanced or not

In [130]:
Y.value_counts()

0    42522
1     7478
Name: booking_complete, dtype: int64

Our dataset is imbalanced. Let's fix that using SMOTE

In [134]:
print(X.shape,Y.shape)

(50000, 13) (50000,)


In [133]:
# using imblearn to removing imablance in our dataset
smote = SMOTE()
x_smote, y_smote = smote.fit_resample(X_scaled, Y)
print(x_smote.shape,y_smote.shape)

(85044, 13) (85044,)


Splitting the dataset into train and test

In [141]:
X_train, X_test, Y_train, Y_test = train_test_split(x_smote, y_smote, test_size = 0.2,random_state=42)


Building model using random forest classifier and

Predicting on test data (obtained during splitting)


In [142]:
clf = RandomForestClassifier(n_estimators = 100)  
clf.fit(X_train, Y_train)
y_pred = clf.predict(X_test)


# Performance metrics

In [143]:
print("ACCURACY OF THE MODEL: ", metrics.accuracy_score(Y_test, y_pred))

ACCURACY OF THE MODEL:  0.9178670115821036


Scaling and balancing the data improved initial accuracy score of 84.4 to 85.5 to 91.7

In [165]:
print(metrics.confusion_matrix(Y_train, clf.predict(X_train)))
print("-"*30)
print(metrics.confusion_matrix(Y_test, clf.predict(X_test)))

[[34074     4]
 [    2 33955]]
------------------------------
[[7971  473]
 [ 924 7641]]


In [164]:
print(metrics.classification_report(Y_train, clf.predict(X_train)))
print("-"*60)
print(metrics.classification_report(Y_test, clf.predict(X_test)))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     34078
           1       1.00      1.00      1.00     33957

    accuracy                           1.00     68035
   macro avg       1.00      1.00      1.00     68035
weighted avg       1.00      1.00      1.00     68035

------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.90      0.94      0.92      8444
           1       0.94      0.89      0.92      8565

    accuracy                           0.92     17009
   macro avg       0.92      0.92      0.92     17009
weighted avg       0.92      0.92      0.92     17009



## Feature Importance

Following shows how each variable contributed to the model

In [146]:
feature_imp = pd.Series(clf.feature_importances_, index = X.columns).sort_values(ascending = False)
feature_imp

booking_origin_n         0.185551
length_of_stay           0.170899
flight_duration          0.119206
route_n                  0.118797
purchase_lead            0.109606
flight_hour              0.100211
flight_day_n             0.090018
num_passengers           0.045036
wants_in_flight_meals    0.016569
wants_extra_baggage      0.016081
wants_preferred_seat     0.014989
sales_channel_n          0.011811
trip_type_n              0.001228
dtype: float64

In [159]:
px.bar(feature_imp[::-1], orientation='h', title = 'Features by their Importance in Modelling')

## Cross Validation

In [152]:
# Using K-FOLD method by using cross_val_score
accuracy = cross_val_score(clf, X_train, Y_train, cv=5)
accuracy

array([0.91166311, 0.91210406, 0.90982582, 0.91563166, 0.90864996])

In [153]:
print("%0.2f accuracy with a standard deviation of %0.2f" % (accuracy.mean(), accuracy.std()))

0.91 accuracy with a standard deviation of 0.00


**Final Thoughts**

Although we did achieve a score of 91% accuracy using RandomForrestClassifier,

to improve the model further we need to work with a **more balanced data** since current data had more non-successful bookings than successful ones and

work with more demographic and  psychological features to successfully predict user behaviour.
