<a href="https://colab.research.google.com/github/janinerottmann/EBS/blob/main/SurgePricing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Surge Pricing with Machine Learning: The Example of Sigma Cabs

**Data and Problem**

Data provided by an Indian cab aggregator service Sigma Cabs. Customers can download the app and book a cab from anywhere in the cities Sigma Cabs operate in. Sigma Cabs, in turn search for cabs from various service providers and provide the best option to their client across available options. They have been in operation for little less than a year now. During this period, they have captured surgepricingtype from the service providers.

The main objective is to build a predictive model, which could help in predicting the surgepricingtype pro-actively. This would in turn help in matching the right cabs with the right customers quickly and efficiently.


**Features**
* Trip_ID: ID for TRIP
* Trip_Distance: The distance for the trip requested by the customer
* TypeofCab: Category of the cab requested by the customer
* CustomerSinceMonths: Customer using cab services since n months; 0 month means current month
* LifeStyleIndex: Proprietary index created by Sigma Cabs showing lifestyle of the customer based on their behaviour
* ConfidenceLifeStyle_Index: Category showing confidence on the index mentioned above
* Destination_Type: Sigma Cabs divides any destination in one of the 14 categories.
* Customer_Rating: Average of life time ratings of the customer till date
* CancellationLast1Month: Number of trips cancelled by the customer in last 1 month
* Var1, Var2 and Var3: Continuous variables masked by the company. Can be used for modelling purposes
* Gender: Gender of the customer
* SurgePricingType: Target (can be of 3 types)

# Libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.preprocessing import MaxAbsScaler
from xgboost import XGBClassifier

# Import Data

In [2]:
df = pd.read_csv("https://raw.githubusercontent.com/janinerottmann/EBS/main/sigma_cabs.csv")
df.head()

Unnamed: 0,Trip_ID,Trip_Distance,Type_of_Cab,Customer_Since_Months,Life_Style_Index,Confidence_Life_Style_Index,Destination_Type,Customer_Rating,Cancellation_Last_1Month,Var1,Var2,Var3,Gender,Surge_Pricing_Type
0,T0005689460,6.77,B,1.0,2.42769,A,A,3.905,0,40.0,46,60,Female,2
1,T0005689461,29.47,B,10.0,2.78245,B,A,3.45,0,38.0,56,78,Male,2
2,T0005689464,41.58,,10.0,,,E,3.50125,2,,56,77,Male,2
3,T0005689465,61.56,C,10.0,,,A,3.45375,0,,52,74,Male,3
4,T0005689467,54.95,C,10.0,3.03453,B,A,3.4025,4,51.0,49,102,Male,2


In [3]:
# take a look at the data
df.describe()

Unnamed: 0,Trip_Distance,Customer_Since_Months,Life_Style_Index,Customer_Rating,Cancellation_Last_1Month,Var1,Var2,Var3,Surge_Pricing_Type
count,131662.0,125742.0,111469.0,131662.0,131662.0,60632.0,131662.0,131662.0,131662.0
mean,44.200909,6.016661,2.802064,2.849458,0.782838,64.202698,51.2028,75.099019,2.155747
std,25.522882,3.626887,0.225796,0.980675,1.037559,21.820447,4.986142,11.578278,0.738164
min,0.31,0.0,1.59638,0.00125,0.0,30.0,40.0,52.0,1.0
25%,24.58,3.0,2.65473,2.1525,0.0,46.0,48.0,67.0,2.0
50%,38.2,6.0,2.79805,2.895,0.0,61.0,50.0,74.0,2.0
75%,60.73,10.0,2.94678,3.5825,1.0,80.0,54.0,82.0,3.0
max,109.23,10.0,4.87511,5.0,8.0,210.0,124.0,206.0,3.0


# Handle missing data

In [4]:
# check missing data
df.isnull().sum()

Trip_ID                            0
Trip_Distance                      0
Type_of_Cab                    20210
Customer_Since_Months           5920
Life_Style_Index               20193
Confidence_Life_Style_Index    20193
Destination_Type                   0
Customer_Rating                    0
Cancellation_Last_1Month           0
Var1                           71030
Var2                               0
Var3                               0
Gender                             0
Surge_Pricing_Type                 0
dtype: int64

In [5]:
df["Type_of_Cab"].value_counts()

B    31136
C    28122
A    21569
D    18991
E    11634
Name: Type_of_Cab, dtype: int64

In [6]:
# fill missing type of cab by new type F
df["Type_of_Cab"] = df["Type_of_Cab"].fillna("F")

# imput missing values by average values
df["Customer_Since_Months"] = df["Customer_Since_Months"].fillna(df["Customer_Since_Months"].mean())
df["Var1"] = df["Var1"].fillna(df["Var1"].mean())

# drop missing values
df = df.dropna(subset=["Life_Style_Index"])

In [7]:
df.isnull().sum()

Trip_ID                        0
Trip_Distance                  0
Type_of_Cab                    0
Customer_Since_Months          0
Life_Style_Index               0
Confidence_Life_Style_Index    0
Destination_Type               0
Customer_Rating                0
Cancellation_Last_1Month       0
Var1                           0
Var2                           0
Var3                           0
Gender                         0
Surge_Pricing_Type             0
dtype: int64

# One Hot Encoding

In [8]:
def one_hot_encoding(column):
     df = pd.get_dummies(column,drop_first=True)
     return df

# one hot encode categorical variables
Type_Of_Cab = one_hot_encoding(df["Type_of_Cab"])
Confidence_Life_Style_Index = one_hot_encoding(df["Confidence_Life_Style_Index"])
Destination_Type = one_hot_encoding(df["Destination_Type"])
Gender = one_hot_encoding(df["Gender"])

# rename columns
Type_Of_Cab = Type_Of_Cab.rename(columns={'B': 'Type_Of_Cab_B','C': 'Type_Of_Cab_C','D': 'Type_Of_Cab_D','E': 'Type_Of_Cab_E','F': 'Type_Of_Cab_F'})
Confidence_Life_Style_Index = Confidence_Life_Style_Index.rename(columns = {"B":"Confidence_Life_Style_Index_B","C":"Confidence_Life_Style_Index_C"})
Destination_Type = Destination_Type.rename(columns = {'B':'Destination_Type_B','C':'Destination_Type_C','D':'Destination_Type_D','E':'Destination_Type_E','F':'Destination_Type_F','G':'Destination_Type_G','H':'Destination_Type_H','I':'Destination_Type_I','J':'Destination_Type_J','K':'Destination_Type_K','L':'Destination_Type_L','M':'Destination_Type_M','N':'Destination_Type_N'})

# merge data
df_one_hot_encoded = pd.concat([df,Type_Of_Cab,Confidence_Life_Style_Index,Destination_Type,Gender], axis=1)

# drop columns
cols_to_drop = ["Trip_ID","Type_of_Cab","Confidence_Life_Style_Index","Destination_Type","Gender"]
df_final = df_one_hot_encoded.drop(cols_to_drop,axis = 1)

In [9]:
df_final.head()

Unnamed: 0,Trip_Distance,Customer_Since_Months,Life_Style_Index,Customer_Rating,Cancellation_Last_1Month,Var1,Var2,Var3,Surge_Pricing_Type,Type_Of_Cab_B,...,Destination_Type_F,Destination_Type_G,Destination_Type_H,Destination_Type_I,Destination_Type_J,Destination_Type_K,Destination_Type_L,Destination_Type_M,Destination_Type_N,Male
0,6.77,1.0,2.42769,3.905,0,40.0,46,60,2,1,...,0,0,0,0,0,0,0,0,0,0
1,29.47,10.0,2.78245,3.45,0,38.0,56,78,2,1,...,0,0,0,0,0,0,0,0,0,1
4,54.95,10.0,3.03453,3.4025,4,51.0,49,102,2,0,...,0,0,0,0,0,0,0,0,0,1
6,29.72,10.0,2.83958,2.975,1,83.0,50,75,2,0,...,0,0,0,0,0,0,0,0,0,1
7,18.44,2.0,2.81871,3.5825,0,103.0,46,63,2,1,...,0,0,0,0,0,0,0,0,0,1


# Normalization

In [10]:
cols_to_be_normalized = ['Trip_Distance', 'Customer_Since_Months', 'Life_Style_Index','Customer_Rating', 'Cancellation_Last_1Month', 'Var1', 'Var2', 'Var3']

cols_not_to_be_normalized = ['Type_Of_Cab_B', 'Type_Of_Cab_C', 'Type_Of_Cab_D', 'Type_Of_Cab_E','Type_Of_Cab_F', 
                            'Confidence_Life_Style_Index_B','Confidence_Life_Style_Index_C', 'Destination_Type_B',
                            'Destination_Type_C', 'Destination_Type_D', 'Destination_Type_E',
                            'Destination_Type_F', 'Destination_Type_G', 'Destination_Type_H',
                            'Destination_Type_I', 'Destination_Type_J', 'Destination_Type_K',
                            'Destination_Type_L', 'Destination_Type_M', 'Destination_Type_N',
                            'Male','Surge_Pricing_Type']

# create an abs_scaler object
abs_scaler = MaxAbsScaler()

# calculate the maximum absolute value for scaling the data using the fit method
abs_scaler.fit(df_final[cols_to_be_normalized])

# transform the data using the parameters calculated by the fit method (the maximum absolute values)
scaled_data = abs_scaler.transform(df_final[cols_to_be_normalized])

# store the results in a data frame
normalize = pd.DataFrame(scaled_data, columns=cols_to_be_normalized)

binarized = df_final[cols_not_to_be_normalized].reset_index(drop = True)
df_final = pd.concat([normalize,binarized], axis=1)

# take a look at our data
df_final.head()

Unnamed: 0,Trip_Distance,Customer_Since_Months,Life_Style_Index,Customer_Rating,Cancellation_Last_1Month,Var1,Var2,Var3,Type_Of_Cab_B,Type_Of_Cab_C,...,Destination_Type_G,Destination_Type_H,Destination_Type_I,Destination_Type_J,Destination_Type_K,Destination_Type_L,Destination_Type_M,Destination_Type_N,Male,Surge_Pricing_Type
0,0.061979,0.1,0.497976,0.781,0.0,0.190476,0.370968,0.291262,1,0,...,0,0,0,0,0,0,0,0,0,2
1,0.269798,1.0,0.570746,0.69,0.0,0.180952,0.451613,0.378641,1,0,...,0,0,0,0,0,0,0,0,1,2
2,0.503067,1.0,0.622454,0.6805,0.5,0.242857,0.395161,0.495146,0,1,...,0,0,0,0,0,0,0,0,1,2
3,0.272086,1.0,0.582465,0.595,0.125,0.395238,0.403226,0.364078,0,0,...,0,0,0,0,0,0,0,0,1,2
4,0.168818,0.2,0.578184,0.7165,0.0,0.490476,0.370968,0.305825,1,0,...,0,0,0,0,0,0,0,0,1,2


# Built the XG-Boost Classifier

In [11]:
# remove Y from data
X = df_final.drop("Surge_Pricing_Type",axis = 1)
# get Y
Y = df_final["Surge_Pricing_Type"] 

In [12]:
# split train test sets
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,random_state=8,test_size=0.3,stratify=Y)

In [13]:
# load XG-Boost Classifier
XGB = XGBClassifier()

In [14]:
# fit data
XGB.fit(X_train, Y_train)

XGBClassifier(objective='multi:softprob')

In [15]:
# make predictions for test data
Y_pred = XGB.predict(X_test)

# Evaluate Results

In [16]:
print(classification_report(Y_test,Y_pred))

              precision    recall  f1-score   support

           1       0.77      0.52      0.62      6912
           2       0.66      0.81      0.73     14427
           3       0.72      0.66      0.69     12102

    accuracy                           0.69     33441
   macro avg       0.72      0.66      0.68     33441
weighted avg       0.70      0.69      0.69     33441

