#ECOMMERCE PROJECT: CUSTOMER PURCHASE INTENTION MODEL

**Objective**:This is a Classification Model intented to Predict that whether Customer will make the purchase or not. The use case of this model is to examine real time web analytics data and predict the probability of each customers making a purchase, which will enable the retailer to carefully implement targeted promotions and other marketing strategies for pursuading those customers who are less likely to make a  purchase.

***Any Ecommerce Application goal is to Convert the browsers into the buyers.***

In [1]:
#Ecommerce purchase intention model predict the probability of each customer making a purchase or not.
#Once we can identify the potential buyers then we can target those customers.
#We will build a model that will help us to know how the customers intent works to enable them for purchase.
#This model will analyse consumer behaviour data from web analytics platforms.
#after the analysis it can predict whether a customer will make a purchase during their visit.
#The things that are monitored using web analytics data are-
    #Number of times URLs visited,
    #Information about the Product,
    #Time Spent on Pages,
    #Number of each Page Type visited.(PageTypes examples-Homepage, Product page, Review page etc.)

#Importing required Libraries

In [2]:
import time
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

from imblearn.over_sampling import SMOTE

from imblearn.pipeline import Pipeline as imbpipeline
from sklearn.pipeline import Pipeline

from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer

from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score

from sklearn.svm import SVC
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.dummy import DummyClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import ExtraTreeClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.linear_model import RidgeClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.naive_bayes import BernoulliNB
from sklearn.neural_network import MLPClassifier



***To ignore warnings***

In [3]:
from warnings import simplefilter
from sklearn.exceptions import ConvergenceWarning
simplefilter("ignore",category=ConvergenceWarning)

In [4]:
import warnings
warnings.filterwarnings("ignore")

#Loading Dataset

In [5]:
df=pd.read_csv("online_shoppers_intention.csv")

***Checking first five rows of the dataset***

In [6]:
df.head()

Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,Month,OperatingSystems,Browser,Region,TrafficType,VisitorType,Weekend,Revenue
0,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,Feb,1,1,1,1,Returning_Visitor,False,False
1,0,0.0,0,0.0,2,64.0,0.0,0.1,0.0,0.0,Feb,2,2,1,2,Returning_Visitor,False,False
2,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,Feb,4,1,9,3,Returning_Visitor,False,False
3,0,0.0,0,0.0,2,2.666667,0.05,0.14,0.0,0.0,Feb,3,2,2,4,Returning_Visitor,False,False
4,0,0.0,0,0.0,10,627.5,0.02,0.05,0.0,0.0,Feb,3,3,1,4,Returning_Visitor,True,False


***Checking no. of rows and columns***

In [7]:
df.shape

(12330, 18)

***Checking Dataset Information***

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12330 entries, 0 to 12329
Data columns (total 18 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Administrative           12330 non-null  int64  
 1   Administrative_Duration  12330 non-null  float64
 2   Informational            12330 non-null  int64  
 3   Informational_Duration   12330 non-null  float64
 4   ProductRelated           12330 non-null  int64  
 5   ProductRelated_Duration  12330 non-null  float64
 6   BounceRates              12330 non-null  float64
 7   ExitRates                12330 non-null  float64
 8   PageValues               12330 non-null  float64
 9   SpecialDay               12330 non-null  float64
 10  Month                    12330 non-null  object 
 11  OperatingSystems         12330 non-null  int64  
 12  Browser                  12330 non-null  int64  
 13  Region                   12330 non-null  int64  
 14  TrafficType           

#Feature Engineering

***converting boolean data type columns of Weekend and Revenue into binary values using replace() methods***

In [9]:
df["Weekend"].unique() #checking unique values in Weekend column.

array([False,  True])

In [10]:
df["Weekend"]=df["Weekend"].replace((True,False),(1,0))

In [11]:
df["Revenue"].unique() #checking unique values in Revenue column.

array([False,  True])

In [12]:
df["Revenue"]=df["Revenue"].replace((True,False),(1,0))

***adding new column "Returning_Visitor" after handeling VisitorType object dtypes into binary***

In [13]:
df.VisitorType.unique()

array(['Returning_Visitor', 'New_Visitor', 'Other'], dtype=object)

In [14]:
#VisitorType contains either Returning or New Visitor, for us one value is enough because other value is opp. to existing value.
#Adding Returning_Visitor column to existing dataframe.

In [15]:
condition=df["VisitorType"]=="Returning_Visitor"

In [16]:
df["Returning_Visitor"]=np.where(condition,1,0)

In [17]:
df=df.drop(columns="VisitorType") #Dropping Visitor_type column since it is not needed anymore.

In [18]:
df.head()

Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,Month,OperatingSystems,Browser,Region,TrafficType,Weekend,Revenue,Returning_Visitor
0,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,Feb,1,1,1,1,0,0,1
1,0,0.0,0,0.0,2,64.0,0.0,0.1,0.0,0.0,Feb,2,2,1,2,0,0,1
2,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,Feb,4,1,9,3,0,0,1
3,0,0.0,0,0.0,2,2.666667,0.05,0.14,0.0,0.0,Feb,3,2,2,4,0,0,1
4,0,0.0,0,0.0,10,627.5,0.02,0.05,0.0,0.0,Feb,3,3,1,4,1,0,1


In [19]:
df.dtypes  #checking all columns datatypes

Administrative               int64
Administrative_Duration    float64
Informational                int64
Informational_Duration     float64
ProductRelated               int64
ProductRelated_Duration    float64
BounceRates                float64
ExitRates                  float64
PageValues                 float64
SpecialDay                 float64
Month                       object
OperatingSystems             int64
Browser                      int64
Region                       int64
TrafficType                  int64
Weekend                      int64
Revenue                      int64
Returning_Visitor            int32
dtype: object

***Handeling Month column with object dtype and converting it into integer data type using OrdinalEncoder***

In [20]:
df.Month.unique()  #checking unique values in Month column

array(['Feb', 'Mar', 'May', 'Oct', 'June', 'Jul', 'Aug', 'Nov', 'Sep',
       'Dec'], dtype=object)

In [21]:
ordinal_encoder=OrdinalEncoder()

In [22]:
df["Month"]=ordinal_encoder.fit_transform(df[["Month"]])

***Quicklook on Target variable i.e; Revenue column***

In [23]:
df.Revenue.value_counts()  #Checking the total counts of each unique class in Revenue column

0    10422
1     1908
Name: Revenue, dtype: int64

***Checking Correlation of features with Revenue column***

In [24]:
result=df.corr()["Revenue"]

In [25]:
result1=result.sort_values(ascending=False)

In [26]:
result1

Revenue                    1.000000
PageValues                 0.492569
ProductRelated             0.158538
ProductRelated_Duration    0.152373
Administrative             0.138917
Informational              0.095200
Administrative_Duration    0.093587
Month                      0.080150
Informational_Duration     0.070345
Weekend                    0.029295
Browser                    0.023984
TrafficType               -0.005113
Region                    -0.011595
OperatingSystems          -0.014668
SpecialDay                -0.082305
Returning_Visitor         -0.103843
BounceRates               -0.150673
ExitRates                 -0.207071
Name: Revenue, dtype: float64

In [27]:
#The strongest predictor of conversion was PageValues column.
#This column contains page value metric.
#This is higher for customers who viewed product, basket and checkout pages.
#thus it plays a significant role.

In [28]:
#Customers who have viewed more product pages and spent longer time looking at them were also much more likely to have purchased

#Preparing features and target

In [29]:
X=df.drop(["Revenue"],axis=1)  #creating features

In [30]:
y=df["Revenue"]  #creates target label

#Preparing Train & Test Dataset

In [31]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=0)

#Model Pipeline

In [32]:
#Machine Learning Pipeline is the process of automating the workflow of our complete machine learning task.
#This includes sequential steps that perform everything from start to end like data extraction and preprocessing 
    #to model training and deployment.
#It means that in the pipeline each step is designed as an independent module, and all these module are tied together
    #to get the final result.
    
#We will be creating a model pipeline, which will handle the data using ColumnTransformer(), impute missing values ,
    # and also data scaling before passing it all to out final model.
#By using this we can avoid manual generation of features, and make the process of inputting data into model fully automated.

In [33]:
def model_pipeline(X,model):
    n_c=X.select_dtypes(exclude=["object"]).columns.values.tolist()
    c_c=X.select_dtypes(include=["object"]).columns.values.tolist()
    
    numeric_columns=n_c
    categorical_columns=c_c
    
    numeric_pipeline=SimpleImputer(strategy="constant")
    categorical_pipeline=OneHotEncoder(handle_unknown="ignore")
    
    a=("numeric",numeric_pipeline,numeric_columns)
    b=("categorical",categorical_pipeline,categorical_columns)
    
    preprocessor=ColumnTransformer(transformers=[a,b], remainder="passthrough")
    
    c=("preprocessor",preprocessor)
    d=("smote",SMOTE(random_state=1)) #for handeling class imbalance.
    e=("scaler",MinMaxScaler())
    f=("feature_selection",SelectKBest(score_func=chi2,k=7)) #to select optimal features, here 7 features as per correlation.
    g=("model",model)
    
    bundled_pipeline=imbpipeline(steps=[c,d,e,f,g])
    
    return bundled_pipeline

#Model Selection

In [34]:
#creating dictionary with all our selected classification algorithms, 
    #then looping through these models through pipeline for each model to run one by one automatically.
    #eventually using cross_validation to get good performance and validity of each model.
    #after that storing model results in DataFrame.
    #Finally selecting the best model with highest ROC/AUC score.

In [35]:
def select_model(X,y,pipeline=None):
    
    classifiers={}
    
    c_d1={"DummyClassifier":DummyClassifier(strategy="most_frequent")}
    classifiers.update(c_d1)
    
    xgb=XGBClassifier(verbosity=0, use_label_encoder=False, eval_metric="logloss", objective="binary:logistic")
    c_d2={"XGBClassifier":xgb}
    classifiers.update(c_d2)
    
    c_d3={"LGBMClassifier":LGBMClassifier()}
    classifiers.update(c_d3)
    
    c_d4={"RandomForestClassifier":RandomForestClassifier()}
    classifiers.update(c_d4)
    
    c_d5={"DecisionTreeClassifier":DecisionTreeClassifier()}
    classifiers.update(c_d5)
    
    c_d6={"ExtraTreeClassifier":ExtraTreeClassifier()}
    classifiers.update(c_d6)
    
    c_d7={"ExtraTreesClassifier":ExtraTreesClassifier()}
    classifiers.update(c_d7)
    
    c_d8={"AdaBoostClassifier":AdaBoostClassifier()}
    classifiers.update(c_d8)
    
    c_d9={"KNeighborsClassifier":KNeighborsClassifier()}
    classifiers.update(c_d9)
    
    c_d10={"RidgeClassifier":RidgeClassifier()}
    classifiers.update(c_d10)
    
    c_d11={"SGDClassifier":SGDClassifier()}
    classifiers.update(c_d11)
    
    c_d12={"BaggingClassifier":BaggingClassifier()}
    classifiers.update(c_d12)
    
    c_d13={"BernoulliNB":BernoulliNB()}
    classifiers.update(c_d13)
    
    c_d14={"SVC":SVC()}
    classifiers.update(c_d14)
    
    c_d15={"MLPClassifier":MLPClassifier()}
    classifiers.update(c_d15)
    
    mlpc={"MLPClassifier (paper)":MLPClassifier(hidden_layer_sizes=(27,50),
                                               max_iter=300, activation="relu",
                                               solver="adam", random_state=1)}
    
    c_d16=mlpc
    classifiers.update(c_d16)
    
    cols=["model", "run_time", "roc_auc"]
    df_models=pd.DataFrame(columns=cols)
    
    for key in classifiers:
        start_time=time.time()
        print()
        print("model_pipeline run successfully on",key)
        
        pipeline=model_pipeline(X_train,classifiers[key])
        
        cv=cross_val_score(pipeline, X, y, cv=10, scoring="roc_auc")
        
        row={"model":key,
            "run_time":format(round((time.time()-start_time)/60,2)),
            "roc_auc":cv.mean()}
        
        df_models=pd.concat([df_models,pd.DataFrame([row])],ignore_index=True)
        
        df_models=df_models.sort_values(by="roc_auc",ascending=False)
        
    return df_models

**Accessing Model *select_model* function**

In [36]:
models=select_model(X_train,y_train)


model_pipeline run successfully on DummyClassifier

model_pipeline run successfully on XGBClassifier

model_pipeline run successfully on LGBMClassifier

model_pipeline run successfully on RandomForestClassifier

model_pipeline run successfully on DecisionTreeClassifier

model_pipeline run successfully on ExtraTreeClassifier

model_pipeline run successfully on ExtraTreesClassifier

model_pipeline run successfully on AdaBoostClassifier

model_pipeline run successfully on KNeighborsClassifier

model_pipeline run successfully on RidgeClassifier

model_pipeline run successfully on SGDClassifier

model_pipeline run successfully on BaggingClassifier

model_pipeline run successfully on BernoulliNB

model_pipeline run successfully on SVC

model_pipeline run successfully on MLPClassifier

model_pipeline run successfully on MLPClassifier (paper)


#Total Model with Score

In [37]:
models

Unnamed: 0,model,run_time,roc_auc
0,MLPClassifier,4.27,0.910004
15,MLPClassifier (paper),3.0,0.908575
1,LGBMClassifier,0.06,0.906965
2,RandomForestClassifier,0.56,0.903191
3,XGBClassifier,0.28,0.899905
4,ExtraTreesClassifier,0.28,0.89976
5,SGDClassifier,0.02,0.895751
6,AdaBoostClassifier,0.18,0.892649
7,SVC,0.98,0.89013
8,BaggingClassifier,0.14,0.878739


In [38]:
#The best performing model is MLPClassifier(), which generated ROC/AUC score of 0.910
#We will select this model as our best model and examine the results in more details to see how well it works.

#Examining the performance of our best model

In [39]:
#by re-running the model_pipeline() function on our selected model as the MLPClassifier(), 
#we can generate predictions and assess their accuracy on the test data.

#Accessing best model and Training

In [40]:
selected_model=MLPClassifier()

In [41]:
bundled_pipeline=model_pipeline(X_train,selected_model)

In [42]:
#training
bundled_pipeline.fit(X_train,y_train)

#Prediction with best fitted model

In [43]:
#prediction
y_pred=bundled_pipeline.predict(X_test)

In [44]:
y_pred

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

#ROC/AUC score

In [45]:
roc_auc=roc_auc_score(y_test,y_pred)
accuracy=accuracy_score(y_test,y_pred)
f1=f1_score(y_test,y_pred)

In [46]:
print("ROC/AUC:",roc_auc,"\nAccuracy:",accuracy,"\nF1 score:",f1)

ROC/AUC: 0.835597739477735 
Accuracy: 0.8726682887266829 
F1 score: 0.6731436502428868


***This shows that the MLPClassifier model generated a ROC/AUC score of 0.91 on the training data, which is really Good.
but it shows a slightly low ROC/AUC score of 0.83 on test data, with an overall Accuracy of 0.87.***

#Classification Report

In [47]:
classify_report=classification_report(y_test,y_pred)

In [48]:
print("\n                   CLASSIFICATION REPORT\n\n\n",classify_report)


                   CLASSIFICATION REPORT


               precision    recall  f1-score   support

           0       0.95      0.89      0.92      3077
           1       0.59      0.78      0.67       622

    accuracy                           0.87      3699
   macro avg       0.77      0.84      0.80      3699
weighted avg       0.89      0.87      0.88      3699



#Conclusion

In [49]:
#We examined our selected best performing model i.e; MLPClassifier with Classification Matrix.

#It showed that-
#We correctly predicted Customers (89% recall on 0 class) wouldn't purchase during their session.
#We correctly predicted that 78% of customers will make some purchase during their session.
#Overall Accuracy is of 87%.

#The Result shows that its possible to predict purchase intention through customer behavior data with a good degree of accuracy.