# Online Purchasing Intention Prediction

#### Import the potential libraries to use

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#For Random Forest - Prediction
from sklearn.ensemble import RandomForestClassifier

#Confusion matrix
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split

#Decision Tree - Interpretation
from sklearn.tree import DecisionTreeClassifier 
from sklearn.model_selection import cross_val_score

%matplotlib inline

from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_squared_error
import statsmodels.formula.api as smf
import statsmodels.api as sm
from sklearn.neighbors import KNeighborsClassifier

from numpy.random import seed
seed(862)
from tensorflow.random import set_seed
set_seed(862)

In [None]:
#Importing the dataset
df = pd.read_csv("data.csv")
df.head()

In [None]:
df.shape
# In the original data, there is 12,330 observations and 18 features. 

**Ultimate Goal- Productionize of the Project**: The model can be used in marketing team to build strategy on which features of a visitor to target future advertisements.

## Part 1: Data Pre-proccessing

In [None]:
# Checking if the Month column includes all 12 months. It's missing Jan and Apr.
df.Month.unique()

In [None]:
# Check the population of each Visitor Type. 
df.groupby('VisitorType')['Revenue'].count()

In [None]:
# There are 85 of the visitors are not categorized. We decided to removed all "Other" Visitor Type.
df.drop(df[df['VisitorType'] == 'Other'].index, inplace = True)

## Part 2: EDA

In [None]:
# Checking for correlations between features
plt.figure(figsize = (20,10))
mask = np.triu(np.ones_like(df.corr(), dtype=bool))
heatmap = sns.heatmap(df.corr(), mask=mask, vmin=-1, vmax=1, annot=True, cmap='BrBG')

**Notes:** 
* Adminstrative, Informational, ProductRelated, Exit rate and Page Values show **high correlation** values.
* Drop ExitRate and use only BounceRate because BounceRate of a page **explains the probability** that a visitor exits 
the browsing session after viewing that page. It suggests how effective that page is in convincing the visitor to stay longer. Since our project towards more retail shopping and getting the purchase, it signifies more on **first impression** on the pages.
* Our team also decides to **not use** Page Values in our models because it is a feature that generate after a person is done purchased. 


In [None]:
# For the scope of our project, we will only use the following features in our models.
df_2 = df[['Administrative_Duration', 'Informational_Duration',
       'ProductRelated_Duration', 'BounceRates', 'SpecialDay', 'Month', 'VisitorType', 'Weekend',
        'Revenue']]

In [None]:
# Checking for null values
df_2.isnull().sum()

In [None]:
# Visualize the data using seaborn Pairplots
g = sns.pairplot(df_2, hue = 'Revenue')

**Notes:**
* How did we choose our method? We have these qualifications to think about: Supervised Learning, Classification problem with overlap observations. We want to have one model with high interpretation to answer our ultimate managerial question. But we also want to provide the marketing team a prediction model that they can use for direct prediction. 


* Looking at the pairplot above: it is clear that there’s lots of overlap in many variables. This means logistic regression is probably not a good choice. So we have Decision Tree, Random Forest, XGBoost, SVM or a combination of these in ensemble, or also KNN, however KNN is known for very slow execution. Our team will use Decision Tree (Interpretation), Random Forest to compare our models and also Ensemble Learning with a base of XGBoost, SVM and a blender of KNN. 

In [None]:
# Investigate all the numerical features by our y. Density plots show the distribution of all the numberical features.
num_features = ['Administrative_Duration', 'Informational_Duration',
       'ProductRelated_Duration', 'BounceRates', 'SpecialDay']

for f in num_features:
    plt.figure(figsize=(10,5))
    plt.xlabel(f)
    plt.ylabel('Revenue')
    sns.kdeplot(df_2[f],fill=True)
    plt.show()

In [None]:
# Investigate all the categorical features by our y.
cat_features = ['Month', 'VisitorType', 'Weekend']

for f in cat_features:
    plt.figure()
    ax = sns.countplot(x=f, data=df_2, hue = 'Revenue')

In [None]:
# Source: https://stackoverflow.com/questions/31749448/how-to-add-percentages-on-top-of-grouped-bars

def with_hue(ax, feature, Number_of_categories, hue_categories):
    a = [p.get_height() for p in ax.patches]
    patch = [p for p in ax.patches]
    for i in range(Number_of_categories):
        total = feature.value_counts().values[i]
        for j in range(hue_categories):
            percentage = '{:.1f}%'.format(100 * a[(j*Number_of_categories + i)]/total)
            x = patch[(j*Number_of_categories + i)].get_x() + patch[(j*Number_of_categories + i)].get_width() / 2 - 0.15
            y = patch[(j*Number_of_categories + i)].get_y() + patch[(j*Number_of_categories + i)].get_height() 
            ax.annotate(percentage, (x, y), size = 12)

def without_hue(ax, feature):
    total = len(feature)
    for p in ax.patches:
        percentage = '{:.1f}%'.format(100 * p.get_height()/total)
        x = p.get_x() + p.get_width() / 2 - 0.05
        y = p.get_y() + p.get_height()
        ax.annotate(percentage, (x, y), size = 12)

In [None]:
# Plot percentage of purchase by visitor types.

plt.figure (figsize =(7,5))
ax = sns.countplot(x='VisitorType', data=df_2, hue = 'Revenue')
plt.xticks(size=12)
plt.xlabel ('Revenue')
plt.yticks(size=12)
plt.ylabel ('count',size=12)

with_hue(ax,df_2.VisitorType,2,2)

**Notes from graph: percentage of purchase by visitor types:** From our EDA, we saw that out of all the returning visitors, there are only 14% that decided to purchase. Out of all the new visitors, almost 25% of them purchase more! Which leads us to our main hypothesis testing.

In [None]:
# Check for Outliers

fig, ([ax1, ax2], [ax3, ax4]) = plt.subplots(2, 2) 
ax1.boxplot(df_2.Administrative_Duration)
ax2.boxplot(df_2.Informational_Duration)
ax3.boxplot(df_2.ProductRelated_Duration)
ax4.boxplot(df_2.BounceRates)

ax1.title.set_text('Administrative_Duration')
ax2.title.set_text('Informational_Duration')
ax3.title.set_text('ProductRelated_Duration')
ax4.title.set_text('BounceRates')

plt.suptitle('Data Outliers')

In [None]:
# Checking for the range of the variable and the count of each value. Outliers start from value over 2200.
df_2.groupby('Administrative_Duration')['Revenue'].count()

In [None]:
# Removing outliers of Administrative_Duration over 2200.
df_2.drop(df_2[df_2['Administrative_Duration'] >= 2200].index, inplace = True)

In [None]:
# Checking for the range of the variable and the count of each value. Outliers start from value over 20000.
df_2.groupby('ProductRelated_Duration')['Revenue'].count()

In [None]:
# Removing outliers of ProductRelated_Duration over 20000.
df_2.drop(df_2[df_2['ProductRelated_Duration'] >= 20000].index, inplace = True)

In [None]:
df_2.shape

In [None]:
# Create dummy variables for Month and Visitor features
df_2 = pd.get_dummies(data = df_2, columns = ['VisitorType','Month'], drop_first = True)
df_2.head()

In [None]:
df_2.describe()

## Part 3: Hypothesis Testing

### 1) Hypothesis Testing Using Chi-Squared Test

During our EDA process, looks like New visitors have higher purchasing intention than Returning Visitors. We would like to test if if finding is statistically significant. First, we check if there is a relationship between the type of visitors and their purchasing intention.

* **H0a:** there is **no** relationship between type of visitor and purchasing intention
* **H1a:** there is **statistically significant relationship** between type of visitor and purchasing intention

Source Code: https://github.com/yug95/MachineLearning/blob/master/Hypothesis%20testing/Paired%20T-test.ipynb

In [None]:
from scipy.stats import chi2_contingency
from scipy import stats

In [None]:
df_3 = df[['VisitorType','Revenue']]
df_3

In [None]:
# Create contigency table
contingency_table=pd.crosstab(df_3["VisitorType"],df_3["Revenue"])
print('Contingency_table :\n',contingency_table)

In [None]:
# Observed Values
Observed_Values = contingency_table.values 
print("Observed Values :\n",Observed_Values)

In [None]:
# Expected Values
b=stats.chi2_contingency(contingency_table)
Expected_Values = b[3]
print("Expected Values :\n",Expected_Values)

In [None]:
#Degree of Freedom
no_of_rows=len(contingency_table.iloc[0:2,0])
no_of_columns=len(contingency_table.iloc[0,0:2])
df_3=(no_of_rows-1)*(no_of_columns-1)
print("Degree of Freedom:-",df_3)

# Set significant level alpha at 0.05.
alpha = 0.05

In [None]:
# Getting the p-value of the Chi-Squared Test
from scipy.stats import chi2
chi_square=sum([(o-e)**2./e for o,e in zip(Observed_Values,Expected_Values)])
chi_square_statistic=chi_square[0]+chi_square[1]
print("chi-square statistic:-",chi_square_statistic)
p_value= 1-chi2.cdf(x=chi_square_statistic, df=df_3)
print('P-value of the Chi-Squared Test is:',p_value)

In [None]:
# Getting the Critical Value
critical_value=chi2.ppf(q=1-alpha,df=df_3)
print('critical_value:',critical_value)

In [None]:
#p-value
p_value=1-chi2.cdf(x=chi_square_statistic,df=df_3)
print('p-value:',p_value)

In [None]:
if p_value<=alpha:
    print("We reject Null Hypothesis H0 as p-value is smaller than alpha."
          " There is a relationship between between the type of visitor and purchasing intention.")
else:
    print("We failed to reject Null Hypothesis H0 as p-value is greater than alpha."
          " There is no relationship between between type of visitor and purchasing intention.")

### 2) Hypothesis Testing Using T-test

We have validated that there is a statistically significant relationship between the type of visitor and purchasing intention. We would like to further test if the ReturningVisitor has different purchasing rate than NewVisitor.

* **H0b:** Purchasing Rate of Returning Visitor is the **same** as Purchasing Rate of New Visitors
* **H1b**: Purchasing Rate of Returning Visitor is **different** than Purchasing Rate of New Visitors

Source Code: https://www.pythonfordatascience.org/independent-samples-t-test-python/

#### Using researchpy

In [None]:
import researchpy as rp

In [None]:
df_4 = df[['VisitorType','Revenue']]

In [None]:
summary, results = rp.ttest(group1= df_4['Revenue'][df_4['VisitorType'] == 'Returning_Visitor'], group1_name= "Returning_Visitors",
                            group2= df_4['Revenue'][df_4['VisitorType'] == 'New_Visitor'], group2_name= "New_Visitors")
summary

In [None]:
results

#### Using scipy.stats

In [None]:
from scipy import stats
stats.ttest_ind(df_4['Revenue'][df_4['VisitorType'] == 'Returning_Visitor'],
                df_4['Revenue'][df_4['VisitorType'] == 'New_Visitor'])

**Conclusion:** Since p-value is significant, we reject the H0. The average purchasing probability of Returning Visitor is statistically significantly different than of New Visitors, and is lower than of New Visitor (0.1393 < 0.2491); t(12243)=-11.67, p-value = 0.0000000.

## Part 4: Predictive Models 

### 1) Decision Tree

Recall that we are building a model to answer the question:
What strategies marketing team should use for advertisement targeting to increase purchase rate?

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

In [None]:
df_2.shape

In [None]:
# Define X and y for Decision Tree Model
X = df_2.drop('Revenue', axis = 1)
y = df_2.Revenue

In [None]:
# Train test split on dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=862)

#### Step 1: Tune Decision Tree Hyperparameters

In [None]:
# Set up GridSearchCV
dt = DecisionTreeClassifier()
param = {'criterion':['entropy'],
         'max_depth':range(2,17),
         'min_samples_split':[3, 5, 10, 15],
         'min_samples_leaf':[3, 5, 10, 15]}


grid= GridSearchCV(dt, param, cv = 5, n_jobs = -1)

In [None]:
# Fitting decision tree
grid.fit(X_train, y_train)

In [None]:
print("Best parameters:",grid.best_params_)
print("Accuracy of model:",np.mean(grid.predict(X_test) == y_test))

#### Step 2: Apply tuned hyperparameters to model

In [None]:
from sklearn.tree import plot_tree

In [None]:
# Because the best max depth is 3, which is too small to understand the tree. I will increase it 
# to 4 and re-run the tree. 
dt_tuned = DecisionTreeClassifier(criterion = "entropy",
                                  max_depth = 4,
                                  min_samples_leaf = 3,
                                  min_samples_split = 3,
                                  random_state = 862)

dt_tuned.fit(X_train,y_train)

In [None]:
# Checking the important features
Feature_Importance = pd.DataFrame({'feature':X.columns.values, 'importance':dt_tuned.feature_importances_})
Feature_Importance.sort_values(by = ['importance'], ascending = 0)

#### Step 3: Visualize Decision Tree for Segmentation

In [None]:
plt.figure("Decision Tree", figsize = [25,8])
plot_tree(dt_tuned,fontsize=10, filled=True, feature_names=X.columns, class_names = True)
plt.tight_layout()
plt.title("Decision Tree", size=18)
plt.show()

In [None]:
#Confusion matrix
from sklearn.metrics import confusion_matrix

# You can call predict the same way we called predict to any other algorithm we used before
y_hat = dt_tuned.predict(X_test)
print(y_hat)

#Prediction confusion matrix:
print("Results from the Decision Tree Classifier:\n")
cf_matrix = confusion_matrix(y_test, y_hat)
cf_matrix

TN = cf_matrix[0,0] #True Negative
FN = cf_matrix[1,0] #False Negative
FP = cf_matrix[0,1] #False Positive
TP = cf_matrix[1,1] #True Positive

Err = float(FP + FN)/(FP + FN + TP + TN) #Prediction Error
Acc = float(TP + TN)/(FP + FN + TP + TN) #Prediction Accuracy
FPR = float(FP)/(FP + TN)  #False Positive Rate
TNR = float(TN)/(FP + TN)  #True Negative Rate
TPR = float(TP)/(FN + TP)  #True Positive Rate
FNR = float(FN)/(FN + TP)  #False Negative Rate
print("False Positive Rate = %f " %FPR)
print("False Negative Rate = %f " %FNR)
print("True Positive Rate = %f " %TPR)
print("True Negative Rate = %f " %TNR)
print("Misclassification Error = %f " %Err)
print("Accuracy = %f " %Acc)

group_names = ['True Neg','False Pos','False Neg','True Pos']
cf_matrix_value = [TNR,FPR,FNR,TPR]

group_counts = ["{0:0.0f}".format(value) for value in
                cf_matrix.flatten()]

group_percentages = ["{0:.2%}".format(value) for value in
                     cf_matrix_value]

labels = [f"{v1}\n{v2}\n{v3}" for v1, v2, v3 in
          zip(group_names,group_counts,group_percentages)]

labels = np.asarray(labels).reshape(2,2)

ax = sns.heatmap(cf_matrix, annot=labels, fmt='', cmap='Blues')

ax.set_title('Decision Tree Classifier Confusion Matrix\n\n');
ax.set_xlabel('\nPredicted Values')
ax.set_ylabel('Actual Values ');

## Ticket labels - List must be in alphabetical order
ax.xaxis.set_ticklabels(['False','True'])
ax.yaxis.set_ticklabels(['False','True'])

## Display the visualization of the Confusion Matrix.
plt.show()

### 2) Random Forest 

In [None]:
# Tuning RFC
RFC = RandomForestClassifier()
parames_RFC = [{ 'n_estimators': [50,100,150, 200, 250, 300]}]
RFC_tuned = GridSearchCV(RFC, parames_RFC, n_jobs = -1, cv = 5)

In [None]:
# Fit the model
RFC_tuned.fit(X_train, y_train)
print("Accuracy of RFC model:",accuracy_score(y_test, RFC_tuned.predict(X_test)))

**Notes:** A tuned Random Forest Classifier gives a lower accuracy score

### 3) Ensemble

In [None]:
from xgboost import XGBClassifier
from sklearn.svm import SVC
scale = StandardScaler()

In [None]:
# Train test split on dataset for Ensemble
X_trainEN, X_testEN, y_trainEN, y_testEN = train_test_split(X, y, test_size=0.3, random_state=862)

In [None]:
# Scale data for Ensemble model
X_trainEN_s = scale.fit_transform(X_trainEN)
X_testEN_s = scale.transform(X_testEN)

In [None]:
# Tuning SVC
SVM = SVC(kernel = 'rbf', random_state = 862)
params_SVM = {'C': [0.01, 0.1, 1, 10],
             'gamma': [0.1, 0.5, 1, 2, 3, 4]}
SVM_tuned = GridSearchCV(SVM, params_SVM, cv = 5, n_jobs = -1)

In [None]:
# Tuning XGB
XGB = XGBClassifier(n_estimators = 5, random_state = 862)
params_XGB = {'n_estimators': [10,20,30,40,50]}
XGB_tuned = GridSearchCV(XGB, params_XGB, cv = 5, n_jobs = -1)

In [None]:
# Defining the base learners
models = {'SVM':SVM_tuned, 'XGB':XGB_tuned}

In [None]:
# Defining the blender
blender = KNeighborsClassifier()

In [None]:
# Splitting the training data into two parts, one to train the weak learners, another to train the blender
X_train_s1, X_train_s2, y_train1, y_train2 = train_test_split(X_trainEN_s, y_train, test_size = 0.5, random_state = 862)

In [None]:
# Training the base learners
for name, model in models.items():
    model.fit(X_train_s1, y_train1)

In [None]:
# Training the blender
# Get the prediction
ENpredictions = pd.DataFrame() # Set up a dataframe to store the predictions
for name, model in models.items():
    ENpredictions[name] = model.predict(X_train_s2)

# Get the blender
scaler_blend = StandardScaler() # Scale the predictions 
predictions_scale = scaler_blend.fit_transform(ENpredictions)
blender.fit(predictions_scale, y_train2)

In [None]:
# Perform evaluation
# First send the data through the weak learners
ENpredictions = pd.DataFrame() # Set up a dataframe to store the predictions
for name, model in models.items():
    ENpredictions[name] = model.predict(X_testEN_s)
    
# Prediction through the blender, and evaluate
predictions_scale = scaler_blend.transform(ENpredictions)
print("Accuracy of Stacking model:",(accuracy_score(y_testEN, blender.predict(predictions_scale))))

## Part 5: Results

In [None]:
accuracy_scores = [np.mean(grid.predict(X_test) == y_test),accuracy_score(y_test, RFC_tuned.predict(X_test)),accuracy_score(y_testEN, blender.predict(predictions_scale))]
clfs = ['Decision Tree','Random Forest','Ensemble Stacking']

In [None]:
# Put the results into a DataFrame
comparison = pd.DataFrame(clfs, columns = ['Classifiers'])
comparison['Accuracy Scores'] = accuracy_scores

In [None]:
# Accuracies of all three predictive models
comparison

**Why is Decision Tree and Ensemble Stacking's accuracies are the same?**
* One of the reasons why the accuracies are the same is our proportion of the majority class. Our majority class which is did not purchase is around 85%, so it's possible that some classifiers just predict the majority class all the time. 


* This would be visible with precision/recall. True positive rate (TPR) is the recall, which means how good is the ability of this classification model to identify all data points correctly. By looking at the Decision Tree Confusion Matrix above, we can see that the TPR is only 7.6%, very low. This might be that most prediction model are just constantly predict the people who do not purchases instead of the people who purchase. 


* Despite of this flaw, we were still able to find out many insights. And that leads to our last and final slide about recommendations. 


*Source: https://datascience.stackexchange.com/questions/101881/all-machine-learning-models-are-giving-the-same-accuracy*

**Conclusions and Recommendations**: 
* First, the marketing team should focus on webpage improvement. It is because we know that NewVisitor have high purchasing rate. Therefore, marketing team should focus on making the webpage more attractive and inviting to make a very good first impression. 


* Findings from Decision Tree Classifier visualization suggest that the most important features to predict a visitor purchase or not are the duration a visitor spend on the webpage, the bounce rate from the number of pages and the November month. If you look at the visualization of decision tree on the right hand side, the branches and nodes help us to interpret who is our ideal visitor segmentation. According to our decision tree, in the past, the visitors who have purchase rate are the people who visit the page on November month, who spend total of 4384 or more of duration on Product_Related pages and have a bounce rate of under 0.011. 


* Lastly is we found that the month of November has very high purchase rate from the EDA but also the feature importance kinda reinforce that. So the advice for the marketing team is they should utilize November month for as much promotions and advertising campaigns as possible, to encourage people to buy! And that is the end of our project.


**References:**
* https://towardsdatascience.com/are-your-models-using-the-correct-significance-levels-c88367ee0544
* https://github.com/krishnaik06/T-test-an-Correlation-using-python/blob/master/Hypothesis_Testing.ipynb
* https://towardsdatascience.com/how-to-know-which-statistical-test-to-use-for-hypothesis-testing-744c91685a5d
* https://machinelearningmastery.com/statistical-hypothesis-tests-in-python-cheat-sheet/
* https://support.google.com/analytics/answer/2695658?hl=en#:~:text=Page%20Value%20is%20the%20average,more%20to%20your%20site's%20revenue
* https://support.google.com/analytics/answer/2525491?hl=en#:~:text=For%20all%20pageviews%20to%20the,that%20start%20with%20that%20page