# E-commerce Customer Purchase Intentions

## Overview

The `online_shoppers_intention.csv` includes 12,330 sessions of online traffic to an unknown website over the period of a year. The column 'Revnue' contains a True or False value that displays whether or not a website viewer purchases the product. This serves as the target variable for the clasification problem. Using the other columns provided, we can create a classification model that can predict whether a site visitor will purhase the product.

The scoring metric will be ___ ___ because ____.

## Primary Business Problem
How can the company increase purchase rate among customers who visit the website?

## Model Data Import

In [5]:
# import libraries
import pandas as pd
pd.options.display.max_columns = 50
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import sklearn
from sklearn.preprocessing import StandardScaler, binarize
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
# add import for Decision Treees
from sklearn.feature_selection import RFECV, SelectKBest, f_regression
import sklearn.metrics as metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, plot_confusion_matrix, roc_curve, auc, classification_report
from sklearn.model_selection import train_test_split, cross_validate, cross_val_score, GridSearchCV
import pickle

In [6]:
# loading in cleaned dataset
model_df = pd.read_csv('model_data.csv', index_col = 0)

In [7]:
model_df.head()

Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,Month,OperatingSystems,Browser,Region,TrafficType,VisitorType,Weekend,Revenue
0,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,2,1,1,1,1,Returning_Visitor,False,False
1,0,0.0,0,0.0,2,64.0,0.0,0.1,0.0,0.0,2,2,2,1,2,Returning_Visitor,False,False
2,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,2,4,1,9,3,Returning_Visitor,False,False
3,0,0.0,0,0.0,2,2.666667,0.05,0.14,0.0,0.0,2,3,2,2,4,Returning_Visitor,False,False
4,0,0.0,0,0.0,10,627.5,0.02,0.05,0.0,0.0,2,3,3,1,4,Returning_Visitor,True,False


### Creating Dummy Columns from Categorical Varibales

In [8]:
model_df = pd.get_dummies(model_df, columns=['SpecialDay', 'Month', 'OperatingSystems','Browser','Region','TrafficType','VisitorType','Weekend'])

In [9]:
model_df.head()

Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,Revenue,SpecialDay_0.0,SpecialDay_0.2,SpecialDay_0.4,SpecialDay_0.6,SpecialDay_0.8,SpecialDay_1.0,Month_2,Month_3,Month_5,Month_6,Month_7,Month_8,Month_9,Month_10,Month_11,...,TrafficType_1,TrafficType_2,TrafficType_3,TrafficType_4,TrafficType_5,TrafficType_6,TrafficType_7,TrafficType_8,TrafficType_9,TrafficType_10,TrafficType_11,TrafficType_12,TrafficType_13,TrafficType_14,TrafficType_15,TrafficType_16,TrafficType_17,TrafficType_18,TrafficType_19,TrafficType_20,VisitorType_New_Visitor,VisitorType_Other,VisitorType_Returning_Visitor,Weekend_False,Weekend_True
0,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,False,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0
1,0,0.0,0,0.0,2,64.0,0.0,0.1,0.0,False,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0
2,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,False,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0
3,0,0.0,0,0.0,2,2.666667,0.05,0.14,0.0,False,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0
4,0,0.0,0,0.0,10,627.5,0.02,0.05,0.0,False,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1


## 1. Train-Test Split

In [10]:
# Split data to be used in the models
X = model_df.drop('Revenue', axis = 1)

# Create target variable
y = model_df['Revenue'] # y is the column we're trying to predict

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=11)

## 2. Creating Baseline Models
KNN, Logistic Regression, Decision Tree or Random Forest?

### Baseline Decision Tree
Creating a Decision Tree model without any hyperparameters

In [12]:
baseline_tree = DecisionTreeClassifier()
baseline_tree = baseline_tree.fit(X_train, y_train)

In [13]:
# predict the training set
tree_y_pred_train = baseline_tree.predict(X_train)

# Predict the for test set
tree_y_pred_test = baseline_tree.predict(X_test)

In [16]:
print('Baseline Decision Tree Model')
print("Training F1 Score: ", metrics.f1_score(y_train, tree_y_pred_train))
print("Training Recall: ", metrics.recall_score(y_train, tree_y_pred_train))

print("Testing F1 Score: ", metrics.f1_score(y_test, tree_y_pred_test))
print("Testing Recall: ", metrics.recall_score(y_test, tree_y_pred_test))

Baseline Decision Tree Model
Training F1 Score:  1.0
Training Recall:  1.0
Testing F1 Score:  0.5813008130081301
Testing Recall:  0.5801217038539553


## 3. Feature Engineering

## Evaluation Metric(s)
Before starting the final modeling process, we must decide on the evaluation metrics. The main metrics will will focus on are F1 Score and Recall.

This is because:
- F1 Score is useful in cases where there is a large class imalance. In this business context, it's important because...?
- Recall is used in this context where false negatives are intolerable. The true positive rate must be higher because the company needs to know how many people actually purchased the product.

### Scaling data before making 'Real Models'

In [14]:
scaler = StandardScaler()  
scaler.fit(X_train)

X_train = scaler.transform(X_train)  
X_test = scaler.transform(X_test)

## 3. Creating 'Real' Models

### at least 2 iterations are required

## 4. Feature Selection

## 5. Evaluating all the different models
(Scoring all the different models)

## MVP Analysis
- Must decide what metrics we want to measure this

## Final Model 

## 6. Retfitting the final model on full dataset