<a href="https://colab.research.google.com/github/mihnguyen/udemy-customer-analytics/blob/main/purchase-prediction/notebook_Purchase_Prediction_Pt2_Predictive_Modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook is part 2 and a continuation of the [[Udemy CAiP] Purchase Prediction Pt1 - Descriptive Analysis notebook](https://colab.research.google.com/drive/1Acmi4BOuO1V7LCxbnMCVzu46vFSpftA6#scrollTo=97I0svM77BjK&uniqifier=1).
While part 1 is descriptive analysis, part 2 will concern predictive modeling.

###Import Libraries

In [7]:
import numpy as np
import pandas as pd

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

import pickle

from sklearn.linear_model import LogisticRegression

###Import Purchase Data

In [3]:
from google.colab import files
uploaded = files.upload()

Saving purchase data.csv to purchase data (2).csv


In [8]:
import io
df_purchase = pd.read_csv(io.BytesIO(uploaded['purchase data.csv']))

###Import Pickle Objects

In [14]:
# Import pickle objects
from google.colab import drive
drive.mount('/content/drive')

import pickle
drive.mount('/content/drive')
DATA_PATH = "/content/drive/My Drive/Colab Notebooks"

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [15]:
infile = open(DATA_PATH+'/pca.pickle','rb')
pca = pickle.load(infile)

pca

PCA(copy=True, iterated_power='auto', n_components=3, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)

In [16]:
infile = open(DATA_PATH+'/scaler.pickle','rb')
scaler = pickle.load(infile)

scaler

StandardScaler(copy=True, with_mean=True, with_std=True)

In [17]:
infile = open(DATA_PATH+'/kmeans_pca.pickle','rb')
kmeans_pca = pickle.load(infile)

kmeans_pca

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=4, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=42, tol=0.0001, verbose=0)

In [18]:
# Standardize data
features = df_purchase[['Sex', 'Marital status', 'Age', 'Education', 'Income', 'Occupation', 'Settlement size']]
df_purchase_segm_std = scaler.transform(features)

# Apply PCA
df_purchase_segm_pca = pca.transform(df_purchase_segm_std)

# Segment data
purchase_segm_kmeans_pca = kmeans_pca.predict(df_purchase_segm_pca)

# Create copy of dataframe
df_purchase_predictors = df_purchase.copy()

# Add segment labels
df_purchase_predictors['Segment'] = purchase_segm_kmeans_pca
segment_dummies = pd.get_dummies(purchase_segm_kmeans_pca, prefix = 'Segment', prefix_sep = '_')
df_purchase_predictors = pd.concat([df_purchase_predictors, segment_dummies], axis = 1)

df_pa = df_purchase_predictors

###Modeling Purchase Probability

Now we will calculate the probability of a customer purchasing at each shop visit. Method used is Logistic Regression.

First, define the dependent and independent variables.

Dependent variable is whether a purchase happens or not (values are either 0 or 1).

As for independent variable, common sense is that price has the most correlation to purchase. So price is the first independent variable. Here let's use the mean price (of all 5 brands) as the independent variable:

In [19]:
Y = df_pa['Incidence']

In [20]:
X = pd.DataFrame()
X['Mean Price'] = (df_pa['Price_1'] +
                   df_pa['Price_2'] +
                   df_pa['Price_3'] +
                   df_pa['Price_4'] +
                   df_pa['Price_5'] ) / 5

Now fit model to predic purchase probability:

In [21]:
model_purchase = LogisticRegression(solver = 'sag') # 'sag' was chosen as the solver as it's optimal for simple problem with large dataset like this one
model_purchase.fit(X, Y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='sag', tol=0.0001, verbose=0,
                   warm_start=False)

Now check the exact relationship between (Mean) Price (X) and Purchase Probability (Y) using coefficient for the independent variable (coef_ attribute), which is the mean price:

In [22]:
model_purchase.coef_

array([[-2.3489037]])

The coefficient for mean price = -2.35. This indicates that an increase in price will result in a decrease in purchase probability (obviously). However, this coefficient only shows the direction of the relationship (which in this case is reverse, or negative).

Next, let's quantify the magnitude of the negative relationship between mean price and purchase probability.

###Modeling Price Elasticity of Purchase Probability

The purpose of this modeling is to check the effect of various mean prices on the purchase probability. Various mean prices here can be done by generating a range of mean price values. How to determine that range?

First, by checking the 5 prices (Price_1 to Price_5) to see their distributions. This checking can be done through .describe() function:

In [24]:
df_pa[['Price_1', 'Price_2', 'Price_3', 'Price_4', 'Price_5']].describe()

Unnamed: 0,Price_1,Price_2,Price_3,Price_4,Price_5
count,58693.0,58693.0,58693.0,58693.0,58693.0
mean,1.392074,1.780999,2.006789,2.159945,2.654798
std,0.091139,0.170868,0.046867,0.089825,0.098272
min,1.1,1.26,1.87,1.76,2.11
25%,1.34,1.58,1.97,2.12,2.63
50%,1.39,1.88,2.01,2.17,2.67
75%,1.47,1.89,2.06,2.24,2.7
max,1.59,1.9,2.14,2.26,2.8


The minimum price is 1.10 (min Price_1), while the maximum price is 2.80 (max Price_5).

Then, the range of mean price we could pick is, say 0.50 to 3.50, for example. Now let's generate a range of numbers between 0.5 and 3.5, with steps = 0.01 to denote an increase in 1 cent (an increase in 1% of a dollar):

In [25]:
price_range = np.arange(0.5, 3.5, 0.01)
price_range

array([0.5 , 0.51, 0.52, 0.53, 0.54, 0.55, 0.56, 0.57, 0.58, 0.59, 0.6 ,
       0.61, 0.62, 0.63, 0.64, 0.65, 0.66, 0.67, 0.68, 0.69, 0.7 , 0.71,
       0.72, 0.73, 0.74, 0.75, 0.76, 0.77, 0.78, 0.79, 0.8 , 0.81, 0.82,
       0.83, 0.84, 0.85, 0.86, 0.87, 0.88, 0.89, 0.9 , 0.91, 0.92, 0.93,
       0.94, 0.95, 0.96, 0.97, 0.98, 0.99, 1.  , 1.01, 1.02, 1.03, 1.04,
       1.05, 1.06, 1.07, 1.08, 1.09, 1.1 , 1.11, 1.12, 1.13, 1.14, 1.15,
       1.16, 1.17, 1.18, 1.19, 1.2 , 1.21, 1.22, 1.23, 1.24, 1.25, 1.26,
       1.27, 1.28, 1.29, 1.3 , 1.31, 1.32, 1.33, 1.34, 1.35, 1.36, 1.37,
       1.38, 1.39, 1.4 , 1.41, 1.42, 1.43, 1.44, 1.45, 1.46, 1.47, 1.48,
       1.49, 1.5 , 1.51, 1.52, 1.53, 1.54, 1.55, 1.56, 1.57, 1.58, 1.59,
       1.6 , 1.61, 1.62, 1.63, 1.64, 1.65, 1.66, 1.67, 1.68, 1.69, 1.7 ,
       1.71, 1.72, 1.73, 1.74, 1.75, 1.76, 1.77, 1.78, 1.79, 1.8 , 1.81,
       1.82, 1.83, 1.84, 1.85, 1.86, 1.87, 1.88, 1.89, 1.9 , 1.91, 1.92,
       1.93, 1.94, 1.95, 1.96, 1.97, 1.98, 1.99, 2.

Now store this range in dataframe called df_price_range:

In [26]:
df_price_range = pd.DataFrame(price_range)