# Customer Propensity Analysis using Python #


LAXMAN PATEL - B.ARCH-25

This problem is on Customer Propensity Analysis using multiple Machine Learning Models - Naive Bayes and Logistic Regression. The main objective of this problem is seeing which model gives the best result and finding the optimum prediction model to understand customer behaviour.   
There is a dataset from an online portal which gives users' demographic information, their transactional history in the online portal, the price of the item he has viewed and whether the user has ultimately purchased it or not.

So we are seeing that we have data on User_Gender, Marital_status, Price of the item the user has viewed, Number of items he has purchased from the website, total value of the transactions he has done before in the web portal, income of the person and lastly whether the user has purchased the current item or not. There are 11155 records. It is needed to develop a predictive model, which can predict for any new user with the above information (except whether he has purchaed or not) that whether the user will be purchasing or not.
We need to perform the following activities in order to solve the problem.

### 1. Importing Data
### 2. Cleaning, Preparing and Manipulating Data
### 3. Training and Testing Different Machine Learning Models
### 4. Improving the Models

# 1. Importing the Data #
First we need to import the data. Before that we need to load some useful libraries.
We need to load the library 'sklearn' which gives us the option to work with Random Forest Classifier. The same 'sklearn' library gives us the option to split a dataset into train and test set using 'cross_validation' function.
We will be needing to import matplotlib library for doing some plots.
We will be needing pandas as it is the basic package for importing a dataset and performing other data-related manipulation activities.
We will be using Numpy package for using the arrays.
We will be using seaborn package for using heat-map.
We will be loading LogisticRegression function from sklearn for fitting Logistic Regression model. We will loading DecisionTreeClassifier function from sklearn for performing Decision Tree classification. We will be loading KNeighboursClassifier from sklearn for performing K-NN classification. We will be loading SVC function for performing Support Vector Machine classification.

In [2]:
from matplotlib import pyplot as plt
from sklearn.metrics import confusion_matrix
import pandas as pd
import numpy as np
import seaborn as sns

We are importing the dataset using 'read_csv' function of Panda library.

In [3]:
product = pd.read_csv('PURCHASE.csv')

We can see what data has been uploaded in the Python system using 'print' function.

In [4]:
product.head()


Unnamed: 0,USER_GENDER,MARITAL,PRICE,NO_ITEMS,PUR_VALUE,USER_INCOME,PURCHASE
0,1,M,25000,0,0,39171,0
1,1,U,20000,2,21866,249,1
2,0,U,30000,1,16090,1249,0
3,0,U,15000,0,0,7247,1
4,1,U,28000,2,26888,33314,1


If we can see the above data, it means the dataset has been imported correctly. Next we need to move to the next part of our journey, i.e., 'Clean, Prepare and Manipulate the Data'.

# 2. Cleaning, Preparing and Manipulating the Data #

For getting an insight about the dataset, we can get a quick summary about the dataset using 'info' and 'describe()' function.

In [5]:
product.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11157 entries, 0 to 11156
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   USER_GENDER  11157 non-null  int64 
 1   MARITAL      11157 non-null  object
 2   PRICE        11157 non-null  int64 
 3   NO_ITEMS     11157 non-null  int64 
 4   PUR_VALUE    11157 non-null  int64 
 5   USER_INCOME  11157 non-null  int64 
 6   PURCHASE     11157 non-null  int64 
dtypes: int64(6), object(1)
memory usage: 610.3+ KB


In [None]:
product.describe()

Unnamed: 0,USER_GENDER,PRICE,NO_ITEMS,PUR_VALUE,USER_INCOME,PURCHASE
count,11157.0,11157.0,11157.0,11157.0,11157.0,11157.0
mean,0.499149,18241.193869,0.666487,6611.297392,11119.78937,0.331272
std,0.500022,6915.991488,0.659351,7581.662231,13055.545027,0.470692
min,0.0,9000.0,0.0,0.0,0.0,0.0
25%,0.0,13000.0,0.0,0.0,561.0,0.0
50%,0.0,16000.0,1.0,3505.0,5724.0,0.0
75%,1.0,24000.0,1.0,12613.0,18018.0,1.0
max,1.0,32000.0,2.0,29943.0,49975.0,1.0


We have seen that marital status has been mentioned as 'M' (married) and 'U' umarried. We need to change this to '1' and '0' for further proceeding in our journey. We make the changes and then see what is the updated dataset.  

In [7]:
product = product.replace({'M':1, 'U':0})
product.head()

Unnamed: 0,USER_GENDER,MARITAL,PRICE,NO_ITEMS,PUR_VALUE,USER_INCOME,PURCHASE
0,1,1,25000,0,0,39171,0
1,1,0,20000,2,21866,249,1
2,0,0,30000,1,16090,1249,0
3,0,0,15000,0,0,7247,1
4,1,0,28000,2,26888,33314,1


We saw that 'M' and 'U' under martital status column has been converted to 1 and 0.
Now, we need to check whether there is any missing value in the data or not.
For that we will be using the following code.

In [None]:
product.isnull().sum(axis = 0)

USER_GENDER    0
MARITAL        0
PRICE          0
NO_ITEMS       0
PUR_VALUE      0
USER_INCOME    0
PURCHASE       0
dtype: int64

We find that there are no missing value in the data and hence we should not be bothered about this. We can use the dataset for further processing.

## 3. Training and Testing Different ML Models ##
We have got some idea about the dataset by now.
Now the time has come for building the model.
Before building the model, we need create 2 arrays X and Y where X will be for the independent variables and Y will be fore the dependent variable (PURCHASE).

In [8]:
X = np.array(product[['USER_GENDER', 'MARITAL', 'PRICE',
                      'NO_ITEMS', 'PUR_VALUE', 'USER_INCOME']])
Y = np.array(product[['PURCHASE']])

In [None]:
# from sklearn.preprocessing import StandardScaler
# scaler = StandardScaler()
# #option 1
# scaler.fit(X)
# X = scaler.transform(X)

In order to develop a machine learning model, it is needed to have a test dataset on which you will be checking how much accurate your model is. If you utilize the whole dataset for developing your model, then there will be no data for testing. Hence, it is a norm to bifurcate the dataset into train set and test set. We need to develop the model using the train set and test the accuracy of the model using the test set.
Hence, we will be creating 4 arrays X_train, X_test, Y_train and Y_test for developing our  model.

In [9]:
from sklearn import model_selection

X_train, X_test, Y_train, Y_test = model_selection.\
train_test_split(X, Y, test_size=0.2)

We have got 73.16% accuracy from the Naive Bayes. Now let us check some other models.
Next we will be checking Logistic Model.

#### Logistic Regression Model

In [10]:
#Logistic Regression
#Training Model
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression()
log_reg.fit(X_train, Y_train)
#option 1
#Testing Model
y_pred = log_reg.predict(X_test)
from sklearn.metrics import accuracy_score
print("Logistic Regression Accuracy Score =", accuracy_score(Y_test, y_pred)*100)

#option 2
#print("Logistic Regression Accuracy Score =",log_reg.score(X_test,Y_test)*100)

Logistic Regression Accuracy Score = 68.36917562724014


  y = column_or_1d(y, warn=True)


In [11]:
# Load libraries
import pandas as pd
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn.model_selection import train_test_split # Import train_test_split function
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation

# Create Decision Tree classifer object
dtclf = DecisionTreeClassifier(criterion="entropy", max_depth=3)#criterion='entropy'
# Train Decision Tree Classifer
dtclf = dtclf.fit(X_train,Y_train)
#Predict the response for test dataset
y_pred = dtclf.predict(X_test)
from sklearn.metrics import accuracy_score
print("Decision Tree Accuracy Score =", accuracy_score(Y_test, y_pred)*100)

Decision Tree Accuracy Score = 78.00179211469535


In [12]:
from sklearn.svm import SVC # "Support vector classifier"
svmc = SVC(kernel='poly')
svmc.fit(X_train,Y_train)
y_pred = svmc.predict(X_test)
from sklearn.metrics import accuracy_score
print("SVM Accuracy Score =", accuracy_score(Y_test, y_pred)*100)

  y = column_or_1d(y, warn=True)


KeyboardInterrupt: ignored

In [13]:
from sklearn.svm import SVC # "Support vector classifier"
svmc = SVC(kernel='rbf', probability = True)
svmc.fit(X_train,Y_train)
y_pred = svmc.predict(X_test)
from sklearn.metrics import accuracy_score
print("SVM Accuracy Score =", accuracy_score(Y_test, y_pred)*100)

  y = column_or_1d(y, warn=True)


SVM Accuracy Score = 70.96774193548387


We received accuracy of logistic regression model to be 66% which is very less.


Before building an appropriate model, we first need to see whether there is any multicollinearity in the data (independent variables) or not. If there is multi-collinearity in the data, it will increase the variance and ultimately it will lead to lesser accuracy. So we need to check the correlation matrix for the independent variables.
For that we will first create a new dataframe 'product1' having only the independent variables.

From the correlation matrix, we are finding that Number of items and Purchase value are highly correlated. Let us create a heat-map which will give a better picture of the multi-collinearity.

We will now see the impact of feature scaling in the model.

**Cross Validation**








In [14]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV, cross_val_score, cross_validate

In [15]:
scores_dt = cross_val_score(dtclf, X_train, Y_train, cv=5) #5 fold cross-validation
scores_svm = cross_val_score(svmc, X_train, Y_train, cv=5) #5 fold cross-validation
scores_log_reg= cross_val_score(log_reg, X_train, Y_train, cv=5) #5 fold cross-validation



  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


In [16]:
print(scores_dt)
print(scores_svm)
print(scores_log_reg)

[0.77478992 0.77030812 0.78095238 0.77086835 0.7697479 ]
[0.70588235 0.70028011 0.70644258 0.70084034 0.69971989]
[0.68011204 0.65378151 0.6627451  0.68011204 0.67619048]


In [17]:
avg_score_dt = scores_dt.mean()
sqr_std_dt = scores_dt.std()*2

avg_score_svm = scores_svm.mean()
sqr_std_svm = scores_svm.std()*2

avg_score_log_reg = scores_log_reg.mean()
sqr_std_log_reg = scores_log_reg.std()*2


print('Expected accuracy for Decision Tree: %.2f (+/- %.2f)' %(avg_score_dt, sqr_std_dt))
print('Range for accuracy for Decision Tree: %.2f - %.2f' %(avg_score_dt - sqr_std_dt, avg_score_dt + sqr_std_dt))

print('Expected accuracy for Support Vector Machine: %.2f (+/- %.2f)' %(avg_score_svm, sqr_std_svm))
print('Range for accuracy for Support Vector Machine: %.2f - %.2f' %(avg_score_svm - sqr_std_svm, avg_score_svm + sqr_std_svm))

print('Expected accuracy for Log Reg: %.2f (+/- %.2f)' %(avg_score_log_reg, sqr_std_log_reg))
print('Range for accuracy for Log Reg: %.2f - %.2f' %(avg_score_log_reg - sqr_std_log_reg, avg_score_log_reg + sqr_std_log_reg))





Expected accuracy for Decision Tree: 0.77 (+/- 0.01)
Range for accuracy for Decision Tree: 0.76 - 0.78
Expected accuracy for Support Vector Machine: 0.70 (+/- 0.01)
Range for accuracy for Support Vector Machine: 0.70 - 0.71
Expected accuracy for Log Reg: 0.67 (+/- 0.02)
Range for accuracy for Log Reg: 0.65 - 0.69


**Hyperparameter Tuning to Improve the accuracy for Decision Tree Model**

In [18]:
params = {'max_depth':list(range(2,10)), 'criterion':['gini','entropy'], 'splitter':['best','random']} # specifying the range for max_depths to create multiple trees
params

{'max_depth': [2, 3, 4, 5, 6, 7, 8, 9],
 'criterion': ['gini', 'entropy'],
 'splitter': ['best', 'random']}

In [19]:
gs = GridSearchCV(estimator=dtclf, param_grid=params, scoring='accuracy', cv=5).fit(X_train, Y_train) #on training data crossvalidatio
##### Manual
gs.best_params_, gs.best_score_

({'criterion': 'entropy', 'max_depth': 5, 'splitter': 'random'},
 0.7735574229691877)

In [20]:
# If we want to know results of R2 scores for other max_depths use -> cv_results_
df_cv_results = pd.DataFrame(gs.cv_results_)
df_cv_results.head()

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_criterion,param_max_depth,param_splitter,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.007583,0.000374,0.001695,0.000246,gini,2,best,"{'criterion': 'gini', 'max_depth': 2, 'splitte...",0.77479,0.770868,0.780952,0.770868,0.769748,0.773445,0.004126,2
1,0.003253,0.000167,0.001254,4.7e-05,gini,2,random,"{'criterion': 'gini', 'max_depth': 2, 'splitte...",0.738375,0.770868,0.780952,0.770868,0.769748,0.766162,0.014478,30
2,0.010104,0.000468,0.001881,0.000404,gini,3,best,"{'criterion': 'gini', 'max_depth': 3, 'splitte...",0.77479,0.770308,0.780952,0.770868,0.769748,0.773333,0.004201,7
3,0.003674,9.7e-05,0.001318,1.4e-05,gini,3,random,"{'criterion': 'gini', 'max_depth': 3, 'splitte...",0.77479,0.770868,0.780952,0.770868,0.7507,0.769636,0.010161,23
4,0.011697,0.00012,0.001631,5.1e-05,gini,4,best,"{'criterion': 'gini', 'max_depth': 4, 'splitte...",0.773669,0.770868,0.780952,0.769748,0.769748,0.772997,0.004228,12


In [None]:
y = proc_data[['tip_amount']].values.astype('float32')

# drop the target variable from the feature matrix
proc_data = proc_data.drop(['tip_amount'], axis=1)

# get the feature matrix used for training
X = proc_data.values

# normalize the feature matrix
X = normalize(X, axis=1, norm='l1', copy=False)

# print the shape of the features matrix and the labels vector
print('X.shape=', X.shape, 'y.shape=', y.shape)