#  Classifier Evaluation

**Write and execute R code in the code cells per the instructions.  The expected results are provided for you directly following the code cells.**

In [1]:
f = "setup.R"; for (i in 1:10) { if (file.exists(f)) break else f = paste0("../", f) }; source(f)

## Business Model

Our company has 1000 prospective customers, of which we expect 50 to buy our product - but we are not sure which customers those are.  The cost to meet with a prospective customer is \\$4500.  The revenue from a customer that buys our product is \\$100000.

In [2]:
prospects = 1000
buyers = 50
passers = 1000-50
cost = 4500
revenue = 100000

profit.baseline = buyers*revenue - prospects*cost
data.frame(profit.baseline)

profit.baseline
500000


## Business Decision

Which prospects should we meet so that we maximize profit?

The approach will be to build a model to predict which prospects will buy based on their market research scores, and then meet only with those prospects.

## Data

Here is some data about past customers.  Each customer is associated with two scores, x1 and x2, that were measured by a market research company.  Also, each customer is known to have either bought our company's product or passed on an opportunity to buy our company's product.

In [3]:
data = data.frame(x1=c(1,2,3,4,3,2,5,4,3,2,5,3,3,2,3,1,1,5,4,1,5,1,0,0,1,2,2,5,1,3,1,2,3,4,5,6,
                       1,3,3,6,3,2,5,4,3,4,5,3,3,2,3,1,2,5,4,1,5,1,1,1,1,2,2,5,1,3,1,2,3,4,5,6),
                  x2=c(3,2,6,5,4,5,3,8,9,0,0,9,7,4,5,5,4,5,6,3,2,4,3,5,4,6,5,1,2,3,4,5,4,3,4,8,
                       3,2,6,5,4,5,3,8,5,5,0,9,7,4,5,5,4,5,7,3,2,4,3,4,4,6,5,1,2,3,4,5,4,3,4,8),
                  class=c("buy","buy","buy","buy","buy","buy","buy","buy","buy","buy","buy","buy","buy","buy","buy","buy","buy","buy",
                          "pass","pass","pass","pass","pass","pass","pass","pass","pass","pass","pass","pass","pass","pass","pass","pass","pass","pass",
                          "buy","buy","buy","pass","buy","buy","buy","buy","buy","buy","buy","buy","buy","buy","buy","buy","buy","buy",
                          "pass","pass","pass","pass","pass","pass","pass","pass","pass","pass","pass","pass","pass","pass","pass","pass","pass","pass"))

size(data)
data

observations,variables
72,3


x1,x2,class
1,3,buy
2,2,buy
3,6,buy
4,5,buy
3,4,buy
2,5,buy
5,3,buy
4,8,buy
3,9,buy
2,0,buy


In [4]:
length(which(data$class=="pass"))

## Problem 1

Build a naive Bayes model based on the data to predict which prospects will buy.

You may want to use these function(s):
* naiveBayes()

In [5]:
model = naiveBayes(class ~ x1+x2, data)
model


Naive Bayes Classifier for Discrete Predictors

Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)

A-priori probabilities:
Y
      buy      pass 
0.4861111 0.5138889 

Conditional probabilities:
      x1
Y          [,1]     [,2]
  buy  2.971429 1.271537
  pass 2.702703 1.853898

      x2
Y          [,1]     [,2]
  buy  4.685714 2.348609
  pass 4.027027 1.674755


## Problem 2: Evaluation by Insample

Make predictions based on the data ("buy" cutoff=0.5) and show the resulting confusion matrix, accuracy, & business metric of the model in terms of profit.

You may want to use these function(s):
* colnames()
* predict()
* as.class()
* confusionMatrix()
* round()

To calculate profit, develop a formula that relies on the confusion matrix.  In the formula, round to the nearest whole number of prospects as appropriate.  

In [6]:
new.data = data[,1:2]
predictions = predict(model, new.data)
new.data$class.predicted = predictions

cm = confusionMatrix(new.data$class.predicted, data$class, positive="buy")$table
cm = cm / sum(cm)

insample_accuracy = (cm[1,1]+cm[2,2])/sum(cm)

fmt.cm(cm)
fmt(insample_accuracy)

Unnamed: 0,buy,pass
buy,0.25,0.0833333
pass,0.2361111,0.4305556


insample_accuracy
0.6805556


In [7]:
buy_buy_num = round(cm[1,1] / (cm[1,1] + cm[2,1]) * buyers)
buy_pass_num = round(cm[1,2] / (cm[1,2] + cm[2,2]) * passers)


buy_buy = buy_buy_num * revenue - buy_buy_num * cost
pass_buy = 0
buy_pass = buy_pass_num * -cost
pass_pass = 0
insample_profit = buy_buy_num * revenue - (buy_buy_num + buy_pass_num) * cost

data.frame(buy_buy, pass_buy, buy_pass, pass_pass, insample_profit)

buy_buy,pass_buy,buy_pass,pass_pass,insample_profit
2483000,0,-693000,0,1790000


## Problem 3: Evaluation by Cross-Validation

Partition the data into 5 test folds.

For each fold, build a naive Bayes model based on training data, make predictions based on test data ("buy" cutoff=0.5), and show the resulting confusion matrix, accuracy, & business metric in terms of profit.

Show the model, its cross-validation accuracy, & its cross-validation business value in terms of profit.

You may want to use these function(s):
* set.seed()
* createFolds()
* setdiff()
* colnames()
* naiveBayes()
* predict()
* as.class()
* confusionMatrix()
* round()

Use `set.seed(12345)` and `createFolds(..., k=5)` to do the partitioning.

In [8]:
# Partition the data into 5 test folds
set.seed(12345)
fold = createFolds(data$class, k=5)
fold

In [9]:
# For each fold, build a naive Bayes model based on training data, make predictions based on test data 
# ("buy" cutoff=0.5), and show the resulting confusion matrix, accuracy, & business metric in 
# terms of profit
cutoff = 0.5

data_1.train = data[setdiff(1:nrow(data), fold$Fold1),]
data_1.test  = data[fold$Fold1,]

data_1.u.test = data_1.test[, colnames(data_1.test)!="class"]

model_1 = naiveBayes(class ~ x1+x2, data_1.train)

prob = as.data.frame(predict(model_1, data_1.u.test, type="raw"))
data_1.u.test$prob.buy = prob$buy
data_1.u.test$class.predicted = as.class(prob, class="buy", cutoff)

data_1.u.test$hit = (data_1.u.test$class.predicted == data_1.test$class)

CM.1 = confusionMatrix(data_1.u.test$class.predicted, data_1.test$class, positive="buy")$table
cm.1 = CM.1 / sum(CM.1)

accuracy.1 = (cm.1[1,1]+cm.1[2,2])/sum(cm.1)

fmt.cm(cm.1)
fmt(accuracy.1)

buy_buy_num = round(cm.1[1,1] / (cm.1[1,1] + cm.1[2,1]) * buyers)
buy_pass_num = round(cm.1[1,2] / (cm.1[1,2] + cm.1[2,2]) * passers)


buy_buy = buy_buy_num * revenue - buy_buy_num * cost
pass_buy = 0
buy_pass = buy_pass_num * -cost
pass_pass = 0
profit.1 = buy_buy_num * revenue - (buy_buy_num + buy_pass_num) * cost

data.frame(buy_buy, pass_buy, buy_pass, pass_pass, profit.1)

Unnamed: 0,buy,pass
buy,0.5,0.0
pass,0.0,0.5


accuracy.1
1


buy_buy,pass_buy,buy_pass,pass_pass,profit.1
4775000,0,0,0,4775000


In [10]:
data_2.train = data[setdiff(1:nrow(data), fold$Fold2),]
data_2.test  = data[fold$Fold2,]

data_2.u.test = data_2.test[, colnames(data_2.test)!="class"]

model_2 = naiveBayes(class ~ x1+x2, data_2.train)

prob = as.data.frame(predict(model_2, data_2.u.test, type="raw"))
data_2.u.test$prob.buy = prob$buy
data_2.u.test$class.predicted = as.class(prob, class="buy", cutoff)

data_2.u.test$hit = (data_2.u.test$class.predicted == data_2.test$class)

CM.2 = confusionMatrix(data_2.u.test$class.predicted, data_2.test$class, positive="buy")$table
cm.2 = CM.2 / sum(CM.2)

accuracy.2 = (cm.2[1,1]+cm.2[2,2])/sum(cm.2)

fmt.cm(cm.2)
fmt(accuracy.2)

buy_buy_num = round(cm.2[1,1] / (cm.2[1,1] + cm.2[2,1]) * buyers)
buy_pass_num = round(cm.2[1,2] / (cm.2[1,2] + cm.2[2,2]) * passers)


buy_buy = buy_buy_num * revenue - buy_buy_num * cost
pass_buy = 0
buy_pass = buy_pass_num * -cost
pass_pass = 0
profit.2 = buy_buy_num * revenue - (buy_buy_num + buy_pass_num) * cost

data.frame(buy_buy, pass_buy, buy_pass, pass_pass, profit.2)

Unnamed: 0,buy,pass
buy,0.2857143,0.2857143
pass,0.2142857,0.2142857


accuracy.2
0.5


buy_buy,pass_buy,buy_pass,pass_pass,profit.2
2769500,0,-2443500,0,326000


In [11]:
data_3.train = data[setdiff(1:nrow(data), fold$Fold3),]
data_3.test  = data[fold$Fold3,]

data_3.u.test = data_3.test[, colnames(data_3.test)!="class"]

model_3 = naiveBayes(class ~ x1+x2, data_3.train)

prob = as.data.frame(predict(model_3, data_3.u.test, type="raw"))
data_3.u.test$prob.buy = prob$buy
data_3.u.test$class.predicted = as.class(prob, class="buy", cutoff)

data_3.u.test$hit = (data_3.u.test$class.predicted == data_3.test$class)

CM.3 = confusionMatrix(data_3.u.test$class.predicted, data_3.test$class, positive="buy")$table
cm.3 = CM.3 / sum(CM.3)

accuracy.3 = (cm.3[1,1]+cm.3[2,2])/sum(cm.3)

fmt.cm(cm.3)
fmt(accuracy.3)

buy_buy_num = round(cm.3[1,1] / (cm.3[1,1] + cm.3[2,1]) * buyers)
buy_pass_num = round(cm.3[1,2] / (cm.3[1,2] + cm.3[2,2]) * passers)


buy_buy = buy_buy_num * revenue - buy_buy_num * cost
pass_buy = 0
buy_pass = buy_pass_num * -cost
pass_pass = 0
profit.3 = buy_buy_num * revenue - (buy_buy_num + buy_pass_num) * cost

data.frame(buy_buy, pass_buy, buy_pass, pass_pass, profit.3)

Unnamed: 0,buy,pass
buy,0.2,0.1333333
pass,0.2666667,0.4


accuracy.3
0.6


buy_buy,pass_buy,buy_pass,pass_pass,profit.3
2005500,0,-1071000,0,934500


In [12]:
data_4.train = data[setdiff(1:nrow(data), fold$Fold4),]
data_4.test  = data[fold$Fold4,]

data_4.u.test = data_4.test[, colnames(data_4.test)!="class"]

model_4 = naiveBayes(class ~ x1+x2, data_4.train)

prob = as.data.frame(predict(model_4, data_4.u.test, type="raw"))
data_4.u.test$prob.buy = prob$buy
data_4.u.test$class.predicted = as.class(prob, class="buy", cutoff)

data_4.u.test$hit = (data_4.u.test$class.predicted == data_4.test$class)

CM.4 = confusionMatrix(data_4.u.test$class.predicted, data_4.test$class, positive="buy")$table
cm.4 = CM.4 / sum(CM.4)

accuracy.4 = (cm.4[1,1]+cm.4[2,2])/sum(cm.4)

fmt.cm(cm.4)
fmt(accuracy.4)

buy_buy_num = round(cm.4[1,1] / (cm.4[1,1] + cm.4[2,1]) * buyers)
buy_pass_num = round(cm.4[1,2] / (cm.4[1,2] + cm.4[2,2]) * passers)


buy_buy = buy_buy_num * revenue - buy_buy_num * cost
pass_buy = 0
buy_pass = buy_pass_num * -cost
pass_pass = 0
profit.4 = buy_buy_num * revenue - (buy_buy_num + buy_pass_num) * cost

data.frame(buy_buy, pass_buy, buy_pass, pass_pass, profit.4)

Unnamed: 0,buy,pass
buy,0.0,0.0
pass,0.4666667,0.5333333


accuracy.4
0.5333333


buy_buy,pass_buy,buy_pass,pass_pass,profit.4
0,0,0,0,0


In [13]:
data_5.train = data[setdiff(1:nrow(data), fold$Fold5),]
data_5.test  = data[fold$Fold5,]

data_5.u.test = data_5.test[, colnames(data_5.test)!="class"]

model_5 = naiveBayes(class ~ x1+x2, data_5.train)

prob = as.data.frame(predict(model_5, data_5.u.test, type="raw"))
data_5.u.test$prob.buy = prob$buy
data_5.u.test$class.predicted = as.class(prob, class="buy", cutoff)

data_5.u.test$hit = (data_5.u.test$class.predicted == data_5.test$class)

CM.5 = confusionMatrix(data_5.u.test$class.predicted, data_5.test$class, positive="buy")$table
cm.5 = CM.5 / sum(CM.5)

accuracy.5 = (cm.5[1,1]+cm.5[2,2])/sum(cm.5)

fmt.cm(cm.5)
fmt(accuracy.5)

buy_buy_num = round(cm.5[1,1] / (cm.5[1,1] + cm.5[2,1]) * buyers)
buy_pass_num = round(cm.5[1,2] / (cm.5[1,2] + cm.5[2,2]) * passers)


buy_buy = buy_buy_num * revenue - buy_buy_num * cost
pass_buy = 0
buy_pass = buy_pass_num * -cost
pass_pass = 0
profit.5 = buy_buy_num * revenue - (buy_buy_num + buy_pass_num) * cost

data.frame(buy_buy, pass_buy, buy_pass, pass_pass, profit.5)

Unnamed: 0,buy,pass
buy,0.1428571,0.1428571
pass,0.3571429,0.3571429


accuracy.5
0.5


buy_buy,pass_buy,buy_pass,pass_pass,profit.5
1337000,0,-1219500,0,117500


In [14]:
# Show the model, its cross-validation accuracy, & its cross-validation business value in terms of profit

cv_accuracy = mean(c(accuracy.1, accuracy.2, accuracy.3, accuracy.4, accuracy.5))
cv_profit = mean(c(profit.1, profit.2, profit.3, profit.4, profit.5))

model
data.frame(cv_accuracy, cv_profit)


Naive Bayes Classifier for Discrete Predictors

Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)

A-priori probabilities:
Y
      buy      pass 
0.4861111 0.5138889 

Conditional probabilities:
      x1
Y          [,1]     [,2]
  buy  2.971429 1.271537
  pass 2.702703 1.853898

      x2
Y          [,1]     [,2]
  buy  4.685714 2.348609
  pass 4.027027 1.674755


cv_accuracy,cv_profit
0.6266667,1230600


## Problem 4: Benefit of the Model

What is the model worth to our company in terms of how much it is expected to increase profit? In other words, what is the opportunity cost of not using the model?

In [15]:
profit.with_model = cv_profit
improvement = cv_profit - profit.baseline

data.frame(profit.baseline, profit.with_model, improvement)

profit.baseline,profit.with_model,improvement
500000,1230600,730600


<font size=1;>
<p style="text-align: left;">
Copyright (c) Berkeley Data Analytics Group, LLC
<span style="float: right;">
Document revised December 17, 2019
</span>
</p>
</font>