# Chapter 5: Evaluating Predictive Performance


> (c) 2019-2020 Galit Shmueli, Peter C. Bruce, Peter Gedeck 
>
> _Data Mining for Business Analytics: Concepts, Techniques, and Applications in Python_ (First Edition) 
> Galit Shmueli, Peter C. Bruce, Peter Gedeck, and Nitin R. Patel. 2019.
>
> Date: 2020-03-08
>
> Python Version: 3.8.2
> Jupyter Notebook Version: 5.6.1
>
> Packages:
>   - dmba: 0.0.12
>   - matplotlib: 3.2.0
>   - pandas: 1.0.1
>   - scikit-learn: 0.22.2
>
> The assistance from Mr. Kuber Deokar and Ms. Anuja Kulkarni in preparing these solutions is gratefully acknowledged.


In [1]:
# import required packages for this chapter

from pathlib import Path

import math
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import accuracy_score, roc_curve, auc
import matplotlib.pylab as plt 

from dmba import regressionSummary, classificationSummary
from dmba import liftChart, gainsChart

no display found. Using non-interactive Agg backend


In [2]:
# Working directory:
#
# We assume that data are kept in the same directory as the notebook. If you keep your 
# data in a different folder, replace the argument of the `Path`
# and then load data using 
#
# pd.read_csv(‘filename.csv’)

# 5.1 

A data mining routine has been applied to a transaction dataset and has classified 88 records as fraudulent (30 correctly so) and 952 as non-fraudulent (920 correctly so). Construct the confusion matrix and calculate the overall error rate.

__Answer:__

<pre>
classification confusion matrix
|----------------------------------------------------------------------------|
|             |                     Predicted Class                          |
|----------------------------------------------------------------------------|
| Actual Class|             C0               |             C1                |
|----------------------------------------------------------------------------|
|      C0     | n0,0 = number of correctly   | n0,1 = number of C0 cases     | 
|             |  classified C0 cases         |  incorrectly classified as C1 | 
|----------------------------------------------------------------------------|
|      C1     | n1,0 = number of C1 cases in-| n1,1 = number of correctly    |
|             |  coreectly classified as C0  |  classified C1 cases          |
|----------------------------------------------------------------------------|

Therefore in our problem the confusion matrix is

classification confusion matrix
|----------------------------------------------------------------------|
|                    |                 Predicted Class                 |
|----------------------------------------------------------------------|
| Actual Class       |         Fraudulant(1)    |   Non-fraudulant (0) |
|----------------------------------------------------------------------|
| Fraudulant (1)     |             30           |        32            | 
|----------------------------------------------------------------------|
| Non-fraudulant (0) |             58           |       920            |
|----------------------------------------------------------------------|

</pre>

formula for overall error rate 

overall Error Rate = (n<sub>0,1</sub> + n<sub>1,0)</sub> / n, where n is the total number of records.

In [3]:
error_rate = (32 + 58) / 1040
error_rate

0.08653846153846154

So the overall error rate is 8.65%.

# 5.2

Suppose that this routine has an adjustable cutoff (threshold) mechanism by which you can alter the proportion of records classified as fraudulent. Describe how moving the cutoff up or down would affect

__5.2.a.__ the classification error rate for records that are truly fraudulent

__5.2.b.__ the classification error rate for records that are truly nonfraudulent

__Answer:__

<pre>
classification confusion matrix
|--------------------------------------------------------|
|                   |          Predicted Class           |
|--------------------------------------------------------|
| Actual Class      | Fraudulant (1) | Non-fraudulant (0)|
|--------------------------------------------------------|
| Fraudulant (1)    |       a        |         b         |
|--------------------------------------------------------|
| Non-fraudulant (0)|       c        |         d         |
|--------------------------------------------------------|
</pre>

The classification error rate for truly fraudulent records (with this 0.5 cutoff) is b/(a+b)

The classification error rate for truly non-fraudulent records (with this 0.5 cutoff) is c/(c+d) 

Lowering the cutoff (here, below 0.5) leads to classifying more records, both fraudulent and non-fraudulent, as fraudulent: a and c both increase, b and d decline.

__a.__ With respect to the classification error rate for truly fraudulent records, the error rate, b/(a+b), decreases as b goes up. As you lower the standard for calling a record fraudulent, you catch more and more of the real frauds.

__b.__ With respect to the classification error rate for truly non-fraudulent records, the error rate, c/(c+d), increases as c goes up. As you lower the standard for calling a record fraudulent, you mistakenly identify more and more non-frauds as frauds.

Increasing the cutoff (here, above 0.5) leads to classifying more records, both fraudulent and non-fraudulent, as non-fraudulent:  b and d both increase, a and c decline.

__a.__ With respect to the classification error rate for truly fraudulent records, the error rate, b/(a+b), increases as b goes up. As you raise the standard for calling a record fraudulent, you miss more and more of the real frauds.

__b.__ With respect to the classification error rate for truly non-fraudulent records, the error rate, c/(c+d), decreases as d goes up. As you raise the standard for calling a record fraudulent, fewer non-frauds get mis-labeled as frauds.

# 5.3

FiscalNote is a startup founded by a Washington, DC entrepreneur and funded by a Singapore sovereign wealth fund, the Winklevoss twins of Facebook fame, and others. It uses machine learning and data mining techniques to predict for its clients whether legislation in the US Congress and in US state legislatures will pass or not. The company reports 94% accuracy. (Washington Post, November 21, 2014, “Capital Business”)

Considering just bills introduced in the US Congress, do a bit of internet research to learn about numbers of bills introduced and passage rates. Identify the possible types of misclassifications, and comment on the use of overall accuracy as a metric. Include a discussion of other possible metrics and the potential role of propensities.

__Answer:__

Web research on govtrack.us shows that, in the 113th Congress (which covered 2013 and 2014), over 10,000 pieces of legislation were introduced but only 3% passed as enacted laws.  "Enacted laws" does not include the 6% that were passed as (usually meaningless) resolutions.  

If we focus just on classifying each bill, and use overall accuracy as a metric, we could achieve 97% accuracy just by predicting that nothing will pass (as a law).   

A data mining model might make two types of classification errors - saying that a bill will pass when it won't, and saying a bill won't pass when it will. The second type of error is probably more costly than the first - identifying the small number of bills that will pass is probably of more interest than overall accuracy. Therefore, a useful metric would be sensitivity to "will pass" - the proportion of "will pass" bills that were correctly predicted (alongside with specificity, which is the proportion of "will not pass" that are correctly ruled out).

However, rather than just assigning a 0/1 class to each bill (classification), we will probably be more interested in ranking the bills and estimating a propensity (probability) for passage for each bill.  We would then focus on the high probability bills, and not be so concerned with the low probability bills.

With ranking as our goal, we could use lift as a metric for how well a model separates out the "will pass" bills. As part of the calculation for lift, the bills would be ranked by their propensity to pass.  Lift gives a picture of how much better the model does than not using a model.

# 5.4

Consider Figure 5.12, the decile lift chart for the transaction data model, applied to new data.

__5.4.a.__ Interpret the meaning of the first and second bars from the left.

__Answer:__

Left-most bar: If we take the 10% "most probable 1's (frauds)" (as ranked by the model), it will catch 6.5 times as many 1's (frauds), as would a random selection of 10% of the records.

#2nd bar from left: If we take the second highest decile (10%) of records that are ranked by the model as "the most probable 1's (frauds)" it will catch 2.7 times as many 1's (frauds), as would a random selection of 10% of the records.

__5.4.b.__ Explain how you might use this information in practice.

__Answer:__

Consider a tax authority that wants to allocate their resources for investigating firms that are most likely to submit fraudulent tax returns. Suppose that there are resources for auditing only 10% of firms. Rather than taking a random sample, they can select the top 10% of firms that are predicted to be most likely to report fraudulently (according to the decile chart). Or, to preserve the principle that anyone might be audited, they can establish differential probabilities for being sampled -- those in the top deciles being much more likely to be audited.

__5.4.c.__ Another analyst comments that you could improve the accuracy of the model by classifying everything as nonfraudulent. If you do that, what is the error rate?

__Answer:__

We have the following confusion matrix from Problem 5.1.

<pre>
classification confusion matrix
|----------------------------------------------------------------------|
|                    |                 Predicted Class                 |
|----------------------------------------------------------------------|
| Actual Class       |         Fraudulant(1)    |   Non-fraudulant (0) |
|----------------------------------------------------------------------|
| Fraudulant (1)     |             30           |        32            | 
|----------------------------------------------------------------------|
| Non-fraudulant (0) |             58           |       920            |
|----------------------------------------------------------------------|

</pre>

According to the new analyst our classification confusion matrix becomes-

<pre>
classification confusion matrix
|----------------------------------------------------------------------|
|                    |                 Predicted Class                 |
|----------------------------------------------------------------------|
| Actual Class       |         Fraudulant(1)    |   Non-fraudulant (0) |
|----------------------------------------------------------------------|
| Fraudulant (1)     |              0           |        88            | 
|----------------------------------------------------------------------|
| Non-fraudulant (0) |              0           |       952            |
|----------------------------------------------------------------------|

</pre>

In [4]:
# Overall misclassification rate

error_rate = 88/1040
error_rate

0.08461538461538462

We see that the misclassification error rate is lower (8.46%) with the “everything non-fraudulent” proposal (although only slightly).

__5.4.d.__ Comment on the usefulness, in this situation, of these two metrics of model performance (error rate and lift).

__Answer:__ 

The likely purpose of this analysis is to identify fraudulent records. The overall "error rate" is not likely to help much in evaluating competing methods for doing so. The key factor here is the ability to identify records that have a high probability of being fraudulent, and this is what lift measures. Using lift, you can "descend" through the records in order of probability of being fraudulent, knowing at each point how much more likely you are to be getting a fraudulent record than naively selecting at random. The "error rate" measure, by contrast, reveals nothing about the efficiency of identifying fraudulent records.

The vast majority of records are non-fraudulent, and correctly classifying nonfraudulent records drives the overall error rate. One can achieve a very respectably low error rate just by classifying everything as non-fraudulent, which is not practically useful.

__5.5.__

A large number of insurance records are to be examined to develop a model for predicting fraudulent claims. Of the claims in the historical database, 1% were judged to be fraudulent. A sample is taken to develop a model, and oversampling is used to provide a balanced sample in light of the very low response rate. When applied to this sample (n = 800), the model ends up correctly classifying 310 frauds, and 270 nonfrauds. It missed 90 frauds, and classified 130 records incorrectly as frauds when they were not.

__5.5.a.__ Produce the confusion matrix for the sample as it stands.

__Answer:__

<pre>
classification confusion matrix
|--------------------------------------------------------|
|                   |          Predicted Class           |
|--------------------------------------------------------|
| Actual Class      |        1       |         0         |
|--------------------------------------------------------|
|          1        |      310       |        90         |
|--------------------------------------------------------|
|          0        |      130       |       270         |
|--------------------------------------------------------|
</pre>

In [5]:
# Misclassification rate

miscl_rate1 = (90 + 130) / 800
miscl_rate1

0.275

So the overall misclassification rate is 27.5%.

The model ends up classifying (310 + 130) / 800 = 0.55 = 55% of the records as fraudulent.

__5.5.b.__ Find the adjusted misclassification rate (adjusting for the oversampling).


__Answer:__

Now we need to add enough zeros so that the 1's only constitute 1% of the total and the 0's constitute 99% of the total (where is X is the total).
<pre>                                                                                                           
400 + 0.99*x = x
Therefore x = 40, 000
Number of zeros = 0.99 * 40, 000 = 39600

#classification confusion matrix
|-------------------------------------------------------------|
|                   |          Predicted Class       |        |
|-------------------------------------------------------------|
|    Actual Class   |        1       |         0     | Total  |
|-------------------------------------------------------------|
|          1        |      310       |        90     | 400    |
|----------------------------------------------------|--------|
|          0        |    12870       |     26730     | 39600  |
|----------------------------------------------------|--------|
|       Total       |    13180       |     26820     | 40000  |
|-------------------------------------------------------------|
</pre>

In [6]:
# overall misclassification rate

miscl_rate2 = (90 + 12870) / 40000
miscl_rate2

0.324

The model ends up classifying (310 + 12870) / 40000 = 0.3295 = 32.95% of the records as fraudulent.

__5.5.c.__ What percentage of new records would you expect to be classified as fraudulent?

From the above calculations, we expect 32.95% of the records to be classified as frauds.

# 5.6

A firm that sells software services has been piloting a new product and has records of 500 customers who have either bought the services or decided not to. The target value is the estimated profit from each sale (excluding sales costs). The global mean is
about 2128 dollars. However, the cost of the sales effort is not cheap— the company figures it comes to 2500 dollars for each of the 500 customers (whether they buy or not). The firm developed a predictive model in hopes of being able to identify the top spenders in the future. The cumulative gains and decile lift charts for the validation set are shown in Figure 5.13.

__5.6.a__ If the company begins working with a new set of 1000 leads to sell the same services, similar to the 500 in the pilot study, without any use of predictive modeling to target sales efforts, what is the estimated profit?

__Answer:__

If the 1000 new leads are like those in the pilot, then the company can expect the same mean profit per sale of 2500 dollars, or 2,500,000 dollars for the 1000 leads. This does not include the cost of the sales effort, which would cost an estimated 2.5 million dollars.  In other words, it would not be a profitable move.

__5.6.b.__ If the firm wants the average profit on each sale to roughly double the sales effort cost, and applies an appropriate cutoff with this predictive model to a new set of 1000 leads, how far down the new list of 1000 should it proceed (how many
deciles)?

__Answer:__

If the profit must double the sales effort cost, that would be 5000 dollars. This is twice the average profit across all customers.  The company could achieve this by attempting sales to the first (top) decile among the customers, where the lift is about 2.1.  However, it should not go beyond this point.

__5.6.c__ Still considering the new list of 1000 leads, if the company applies this predictive model with a lower cutoff of 2500, how far should it proceed down the ranked leads, in terms of deciles?

__Answer:__

If the cutoff is lowered to 2500 dollars, a lift as low as 2500/2500 or 1.0 could be tolerated.  This would mean going all the way through the 5th decile, or half the customers. It would also mean that the product is breakeven at the margin (i.e. the last group of leads produces no net profit)

__5.6.d.__ Why use this two-stage process for predicting sales—why not simply develop a model for predicting profit for the 1000 new leads?

__Answer:__

A model to predict overall profit for all 1000 leads would not be useful.  The profit would be zero.  Individual profit predictions for each of the 1000 leads would be useful, and might be sufficient, but generating staged solutions (corresponding to different cutoffs and differing lift) helps translate the optimization problem into a business problem, and frame a limited number of decision options for managers.  _Note:  different interpretations of the question are possible with respect to whether profit for all 1000 leads all together is intended, or profit for each of the 1000 leads individually, so some leeway is accorded in marking_.

# 5.7

Table 5.7 shows a small set of predictive model validation results for a classification model, with both actual values and propensities.

__5.7.a.__ Calculate error rates, sensitivity, and specificity using cutoffs of 0.25, 0.5, and 0.75.

__Answer:__

In [7]:
# create a data frame of data

data = {'Propensity': [0.03, 0.52, 0.38, 0.82, 0.33, 0.42, 0.55, 0.59, 0.09, 0.21, 0.43, 0.04, 0.08, 0.13, 0.01, 0.79, 0.42, 
                       0.29, 0.08, 0.02],
        'Actual': [0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]}

# convert to data frame
df = pd.DataFrame(data)

In [8]:
# cutoff = 0.25

Predicted = [1 if p > 0.25 else 0 for p in df.Propensity]
classificationSummary(df.Actual, Predicted, class_names=['0', '1'])

Confusion Matrix (Accuracy 0.6000)

       Prediction
Actual 0 1
     0 9 8
     1 0 3


In [9]:
# overall error rate
error_rate = (8) / 20
print('\nError Rate = ', error_rate)
# sensitivity
sensitivity = (3) / (3+0)
print('\nSensitivity=',sensitivity)
# specificity
specificity = (9) / (8+9)
print('\nSpecificity = ', specificity)


Error Rate =  0.4

Sensitivity= 1.0

Specificity =  0.5294117647058824


In [10]:
# cutoff = 0.5

Predicted = [1 if p > 0.5 else 0 for p in df.Propensity]
classificationSummary(df.Actual, Predicted, class_names=['0', '1'])

Confusion Matrix (Accuracy 0.9000)

       Prediction
Actual  0  1
     0 15  2
     1  0  3


In [11]:
# overall error rate
error_rate = (2) / 20
print('\nError Rate = ', error_rate)
# sensitivity
sensitivity = (3) / (3+0)
print('\nSensitivity=',sensitivity)
# specificity
specificity = (15) / (2+15)
print('\nSpecificity = ', specificity)


Error Rate =  0.1

Sensitivity= 1.0

Specificity =  0.8823529411764706


In [12]:
# cutoff = 0.75

Predicted = [1 if p > 0.75 else 0 for p in df.Propensity]
classificationSummary(df.Actual, Predicted, class_names=['0', '1'])

Confusion Matrix (Accuracy 0.9500)

       Prediction
Actual  0  1
     0 17  0
     1  1  2


In [13]:
# overall error rate
error_rate = (1) / 20
print('\nError Rate = ', error_rate)
# sensitivity
sensitivity = (2) / (2+1)
print('\nSensitivity=',sensitivity)
# specificity
specificity = (17) / (0+17)
print('\nSpecificity = ', specificity)


Error Rate =  0.05

Sensitivity= 0.6666666666666666

Specificity =  1.0


__5.7.b.__ Create a decile lift chart.

__Answer:__

In [14]:
# decile lift chart
# sort data by propensities
df = df.sort_values(by=['Propensity'], ascending=False)
#create decile lift chart
liftChart(df.Actual)

<AxesSubplot:title={'center':'Decile Lift Chart'}, xlabel='Percentile', ylabel='Lift'>