# Chapter 8: The Naive Bayes Classifier (NB)


> (c) 2019-2020 Galit Shmueli, Peter C. Bruce, Peter Gedeck 
>
> _Data Mining for Business Analytics: Concepts, Techniques, and Applications in Python_ (First Edition) 
> Galit Shmueli, Peter C. Bruce, Peter Gedeck, and Nitin R. Patel. 2019.
>
> Date: 2020-03-08
>
> Python Version: 3.8.2
> Jupyter Notebook Version: 5.6.1
>
> Packages:
>   - dmba: 0.0.12
>   - numpy: 1.18.1
>   - pandas: 1.0.1
>   - scikit-learn: 0.22.2
>
> The assistance from Mr. Kuber Deokar and Ms. Anuja Kulkarni in preparing these solutions is gratefully acknowledged.


In [1]:
# Import required packages for this chapter
from pathlib import Path

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from dmba import classificationSummary

%matplotlib inline

In [2]:
# Working directory:
#
# We assume that data are kept in the same directory as the notebook. If you keep your 
# data in a different folder, replace the argument of the `Path`
DATA = Path('.')
# and then load data using 
#
# pd.read_csv(DATA / ‘filename.csv’)

# Problem 8.1 Personal Loan Acceptance.

The file _UniversalBank.csv_ contains data on 5000 customers of Universal Bank. The data include customer demographic information (age, income, etc.), the customer’s relationship with the bank (mortgage, securities account, etc.), and the customer response to the last personal loan campaign (Personal Loan). Among these 5000 customers, only 480 (= 9.6%) accepted the personal loan that was offered to them in the earlier campaign. In this exercise, we focus on two predictors: Online (whether or not the customer is an active user of online banking services) and Credit Card (abbreviated CC below) (does the customer hold a credit card issued by the bank), and the outcome Personal Loan (abbreviated Loan below).

Partition the data into training (60%) and validation (40%) sets.

## Data Preparation

Remove all unnecessary columns from the dataset and convert _Online_ and _CreditCard_ to categories. Split the data into training (60%), and validation (40%) sets (use <code>random_state=1</code>).

In [3]:
# Load the data
bank_df = pd.read_csv(DATA / 'UniversalBank.csv')

# Consider only the required variables and reorder the columns at the same time
bank_df = bank_df[['Online', 'CreditCard', 'Personal Loan']]
bank_df.Online = bank_df.Online.astype('category')
bank_df.CreditCard = bank_df.CreditCard.astype('category')
bank_df.head()

Unnamed: 0,Online,CreditCard,Personal Loan
0,0,0,0
1,0,0,0
2,0,0,0
3,0,0,0
4,0,1,0


Split dataset into training and validation sets.

In [4]:
train_df, valid_df = train_test_split(bank_df, test_size=0.4, random_state=1)
print('Training Set:', train_df.shape, 'Validation Set:', valid_df.shape)

Training Set: (3000, 3) Validation Set: (2000, 3)


__8.1.a__ Create a pivot table for the training data with Online as a column variable, CC as a row variable, and Loan as a secondary row variable. The values inside the table should convey the count. Use the pandas dataframe methods _melt()_ and _pivot()_.

__Answer:__

In [5]:
# pivot table for training data
train_df.pivot_table(index=['CreditCard', 'Personal Loan'],
                    columns=['Online'], aggfunc=len)

Unnamed: 0_level_0,Online,0,1
CreditCard,Personal Loan,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0,792,1117
0,1,73,126
1,0,327,477
1,1,39,49


__8.1.b.__ Consider the task of classifying a customer who owns a bank credit card and is actively using online banking services. Looking at the pivot table, what is the probability that this customer will accept the loan offer? (This is the probability of loan acceptance (Loan = 1) conditional on having a bank credit card (CC = 1) and being an active user of online banking services (Online = 1)).

__Answer:__

Use the pivot table created in 8.1.b. for the answer

There are 477 + 49 = 526 records where online = 1 and cc = 1. 46 of them accept the loan, so the conditional probability is 49/526 = 0.0932

In [6]:
p11 = 49 / (477 + 49)
print('Count based probability P(Loan = 1|CC = 1, Online = 1) = ', p11)

Count based probability P(Loan = 1|CC = 1, Online = 1) =  0.09315589353612168


__8.1.c.__ Create two separate pivot tables for the training data. One will have Loan (rows) as a function of Online (columns) and the other will have Loan (rows) as a function of CC.

__Answer__

Pivot table for Loan (rows) as a function of Online (columns). Here we can use the `pivot_table` method of the pandas data frame.

In [7]:
predictors = ['CreditCard', 'Online']

print(train_df['Personal Loan'].value_counts() / len(train_df))
print()

for predictor in predictors:
    # construct the frequency table
    df = train_df[['Personal Loan', predictor]]
    freqTable = df.pivot_table(index='Personal Loan', columns=predictor, aggfunc=len)

    # divide each row by the sum of the row to get conditional probabilities
    propTable = freqTable.apply(lambda x: x / sum(x), axis=1)
    print(propTable)
    print()

0    0.904333
1    0.095667
Name: Personal Loan, dtype: float64

CreditCard            0         1
Personal Loan                    
0              0.703649  0.296351
1              0.693380  0.306620

Online                0         1
Personal Loan                    
0              0.412459  0.587541
1              0.390244  0.609756



<small><em>CreditCard</em> abbreviated as CC, <em>Personal Loan</em> abbreviated as Loan)</small>

__8.1.d.__ Compute the following quantities, P(A | B) means “the probability of A given B”]:

<ul>
<li>i. P(CC = 1 | Loan = 1) (the proportion of credit card holders among the loan acceptors)</i>
<li>ii. P(Online = 1|Loan = 1)</li>
<li>iii. P(Loan = 1) = the proportion of loan acceptors</li>
<li>iv. P(CC = 1|Loan = 0)</li>
<li>v.  P(Online = 1|Loan = 0)</li>
<li>vi. P(Loan = 0)</li>
</ul>

Use the pivot tables created in 8.1.c.

<ul>
    <li>i. P(CreditCard = 1|Loan = 1) = 0.306620</li>
    <li>ii. P(Online = 1|Loan = 1) = 0.609756</li>
    <li>iii. P(Loan = 1) = 0.095667</li> 
    <li>iv. P(CC = 1|Loan = 0) = 0.296351</li> 
    <li>v. P(Online = 1|Loan = 0) = 0.587541</li> 
    <li>vi. P(Loan = 0) = 0.904333</li>
</ul>

__8.1.e.__ Use the quantities computed above to compute the naive Bayes probability P(Loan = 1 j CC = 1, Online = 1).

Refer to the naive Bayes formula (8.3) in the book.

```
P(Loan=1|CC=1,Online=1) = 
   P(Loan=1) * P(CC=1|Loan=1) * P(Online=1|Loan=1) / 
   [P(Loan=1) * [P(CC=1|Loan=1) * P(Online=1|Loan=1)] + 
    P(Loan=0) * [P(CC=1|Loan=0) * P(Online=1|Loan=0)]]
```

In [8]:
# P(Loan = 1) * P(CC = 1 / Loan = 1) * P(Online = 1 / Loan = 1)
p1 = 0.095667 * 0.306620 * 0.609756
# P(Loan = 0) * P(CC = 1 / Loan = 0) * P(Online = 1 / Loan = 0)
p2 = 0.904333 * 0.296351 * 0.587541

print('Naive Bayes probability P(Loan = 1|CC = 1, Online = 1) = ', p1 / (p1 + p2))

Naive Bayes probability P(Loan = 1|CC = 1, Online = 1) =  0.1020046248320646


__8.1.f.__ Compare this value with the one obtained from the pivot table in (b). Which is a more accurate estimate?

The value obtained from the crossed pivot table is the more accurate estimate, since it does not make the simplifying assumption that the probabilities (of taking a loan if you are a credit card holder and if you are an online customer) are independent. It is feasible in this case because there are few variables and few categories to consider, and thus there are ample data for all possible combinations.

__8.1.g.__ Which of the entries in this table are needed for computing P(Loan = 1 | CC = 1, Online = 1)? In Python, run naive Bayes on the data. Examine the model output on training data, and find the entry that corresponds to P(Loan = 1 | CC = 1,
Online = 1). Compare this to the number you obtained in (e).

In Python, run naive Bayes on the training data. Use data points that match the condition <em>CreditCard=1,Online=1</em> to find the predicted probability for P(Loan=1|CC=1,Online=1).

Change the types of variables to categories and use one-hot-encoding for the independent variables.

In [9]:
train_df = pd.get_dummies(train_df, prefix_sep='_')
train_df['Personal Loan'] = train_df['Personal Loan'].astype('category')
train_df.head()

Unnamed: 0,Personal Loan,Online_0,Online_1,CreditCard_0,CreditCard_1
4522,0,1,0,1,0
2851,0,0,1,1,0
2313,0,0,1,0,1
982,0,1,0,0,1
1164,1,0,1,1,0


In [10]:
predictors = ['Online_0', 'Online_1', 'CreditCard_0', 'CreditCard_1']
nb = MultinomialNB(alpha=0.01)
nb.fit(train_df[predictors], train_df['Personal Loan'])

MultinomialNB(alpha=0.01, class_prior=None, fit_prior=True)

Predict probabilities and check for the probability of "1" in the row where Online = 1 and CreditCard = 1

In [11]:
predProb = nb.predict_proba(train_df.drop(columns=['Personal Loan']))
predicted = pd.concat([train_df, pd.DataFrame(predProb, index=train_df.index)], axis=1)

matches = (predicted.Online_1 == 1) & (predicted.CreditCard_1 == 1)
predicted[matches].head()

Unnamed: 0,Personal Loan,Online_0,Online_1,CreditCard_0,CreditCard_1,0,1
2313,0,0,1,0,1,0.897993,0.102007
1918,1,0,1,0,1,0.897993,0.102007
4506,0,0,1,0,1,0.897993,0.102007
586,0,0,1,0,1,0.897993,0.102007
3591,0,0,1,0,1,0.897993,0.102007


This gives `P(Loan=1|Online=1,CC=1) = 0.1020`

# Problem 8.2 Automobile Accidents.

The file _accidentsFull.csv_ contains information on 42,183 actual automobile accidents in 2001 in the United States that involved one of three levels of injury: NO INJURY, INJURY, or FATALITY. For each accident, additional information is recorded, such as day of week, weather conditions, and road type. A firm might be interested in developing a system for quickly classifying the severity of an accident based on initial reports and associated data in the system (some of which rely on GPS-assisted reporting).

Our goal here is to predict whether an accident just reported will involve an injury (MAX_SEV_IR = 1 or 2) or will not (MAX_SEV_IR = 0). For this purpose, create a dummy variable called INJURY that takes the value “yes” if MAX_SEV_IR = 1 or
2, and otherwise “no.”

In [12]:
# load the data
accidents_df = pd.read_csv(DATA / "accidentsFull.csv")
accidents_df.head()

Unnamed: 0,HOUR_I_R,ALCHL_I,ALIGN_I,STRATUM_R,WRK_ZONE,WKDY_I_R,INT_HWY,LGTCON_I_R,MANCOL_I_R,PED_ACC_R,...,SUR_COND,TRAF_CON_R,TRAF_WAY,VEH_INVL,WEATHER_R,INJURY_CRASH,NO_INJ_I,PRPTYDMG_CRASH,FATALITIES,MAX_SEV_IR
0,0,2,2,1,0,1,0,3,0,0,...,4,0,3,1,1,1,1,0,0,1
1,1,2,1,0,0,1,1,3,2,0,...,4,0,3,2,2,0,0,1,0,0
2,1,2,1,0,0,1,0,3,2,0,...,4,1,2,2,2,0,0,1,0,0
3,1,2,1,1,0,0,0,3,2,0,...,4,1,2,2,1,0,0,1,0,0
4,1,1,1,0,0,1,0,3,2,0,...,4,0,2,3,1,0,0,1,0,0


In [13]:
accidents_df['INJURY'] = np.where(accidents_df['MAX_SEV_IR']>0, 'yes', 'no')
print(accidents_df.INJURY)

0        yes
1         no
2         no
3         no
4         no
        ... 
42178     no
42179    yes
42180     no
42181     no
42182     no
Name: INJURY, Length: 42183, dtype: object


__8.2.a.__ Using the information in this dataset, if an accident has just been reported and no further information is available, what should the prediction be? (INJURY = Yes or No?) Why?

In [14]:
# proportion of "yes" and "no" in the response variable "Injury"
print(accidents_df['INJURY'].value_counts() / len(accidents_df))
print()

yes    0.508783
no     0.491217
Name: INJURY, dtype: float64



So the probability of injury is almost 50.87%.

__8.2.b.__ Select the first 12 records in the dataset and look only at the response (INJURY) and the two predictors WEATHER_R and TRAF_CON_R.

In [15]:
accidents_df.head(12)

Unnamed: 0,HOUR_I_R,ALCHL_I,ALIGN_I,STRATUM_R,WRK_ZONE,WKDY_I_R,INT_HWY,LGTCON_I_R,MANCOL_I_R,PED_ACC_R,...,TRAF_CON_R,TRAF_WAY,VEH_INVL,WEATHER_R,INJURY_CRASH,NO_INJ_I,PRPTYDMG_CRASH,FATALITIES,MAX_SEV_IR,INJURY
0,0,2,2,1,0,1,0,3,0,0,...,0,3,1,1,1,1,0,0,1,yes
1,1,2,1,0,0,1,1,3,2,0,...,0,3,2,2,0,0,1,0,0,no
2,1,2,1,0,0,1,0,3,2,0,...,1,2,2,2,0,0,1,0,0,no
3,1,2,1,1,0,0,0,3,2,0,...,1,2,2,1,0,0,1,0,0,no
4,1,1,1,0,0,1,0,3,2,0,...,0,2,3,1,0,0,1,0,0,no
5,1,2,1,1,0,1,0,3,0,0,...,0,2,1,2,1,1,0,0,1,yes
6,1,2,1,0,0,1,1,3,0,0,...,0,2,1,2,0,0,1,0,0,no
7,1,2,1,1,0,1,0,3,0,0,...,0,1,1,1,1,1,0,0,1,yes
8,1,2,1,1,0,1,0,3,0,0,...,0,1,1,2,0,0,1,0,0,no
9,0,2,1,0,0,0,0,3,0,0,...,0,1,1,2,0,0,1,0,0,no


__8.2.b.i.__ Create a pivot table that examines INJURY as a function of the two predictors for these 12 records. Use all three variables in the pivot table as rows/columns.

In [16]:
#accidents1_df = accidents_df.iloc[0:12, :]
#accidents1_df = accidents1_df.loc[:, ['WEATHER_R', 'TRAF_CON_R', 'INJURY']]
#accidents1_df
accidents1_df = accidents_df.head(12)
accidents1_df = accidents1_df[['WEATHER_R', 'TRAF_CON_R', 'INJURY']]
accidents1_df

Unnamed: 0,WEATHER_R,TRAF_CON_R,INJURY
0,1,0,yes
1,2,0,no
2,2,1,no
3,1,1,no
4,1,0,no
5,2,0,yes
6,2,0,no
7,1,0,yes
8,2,0,no
9,2,0,no


In [17]:
# change variable types to appropriate ones
accidents1_df.WEATHER_R = accidents1_df.WEATHER_R.astype('category')
accidents1_df.TRAF_CON_R = accidents1_df.TRAF_CON_R.astype('category')
accidents1_df.INJURY = accidents1_df.INJURY.astype('category')
accidents1_df.dtypes

WEATHER_R     category
TRAF_CON_R    category
INJURY        category
dtype: object

In [18]:
# pivot table. This pivot table shows some 'NaN' values as there are no records in accidents1_df data with those combinations.
accidents1_df.pivot_table(index=['INJURY', 'WEATHER_R'],
                    columns=['TRAF_CON_R'], aggfunc=len)

Unnamed: 0_level_0,TRAF_CON_R,0,1,2
INJURY,WEATHER_R,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
no,1,1.0,1.0,1.0
no,2,5.0,1.0,
yes,1,2.0,,
yes,2,1.0,,


__8.2.b.ii.__ Compute the exact Bayes conditional probabilities of an injury (INJURY = Yes) given the six possible combinations of the predictors.

In [19]:
pd.set_option('precision', 4)
# To find P(Injury=yes|WEATHER_R = 1, TRAF_CON_R =0):
# Numerator = (proportion of combination {WEATHER_R =1, TRAF_CON_R = 0} when Injury =  
#               yes) * (proportion of injuries in all cases)
# Denominator = proportion of combination {WEATHER_R =1, TRAF_CON_R = 0} in all cases
numerator1 = 2/3 * 3/12
denominator1 = 3/12
p1 = numerator1 / denominator1
print(p1)

0.6666666666666666


So P(Injury=yes|WEATHER_R = 1, TRAF_CON_R =0) = 0.667. Other probabilities can be calculated in the the simlar way as follows:

In [20]:
# P(Injury=yes|WEATHER_R = 1, TRAF_CON_R =1) 
numerator2 = 0/3 * 3/12
denominator2 = 1/12
p2 = numerator2/denominator2

# P(Injury=yes| WEATHER_R = 1, TRAF_CON_R =2)
numerator3 = 0/3 * 3/12
denominator3 = 1/12
p3 = numerator3/denominator3

# P(Injury=yes| WEATHER_R = 2, TRAF_CON_R =0)
numerator4 = 1/3 * 3/12
denominator4 = 6/12
p4 = numerator4/denominator4

# P(Injury=yes| WEATHER_R = 2, TRAF_CON_R =1)
numerator5 = 0/3 *3/12
denominator5 = 1/12
p5 = numerator5/denominator5

# P(Injury=yes| WEATHER_R = 2, TRAF_CON_R = 2) = 0
# In the above 12 observations there is no observation with  (Injury=yes, WEATHER_R = 2, TRAF_CON_R =2). 
# The conditional probability here is undefined, since the denominator is zero.

print('P(Injury=yes | WEATHER_R = 1, TRAF_CON_R =0) = ', p1)
print('\nP(Injury=yes | WEATHER_R = 1, TRAF_CON_R =1) = ', p2)
print('\nP(Injury=yes | WEATHER_R = 1, TRAF_CON_R =2) = ', p3)
print('\nP(Injury=yes | WEATHER_R = 2, TRAF_CON_R =0) = ', p4)
print('\nP(Injury=yes | WEATHER_R = 1, TRAF_CON_R =1) = ', p5)
print('\nP(Injury=yes | WEATHER_R = 2, TRAF_CON_R =2) = 0\nIn the above 12 observations there is no observation with (Injury=yes, WEATHER_R = 2, TRAF_CON_R =2)\nThe conditional probability here is undefined, since the denominator is zero.')

P(Injury=yes | WEATHER_R = 1, TRAF_CON_R =0) =  0.6666666666666666

P(Injury=yes | WEATHER_R = 1, TRAF_CON_R =1) =  0.0

P(Injury=yes | WEATHER_R = 1, TRAF_CON_R =2) =  0.0

P(Injury=yes | WEATHER_R = 2, TRAF_CON_R =0) =  0.16666666666666666

P(Injury=yes | WEATHER_R = 1, TRAF_CON_R =1) =  0.0

P(Injury=yes | WEATHER_R = 2, TRAF_CON_R =2) = 0
In the above 12 observations there is no observation with (Injury=yes, WEATHER_R = 2, TRAF_CON_R =2)
The conditional probability here is undefined, since the denominator is zero.


__8.2.b.iii.__ Classify the 12 accidents using these probabilities and a cutoff of 0.5.

In [21]:
accidents1_df["prob_of_injury"] = [0.667, 0.167, 0, 0, 0.667, 0.167, 0.167, 0.667, 0.167, 0.167, 0.167, 0]
accidents1_df

Unnamed: 0,WEATHER_R,TRAF_CON_R,INJURY,prob_of_injury
0,1,0,yes,0.667
1,2,0,no,0.167
2,2,1,no,0.0
3,1,1,no,0.0
4,1,0,no,0.667
5,2,0,yes,0.167
6,2,0,no,0.167
7,1,0,yes,0.667
8,2,0,no,0.167
9,2,0,no,0.167


In [22]:
# classification of 12 accidents using these probabilities and a cutoff of 0.5.
accidents1_df['accident'] = ["Yes" if x > 0.5 else "No" for x in accidents1_df['prob_of_injury']]
accidents1_df

Unnamed: 0,WEATHER_R,TRAF_CON_R,INJURY,prob_of_injury,accident
0,1,0,yes,0.667,Yes
1,2,0,no,0.167,No
2,2,1,no,0.0,No
3,1,1,no,0.0,No
4,1,0,no,0.667,Yes
5,2,0,yes,0.167,No
6,2,0,no,0.167,No
7,1,0,yes,0.667,Yes
8,2,0,no,0.167,No
9,2,0,no,0.167,No


__8.2.b.iv.__ Compute manually the naive Bayes conditional probability of an injury given WEATHER_R = 1 and TRAF_CON_R = 1.

To find P(Injury=yes| WEATHER_R = 1, TRAF_CON_R =1):

Probability of injury involved in accidents = (proportion of WEATHER_R =1 when Injury = yes) \* (proportion of TRAF_CON_R =1 when Injury = yes) * (propotion of Injury = yes in all cases)

In [23]:
prob = 2/3 * 0/3 * 3/12
prob

0.0

__8.2.b.v.__ Run a naive Bayes classifier on the 12 records and two predictors using _scikitlearn_. Check the model output to obtain probabilities and classifications for all 12 records. Compare this to the exact Bayes classification. Are the resulting
classifications equivalent? Is the ranking (= ordering) of observations equivalent?

In [24]:
# run anive bayes model and obtain probabilities and classifications of all 12 records
predictors = ['WEATHER_R', 'TRAF_CON_R']
outcome = 'INJURY'
# fit the model
accidents1_nb = MultinomialNB(alpha=0.01)
accidents1_nb.fit(accidents1_df[predictors], accidents1_df['INJURY'])
# predict probabilities
predProb = accidents1_nb.predict_proba(accidents1_df[predictors])
print('predicted probabilities\n')
print(predProb)
# predict class memberships
print('\npredicted classes\n')
class_pred = accidents1_nb.predict(accidents1_df[predictors])
print(class_pred)

predicted probabilities

[[7.03564216e-01 2.96435784e-01]
 [6.52499618e-01 3.47500382e-01]
 [9.93755543e-01 6.24445695e-03]
 [9.95053326e-01 4.94667443e-03]
 [7.03564216e-01 2.96435784e-01]
 [6.52499618e-01 3.47500382e-01]
 [6.52499618e-01 3.47500382e-01]
 [7.03564216e-01 2.96435784e-01]
 [6.52499618e-01 3.47500382e-01]
 [6.52499618e-01 3.47500382e-01]
 [6.52499618e-01 3.47500382e-01]
 [9.99941348e-01 5.86518326e-05]]

predicted classes

['no' 'no' 'no' 'no' 'no' 'no' 'no' 'no' 'no' 'no' 'no' 'no']


The classifications (predicted classes) are not the same.

__8.2.c.__ Let us now return to the entire dataset. Partition the data into training (60%) and validation (40%).

In [25]:
# predictors and outcome
predictors = ['HOUR_I_R', 'ALIGN_I', 'WRK_ZONE', 'WKDY_I_R', 'INT_HWY', 'LGTCON_I_R', 'PROFIL_I_R', 'SPD_LIM',
              'SUR_COND', 'TRAF_CON_R', 'TRAF_WAY', 'WEATHER_R']
outcome = 'INJURY'
# get dummies
X = pd.get_dummies(accidents_df[predictors])
y = accidents_df['INJURY'].astype('category')
classes = list(y.cat.categories)
# partition the data
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.40, random_state=1)

__8.2.c.i__ Assuming that no information or initial reports about the accident itself are available at the time of prediction (only location characteristics, weather conditions, etc.), which predictors can we include in the analysis? (Use the data
descriptions page from www.dataminingbook.com ).

All the following predictors are non-specific to the accident. They either describe calendar time or road conditions:
HOUR_I_R, ALIGN_I, WRK_ZONE, WKDY_I_R, INT_HWY, LGTCON_I_R, PROFIL_I_R, SPD_LIM, SUR_CON, TRAF_CON_R, TRAF_WAY and WEATHER_R.

__8.2.c.ii.__ Run a naive Bayes classifier on the complete training set with the relevant predictors (and INJURY as the response). Note that all predictors are categorical. Show the confusion matrix.

In [26]:
# fit the model
accidents_nb = MultinomialNB(alpha=0.01)
accidents_nb.fit(X_train, y_train)
# predict probabilities for training and validation sets
predProb_train = accidents_nb.predict_proba(X_train)
predProb_valid = accidents_nb.predict_proba(X_valid)
# predict class memberships for validation data
y_train_pred = accidents_nb.predict(X_train)
y_valid_pred = accidents_nb.predict(X_valid)

In [27]:
# confusion matrix
# training
print('training data\n')
classificationSummary(y_train, y_train_pred, class_names=classes)
# validation 
print('\nvalidation data\n')
classificationSummary(y_valid, y_valid_pred, class_names=classes)

training data



Confusion Matrix (Accuracy 0.5291)

       Prediction
Actual   no  yes
    no 4197 8195
   yes 3724 9193

validation data



Confusion Matrix (Accuracy 0.5288)

       Prediction
Actual   no  yes
    no 2838 5491
   yes 2460 6085


In [28]:
#Overall error for the validation set is 47.12%. 
error = 1-0.5288
error

0.47119999999999995

__8.2.c.iii__ What is the overall error for the validation set?

Overall error for the validation set is 47.12%.

__8.2.c.iv.__ What is the percent improvement relative to the naive rule (using the validation set)?

Overall error using validation set                0.4712

Naïve rule's error                                0.4913

Improvement                                       3.95%

In [29]:
improvemnt  = 100*(0.4913-0.4712)/0.5087
improvemnt

3.9512482799292323

__8.2.c.v.__ Examine the conditional probabilities in the pivot tables. Why do we get a probability of zero for P(INJURY = No | SPD_LIM = 5)?

In [30]:
# consider only required variables
acc_df = accidents_df[['SPD_LIM', 'INJURY']]
# pivot table
acc_df.pivot_table(index=['INJURY'],
                    columns=['SPD_LIM'], aggfunc=len)

SPD_LIM,5,10,15,20,25,30,35,40,45,50,55,60,65,70,75
INJURY,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
no,2,11,93,159,2245,1807,3994,1978,3240,844,3306,727,1371,818,126
yes,4,11,90,92,1960,1908,4547,2326,3347,821,3288,931,1344,636,157


There are only 2 records with speed limit=5 and no injury. Therefore the probability is almost zero.