In [1]:
# from preamble import *
# %matplotlib inline
import pandas as pd

### Naive Bayes classifier

- is probabilistic method that uses the probabilities of each attribute belonging to each (target) class to make a prediction.


- simplifies the calculation of probabilities by assuming that the probability of each attribute belonging to a given class value is independent of all other attributes. This is a strong assumption but results in a fast and effective method.


- The probability of a class value given a value of an attribute is called the conditional probability. By multiplying the conditional probabilities together for each attribute for a given class value, we have a probability of a data instance belonging to that class.


- To make a prediction we can calculate probabilities of the instance belonging to each class and select the class value with the highest probability.


- Like the previous samples, we are going to predict whether a row belongs to one of the two classes - contained within the `Malignant` column.

In [2]:
cancer_df = pd.read_csv('data/breast_cancer_wisconsin/full.csv', index_col=0)

In [3]:
cancer_df.head()

Unnamed: 0,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleoli,Mitoses,Malignant
1000025,5,1,1,1,2,1,3,1,1,0
1002945,5,4,4,5,7,10,3,2,1,0
1015425,3,1,1,1,2,2,3,1,1,0
1016277,6,8,8,1,3,4,3,7,1,0
1017023,4,1,1,3,2,1,3,1,1,0


#### How does Naive Bayes work - 

1. Calculating class probabilities

In [11]:
# count number of classes 
print("Unique classes: ", cancer_df.Malignant.unique())

Unique classes:  [0 1]


In [12]:
print("probability of patient having a malignant case (Malignant = 1): ", round(cancer_df[cancer_df.Malignant == 1].shape[0]/cancer_df.shape[0] * 100, 2))

probability of patient having a malignant case (Malignant = 1):  34.48


In [13]:
print("probability of patient not having a malignant case (Malignant = 0): ", round(cancer_df[cancer_df.Malignant == 0].shape[0]/cancer_df.shape[0] * 100, 2))

probability of patient not having a malignant case (Malignant = 0):  65.52


- Calculating Conditional Probabilities

The conditional probabilities are the frequency of each attribute value for a given class value divided by the frequency of instances with that class value.

In [16]:
# For a moment let's assume that the columns are actually categories encoded as integers
# For example, in the case of the Normal Nucleoli column
print("Number of unique values in Normal Nucleoli column: ", cancer_df['Normal Nucleoli'].nunique())

Number of unique values in Normal Nucleoli column:  10


In [17]:
cancer_df['Normal Nucleoli'].unique()

array([ 1,  2,  7,  4,  5,  3, 10,  6,  9,  8])

For the "category" of 1, in the case of Normal Nucleoli, the conditional probability would be : 

In [29]:
numerator = cancer_df[(cancer_df.Malignant== 0) & (cancer_df['Normal Nucleoli'] == 1)].shape[0]
denominator = cancer_df[(cancer_df.Malignant== 0)].shape[0]
print("P(Normal Nucleoli=1|Malignant=0) OR Conditional probability of Normal Nucleoli being of the class 1 and Malignancy being 0 is: ", numerator/float(denominator) * 100)

P(Normal Nucleoli=1|Malignant=0) OR Conditional probability of Normal Nucleoli being of the class 1 and Malignancy being 0 is:  87.77292576419214


In [30]:
# Let's repeat the same for other "types" of Normal Nucleoli, when sample is benign or Malginant=0
unique_values = sorted(cancer_df['Normal Nucleoli'].unique())
for x_nucleoli in unique_values:
    numerator = cancer_df[(cancer_df.Malignant== 0) & (cancer_df['Normal Nucleoli'] == x_nucleoli)].shape[0]
    denominator = cancer_df[(cancer_df.Malignant== 0)].shape[0]
    prob = numerator/float(denominator) * 100
    print(f"P(Normal Nucleoli={x_nucleoli}|Malignant=0) OR Conditional probability of Normal Nucleoli being of the 'class' {x_nucleoli} and Malignancy being 0 is: {prob}") 

P(Normal Nucleoli=1|Malignant=0) OR Conditional probability of Normal Nucleoli being of the 'class' 1 and Malignancy being 0 is: 87.77292576419214
P(Normal Nucleoli=2|Malignant=0) OR Conditional probability of Normal Nucleoli being of the 'class' 2 and Malignancy being 0 is: 6.550218340611353
P(Normal Nucleoli=3|Malignant=0) OR Conditional probability of Normal Nucleoli being of the 'class' 3 and Malignancy being 0 is: 2.6200873362445414
P(Normal Nucleoli=4|Malignant=0) OR Conditional probability of Normal Nucleoli being of the 'class' 4 and Malignancy being 0 is: 0.21834061135371177
P(Normal Nucleoli=5|Malignant=0) OR Conditional probability of Normal Nucleoli being of the 'class' 5 and Malignancy being 0 is: 0.43668122270742354
P(Normal Nucleoli=6|Malignant=0) OR Conditional probability of Normal Nucleoli being of the 'class' 6 and Malignancy being 0 is: 0.8733624454148471
P(Normal Nucleoli=7|Malignant=0) OR Conditional probability of Normal Nucleoli being of the 'class' 7 and Malign

Let's repeat the same process for a malignant sample where Malignant = 1

In [25]:
# Let's repeat the same for other "types" of Normal Nucleoli, when sample is benign or Malginant=0
unique_values = sorted(cancer_df['Normal Nucleoli'].unique())
for x_nucleoli in unique_values:
    numerator = cancer_df[(cancer_df.Malignant== 1) & (cancer_df['Normal Nucleoli'] == x_nucleoli)].shape[0]
    denominator = cancer_df[(cancer_df.Malignant== 1)].shape[0]
    prob = numerator/float(denominator) * 100
    print(f"Conditional probability of Normal Nucleoli being of the class {x_nucleoli} and Malignancy being 1 is: {prob}") 

Conditional probability of Normal Nucleoli being of the class 1 and Malignancy being 1 is: 17.012448132780083
Conditional probability of Normal Nucleoli being of the class 2 and Malignancy being 1 is: 2.4896265560165975
Conditional probability of Normal Nucleoli being of the class 3 and Malignancy being 1 is: 13.278008298755188
Conditional probability of Normal Nucleoli being of the class 4 and Malignancy being 1 is: 7.053941908713693
Conditional probability of Normal Nucleoli being of the class 5 and Malignancy being 1 is: 7.053941908713693
Conditional probability of Normal Nucleoli being of the class 6 and Malignancy being 1 is: 7.468879668049793
Conditional probability of Normal Nucleoli being of the class 7 and Malignancy being 1 is: 5.809128630705394
Conditional probability of Normal Nucleoli being of the class 8 and Malignancy being 1 is: 8.29875518672199
Conditional probability of Normal Nucleoli being of the class 9 and Malignancy being 1 is: 6.224066390041494
Conditional proba

### Predicting with the learnt data

Again, for a new instance of Normal Nucleoli, we would calculate the probability as follows:

new_nn_non_malignant = `P(Normal Nucleoli=new_x | Malignant=0)` * `P(Malignant=0)`

new_nn_malignant = `P(Normal Nucleoli=new_x | Malignant=1)` * `P(Malignant=1)`

For this new instance of data, Naive Bayes will calculate the probability of Malignant being 0 or 1, and choose the class with the maximum value - 

P(Malignant=0 | Normal Nucleoli=new_x) = `new_nn_non_malignant` / (`new_nn_non_malignant` + `new_nn_malignant`)

P(Malignant=1 | Normal Nucleoli=new_x) = `new_nn_malignant` / (`new_nn_non_malignant` + `new_nn_malignant`)

The class with the max probability above will be the new predicted class.

### Using sklearn

In [37]:
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [35]:
train, target = cancer_df.drop('Malignant', 1), cancer_df.Malignant
train_x, train_y, test_x, test_y = train_test_split(train, target, train_size=0.6)

In [36]:
#Create a Gaussian Classifier
model = GaussianNB()

# Train the model using the training sets
model.fit(train_x, test_x)

#Predict Output
# predicted= model.predict([[0,2]]) # 0:Overcast, 2:Mild
# print "Predicted Value:", predicted

GaussianNB(priors=None, var_smoothing=1e-09)

In [39]:
predicted = model.predict(train_y)
accuracy_score(predicted, test_y)

#woohoo

0.9607142857142857