# Bayes Classifier Lab

### Introduction

We've now learned about bayes theorem, and how it allows us to calculate the probability of an event occurring.  The key point is to incorporate our "prior", or the base probability of an event occurring, as we incorporate new evidence.

$P(H|E): \frac{P(H)*P(E|H)}{P(EH) + P(EH^c)} = \frac{P(H)*P(E|H)}{P(H)*P(E|H) + P(H^c)*P(E|H^c)} $

In this lesson, we'll see how we can use bayes theorem our breast cancer dataset to classify a slide as having cancer present or not.

### Loading our Data

Let's begin by loading our cancer dataset.

In [1]:
from sklearn.datasets import load_breast_cancer
import pandas as pd
cancer = load_breast_cancer()

  return f(*args, **kwds)


In [6]:
X = pd.DataFrame(cancer.data[:, :1], columns = cancer['feature_names'][:1])  # we only take the first two features.
y = pd.Series(cancer.target != 1, dtype = 'int')

> Remember that in the dataset, the cancerous events are represented as 0.

In [11]:
cancer.target_names

array(['malignant', 'benign'], dtype='<U9')

In [8]:
y.value_counts()

0    357
1    212
dtype: int64

The first feature that we selected, and what we'll be using to predict cancer is the mean radius.

In [3]:
X[:2]

Unnamed: 0,mean radius
0,17.99
1,20.57


Now let's split our data.

In [16]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2, random_state = 2, stratify = y)

### Using Bayes Theorem

Remember that our formula is the following:

$P(H|E): \frac{P(H)*P(E|H)}{P(EH) + P(EH^c)} = \frac{P(H)*P(E|H)}{P(H)*P(E|H) + P(H^c)*P(E|H^c)} $

* The prior: $P(H)$
* The likelihood: $P(E|H)$

1. The **prior** is just the probability of cancer in general

> Begin by calculating the probability of both the hypothesis and the complement of the hypothesis.

In [36]:
P_h = None
P_h
# 0.37362637362637363

0.37362637362637363

In [50]:
P_h_c = None
P_h_c

# 0.6263736263736264

0.6263736263736264

2. The **likelihood**, $P(E|H)$.  That is, assuming the hypothesis, what is the probability of seeing the evidence.

So let's calculate the likelihood of seeing the evidence, by calculating the probability of seeing our feature value, considering the hypothesis.

> To calculate this, let's first select those events that are cancerous.

In [18]:
cancerous_df = None

In [19]:
cancerous_df.shape

# (170, 1)

(170, 1)

> And select those that are not cancerous.

In [28]:
non_cancerous_df = None
non_cancerous_df.shape

(285, 1)

### Updating the Likelihood

Now to get a sense of the likelihood of observing a feature given the hypothesis, let's plot both the mean radius assuming cancerous and then assuming non-cancerous observations.

>To do so, plot a histogram of each.

<img src="./cancerous-not.png" width="40%">

So we can see there is a difference in likelihood, assuming cancerous or not.  
> Of course our classifier also needs to take into consideration the priors.

Let's continue on with working with the likelihoods.  Find the mean and standard deviation of the mean area given a slide with cancer.

In [24]:


# 	mean radius
# mean	17.491118
# std	3.352856

Unnamed: 0,mean radius
mean,17.491118
std,3.352856


Let's also find the mean and standard deviation of the area when there is no cancer present.

In [27]:


# 	mean radius
# mean	12.046354
# std	1.763033

Unnamed: 0,mean radius
mean,12.046354
std,1.763033


So we can see that the non-cancerous cells do appear to be smaller.

### Finding the likelihoods

So now it's time to find the probability of the hypothesis given the evidence.  And doing this involves finding the probability of the evidence and cancer occurring and the probability of the evidence and cancer not occurring.  

$P(H|E): \frac{P(H)*P(E|H)}{P(EH) + P(EH^c)} = \frac{P(H)*P(E|H)}{P(H)*P(E|H) + P(H^c)*P(E|H^c)} $

To do that, we first initialize our random variables of the mean area of cancerous and non-cancerous cells, using the parameters we found above.

In [31]:
import scipy
# fill in the correct parameters below
rvs_cancerous = scipy.stats.norm()
rvs_not_cancerous = scipy.stats.norm()

Ok, so now to make predictions on our test set, we can calculate the probability of cancerous and not-cancerous given the evidence.

> First calculate the probability of the evidence assuming that the data is cancerous.  Calculate probabilities on the test data.

In [32]:
P_e_hs = rvs_cancerous.pdf()
P_e_hs[:3]
# array([0.0929175 , 0.06792287, 0.03471501])

array([0.0929175 , 0.06792287, 0.03471501])

> And then calculate the probability of the evidence given the hypothesis of not-cancerous.

In [34]:
P_e_hcs = rvs_not_cancerous.pdf()
P_e_hcs[:3]
# array([0.04901764, 0.12709726, 0.22505751])

array([0.04901764, 0.12709726, 0.22505751])

Now we have all of the variables declared to make predictions on the test data.  Go for it.  Use bayes theorem to predict cancerous data.

In [39]:
predictions = None

In [40]:
predictions[:10]

# array([0.53067173, 0.24172038, 0.08425619, 0.07230055, 0.08806994,
#        0.22694748, 0.10916146, 0.98915288, 0.07352632, 0.05812051])

array([0.53067173, 0.24172038, 0.08425619, 0.07230055, 0.08806994,
       0.22694748, 0.10916146, 0.98915288, 0.07352632, 0.05812051])

Let's see how this compares to the bayesian classifier from sklearn.

In [49]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X_train, y_train)

gnb.predict_proba(X_test)[:10, 1]

# array([0.53177892, 0.24165722, 0.08384439, 0.07189382, 0.08765794,
#        0.22683113, 0.10875936, 0.98936907, 0.07311875, 0.05772833])

array([0.53177892, 0.24165722, 0.08384439, 0.07189382, 0.08765794,
       0.22683113, 0.10875936, 0.98936907, 0.07311875, 0.05772833])

Not bad at all.

### Resources

[Sklearn Naive Bayes - Gaussian vs Multinomial](https://www.quora.com/What-is-the-difference-between-the-the-Gaussian-Bernoulli-Multinomial-and-the-regular-Naive-Bayes-algorithms)

[Machine learning mastery](https://machinelearningmastery.com/naive-bayes-classifier-scratch-python/)