## Naive Bayesian Classifier

Probability based learning

### 1. Conditional Probability

$P(A|B) = \large \frac{P(A \cap B)}{P(B)}$ 

$P(B|A) = \large \frac{P(A \cap B)}{P(A)}$

From which we can infer the following:<br>
$P(A \cap B) = P(B)P(A|B) = P(A)P(B|A)$

### 2. Bayes's Theorem

$P(A|B) = \large \frac{P(A \cap B)}{P(B)} = \large \frac{P(A)P(B|A)}{P(B)}$


### 2.1 Example

Let's say that we have two bowls of cookies. The first bowl contains 30 vanila cookies and 10 chocolate cookies. The second bowl contains 20 of each cookies. We randomly picked a cookie, which turned out to be a vanilla cookies. What is the probability this cookie was drawn from the first bowl?

$P(B1|vanilla) = \large \frac{P(vanilla|B1)(P(B1)}{P(vanilla)}$

$P(choco) = \frac{3}{8}$

$P(vanilla) = \frac{5}{8}$

$P(B1) = \frac{1}{2}$

$P(B2) = \frac{1}{2}$

$P(vanilla|B1) = \frac{3}{4}$

Therefore, $P(B1|vanilla) = \large \frac{\frac{3}{4}\frac{1}{2}}{\frac{5}{8}} = \frac{3}{5}$

## 3. Single Variable Bayes Classifier

In [2]:
import pandas as pd
import numpy as np
from pandas import Series, DataFrame

In [2]:
viagra_span = {'viagra': [1,0,0,0,0,0,0,0,1,1,1,0,0,1,0,0,0,0,0,1],
              'spam': [1,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,1,1,1]}

In [4]:
df = pd.DataFrame(viagra_span, columns=['viagra', 'spam'])
df.head()

Unnamed: 0,viagra,spam
0,1,1
1,0,0
2,0,0
3,0,0
4,0,0


In [5]:
#conver to matrix
np_data = df.values
np_data

array([[1, 1],
       [0, 0],
       [0, 0],
       [0, 0],
       [0, 0],
       [0, 0],
       [0, 1],
       [0, 0],
       [1, 1],
       [1, 0],
       [1, 0],
       [0, 0],
       [0, 0],
       [1, 0],
       [0, 0],
       [0, 0],
       [0, 0],
       [0, 1],
       [0, 1],
       [1, 1]])

In [11]:
#find P(viagra), P(spam), probability of viagra intersection spam
p_viagra = sum(np_data[:, 0] == 1) / len(np_data)
p_spam = sum(np_data[:, 1] == 1) / len(np_data)
p_v_cap_s = sum((np_data[:, 0] == 1) & (np_data[:, 1] == 1)) / len(np_data)
p_n_v_cap_s = sum((np_data[:, 0] == 0) & (np_data[:, 1] == 1)) / len(np_data)

In [12]:
#P(spam|viagra)
p_v_cap_s / p_viagra

0.5

## 4. Naive Bayes Classifier

In [3]:
data_url = "./fraud.csv"
df= pd.read_csv(data_url, sep=',') 
df.head()

Unnamed: 0,ID,History,CoApplicant,Accommodation,Fraud
0,1,current,none,own,True
1,2,paid,none,own,False
2,3,paid,none,own,False
3,4,paid,guarantor,rent,True
4,5,arrears,none,own,False


In [4]:
del df["ID"]
Y_data = df.pop("Fraud")
Y_data = Y_data.as_matrix()
Y_data

  This is separate from the ipykernel package so we can avoid doing imports until


array([ True, False, False,  True, False,  True, False, False, False,
        True, False,  True,  True, False, False, False, False, False,
       False, False])

In [6]:
df.head()

Unnamed: 0,History,CoApplicant,Accommodation
0,current,none,own
1,paid,none,own
2,paid,none,own
3,paid,guarantor,rent
4,arrears,none,own


In [7]:
#use one-hot encoding
x_df = pd.get_dummies(df)
x_df.head()

Unnamed: 0,History_arrears,History_current,History_none,History_paid,CoApplicant_coapplicant,CoApplicant_guarantor,CoApplicant_none,Accommodation_free,Accommodation_own,Accommodation_rent
0,0,1,0,0,0,0,1,0,1,0
1,0,0,0,1,0,0,1,0,1,0
2,0,0,0,1,0,0,1,0,1,0
3,0,0,0,1,0,1,0,0,0,1
4,1,0,0,0,0,0,1,0,1,0


In [8]:
x_data = x_df.as_matrix()
x_data

  """Entry point for launching an IPython kernel.


array([[0, 1, 0, 0, 0, 0, 1, 0, 1, 0],
       [0, 0, 0, 1, 0, 0, 1, 0, 1, 0],
       [0, 0, 0, 1, 0, 0, 1, 0, 1, 0],
       [0, 0, 0, 1, 0, 1, 0, 0, 0, 1],
       [1, 0, 0, 0, 0, 0, 1, 0, 1, 0],
       [1, 0, 0, 0, 0, 0, 1, 0, 1, 0],
       [0, 1, 0, 0, 0, 0, 1, 0, 1, 0],
       [1, 0, 0, 0, 0, 0, 1, 0, 1, 0],
       [0, 1, 0, 0, 0, 0, 1, 0, 0, 1],
       [0, 0, 1, 0, 0, 0, 1, 0, 1, 0],
       [0, 1, 0, 0, 1, 0, 0, 0, 1, 0],
       [0, 1, 0, 0, 0, 0, 1, 0, 1, 0],
       [0, 1, 0, 0, 0, 0, 1, 0, 0, 1],
       [0, 0, 0, 1, 0, 0, 1, 0, 1, 0],
       [1, 0, 0, 0, 0, 0, 1, 0, 1, 0],
       [0, 1, 0, 0, 0, 0, 1, 0, 1, 0],
       [1, 0, 0, 0, 1, 0, 0, 0, 0, 1],
       [1, 0, 0, 0, 0, 0, 1, 1, 0, 0],
       [1, 0, 0, 0, 0, 0, 1, 0, 1, 0],
       [0, 0, 0, 1, 0, 0, 1, 0, 1, 0]], dtype=uint8)

In [9]:
P_Y_True = sum(Y_data==True) / len(Y_data)
P_Y_False = 1 - P_Y_True

P_Y_True,P_Y_False

(0.3, 0.7)

In [10]:
Y_data

array([ True, False, False,  True, False,  True, False, False, False,
        True, False,  True,  True, False, False, False, False, False,
       False, False])

In [11]:
#get indices where Y_data is True and where Y_data is False
ix_Y_True = np.where(Y_data)
ix_Y_False = np.where(Y_data==False)

ix_Y_True, ix_Y_False

((array([ 0,  3,  5,  9, 11, 12]),),
 (array([ 1,  2,  4,  6,  7,  8, 10, 13, 14, 15, 16, 17, 18, 19]),))

In [19]:
p_x_y_true = (x_data[ix_Y_True].sum(axis=0)) / sum(Y_data==True)
p_x_y_false = (x_data[ix_Y_False].sum(axis=0)) / sum(Y_data==False)

p_x_y_true, p_x_y_false

(array([0.16666667, 0.5       , 0.16666667, 0.16666667, 0.        ,
        0.16666667, 0.83333333, 0.        , 0.66666667, 0.33333333]),
 array([0.42857143, 0.28571429, 0.        , 0.28571429, 0.14285714,
        0.        , 0.85714286, 0.07142857, 0.78571429, 0.14285714]))

In [27]:
x_test = [0,1,0,0,0,1,0, 0,1,0]

p_y_true_test = P_Y_True + p_x_y_true.dot(x_test)
p_y_false_test = P_Y_False + p_x_y_false.dot(x_test)

p_y_true_test , p_y_false_test

(1.6333333333333333, 1.7714285714285714)

## 5. Multinomial Naive Bayes

* When the values in X are not bionimal, but represent values greater than 1
* Generally used to categorize texts
* Use of bag of words

## 6. Naive Bayes with SKlearn

In [28]:
from sklearn.feature_extraction.text import CountVectorizer

In [39]:
y_example_text = ["Sports", "Not sports","Sports","Sports","Not sports"]
y_example = [1 if c=="Sports" else 0 for c in y_example_text ]
text_example = ["A great game game", "The The election was over",
                "Very clean game match",
                "A clean but forgettable game game","It was a close election", ]


countvect_example =  CountVectorizer()
X_example = countvect_example.fit_transform(text_example)
countvect_example.get_feature_names()[:8]

['but', 'clean', 'close', 'election', 'forgettable', 'game', 'great', 'it']

In [37]:
X_example.toarray()
#the first sentence was "A great game game"

array([[0, 0, 0, 0, 0, 2, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 2, 0, 1],
       [0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0],
       [1, 1, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1]], dtype=int64)