# **NAIVE BAYES**

* The predictors are all categorical.
* The outcome variable, y, is also categorical (classes: 0 and 1).

This model only works for classification problems - It does not work for regression problems.

In [None]:
import pandas as pd

In [None]:
df = pd.read_excel('/content/nb_example01.xlsx')
df

Unnamed: 0,X1,X2,Y
0,B,A,1
1,A,A,0
2,A,A,1
3,A,B,1
4,A,B,0
5,B,B,0


**How can we predict the class of point (X1,X2) = (B,A)?**

Based on probabilities...

P(Y=0|X1=B,X2=A)>0.5 => prediction is 0

P(Y=0|X1=B,X2=A)<0.5 => prediction is 1


**BAYES' THEOREM**

P(A|B)=[P(B|A)P(A))/(P(B)]

P(Y=0|X1=B, X2=A)
= [P(X1=B, X2=A|Y=0)P(Y=0)]/[P(X1=B, X2=A)]
= P(Y=1|X1=B, X2=A)=1 

sendo que 
P(X1=B, X2=A|Y=0) = 0
P(Y=0) = 3/6
P(X1=B, X2=A) = 1/6
(ver dataset `nb_example01`)

The prediction for (X1,X2)=(B,A) is class 1.

We need to calculate:

[P(Y=1|X1=B, X2=A)P(Y=1)]/[P(X1=B, X2=A)]

Naive Bayes assumes, naively, that the predictors are independent. Then, P(X1=B, X2=A|Y=1)=P(X1=B|Y=1)P(X2=A|Y=1)

__________________________________________________
P(X1=B, X2=A)=P(X1=B, X2=A ∩ Y=0) + P(X1=B, X2=A ∩ Y=1)

sendo que 
P(X1=B, X2=A ∩ Y=0) = P(X1=B, X2=A|Y=0)P(Y=0)
P(X1=B, X2=A ∩ Y=1) = P(X1=B, X2=A|Y=1)P(Y=1)

For our prediction, the probability will be:
[P(X1=B|Y=1)P(X2=A|Y=1)]/[P(X1=B|Y=1)P(X2=A|Y=1)P(Y=1)+P(X1=B|Y=0)P(X2=A|Y=0)P(Y=0)]

sendo que 
P(Y=0) = 3/6
P(Y=1) = 3/6
P(X1=B|Y=1) = 1/3
P(X2=A|Y=1) = 2/3

So...
= [(1/3)*(2/3)*(3/6)]/[(1/3)*(2/3)*(3/6)+(1/3)*(1/3)*(3/6)]
= 2/3 => the prediction for (X1,X2) =(B,A) is class 1 (because probability is greater than 0.5)

Let's do this in python:

In [None]:
from sklearn.naive_bayes import BernoulliNB

`BernoulliNB` works with binary predictors.

In our example, our predictors, can only assume two possible values: A or B. They are, therefore, binary in essence.

The predictors cannot have letters, so we need to transform them into numbers.

In [None]:
df

Unnamed: 0,X1,X2,Y
0,B,A,1
1,A,A,0
2,A,A,1
3,A,B,1
4,A,B,0
5,B,B,0


Como podemos transformar as letras em números?


In [None]:
df = df.replace('A', 0)
df = df.replace('B', 1)
df

Unnamed: 0,X1,X2,Y
0,1,0,1
1,0,0,0
2,0,0,1
3,0,1,1
4,0,1,0
5,1,1,0


We are going to turn off the numerical smoothing Python uses, because we want to get the results **pure** Naive Bayes would produce.

In [None]:
nb_model = BernoulliNB(alpha=0, force_alpha=True)

In [None]:
X=df.drop('Y', axis=1)
y=df['Y']
nb_model.fit(X,y)

In [None]:
X_new = pd.DataFrame({
    'X1':[1],
    'X2':[0]   
})
nb_model.predict(X_new)

array([1])

We can do this on probability.

In [None]:
nb_model.predict_proba(X_new)

array([[0.33333333, 0.66666667]])

# NAIVE BAYES

Dataset: bank_mark_campaign

https://archive.ics.uci.edu/ml/datasets/bank+marketing

In [None]:
df = pd.read_csv('/content/bank_mark_campaign.csv', sep=';')
df

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41183,73,retired,married,professional.course,no,yes,no,cellular,nov,fri,...,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,yes
41184,46,blue-collar,married,professional.course,no,no,no,cellular,nov,fri,...,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,no
41185,56,retired,married,university.degree,no,yes,no,cellular,nov,fri,...,2,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,no
41186,44,technician,married,professional.course,no,no,no,cellular,nov,fri,...,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,yes


The following command replaces 'unknown' by NA, using `np.nan`.

In [None]:
import numpy as np

In [None]:
df = df.replace('unknown',np.nan)

**TREATMENT OF NA:**

There are several techniques to deal with NAN.

**1.** Remove all the rows with NA - this may be not a good approach, because we are dropping possibily important information;

**2.** Use computation - e.g. of a strategy is to use the most frequent value for the NA.

In [None]:
df.isna()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,True,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41183,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
41184,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
41185,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
41186,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


The method `isna` gives True or False, if the values are NA or not, respectively.

The method `any` gives True if there is some True in its arguments.

With `axis=0`, we get a True if a column has any NA or a False otherwise.

In [None]:
df.isna().any(axis=0)

age               False
job                True
marital            True
education          True
default            True
housing            True
loan               True
contact           False
month             False
day_of_week       False
duration          False
campaign          False
pdays             False
previous          False
poutcome          False
emp.var.rate      False
cons.price.idx    False
cons.conf.idx     False
euribor3m         False
nr.employed       False
y                 False
dtype: bool

Se quisermos uma lista com o nome de todas as variáveis que têm, pelo menos, uma observação NA.

In [None]:
col_nan = df.columns[df.isna().any(axis=0)].to_list()
col_nan

['job', 'marital', 'education', 'default', 'housing', 'loan']

How can we get the numerical columns?

In [None]:
col_num = df.describe().columns.to_list()
col_num

['age',
 'duration',
 'campaign',
 'pdays',
 'previous',
 'emp.var.rate',
 'cons.price.idx',
 'cons.conf.idx',
 'euribor3m',
 'nr.employed']

Precisámos de saber porque temos de os escalar.

The categorical predictoes remaining are the columns of `df` that are not in `col_nan` and `col_num`.

`list1`+`list2` concatenates the two lists into a single list.

In [None]:
col_nan + col_num

['job',
 'marital',
 'education',
 'default',
 'housing',
 'loan',
 'age',
 'duration',
 'campaign',
 'pdays',
 'previous',
 'emp.var.rate',
 'cons.price.idx',
 'cons.conf.idx',
 'euribor3m',
 'nr.employed']

In [None]:
df.columns.difference(col_nan + col_num)

Index(['contact', 'day_of_week', 'month', 'poutcome', 'y'], dtype='object')

Since the column with name `y` is the outcome variable, we need to exclude it from the resultfrom the previous command.

In [None]:
col_cat = df.columns.difference(col_nan + col_num + ['y']).to_list()
col_cat

['contact', 'day_of_week', 'month', 'poutcome']