# **NAIVE BAYES**

This model only works for **classification problems** - It does not work for regression problems.

# **1. INTRODUCTION**

Naive Bayes is another prediction model for classification problems. It operates by neglecting a fundamental theorem of Statistics: The Bayes Theorem. Surprisingly, despite using a wrong version of such a theorem, this model performs extremely well in certain problems and is fast.

A limitation of this model is that the predictors must be all categorical. There are several ways to overcome that limitation.

# **2. THE BAYES THEOREM**

See the explanation in paper - **The Bayes Theorem**.

# **3. NAIVE BAYES**

See the explanation in paper - **naive version of Bayes' Theorem**.


# **4. IMPLEMENTATION**

We need to first load the needed function:

In [None]:
from sklearn.naive_bayes import BernoulliNB

`BernoulliNB` works with binary predictors and, in our example, our predictors are binary in essence since they can only assume two possible values: A or B.

Read the dataset:

In [None]:
import pandas as pd

In [None]:
df = pd.read_excel('/content/nb_example01.xlsx')
df

Unnamed: 0,X1,X2,Y
0,B,A,1
1,A,A,0
2,A,A,1
3,A,B,1
4,A,B,0
5,B,B,0


Since `BernoulliNB` requires a numerical dataframe, which means that the predictors cannot have letters, we need to transform the predictors into number, i.e., replace the classes A and B by, for instance, 1 and 0, respectively.


In [None]:
df = df.replace({'A':1, 'B':0})
print(df)

   X1  X2  Y
0   0   1  1
1   1   1  0
2   1   1  1
3   1   0  1
4   1   0  0
5   0   0  0


As we have seen in points 2. and 3. (paper) we can calculate the probability of a certain case using two methods: the `Bayes Theorem` itself and the `Naive Bayes`, respectively.

The second one is also known as the `naive version of the Bayes's Theorem` and is:

* a solution to overcome the difficulty of not find a row in the dataset where the predictors have the same values as the respective predictors of the example we want to classify;

* is a method that assumes that the predictors are independent.

Relating this to `BernoulliNB`:

- if we use the BernoulliNB normally we are assuming that the predictors are independent, which mean that we are using a technique called `smoothing` that avoids zero probabilities by adding a small positive value (ussualy 1) to each value of each predictor.

- if we want to use PURE Naive Bayes without smoothing, we can desactivate this mechanisms in `BernoulliNB` by simply setting the parameter `alpha`. With alpha equal to 0 no smoothing is applied and the model will produce the same results as pure Naive Bayes.

We are going to turn off the numerical smoothing Python uses, because we want to get the **results pure Naive Bayes** would produce, just to check the correctness of our hand calculations.

In general, we must **not** turn off numerical smoothing.

To turn off the numerical smoothing mechanism, we use `alpha=0` and `force_alpha=True`.

`alpha=0` => no smoothing technique will be applied.

`force_alpha=True` => the model will force the value of alpha even if the value is zero.


Dividing the dataset in two datasets: `X` that contains only the predictors because we drop the outcome variable column and the `y` that only contains the outcome variable.

In [None]:
X = df.drop('Y', axis=1)
y = df['Y']

The difference in the use of `( )` and `[ ]` has to be with the fact that in the `X` we are **applying the method** `drop` to remove the column `Y` and in `y` we are only **selecting** a column `Y`. 

In [None]:
nb_model = BernoulliNB(alpha=0, force_alpha=True)
nb_model.fit(X,y)

Basically, this code is creating a model of Naive Bayes of `BernoulliNB` and adjusting the model to the data `X` and `y` without smoothing, that can be used to make previsions in new data.

NEW DATA:

Creating a new dataframe with two columns `X1` and `X2` and a row with the values [0,1].

Remember that we we have replaced the A for 1 and the B for 0.

If we want a new point like (X1,X2)=(B,A) we have to add the point (X1,X2)=(0,1).

In [None]:
X_new = pd.DataFrame({
    'X1': [0],
    'X2': [1]
})

Predicting the new data:

In [None]:
nb_model.predict(X_new)

array([1])

This means that the prediction of (X1,X2)=(B,A) is class 1.

Since `X_new` have only one row the result will be a matrix 1x2, where the first column represents the probability of the point being class 0 and the second column the probability of being class 1.

In [None]:
y_proba = nb_model.predict_proba(X_new)
y_proba

array([[0.33333333, 0.66666667]])

Since the probability of being class 1 is higher, this means that the prediction of (X1,X2)=(B,A) is class 1 (the same as before).

# **5. NAIVE BAYES GENERALIZATION**

Naive Bayes is based on the assumption that the predictors are categorical. This is a strong limitation! Fortunately, there is a version of Naive Bayes that works with numerical predictors, the `Gaussian Naive Bayes`.

`Gaussian Naive Bayes` assumes that all predictors follow Gaussian distribution and also that are independent of each other. This may not hold in reality bu, despite that, it can produce good predictions - that is, therefore, worth trying.

# **EXERCISE**

Dataset: bank_mark_campaign

https://archive.ics.uci.edu/ml/datasets/bank+marketing

In [None]:
df = pd.read_csv('/content/bank_mark_campaign.csv', sep=';')
df

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41183,73,retired,married,professional.course,no,yes,no,cellular,nov,fri,...,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,yes
41184,46,blue-collar,married,professional.course,no,no,no,cellular,nov,fri,...,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,no
41185,56,retired,married,university.degree,no,yes,no,cellular,nov,fri,...,2,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,no
41186,44,technician,married,professional.course,no,no,no,cellular,nov,fri,...,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,yes


The following command replaces 'unknown' by NA, using `np.nan`.

In [None]:
import numpy as np

In [None]:
df = df.replace('unknown', np.nan)
df

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41183,73,retired,married,professional.course,no,yes,no,cellular,nov,fri,...,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,yes
41184,46,blue-collar,married,professional.course,no,no,no,cellular,nov,fri,...,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,no
41185,56,retired,married,university.degree,no,yes,no,cellular,nov,fri,...,2,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,no
41186,44,technician,married,professional.course,no,no,no,cellular,nov,fri,...,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,yes


# **ANALYSE THE PRESENCE OF NA:**

The method `isna` gives True or False if the values are NA or not, respectively.

The method `any` gives True if there is some NA in its arguments, at least one.

With `axis=0`, we get a True if a COLUMN has any NA or a False otherwise.

In [None]:
df.isna().any(axis=0)

age               False
job                True
marital            True
education          True
default            True
housing            True
loan               True
contact           False
month             False
day_of_week       False
duration          False
campaign          False
pdays             False
previous          False
poutcome          False
emp.var.rate      False
cons.price.idx    False
cons.conf.idx     False
euribor3m         False
nr.employed       False
y                 False
dtype: bool

`df.isna()` retorna um DataFrame booleano com o mesmo formato de df, onde cada valor é True se o valor correspondente em df for NaN e False caso contrário.

`.any(axis=0)` é um método que verifica se há valores True em cada coluna do DataFrame. `axis=0` indica que a verificação deve ser feita ao longo do eixo das colunas. O resultado é uma série booleana que indica quais colunas de df contêm pelo menos um valor ausente.

If we want a LIST with all the names of the variables that has, at least, one observation NA:

In [None]:
col_nan = df.columns[df.isna().any(axis=0)].to_list()
col_nan

['job', 'marital', 'education', 'default', 'housing', 'loan']

How can we get the **numerical columns**?

We need to know them because we will have to scale them.



In [None]:
col_num = df.describe().columns.to_list()
col_num

['age',
 'duration',
 'campaign',
 'pdays',
 'previous',
 'emp.var.rate',
 'cons.price.idx',
 'cons.conf.idx',
 'euribor3m',
 'nr.employed']

`df.describe()` returns a dataframe containing several descriptive statistics to each numerical column of the dataframe `df`.

`col_num` is a list containing all the names of the **numerical columns** of the dataframe.

The remaining columns, the **categorical** ones, are the ones which are not included in the `col_nan` (list of the variables that have at least one observation NA) and in the `col_num` (list of numerical columns).

Let's make a list with the two lists (concatenates) `col_nan` and `col_num`.

In [None]:
col_nan + col_num

['job',
 'marital',
 'education',
 'default',
 'housing',
 'loan',
 'age',
 'duration',
 'campaign',
 'pdays',
 'previous',
 'emp.var.rate',
 'cons.price.idx',
 'cons.conf.idx',
 'euribor3m',
 'nr.employed']

Now, let's make the differente between the entire DataFrame and this new list that contains `col_nan` and `col_num` in order to get the categorical predictors, the remaining.

In [None]:
df.columns.difference(col_nan + col_num)

Index(['contact', 'day_of_week', 'month', 'poutcome', 'y'], dtype='object')

Just one more thing: the `y` is the column of the outcome variable, so we need to exclude it from the result of the previous command, since we only want the predictors.

In [None]:
col_cat = df.columns.difference(col_nan + col_num + ['y']).to_list()
col_cat

['contact', 'day_of_week', 'month', 'poutcome']

These are the **categorical predictors** of the DataFrame.

Now, that we have seen the predictors that have at least one observation `NA`, the lists of numerical predictors and also the list of categorical predictors, we can **TREAT NA**.

# **TREATMENT OF NA**:

There are several techniques to deal with NA:

1. Remove all the rows with NA - this may be not a good approach, because we are dropping possibilily important information.

2. Use computation - an example of a strategy is to use the most frequent values for the NA.

* We will use the 2nd method, i.e., we are going to **impute the NA with the most frequent value**.

In this dataset, the NA occur only in the categorical predictors (variáveis com categorias específicas e não númericas no sentido em que podem assumir qualquer valor). Neste caso, cada uma das variáveis com NA pode assumir determinadas categorias, daí serem categóricas.

Since they are categorical we need to use `OneHotEncoder`. WHY? Because the `Gaussian Naive Bayes` only works with numerical predictors, so we need to transform the categories into values.

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.naive_bayes import GaussianNB

**Naive Bays** - the original version - assumes only categorical predictors.

But since we have numerical and categorical variables in the dataset Naive Bayes is not appropriate in this case, so, we need another method.

There is `Gaussian Naive Bays`, which assumes the continuous predictors follow **Gaussian statistical distributions** and **are independent** and is able to assume numerical variables instead of only numerical.

The assumptions of Gaussian Naive Bays may not hold, but its predictions may be good nevertheless.

In [None]:
na_treat = Pipeline([
    ('imp', SimpleImputer(strategy='most_frequent')),
    ('oneh', OneHotEncoder(drop='first'))])

Here, we are creating a pipeline with two elements: `SimpleImputer` the most frequent values to NA and `OneHotEncoder` to transforme the categorical variables into binary ones (0 or 1) droping the first encoded column (category) to avoid the multicolinearity between the variables.

In [None]:
preprocessor = ColumnTransformer([
     ('na_tr', na_treat, col_nan),
     ('cat_tr', OneHotEncoder(drop='first'), col_cat),
     ('scale_tr', StandardScaler(), col_num)],
     remainder='passthrough')

Here it's important to understand some things:

- `preprocessor` is a `ColumnTransformer` object that preprocesses the input data by applying different transformations to different subsets of columns.

* `na_tr` => name, pipeline previously created, list of column names to which the transformer should be applied (in this case, to the `col_cat` - categorical variables list).

* `cat_tr` => name, `OneHotEncoder` - applies one-hot encoding to the categorical variables, list of columns that the transformation will be applied (in this case, `col_cat`).

* `scale_tr` => name, `StandardScaler` standardizes these numerical columns by subtracting their mean and dividing by their standard deviation, list that will be transformed (this case, `col_num` - numerical variables)

* `remainder` => any columns not specified in the above transformers will be passed through without any transformations.

In [None]:
pipe = Pipeline([
    ('pre', preprocessor),
    ('gnb', GaussianNB())])

==> `pipe` combines the `preprocessor` and the `GaussianNB` that can be used to make predictions on new, unseen data.

Divide the dataset into two dataframes (`X` and `y`)

In [None]:
X = df.drop('y', axis=1)
y = df['y']

Divide both dataframes into test (20%) and train (80%).


In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=45)

When we call `pipe.fit(X_train, y_train)`, the following steps are performed:

1. `preprocessor` - fit to the training data `X_train` - this involves impute strategy for missing values, one-hot encoding for categorical variables, and standardization for numerical variables.

2. the transformed data is passed to the `GaussianNB` classifier, which is fit to the transformed data along with the target values `y_train`.

After fitting the `pipe`, it can be used to make predictions on new or unseen data using the `pipe.predict()` method.

In [None]:
pipe.fit(X_train, y_train)

In [None]:
y_pred = pipe.predict(X_train)

* `y_pred = pipe.predict(X_train)` generates predicted target values for the training data `X_train` using the trained `pipe` object.

**ACCURACY**

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
acur = accuracy_score(y_train, y_pred)
acur

0.8391502276176024

* `acur = accuracy_score(y_train, y_pred)` calculates the accuracy score of the predicted target values `y_pred` compared to the actual target values `y_train`.

The value of the accuracy means that the model is classifying correctly almost 84% of the instances which, depending on the context of the problem can be good or bad.

**CONFUSION MATRIX**

In [None]:
from sklearn.metrics import confusion_matrix

In [None]:
cm = confusion_matrix(y_train, y_pred, labels = ['yes', 'no'])
cm

array([[ 2283,  1417],
       [ 3883, 25367]])

The confusion matrix is a table that shows the frequency in which the model classifies correctly or incorrectly the instances of each class.

 - ROWS - actual values (yes or no)

 - COLUMS - predicted values (yes or no)

**RECALL SCORE**

In [None]:
from sklearn.metrics import recall_score

In [None]:
recall_score(y_train, y_pred, pos_label='yes')

0.617027027027027

The `recall_score` is an evaluation metrics that measures the proportion of instances that are well identified by the model. In other words, it measures the capacity of the model to find all the examples of positive class, i.e., its sensibility.

`recall_score = 2283/(2283+1417) = 0,617` means that the model identified correctly 61,7% of the instances of class 'yes', which means that the model is not so good finding all the examples of positive class and can be commiting some "falsos negativos".



Since we are trying to predict the "yes", these predictions are not good, because many of the actual values "yes" are predicted as "no".

* 1417 => falsos negativos.

* 25367 => verdadeiros positivos.

* 2283 => verdadeiros negativos.

* 3883 => falsos positivos.


The ideal case would be to get a recall_score of 1.