# Classifiers with Bayes' rule

### This notebook is based on the tutorials by

Guillaume J. Clement "Why & How to use the Naive Bayes algorithms in a regulated industry with sklearn | Python + code", available at https://towardsdatascience.com/why-how-to-use-the-naive-bayes-algorithms-in-a-regulated-industry-with-sklearn-python-code-dbd8304ab2cf

Saul Dobilas "Naive Bayes Classifier — How to Successfully Use It in Python?", available at https://towardsdatascience.com/naive-bayes-classifier-how-to-successfully-use-it-in-python-ecf76a995069

Naive Bayes is a simple, fast algorithm, that is typically explainable for the foundations of its predictions. The algorithm is considered as simple (and as naive) because it assumes that the features are conditionally independent, which is rarely true in reality.

If the variables/features are continuously distributed, then we deal with the Gaussian Naive Bayes model. The crucial assumption here is that the features are independent and normally distributed. 

If the variabels/features are discrete, then we deal with the multinomial distribution, which models the probability of counts for each side of a k-sided die rolled n times. For n independent trials each of which leads to a success for exactly one of k categories, with each category having a given fixed success probability, the multinomial distribution gives the probability of any particular combination of numbers of successes for the various categories. There are some special cases: 
- If k=2 and n=1, it is the Bernoulli distribution: model the probability of getting "tails" when flipping a coin once. This is about a binary variable.
- If k=2 and n>1, it is the binomial distribution: model the probability of getting "tails" X times (0<=X<=n) when flipping a coin n times independently. This is about counts of a binary variable.
- If k>2 and n=1, it is the categorical distribution: the probability of getting one of several options in a single sampling experiment. This is about a categorical/finite variable.

We will discuss in this notebook how to build Naive Bayes classification models, including:
1. Gaussian NB with 2 independent variables
2. Categorical NB with 2 independent variables
3. Bernoulli NB with 2 independent variables

#  0. Load the basic packages

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns

%matplotlib inline
import matplotlib.pyplot as plt

# 1. Load the data

We use the Wisconsin breast cancer dataset available at https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic. We load a version of the data from https://openml.org/search?type=data&status=any&id=43757. These are consecutive patients seen by Dr. Wolberg 1984-1992, and include only those cases exhibiting invasive breast cancer and no evidence of distant metastases at the time of diagnosis. Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. Ten real-valued features are offered in this dataset, each with its mean, standard error, and worst value seen for that tumour sample:

a) radius (mean of distances from center to points on the perimeter)
b) texture (standard deviation of gray-scale values)
c) perimeter
d) area
e) smoothness (local variation in radius lengths)
f) compactness (perimeter^2 / area - 1.0)
g) concavity (severity of concave portions of the contour)
h) concave points (number of concave portions of the contour)
i) symmetry
j) fractal dimension ("coastline approximation" - 1)

The mean, standard error and "worst" or largest (mean of the three
largest values) of these features were computed for each image,
resulting in 30 features. All feature values are recoded with four significant digits.

For each sample we also have the diagnosis: whether the tumour was malignant (M) or benign (B). Our task is to develop a machine learning model to predict the diagnosis for a sample, whose all other features are known.

In [None]:
# Load the dataset from OpenML.org
# Link to the dataset: https://openml.org/search?type=data&status=any&id=43757
# ID of the dataset on OpenML: 43757

from sklearn.datasets import fetch_openml

X, y = fetch_openml(
    data_id=43757,
    as_frame=True,
    return_X_y=True
)

print(type(X), type(y))
print(X.columns)

X.info()

In [None]:
# y includes the labels and X includes the features
y = X['diagnosis']       # M or B

#To handle the data better, we replace the B/M values with numerical encodings 0/1.

y = y.replace(['B', 'M'], [0, 1])

X = X.drop(['Unnamed:_32','diagnosis'], axis = 1 )

# 2. Explore the data

It is always a good to do some data exploration before we start using it, find outliers, and decide if we need a preprocessing phase to uniform or augment it. And also to make sure that all the classes are covered by or more or less the same number of samples.

In [None]:
# Print the samples on rows 50-59.

X.iloc[50:60,:]

We note that the dataset has a variety of features: some discrete ('diagnosis'), others continuous on various scales. This should be taken into account when pre-processing the data for the models.

In [None]:
# Check the distribution of samples between malignant/benign

y.value_counts().plot(kind="bar")
plt.show()

B, M = y.value_counts()
print('Number of Benign: ',B)
print('Number of Malignant : ',M)

We conclude that the data is roughly balanced. A stratified split is in order, but no other balancing measures are needed.

In [None]:
# Visualise the distribution of each feature.
# Q: Can they be considered as normally distributed?

fig, axes = plt.subplots(figsize=(10, 20), nrows=10, ncols=3, layout="constrained")
X.plot(subplots=True, ax=axes, kind='kde')
plt.show()

### Another way to visualise the data, this time separated on malignant/benign

In [None]:
# first ten features
data = pd.concat([y,X.iloc[:,0:10]],axis=1)
data = pd.melt(data,id_vars="diagnosis",
                    var_name="features",
                    value_name='value')
plt.figure(figsize=(10,10))
sns.violinplot(x="features", y="value", hue="diagnosis", data=data,split=True, inner="quart")
plt.xticks(rotation=90)
plt.show()

### Did not work out well: the features are on different scales. Let's normalize them.

In [None]:
# Features 0-9

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
data = X.iloc[:,0:10]
scaled_features = scaler.fit_transform(data)
X_scaled = pd.DataFrame(scaled_features, index=data.index, columns=data.columns)

data = pd.concat([y, X_scaled],axis=1)
data = pd.melt(data,id_vars="diagnosis",
                    var_name="features",
                    value_name='value')
plt.figure(figsize=(10,10))
sns.violinplot(x="features", y="value", hue="diagnosis", data=data,split=True, inner="quart")
plt.xticks(rotation=90)
plt.show()

In [None]:
# Features 10-19

scaler = StandardScaler()
data = X.iloc[:,10:20]
scaled_features = scaler.fit_transform(data)
X_scaled = pd.DataFrame(scaled_features, index=data.index, columns=data.columns)

data = pd.concat([y, X_scaled],axis=1)
data = pd.melt(data,id_vars="diagnosis",
                    var_name="features",
                    value_name='value')
plt.figure(figsize=(10,10))
sns.violinplot(x="features", y="value", hue="diagnosis", data=data,split=True, inner="quart")
plt.xticks(rotation=90)
plt.show()

In [None]:
# Features 20-29

scaler = StandardScaler()
data = X.iloc[:,20:30]
scaled_features = scaler.fit_transform(data)
X_scaled = pd.DataFrame(scaled_features, index=data.index, columns=data.columns)

data = pd.concat([y, X_scaled],axis=1)
data = pd.melt(data,id_vars="diagnosis",
                    var_name="features",
                    value_name='value')
plt.figure(figsize=(10,10))
sns.violinplot(x="features", y="value", hue="diagnosis", data=data,split=True, inner="quart")
plt.xticks(rotation=90)
plt.show()

In [None]:
del data
del scaled_features
del scaler

# 3. Data preprocessing

In [None]:
# Reset the seed of the random number generator, for reproducibility purposes

import os

def reset_seed(SEED = 0):
    """Reset the seed for every random library in use (System, numpy)"""

    os.environ['PYTHONHASHSEED']=str(SEED)
    np.random.seed(SEED)


reset_seed(2023)

In [None]:
# Split the training dataset into train+validation and test

from sklearn.model_selection import train_test_split

X_train_valid, X_test, y_train_valid, y_test = train_test_split(X,
                                                      y,
                                                      test_size=0.2,
                                                      random_state=2023,
                                                      stratify=y
                                                     )

X_train_valid = X_train_valid.reset_index(drop=True)
X_test = X_test.reset_index(drop=True)
y_train_valid = y_train_valid.reset_index(drop=True)
y_test = y_test.reset_index(drop=True)



# Split the training dataset into training and validation

X_train, X_valid, y_train, y_valid = train_test_split(X_train_valid,
                                                      y_train_valid,
                                                      test_size=0.25,
                                                      random_state=2023,
                                                      stratify=y_train_valid
                                                     )

X_train = X_train.reset_index(drop=True)
X__valid = X_valid.reset_index(drop=True)
y_train = y_train.reset_index(drop=True)
y_valid = y_valid.reset_index(drop=True)



# Check the result of the data split

print('# of training samples:', len(X_train))
print(y_train.value_counts())

print('# of validation samples:', len(X_valid))
print(y_valid.value_counts())

print('# of test samples:', len(X_test))
print(y_test.value_counts())

# 4. Models

## Model 1: Naive Bayes on a Gaussian distribution, with 2 variables (GaussianNB)
We build a classifier based on Bayes' rule, considering only 2 features: area_mean,
concavity_mean. Recall: the variables are assumed to be independent and normally distributed. 

In [None]:
# Consider only two features in this model: area_mean, concavity_mean.

data_train = X_train[['area_mean', 'concavity_mean']]
data_valid = X_valid[['area_mean', 'concavity_mean']]
# data_test = X_test[['area_mean', 'concavity_mean']] # we do not use the test data in this notebook

#### Scale the training data

In [None]:
scaler = StandardScaler()

scaler = scaler.fit(data_train)
data_train_scaled = scaler.transform(data_train)
data_train = pd.DataFrame(data_train_scaled, index=data_train.index, columns=data_train.columns)

del data_train_scaled

### An sklearn model for Gaussian Naive Bayes classifier

In [None]:
# Train a Gaussian Naive Bayes classifier with sklearn

from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report

gssnNB = GaussianNB()
gssnNB.fit(data_train, y_train)

# Print model attributes
print('Classes: ', gssnNB.classes_) # class labels known to the classifier
print('Class Priors: ',gssnNB.class_prior_) # prior probability of each class.
print('Variances: ', gssnNB.var_)
print('Score on the training data: ', gssnNB.score(data_train, y_train))
pred_labels = gssnNB.predict(data_train)
print(classification_report(y_train, pred_labels))

In [None]:
data_valid_scaled = scaler.transform(data_valid)
data_valid = pd.DataFrame(data_valid_scaled, index=data_valid.index, columns=data_valid.columns)
del data_valid_scaled
del scaler

print('Score on the validation data: ', gssnNB.score(data_valid, y_valid))
pred_labels = gssnNB.predict(data_valid)
print(classification_report(y_valid, pred_labels))

### An explicit model based on the calculations with Bayes' rule

The following is based on the tutorial available at https://towardsdatascience.com/why-how-to-use-the-naive-bayes-algorithms-in-a-regulated-industry-with-sklearn-python-code-dbd8304ab2cf

#### How does Naive Bayes work for continuous features — GaussianNB

First of all, the Naive Bayes’ posterior probability calculus is defined as:

52bd0ca5938da89d7f9bf388dc7edcbd546c118e.svg

d0d9f596ba491384422716b01dbe74472060d0d7.svg

We aim to calculate over the next few steps the posterior probability for “benign” & “malignant” by using the variables "area_mean", "concavity_mean". To avoid any confusion, we refer to the feature "area_mean" as A and to "concavity_mean" as C. So our data x will be a 2-dimensional vector (A,C). 

First, the estimation of the class prior is simple because it is about the ratio of benign (0)  and malignant (1) samples:

In [None]:
gssnPrior1 = (y_train.sum())/(len(y_train))
gssnPrior0 = 1-(y_train.sum()/(len(y_train)))

print("Class priors: ", gssnPrior0, gssnPrior1)

From the sklearn model we got Class Priors:  [0.62756598 0.37243402].
NOTE: these are the same values!

We estimate the likelihood probabilities p((A,C)|"benign") and p((A,C)|"malignant"). We assume that the two features are independent and categorically distributed. We estimate their mean and variance using the training sample.

In [None]:
#Merge the labels into the data
data_train_conc = pd.concat([data_train,y_train], axis=1)

# Calculations for class "benign" / 0

data_train_0 = data_train_conc[data_train_conc['diagnosis'] == 0]
gssn_A_mean_0 = data_train_0['area_mean'].mean()
gssn_C_mean_0 = data_train_0['concavity_mean'].mean()
gssn_A_std_0 = data_train_0['area_mean'].std(ddof=0) # normalization by N
gssn_C_std_0 = data_train_0['concavity_mean'].std(ddof=0) # normalization by N

# Calculations for class "malign" / 1

data_train_1 = data_train_conc[data_train_conc['diagnosis'] == 1]
gssn_A_mean_1 = data_train_1['area_mean'].mean()
gssn_C_mean_1 = data_train_1['concavity_mean'].mean()
gssn_A_std_1 = data_train_1['area_mean'].std(ddof=0) # normalization by N
gssn_C_std_1 = data_train_1['concavity_mean'].std(ddof=0) # normalization by N

print("Variances calculated with Bayes' rule: ", 
      gssn_A_std_0**2, gssn_C_std_0**2, gssn_A_std_1**2, gssn_C_std_1**2
     )

Variances obtained from the sklearn model: [[0.13643587 0.21021074], [1.0131341  0.91604935]]
Note: they are identical!

We get the conditional probabilities p(A|0), p(A|1), p(C|0) and p(C|1) by replacing these values into the equation of the normal distribution:



normal_distribution.svg

In [None]:
gssn_A_0 = (1/(gssn_A_std_0*np.sqrt(2*np.pi))) * np.exp( -(1/2)*np.square( (data_train_conc['area_mean']-gssn_A_mean_0)/gssn_A_std_0 ) )
gssn_A_1 = (1/(gssn_A_std_1*np.sqrt(2*np.pi))) * np.exp( -(1/2)*np.square( (data_train_conc['area_mean']-gssn_A_mean_1)/gssn_A_std_1 ) )

gssn_C_0 = (1/(gssn_C_std_0*np.sqrt(2*np.pi))) * np.exp( -(1/2)*np.square( (data_train_conc['concavity_mean']-gssn_C_mean_0)/gssn_C_std_0 ) )
gssn_C_1 = (1/(gssn_C_std_1*np.sqrt(2*np.pi))) * np.exp( -(1/2)*np.square( (data_train_conc['concavity_mean']-gssn_C_mean_1)/gssn_C_std_1 ) )

Now we have all the parameters to get the likelihood for the two classes based on our hypothesis that A and C are independent:
p((A,C)|0)=p(A|0)* p(C|0)
p((A,C)|1)=p(A|1)* p(C|1)


In [None]:
gssnLikelihood0 = gssn_A_0 * gssn_C_0
gssnLikelihood1 = gssn_A_1 * gssn_C_1

We can calculate now the evidence p(A,C)=p(A,C|0)* p(0) + p(A,C|1)* p(1)

In [None]:
gssnEvidence = (gssnPrior0*gssnLikelihood0) + (gssnPrior1*gssnLikelihood1)

Finally, we can calculate now the posterior probabilities p(0|A,C) and p(1|A,c)

In [None]:
gssnP0 = (gssnPrior0*gssnLikelihood0) / gssnEvidence
gssnP1 = (gssnPrior1*gssnLikelihood1) / gssnEvidence

In [None]:
# The predictions we make based on the Bayes' rule
pred2_labels=(gssnP0<gssnP1).astype(int)

# Here are the predictions of our sklearn model
pred_labels = gssnNB.predict(data_train)

compare = (pred2_labels == pred_labels)
print('The values in our comparison vector: ', np.unique(compare))

NOTE: No False value in our comparison vector, i.e., we got the very same predictions as in the sklearn model!

To check the predictions of the explicit Bayes' rule model on the validation and on the test datasets, the calculations above have to be re-done for those datasets. We skip this here, you may want to check the calculations yourself. 

In [None]:
del data_train
del data_valid

## Model 2: Naive Bayes on discrete distributions, with 2 variables (CategoricalNB)

In this part we will see step by step how the estimation of the a posteriori probability is made when we use the Categorical Naive Bayes. Recall: the features are assumed to be independent and cateorically distributed. In other words, each feature can take a finite number of values, the probability distribution is about the outcome of a single sampling experiment.  

The dataset we use in this notebook only has continuous distributions. We will choose two features and discretize (categorize) them to illustrate the use of this version of the Bayesian model. We will use 'texture_mean' and 'concavity_worst'.

In [None]:
# Consider only two features in this model: area_mean, concavity_mean.

data_train = X_train[['texture_mean', 'concavity_worst']]
data_valid = X_valid[['texture_mean', 'concavity_worst']]
# data_test = X_test[['texture_mean', 'concavity_worst']] # not used in this notebook

#### Discretize the data
This step would be skipped if the features were already categorical. 

In [None]:
# We train 2 discretizers, one for each of the 2 features. 
# Use 3 bins for the first feature and 5 bins for the second, chosen so that the bins have the same number of points.
# Other choices are also possible, and may work well. 

from sklearn.preprocessing import KBinsDiscretizer
from sklearn.compose import ColumnTransformer

ct = ColumnTransformer(
    [
        ("discr1", KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='quantile'), [0]),
        ("discr2", KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='quantile'), [1])
    ]
)

ct.fit(data_train)
discr_train_transf = ct.transform(data_train).astype(int)
discr_train = pd.DataFrame(
    discr_train_transf, 
    index=data_train.index, 
    columns=data_train.columns
)

del data_train
del discr_train_transf

### An sklearn model for the Categorical Naive Bayes classifier

In [None]:
from sklearn.naive_bayes import CategoricalNB

ctgrclNB = CategoricalNB(alpha=1)
ctgrclNB.fit(discr_train, y_train)

# Print model attributes
print('Classes: ', ctgrclNB.classes_) # class labels known to the classifier
print('Class log priors: ', ctgrclNB.class_log_prior_) 
print('Features log probabilities:\n', ctgrclNB.feature_log_prob_)
print('Score on the training data: ', ctgrclNB.score(discr_train, y_train))
pred_labels = ctgrclNB.predict(discr_train)
print(classification_report(y_train, pred_labels))

In [None]:
discr_valid_transf = ct.transform(data_valid).astype(int)
discr_valid = pd.DataFrame(
    discr_valid_transf, 
    index=data_valid.index, 
    columns=data_valid.columns
)

del data_valid
del discr_valid_transf
del ct

print('Score on the validation data: ', ctgrclNB.score(discr_valid, y_valid))
pred_labels = ctgrclNB.predict(discr_valid)
print(classification_report(y_valid, pred_labels))

### An explicit model based on the calculations with Bayes' rule
#### How does Naive Bayes work for categorical features — CategoricalNB

Recall that the Naive Bayes’ posterior probability calculus is defined as:

52bd0ca5938da89d7f9bf388dc7edcbd546c118e.svg

d0d9f596ba491384422716b01dbe74472060d0d7.svg

In the categorical distribution we have a parameter alpha (that was also explicit in the sklearn model) as a smoothing parameter. This parameter allows to have no numerical issues in the calculations if we have a category with no observation in a class.

Our data has two features ['texture_mean', 'concavity_worst'], i.e., it can be represented as a 2-dimensional vector denoted (T,W). We aim to calculate over the next few steps the posterior probability for “benign” & “malignant” by using the variables (T,W)=('texture_mean', 'concavity_worst'). 

Let's calculate the posterior probability of "how likely is the tumor to be benign/malignant" if T=1 and W=3. 

The estimation of the class prior is the same as before:

In [None]:
ctgrclPrior1 = (y_train.sum())/(len(y_train))
ctgrclPrior0 = 1-(y_train.sum()/(len(y_train)))

print("Class log priors: ", np.log(ctgrclPrior0), np.log(ctgrclPrior1))

From sklearn we got the Class log priors: [-0.46590646 -0.98769539], i.e., the same!

Now we estimate the likelihood probabilities p((T,W)|"benign") and p((T,W)|"malignant"). We assume that the two features are independent and categorically distributed. So we can perform the probabiity calculations independently for feature T and for feature W. Let x be either T or W, and c be the target class, 0 or 1. Let alpha be the smoothing parameter (we will use in the calculations alpha=1). 

The conditional probability p(x|c;alpha) for a single variable x categorically distributed is:

                  P(x=t | c;alpha)=(N_{t,c} + alpha)/(N_c + alpha*n_x), where 
                   
- N_{t,c} is the number of times that a category t appears in class c (on feature x), 
- N_c is the number of samples in class c, 
- n_x is the number of categories we have in  feature x,
- t can be be any of the categories we have in feature x,
- c is either 0 ('benign') or 1 ('malignant').

The likelihood probabilities for T=1 and W=3, with alpha=1, can now be calculated as follows: 

                  P(T=1 | c;alpha)=(N_{1,c}+1)/(N_c+3): 
                  
- N_{1,c} is the number of times that category 1 appears in class c (on feature T), 
- N_c is the number of samples in class c, 
- n_T=3,
- c is either 0 ('benign') or 1 ('malignant').

Similarly, 

                  P(W=3 | c;alpha)=(N_{3,c}+1)/(N_c+5): 
                  
- N_{3,c} is the number of times that category 3 appears in class c (on feature W), 
- N_c is the number of samples in class c, 
- n_W=5.
- c is either 0 ('benign') or 1 ('malignant').

Here are the numerical calculations for the 4 probabilities with c=0 and c=1:

In [None]:
#Merge the labels into the data
discr_train_conc = pd.concat([discr_train,y_train], axis=1)

# Calculations for class "benign" / 0

ctgrclTp0 = (len(discr_train_conc[(discr_train_conc['texture_mean']==1)&(discr_train_conc['diagnosis']==0)])+1) / (len(discr_train_conc[discr_train_conc['diagnosis']==0])+(1*3))
ctgrclWp0 = (len(discr_train_conc[(discr_train_conc['concavity_worst']==3)&(discr_train_conc['diagnosis']==0)])+1) / (len(discr_train_conc[discr_train_conc['diagnosis']==0])+(1*5))

# Calculations for class "malign" / 1

ctgrclTp1 = (len(discr_train_conc[(discr_train_conc['texture_mean']==1)&(discr_train_conc['diagnosis']==1)])+1) / (len(discr_train_conc[discr_train_conc['diagnosis']==1])+(1*3))
ctgrclWp1 = (len(discr_train_conc[(discr_train_conc['concavity_worst']==3)&(discr_train_conc['diagnosis']==1)])+1) / (len(discr_train_conc[discr_train_conc['diagnosis']==1])+(1*5))

We can calculate the likelihood p((T,W)=(1,3)|c)=p(T=1|c)* p(W=3|c):

In [None]:
ctgrclLikelihood0 = ctgrclTp0*ctgrclWp0
ctgrclLikelihood1 = ctgrclTp1*ctgrclWp1

We can calculate the evidence:

In [None]:
ctgrclEvidence = (ctgrclPrior0*ctgrclLikelihood0) + (ctgrclPrior1*ctgrclLikelihood1)

We can calculate the posterior probabilities: p(0|(T,W)=(1,3)) and p(1|(T,W)=(1,3)):

In [None]:
ctgrclP0 = (ctgrclPrior0*ctgrclLikelihood0)/ctgrclEvidence
ctgrclP1 = (ctgrclPrior1*ctgrclLikelihood1)/ctgrclEvidence

print("The posterior probabilities for (T,W)=(1,3) are ", ctgrclP0, ctgrclP1)

In [None]:
# The predictions of the sklearn model on (T,W)=(1,3) are:

x = np.array([1,3]).reshape(1,-1)
x = pd.DataFrame(x, columns = ['texture_mean','concavity_worst'])
ctgrclNB.predict_proba(x)

In [None]:
del discr_train
del discr_train_conc
del discr_valid

#### The results are the same! The models predict class 1/malignant for this data point. 

## Model 3: Naive Bayes on binary features, with 2 variables (BernoulliNB)

In this part we will see step by step how the estimation of the a posteriori probability is made when we use the Bernoulli Naive Bayes. This is applicable to binary features that are independent and Bernoulli distributed. 

The dataset we use in this notebook only has continuous distributions. We will choose two features and binarise them to illustrate the use of this version of the Bayesian model. We will use ['concavity_se', 'area_se'].

In [None]:
# Consider only two features in this model: area_mean, concavity_mean.

data_train = X_train[['concavity_se', 'area_se']]
data_valid = X_valid[['concavity_se', 'area_se']]
# data_test = X_test[['concavity_se', 'area_se']] # not used in this dataset

#### Binarise the data
This step would be skipped if the features were already binary. 

In [None]:
# We train 2 discretizers, one for each of the 2 features. 
# Use 2 bins for each of them. 

from sklearn.preprocessing import KBinsDiscretizer
from sklearn.compose import ColumnTransformer

ct = ColumnTransformer(
    [
        ("binary1", KBinsDiscretizer(n_bins=2, encode='ordinal', strategy='quantile'), [0]),
        ("binary2", KBinsDiscretizer(n_bins=2, encode='ordinal', strategy='quantile'), [1])
    ]
)

ct.fit(data_train)

discr_train_transf = ct.transform(data_train).astype(int)
discr_train = pd.DataFrame(
    discr_train_transf, 
    index=data_train.index, 
    columns=data_train.columns
)

del data_train
del discr_train_transf

### An sklearn model for the Bernoulli Naive Bayes classifier

In [None]:
from sklearn.naive_bayes import BernoulliNB

brnllNB = BernoulliNB(alpha=1)
brnllNB.fit(discr_train, y_train)

# Print model attributes
print('Classes: ', brnllNB.classes_) # class labels known to the classifier
print('Class log priors: ', brnllNB.class_log_prior_) 
print('Features log probabilities:\n', brnllNB.feature_log_prob_)
print('Score on the training data: ', brnllNB.score(discr_train, y_train))
pred_labels = brnllNB.predict(discr_train)
print(classification_report(y_train, pred_labels))

In [None]:
discr_valid_transf = ct.transform(data_valid).astype(int)
discr_valid = pd.DataFrame(
    discr_valid_transf, 
    index=data_valid.index, 
    columns=data_valid.columns
)

del data_valid
del discr_valid_transf

print('Score on the validation data: ', brnllNB.score(discr_valid, y_valid))
pred_labels = brnllNB.predict(discr_valid)
print(classification_report(y_valid, pred_labels))

### An explicit model based on the calculations with Bayes' rule
#### How does Naive Bayes work for binary features — BernoulliNB

Recall that the Naive Bayes’ posterior probability calculus is defined as:

52bd0ca5938da89d7f9bf388dc7edcbd546c118e.svg

d0d9f596ba491384422716b01dbe74472060d0d7.svg

We use a smoothing parameter alpha (that was also explicit in the sklearn model), that allows to have no numerical issues in the calculations even if we have a category with no observation in a class.

Our data has two features ['concavity_se', 'area_se'], i.e., it can be represented as a 2-dimensional vector denoted (C,A). We aim to calculate over the next few steps the posterior probability for “benign” & “malignant” by using the variables (C,A)=('concavity_se', 'area_se'). 

Let's calculate the posterior probability of "how likely is the tumor to be benign/malignant" if C=1 and A=0. 

The estimation of the class prior is the same as before:

In [None]:
brnllPrior1 = (y_train.sum())/(len(y_train))
brnllPrior0 = 1-(y_train.sum()/(len(y_train)))

print("Class log priors: ", np.log(brnllPrior0), np.log(brnllPrior1))

Note: from the sklearn model we got the class log priors: [-0.46590646 -0.98769539]. Same values!

Now we estimate the likelihood probabilities p((C,A)|"benign") and p((C,A)|"malignant"). We assume that the two features are independent and categorically distributed. So we can perform the probabiity calculations independently for feature C and for feature A. Let x be either C or A (with values 0 or 1), and c be the target class (0 or 1). Let alpha be the smoothing parameter. The conditional probability p(x|c;alpha) on the Bernoulli distribution is:

          P(x|c;alpha) = P(x=i|c;alpha)*i+P(x=1-i|c;alpha)*(1-i)    (where i=0,1)
                       = P(x=1|c;alpha)
                                
As in the case of the categorical distribution, we have now 

                  P(x=1 | c;alpha)=(N_{1,c}+alpha)/(N_c+alpha * n_x), where 
                   
- N_{1,c} is the number of times that category 1 appears in class c (on feature x), 
- N_c is the number of samples in class c, 
- n_x is the number of categories we have in feature x (in this case n_x=2).

This leads to 

                  P(x=1 | c;alpha)=(N_{1,c}+alpha)/(N_c+alpha * 2).

The likelihood probabilities for C=1 and W=0, with alpha=1, can now be calculated as follows: 

                  P(C=1 | c;alpha)=(N_{1,c}+1)/(N_c+2): 
                  
- N_{1,c} is the number of times that category 1 appears in class c (on feature C), 
- N_c is the number of samples in class c.

Similarly, 

                  P(A=0 | c;alpha) = 1-P(A=1 | c;alpha)
                                   = 1-(N_{1,c}+1)/(N_c+2)
                                   = (N_c-N_{1,c}+1)/(N_c+2)
                                   = (N_{0,c}+1)/(N_c+2): 
                  
- N_{0,c} is the number of times that category 0 appears in class c (on feature A), 
- N_c is the number of samples in class c.

Here are the numerical calculations for the 4 probabilities (C=1, A=0) with c=0 and c=1:

In [None]:
#Merge the labels into the data
discr_train_conc = pd.concat([discr_train,y_train], axis=1)

# Calculations for class "benign" / 0

brnllCp0 = (len(discr_train_conc[(discr_train_conc['concavity_se']==1)&(discr_train_conc['diagnosis']==0)])+1) / (len(discr_train_conc[discr_train_conc['diagnosis']==0])+2)
brnllAp0 = (len(discr_train_conc[(discr_train_conc['area_se']==0)&(discr_train_conc['diagnosis']==0)])+1) / (len(discr_train_conc[discr_train_conc['diagnosis']==0])+2)

# Calculations for class "malign" / 1

brnllCp1 = (len(discr_train_conc[(discr_train_conc['concavity_se']==1)&(discr_train_conc['diagnosis']==1)])+1) / (len(discr_train_conc[discr_train_conc['diagnosis']==1])+2)
brnllAp1 = (len(discr_train_conc[(discr_train_conc['area_se']==0)&(discr_train_conc['diagnosis']==1)])+1) / (len(discr_train_conc[discr_train_conc['diagnosis']==1])+2)

We can calculate the likelihood p((C,A)=(1,0)|c)=p(C=1|c)* p(A=0|c):

In [None]:
brnllLikelihood0 = brnllCp0*brnllAp0
brnllLikelihood1 = brnllCp1*brnllAp1

We can calculate the evidence:

In [None]:
brnllEvidence = (brnllPrior0*brnllLikelihood0) + (brnllPrior1*brnllLikelihood1)

We can calculate the posterior probabilities: p(0|(C,A)=(1,0)) and p(1|(C,A)=(1,0)):

In [None]:
brnllP0 = (brnllPrior0*brnllLikelihood0)/brnllEvidence
brnllP1 = (brnllPrior1*brnllLikelihood1)/brnllEvidence

print("The posterior probabilities for (C,A)=(1,0) are ", brnllP0, brnllP1)

In [None]:
# The predictions of the sklearn model on (T,W)=(1,3) are:

x = np.array([1,0]).reshape(1,-1)
x = pd.DataFrame(x, columns = ['concavity_se','area_se'])
brnllNB.predict_proba(x)

In [None]:
del discr_train
del discr_train_conc
del discr_valid

#### The results are the same!

# Assignment 3: Naive Bayes on the Iris dataset. Use the random seed and the split from assignment 1

### 1. Build a Gaussian Naive Bayes classifier (you may use sklearn) to classify the Iris dataset. Check its score of the model on the validation dataset?

### 2. Build a categorical Naive Bayes classifier (you may use sklearn) to classify the Iris dataset. Use 5 bins for each feature, each with the same number of points. Check its score of the model on the validation dataset?

### 3. Build a Bernoulli Naive Bayes classifier (you may use sklearn) to classify the Iris dataset. Check its score of the model on the validation dataset?

### 4. Select your best model as the model with the best score on the validation set. Check its score on the test dataset.