## Table of Contents:
* [Bernoulli Naive Bayes](#bernoulli_naive_bayes)
* [Data load ~ 1](#data_load_1)
* [Math Explanation ~ 1](#math_expl_1)
* [SciKit BernoulliNB ~ 1](#sci_bnb_1)
* [Data load ~ 2](#data_load_2)
* [SciKit BernoulliNB ~ 2](#sci_bnb_2)

In [1]:
import pandas as pd
import traceback
import numpy as np
import string

import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelBinarizer
from sklearn.naive_bayes import BernoulliNB

## Bernoulli Naive Bayes <a class="anchor" id="bernoulli_naive_bayes"></a>

1) The main feature of Bernoulli Naive Bayes is that it accepts features only as <b>binary values</b> like true or false, yes or no, success or failure, 0 or 1 and so on. So when the feature values are binary we know that we have to use Bernoulli Naive Bayes classifier. <br>
2) As we deal with binary values, let's consider 'p' as probability of success and '1-p' as probability of failure<br>
For a random variable 'X' in Bernoulli distribution, where 'x' can have only two values either 0 or 1 <br>

$
\begin{align}
& \hat{y} = \underset{k \in {1, .., K}}{\mathrm{arg\,max}} P(y_k) \prod_{i=1}^{d} p(x_i | y_k) \\
\end{align}
$

$
\begin{align}
  & P(x_{i}\mid y_{k}) = 
  \begin{cases}
    p     & \text{if $x = 1$}, \\
    1 - p & \text{if $x = 0$}.
  \end{cases}
\end{align}
$

In [2]:
def get_conf():
    try:
        conf = {
            "data1_fl_path": "../DataSets/heart_disease.csv",
            "data2_fl_path": "../DataSets/questions_vs_statements_v1.0.csv"
        }       
        return conf
    except Exception as e:
        raise e

***
<b>HEART DISEASE</b>
***

## Data load <a class="anchor" id="data_load_1"></a>
<b>Data1:</b> <br>
https://www.kaggle.com/code/murattademir/heart-disease-binary-classification/data <br>
-- selected three features ['HighBP', 'HighChol', 'Smoker'] <br>
-- target is 'Class' ~ HeartDisease (1) or HeartAttack (0) <br>

In [3]:
def load_heart_disease(conf):
    try:
        df = pd.read_csv(conf["data1_fl_path"])
        df = df[['HighBP', 'HighChol', 'Smoker', 'HeartDiseaseorAttack']]
        df.rename({'HeartDiseaseorAttack': 'Class'}, axis=1, inplace=True)
        return df
    except Exception as e:
        raise e

In [4]:
def data_explor():
    try:
        conf = get_conf()
        heart_df = load_heart_disease(conf)
        display(heart_df.head())
        
        count_df=pd.DataFrame()
        
        highbp_cnt = heart_df['HighBP'].value_counts().to_frame()
        highcol_cnt = heart_df['HighChol'].value_counts().to_frame()
        smpker_cnt = heart_df['Smoker'].value_counts().to_frame()
        cls_cnt = heart_df['Class'].value_counts().to_frame()
        
        count_df = pd.concat([highbp_cnt, highcol_cnt, smpker_cnt, cls_cnt], axis=1)
        display(count_df)
        
        highbp_cnt_gp = heart_df.groupby('Class')['HighBP'].value_counts().to_frame()
        highcol_cnt_gp = heart_df.groupby('Class')['HighChol'].value_counts().to_frame()
        smpker_cnt_gp = heart_df.groupby('Class')['Smoker'].value_counts().to_frame()
        
        count_df_gp = pd.concat([highbp_cnt_gp, highcol_cnt_gp, smpker_cnt_gp], axis=1)
        display(count_df_gp)
        
        return heart_df
    except Exception as e:
        traceback.print_exc()
        
heart_df = data_explor()

Unnamed: 0,HighBP,HighChol,Smoker,Class
0,1.0,1.0,1.0,0.0
1,0.0,0.0,1.0,0.0
2,1.0,1.0,0.0,0.0
3,1.0,0.0,0.0,0.0
4,1.0,1.0,0.0,0.0


Unnamed: 0,HighBP,HighChol,Smoker,Class
0.0,144851,146089,141257,229787
1.0,108829,107591,112423,23893


Unnamed: 0_level_0,Unnamed: 1_level_0,HighBP,HighChol,Smoker
Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0.0,0.0,138886,138949,132165
0.0,1.0,90901,90838,97622
1.0,1.0,17928,16753,14801
1.0,0.0,5965,7140,9092


## Math Explanation <a class="anchor" id="math_expl_1"></a>

<table style="float:left">
    <tr>
         <td>
            <table>
                <tr>
                    <td colspan=3> Frequency Table </td>
                </tr>
                <tr>
                    <td> Class </td>
                    <td>  </td>
                    <td> 0 </td>
                    <td> 1 </td>
                    <td> Total </td>
                </tr>
                <tr>
                    <td>  </td>
                    <td>  </td>
                    <td> 229787 </td>
                    <td> 23893 </td>
                    <td> 253680 </td>
                </tr>
            </table>
        </td>
        <td>
            <table>
                <tr>
                    <td colspan=3> Likelihood Table </td>
                </tr>
                <tr>
                    <td> Class </td>
                    <td>  </td>
                    <td> 0 </td>
                    <td> 1 </td>
                    <td> Total </td>
                </tr>
                <tr>
                    <td>  </td>
                    <td>  </td>
                    <td> 229787/253680 </td>
                    <td> 23893/253680 </td>
                    <td> 253680/253680 </td>
                </tr>
            </table>
        </td>
        <td>
            <table>
                <tr>
                    <td colspan=3> Likelihood Table </td>
                </tr>
                <tr>
                    <td> Class </td>
                    <td>  </td>
                    <td> 0 </td>
                    <td> 1 </td>
                    <td> Total </td>
                </tr>
                <tr>
                    <td>  </td>
                    <td>  </td>
                    <td> 0.91 </td>
                    <td> 0.094 </td>
                    <td> 1.0 </td>
                </tr>
            </table>
        </td>
        <td>
            <table>
                <tr>
                    <td colspan=3> Log Likelihood Table </td>
                </tr>
                <tr>
                    <td> Class </td>
                    <td>  </td>
                    <td> 0 </td>
                    <td> 1 </td>
                    <td> Total </td>
                </tr>
                <tr>
                    <td>  </td>
                    <td>  </td>
                    <td> -0.094310 </td>
                    <td> -2.36446 </td>
                    <td>  </td>
                </tr>
            </table>
        </td>
    </tr>    
</table>

<table style="float:left">
    <tr>
         <td>
            <table>
                <tr>
                    <td colspan=3> Frequency Table </td>
                </tr>
                <tr>
                    <td> Class </td>
                    <td>  </td>
                    <td> 0 </td>
                    <td> 1 </td>
                    <td> Total </td>
                </tr>
                <tr>
                    <td> HighBP </td>
                    <td> Yes </td>
                    <td> 90901 </td>
                    <td> 17928 </td>
                    <td> 108829 </td>
                </tr>
                <tr>
                    <td>  </td>
                    <td> No </td>
                    <td> 138886 </td>
                    <td> 5965 </td>
                    <td> 144851 </td>
                </tr>
                 <tr>
                    <td>  </td>
                    <td> Total </td>
                    <td> 229787 </td>
                    <td> 23893 </td>
                    <td> 253680 </td>
                </tr>
            </table>
        </td>
        <td>
            <table>
                <tr>
                    <td colspan=3> Likelihood Table </td>
                </tr>
                <tr>
                    <td> Class </td>
                    <td>  </td>
                    <td> 0 </td>
                    <td> 1 </td>
                    <td> Total </td>
                </tr>
                <tr>
                    <td> HighBP </td>
                    <td> 1 </td>
                    <td> 90901/229787 </td>
                    <td> 17928/23893 </td>
                    <td> 108829/253680 </td>
                </tr>
                <tr>
                    <td>  </td>
                    <td> 0 </td>
                    <td> 138886/229787 </td>
                    <td> 5965/23893 </td>
                    <td> 144851/253680 </td>
                </tr>
                <tr>
                    <td>  </td>
                    <td> Total </td>
                    <td> 229787/253680 </td>
                    <td> 23893/253680 </td>
                    <td> </td>
                </tr>
            </table>
        </td>
        <td>
            <table>
                <tr>
                    <td colspan=3> Likelihood Table </td>
                </tr>
                <tr>
                    <td> Class </td>
                    <td>  </td>
                    <td> 0 </td>
                    <td> 1 </td>
                    <td> Total </td>
                </tr>
                <tr>
                    <td> HighBP </td>
                    <td> 1 </td>
                    <td> 0.3956 </td>
                    <td> 0.75034 </td>
                    <td>  </td>
                </tr>
                <tr>
                    <td>  </td>
                    <td> 0 </td>
                    <td> 0.6044 </td>
                    <td> 0.24965 </td>
                    <td>  </td>
                </tr>
                <tr>
                    <td>  </td>
                    <td> Total </td>
                    <td>  </td>
                    <td>  </td>
                    <td> </td>
                </tr>
            </table>
        </td>
        <td>
            <table>
                <tr>
                    <td colspan=3> Log Likelihood Table </td>
                </tr>
                <tr>
                    <td> Class </td>
                    <td>  </td>
                    <td> 0 </td>
                    <td> 1 </td>
                    <td> Total </td>
                </tr>
                <tr>
                    <td> HighBP </td>
                    <td> 1 </td>
                    <td> -0.9273516 </td>
                    <td> -0.2872288 </td>
                    <td>  </td>
                </tr>
                <tr>
                    <td>  </td>
                    <td> 0 </td>
                    <td> -0.5035190 </td>
                    <td> -1.3876954 </td>
                    <td>  </td>
                </tr>
                <tr>
                    <td>  </td>
                    <td> Total </td>
                    <td>  </td>
                    <td>  </td>
                    <td> </td>
                </tr>
            </table>
        </td>
    </tr>    
</table>

<table style="float:left">
    <tr>
         <td>
            <table>
                <tr>
                    <td colspan=3> Frequency Table </td>
                </tr>
                <tr>
                    <td> Class </td>
                    <td>  </td>
                    <td> 0 </td>
                    <td> 1 </td>
                    <td> Total </td>
                </tr>
                <tr>
                    <td> HighChol </td>
                    <td> Yes </td>
                    <td> 90838 </td>
                    <td> 16753 </td>
                    <td> 107591 </td>
                </tr>
                <tr>
                    <td>  </td>
                    <td> No </td>
                    <td> 138949 </td>
                    <td> 7140 </td>
                    <td> 146089 </td>
                </tr>
                 <tr>
                    <td>  </td>
                    <td> Total </td>
                    <td> 229787 </td>
                    <td> 23893 </td>
                    <td> 253680 </td>
                </tr>
            </table>
        </td>
        <td>
            <table>
                <tr>
                    <td colspan=3> Likelihood Table </td>
                </tr>
                <tr>
                    <td> Class </td>
                    <td>  </td>
                    <td> No </td>
                    <td> Yes </td>
                    <td> Total </td>
                </tr>
                <tr>
                    <td> HighChol </td>
                    <td> Yes </td>
                    <td> 90838/229787 </td>
                    <td> 16753/23893 </td>
                    <td> 107591/253680 </td>
                </tr>
                <tr>
                    <td>  </td>
                    <td> No </td>
                    <td> 138949/229787 </td>
                    <td> 7140/23893 </td>
                    <td> 146089/253680 </td>
                </tr>
                 <tr>
                    <td>  </td>
                    <td> Total </td>
                    <td> 229787/253680 </td>
                    <td> 23893/253680 </td>
                    <td>  </td>
                </tr>
            </table>
        </td>
        <td>
            <table>
                <tr>
                    <td colspan=3> Likelihood Table </td>
                </tr>
                <tr>
                    <td> Class </td>
                    <td>  </td>
                    <td> No </td>
                    <td> Yes </td>
                    <td> Total </td>
                </tr>
                <tr>
                    <td> HighChol </td>
                    <td> Yes </td>
                    <td> 0.39531 </td>
                    <td> 0.70116 </td>
                    <td>  </td>
                </tr>
                <tr>
                    <td>  </td>
                    <td> No </td>
                    <td> 0.60468 </td>
                    <td> 0.29883 </td>
                    <td>  </td>
                </tr>
                 <tr>
                    <td>  </td>
                    <td> Total </td>
                    <td> </td>
                    <td> </td>
                    <td>  </td>
                </tr>
            </table>
        </td>
        <td>
            <table>
                <tr>
                    <td colspan=3> Likelihood Table </td>
                </tr>
                <tr>
                    <td> Class </td>
                    <td>  </td>
                    <td> No </td>
                    <td> Yes </td>
                    <td> Total </td>
                </tr>
                <tr>
                    <td> HighChol </td>
                    <td> Yes </td>
                    <td> -0.928085 </td>
                    <td> -0.3550192 </td>
                    <td>  </td>
                </tr>
                <tr>
                    <td>  </td>
                    <td> No </td>
                    <td> -0.5030558 </td>
                    <td> -1.20788043 </td>
                    <td>  </td>
                </tr>
                 <tr>
                    <td>  </td>
                    <td> Total </td>
                    <td> </td>
                    <td> </td>
                    <td>  </td>
                </tr>
            </table>
        </td>
    </tr>    
</table>

<table style="float:left">
    <tr>
         <td>
            <table>
                <tr>
                    <td colspan=3> Frequency Table </td>
                </tr>
                <tr>
                    <td> Class </td>
                    <td>  </td>
                    <td> 0 </td>
                    <td> 1 </td>
                    <td> Total </td>
                </tr>
                <tr>
                    <td> Smoker </td>
                    <td> Yes </td>
                    <td> 97622 </td>
                    <td> 14801 </td>
                    <td> 112423 </td>
                </tr>
                <tr>
                    <td>  </td>
                    <td> No </td>
                    <td> 132165 </td>
                    <td> 9092 </td>
                    <td> 141257 </td>
                </tr>
                 <tr>
                    <td>  </td>
                    <td> Total </td>
                    <td> 229787 </td>
                    <td> 23893 </td>
                    <td> 253680 </td>
                </tr>
            </table>
        </td>
        <td>
            <table>
                <tr>
                    <td colspan=3> Likelihood Table </td>
                </tr>
                <tr>
                    <td> Class </td>
                    <td>  </td>
                    <td> No </td>
                    <td> Yes </td>
                    <td> Total </td>
                </tr>
                <tr>
                    <td> Smoker </td>
                    <td> Yes </td>
                    <td> 97622/229787 </td>
                    <td> 14801/23893 </td>
                    <td> 112423/253680 </td>
                </tr>
                <tr>
                    <td>  </td>
                    <td> No </td>
                    <td> 132165/229787 </td>
                    <td> 9092/23893 </td>
                    <td> 141257/253680 </td>
                </tr>
                 <tr>
                    <td>  </td>
                    <td> Total </td>
                    <td> 229787/253680 </td>
                    <td> 23893/253680 </td>
                    <td>  </td>
                </tr>
            </table>
        </td>
        <td>
            <table>
                <tr>
                    <td colspan=3> Likelihood Table </td>
                </tr>
                <tr>
                    <td> Class </td>
                    <td>  </td>
                    <td> No </td>
                    <td> Yes </td>
                    <td> Total </td>
                </tr>
                <tr>
                    <td> Smoker </td>
                    <td> Yes </td>
                    <td> 0.42483 </td>
                    <td> 0.61947 </td>
                    <td> </td>
                </tr>
                <tr>
                    <td>  </td>
                    <td> No </td>
                    <td> 0.57516 </td>
                    <td> 0.38053 </td>
                    <td> </td>
                </tr>
                 <tr>
                    <td>  </td>
                    <td> Total </td>
                    <td> </td>
                    <td> </td>
                    <td>  </td>
                </tr>
            </table>
        </td>
        <td>
            <table>
                <tr>
                    <td colspan=3> Likelihood Table </td>
                </tr>
                <tr>
                    <td> Class </td>
                    <td>  </td>
                    <td> No </td>
                    <td> Yes </td>
                    <td> Total </td>
                </tr>
                <tr>
                    <td> Smoker </td>
                    <td> Yes </td>
                    <td> -0.8560662 </td>
                    <td> -0.478891 </td>
                    <td> </td>
                </tr>
                <tr>
                    <td>  </td>
                    <td> No </td>
                    <td> -0.55310702 </td>
                    <td> -0.9661903 </td>
                    <td> </td>
                </tr>
                 <tr>
                    <td>  </td>
                    <td> Total </td>
                    <td> </td>
                    <td> </td>
                    <td>  </td>
                </tr>
            </table>
        </td>
    </tr>    
</table>

--> <b>let's take TEST instance</b> <br>
HighBP : 1 &nbsp; HighChol : 1 &nbsp; Smoker: 0 &nbsp; Class : ? <br>
HighBP: 0 &nbsp; HighChol: 0 &nbsp; Smoker: 1  &nbsp; Class : ? <br>

--> <b>TestCase-1</b> <br>
HighBP : 1 &nbsp; HighChol : 1 &nbsp; Smoker: 0 &nbsp; Class : ? <br>

<b>let's calcualte $ P(y_{trafficJam}|X_{test}) $ for test sample 1 </b>

***

<b>Probability $ P(X|y):- $ </b><br>
$ 
\begin{align}
P(X_{HighBP, HighChol, Smoker}|y_{Class} = 1) = \\
P(X_{HighBP} = 1 | y_{Class} = 1) * \\
P(X_{HighChol} = 1 | y_{Class} = 1) * \\
P(X_{Smoker} = 0 | y_{Class} = 1) \\
\end{align}
$
<br>

$ 
\begin{align}
P(X_{HighBP, HighChol, Smoker}|y_{Class} = 1) = 0.75034 * 0.70116 * 0.38053
\end{align}
$
<br>
$ 
\begin{align}
P(y_{Class} = 1) = 0.094
\end{align}
$
<br>
So, $ P(X|y):- $ is [ ignoring denominator $P(X)$] <br>
$
\begin{align}
P(y_{Class|X_{test}} = 1) = 0.094 * 0.75034 * 0.70116 * 0.38053 = 0.01881880256
\end{align}
$

***

The first part is $ P(X|y):- $ <br>
$ 
\begin{align}
P(X_{HighBP, HighChol, Smoker}|y_{Class} = 0) = \\
P(X_{HighBP} = 1 | y_{Class} = 0) * \\
P(X_{HighChol} = 1 | y_{Class} = 0) * \\
P(X_{Smoker} = 0 | y_{Class} = 0)  \\
\end{align}
$
<br>
$ 
\begin{align}
log(P(X_{HighBP, HighChol, Smoker}|y_{Class} = 0)) = \\
log(P(X_{HighBP} = 1 | y_{Class} = 0)) + \\
log(P(X_{HighChol} = 1 | y_{Class} = 0)) + \\
log(P(X_{Smoker} = 0 | y_{Class} = 0))  \\
\end{align}
$
<br>
$ 
\begin{align}
P(X_{HighBP, HighChol, Smoker}|y_{Class} = 0) = 0.3956 * 0.39531 * 0.57516
\end{align}
$
<br>
$ 
\begin{align}
P(y_{Class} = 0) = 0.91
\end{align}
$
<br>
So, $ P(X|y):- $ is [ ignoring denominator $P(X)$] <br>
$
\begin{align}
P(y_{Class|X_{test}} = 0) = 0.91 * 0.3956 * 0.39531 * 0.57516 = 0.08185103039
\end{align}
$

***
$
\begin{align}
& evidence = 0.01881880256 + 0.08185103039 = 0.10066983295 \\
& P(y_{Class|X_{test}} = 1) = 0.01881880256/0.10066983295  = 0.18693586756 \\
& P(y_{Class|X_{test}} = 0) = 0.08185103039/0.10066983295  = 0.81306413243 
\end{align}
$
***

So, The anser is Class = 0

<b>Joint Likelihood $ P(X|y):- $ </b><br>
$ 
\begin{align}
P(X_{HighBP, HighChol, Smoker}|y_{Class} = 1) = \\
P(X_{HighBP} = 1 | y_{Class} = 1) * \\
P(X_{HighChol} = 1 | y_{Class} = 1) * \\
P(X_{Smoker} = 0 | y_{Class} = 1) \\
\end{align}
$
<br>
$ 
\begin{align}
log(P(X_{HighBP, HighChol, Smoker}|y_{Class} = 1)) = \\
log(P(X_{HighBP} = 1 | y_{Class} = 1)) + \\
log(P(X_{HighChol} = 1 | y_{Class} = 1)) + \\
log(P(X_{Smoker} = 0 | y_{Class} = 1)) \\
\end{align}
$
<br>
$ 
\begin{align}
P(X_{HighBP, HighChol, Smoker}|y_{Class} = 1) = -0.2872288 + -0.3550192 + -0.9661903 = -1.6084383
\end{align}
$
<br>
$ 
\begin{align}
P(y_{Class} = 1) = -2.36446
\end{align}
$
<br>
So, $ P(X|y):- $ is [ ignoring denominator $P(X)$] <br>
$
\begin{align}
P(y_{Class|X_{test}} = 1) = -1.6084383 + -2.36446 = -3.9728983
\end{align}
$

***

The first part is $ P(X|y):- $ <br>
$ 
\begin{align}
P(X_{HighBP, HighChol, Smoker}|y_{Class} = 0) = \\
P(X_{HighBP} = 1 | y_{Class} = 0) * \\
P(X_{HighChol} = 1 | y_{Class} = 0) * \\
P(X_{Smoker} = 0 | y_{Class} = 0)  \\
\end{align}
$
<br>
$ 
\begin{align}
log(P(X_{HighBP, HighChol, Smoker}|y_{Class} = 0)) = \\
log(P(X_{HighBP} = 1 | y_{Class} = 0)) + \\
log(P(X_{HighChol} = 1 | y_{Class} = 0)) + \\
log(P(X_{Smoker} = 0 | y_{Class} = 0))  \\
\end{align}
$
<br>
$ 
\begin{align}
P(X_{HighBP, HighChol, Smoker}|y_{Class} = 0) = -0.9273516 + -0.928085 + -0.55310702 = -2.40854
\end{align}
$
<br>
$ 
\begin{align}
P(y_{Class} = 0) = -0.094310
\end{align}
$
<br>
So, $ P(X|y):- $ is [ ignoring denominator $P(X)$] <br>
$
\begin{align}
P(y_{Class|X_{test}} = 0) = -2.40854 + -0.094310 = -2.50285
\end{align}
$

***

So, The anser is Class = 0

--> <b>TestCase-2</b> <br>
HighBP : 0 &nbsp; HighChol : 0 &nbsp; Smoker: 1 &nbsp; Class : ? <br>

<b>let's calcualte $ P(y_{trafficJam}|X_{test}) $ for test sample 2 </b>

***

<b>Probability $ P(X|y):- $ </b><br>
$ 
\begin{align}
P(X_{HighBP, HighChol, Smoker}|y_{Class} = 1) = \\
P(X_{HighBP} = 0 | y_{Class} = 1) * \\
P(X_{HighChol} = 0 | y_{Class} = 1) * \\
P(X_{Smoker} = 1 | y_{Class} = 1)  \\
\end{align}
$
<br>
$ 
\begin{align}
P(X_{HighBP, HighChol, Smoker}|y_{Class} = 1) = 0.24965 * 0.29883 * 0.61947
\end{align}
$
<br>
$ 
\begin{align}
P(y_{Class} = 1) = 0.094
\end{align}
$
<br>
So, $ P(X|y):- $ is [ ignoring denominator $P(X)$] <br>
$
\begin{align}
P(y_{Class|X_{test}} = 1) = 0.094 * 0.24965 * 0.29883 * 0.61947 = 0.00434414084
\end{align}
$

***

The first part is $ P(X|y):- $ <br>
$ 
\begin{align}
P(X_{HighBP, HighChol, Smoker}|y_{Class} = 0) = \\
P(X_{HighBP} = 0 | y_{Class} = 0) * \\
P(X_{HighChol} = 0 | y_{Class} = 0) * \\
P(X_{Smoker} = 1 | y_{Class} = 0)  \\
\end{align}
$
<br>
$ 
\begin{align}
P(X_{HighBP, HighChol, Smoker}|y_{Class} = 0) = 0.6044 * 0.60468 * 0.42483
\end{align}
$
<br>
$ 
\begin{align}
P(y_{Class} = 0) = 0.91
\end{align}
$
<br>
So, $ P(X|y):- $ is [ ignoring denominator $P(X)$] <br>
$
\begin{align}
P(y_{Class|X_{test}} = 0) = 0.91 * 0.6044 * 0.60468 * 0.42483 = 0.14128843996
\end{align}
$

***
$
\begin{align}
& evidence = 0.00434414084 + 0.14128843996 = 0.1456325808 \\
& P(y_{Class|X_{test}} = 1) = 0.00434414084/0.1456325808  = 0.02982945722 \\
& P(y_{Class|X_{test}} = 0) = 0.14128843996/0.1456325808  = 0.97017054277  
\end{align}
$
***

So, The anser is Class = 0

<b>Joint Likelihood $ P(X|y):- $ </b><br>
$ 
\begin{align}
P(X_{HighBP, HighChol, Smoker}|y_{Class} = 1) = \\
P(X_{HighBP} = 0 | y_{Class} = 1) * \\
P(X_{HighChol} = 0 | y_{Class} = 1) * \\
P(X_{Smoker} = 1 | y_{Class} = 1)  \\
\end{align}
$
<br>
$ 
\begin{align}
log(P(X_{HighBP, HighChol, Smoker}|y_{Class} = 1)) = \\
log(P(X_{HighBP} = 0 | y_{Class} = 1)) + \\
log(P(X_{HighChol} = 0 | y_{Class} = 1)) + \\
log(P(X_{Smoker} = 1 | y_{Class} = 1)) + \\
\end{align}
$
<br>
$ 
\begin{align}
P(X_{HighBP, HighChol, Smoker}|y_{Class} = 1) = -1.3876954 + -1.20788043 + -0.478891 = -3.07446683
\end{align}
$
<br>
$ 
\begin{align}
P(y_{Class} = 1) = −2.36446
\end{align}
$
<br>
So, $ P(X|y):- $ is [ ignoring denominator $P(X)$] <br>
$
\begin{align}
P(y_{Class|X_{test}} = 1) = -3.07446683 + −2.36446 = -5.43892683
\end{align}
$

***

The first part is $ P(X|y):- $ <br>
$ 
\begin{align}
P(X_{HighBP, HighChol, Smoker}|y_{Class} = 0) = \\
P(X_{HighBP} = 0 | y_{Class} = 0) * \\
P(X_{HighChol} = 0 | y_{Class} = 0) * \\
P(X_{Smoker} = 1 | y_{Class} = 0)  \\
\end{align}
$
<br>
$ 
\begin{align}
P(X_{HighBP, HighChol, Smoker}|y_{Class} = 0) = \\
P(X_{HighBP} = 0 | y_{Class} = 0) + \\
P(X_{HighChol} = 0 | y_{Class} = 0) + \\
P(X_{Smoker} = 1 | y_{Class} = 0)  \\
\end{align}
$
<br>
$ 
\begin{align}
P(X_{HighBP, HighChol, Smoker}|y_{Class} = 0) = -0.5035190 + -0.5030558 + -0.8560662 = -1.862641
\end{align}
$
<br>
$ 
\begin{align}
P(y_{Class} = 0) = −0.094310
\end{align}
$
<br>
So, $ P(X|y):- $ is [ ignoring denominator $P(X)$] <br>
$
\begin{align}
P(y_{Class|X_{test}} = 0) = -1.862641 + −0.094310 = -1.956951
\end{align}
$

***

So, The anser is Class = 0

## SciKit BernoulliNB <a class="anchor" id="sci_bnb_1"></a>

In [5]:
conf = get_conf()
data_df = load_heart_disease(conf)

X_train = data_df[['HighBP', 'HighChol', 'Smoker']]
y_train = data_df['Class']

print(X_train.shape, y_train.shape)

# instantiate the model
bnb = BernoulliNB()

# fit the model
bnb.fit(X_train, y_train)

print('Sklearn values:')
print('feture class_count_\n',bnb.class_count_)
print('feture class_log_prior_\n',bnb.class_log_prior_)
print('feture feature_count_\n',bnb.feature_count_)
print('feture log-probabilities\n',bnb.feature_log_prob_)

(253680, 3) (253680,)
Sklearn values:
feture class_count_
 [229787.  23893.]
feture class_log_prior_
 [-0.09892084 -2.3624881 ]
feture feature_count_
 [[90901. 90838. 97622.]
 [17928. 16753. 14801.]]
feture log-probabilities
 [[-0.92737949 -0.92807279 -0.85604838]
 [-0.28724972 -0.3550322  -0.47890693]]


In [6]:
arr_test = [[1, 1, 0]]
X_test=pd.DataFrame(arr_test, columns=['HighBP', 'HighChol', 'Smoker'])
y_pred = bnb.predict(X_test)


print(y_pred)
print('Sklearn predict_proba\n', bnb.predict_proba(X_test))
print('Sklearn predict_log_proba\n',bnb.predict_log_proba(X_test))
print('Sklearn _joint_log_likelihood\n',bnb._joint_log_likelihood(X_test))

[0.]
Sklearn predict_proba
 [[0.81206107 0.18793893]]
Sklearn predict_log_proba
 [[-0.20817974 -1.67163819]]
Sklearn _joint_log_likelihood
           0         1
0 -2.507476 -3.970934


In [7]:
arr_test = [[0, 0, 1]]
X_test=pd.DataFrame(arr_test, columns=['HighBP', 'HighChol', 'Smoker'])
y_pred = bnb.predict(X_test)


print(y_pred)
print('Sklearn predict_proba\n', bnb.predict_proba(X_test))
print('Sklearn predict_log_proba\n',bnb.predict_log_proba(X_test))
print('Sklearn _joint_log_likelihood\n',bnb._joint_log_likelihood(X_test))

[0.]
Sklearn predict_proba
 [[0.96997636 0.03002364]]
Sklearn predict_log_proba
 [[-0.03048358 -3.50577017]]
Sklearn _joint_log_likelihood
           0         1
0 -1.961517 -5.436804


***
<b>QUESTION - STATEMENT</b>
***

## Data load <a class="anchor" id="data_load_2"></a>
<b>Data2:</b> <br>
https://www.kaggle.com/datasets/shahrukhkhan/questions-vs-statementsclassificationdataset <br>
-- selected features [sentence] <br>
-- target is 'Class' ~ stmt (1) or question (0) <br>

In [8]:
def load_ques_stmt(conf):
    try:
        df = pd.read_csv(conf["data2_fl_path"])
        df = df[['doc', 'target']]
        df.rename({'target': 'Class'}, axis=1, inplace=True)
        return df.head(20)
    except Exception as e:
        raise e

In [9]:
def data_explor():
    try:
        conf = get_conf()
        ques_stmt_df = load_ques_stmt(conf)
        display(ques_stmt_df.head())
        
        count_df=pd.DataFrame()
        
        cls_cnt = ques_stmt_df['Class'].value_counts().to_frame()
        
        count_df = pd.concat([cls_cnt], axis=1)
        display(count_df)
        
        return ques_stmt_df
    except Exception as e:
        traceback.print_exc()
        
ques_stmt_df = data_explor()

Unnamed: 0,doc,Class
0,a CBC or Radio-Canada television station loca...,0
1,"Unsurprisingly, these officers enforced socia...",0
2,"In 1952, Thomas Watson, Sr. In what year did I...",1
3,What Roman battle took place in the year 446 BC,1
4,What often lacks in software developed when it...,1


Unnamed: 0,Class
0,10
1,10


In [10]:
def gen_clean_data(df_row):
    try:
        remove_punc = [char for char in df_row if char not in string.punctuation]
        remove_punc = "".join(remove_punc)
        remove_punc = remove_punc.split()
        
        lower_word = [word.lower() for word in remove_punc]
        
        # clean_word = [word for word in lower_word if word not in stopwords.words('english')]
        
        join_word = " ".join(lower_word)
        
        return join_word       
    except Exception as e:
        raise e

In [11]:
'''
Convert the text into BOW using CountVectorizer
'''
def convert_to_BOW(corpus):
    try:
        # Given text return the BOW representation of the words
        vectorizer = CountVectorizer(binary=True) #transform your continuous features into a binary way
        X = vectorizer.fit_transform(corpus)
        # Save vectorizer.vocabulary_
        # pickle.dump(vectorizer.vocabulary_,open("vocab.pkl","wb"))
        return X.toarray(), vectorizer
    except Exception as e:
        raise e

In [12]:
ques_stmt_df["doc"] = ques_stmt_df.iloc[:,0].apply(gen_clean_data)
display(ques_stmt_df.head(5))
X, vectorizer = convert_to_BOW(ques_stmt_df["doc"].values)
y = ques_stmt_df["Class"].values

Unnamed: 0,doc,Class
0,a cbc or radiocanada television station locate...,0
1,unsurprisingly these officers enforced social ...,0
2,in 1952 thomas watson sr in what year did ibm ...,1
3,what roman battle took place in the year 446 bc,1
4,what often lacks in software developed when it...,1


In [13]:
display(X.shape) # total number of words 131 (vocabulary)
display(X[10:15]) # for each sentence, it creates a vector with the info whether word is present ot not
counts = pd.DataFrame(X[10:15], index=['doc_1','doc_2','doc_3','doc_4','doc_5'], columns=vectorizer.get_feature_names_out()) #only first 5 Articles
display(counts[['television','what', 'year', 'when', 'who', 'roman']]) # display frequency for selected words

(20, 249)

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 1, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0]])

Unnamed: 0,television,what,year,when,who,roman
doc_1,0,1,0,0,1,0
doc_2,0,0,0,0,0,0
doc_3,0,0,0,0,0,0
doc_4,0,1,0,0,0,0
doc_5,0,0,0,0,0,0


## SciKit BernoulliNB <a class="anchor" id="sci_bnb_2"></a>

In [14]:
X_train = X
y_train = y

print(X_train.shape, y_train.shape)

# instantiate the model
bnb = BernoulliNB()

# fit the model
bnb.fit(X_train, y_train)

print('Sklearn values:')
print('feture log-probabilities',bnb.feature_log_prob_)

(20, 249) (20,)
Sklearn values:
feture log-probabilities [[-1.79175947 -1.79175947 -2.48490665 -1.79175947 -2.48490665 -1.79175947
  -1.79175947 -1.79175947 -1.79175947 -1.79175947 -2.48490665 -2.48490665
  -2.48490665 -1.79175947 -1.79175947 -1.79175947 -1.38629436 -1.79175947
  -1.79175947 -1.38629436 -1.79175947 -2.48490665 -2.48490665 -2.48490665
  -1.79175947 -1.79175947 -1.79175947 -2.48490665 -1.79175947 -2.48490665
  -1.79175947 -2.48490665 -1.79175947 -1.79175947 -2.48490665 -1.79175947
  -1.79175947 -2.48490665 -2.48490665 -2.48490665 -2.48490665 -2.48490665
  -2.48490665 -1.79175947 -1.79175947 -1.79175947 -1.79175947 -1.79175947
  -2.48490665 -2.48490665 -2.48490665 -1.79175947 -1.79175947 -1.79175947
  -2.48490665 -2.48490665 -2.48490665 -1.79175947 -1.79175947 -1.79175947
  -1.79175947 -2.48490665 -1.79175947 -2.48490665 -1.79175947 -2.48490665
  -2.48490665 -2.48490665 -1.79175947 -2.48490665 -1.79175947 -2.48490665
  -1.79175947 -1.79175947 -1.79175947 -1.79175947 -1.79

In [15]:
arr_test = [['theres nothing really that gets in that early']]
X_test_df=pd.DataFrame(arr_test, columns=['doc'])
# Convert into BOW
#loaded_vec = CountVectorizer(decode_error="replace",vocabulary=pickle.load(open("vocab.pkl", "rb")))
# loaded_vec = CountVectorizer(decode_error="replace",vocabulary=vectorizer.vocabulary_)
X_test = vectorizer.transform(X_test_df.doc)
y_pred = bnb.predict(X_test)


print(y_pred)
print('Sklearn predict_proba\n', bnb.predict_proba(X_test))
print('Sklearn predict_log_proba\n',bnb.predict_log_proba(X_test))
print('Sklearn _joint_log_likelihood\n',bnb._joint_log_likelihood(X_test))

[0]
Sklearn predict_proba
 [[0.99796306 0.00203694]]
Sklearn predict_log_proba
 [[-2.03901494e-03 -6.19630779e+00]]
Sklearn _joint_log_likelihood
 [[-46.44510968 -52.63937846]]


In [16]:
arr_test = [['What Roman battle took place in the year']]
X_test_df=pd.DataFrame(arr_test, columns=['doc'])
# Convert into BOW
#loaded_vec = CountVectorizer(decode_error="replace",vocabulary=pickle.load(open("vocab.pkl", "rb")))
# loaded_vec = CountVectorizer(decode_error="replace",vocabulary=vectorizer.vocabulary_)
X_test = vectorizer.transform(X_test_df.doc)
y_pred = bnb.predict(X_test)


print(y_pred)
print('Sklearn predict_proba\n', bnb.predict_proba(X_test))
print('Sklearn predict_log_proba\n',bnb.predict_log_proba(X_test))
print('Sklearn _joint_log_likelihood\n',bnb._joint_log_likelihood(X_test))

[1]
Sklearn predict_proba
 [[0.0015627 0.9984373]]
Sklearn predict_log_proba
 [[-6.46134198e+00 -1.56391947e-03]]
Sklearn _joint_log_likelihood
 [[-51.75567234 -45.29589427]]


## Resources
1) https://towardsdatascience.com/how-i-was-using-naive-bayes-incorrectly-till-now-part-1-4ed2a7e2212b
2) https://iq.opengenus.org/bernoulli-naive-bayes/
3) https://www.codingninjas.com/codestudio/library/bernoulli-naive-bayes
4) https://developer.nvidia.com/blog/faster-text-classification-with-naive-bayes-and-gpus/
5) https://towardsdatascience.com/why-how-to-use-the-naive-bayes-algorithms-in-a-regulated-industry-with-sklearn-python-code-dbd8304ab2cf

## QUESTIONS <a class="anchor" id="questions"></a>

1) <b>If test word is not present in vocabulary?</b> <br>

If there is a categorial variable which is not present in training dataset, it results in zero frequency problem. This problem can be easily solved by Laplace estimation.