### Q1. A company conducted a survey of its employees and found that 70% of the employees use the company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the probability that an employee is a smoker given that he/she uses the health insurance plan?


* According to the information provided:
    > P(A) = 70% = 0.70 (probability that an employee uses the health insurance plan)
    
    > P(B|A) = 40% = 0.40 (probability that an employee is a smoker given that they use the health insurance plan)
<br>

* Bayes' theorem states:
    > P(B|A) = (P(A|B) * P(B)) / P(A)
<br>

* We know P(B|A) and P(A), but we need to find P(A|B) and P(B) to calculate the probability.
<br>

* Let's assume that the proportion of smokers in the overall employee population is S. In other words, P(B) represents the probability that a randomly selected employee is a smoker, regardless of their health insurance plan usage.
<br>

* Given that P(A|B) = 1 (if an employee is a smoker, they certainly use the health insurance plan), we can rewrite Bayes' theorem as:
    > P(B|A) = (1 * P(B)) / P(A)
<br>

* Substituting the given values:
    > 0.40 = (1 * P(B)) / 0.70
<br>

* Solving for P(B):
    > 0.40 * 0.70 = P(B)
    
    > 0.28 = P(B)
<br>

* Therefore, the probability that an employee is a smoker given that they use the health insurance plan is 0.28 or 28%.

# -----------------------------------------------------

### Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?


* Bernoulli Naive Bayes and Multinomial Naive Bayes are two variants of the Naive Bayes algorithm, which is a probabilistic classification algorithm that is widely used in natural language processing, text mining, and other machine learning applications. Below are difference between Bernoullin and Multinomial Naive Bayes :
<br>


|Bernoulli Naive Bayes|Multinomial Naive Bayes|
|--|--|
|Assumes binary input data|Assumes count input data|
|Represents each document as a binary vector|Represents each document as a count vector|
|Calculates likelihood probabilities based on presence/absence of features|Calculates likelihood probabilities based on frequency of features|
|Suitable for problems focused on the presence or absence of features|Suitable for problems focused on the frequency of features|




# -----------------------------------------------------

### Q3. How does Bernoulli Naive Bayes handle missing values?


* Bernoulli Naive Bayes is a classification algorithm that is commonly used in natural language processing tasks such as text classification. It is a variant of the Naive Bayes algorithm that assumes that the features are binary or Boolean, indicating whether a particular feature is present or not.

<br>

* In the case of missing values in the input data, Bernoulli Naive Bayes handles them by simply ignoring the missing values and treating them as if they were not present in the data. This is because the algorithm assumes that the features are independent of each other, and therefore the absence of a particular feature does not affect the probability of the presence of another feature.

<br>

* However, it is important to note that the presence or absence of certain features can have a significant impact on the classification accuracy of the algorithm. Therefore, it is recommended to handle missing values in the input data by imputing correct values, such as the mean or median value of that desired feature before applying the Bernoulli Naive Bayes algorithm.

# -----------------------------------------------------

### Q4. Can Gaussian Naive Bayes be used for multi-class classification?


* Yes, Gaussian Naive Bayes can be used for multi-class classification problems. In this case, the algorithm extends the binary Gaussian Naive Bayes classifier to the multi-class setting by using the "one-vs-all" (OvA) approach.
<br>

* In the OvA("one-vs-all") approach, the multi-class problem is divided into multiple binary classification problems, with each class compared against all other classes. For example, if we have a problem with three classes (A, B, and C), we would train three binary classifiers: one to distinguish A from B and C, one to distinguish B from A and C, and one to distinguish C from A and B.
<br>

* During classification, the algorithm calculates the probability of each document belonging to each class using the corresponding binary classifier. The document is assigned to the class with the highest probability.
<br>

* In Gaussian Naive Bayes, the likelihood probability is modeled using a Gaussian distribution for each feature in each class. The algorithm estimates the mean and variance of each feature in each class based on the training data. During classification, the algorithm calculates the probability of each document belonging to each class using the Gaussian distribution parameters for that class.
<br>

* Overall, Gaussian Naive Bayes can be a useful algorithm for multi-class classification problems when the features are continuous and can be modeled using a Gaussian distribution. However, it is important to note that it makes certain assumptions about the data (such as independence of features) that may not always hold in practice.

# -----------------------------------------------------

### Q5. Assignment:
<br>

* **Data preparation:**
    * Download the "Spambase Data Set" from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Spambase). This dataset contains email messages, where the goal is to predict whether a message is spam or not based on several input features.
<br>

* **Implementation:**
    * Implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the scikit-learn library in Python. Use 10-fold cross-validation to evaluate the performance of each classifier on the dataset. You should use the default hyperparameters for each classifier.
<br>

* **Results:**
    * Report the following performance metrics for each classifier:
        * Accuracy
        * Precision
        * Recall
        * F1 score
<br>    

* **Discussion:**
    *Discuss the results you obtained. Which variant of Naive Bayes performed the best? Why do you think that is the case? Are there any limitations of Naive Bayes that you observed?
<br>

* **Conclusion:**
    * Summarise your findings and provide some suggestions for future work.

In [1]:
import pandas as pd
df = pd.read_csv('spambase.data',header=None)

features=[]
for i in range(df.shape[1]):
    if i!=57:
        fs = 'f'+str(i+1)
        features.append(fs)
    else:
        features.append('target')
df.columns = features

# Seperating X and Y variables
X = df.drop(labels=['target'],axis=1)
Y = df[['target']]

# Train Test Split 
from sklearn.model_selection import train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(X,Y,test_size=0.3,random_state=42,stratify=Y)



In [2]:
## Gaussian NB
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(xtrain,ytrain.values.flatten())

from sklearn.model_selection import StratifiedKFold
skf =  StratifiedKFold(n_splits=10,shuffle=True,random_state=42)

from sklearn.model_selection import cross_val_score
scores_gnb = cross_val_score(GaussianNB(),xtrain,ytrain.values.flatten(),cv=skf,scoring='f1')
print(scores_gnb)

import numpy as np
mean_score_gnb = np.mean(scores_gnb)
print('Results for Gaussian Naive Bayes')
print(f'Mean 10 fold cross validation f1 score is : {mean_score_gnb:.4f}')


[0.77564103 0.82191781 0.80267559 0.802589   0.78064516 0.81081081
 0.82876712 0.82033898 0.80130293 0.8125    ]
Results for Gaussian Naive Bayes
Mean 10 fold cross validation f1 score is : 0.8057


In [3]:
## Bernoulli NB
from sklearn.naive_bayes import BernoulliNB
bnb = BernoulliNB()
bnb.fit(xtrain,ytrain.values.flatten())

scores_bnb = cross_val_score(BernoulliNB(),xtrain,ytrain.values.flatten(),cv=skf,scoring='f1')
print(scores_bnb)

mean_score_bnb = np.mean(scores_bnb)
print('Results for BernoulliNB :')
print(f'Mean 10 fold cross validation f1 score is : {mean_score_bnb:.4f}')


[0.84897959 0.84677419 0.84120172 0.8515625  0.85258964 0.81512605
 0.8879668  0.85232068 0.85483871 0.84081633]
Results for BernoulliNB :
Mean 10 fold cross validation f1 score is : 0.8492


In [4]:
## Multinomial Naive Bayes
from sklearn.naive_bayes import MultinomialNB
mnb = MultinomialNB()
mnb.fit(xtrain,ytrain.values.flatten())

scores_mnb = cross_val_score(MultinomialNB(),xtrain,ytrain.values.flatten(),cv=skf,scoring='f1')
print(scores_mnb)

mean_score_mnb = np.mean(scores_mnb)
print('Results for MultinomialNB :')
print(f'Mean 10 fold cross validation f1 score is : {mean_score_mnb:.4f}')


[0.70817121 0.68907563 0.74509804 0.71604938 0.67741935 0.72131148
 0.76       0.712      0.703125   0.7768595 ]
Results for MultinomialNB :
Mean 10 fold cross validation f1 score is : 0.7209


In [5]:
# Function to store all above metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
def evaluate_model(x,y,model):
    ypred = model.predict(x)
    acc = accuracy_score(y,ypred)
    pre = precision_score(y,ypred)
    rec = recall_score(y,ypred)
    f1 = f1_score(y,ypred)
    print(f'Accuracy  : {acc:.4f}')
    print(f'Precision : {pre:.4f}')
    print(f'Recall    : {rec:.4f}')
    print(f'F1 Score  : {f1:.4f}')
    return acc, pre, rec, f1



In [6]:
## Evaluate Gaussian NB
print('Gaussian Naive Bayes Results : ')
acc_gnb, pre_gnb, rec_gnb, f1_gnb = evaluate_model(xtest,ytest.values.flatten(),gnb)
print('--------------------------------')
## Evaluate Bernoulli NB
print('Bernoulli Naive Bayes Results : ')
acc_bnb, pre_bnb, rec_bnb, f1_bnb = evaluate_model(xtest,ytest.values.flatten(),bnb)
print('--------------------------------')
## Evaluate Multinomial NB
print('Multinomial Naive Bayes Results : ')
acc_mnb, pre_mnb, rec_mnb, f1_mnb = evaluate_model(xtest,ytest.values.flatten(),mnb)

Gaussian Naive Bayes Results : 
Accuracy  : 0.8240
Precision : 0.7048
Recall    : 0.9522
F1 Score  : 0.8100
--------------------------------
Bernoulli Naive Bayes Results : 
Accuracy  : 0.8870
Precision : 0.8865
Recall    : 0.8180
F1 Score  : 0.8509
--------------------------------
Multinomial Naive Bayes Results : 
Accuracy  : 0.7697
Precision : 0.7190
Recall    : 0.6820
F1 Score  : 0.7000


---

* **Discussion:**
    * Best Model for above data is Bernoulli Naive Bayes

        * **Reasons:**

            * BernoulliNB has highest test f1 score of 0.8509
            * BernoulliNB has highest test accuracy of 0.8870
            * BernoulliNB has highest 10 fold cross validation F1 score of 0.8492
            
* Although Naive Bayes algorithm is a powerful and widely used algorithm, it also has some limitations, including:
* The assumption of feature independence: The Naive Bayes algorithm assumes that the features are independent of each other. However, in real-world scenarios, this assumption is not always true, and features may be dependent on each other.

* Sensitivity to input data: Naive Bayes algorithm is very sensitive to input data, and even a slight change in the input data can significantly affect the accuracy of the model.

* Lack of tuning parameters: Naive Bayes algorithm does not have many tuning parameters that can be adjusted to improve its performance.

* Data sparsity problem: Naive Bayes algorithm relies on a lot of training data to estimate the probabilities of different features. However, if some features have very low frequencies in the training data, the algorithm may not be able to accurately estimate their probabilities.

* Class-conditional independence assumption: Naive Bayes algorithm assumes that each feature is conditionally independent given the class. However, in many cases, this assumption may not hold, and the algorithm may not perform well.

* Imbalanced class distribution: Naive Bayes algorithm assumes that the classes are equally likely, but in real-world scenarios, the class distribution may be imbalanced, which can lead to biased results.

* The need for continuous data: Naive Bayes algorithm assumes that the input features are continuous, which may not always be the case in real-world scenarios where the input features are discrete.

<br>

* **Conclusion:**
    * Summarise your findings and provide some suggestions for future work.

* **Below are conclusions for above model**
    * Bernoulli Naive Bayes performed best on both cross validation and test dataset.
    * For Email Classification Neural Network is better suited algorithm as it is able to provide better results and has lot of tunable paramenters.

# -----------------------------------------------------