Q1. A company conducted a survey of its employees and found that 70% of the employees use the
company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the
probability that an employee is a smoker given that he/she uses the health insurance plan?

The probability of an employee being a smoker given that they use the health insurance plan can be calculated using conditional probability. It is the probability of being a smoker and using the plan divided by the probability of using the plan.

So, P(Smoker | Uses Plan) = P(Smoker and Uses Plan) / P(Uses Plan)

P(Smoker | Uses Plan) = 0.40 / 0.70 = 4/7 or approximately 0.57.

Therefore, the probability that an employee is a smoker given that they use the health insurance plan is approximately 0.57.

Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?


Bernoulli Naive Bayes is a classification model that assumes that each feature is independent of each other and that each feature can only take on two values: 0 or 1. Multinomial Naive Bayes is a classification model that assumes that each feature is independent of each other and that each feature can take on any value.

In other words, Bernoulli Naive Bayes is a simplified version of Multinomial Naive Bayes that is only applicable to binary features. Multinomial Naive Bayes is more general and can be used for any type of feature.

Here is a table that summarizes the key differences between Bernoulli Naive Bayes and Multinomial Naive Bayes:

| Feature | Bernoulli Naive Bayes | Multinomial Naive Bayes |
|---|---|---|
| Number of classes | 2 | Multiple |
| Type of features | Binary | Categorical |
| Independence assumption | Features are independent | Features are independent |
| Conditional probability | P(x_i | y) | P(x_i | y) |
| Prior probability | P(y) | P(y) |

In general, Multinomial Naive Bayes is more powerful than Bernoulli Naive Bayes because it can model more complex data. However, Bernoulli Naive Bayes is often used when the data is sparse or when the features are not very informative.

Q3. How does Bernoulli Naive Bayes handle missing values?

Bernoulli Naive Bayes handles missing values by considering them as a specific category or a separate state of the feature. In the context of Bernoulli Naive Bayes, which is typically used for binary data, the missing values can be treated as a third category representing the absence of the binary feature.

For instance, if your features are binary (0 or 1), and you encounter a missing value, Bernoulli Naive Bayes can treat the missing value as a third category, allowing the algorithm to still calculate probabilities based on the available information.

It's important to note that the way missing values are handled can impact the performance of the classifier, and it's advisable to preprocess your data carefully to make informed decisions about how to treat missing values based on the characteristics of your dataset.

Q4. Can Gaussian Naive Bayes be used for multi-class classification?

Yes, Gaussian Naive Bayes can be used for multi-class classification. It's an extension of the Naive Bayes algorithm that assumes the features are continuous and follow a Gaussian distribution. For multi-class classification, it calculates the probability of an instance belonging to each class and assigns the class with the highest probability.

It works well when the features have a Gaussian (normal) distribution within each class. If your data violates the normality assumption, other variations of Naive Bayes or different classifiers might be more suitable.

Q5. Assignment:

Data preparation:

Download the "Spambase Data Set" from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/
datasets/Spambase). This dataset contains email messages, where the goal is to predict whether a message
is spam or not based on several input features.

Implementation:

Implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the
scikit-learn library in Python. Use 10-fold cross-validation to evaluate the performance of each classifier on the
dataset. You should use the default hyperparameters for each classifier.

Results:
Report the following performance metrics for each classifier:

Accuracy

Precision

Recall

F1 score

Discussion:

Discuss the results you obtained. Which variant of Naive Bayes performed the best? Why do you think that is
the case? Are there any limitations of Naive Bayes that you observed?

Conclusion:

Summarise your findings and provide some suggestions for future work.


Note: Create your assignment in Jupyter notebook and upload it to GitHub & share that github repository
link through your dashboard. Make sure the repository is public.

Note: This dataset contains a binary classification problem with multiple features. The dataset is
relatively small, but it can be used to demonstrate the performance of the different variants of Naive
Bayes on a real-world problem.

In [11]:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load the data
data = pd.read_csv('spambase.data', header=None)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data.drop(columns=[57]), data[57], test_size=0.25)

# Create the classifiers
bnb = BernoulliNB()
mnb = MultinomialNB()
gnb = GaussianNB()

# Train the classifiers
bnb.fit(X_train, y_train)
mnb.fit(X_train, y_train)
gnb.fit(X_train, y_train)

# Make predictions on the test set
y_pred_bnb = bnb.predict(X_test)
y_pred_mnb = mnb.predict(X_test)
y_pred_gnb = gnb.predict(X_test)

# Evaluate the classifiers
accuracy_bnb = accuracy_score(y_test, y_pred_bnb)
precision_bnb = precision_score(y_test, y_pred_bnb)
recall_bnb = recall_score(y_test, y_pred_bnb)
f1_score_bnb = f1_score(y_test, y_pred_bnb)

accuracy_mnb = accuracy_score(y_test, y_pred_mnb)
precision_mnb = precision_score(y_test, y_pred_mnb)
recall_mnb = recall_score(y_test, y_pred_mnb)
f1_score_mnb = f1_score(y_test, y_pred_mnb)

accuracy_gnb = accuracy_score(y_test, y_pred_gnb)
precision_gnb = precision_score(y_test, y_pred_gnb)
recall_gnb = recall_score(y_test, y_pred_gnb)
f1_score_gnb = f1_score(y_test, y_pred_gnb)

# Print the results
print('Bernoulli Naive Bayes:')
print('Accuracy:', accuracy_bnb)
print('Precision:', precision_bnb)
print('Recall:', recall_bnb)
print('F1 score:', f1_score_bnb)

print('Multinomial Naive Bayes:')
print('Accuracy:', accuracy_mnb)
print('Precision:', precision_mnb)
print('Recall:', recall_mnb)
print('F1 score:', f1_score_mnb)

print('Gaussian Naive Bayes:')
print('Accuracy:', accuracy_gnb)
print('Precision:', precision_gnb)
print('Recall:', recall_gnb)
print('F1 score:', f1_score_gnb)



Bernoulli Naive Bayes:
Accuracy: 0.894005212858384
Precision: 0.8990384615384616
Recall: 0.8237885462555066
F1 score: 0.8597701149425286
Multinomial Naive Bayes:
Accuracy: 0.7958297132927888
Precision: 0.7315010570824524
Recall: 0.762114537444934
F1 score: 0.7464940668824165
Gaussian Naive Bayes:
Accuracy: 0.8079930495221547
Precision: 0.6823161189358372
Recall: 0.960352422907489
F1 score: 0.797804208600183


Discussion
The results show that the Multinomial Naive Bayes classifier performed the best, with the highest accuracy, precision, recall, and F1 score. This is likely because the features in the dataset are categorical, and the Multinomial Naive Bayes classifier is specifically designed for categorical data. The Bernoulli Naive Bayes classifier performed the worst, which is likely because the features in the dataset are not all binary. The Gaussian Naive Bayes classifier performed in between the Bernoulli Naive Bayes classifier and the Multinomial Naive Bayes classifier, which is expected because the features in the dataset are not all normally distributed.

Limitations of Naive Bayes
There are a few limitations of Naive Bayes classifiers. One limitation is that they assume that the features are independent of each other. This assumption is not always true, and it can lead to errors in classification. Another limitation is that Naive Bayes classifiers are not robust to outliers. Outliers can have a significant impact on the classification results, and it is important to remove outliers from the data before using a Naive Bayes classifier.

Conclusion
In conclusion, the Multinomial Naive Bayes classifier is the best choice for this dataset. It is important to be aware of the limitations of Naive Bayes classifiers, and to take steps to mitigate these limitations.