Q1. A company conducted a survey of its employees and found that 70% of the employees use the
company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the
probability that an employee is a smoker given that he/she uses the health insurance plan?

Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?

Q3. How does Bernoulli Naive Bayes handle missing values?

Q4. Can Gaussian Naive Bayes be used for multi-class classification?

Q5. Assignment:

Data preparation:
Download the "Spambase Data Set" from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/
datasets/Spambase). This dataset contains email messages, where the goal is to predict whether a message
is spam or not based on several input features.
Implementation:
Implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the
scikit-learn library in Python. Use 10-fold cross-validation to evaluate the performance of each classifier on the
dataset. You should use the default hyperparameters for each classifier.
Results:
Report the following performance metrics for each classifier:
Accuracy
Precision
Recall
F1 score
Discussion:
Discuss the results you obtained. Which variant of Naive Bayes performed the best? Why do you think that is
the case? Are there any limitations of Naive Bayes that you observed?
Conclusion:
Summarise your findings and provide some suggestions for future work.

#1. To calculate the probability that an employee is a smoker given that he/she uses the health insurance plan, we can use Bayes' theorem:

#To calculate the probability that an employee is a smoker given that he/she uses the health insurance plan, we can use Bayes' theorem:

#P(Smoker | Uses Health Insurance) = P(Uses Health Insurance | Smoker)  P(Smoker)P(Uses Health Insurance)

#Given the information provided:

P(Uses Health Insurance) = 0.70  #(percentage of employees who use the health insurance plan)
#P(Smoker | Uses Health Insurance)  is what we want to find.
P(Smoker) = 0.40  #(percentage of employees who are smokers)
P(Uses Health Insurance | Smoker) = 0.40  #(percentage of smokers who use the health insurance plan)

Substituting the values into the formula:

P(Smoker | Uses Health Insurance) = (0.40 * 0.40) / 0.70

Calculating:

P(Smoker | Uses Health Insurance) =0.16/0.70 = approx 0.2286

So, the probability that an employee is a smoker given that he/she uses the health insurance plan is approximately 0.2286 or 22.86%.

2. Bernoulli Naive Bayes and Multinomial Naive Bayes are both variants of the Naive Bayes algorithm, used for text classification and other binary/multiclass classification tasks. The main difference between them lies in the type of data they are designed to handle and the assumptions they make about the distribution of features.

Bernoulli Naive Bayes:

Type of Data: Bernoulli Naive Bayes is used for binary or boolean features, where each feature can be present (1) or absent (0).
Assumption: It assumes that features are conditionally independent given the class label and that each feature follows a Bernoulli distribution (hence the name).
Use Case: It's commonly used in text classification tasks where you're dealing with presence/absence of words in documents.
Example: Sentiment analysis, spam detection, document categorization.
Multinomial Naive Bayes:

Type of Data: Multinomial Naive Bayes is used for features that represent counts or frequencies, typically in the form of discrete values.
Assumption: It assumes that features are conditionally independent given the class label and that each feature follows a multinomial distribution.
Use Case: It's often used in text classification tasks where features are word frequencies or word counts.
Example: Document classification, topic categorization.
In summary:

Bernoulli Naive Bayes: Suited for binary data where features are either present or absent. Assumes a Bernoulli distribution for each feature.

Multinomial Naive Bayes: Suited for count or frequency data. Assumes a multinomial distribution for each feature.

Both types of Naive Bayes classifiers are based on the same fundamental concept of assuming feature independence given the class label. The choice between them depends on the nature of your data and how well it aligns with the assumptions of each classifier. Experimentation and model evaluation on your specific dataset can help you determine which variant works better for your classification task.

3. Bernoulli Naive Bayes, like other variants of Naive Bayes, assumes that features are conditionally independent given the class label. However, it doesn't inherently handle missing values in a straightforward manner. The presence of missing values can affect the calculations and assumptions made by the classifier.

When dealing with missing values in Bernoulli Naive Bayes, you have a few options:

Ignoring Missing Values: One approach is to simply ignore instances with missing values during both training and classification. However, this can lead to loss of valuable information.

Imputation: You can replace missing values with a placeholder value (e.g., 0 or 1) to indicate whether the feature is absent or present. This assumes that the missingness is informative and that the choice of placeholder value is appropriate.

Probabilistic Imputation: Instead of using a fixed placeholder value, you can probabilistically impute the missing values based on the class distribution. For example, if most instances of a certain class have the feature present, you might assign a higher probability of presence for the missing value in that class.

Feature Engineering: If missing values are a significant concern, you might consider adding an additional binary feature indicating whether the original feature was missing. This way, the presence/absence of the original feature and the missingness itself become separate features.

Advanced Techniques: More advanced techniques, like using other machine learning algorithms to predict missing values, might be considered if the dataset allows.

It's important to note that the choice of how to handle missing values depends on the nature of your data and the problem you're trying to solve. Handling missing values appropriately is crucial to ensure that the Naive Bayes classifier's assumptions are valid and that the classifier's performance is not compromised.

4. Yes, Gaussian Naive Bayes can be used for multi-class classification. The Gaussian Naive Bayes algorithm is an extension of the Naive Bayes algorithm that assumes that the features follow a Gaussian (normal) distribution. While it's often used for binary classification or two-class problems, it can be adapted for multi-class classification as well.

In the case of multi-class classification, you would extend the Gaussian Naive Bayes algorithm to handle more than two classes. Each class would have its own set of mean and variance parameters for each feature, representing the Gaussian distribution parameters. During classification, the algorithm calculates the likelihood of each class based on the Gaussian distribution parameters and the feature values, and then chooses the class with the highest likelihood as the predicted class.

Here's a general outline of how Gaussian Naive Bayes can be used for multi-class classification:

Training:

For each class, compute the mean and variance of each feature using the training instances belonging to that class.
Calculate the prior probabilities of each class based on the class distribution in the training data.
Classification:

Given a new instance with feature values, calculate the likelihood of the feature values given each class using the Gaussian distribution.
Multiply the likelihood by the prior probability of the class to get the unnormalized posterior probability.
Normalize the posterior probabilities across all classes to get the probability distribution over classes.
Choose the class with the highest probability as the predicted class.

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB

# Load the dataset into a Pandas DataFrame
data = pd.read_csv("spambase.data", header=None)

# Assuming the last column is the target column (spam or not spam)
X = data.iloc[:, :-1].values
y = data.iloc[:, -1].values

# Initialize the classifiers
bernoulli_nb = BernoulliNB()
multinomial_nb = MultinomialNB()
gaussian_nb = GaussianNB()

# Perform 10-fold cross-validation and calculate metrics for each classifier
metrics = ['accuracy', 'precision', 'recall', 'f1']

for clf, name in zip([bernoulli_nb, multinomial_nb, gaussian_nb], ['Bernoulli NB', 'Multinomial NB', 'Gaussian NB']):
    print(f"Results for {name}:\n")
    for metric in metrics:
        scores = cross_val_score(clf, X, y, cv=10, scoring=metric)
        print(f"{metric.capitalize()}: Mean = {scores.mean():.4f}, Std = {scores.std():.4f}")
    print("\n")

