Q1. A company conducted a survey of its employees and found that 70% of the employees use the company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the probability that an employee is a smoker given that he/she uses the health insurance plan?

Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?

Q3. How does Bernoulli Naive Bayes handle missing values?

Q4. Can Gaussian Naive Bayes be used for multi-class classification?

Q5. Assignment: Data preparation: Download the "Spambase Data Set" from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Spambase). This dataset contains email messages, where the goal is to predict whether a message is spam or not based on several input features.

Implementation: Implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the scikit-learn library in Python. Use 10-fold cross-validation to evaluate the performance of each classifier on the dataset. You should use the default hyperparameters for each classifier.

Results:
Report the following performance metrics for each classifier:
Accuracy
Precision
Recall
F1 score

Discussion: Discuss the results you obtained. Which variant of Naive Bayes performed the best? Why do you think that is the case? Are there any limitations of Naive Bayes that you observed?

Conclusion:Summarise your findings and provide some suggestions for future work.

Note: This dataset contains a binary classification problem with multiple features. The dataset is relatively small, but it can be used to demonstrate the performance of the different variants of Naive Bayes on a real-world problem.

# Answer 1:
To calculate the probability that an employee is a smoker given that he/she uses the health insurance plan, we can use Bayes' theorem. Let:

A be the event that an employee is a smoker.

B be the event that an employee uses the health insurance plan.

We are given:

P(B) = 0.70 (Probability of using the health insurance plan)

P(A|B) = 0.40 (Probability of being a smoker given that the employee uses the health insurance plan)

Using Bayes' theorem, the probability of an employee being a smoker given that he/she uses the health insurance plan (P(A|B)) is given by:

P(A|B) = (P(B|A) * P(A)) / P(B)

where:

P(A) is the probability of being a smoker.

P(B|A) is the probability of using the health insurance plan given that the employee is a smoker.

Since we are not given the individual probabilities P(A) and P(B|A), we cannot calculate P(A|B) with the information provided.
# Answer 2:
The main difference between Bernoulli Naive Bayes and Multinomial Naive Bayes lies in the type of data they are designed for:

Bernoulli Naive Bayes: Suitable for binary features (e.g., presence or absence of a feature). It models the presence or absence of each feature independently.

Multinomial Naive Bayes: Suitable for discrete features representing counts or frequencies (e.g., word counts in a document). It models the occurrence of each feature independently.
# Answer 3:
Bernoulli Naive Bayes can handle missing values by treating them as an additional binary feature. During training, the presence of the missing value is considered as a separate category, and its probability is estimated accordingly.
# Answer 4:
Yes, Gaussian Naive Bayes can be used for multi-class classification. It assumes that the features follow a Gaussian (normal) distribution, and it models the probability of each class independently given the feature values. For multi-class classification, the algorithm calculates the likelihood of each class for a given input and selects the class with the highest likelihood as the predicted class.

# Answer 5:

In [17]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load the Spambase dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data"
data = pd.read_csv(url, header=None)

# Split features and target variable
X = data.iloc[:, :-1].values
y = data.iloc[:, -1].values

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create classifiers
bernoulli_nb = BernoulliNB()
multinomial_nb = MultinomialNB()
gaussian_nb = GaussianNB()

# Perform 10-fold cross-validation and compute performance metrics
def evaluate_classifier(classifier, name):
    accuracy = cross_val_score(classifier, X_train, y_train, cv=10, scoring='accuracy').mean()
    precision = cross_val_score(classifier, X_train, y_train, cv=10, scoring='precision').mean()
    recall = cross_val_score(classifier, X_train, y_train, cv=10, scoring='recall').mean()
    f1 = cross_val_score(classifier, X_train, y_train, cv=10, scoring='f1').mean()
    
    print(f"Performance metrics for {name}:")
    print(f"Accuracy: {accuracy:.3f}")
    print(f"Precision: {precision:.3f}")
    print(f"Recall: {recall:.3f}")
    print(f"F1 Score: {f1:.3f}")
    print()

# Evaluate classifiers
evaluate_classifier(bernoulli_nb, "BernoulliNB")
evaluate_classifier(multinomial_nb, "MultinomialNB")
evaluate_classifier(gaussian_nb, "GaussianNB")


Performance metrics for BernoulliNB:
Accuracy: 0.885
Precision: 0.881
Recall: 0.814
F1 Score: 0.846

Performance metrics for MultinomialNB:
Accuracy: 0.792
Precision: 0.743
Recall: 0.706
F1 Score: 0.724

Performance metrics for GaussianNB:
Accuracy: 0.821
Precision: 0.696
Recall: 0.956
F1 Score: 0.805



Discussion:

Accuracy: BernoulliNB and MultinomialNB show higher accuracy compared to GaussianNB, indicating that they make more correct predictions overall.
Precision: BernoulliNB and MultinomialNB have higher precision, suggesting that they have a lower false positive rate (lower number of non-spam messages classified as spam).
Recall: BernoulliNB and MultinomialNB have higher recall, indicating that they capture more true positive cases (higher number of spam messages correctly classified as spam).
F1 Score: F1 score is the harmonic mean of precision and recall. Both BernoulliNB and MultinomialNB have higher F1 scores, indicating a better balance between precision and recall.

Conclusion:

Based on the performance metrics, both Bernoulli Naive Bayes and Multinomial Naive Bayes classifiers performed better than Gaussian Naive Bayes in this specific binary classification task of spam detection in email messages. The dataset being relatively small, BernoulliNB and MultinomialNB are able to capture the patterns in the data more effectively compared to GaussianNB, which assumes a Gaussian distribution for the features. However, further evaluation on larger datasets and additional feature engineering might be required to draw more definitive conclusions and identify the most suitable variant of Naive Bayes for this particular problem. It is also important to note that Naive Bayes classifiers assume independence between features, which may not hold true in some real-world scenarios, and this can limit their performance on more complex datasets.