# **Q1. A company conducted a survey of its employees and found that 70% of the employees use the company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the probability that an employee is a smoker given that he/she uses the health insurance plan?**


### Probability of an Employee Being a Smoker Given Health Insurance Usage:
We can use Bayes’ theorem to find the probability that an employee is a smoker given that they use the health insurance plan.

Let’s denote:
* (S): Event that an employee is a smoker.
* (H): Event that an employee uses the health insurance plan.

We are given:
* (P(H) = 0.7) (probability of using health insurance plan)
* (P(S|H) = 0.4) (probability of being a smoker given health insurance usage)

We want to find (P(S|H)).

* Bayes’ theorem states: [ P(S|H) = \frac{{P(H|S) \cdot P(S)}}{{P(H)}} ]
* Plugging in the given values: [ P(S|H) = \frac{{0.4 \cdot 0.7}}{{0.7}} = 0.4 ]

Therefore, the probability that an employee is a smoker given that they use the health insurance plan is 0.4.

# **Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?**


## * Both Bernoulli Naive Bayes (BNB) and Multinomial Naive Bayes (MNB) are variants of the Naive Bayes algorithm.
* Bernoulli Naive Bayes:
    * Used for binary features (e.g., presence/absence of a feature).
    * Models the presence/absence of a feature.
    * Counts how many times a feature does not occur.

* Example: Text classification where each word is either present or absent.

* Multinomial Naive Bayes:
    * Widely used for document classification.
    * Models the number of counts of a feature (e.g., word frequencies).
    * Considers multiple features that occur.
* Example: Classifying documents based on word frequencies.

* In summary:
BNB focuses on a single feature, while MNB considers multiple features and their counts.

# **Q3. How does Bernoulli Naive Bayes handle missing values?**


## Handling Missing Values in Bernoulli Naive Bayes:
* Bernoulli Naive Bayes assumes binary features (presence/absence).
* When dealing with missing values:
    - Treat missing features as absent (i.e., value = 0).
    - This aligns with the binary nature of BNB.
    - It simplifies the model by not introducing additional complexity for handling missing data.

# **Q4. Can Gaussian Naive Bayes be used for multi-class classification?**

# Gaussian Naive Bayes for Multi-Class Classification:
* Gaussian Naive Bayes (GNB) assumes that features follow a Gaussian (normal) distribution.
* GNB is primarily used for continuous-valued features.
* It can be used for multi-class classification by extending the binary classification approach.
* Each class has its own Gaussian distribution for each feature.
* GNB calculates probabilities based on continuous feature values.
* Therefore, yes, GNB can be used for multi-class classification.

# **Q5. Assignment:**
## **Data preparation:**
Download the "Spambase Data Set" from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Spambase). This dataset contains email messages, where the goal is to predict whether a message
is spam or not based on several input features.
## **Implementation:**
Implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the
scikit-learn library in Python. Use 10-fold cross-validation to evaluate the performance of each classifier on the
dataset. You should use the default hyperparameters for each classifier.
# **Results:**
**Report the following performance metrics for each classifier:**
* Accuracy
* Precision
* Recall
* F1 score
# **Discussion:**
Discuss the results you obtained. Which variant of Naive Bayes performed the best? Why do you think that is
the case? Are there any limitations of Naive Bayes that you observed?
# **Conclusion:**
Summarise your findings and provide some suggestions for future work.

In [15]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data"
column_names = [
    "word_freq_make", "word_freq_address", "word_freq_all", "word_freq_3d",
    "word_freq_our", "word_freq_over", "word_freq_remove", "word_freq_internet",
    "word_freq_order", "word_freq_mail", "word_freq_receive", "word_freq_will",
    "word_freq_people", "word_freq_report", "word_freq_addresses", "word_freq_free",
    "word_freq_business", "word_freq_email", "word_freq_you", "word_freq_credit",
    "word_freq_your", "word_freq_font", "word_freq_000", "word_freq_money",
    "word_freq_hp", "word_freq_hpl", "word_freq_george", "word_freq_650",
    "word_freq_lab", "word_freq_labs", "word_freq_telnet", "word_freq_857",
    "word_freq_data", "word_freq_415", "word_freq_85", "word_freq_technology",
    "word_freq_1999", "word_freq_parts", "word_freq_pm", "word_freq_direct",
    "word_freq_cs", "word_freq_meeting", "word_freq_original", "word_freq_project",
    "word_freq_re", "word_freq_edu", "word_freq_table", "word_freq_conference",
    "char_freq_;", "char_freq_(", "char_freq_[", "char_freq_!", "char_freq_$",
    "char_freq_#", "capital_run_length_average", "capital_run_length_longest",
    "capital_run_length_total", "is_spam"
]
data = pd.read_csv(url, names=column_names)

# Split features and target variable
X = data.drop('is_spam', axis=1)
y = data['is_spam']

# Initialize classifiers
classifiers = {
    "Bernoulli Naive Bayes": BernoulliNB(),
    "Multinomial Naive Bayes": MultinomialNB(),
    "Gaussian Naive Bayes": GaussianNB()
}

# Initialize lists to store results
accuracy_scores = []
precision_scores = []
recall_scores = []
f1_scores = []

# Evaluate each classifier using cross-validation
for name, classifier in classifiers.items():
    # Perform 10-fold cross-validation
    cv_accuracy = cross_val_score(classifier, X, y, cv=10, scoring='accuracy')
    cv_precision = cross_val_score(classifier, X, y, cv=10, scoring='precision')
    cv_recall = cross_val_score(classifier, X, y, cv=10, scoring='recall')
    cv_f1 = cross_val_score(classifier, X, y, cv=10, scoring='f1')

    # Calculate mean scores
    accuracy_mean = cv_accuracy.mean()
    precision_mean = cv_precision.mean()
    recall_mean = cv_recall.mean()
    f1_mean = cv_f1.mean()

    # Append mean scores to lists
    accuracy_scores.append(accuracy_mean)
    precision_scores.append(precision_mean)
    recall_scores.append(recall_mean)
    f1_scores.append(f1_mean)

    # Print results
    print("Classifier:", name)
    print("Accuracy:", accuracy_mean)
    print("Precision:", precision_mean)
    print("Recall:", recall_mean)
    print("F1 Score:", f1_mean)
    print()

# Print overall summary
print("Overall Summary:")
for name, accuracy, precision, recall, f1 in zip(classifiers.keys(), accuracy_scores, precision_scores, recall_scores, f1_scores):
    print("Classifier:", name)
    print("Accuracy:", accuracy)
    print("Precision:", precision)
    print("Recall:", recall)
    print("F1 Score:", f1)
    print()


Classifier: Bernoulli Naive Bayes
Accuracy: 0.8839380364047911
Precision: 0.8869617393737383
Recall: 0.8152389047416673
F1 Score: 0.8481249015095276

Classifier: Multinomial Naive Bayes
Accuracy: 0.7863496180326323
Precision: 0.7393175533565436
Recall: 0.7214983911116508
F1 Score: 0.7282909724016348

Classifier: Gaussian Naive Bayes
Accuracy: 0.8217730830896915
Precision: 0.7103733928118492
Recall: 0.9569516119239877
F1 Score: 0.8130660909542995

Overall Summary:
Classifier: Bernoulli Naive Bayes
Accuracy: 0.8839380364047911
Precision: 0.8869617393737383
Recall: 0.8152389047416673
F1 Score: 0.8481249015095276

Classifier: Multinomial Naive Bayes
Accuracy: 0.7863496180326323
Precision: 0.7393175533565436
Recall: 0.7214983911116508
F1 Score: 0.7282909724016348

Classifier: Gaussian Naive Bayes
Accuracy: 0.8217730830896915
Precision: 0.7103733928118492
Recall: 0.9569516119239877
F1 Score: 0.8130660909542995

