### Q1. Probability of Being a Smoker Given Health Insurance Plan

To find the probability that an employee is a smoker given that they use the company's health insurance plan, we can use Bayes' Theorem.

Let:
- \( P(I) \) = Probability of using the health insurance plan = 0.70
- \( P(S | I) \) = Probability of being a smoker given using the plan = 0.40

We need to find \( P(S | I) \), which is already given as 0.40. So the answer is:

\[ P(S | I) = 0.40 \]

### Q2. Difference Between Bernoulli Naive Bayes and Multinomial Naive Bayes

**Bernoulli Naive Bayes** and **Multinomial Naive Bayes** are two types of Naive Bayes classifiers used for different types of data:

- **Bernoulli Naive Bayes**:
  - Assumes binary/boolean features (0 or 1).
  - Useful for binary or presence/absence data.
  - Models the probability of features as Bernoulli-distributed (i.e., binary outcomes).

- **Multinomial Naive Bayes**:
  - Assumes features follow a multinomial distribution.
  - Suitable for data where features are counts or frequencies (e.g., word counts in text classification).
  - Models the probability of features as counts in different classes.

### Q3. Handling Missing Values in Bernoulli Naive Bayes

Bernoulli Naive Bayes doesn't handle missing values directly. Common strategies to address missing values before applying Bernoulli Naive Bayes include:

- **Imputation**: Fill missing values with the mean, median, or mode of the available data.
- **Drop**: Remove rows with missing values.
- **Create a Missing Indicator**: Introduce an additional feature to indicate missingness.

### Q4. Gaussian Naive Bayes for Multi-class Classification

Yes, **Gaussian Naive Bayes** can be used for multi-class classification. It assumes that the features follow a Gaussian distribution and can handle multiple classes by applying the same Gaussian distribution assumption to each class.

### Discussion and Conclusion

1. **Discussion**:
   - Compare the performance metrics for each classifier.
   - Discuss which Naive Bayes variant performed the best and why.
   - Address any limitations of Naive Bayes observed during evaluation.

2. **Conclusion**:
   - Summarize findings and suggest improvements or future work.

n

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Path to the dataset
file_path = 'work/spambase.data'  # Update this path to where you saved the file

# Load the dataset
df = pd.read_csv(file_path, header=None)

# Display the first few rows of the dataframe to understand its structure
df.head()


In [None]:
# Define feature names and target column
num_features = df.shape[1] - 1  # Number of features (one less than the number of columns)
feature_names = [f'feature_{i}' for i in range(num_features)]
df.columns = feature_names + ['label']  # Add a label column

# Split into features and target
X = df[feature_names]
y = df['label']

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [None]:
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.model_selection import cross_val_score
from sklearn.metrics import precision_score, recall_score, f1_score

# Initialize classifiers
bnb = BernoulliNB()
mnb = MultinomialNB()
gnb = GaussianNB()

# Function to perform 10-fold cross-validation and report metrics
def evaluate_classifier(clf, X, y):
    accuracy = cross_val_score(clf, X, y, cv=10, scoring='accuracy').mean()
    precision = cross_val_score(clf, X, y, cv=10, scoring='precision').mean()
    recall = cross_val_score(clf, X, y, cv=10, scoring='recall').mean()
    f1 = cross_val_score(clf, X, y, cv=10, scoring='f1').mean()
    return accuracy, precision, recall, f1

# Evaluate each classifier
bnb_metrics = evaluate_classifier(bnb, X, y)
mnb_metrics = evaluate_classifier(mnb, X, y)
gnb_metrics = evaluate_classifier(gnb, X, y)

# Print the metrics for each classifier
print("Bernoulli Naive Bayes Metrics:")
print(f"Accuracy: {bnb_metrics[0]}")
print(f"Precision: {bnb_metrics[1]}")
print(f"Recall: {bnb_metrics[2]}")
print(f"F1 Score: {bnb_metrics[3]}")

print("\nMultinomial Naive Bayes Metrics:")
print(f"Accuracy: {mnb_metrics[0]}")
print(f"Precision: {mnb_metrics[1]}")
print(f"Recall: {mnb_metrics[2]}")
print(f"F1 Score: {mnb_metrics[3]}")

print("\nGaussian Naive Bayes Metrics:")
print(f"Accuracy: {gnb_metrics[0]}")
print(f"Precision: {gnb_metrics[1]}")
print(f"Recall: {gnb_metrics[2]}")
print(f"F1 Score: {gnb_metrics[3]}")


**Discussion**
 - Compare the Metrics: Review the accuracy, precision, recall, and F1 scores for each classifier.
   Determine the Best Performer: The classifier with the highest F1 score (which balances precision and recall) is generally considered the best.
   Discuss Limitations: Note any limitations of Naive Bayes classifiers, such as their assumptions of feature independence and the potential impact of feature correlations.

**Conclusion**
  - Summarize Findings: Conclude which Naive Bayes variant performed the best and why.
    Suggestions for Future Work: Suggest potential improvements or alternative approaches, such as using more advanced     classifiers or feature engineering.