In [None]:

---

### **Q1. Probability of a smoker given they use health insurance**

We’re given:
- \( P(H) = 0.70 \) → probability of using health insurance
- \( P(S|H) = 0.40 \) → probability of being a smoker given use of health insurance

You’re being asked directly:  
\[
P(S|H) = 0.40
\]


---

### **Q2. Difference: Bernoulli Naive Bayes vs Multinomial Naive Bayes**

| Feature                     | **Bernoulli NB**                        | **Multinomial NB**                     |
|----------------------------|-----------------------------------------|----------------------------------------|
| Input Type                 | Binary features (0 or 1)                | Count features (e.g., word frequency)  |
| Use Case                   | Presence/absence of features            | Text classification with word counts   |
| Feature distribution       | Bernoulli distribution                  | Multinomial distribution               |
| Example Use                | Spam detection with binary features     | Document classification (word counts)  |

---

### **Q3. How does Bernoulli Naive Bayes handle missing values?**

Missing values (e.g., NaNs) need to be handled **before** training by:
- Imputation (e.g., fill with 0 or mean)
- Dropping rows/columns with missing values

Otherwise, `sklearn` will throw an error during training or prediction.

---

### **Q4. Can Gaussian Naive Bayes be used for multi-class classification?**


`GaussianNB` in `scikit-learn` supports **multi-class classification** out of the box using the **one-vs-rest** strategy.

---

### **Q5. Assignment — Naive Bayes on Spambase Dataset**

Let’s break this down:


```python
import pandas as pd
import numpy as np
from sklearn.model_selection import cross_val_score, cross_validate
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.preprocessing import Binarizer
from sklearn.metrics import make_scorer, accuracy_score, precision_score, recall_score, f1_score

# Load data
data = pd.read_csv("spambase.data", header=None)
X = data.iloc[:, :-1]
y = data.iloc[:, -1]

# Binarize data for BernoulliNB
X_binary = Binarizer().fit_transform(X)

# Define models
models = {
    "BernoulliNB": BernoulliNB(),
    "MultinomialNB": MultinomialNB(),
    "GaussianNB": GaussianNB()
}

# Define scoring
scoring = ['accuracy', 'precision', 'recall', 'f1']

# Evaluate
results = {}
for name, model in models.items():
    if name == "BernoulliNB":
        scores = cross_validate(model, X_binary, y, cv=10, scoring=scoring)
    else:
        scores = cross_validate(model, X, y, cv=10, scoring=scoring)
    results[name] = {metric: np.mean(scores[f'test_{metric}']) for metric in scoring}

# Display results
for model_name, metrics in results.items():
    print(f"\nModel: {model_name}")
    for metric, score in metrics.items():
        print(f"{metric.capitalize()}: {score:.4f}")
```

---

### 📊 **Results (Hypothetical Sample)**

| Model         | Accuracy | Precision | Recall | F1 Score |
|---------------|----------|-----------|--------|----------|
| BernoulliNB   | 0.88     | 0.87      | 0.85   | 0.86     |
| MultinomialNB | 0.91     | 0.90      | 0.89   | 0.89     |
| GaussianNB    | 0.84     | 0.81      | 0.83   | 0.82     |

---



- **MultinomialNB** usually performs best with **text data** like this because it models **word frequency**.
- **BernoulliNB** works reasonably well when you reduce data to binary (word present/absent).
- **GaussianNB** assumes a continuous normal distribution, so it's not optimal for count data like this.

**Limitations of Naive Bayes:**
- Assumes **feature independence**, which is rarely true in real datasets.
- Can be **biased** when features are correlated.
- GaussianNB performs poorly with **non-normal distributions**.

---



- **Multinomial Naive Bayes** is most suited for the Spambase dataset due to its frequency-based features.
- Bernoulli works decently when binarized.
- Gaussian NB is least suited unless the features are continuous and normally distributed.

- Try **TF-IDF transformation** on features.
- Apply **feature selection** (e.g., chi-square test).
- Compare with other classifiers like SVM or Random Forest.
- Perform **grid search** for hyperparameter tuning.

---

Let me know if you want help writing the full report or visualizing the results!