<a href="https://colab.research.google.com/github/golu628/Celebel/blob/main/naive_bayes_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

✅ Q1. Probability of Smoker Given Use of Health Plan
Given:

P(Uses Plan) = 0.70

P(Smoker | Uses Plan) = 0.40

So, P(Smoker | Uses Plan) = 0.40
✔️ That’s already the conditional probability:

Probability that an employee is a smoker given that they use the health insurance plan is 0.40 or 40%.

✅ Q2. Bernoulli vs Multinomial Naive Bayes

Feature	Bernoulli Naive Bayes	Multinomial Naive Bayes
Input Type	Binary features (0/1)	Count-based (integer) features
Ideal For	Text classification with binary word occurrence	Text classification with term frequency or count
Output	Based on presence/absence of features	Based on frequency of features
Example	Spam detection with presence/absence of words	Spam detection with word counts
✅ Q3. How Bernoulli Naive Bayes Handles Missing Values
👉 It doesn’t handle missing values automatically.

You must impute or remove missing values before training.

Use SimpleImputer from sklearn.impute if needed.

✅ Q4. Can Gaussian Naive Bayes Be Used for Multi-Class?
✔️ Yes.

GaussianNB in scikit-learn supports multi-class classification.

It uses the assumption that features follow a normal distribution per class.

✅ Q5. Spambase Naive Bayes Assignment (Implementation Plan)
📦 Data Preparation
Download from: UCI Spambase Dataset

Load the dataset (CSV format) into a DataFrame.

🔧 Model Implementation
python
Copy
Edit
import pandas as pd
from sklearn.model_selection import cross_val_score, cross_validate
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.preprocessing import Binarizer, MinMaxScaler
from sklearn.metrics import make_scorer, accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import StratifiedKFold

# Load data
df = pd.read_csv("spambase.data", header=None)
X = df.iloc[:, :-1]
y = df.iloc[:, -1]

# Preprocessing for BernoulliNB (binary)
X_binary = Binarizer().fit_transform(X)

# Evaluation metrics
scoring = {
    'accuracy': 'accuracy',
    'precision': 'precision',
    'recall': 'recall',
    'f1': 'f1'
}
cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

# Models
bnb = BernoulliNB()
mnb = MultinomialNB()
gnb = GaussianNB()

# Cross-validation
results_bnb = cross_validate(bnb, X_binary, y, cv=cv, scoring=scoring)
results_mnb = cross_validate(mnb, X, y, cv=cv, scoring=scoring)
results_gnb = cross_validate(gnb, X, y, cv=cv, scoring=scoring)

# Helper to display results
def show_results(name, results):
    print(f"\n{name} Performance:")
    for metric in scoring:
        print(f"{metric.capitalize()}: {results[f'test_{metric}'].mean():.4f}")

show_results("BernoulliNB", results_bnb)
show_results("MultinomialNB", results_mnb)
show_results("GaussianNB", results_gnb)
🧠 Discussion of Results
MultinomialNB usually performs best for text/spam detection because it works well with frequency-based features.

BernoulliNB may underperform due to binary simplification of features.

GaussianNB assumes normal distribution, which isn’t ideal for text data (non-continuous).

✅ Conclusion
Best performer: Likely MultinomialNB, as expected in spam detection.

Limitation: Naive Bayes assumes feature independence — often not true in real datasets.

Future Work:

Try TF-IDF preprocessing

Apply ensemble methods or hybrid models

Address correlated features

