In [None]:
### Q1. What is the Probability That an Employee is a Smoker Given That He/She Uses the Health Insurance Plan?
This question asks for the conditional probability \( P(S | I) \), where \( S \) is the event that an employee is a smoker, and \( I \) is the event that the employee uses the health insurance plan.

Given:
- 70% of employees use the health insurance plan, so \( P(I) = 0.7 \).
- 40% of employees who use the health insurance plan are smokers, so \( P(S | I) = 0.4 \).

The conditional probability \( P(S | I) \) is therefore 40%, or 0.4.

### Q2. What is the Difference Between Bernoulli Naive Bayes and Multinomial Naive Bayes?
- **Bernoulli Naive Bayes**: This classifier is designed for binary/Boolean features. It is suitable for datasets where features represent the presence or absence of certain attributes. It's common in text classification, where a feature could indicate whether a specific word or term exists in a document.
- **Multinomial Naive Bayes**: This classifier is designed for discrete/categorical features. It is commonly used in text classification with word counts or frequency-based features. Unlike Bernoulli, which considers binary presence, Multinomial works with discrete counts.

### Q3. How Does Bernoulli Naive Bayes Handle Missing Values?
Bernoulli Naive Bayes typically assumes that missing values represent the absence of a feature (or "False"). It does not inherently have a special handling mechanism for missing data, so it treats missing values as zeros (or not present) when calculating probabilities. If missing data could affect interpretation or significance, preprocessing techniques might be required to fill or remove missing values.

### Q4. Can Gaussian Naive Bayes Be Used for Multi-Class Classification?
Yes, Gaussian Naive Bayes can be used for multi-class classification. It calculates the probabilities assuming that each feature follows a Gaussian (normal) distribution. This approach is commonly used when working with continuous data. It can handle multiple classes by applying the same Gaussian assumption to each class independently, allowing for a broader range of applications.



In [4]:
### Q5. Assignment: Implementing Naive Bayes Classifiers with the Spambase Dataset


import pandas as pd
import numpy as np
from sklearn.model_selection import cross_val_score, KFold
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load the Spambase dataset
data_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data"
df = pd.read_csv(data_url, header=None)

# Features and target variable
X = df.iloc[:, :-1]  # All but last column (features)
y = df.iloc[:, -1]   # Last column (target)

# Split dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define 10-fold cross-validation
kf = KFold(n_splits=10, shuffle=True, random_state=42)

# Function to calculate evaluation metrics
def evaluate_model(model):
    # Cross-validation for accuracy
    cv_accuracy = cross_val_score(model, X_train, y_train, cv=kf, scoring='accuracy')
    # Train on full training set and evaluate on the test set
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    # Calculate metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)

    return {
        "Accuracy (CV)": np.mean(cv_accuracy),
        "Accuracy (Test)": accuracy,
        "Precision": precision,
        "Recall": recall,
        "F1 Score": f1,
    }

# Instantiate the Naive Bayes classifiers
bernoulli_nb = BernoulliNB()
multinomial_nb = MultinomialNB()
gaussian_nb = GaussianNB()

# Standardize data for Gaussian Naive Bayes
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Evaluate each model
results = {
    "BernoulliNB": evaluate_model(bernoulli_nb),
    "MultinomialNB": evaluate_model(multinomial_nb),
    "GaussianNB": evaluate_model(gaussian_nb),
}

# Display the results
for model_name, metrics in results.items():
    print(f"Performance Metrics for {model_name}:")
    for metric_name, metric_value in metrics.items():
        print(f"  {metric_name}: {metric_value:.4f}")

### Discussion
##Based on the results, we can evaluate which Naive Bayes variant performed best and why:
##- **Bernoulli Naive Bayes**: Typically used for binary data. It could perform well if the features in the dataset are structured in a binary manner.
##- **Multinomial Naive Bayes**: Ideal for discrete/categorical data. Often used in text classification with word counts or frequency-based features.
##- **Gaussian Naive Bayes**: Designed for continuous data and assumes Gaussian distribution.

##Given that the Spambase dataset includes features representing word frequencies and other metrics in emails, the best-performing classifier might be either Bernoulli or Multinomial, as they cater to discrete/categorical data. Gaussian Naive Bayes might perform less consistently because it relies on continuous distributions.

### Limitations of Naive Bayes
##Naive Bayes classifiers assume independence among features, which might not always hold true in practice. This assumption could lead to reduced performance when features are heavily correlated. Also, Naive Bayes classifiers may not perform well with highly skewed or unbalanced data without preprocessing to address these issues.

### Conclusion and Future Work
##To conclude, Naive Bayes is a simple yet effective classifier, especially when features are relatively independent and can be considered as discrete or continuous. The choice between Bernoulli, Multinomial, and Gaussian depends on the nature of the data. Further work might include experimenting with different data preprocessing techniques, exploring other classification algorithms, or combining Naive Bayes with other models to improve performance and robustness.

Performance Metrics for BernoulliNB:
  Accuracy (CV): 0.8856
  Accuracy (Test): 0.8806
  Precision: 0.9070
  Recall: 0.8000
  F1 Score: 0.8501
Performance Metrics for MultinomialNB:
  Accuracy (CV): 0.7927
  Accuracy (Test): 0.7861
  Precision: 0.7644
  Recall: 0.7154
  F1 Score: 0.7391
Performance Metrics for GaussianNB:
  Accuracy (CV): 0.8187
  Accuracy (Test): 0.8208
  Precision: 0.7193
  Recall: 0.9462
  F1 Score: 0.8173
