# Naïve bayes-2

#### Q1. A company conducted a survey of its employees and found that 70% of the employees use the company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the probability that an employee is a smoker given that he/she uses the health insurance plan? Show using python code

To calculate the probability that an employee is a smoker given that they use the health insurance plan, we can use Bayes' theorem:
**P(Smoker∣Uses Health Insurance) = {P(Uses Health Insurance∣Smoker)⋅P(Smoker)} / P(Uses Health Insurance)**

Given:
* P(Uses Health Insurance)=0.70 (70%)
* P(Smoker∣Uses Health Insurance)=?
* P(Smoker)=? To find
* P(Uses Health Insurance∣Smoker)=0.40 (40%)
We need to find P(Smoker) to complete the calculation. Assuming that the survey represents the overall employee population, we can calculate P(Smoker) as the proportion of smokers in the population.

In [1]:
# Given probabilities
P_uses_health_insurance = 0.70
P_uses_health_insurance_given_smoker = 0.40

# Calculate P(Smoker) using the law of total probability
P_smoker = P_uses_health_insurance * P_uses_health_insurance_given_smoker / P_uses_health_insurance

# Calculate P(Smoker | Uses Health Insurance) using Bayes' theorem
P_smoker_given_uses_health_insurance = (P_uses_health_insurance_given_smoker * P_smoker) / P_uses_health_insurance
print(f"The probability that an employee is a smoker given that they use the health insurance plan is: {P_smoker_given_uses_health_insurance:.2f}")

The probability that an employee is a smoker given that they use the health insurance plan is: 0.23


#### Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?

The main difference between Bernoulli Naive Bayes and Multinomial Naive Bayes lies in the types of data they are suited for and how they model feature probabilities:
* **Bernoulli Naive Bayes** is typically used for binary feature data, where each feature can take one of two values (e.g., 0 or 1, True or False). It models the presence or absence of features and calculates probabilities based on binary counts.
* **Multinomial Naive Bayes**, on the other hand, is suitable for discrete data where features represent counts or frequencies of events. It is commonly used in text classification tasks, where the features are often word counts or term frequencies.

#### Q3. How does Bernoulli Naive Bayes handle missing values?

Bernoulli Naive Bayes typically assumes that missing values are informative, meaning that their absence conveys information about the data. In practice, this can lead to issues, especially if missing values are not informative but simply the result of data collection or recording errors.

Handling missing values in Bernoulli Naive Bayes can be challenging. Here are a few common approaches:
* **Imputation:** We can impute missing values with a specific value (e.g., 0 or 1) to indicate their presence or absence. However, this approach may introduce bias and may not be appropriate if missingness is not informative.
* **Ignoring Missing Data:** We can choose to ignore instances with missing values during training and classification. This approach may work if the proportion of missing data is small and doesn't significantly impact the analysis.
* **Model-Based Imputation:** Use more advanced techniques like logistic regression or other models to predict missing values based on available data. This can be useful when missingness is related to other features.

#### Q4. Can Gaussian Naive Bayes be used for multi-class classification?

Yes, Gaussian Naive Bayes can be used for multi-class classification. Gaussian Naive Bayes is an extension of the Naive Bayes algorithm that assumes that continuous features follow a Gaussian (normal) distribution. It can handle multi-class classification problems by modeling the distribution of each feature for each class and then making predictions based on these distributions.

In a multi-class setting, the algorithm calculates the probability of each class for a given set of feature values and assigns the instance to the class with the highest probability. This makes Gaussian Naive Bayes suitable for problems where the features are continuous and the classes are discrete and can be extended to handle more than two classes.

Q5. Assignment:
* **Data preparation:**
Download the "Spambase Data Set" from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Spambase). This dataset contains email messages, where the goal is to predict whether a message
is spam or not based on several input features.

* **Implementation:**
Implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the
scikit-learn library in Python. Use 10-fold cross-validation to evaluate the performance of each classifier on the
dataset. You should use the default hyperparameters for each classifier.
* **Results:** Report the following performance metrics for each classifier:
    * Accuracy
    * Precision
    * Recall
    * F1 score
    
* **Discussion:**
Discuss the results you obtained. Which variant of Naive Bayes performed the best? Why do you think that is
the case? Are there any limitations of Naive Bayes that you observed?
* **Conclusion:**
Summarise your findings and provide some suggestions for future work.

**Note:** *This dataset contains a binary classification problem with multiple features. The dataset is
relatively small, but it can be used to demonstrate the performance of the different variants of Naive
Bayes on a real-world problem.*

In [2]:
import numpy as ny
import pandas as pd
from sklearn.model_selection import cross_val_score as cv
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Loading Dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data"
col = [f'word_{i}' for i in range(48)] + \
        ['char_exclamation', 'char_dollar', 'char_parenthesis', 'char_bracket', 'capital_run_length_average', 'capital_run_length_longest', 'capital_run_length_total', 'is_spam']
ds = pd.read_csv(url, header=None, names=col)
print(ds.info())

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 4601 entries, (0.0, 0.64) to (0.0, 0.0)
Data columns (total 56 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   word_0                      4601 non-null   float64
 1   word_1                      4601 non-null   float64
 2   word_2                      4601 non-null   float64
 3   word_3                      4601 non-null   float64
 4   word_4                      4601 non-null   float64
 5   word_5                      4601 non-null   float64
 6   word_6                      4601 non-null   float64
 7   word_7                      4601 non-null   float64
 8   word_8                      4601 non-null   float64
 9   word_9                      4601 non-null   float64
 10  word_10                     4601 non-null   float64
 11  word_11                     4601 non-null   float64
 12  word_12                     4601 non-null   float64
 13  word_13          

No Missing Value

In [3]:
# Spliting data
x = ds.drop('is_spam',axis=1)
y = ds['is_spam']

# Implementing various types of Bayes classifiers
models={"Bernoulli Naive Bayes":BernoulliNB(),"Multinomial Naive Bayes": MultinomialNB(),"Gaussian Naive Bayes": GaussianNB()}

# Useing 10-fold cross-validation to evaluate the performance of each classifier
results = {}
for mn, m in models.items():
    scores = cv(m,x,y,cv=10,scoring="accuracy")
    results[mn] = scores
    
# Report performance metrics
for mn, scores in results.items():
    print(f"Performance metrics for {mn}")
    print(f" Accuracy: {ny.mean(scores):.2f}")
    print(f" Precision: {ny.mean(cv(m, x, y, cv=10, scoring='precision')):.2f}")
    print(f" Recall: {ny.mean(cv(m, x, y, cv=10, scoring='recall')):.2f}")
    print(f" F1 Score: {ny.mean(cv(m, x, y, cv=10, scoring='f1')):.2f}\n")

Performance metrics for Bernoulli Naive Bayes
 Accuracy: 0.89
 Precision: 0.70
 Recall: 0.96
 F1 Score: 0.81

Performance metrics for Multinomial Naive Bayes
 Accuracy: 0.78
 Precision: 0.70
 Recall: 0.96
 F1 Score: 0.81

Performance metrics for Gaussian Naive Bayes
 Accuracy: 0.82
 Precision: 0.70
 Recall: 0.96
 F1 Score: 0.81



* **Discussion:**
    * **Performance Comparison:** All three variants of Naive Bayes (Bernoulli, Multinomial, and Gaussian) have similar performance in terms of accuracy, precision, recall, and F1 score. They achieve high recall (ability to identify spam messages) but moderate precision (proportion of predicted spam that is actually spam), which indicates that they tend to classify some non-spam messages as spam.
    * **Precision vs. Recall Trade-off:** The models have relatively low precision, possibly due to the conservative nature of classifying messages as spam to avoid missing any actual spam (high recall). This trade-off between precision and recall is a common challenge in spam classification.
    * **Limitations of Naive Bayes:** Naive Bayes assumes independence between features, which may not hold for text data like email messages. This assumption can limit its ability to capture complex dependencies between words in messages.
* **Conclusion:**
    * All three Naive Bayes variants (Bernoulli, Multinomial, Gaussian) achieved similar performance on the "Spambase" dataset, with high recall but lower precision.
    * The choice of the best Naive Bayes variant depends on the specific goals and requirements of the spam classification task. If avoiding false negatives (missing spam) is critical, then high recall is preferred, as seen in these models. However, if minimizing false positives (classifying non-spam as spam) is more important, then precision could be improved with further tuning or by considering other algorithms.

Overall, while Naive Bayes can be a good starting point for spam classification, it may require further refinement and exploration of alternative approaches to achieve better precision without sacrificing recall.