# Naïve bayes-2

#### Q1. A company conducted a survey of its employees and found that 70% of the employees use the company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the probability that an employee is a smoker given that he/she uses the health insurance plan?

#### Ans.

P(H) = 0.7, since 70% of the employees use the health insurance plan.

P(S|H) = 0.4, since 40% of the employees who use the health insurance plan are smokers.

Assuming total number of smokers in the organisation to be 15%

P(S) = 0.15

P(H|S) = P(S|H)*P(H)/P(S)

P(H|S) = 0.4*0.7/0.15

P(H|S) = 1.8666666666666665

In [2]:
0.4*0.7/0.15

1.8666666666666665

.'.   Hence Probability that an employee is smoker given that he/she uses health insurance plan is 186%

#### Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?

#### Ans.
Difference between Bernoulli, Multinomial and Gaussian Naive Bayes. Multinomial Naïve Bayes consider a feature vector where a given term represents the number of times it appears or very often i.e. frequency. On the other hand, Bernoulli is a binary algorithm used when the feature is present or not.

- Bernoulli Naive Bayes:
    - Assumes binary input data
    - Represents each document as a binary vector
    - Calculates likelihood probabilities based on presence/absence of features	
    - Suitable for problems focused on the presence or absence of features
- Multinomial Naive Bayes
    - Assumes count input data
    - Represents each document as a count vector
    - Calculates likelihood probabilities based on frequency of features
    - Suitable for problems focused on the frequency of features

#### Q3. How does Bernoulli Naive Bayes handle missing values?

#### Ans. 
Bernoulli Naive Bayes is primarily designed to work with binary data, where each feature can take on one of two values: 0 (absent) or 1 (present). When it comes to handling missing values in Bernoulli Naive Bayes, there are a few approaches you can consider:

1. Ignore Missing Values: One approach is to simply ignore instances with missing values. This means that if a particular instance has a missing value for a specific feature, you exclude that instance from the analysis involving that feature. However, this approach could lead to a loss of information and might not be ideal, especially if missing values are common.

2. Imputation: Another approach is to impute missing values with a default value before applying Bernoulli Naive Bayes. This might involve filling in the missing value with either 0 (absent) or 1 (present) based on some reasonable assumption or rule. However, this could introduce bias into your data, especially if the missing values are not truly indicative of the absence or presence of the feature.

3. Treat Missing as a Separate Category: Instead of ignoring or imputing missing values, you could treat the missing values as a separate category of the feature. This means creating a new category, say "missing," and assigning it a binary value (0 or 1) as you would with the other categories. This approach allows you to explicitly model instances with missing values and incorporate them into your classification.

4. Use Advanced Techniques: Depending on the complexity of your data and the problem you're working on, you might consider more advanced techniques for handling missing values, such as using probabilistic models or machine learning algorithms to predict missing values based on other features.

#### Q4. Can Gaussian Naive Bayes be used for multi-class classification?

#### Ans.
Yes, Gaussian Naive Bayes can indeed be used for multi-class classification tasks. Gaussian Naive Bayes is a variant of the Naive Bayes algorithm that assumes that the features in your dataset follow a Gaussian (normal) distribution. It's particularly well-suited for continuous data where the values of features are real numbers.

In the context of multi-class classification, Gaussian Naive Bayes extends naturally to handle multiple classes. The algorithm calculates the probabilities for each class based on the Gaussian distribution of feature values for each class. When you have more than two classes, the algorithm computes the probabilities for each class independently and then assigns the class with the highest probability as the predicted class.

Here's how Gaussian Naive Bayes can be used for multi-class classification:

1. Training:

    - For each class, calculate the mean and standard deviation of each feature's values among the instances belonging to that class.
    - You'll also need the prior probabilities of each class, which can be calculated based on the proportions of instances in each class.
2. Prediction:

    - Given a new instance with feature values, calculate the conditional probability of the instance belonging to each class using the Gaussian probability density function for each feature in each class.
    - Multiply the conditional probabilities for all features to get the probability of the instance belonging to each class.
    - The class with the highest probability is predicted as the class for the new instance.
    
Gaussian Naive Bayes is commonly used when the distribution of feature values for each class is approximately Gaussian, or at least can be reasonably approximated as such. However, keep in mind that the assumption of independence between features might not hold true in all cases, and the algorithm might not perform well if the data violates these assumptions significantly.

#### Q5. Assignment:
- Data preparation:
    - Download the "Spambase Data Set" from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/ datasets/Spambase). This dataset contains email messages, where the goal is to predict whether a message is spam or not based on several input features.

- Implementation:
    - Implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the scikit-learn library in Python. Use 10-fold cross-validation to evaluate the performance of each classifier on the dataset. You should use the default hyperparameters for each classifier.
- Results:
    - Report the following performance metrics for each classifier:
    - Accuracy
    - Precision
    - Recall
    - F1 score
- Discussion:
    - Discuss the results you obtained. Which variant of Naive Bayes performed the best? Why do you think that i the case? Are there any limitations of Naive Bayes that you observed?
- Conclusion:
    - Summarise your findings and provide some suggestions for future work.

In [7]:
import numpy as np
import pandas as pd
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.preprocessing import StandardScaler

# Load data
data = pd.read_csv('diabetes.csv')
data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [15]:
# Separate features and labels
X = data.drop(columns=['Outcome'])
y = data['Outcome']
X.head(2)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,6,148,72,35,0,33.6,0.627,50
1,1,85,66,29,0,26.6,0.351,31


In [16]:
y

0      1
1      0
2      1
3      0
4      1
      ..
763    0
764    0
765    0
766    1
767    0
Name: Outcome, Length: 768, dtype: int64

In [17]:
# Scale features
scaler = StandardScaler()
X = scaler.fit_transform(X)

In [18]:
bnb = BernoulliNB()
bnb_scores = cross_val_score(bnb, X, y, cv=10)

In [36]:
clf = MultinomialNB(force_alpha=True)
clff=clf.fit(X>0, y)
clff

In [37]:
gnb = GaussianNB()
gnb_scores = cross_val_score(gnb, X, y, cv=10)

In [None]:
print('Accuracy:', accuracy)
print('Precision:', precision)
print('Recall:', recall)
print('F1 score:', f1)

	models	cross_val_score_mean
0	  Gaussian	  0.805719

1	  Bernoulli	  0.849218

2	  Multinomial	  0.720911