In [1]:
# Q1. A company conducted a survey of its employees and found that 70% of the employees use the
# company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the
# probability that an employee is a smoker given that he/she uses the health insurance plan?

70% of the employees use the company's health insurance plan.
40% of the employees who use the plan are smokers.

P(S): probability that an employee is a smoker.
P(I): probability that an employee uses the health insurance plan.

P(I)=0.7
P(S|I)=0.4   probability that an employee is a smoker given that they use the health insurance plan.

P(S|I) will be 0.4 i.e 40%

In [2]:
# Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?

Bernoulli Naive Bayes and Multinomial Naive Bayes are two variants of the Naive Bayes algorithm, each designed to handle different types of data inputs. Here's a comparison of the two:

### Bernoulli Naive Bayes:

1. **Data Type**: It is specifically used for binary/boolean features. That is, it's suitable for data where features are either present or absent (typically represented as 1s and 0s).

2. **Modeling Technique**: In Bernoulli Naive Bayes, the feature vectors represent the presence or absence of features. For example, in text classification, it might represent whether a specific word appears in a document or not, regardless of the number of times it appears.

3. **Use Cases**: It's often used in text classification where the "bag of words" model is binary. For instance, classifying emails as spam or not spam based on the presence or absence of certain keywords.

4. **Probability Estimation**: The probability of a feature being present or absent in a particular class is estimated and used for prediction.

### Multinomial Naive Bayes:

1. **Data Type**: It is designed for features that represent counts or frequency counts. The features are typically the counts of events or occurrences.

2. **Modeling Technique**: In Multinomial Naive Bayes, the algorithm counts the occurrence of each feature (e.g., word in text classification) and uses these counts to make predictions.

3. **Use Cases**: Commonly used in text classification (e.g., categorizing documents into different topics) where the frequency of words is important. It works well with term frequency or TF-IDF weighted word counts.

4. **Probability Estimation**: The probability of observing a certain count of a feature in a particular class is estimated.

### Key Differences:

- **Feature Representation**: Bernoulli Naive Bayes is binary and focuses on the presence/absence of features, while Multinomial Naive Bayes deals with the frequency/count of features.

- **Applicability**: Bernoulli is more suited for binary feature models, while Multinomial is better for features that can take on more than two values (like word counts).

- **Use Case Scenarios**: While both are used in text classification, their applicability depends on how the text data is vectorized. Bernoulli is used when only the appearance of words matters, whereas Multinomial is used when the frequency of words is important. 

Choosing between these two depends on the nature of your data and the specific requirements of your application.

In [3]:
# Q3. How does Bernoulli Naive Bayes handle missing values?

Bernoulli Naive Bayes, like other Naive Bayes variants, is fundamentally a probabilistic model that calculates the likelihood of different classes based on the presence or absence of features. How it handles missing values in the data depends largely on the implementation and data preprocessing steps. However, there are some general approaches:

1. **Ignoring Missing Values During Model Training and Prediction**:
   - When computing probabilities, Bernoulli Naive Bayes typically considers only the presence or absence of features. If a feature is missing (i.e., not present), it can be treated as absent. This approach essentially ignores missing values during probability calculation.

2. **Data Imputation**:
   - Before feeding data into the model, missing values can be imputed during the preprocessing stage. Common imputation strategies include:
     - Replacing missing values with the most frequent value in a column.
     - Using a placeholder value that represents "unknown" or "missing".
     - Applying more sophisticated imputation methods based on other features in the dataset.
   - However, imputation should be done carefully, considering the nature of the data and the implications of the chosen imputation method on the model's performance.

3. **Custom Handling in Implementation**:
   - Custom modifications to the Bernoulli Naive Bayes algorithm can be made to handle missing values in a specific way, depending on the requirements of the task. For example, one might modify the likelihood calculations to account for missingness as a separate category.

4. **Modeling Missing Values as a Separate Category**:
   - In some implementations, missing values can be explicitly modeled as a separate category. This approach treats the missingness of the data as informative by itself.

### Important Considerations:

- **Impact on Model Performance**: How missing values are handled can significantly impact the model's performance. Careful validation and testing are necessary to understand the impact.

- **Data Nature and Context**: The approach for handling missing values should align with the nature of the data and the specific problem context.

In practice, Bernoulli Naive Bayes, due to its simplicity and the way it handles features (as binary variables), can be quite robust to missing data, especially when missingness aligns with the binary absence of a feature. However, for more complex datasets or when missing values carry important information, more sophisticated handling may be required.

In [1]:
# Q4. Can Gaussian Naive Bayes be used for multi-class classification?

Yes, Gaussian Naive Bayes can be used for multi-class classification problems. The Gaussian Naive Bayes classifier is particularly well-suited for continuous input features and is based on the assumption that the continuous values associated with each class are distributed according to a Gaussian (normal) distribution.

### Multi-Class Classification with Gaussian Naive Bayes:

1. **Basic Principle**:
   - In a multi-class setting, the Gaussian Naive Bayes model calculates the probability of each class given an input feature vector and then predicts the class with the highest probability.
   - For each class, it assumes that the features follow a Gaussian distribution, and it calculates the mean and variance of the features in each class during the training process.

2. **Probability Calculation**:
   - For a given input feature vector, the model calculates the probability of that vector belonging to each class, based on the Gaussian distribution of the features in that class.
   - It applies Bayes' theorem to compute the posterior probability for each class, and then it selects the class with the highest posterior probability.

3. **Applicability**:
   - Gaussian Naive Bayes is used in a variety of multi-class classification scenarios, especially where features are continuous. Common applications include text classification, medical diagnosis, and more.

4. **Advantages**:
   - Gaussian Naive Bayes is easy to implement and fast, making it suitable for large datasets.
   - It performs well in multi-class classification even with the assumption of feature independence, which is the "naive" part of Naive Bayes.

5. **Limitations**:
   - The assumption that features are normally distributed might not always hold true, which can affect the model's performance.
   - The independence assumption between features is often violated in real-world data, but in practice, Naive Bayes classifiers can still perform well even when this independence assumption is not strictly true.

### Conclusion:

Gaussian Naive Bayes is a versatile algorithm that can efficiently handle multi-class classification problems, especially when the input features are continuous and can be reasonably approximated by a Gaussian distribution. As with any model, its effectiveness can depend on the specific characteristics of the dataset and the problem at hand.

In [3]:
# Data preparation:
# Download the "Spambase Data Set" from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/
# datasets/Spambase). This dataset contains email messages, where the goal is to predict whether a message
# is spam or not based on several input features.

# Implementation:
# # Implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the
# # scikit-learn library in Python. Use 10-fold cross-validation to evaluate the performance of each classifier on the
# # dataset. You should use the default hyperparameters for each classifier.
# # Results:
# # Report the following performance metrics for each classifier:
# # Accuracy
# # Precision
# # Recall
# # F1 score
# # Discussion:
# # Discuss the results you obtained. Which variant of Naive Bayes performed the best? Why do you think that is
# # the case? Are there any limitations of Naive Bayes that you observed?
# # Conclusion:
# # Summarise your findings and provide some suggestions for future work.

In [16]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split,StratifiedKFold,cross_val_score
from sklearn.naive_bayes import GaussianNB,BernoulliNB,MultinomialNB
from sklearn.metrics import make_scorer, accuracy_score, precision_score, recall_score, f1_score

In [4]:
df=pd.read_csv('spambase.data',header=None)

In [5]:
df.columns=['word_freq_make',
'word_freq_address',
'word_freq_all',
'word_freq_3d',
'word_freq_our',
'word_freq_over',
'word_freq_remove',
'word_freq_internet',
'word_freq_order',
'word_freq_mail',
'word_freq_receive',
'word_freq_will',
'word_freq_people',
'word_freq_report',
'word_freq_addresses',
'word_freq_free',
'word_freq_business',
'word_freq_email',
'word_freq_you',
'word_freq_credit',
'word_freq_your',
'word_freq_font',
'word_freq_000',
'word_freq_money',
'word_freq_hp',
'word_freq_hpl',
'word_freq_george',
'word_freq_650',
'word_freq_lab',
'word_freq_labs',
'word_freq_telnet',
'word_freq_857',
'word_freq_data',
'word_freq_415',
'word_freq_85',
'word_freq_technology',
'word_freq_1999',
'word_freq_parts',
'word_freq_pm',
'word_freq_direct',
'word_freq_cs',
'word_freq_meeting',
'word_freq_original',
'word_freq_project',
'word_freq_re',
'word_freq_edu',
'word_freq_table',
'word_freq_conference',
'char_freq_;',
'char_freq_(',
'char_freq_[',
'char_freq_!',
'char_freq_$',
'char_freq_#',
'capital_run_length_average',
'capital_run_length_longest',
'capital_run_length_total',
'spam'
]

In [6]:
df.head()

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total,spam
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


In [7]:
df.isnull().sum()

word_freq_make                0
word_freq_address             0
word_freq_all                 0
word_freq_3d                  0
word_freq_our                 0
word_freq_over                0
word_freq_remove              0
word_freq_internet            0
word_freq_order               0
word_freq_mail                0
word_freq_receive             0
word_freq_will                0
word_freq_people              0
word_freq_report              0
word_freq_addresses           0
word_freq_free                0
word_freq_business            0
word_freq_email               0
word_freq_you                 0
word_freq_credit              0
word_freq_your                0
word_freq_font                0
word_freq_000                 0
word_freq_money               0
word_freq_hp                  0
word_freq_hpl                 0
word_freq_george              0
word_freq_650                 0
word_freq_lab                 0
word_freq_labs                0
word_freq_telnet              0
word_fre

In [8]:
X=df.drop('spam',axis=1)

In [9]:
y=df['spam']

In [10]:
classifier={'BernoulliNB':BernoulliNB(),
           'MultinomialNB':MultinomialNB(),
           'GaussianNB':GaussianNB()}

cv=StratifiedKFold(n_splits=10)

scoring_metrics = {
    'accuracy': make_scorer(accuracy_score),
    'precision': make_scorer(precision_score),
    'recall': make_scorer(recall_score),
    'f1_score': make_scorer(f1_score)
}

In [17]:
results = {}
for name, clf in classifier.items():
    #print(clf)
    clf_results = {}
    for metric, scorer in scoring_metrics.items():
        scores = cross_val_score(clf, X, y, scoring=scorer, cv=cv)
        clf_results[metric] = scores.mean()
    results[name] = clf_results

results

{'BernoulliNB': {'accuracy': 0.8839380364047911,
  'precision': 0.8869617393737383,
  'recall': 0.8152389047416673,
  'f1_score': 0.8481249015095276},
 'MultinomialNB': {'accuracy': 0.7863496180326323,
  'precision': 0.7393175533565436,
  'recall': 0.7214983911116508,
  'f1_score': 0.7282909724016348},
 'GaussianNB': {'accuracy': 0.8217730830896915,
  'precision': 0.7103733928118492,
  'recall': 0.9569516119239877,
  'f1_score': 0.8130660909542995}}