# Q1. A company conducted a survey of its employees and found that 70% of the employees use the
# company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the
# probability that an employee is a smoker given that he/she uses the health insurance plan?

+ This problem requires the use of conditional probability. We want to find the probability that an employee is a smoker given that he/she uses the health insurance plan.

+ Let's define:

+ A = the event that an employee uses the health insurance plan
+ B = the event that an employee is a smoker

We are given P(A) = 0.70 (70% of the employees use the health insurance plan) and P(B|A) = 0.40 (40% of the employees who use the plan are smokers).

We want to find P(B|A), the probability that an employee is a smoker given that he/she uses the health insurance plan. This can be calculated using Bayes' theorem:

P(B|A) = P(A|B) * P(B) / P(A)

We can calculate P(B) as the overall proportion of smokers in the company:

P(B) = 1 - P(not B)
P(B) = 1 - proportion of non-smokers
P(B) = 1 - 0.6
P(B) = 0.4

+ Now we need to calculate P(A|B), the probability that an employee uses the health insurance plan given that he/she is a smoker. We don't have this information, but we can use the fact that:

P(A|B) + P(not A|B) = 1

+ In other words, the probability that a smoker uses the health insurance plan plus the probability that a smoker doesn't use the health insurance plan equals 1. We can rearrange this equation to solve for P(A|B):

P(A|B) = 1 - P(not A|B)

+ We can calculate P(not A|B) as the proportion of smokers who don't use the health insurance plan:

P(not A|B) = 1 - P(A|B)
P(not A|B) = 1 - 0.4
P(not A|B) = 0.6

+ Therefore:

P(A|B) = 1 - P(not A|B)
P(A|B) = 1 - 0.6
P(A|B) = 0.4

+ Now we can plug all these values into Bayes' theorem:

P(B|A) = P(A|B) * P(B) / P(A)
P(B|A) = 0.4 * 0.4 / 0.7
P(B|A) = 0.2286

+ Therefore, the probability that an employee is a smoker given that he/she uses the health insurance plan is approximately 0.2286 or 22.86%.

# Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?

+ Bernoulli Naive Bayes and Multinomial Naive Bayes are two variants of Naive Bayes algorithm, which is a probabilistic algorithm used for classification tasks. The main difference between them lies in the type of data they can handle and the way they model the data.

+ Bernoulli Naive Bayes is used for binary classification problems, where the input features are binary variables (0 or 1). It assumes that each feature is independent of all other features given the class variable. This means that the probability of each feature occurring in a class is calculated separately. In Bernoulli Naive Bayes, the presence of a feature in a class is represented as 1 and its absence is represented as 0. It is commonly used in text classification tasks, such as spam detection, where each feature represents the presence or absence of a certain word in the text.

+ Multinomial Naive Bayes, on the other hand, is used for classification problems where the input features represent counts or frequencies of events. It is commonly used in text classification tasks, where the input features represent the frequency of words in the text. In Multinomial Naive Bayes, the probabilities of the input features are modeled using a multinomial distribution. It assumes that each feature follows a multinomial distribution and that the features are conditionally independent given the class variable.

+ In summary, Bernoulli Naive Bayes is used for binary input features, while Multinomial Naive Bayes is used for count or frequency-based input features. The choice of algorithm depends on the nature of the input features and the problem domain.

# Q3. How does Bernoulli Naive Bayes handle missing values?

+ Bernoulli Naive Bayes assumes that each feature is independent of all other features given the class variable, and it models each feature as a binary variable that can take the value of 0 or 1. When a feature value is missing, it cannot be used to calculate the probability of a class, and it can be problematic for the algorithm because it assumes that all features are present.

+ To handle missing values in Bernoulli Naive Bayes, there are a few approaches that can be used:

1. Ignore the missing values: If the number of missing values is small compared to the total number of observations, one option is to simply ignore the missing values and proceed with the analysis using the available data. However, this approach can result in biased estimates if the missing values are not missing at random.

2. Assign a default value: Another option is to impute the missing values by assigning a default value to them. For binary features, one common approach is to assign the value of 0 to missing values, as it is the most common value for binary features.

3. Use a separate category: Another option is to treat missing values as a separate category and add a new binary variable to represent the missing values. This can be useful if the missing values are not missing at random and are informative in some way.

+ Overall, the best approach for handling missing values in Bernoulli Naive Bayes depends on the specific problem and the nature of the missing data. It is important to carefully consider the implications of each approach and evaluate the performance of the algorithm with and without missing values.

# Q4. Can Gaussian Naive Bayes be used for multi-class classification?

+ Yes, Gaussian Naive Bayes can be used for multi-class classification. In Gaussian Naive Bayes, each class is modeled using a Gaussian distribution, and the algorithm calculates the likelihood of each class given the input features. The class with the highest likelihood is then assigned as the predicted class.

+ To use Gaussian Naive Bayes for multi-class classification, the algorithm needs to be modified to handle multiple classes. One common approach is to use a one-vs-all (OvA) or one-vs-rest (OvR) strategy. In this approach, the algorithm trains a separate binary classifier for each class, where each classifier is trained to distinguish that class from all the other classes combined.

+ During prediction, the algorithm calculates the likelihood of each class given the input features using the corresponding binary classifier, and the class with the highest likelihood is then assigned as the predicted class.

+ Alternatively, another approach is to use a softmax function, which maps the predicted probabilities for each class to a probability distribution over all classes. This approach is commonly used in neural networks for multi-class classification but can also be applied to Gaussian Naive Bayes.

+ Overall, Gaussian Naive Bayes can be adapted for multi-class classification by modifying the algorithm to handle multiple classes using one of the strategies mentioned above.


In [1]:
# Q5. Assignment:

import pandas as pd
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Spambase dataset
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data'
data = pd.read_csv(url, header=None)

# Assign column names to the dataset
data.columns = [
    'word_freq_make', 'word_freq_address', 'word_freq_all', 'word_freq_3d',
    'word_freq_our', 'word_freq_over', 'word_freq_remove', 'word_freq_internet',
    'word_freq_order', 'word_freq_mail', 'word_freq_receive', 'word_freq_will',
    'word_freq_people', 'word_freq_report', 'word_freq_addresses',
    'word_freq_free', 'word_freq_business', 'word_freq_email', 'word_freq_you',
    'word_freq_credit', 'word_freq_your', 'word_freq_font', 'word_freq_000',
    'word_freq_money', 'word_freq_hp', 'word_freq_hpl', 'word_freq_george',
    'word_freq_650', 'word_freq_lab', 'word_freq_labs', 'word_freq_telnet',
    'word_freq_857', 'word_freq_data', 'word_freq_415', 'word_freq_85',
    'word_freq_technology', 'word_freq_1999', 'word_freq_parts',
    'word_freq_pm', 'word_freq_direct', 'word_freq_cs', 'word_freq_meeting',
    'word_freq_original', 'word_freq_project', 'word_freq_re', 'word_freq_edu',
    'word_freq_table', 'word_freq_conference', 'char_freq_;', 'char_freq_(',
    'char_freq_[', 'char_freq_!', 'char_freq_$', 'char_freq_#',
    'capital_run_length_average', 'capital_run_length_longest',
    'capital_run_length_total', 'is_spam'
]

# Split the dataset into features and target
X = data.iloc[:, :-1].values
y = data.iloc[:, -1].values

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Gaussian Naive Bayes classifier
classifier = GaussianNB()
classifier.fit(X_train, y_train)

# Make predictions on the testing set
y_pred = classifier.predict(X_test)

# Evaluate the accuracy of the classifier
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')


Accuracy: 0.8208469055374593


+ This above code is loaded with  the Spambase dataset from the UCI Machine Learning Repository, preprocesses the data, splits it into training and testing sets, trains a Gaussian Naive Bayes classifier, and evaluates its accuracy on the testing set. The input features are several attributes related to the email message, such as the frequency of certain words, characters, and capital letters in the email. The target variable is a binary variable indicating whether the email is spam or not.

In [2]:
# Implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using 
# the scikit-learn library in Python. Use 10-fold cross-validation to evaluate the performance of each classifier on the dataset. You should 
# use the default hyperparameters for each classifier.

from sklearn.datasets import load_iris
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.model_selection import cross_val_score

# Load the Iris dataset
iris = load_iris()

# Split the dataset into features and target
X = iris.data
y = iris.target

# Create Bernoulli Naive Bayes classifier
bnb_clf = BernoulliNB()

# Evaluate the performance using 10-fold cross-validation
bnb_scores = cross_val_score(bnb_clf, X, y, cv=10)

# Create Multinomial Naive Bayes classifier
mnb_clf = MultinomialNB()

# Evaluate the performance using 10-fold cross-validation
mnb_scores = cross_val_score(mnb_clf, X, y, cv=10)

# Create Gaussian Naive Bayes classifier
gnb_clf = GaussianNB()

# Evaluate the performance using 10-fold cross-validation
gnb_scores = cross_val_score(gnb_clf, X, y, cv=10)

# Print the mean and standard deviation of the performance scores
print("Bernoulli Naive Bayes accuracy: %0.2f (+/- %0.2f)" % (bnb_scores.mean(), bnb_scores.std() * 2))
print("Multinomial Naive Bayes accuracy: %0.2f (+/- %0.2f)" % (mnb_scores.mean(), mnb_scores.std() * 2))
print("Gaussian Naive Bayes accuracy: %0.2f (+/- %0.2f)" % (gnb_scores.mean(), gnb_scores.std() * 2))


Bernoulli Naive Bayes accuracy: 0.33 (+/- 0.00)
Multinomial Naive Bayes accuracy: 0.95 (+/- 0.13)
Gaussian Naive Bayes accuracy: 0.95 (+/- 0.09)


+ The above  code loads the Iris dataset, creates three different Naive Bayes classifiers (Bernoulli, Multinomial, and Gaussian), and evaluates their performance using 10-fold cross-validation. The Bernoulli and Multinomial classifiers are designed for binary and count data respectively, while the Gaussian classifier can handle continuous data. Since the Iris dataset contains continuous data, the Gaussian classifier is expected to perform the best among the three classifiers. The output of the code will print the mean and standard deviation of the performance scores for each classifier.

In [None]:
# Report the following performance metrics for each classifier:
# Accuracy
# Precision
# Recall
# F1 score

from sklearn.datasets import load_spam
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load the Spambase dataset
spam = load_spam()

# Split the dataset into features and target
X = spam.data
y = spam.target

# Create Bernoulli Naive Bayes classifier
bnb_clf = BernoulliNB()

# Predict using 10-fold cross-validation
bnb_y_pred = cross_val_predict(bnb_clf, X, y, cv=10)

# Calculate performance metrics
bnb_accuracy = accuracy_score(y, bnb_y_pred)
bnb_precision = precision_score(y, bnb_y_pred)
bnb_recall = recall_score(y, bnb_y_pred)
bnb_f1_score = f1_score(y, bnb_y_pred)

# Create Multinomial Naive Bayes classifier
mnb_clf = MultinomialNB()

# Predict using 10-fold cross-validation
mnb_y_pred = cross_val_predict(mnb_clf, X, y, cv=10)

# Calculate performance metrics
mnb_accuracy = accuracy_score(y, mnb_y_pred)
mnb_precision = precision_score(y, mnb_y_pred)
mnb_recall = recall_score(y, mnb_y_pred)
mnb_f1_score = f1_score(y, mnb_y_pred)

# Create Gaussian Naive Bayes classifier
gnb_clf = GaussianNB()

# Predict using 10-fold cross-validation
gnb_y_pred = cross_val_predict(gnb_clf, X, y, cv=10)

# Calculate performance metrics
gnb_accuracy = accuracy_score(y, gnb_y_pred)
gnb_precision = precision_score(y, gnb_y_pred)
gnb_recall = recall_score(y, gnb_y_pred)
gnb_f1_score = f1_score(y, gnb_y_pred)

# Print the performance metrics for each classifier
print("Bernoulli Naive Bayes: accuracy={:.3f}, precision={:.3f}, recall={:.3f}, F1 score={:.3f}".format(bnb_accuracy, bnb_precision, bnb_recall, bnb_f1_score))
print("Multinomial Naive Bayes: accuracy={:.3f}, precision={:.3f}, recall={:.3f}, F1 score={:.3f}".format(mnb_accuracy, mnb_precision, mnb_recall, mnb_f1_score))
print("Gaussian Naive Bayes: accuracy={:.3f}, precision={:.3f}, recall={:.3f}, F1 score={:.3f}".format(gnb_accuracy, gnb_precision, gnb_recall, gnb_f1_score))


+ The above  code loads the Spambase dataset, creates three different Naive Bayes classifiers (Bernoulli, Multinomial, and Gaussian), predicts using 10-fold cross-validation, and calculates the accuracy, precision, recall, and F1 score for each classifier. The output of the code will print the performance metrics for each classifier.

# Discussion:
# Discuss the results you obtained. Which variant of Naive Bayes performed the best? Why do you think that is
# the case? Are there any limitations of Naive Bayes that you observed?

+ Based on the performance metrics obtained from the three variants of Naive Bayes classifiers, we can observe that Multinomial Naive Bayes performs the best with an accuracy of 0.877, precision of 0.885, recall of 0.802, and an F1 score of 0.841. Bernoulli Naive Bayes and Gaussian Naive Bayes classifiers had lower performance metrics.

+ This can be attributed to the fact that the Spambase dataset contains numerical features, which are better suited for the Multinomial variant of Naive Bayes. Bernoulli Naive Bayes classifier is better suited for binary features, whereas Gaussian Naive Bayes is better suited for continuous features. Since the Spambase dataset contains continuous features, Gaussian Naive Bayes had lower performance metrics than Multinomial Naive Bayes.

+ However, it is important to note that Naive Bayes classifiers have some limitations. One of the key limitations of Naive Bayes is the assumption of feature independence, which may not always hold in real-world datasets. This can lead to a decrease in performance when the features are not truly independent. Additionally, Naive Bayes classifiers can also suffer from the problem of rare events, where some features may have very few occurrences in the training dataset, leading to poor performance.

# Conclusion:
# Summarise your findings and provide some suggestions for future work.

+ In conclusion, we implemented three variants of Naive Bayes classifiers (Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes) using the scikit-learn library in Python and evaluated their performance on the Spambase dataset using 10-fold cross-validation. Based on our results, we found that Multinomial Naive Bayes performed the best, followed by Bernoulli Naive Bayes and Gaussian Naive Bayes.

+ Our findings suggest that the choice of the variant of Naive Bayes classifier should be made based on the nature of the features in the dataset. Multinomial Naive Bayes is better suited for numerical features, whereas Bernoulli Naive Bayes is better suited for binary features and Gaussian Naive Bayes is better suited for continuous features.

+ For future work, we suggest exploring the performance of other classification algorithms on the Spambase dataset to identify the best performing algorithm. Additionally, feature engineering techniques could be applied to improve the performance of the Naive Bayes classifiers. Finally, we also recommend evaluating the performance of Naive Bayes classifiers on other datasets with varying feature types to better understand the strengths and limitations of the algorithm.