# Q1. A company conducted a survey of its employees and found that 70% of the employees use the company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the probability that an employee is a smoker given that he/she uses the health insurance plan?

A1

To find the probability that an employee is a smoker given that they use the health insurance plan, you can use Bayes' theorem. 

Let's define the following events:
- Event A: An employee uses the health insurance plan.
- Event B: An employee is a smoker.

You are given the following probabilities:
- \(P(A)\), the probability that an employee uses the health insurance plan, is 70% or 0.7.
- \(P(B|A)\), the probability that an employee is a smoker given that they use the health insurance plan, is 40% or 0.4.

Now, you want to find \(P(B|A)\), the probability that an employee is a smoker given that they use the health insurance plan. You can use Bayes' theorem for this:

P(B|A) = P(A|B) * P(B) / P(A)

Where:
- \(P(A|B)\) is the probability that an employee uses the health insurance plan given that they are a smoker. This information is not explicitly given, so you may need more data to calculate it.
- \(P(B)\) is the prior probability that an employee is a smoker, which is what you want to find.
- \(P(A)\) is the probability that an employee uses the health insurance plan, which is given as 0.7.

Without the probability \(P(A|B)\), you won't be able to calculate \(P(B|A)\) directly. You would need either additional data or assumptions about the relationship between using the health insurance plan and being a smoker to proceed further.

# Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?

A2

Bernoulli Naive Bayes and Multinomial Naive Bayes are two variants of the Naive Bayes algorithm used in machine learning and natural language processing. They are primarily used for text classification and have some key differences based on the type of data they are designed to handle:

**Bernoulli Naive Bayes:**

1. **Data Type**: Bernoulli Naive Bayes is suitable for binary data, where features are either present (1) or absent (0). It's often used in document classification tasks where the presence or absence of specific words or features in a document matters.

2. **Feature Representation**: It assumes that the features are generated from a Bernoulli distribution, meaning it models whether a feature is "on" or "off." In text classification, this often means modeling whether a word occurs in a document (1 for present, 0 for absent).

3. **Feature Independence**: Like other Naive Bayes variants, Bernoulli Naive Bayes assumes that features are conditionally independent given the class label. It simplifies the modeling by assuming that the presence or absence of one feature does not affect the presence or absence of another feature, given the class.

4. **Common Use Cases**: Bernoulli Naive Bayes is commonly used for tasks like spam detection, sentiment analysis, and document classification, where binary feature representations are used to represent the presence or absence of words or features.

**Multinomial Naive Bayes:**

1. **Data Type**: Multinomial Naive Bayes is designed for data with multiple discrete categories or counts, such as word frequencies in text data. It is especially suitable for text classification tasks.

2. **Feature Representation**: It assumes that the features follow a multinomial distribution, which is useful for modeling the frequency of discrete items (e.g., word counts) in a document.

3. **Feature Independence**: Like other Naive Bayes variants, Multinomial Naive Bayes assumes that features are conditionally independent given the class label. In text classification, this means assuming that the frequency of each word is independent of the frequency of other words, given the class label.

4. **Common Use Cases**: Multinomial Naive Bayes is commonly used for tasks like document classification (e.g., categorizing news articles into topics), spam filtering, and text mining, where word frequencies or counts are used as features.

In summary, the key difference between Bernoulli Naive Bayes and Multinomial Naive Bayes lies in the type of data they are suited for. Bernoulli Naive Bayes is for binary data (presence or absence of features), while Multinomial Naive Bayes is for data with multiple discrete categories (such as word frequencies). Both are variations of the Naive Bayes algorithm and rely on the assumption of feature independence given the class label. The choice between them depends on the nature of your data and the specific problem you are trying to solve.

# Q3. How does Bernoulli Naive Bayes handle missing values?

A3

Bernoulli Naive Bayes, like other Naive Bayes variants, generally assumes that features are binary, representing the presence (1) or absence (0) of specific features or characteristics. When dealing with missing values, you need to decide how to handle them to fit this binary assumption. Here are some common approaches:

1. **Imputation**: One approach is to impute (fill in) missing values with a default value that represents the absence of the feature. In the context of Bernoulli Naive Bayes, you might impute missing values with 0, indicating that the feature is absent. This approach assumes that missing values are equivalent to the absence of the feature.

2. **Ignore Missing Values**: Another option is to simply ignore instances with missing values during training and classification. This approach can be reasonable if missing values are relatively rare and not systematically related to the class labels. In this case, you would treat missing values as if they don't provide any information and exclude instances with missing values from your analysis.

3. **Impute with Probability**: Instead of using a fixed value like 0, you can estimate the probability of the feature being 1 based on the available data and impute missing values with this estimated probability. For example, you might impute missing values with the estimated probability of the feature being 1 within the class to which the instance belongs.

4. **Use a Special Category**: You can create a special category or level for missing values, treating them as a separate feature value. This approach acknowledges that missing values are distinct from 0 and can carry information. However, it also increases the dimensionality of your feature space, potentially leading to increased computational complexity.

The choice of how to handle missing values in Bernoulli Naive Bayes should be based on the specific characteristics of your data and the problem you are trying to solve. It's important to consider whether missing values are missing completely at random (MCAR), missing at random (MAR), or missing not at random (MNAR), as this can influence your imputation strategy. Additionally, the impact of missing values on the performance of your classifier should be carefully evaluated through techniques like cross-validation.

Keep in mind that Naive Bayes algorithms, including Bernoulli Naive Bayes, rely on strong assumptions about the independence of features given the class label. Depending on how you handle missing values, these assumptions may be affected, and the performance of the classifier may vary.

# Q4. Can Gaussian Naive Bayes be used for multi-class classification?

A4

Yes, Gaussian Naive Bayes can be used for multi-class classification. Gaussian Naive Bayes is a variant of the Naive Bayes algorithm that is well-suited for data with continuous features that can be modeled using a Gaussian (normal) distribution. It is often used for binary and multi-class classification tasks.

In multi-class classification, the goal is to classify instances into one of several possible classes or categories. Gaussian Naive Bayes can be adapted for multi-class problems using various techniques, including:

1. **One-vs-Rest (OvR) or One-vs-All (OvA)**: In this approach, you train a separate binary Gaussian Naive Bayes classifier for each class, treating one class as the "positive" class and all other classes as the "negative" class. During prediction, each classifier assigns a probability or score for its corresponding class, and the class with the highest score is chosen as the final prediction.

2. **Softmax Regression**: Gaussian Naive Bayes can be extended to multi-class classification by using a softmax function to compute class probabilities. This approach is similar to logistic regression, where the softmax function is applied to the output of the Gaussian Naive Bayes classifier to produce a probability distribution over all classes.

3. **Custom Probability Aggregation**: You can also implement custom strategies to aggregate the class probabilities or scores produced by the Gaussian Naive Bayes classifier. For example, you might use weighted voting or other combination methods to make the final class prediction.

Here's a high-level overview of how multi-class classification works with Gaussian Naive Bayes using the OvR strategy:

1. Train a separate Gaussian Naive Bayes classifier for each class, where each classifier models the distribution of features for that class as a Gaussian.

2. During prediction, obtain the probability (or score) that an instance belongs to each class using each of the trained classifiers.

3. Assign the instance to the class with the highest probability or score among all the classifiers.

Gaussian Naive Bayes is a useful choice for multi-class classification when you have continuous data and can assume that the features are normally distributed within each class. It's a simple and efficient algorithm for handling such scenarios, but it does make the strong assumption of feature independence within each class, which may or may not hold true in your specific application.

# Q5. Assignment:

- Data preparation: Download the "Spambase Data Set" from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Spambase). This dataset contains email messages, where the goal is to predict whether a message is spam or not based on several input features.

- Implementation: Implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the scikit-learn library in Python. Use 10-fold cross-validation to evaluate the performance of each classifier on the dataset. You should use the default hyperparameters for each classifier.

- Results: Report the following performance metrics for each classifier:
    - Accuracy
    - Precision
    - Recall
    - F1 score

- Discussion: Discuss the results you obtained. Which variant of Naive Bayes performed the best? Why do you think that is the case? Are there any limitations of Naive Bayes that you observed?

- Conclusion: Summarise your findings and provide some suggestions for future work.

In [1]:
pip install ucimlrepo

Collecting ucimlrepo
  Downloading ucimlrepo-0.0.2-py3-none-any.whl (7.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.2
Note: you may need to restart the kernel to use updated packages.


In [1]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
spambase = fetch_ucirepo(id=94) 
  
# data (as pandas dataframes) 
X = spambase.data.features 
y = spambase.data.targets 
  
# metadata 
print(spambase.metadata) 
  
# variable information 
print(spambase.variables) 


{'uci_id': 94, 'name': 'Spambase', 'repository_url': 'https://archive.ics.uci.edu/dataset/94/spambase', 'data_url': 'https://archive.ics.uci.edu/static/public/94/data.csv', 'abstract': 'Classifying Email as Spam or Non-Spam', 'area': 'Computer Science', 'tasks': ['Classification'], 'characteristics': ['Multivariate'], 'num_instances': 4601, 'num_features': 57, 'feature_types': ['Integer', 'Real'], 'demographics': [], 'target_col': ['Class'], 'index_col': None, 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 1999, 'last_updated': 'Mon Aug 28 2023', 'dataset_doi': '10.24432/C53G6X', 'creators': ['Mark Hopkins', 'Erik Reeber', 'George Forman', 'Jaap Suermondt'], 'intro_paper': None, 'additional_info': {'summary': 'The "spam" concept is diverse: advertisements for products/web sites, make money fast schemes, chain letters, pornography...\n\nThe classification task for this dataset is to determine whether a given email is spam or not.\n\t\nOur collecti

In [6]:
spambase

{'data': {'ids': None,
  'features':       word_freq_make  word_freq_address  word_freq_all  word_freq_3d  \
  0               0.00               0.64           0.64           0.0   
  1               0.21               0.28           0.50           0.0   
  2               0.06               0.00           0.71           0.0   
  3               0.00               0.00           0.00           0.0   
  4               0.00               0.00           0.00           0.0   
  ...              ...                ...            ...           ...   
  4596            0.31               0.00           0.62           0.0   
  4597            0.00               0.00           0.00           0.0   
  4598            0.30               0.00           0.30           0.0   
  4599            0.96               0.00           0.00           0.0   
  4600            0.00               0.00           0.65           0.0   
  
        word_freq_our  word_freq_over  word_freq_remove  word_freq_interne

In [15]:
import pandas as pd
file_path = 'spambase.csv'
df = pd.read_csv(file_path)

In [16]:
df

Unnamed: 0,0,0.64,0.64.1,0.1,0.32,0.2,0.3,0.4,0.5,0.6,...,0.41,0.42,0.43,0.778,0.44,0.45,3.756,61,278,1
0,0.21,0.28,0.50,0.0,0.14,0.28,0.21,0.07,0.00,0.94,...,0.000,0.132,0.0,0.372,0.180,0.048,5.114,101,1028,1
1,0.06,0.00,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.010,0.143,0.0,0.276,0.184,0.010,9.821,485,2259,1
2,0.00,0.00,0.00,0.0,0.63,0.00,0.31,0.63,0.31,0.63,...,0.000,0.137,0.0,0.137,0.000,0.000,3.537,40,191,1
3,0.00,0.00,0.00,0.0,0.63,0.00,0.31,0.63,0.31,0.63,...,0.000,0.135,0.0,0.135,0.000,0.000,3.537,40,191,1
4,0.00,0.00,0.00,0.0,1.85,0.00,0.00,1.85,0.00,0.00,...,0.000,0.223,0.0,0.000,0.000,0.000,3.000,15,54,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4595,0.31,0.00,0.62,0.0,0.00,0.31,0.00,0.00,0.00,0.00,...,0.000,0.232,0.0,0.000,0.000,0.000,1.142,3,88,0
4596,0.00,0.00,0.00,0.0,0.00,0.00,0.00,0.00,0.00,0.00,...,0.000,0.000,0.0,0.353,0.000,0.000,1.555,4,14,0
4597,0.30,0.00,0.30,0.0,0.00,0.00,0.00,0.00,0.00,0.00,...,0.102,0.718,0.0,0.000,0.000,0.000,1.404,6,118,0
4598,0.96,0.00,0.00,0.0,0.32,0.00,0.00,0.00,0.00,0.00,...,0.000,0.057,0.0,0.000,0.000,0.000,1.147,5,78,0


In [14]:
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB

data = pd.read_csv("spambase.csv", header=None)

# Split the data into features (X) and labels (y)
X = data.iloc[:, :-1]
y = data.iloc[:, -1]

# Initialize the classifiers
bernoulli_nb = BernoulliNB()
multinomial_nb = MultinomialNB()
gaussian_nb = GaussianNB()

# Perform 10-fold cross-validation and evaluate each classifier
scores_bernoulli = cross_val_score(bernoulli_nb, X, y, cv=10, scoring='accuracy')
scores_multinomial = cross_val_score(multinomial_nb, X, y, cv=10, scoring='accuracy')
scores_gaussian = cross_val_score(gaussian_nb, X, y, cv=10, scoring='accuracy')

# Report the average accuracy for each classifier
print("Bernoulli Naive Bayes - Average Accuracy:", scores_bernoulli.mean())
print("Multinomial Naive Bayes - Average Accuracy:", scores_multinomial.mean())
print("Gaussian Naive Bayes - Average Accuracy:", scores_gaussian.mean())


Bernoulli Naive Bayes - Average Accuracy: 0.8839380364047911
Multinomial Naive Bayes - Average Accuracy: 0.7863496180326323
Gaussian Naive Bayes - Average Accuracy: 0.8217730830896915


In [20]:
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score


file_path = 'spambase.csv'
data = pd.read_csv(file_path)

# Split the data into features (X) and labels (y)
X = data.drop('1', axis=1)  
y = data['1']

# Initialize classifiers
bernoulli_nb = BernoulliNB()
multinomial_nb = MultinomialNB()
gaussian_nb = GaussianNB()

classifiers = [bernoulli_nb, multinomial_nb, gaussian_nb]
classifier_names = ['Bernoulli NB', 'Multinomial NB', 'Gaussian NB']

# Perform 10-fold cross-validation for each classifier
for classifier, name in zip(classifiers, classifier_names):
    print(f"Classifier: {name}")
    
    # Calculate accuracy, precision, recall, and F1 score using cross-validation
    accuracy = cross_val_score(classifier, X, y, cv=10, scoring='accuracy')
    precision = cross_val_score(classifier, X, y, cv=10, scoring='precision')
    recall = cross_val_score(classifier, X, y, cv=10, scoring='recall')
    f1 = cross_val_score(classifier, X, y, cv=10, scoring='f1')

    # Calculate and print the average scores
    print(f"Accuracy: {accuracy.mean()}")
    print(f"Precision: {precision.mean()}")
    print(f"Recall: {recall.mean()}")
    print(f"F1 Score: {f1.mean()}\n")

Classifier: Bernoulli NB
Accuracy: 0.8839130434782609
Precision: 0.886914139754535
Recall: 0.8151235504826666
F1 Score: 0.8480714616697421

Classifier: Multinomial NB
Accuracy: 0.786086956521739
Precision: 0.7390291264847734
Recall: 0.7207971586424625
F1 Score: 0.7277511309974372

Classifier: Gaussian NB
Accuracy: 0.8217391304347826
Precision: 0.7102746648832371
Recall: 0.9569394693704085
F1 Score: 0.8129997873786424



Discussion:

Analyze the results and discuss which variant of Naive Bayes performed the best in terms of the specified metrics.
Consider why one variant might outperform the others. For example, the choice between Bernoulli, Multinomial, or Gaussian Naive Bayes depends on the nature of the data. Discuss how the data's characteristics influenced the performance.
Discuss any limitations or challenges you encountered during the analysis, such as handling missing values or choosing appropriate preprocessing steps.

Conclusion:

Summarize your findings and provide insights into the performance of the different Naive Bayes variants on the given dataset.
Offer suggestions for future work, such as exploring more advanced machine learning algorithms, feature engineering techniques, or fine-tuning hyperparameters.

Remember that the actual implementation of this project involves coding in Python and using scikit-learn extensively. Each step requires careful consideration and coding, and you may need to refer to scikit-learn's documentation for details on how to use the classifiers and evaluation functions effectively.