# Naive Bayes Classifiers
This notebook demonstrates the use of different Naive Bayes classifiers:
- **Multinomial Naive Bayes** for text data with counts.
- **Bernoulli Naive Bayes** for binary data or presence/absence of features.
- **Gaussian Naive Bayes** for continuous data following a normal distribution.


Additionally, we'll use:
- Binning to convert continuous data for BernoulliNB
- Laplacian smoothing for handling zero probabilities

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB, GaussianNB, BernoulliNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.metrics import accuracy_score

## 1. Multinomial Naive Bayes
We will use sample text data and convert it to a bag-of-words representation to use MultinomialNB.

In [2]:
# Sample text data

# Sample data: two classes with short text samples
X_text = [
    'I love programming in Python',
    'Python is my favorite language',
    'I dislike dislike bugs in my code',
    'Debugging code can be frustrating',
    'I enjoy solving problems using Python',
    'Syntax errors are annoying',
    'Coding is fun fun',
    'I hate when my code crashes',
    'Python is amazing for amazing data science',
    'I need to refactor this code'
]

# Labels: 1 for positive sentiment, 0 for negative sentiment
y_text = [1, 1, 0, 0, 1, 0, 1, 0, 1, 0]

In [3]:
# Vectorize text data
vectorizer = CountVectorizer()
X_text_vec = vectorizer.fit_transform(X_text)

In [4]:
# Convert X_text_vec to a DataFrame with feature names
X_text_df = pd.DataFrame(X_text_vec.toarray(), columns=vectorizer.get_feature_names_out())

# Add the labels to the DataFrame
X_text_df['Label'] = y_text

# Display the resulting DataFrame
X_text_df

Unnamed: 0,amazing,annoying,are,be,bugs,can,code,coding,crashes,data,...,python,refactor,science,solving,syntax,this,to,using,when,Label
0,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,1
1,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,1
2,0,0,0,0,1,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,1,0,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,1,0,0,1,0,0,0,1,0,1
5,0,1,1,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
6,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,1
7,0,0,0,0,0,0,1,0,1,0,...,0,0,0,0,0,0,0,0,1,0
8,2,0,0,0,0,0,0,0,0,1,...,1,0,1,0,0,0,0,0,0,1
9,0,0,0,0,0,0,1,0,0,0,...,0,1,0,0,0,1,1,0,0,0


In [5]:
# Split the data
X_train_text, X_test_text, y_train_text, y_test_text = train_test_split(X_text_vec, y_text, test_size=0.3, random_state=42)

# Train and evaluate MultinomialNB with Laplacian smoothing (alpha=1)
mnb = MultinomialNB(alpha=1)
mnb.fit(X_train_text, y_train_text)
y_pred_text = mnb.predict(X_test_text)

print('Accuracy (MultinomialNB):', accuracy_score(y_test_text, y_pred_text))

Accuracy (MultinomialNB): 0.6666666666666666


## 2.1 Bernoulli Naive Bayes

In [6]:
# Set random seed for reproducibility
np.random.seed(42)

# Generate synthetic binary data
n_samples = 100  # Number of samples
n_features = 10  # Number of binary features

# Generate binary feature data (0 or 1)
X_binary = np.random.randint(2, size=(n_samples, n_features))

# Generate binary labels (0 or 1)
y_binary = np.random.randint(2, size=n_samples)

# Create a DataFrame to inspect the data
df_binary = pd.DataFrame(X_binary, columns=[f'Feature_{i+1}' for i in range(n_features)])
df_binary['Label'] = y_binary

df_binary

Unnamed: 0,Feature_1,Feature_2,Feature_3,Feature_4,Feature_5,Feature_6,Feature_7,Feature_8,Feature_9,Feature_10,Label
0,0,1,0,0,0,1,0,0,0,1,1
1,0,0,0,0,1,0,1,1,1,0,0
2,1,0,1,1,1,1,1,1,1,1,0
3,0,0,1,1,1,0,1,0,0,0,0
4,0,0,1,1,1,1,1,0,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...
95,0,1,0,1,0,0,1,0,1,0,1
96,0,0,1,0,0,0,0,1,1,0,1
97,1,0,0,0,1,1,1,1,1,1,1
98,1,1,1,0,1,1,1,1,1,1,1


In [7]:
# Split the data into training and testing sets
X_train_bin, X_test_bin, y_train_bin, y_test_bin = train_test_split(X_binary, y_binary, test_size=0.3, random_state=42)

# Train a Bernoulli Naive Bayes model with Laplacian smoothing (alpha=1)
bnb = BernoulliNB(alpha=1)
bnb.fit(X_train_bin, y_train_bin)

# Predict and evaluate the model
y_pred_bin = bnb.predict(X_test_bin)

print("\nAccuracy (BernoulliNB):", accuracy_score(y_test_bin, y_pred_bin))


Accuracy (BernoulliNB): 0.5333333333333333


## 2.2 Bernoulli Naive Bayes with Binning
We will bin continuous data into binary values (0 or 1) to apply BernoulliNB.

In [8]:
# Generate synthetic continuous data
np.random.seed(42)
X_continuous = np.random.normal(size=(100, 5))  # 100 samples, 5 features
y_continuous = np.random.choice([0, 1], size=100)  # Binary labels

In [9]:
# Use KBinsDiscretizer to bin continuous data into binary values
binner = KBinsDiscretizer(n_bins=2, encode='ordinal', strategy='uniform')
X_binned = binner.fit_transform(X_continuous)

In [10]:
# Split the binned data
X_train_bin, X_test_bin, y_train_bin, y_test_bin = train_test_split(X_binned, y_continuous, test_size=0.3, random_state=42)

# Train and evaluate BernoulliNB with Laplacian smoothing (alpha=1)
bnb = BernoulliNB(alpha=1)
bnb.fit(X_train_bin, y_train_bin)
y_pred_bin = bnb.predict(X_test_bin)

print('Accuracy (BernoulliNB):', accuracy_score(y_test_bin, y_pred_bin))

Accuracy (BernoulliNB): 0.6


## 3. Gaussian Naive Bayes
We will use continuous synthetic data and apply GaussianNB.

In [11]:
# Generate synthetic continuous data
np.random.seed(42)
X_continuous = np.random.normal(size=(1000, 15))  # 1000 samples, 15 features
y_continuous = np.random.choice([0, 1], size=1000)  # Binary labels

# Split the data
X_train_cont, X_test_cont, y_train_cont, y_test_cont = train_test_split(X_continuous, y_continuous, test_size=0.3, random_state=42)

# Train and evaluate GaussianNB
gnb = GaussianNB()
gnb.fit(X_train_cont, y_train_cont)

y_pred_cont = gnb.predict(X_test_cont)
print('Accuracy (GaussianNB):', accuracy_score(y_test_cont, y_pred_cont))

Accuracy (GaussianNB): 0.5133333333333333
