# Naive Bayes and Probabilities

This notebook demonstrates the fundamentals of **Naive Bayes classification**, highlighting probability computations, assumptions of conditional independence, and example applications.

**Author:** Mahira Banu  
**Date:** June 2025  
**Tags:** #NaiveBayes #ML #Probability #InterviewPrep #GTV

## Joint Probability

Joint probability is the chance that **two events happen at the same time**.

For example, imagine a dataset where people buy ice cream on days with different weather:

| Weather | Buys Ice Cream | Count |
|---------|----------------|-------|
| Sunny   | Yes            | 30    |
| Sunny   | No             | 10    |
| Rainy   | Yes            | 5     |
| Rainy   | No             | 5     |

The total number of observations is 50.

The joint probability of it being **Sunny AND buying ice cream** is:

$$
P(\text{Sunny} \cap \text{Yes}) = \frac{30}{50} = 0.6 
$$
This means there is a 60% chance that on a random day, it is sunny and a person buys ice cream.


In [65]:
import pandas as pd

# Create the dataset
data = {
    'Weather': ['Sunny', 'Sunny', 'Rainy', 'Rainy'],
    'Buys_IceCream': ['Yes', 'No', 'Yes', 'No'],
    'Count': [30, 10, 5, 5]
}

df = pd.DataFrame(data)

# Total observations
total = df['Count'].sum()

# Calculate joint probability P(Sunny ∩ Yes)
joint_count = df[(df['Weather'] == 'Sunny') & (df['Buys_IceCream'] == 'Yes')]['Count'].values[0] 

#This above line finds how many times the condition "Weather = Sunny AND Buys Ice Cream = Yes" occurs in the dataset.

joint_prob = joint_count / total

print(f"Joint Probability P(Sunny ∩ Buys Ice Cream) = {joint_prob:.2f}")


30
30
Joint Probability P(Sunny ∩ Buys Ice Cream) = 0.60


## Marginal Probability (Ice Cream Example)

Marginal probability is the chance of **a single event happening**, regardless of other events.

From our ice cream data:

| Weather | Buys Ice Cream | Count |
|---------|----------------|-------|
| Sunny   | Yes            | 30    |
| Sunny   | No             | 10    |
| Rainy   | Yes            | 5     |
| Rainy   | No             | 5     |

Total observations: 50

To find the probability that a person **buys ice cream regardless of weather**, sum all counts where Buys Ice Cream = Yes:
$$
P(\text{Buys Ice Cream = Yes}) = \frac{30 + 5}{50} = \frac{35}{50} = 0.7
$$

So, there is a 70% chance that a randomly observed person buys ice cream.


In [69]:
# Calculate marginal probability of Buys Ice Cream = Yes
marginal_count_icecream = df[df['Buys_IceCream'] == 'Yes']['Count'].sum()
print(df[df['Buys_IceCream'] == 'Yes']['Count'].sum())
marginal_prob_icecream = marginal_count_icecream / total

print(f"Marginal Probability P(Buys Ice Cream = Yes) = {marginal_prob_icecream:.2f}")


35
Marginal Probability P(Buys Ice Cream = Yes) = 0.70


## Conditional Probability (Ice Cream Example)

Conditional probability is the chance of an event happening **given that** another event has already happened.

From our ice cream data:

| Weather | Buys Ice Cream | Count |
|---------|----------------|-------|
| Sunny   | Yes            | 30    |
| Sunny   | No             | 10    |
| Rainy   | Yes            | 5     |
| Rainy   | No             | 5     |

Total observations: 50

Suppose we want to find the probability that a person **buys ice cream given that it is sunny**. This is written as:

$$
P(\text{Buys Ice Cream = Yes} \mid \text{Weather = Sunny}) = \frac{P(\text{Buys Ice Cream = Yes} \cap \text{Weather = Sunny})}{P(\text{Weather = Sunny})}
$$

We already know:
$
(P(\text{Buys Ice Cream = Yes} \cap \text{Weather = Sunny}) = \frac{30}{50} = 0.6)
$
$
(P(\text{Weather = Sunny}) = \frac{30 + 10}{50} = \frac{40}{50} = 0.8)
$
So,
$
P(\text{Buys Ice Cream = Yes} \mid \text{Weather = Sunny}) = \frac{0.6}{0.8} = 0.75
$

This means that given the weather is sunny, there is a 75% chance that a person buys ice cream.


In [77]:
# Calculate conditional probability P(Buys Ice Cream = Yes | Weather = Sunny)
joint_prob = df[(df['Weather'] == 'Sunny') & (df['Buys_IceCream'] == 'Yes')]['Count'].values[0] / total
marginal_prob_weather = df[df['Weather'] == 'Sunny']['Count'].sum() / total
conditional_prob = joint_prob / marginal_prob_weather

print(f"Conditional Probability P(Buys Ice Cream = Yes | Weather = Sunny) = {conditional_prob:.2f}")


Conditional Probability P(Buys Ice Cream = Yes | Weather = Sunny) = 0.75


## Independence of Events

Two events \(A\) and \(B\) are **independent** if knowing one event does **not affect** the probability of the other.

Mathematically:

$
P(A \cap B) = P(A) \times P(B)
$

where:

$
P(A \cap B)
$

is the joint probability of both events occurring together, and

$
P(A) \text{ and } P(B)
$

are the marginal probabilities of the individual events.


### Ice Cream Example

Given the data:

| Weather | Buys Ice Cream | Count |
|---------|----------------|-------|
| Sunny   | Yes            | 30    |
| Sunny   | No             | 10    |
| Rainy   | Yes            | 5     |
| Rainy   | No             | 5     |

Calculate:

- $P(\text{Sunny}) = \frac{40}{50} = 0.8$
- $P(\text{Buys Ice Cream}) = \frac{35}{50} = 0.7$
- $P(\text{Sunny} \cap \text{Buys Ice Cream}) = \frac{30}{50} = 0.6$


Check independence:

$$
P(\text{Sunny} \cap \text{Buys Ice Cream}) \stackrel{?}{=} P(\text{Sunny}) \times P(\text{Buys Ice Cream}) = 0.8 \times 0.7 = 0.56
$$

Since $0.6 \neq 0.56$, the events are **not independent**.


In [89]:
# Calculate probabilities
p_sunny = df[df['Weather'] == 'Sunny']['Count'].sum() / total
p_icecream = df[df['Buys_IceCream'] == 'Yes']['Count'].sum() / total
p_joint = df[(df['Weather'] == 'Sunny') & (df['Buys_IceCream'] == 'Yes')]['Count'].values[0] / total

print(f"P(Sunny) = {p_sunny:.2f}")
print(f"P(Buys Ice Cream) = {p_icecream:.2f}")
print(f"P(Sunny and Buys Ice Cream) = {p_joint:.2f}")
print(f"Product P(Sunny) * P(Buys Ice Cream) = {p_sunny * p_icecream:.2f}")

if abs(p_joint - (p_sunny * p_icecream)) < 1e-6:
    print("Events are independent")
else:
    print("Events are NOT independent")


P(Sunny) = 0.80
P(Buys Ice Cream) = 0.70
P(Sunny and Buys Ice Cream) = 0.60
Product P(Sunny) * P(Buys Ice Cream) = 0.56
Events are NOT independent


## Bayes' Theorem

Bayes' Theorem relates the conditional and marginal probabilities of random events. It is given by:

$$
P(Y \mid X) = \frac{P(X \mid Y) \cdot P(Y)}{P(X)}
$$

where:
- $P(Y \mid X)$ is the posterior probability: the probability of event $Y$ given $X$.
- $P(X \mid Y)$ is the likelihood: the probability of event $X$ given $Y$.
- $P(Y)$ is the prior probability of event $Y$.
- $P(X)$ is the marginal likelihood or evidence.



In [106]:
# Bayes Theorem example: Spam detection

# Probabilities (assumed values for example)
P_spam = 0.2               # Prior probability of spam
P_word_given_spam = 0.7    # Probability word appears in spam
P_word_given_not_spam = 0.1 # Probability word appears in non-spam
P_not_spam = 1 - P_spam

# Apply Bayes Theorem to calculate P(spam | word)
# Posterior = (Likelihood * Prior) / Evidence

P_word = P_word_given_spam * P_spam + P_word_given_not_spam * P_not_spam  # Total probability of word

P_spam_given_word = (P_word_given_spam * P_spam) / P_word

print(f"Probability of spam given the word: {P_spam_given_word:.2f}")


Probability of spam given the word: 0.64


## Naïve Bayes Assumption

**Definition:**  
All features are conditionally independent given the class.

This means that, for a given class \( C \), the probability of observing features \( X_1, X_2, ..., X_n \) is the product of the probabilities of observing each feature individually, assuming independence:

$$
P(X_1, X_2, ..., X_n \mid C) = P(X_1 \mid C) \times P(X_2 \mid C) \times \cdots \times P(X_n \mid C)
$$



### Example: Spam Filtering

If we want to calculate the probability of an email \( E \) being spam \( S \), and the email has words \( w_1, w_2, ..., w_n \), Naïve Bayes assumes:

$$
P(E \mid S) \approx P(w_1 \mid S) \times P(w_2 \mid S) \times \cdots \times P(w_n \mid S)
$$

In simple words:  
The probability of the entire email given it’s spam is approximated by multiplying the probabilities of each word occurring given that it is spam.


In [110]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

# Sample emails
emails = [
    "Free money offer just for you",
    "Hi Bob, are we meeting tomorrow?",
    "Congratulations, you won a prize",
    "Dear friend, let's catch up soon",
    "Win a free vacation now",
    "Are you available for a call?"
]

# Labels: 1 for spam, 0 for not spam
labels = [1, 0, 1, 0, 1, 0]

# Convert text data to numeric features (word counts)
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(emails)

# Train Naive Bayes classifier
nb = MultinomialNB()
nb.fit(X, labels)

# Predict if new email is spam
new_emails = ["Free prize for you", "Let's schedule a meeting"]
X_new = vectorizer.transform(new_emails)
predictions = nb.predict(X_new)

for email, pred in zip(new_emails, predictions):
    label = "Spam" if pred == 1 else "Not Spam"
    print(f"Email: '{email}' -> Prediction: {label}")


Email: 'Free prize for you' -> Prediction: Spam
Email: 'Let's schedule a meeting' -> Prediction: Not Spam


## Laplace Smoothing

**Definition:**  
Laplace smoothing is a technique used to handle the problem of zero probabilities in probabilistic models like Naïve Bayes.

When a feature (e.g., a word) does **not appear** in the training data for a certain class, the probability estimate for that feature would be zero, which can cause the whole probability product to become zero.

Laplace smoothing adds a small value (usually 1) to all counts to avoid zero probabilities.


### Formula:

If \( \text{count}(X_i, C) \) is the count of feature \( X_i \) in class \( C \), and \( V \) is the total number of possible features (vocabulary size), then the smoothed probability is:

$$
P(X_i \mid C) = \frac{\text{count}(X_i, C) + 1}{\text{count}(C) + V}
$$


### Example:

If the word "free" never appears in the "not spam" class in the training data, without smoothing:

$$
P(\text{"free"} \mid \text{not spam}) = 0
$$

With Laplace smoothing:

$$
P(\text{"free"} \mid \text{not spam}) = \frac{0 + 1}{\text{total words in not spam} + V}
$$


In [114]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

# Sample emails
emails = [
    "Free money offer just for you",
    "Hi Bob, are we meeting tomorrow?",
    "Congratulations, you won a prize",
    "Dear friend, let's catch up soon",
    "Win a free vacation now",
    "Are you available for a call?"
]

# Labels: 1 for spam, 0 for not spam
labels = [1, 0, 1, 0, 1, 0]

# Convert text data to numeric features (word counts)
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(emails)

# Train Naive Bayes classifier with Laplace smoothing (alpha=1)
nb = MultinomialNB(alpha=1)
nb.fit(X, labels)

# Predict probabilities for a new email containing unseen word "winner"
new_emails = ["You are a winner!"]
X_new = vectorizer.transform(new_emails)
probs = nb.predict_proba(X_new)

print(f"Probability of not spam: {probs[0][0]:.4f}")
print(f"Probability of spam: {probs[0][1]:.4f}")


Probability of not spam: 0.6338
Probability of spam: 0.3662


## Log Probabilities

**Definition:**  
When multiplying many small probabilities (like in Naïve Bayes), the result can become extremely small, causing numerical underflow (computers round to zero).

To avoid this, we use **logarithms** to convert multiplication into addition, which is numerically more stable.

### Formula:

Instead of computing:

$$
P(Y \mid X) = P(Y) \times \prod_{i=1}^n P(X_i \mid Y)
$$

We compute:

$$
\log P(Y \mid X) = \log P(Y) + \sum_{i=1}^n \log P(X_i \mid Y)
$$


### Explanation:

- Multiplying tiny probabilities → risk of underflow  
- Adding log probabilities → stable and efficient  
- The class with the highest log posterior is predicted.
## Log Probabilities

**Definition:**  
When multiplying many small probabilities (like in Naïve Bayes), the result can become extremely small, causing numerical underflow (computers round to zero).

To avoid this, we use **logarithms** to convert multiplication into addition, which is numerically more stable.


### Formula:

Instead of computing:

$$
P(Y \mid X) = P(Y) \times \prod_{i=1}^n P(X_i \mid Y)
$$

We compute:

$$
\log P(Y \mid X) = \log P(Y) + \sum_{i=1}^n \log P(X_i \mid Y)
$$



### Explanation:

- Multiplying tiny probabilities → risk of underflow  
- Adding log probabilities → stable and efficient  
- The class with the highest log posterior is predicted.


In [118]:
import numpy as np
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

# Sample emails
emails = [
    "Free money offer just for you",
    "Hi Bob, are we meeting tomorrow?",
    "Congratulations, you won a prize",
    "Dear friend, let's catch up soon",
    "Win a free vacation now",
    "Are you available for a call?"
]

labels = [1, 0, 1, 0, 1, 0]

# Vectorize text data
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(emails)

# Train Naive Bayes classifier
nb = MultinomialNB(alpha=1)
nb.fit(X, labels)

# Access log probabilities for class spam (1)
log_probs_spam = nb.feature_log_prob_[1]

# Show log probability for first 5 features
for word, log_prob in zip(vectorizer.get_feature_names_out()[:5], log_probs_spam[:5]):
    print(f"Log P('{word}'|spam) = {log_prob:.4f}")


Log P('are'|spam) = -3.6889
Log P('available'|spam) = -3.6889
Log P('bob'|spam) = -3.6889
Log P('call'|spam) = -3.6889
Log P('catch'|spam) = -3.6889


## Zero-One Loss Function

**Definition:**  
The Zero-One Loss function is a simple way to measure prediction errors. It assigns:

- A loss of 0 if the prediction is **correct**.
- A loss of 1 if the prediction is **incorrect**.


### Formula:

For true label \( y \) and predicted label \( \hat{y} \):

$$
L(y, \hat{y}) = 
\begin{cases}
0, & \text{if } y = \hat{y} \\
1, & \text{if } y \neq \hat{y}
\end{cases}
$$


### Explanation:

- This loss treats all errors equally.
- It's used mostly in classification tasks to count misclassifications.


In [121]:
def zero_one_loss(y_true, y_pred):
    losses = [0 if true == pred else 1 for true, pred in zip(y_true, y_pred)]
    return sum(losses)

# Example usage:
true_labels = [1, 0, 1, 1, 0]
predicted_labels = [1, 0, 0, 1, 1]

loss = zero_one_loss(true_labels, predicted_labels)
print(f"Total Zero-One Loss: {loss}")


Total Zero-One Loss: 2


## Naive Bayes Classifier

**Definition:**  
A probabilistic classifier based on Bayes' Theorem with the "naïve" assumption of conditional independence between features.



### How it works:

Given a feature vector \( X = (X_1, X_2, ..., X_n) \) and class \( Y \), the classifier computes the posterior probability for each class:

$$
P(Y \mid X) = \frac{P(Y) \times \prod_{i=1}^{n} P(X_i \mid Y)}{P(X)}
$$

Since \( P(X) \) is the same for all classes, we predict the class that maximizes the numerator:

$$
\hat{Y} = \arg\max_Y \left[ P(Y) \times \prod_{i=1}^n P(X_i \mid Y) \right]
$$


### Intuition:

- Calculate prior \( P(Y) \) from class frequencies.
- Calculate likelihood \( P(X_i \mid Y) \) for each feature given class.
- Multiply priors and likelihoods to get the posterior (up to proportionality).
- Predict the class with the highest posterior probability.


In [127]:

from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Sample text data
emails = [
    "Win money now",
    "Hello friend, how are you?",
    "Claim your free prize",
    "Are we still meeting tomorrow?",
    "Exclusive offer just for you"
]

labels = [1, 0, 1, 0, 1]  # 1=spam, 0=not spam

# Convert text to numeric features
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(emails)

# Split into train/test sets
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.4, random_state=42)

# Train Naive Bayes model
nb = MultinomialNB()
nb.fit(X_train, y_train)

# Predict on test set
y_pred = nb.predict(X_test)

# Evaluate accuracy
print("Accuracy:", accuracy_score(y_test, y_pred))



Accuracy: 1.0


## Graphical Model of Naïve Bayes

Naïve Bayes can be represented as a **probabilistic graphical model** where:

- The class variable \( Y \) is the **parent node**.
- Each feature \( X_i \) is a **child node** dependent only on \( Y \).

This expresses the **conditional independence assumption**:

$$
P(X_1, X_2, ..., X_n \mid Y) = \prod_{i=1}^n P(X_i \mid Y)
$$

### Diagram (Simple):

 Y
 
 /   |   \

 x1  x2   x3

 

This means each feature is conditionally independent of others given the class.

### Intuition:

Knowing the class, features are independent — so we only need to learn the distribution of each feature given the class.


In [136]:
# Sklearn Naive Bayes models implement this assumption internally
from sklearn.naive_bayes import MultinomialNB

# Features X1, X2, ..., Xn are assumed independent given class Y
# The model learns P(X_i | Y) for each feature separately and P(Y) as class prior
nb = MultinomialNB()
# When you fit nb.fit(X_train, y_train), the model learns P(X_i|Y) and P(Y)


## Conclusion
This notebook covered the basics of Naive Bayes — its use of prior and likelihood to estimate posterior probabilities. These concepts are especially useful in NLP, spam filtering, and real-world probabilistic models.

This notebook contributes to my public portfolio for Kaggle, GitHub, and GTV preparation.
Thank you Keep Learning👩🏼‍💻