In [15]:
import numpy as np
from matplotlib import pyplot as plt


The Bayesian Theorem is a fundamental theorem in probability theory and statistics that describes the probability of an event (posterior), based on prior knowledge of conditions that might be related to the event. 

The formula for Bayesian Theorem is given by:

$$
P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}
$$

Where:
- $P(A|B)$ is the posterior probability of $A$ given $B$,
- $P(B|A)$ is the likelihood of $B$ given $A$,
- $P(A)$ is the prior probability of $A$,
- $P(B)$ is the marginal likelihood or evidence.


## Poblem 1: Disease Diagnosis



### Objective

Simulate a scenario for medical testing using Bayes' Theorem.

#### Tasks

1. **Define Variables**
    - Prevalence of a disease.
    - Sensitivity of a test (true positive rate).
    - Specificity of the test (true negative rate).

2. **Write a Function**
    - Calculate the probability of having the disease given a positive test result using Bayes' Theorem.

3. **Run Simulations**
    - Use different prevalence rates and test accuracies to see how they affect the outcome.

#### Expected Learning Outcome

- Understand the critical impact of test accuracy (sensitivity and specificity) and disease prevalence on the reliability of medical diagnostics. This exercise aims to highlight how Bayes' Theorem can be applied to real-world problems in medical diagnostics, showcasing the importance of considering these factors in interpreting test results.

---


### Example Scenario

Suppose there is a rare disease that affects 1 out of every 1000 people. A test for this disease is 99% accurate, meaning it correctly identifies 99% of people who have the disease (true positive) and 99% of people who don't have the disease (true negative).

**Question:** If a person tests positive, what is the probability that they actually have the disease?

**Application of Bayes' Theorem:**

- **Prior Probability** \($P(\text{Disease})$\) = 0.001 (1 in 1000)
- **Likelihood** \($P(\text{Positive} | \text{Disease})$\) = 0.99
- **Marginal Probability** \($P(\text{Positive})$) = $P(\text{Disease}) \times P(\text{Positive} | \text{Disease}) + P(\text{No Disease}) \times P(\text{Positive} | \text{No Disease})$
- **Posterior Probability** \($P(\text{Disease} | \text{Positive})\) = ?

---


In [20]:
def Bayes_medical_diagnosis ( p_disease, TP, TN):
    """
    Parameters:
    p_disease: P(Disease)
    TP: P(postive/disease)
    TN: P(Negative/NoDisease)
    """ 
    # Prior (P(~Disease)= 1-P(Disease)
    p_no_disease = 1-p_disease
    # Let's find P(Postive/NoDisease) = 1-P(Negative/NoDisease) = 1-TN
    p_postive_NoDisease = 1-TN
    # Marginal Probability (P(Postive)
    p_positive = p_disease*TP + p_no_disease*p_postive_NoDisease

    # Finally our posterior probability : P(Disease/Postive)
    p_disease_postive = (TP*p_disease)/p_positive

    return p_disease_postive
p_disease = 0.001
TP = 0.99
TN = 0.99
posterior= Bayes_medical_diagnosis(p_disease,TP,TN)
print(f"The probability of disease given postive test: {posterior:.4f}")

The probability of disease given postive test: 0.0902


## Problem 2: Spam Detection
### Part 1: Let's solve our example for Spam Email Scenario 


An email filter is designed to catch spam emails. It has a 99% success rate of identifying spam and a 99% success rate of correctly passing through regular emails.\
Question: If an email is marked as spam, what's the probability it really is spam, given that 5% of all emails are
actually spam?

Given:
- $P(A)$ = 0.05 : Probability of email being spam
- $P(B)$: Marginal Probability of marked spam
- $P(B'/A')$= 0.99: Success rate of passing regular email (i.e., Marking no spam while no spam)
- $P(B/A')$ =1-0.99: Probality of Marking email spam  when not spam
- $P(B|A)$ = 0.99 Porablity identifying spam (Marking spam,
- $P(B/A)$ = 0.99: Success rate of identifying spam(i.e., marking spam when its actually spam)
- $P(A|B)$ is the posterior probability of email being spam when marked spam 


$$
P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}
$$

In [17]:

def bayes_spam(p_a,p_b_given_a, p_not_b_given_not_a):
   
    # Calculate P(B|~A), the probability the filter incorrectly marks a regular email as spam
    p_b_given_not_a = 1 - p_not_b_given_not_a

    # Prior probability of regular (P(~A))
    p_not_a= 1 - p_a
    # Calculate P(B), the marginal probability an email is marked as spam
    p_b = (p_b_given_a* p_a) + (p_b_given_not_a * p_not_a)
    
    # Calculate P(A|B), the probability an email is actually spam given it's marked as spam
    p_a_given_b = (p_b_given_a* p_a) / p_b
    
    return p_a_given_b

# Example usage
p_a = 0.05  # 5% of all emails are actually spam
p_b_given_a = 0.99  # 99% success rate of identifying spam
p_not_b_given_not_a = 0.99  # 99% success rate of correctly passing through regular emails P(~B/~A)

# Calculate the probability
probability = bayes_spam(p_a,p_b_given_a, p_not_b_given_not_a)
print(f"The probability that an email is actually spam given it is marked as spam: {probability:.2f}")


The probability that an email is actually spam given it is marked as spam: 0.84



### Part 2: Objective

Use Bayes' Theorem to calculate the probability of an email being spam.

#### Tasks

1. **Define Probabilities Based on a Given Dataset**
    - Probability of an email being spam.
    - Probability of certain keywords appearing in spam and non-spam emails.

2. **Create a Function**
    - Compute the probability of an email being spam given the presence of certain keywords using Bayes' Theorem.

3. **Test the Function**
    - Test the function with different sets of keywords.


### Step 1: Define Probabilities Based on a Given Dataset

We will use a simplified dataset with the following characteristics:

- **Total emails**: 100
- **Spam emails**: 20 (20% of all emails)
- **Non-spam (Regular) emails**: 80 (80% of all emails)

We focus on two keywords and their distribution in spam and non-spam emails:

- **Keyword "offer"**: 
  - Appears in 15 out of 20 spam emails (75%).
  - Appears in 10 out of 80 regular emails (12.5%).

- **Keyword "free"**: 
  - Appears in 10 out of 20 spam emails (50%).
  - Appears in 5 out of 80 regular emails (6.25%).

Based on this data:

- The prior probability of an email being spam, \(P(A)\), is 0.2.
- The probabilities of keywords appearing in spam emails (\(P(B|A)\)) are 0.75 for "offer" and 0.50 for "free".
- The probabilities of keywords appearing in regular emails (\(P(B|\neg A)\)) are 0.125 for "offer" and 0.0625 for "free".


### Step 2: Create a Function to Compute the Probability

A function is defined to compute the probability of an email being spam given the presence of certain keywords, using Bayes' Theorem. This function takes the prior probability of spam, the probabilities of keywords appearing in spam and regular emails, and a list of keywords present in the email being analyzed.


In [22]:
# Define the function to compute the probability of spam given keywords
def bayes_spam_keywords(p_a, keyword_probs_spam, keyword_probs_regular, keywords_present):
    """
    Compute the probability of an email being spam given the presence of certain keywords.
    
    Parameters:
    - p_a: Prior probability of an email being spam.
    - keyword_probs_spam (dict): Probabilities of keywords appearing in spam emails.
    - keyword_probs_regular (dict): Probabilities of keywords appearing in regular emails.
    - keywords_present (list): List of keywords present in the email being analyzed.
    
    Return:
    -Probability an email is spam given the presence of certain keywords.
    """
    # Calculate the average probability of the present keywords appearing in spam emails
    p_b_given_a = sum(keyword_probs_spam[key] for key in keywords_present) / len(keywords_present)
    
    # Calculate the average probability of the present keywords appearing in regular emails
    p_b_given_not_a = sum(keyword_probs_regular[key] for key in keywords_present) / len(keywords_present)
    
    # Calculate the prior probability of an email not being spam
    p_not_a = 1 - p_a
    
    # Calculate the marginal probability of the keywords being present in any email
    p_b = (p_b_given_a * p_a) + (p_b_given_not_a * p_not_a)
    
    # Calculate the posterior probability of an email being spam given the presence of certain keywords
    p_a_given_b = (p_b_given_a * p_a) / p_b
    
    # Return the calculated probability
    return p_a_given_b




### Step 3: Test the Function with Different Sets of Keywords

The function is tested with different sets of keywords, such as ["offer"], ["free"], and ["offer", "free"], to observe how the presence of these keywords affects the computed probability of an email being classified as spam.


In [23]:
# Define probabilities for keywords in spam and regular emails
keyword_probs_spam = {'offer': 0.75, 'free': 0.50}  # Probabilities of keywords in spam emails
keyword_probs_regular = {'offer': 0.125, 'free': 0.0625}  # Probabilities of keywords in regular emails

# Define test sets of keywords
test_keywords = [['offer'], ['free'], ['offer', 'free']]  # Different sets of keywords to test

# Loop through each set of keywords to test the function
for keywords in test_keywords:
    # Calculate the probability of an email being spam given the keywords
    probability =  bayes_spam_keywords(0.2, keyword_probs_spam, keyword_probs_regular, keywords)
    # Print the probability for each set of keywords
    print(f"Keywords: {keywords} => Probability of spam: {probability:.2f}")


Keywords: ['offer'] => Probability of spam: 0.60
Keywords: ['free'] => Probability of spam: 0.67
Keywords: ['offer', 'free'] => Probability of spam: 0.62
