<code style="background:yellow;color:red">
    <b>There are 8 lab questions for this notebook. Look for sections that start "Lab part..."</b>
</code>

### Introduction to Bayes' theorem

We're going to be looking at a hypothetical disease known as "bayesitis". This disease has a base rate of 1%, meaning that in any random selection of people, we would expect about 1% of them to have bayesitis.

We have a test that can detect bayesitis. Like most diagnostic tests, it's not perfect. It has a true positive rate of 90%, meaning that if someone has bayesitis, the test will correctly identify them 90% of the time. It also has a false positive rate of 5%, meaning that if someone doesn't have bayesitis, the test will incorrectly identify them as having it 5% of the time.

The questions we're trying to answer are:

- If a person tests positive for bayesitis, what is the probability that they actually have the disease?
- If a person tests negative for bayesitis, what is the probability that they do have the disease?

These questions are best answered using Bayes' theorem, a fundamental concept in probability theory and statistics that addressses how to update the probabilities of hypotheses when given evidence.

---

## Statistical reasoning

Let's try a statistical reasoning question like the ones we talked about in class. The next code cell provides a simple widget to ask the question. This one is kind of superfluous; it really just asks the question...

In [1]:
import ipywidgets as widgets

# Question 1
guidance = """
    For probabilities, enter a value between 0.000 and 1.000 (where 1.0 = 100%).
    For confidence, 1 = not at all sure, 4 = medium, 7 = completely sure.
    
    Again, our test has a true positive rate of 90%, and a false positive rate of 5%.
    1% of the population has the disease.
    """
print(guidance)

print("1. If a person tests positive for bayesitis, what is the probability that they actually have the disease?")

# Create and display the widgets for Question 1
guess_pos_test = widgets.FloatSlider(
    value=0.5,
    min=0,
    max=1.0,
    step=0.01,
    description='Guess:',
    readout_format='.2f',
)
confidence_pos_test = widgets.IntSlider(
    value=4,
    min=1,
    max=7,
    step=1,
    description='Confidence:',
)
display(guess_pos_test, confidence_pos_test)

# Question 2
print("\n2. If a person tests negative for bayesitis, what is the probability that they do have the disease?")

# Create and display the widgets for Question 2
guess_neg_test = widgets.FloatSlider(
    value=0.5,
    min=0,
    max=1.0,
    step=0.01,
    description='Guess:',
    readout_format='.2f',
)
confidence_neg_test = widgets.IntSlider(
    value=4,
    min=1,
    max=7,
    step=1,
    description='Confidence:',
)
display(guess_neg_test, confidence_neg_test)



    For probabilities, enter a value between 0.000 and 1.000 (where 1.0 = 100%).
    For confidence, 1 = not at all sure, 4 = medium, 7 = completely sure.
    
    Again, our test has a true positive rate of 90%, and a false positive rate of 5%.
    1% of the population has the disease.
    
1. If a person tests positive for bayesitis, what is the probability that they actually have the disease?


FloatSlider(value=0.5, description='Guess:', max=1.0, step=0.01)

IntSlider(value=4, description='Confidence:', max=7, min=1)


2. If a person tests negative for bayesitis, what is the probability that they do have the disease?


FloatSlider(value=0.5, description='Guess:', max=1.0, step=0.01)

IntSlider(value=4, description='Confidence:', max=7, min=1)

## Lab part 1

1. Report your initial guesses and confidence for the 2 questions above. We will get to the probability-based answer below. For now, please just give your honest first answer. (Given the timing, we will do this example in class -- report what your initial response was in class, please.)

---

## Understanding Bayes' Theorem

Bayes' theorem is a fundamental concept in the field of probability and statistics that describes how to update the probabilities of hypotheses when given evidence. It's named after Thomas Bayes, who first provided an equation that allows new evidence to update beliefs in his "An Essay towards solving a Problem in the Doctrine of Chances" (1763). It's articulated as:

$$
P(A|B) = \frac{P(B|A) P(A)}{P(B)}
$$


> **Refresher:** to read this out loud we would say:

>> *The probability of A given B = the probability of B given A multiplied by the probability of A over the probability of B.* 

> So note that when we have 2 terms with no operator between them, multiplication is implied. Thus,  $P(B|A)P(A)$ is the same as $P(B|A) \times P(A)$ .


Where:
- $P(A|B)$ is the posterior probability, the updated probability of event A occurring given that event B has occurred (the probability of having the disease *given* that you have a positive test). *Note: there are some subtleties to the updating that we need to discuss.*
- $P(B|A)$ is the likelihood, the probability of event B occurring given that event A has occurred (e.g., the probability of a positive test if you really do have the disease).
- $P(A)$ is the prior probability, the initial degree of belief in A (for example, the probability of having the disease, or the *base rate*).
- $P(B)$ is the marginal likelihood, the total probability of observing evidence B (e.g., the probability of having a positive test, whether or not you have the disease).

In our context:
- Event A is the event that a person has bayesitis.
- Event B is the event that a person tests positive or negative.

We can rewrite the Bayes' theorem as follows for our context:

$$
P(\text{Disease}|\text{Test}) = \frac{P(\text{Test}|\text{Disease}) P(\text{Disease})}{P(\text{Test})}
$$

- $P(\text{Disease}|\text{Test})$ is what we want to find: the probability that a person has the disease given that their test result is positive or negative (Posterior Probability).
- $P(\text{Test}|\text{Disease})$ is the probability that a person tests positive or negative given that they have the disease (Likelihood).
- $P(\text{Disease})$ is the base rate, the overall probability that a person (randomly selected from the populationulation) has the disease (Prior Probability).
- $P(\text{Test})$ is the overall probability that a person (randomly selected from the populationulation) tests positive or negative (Marginal Likelihood).

We will use this formulation to calculate the probabilities in the following interactive section.


---

### To think about

***Exercise:*** As we proceed, try restating our specific examples in a more generic way. Instead of, 'what is the probability of having the disease given a positive test?', the more generic formulation is 'what is the probability of the outcome given the evidence?'. Similarly, the base rate -- the prevalence of the disease, typically stated as something like '1% of the population is infected' -- is the probability of the outcome, also called the *base rate* -- the rate of incidnece of the disease, for example.

***Assumptions:*** Positive or negative tests correspond to $E$ and $\neg E$ (Evidence and **not** Evidence, where when we say 'Evidence' that means it is 'true', as in a positive test. Note that if we are saying E is true or false for every individual in a population or sample, this implies that **everyone** has been tested. But this is only an assumption for the purposes of doing the calculation, and the real assumption is instead **if we could test everyone in the population** or **if we could test everyone in a random sample** here are our best estimates of the probabilities. 
 

---

## Bayes' widget

The interactive widget below will let you explore how base rate, hit rate, and false alarm rate influence the probability of having or not having the disease based on evidence from the test. Make sure to 'run' the code cell if you do not see Base Rate and True Positive and False positve sliders below the next code cell.


In [3]:
import ipywidgets as widgets
#from ipywidgets import Layout
from IPython.display import display, HTML

# Custom CSS to make the text field wider 
custom_css = """
<style>
.widget-label { width: 40% !important; }
.widget-slider { width: 60% !important; } /* Adjust the slider width */
</style>
"""
display(HTML(custom_css))

# Create the sliders
base_rate = widgets.FloatSlider(
    value=0.01,
    min=0,
    max=1.0,
    step=0.01,
    description='Base Rate:',
    readout_format='.2f',
    #layout = layout
    #description_width='400px'  # Set description_width to 'initial' to allow for wider labels
)

true_positive_rate = widgets.FloatSlider(
    value=0.90,
    min=0,
    max=1.0,
    step=0.01,
    description='True Positive Rate:',
    readout_format='.2f',
    #description_width=200  # Set description_width to 'initial' to allow for wider labels
)

false_positive_rate = widgets.FloatSlider(
    value=0.05,
    min=0,
    max=1.0,
    step=0.01,
    description='False Positive Rate:',
    readout_format='.2f',
    #description_width='initial'  # Set description_width to 'initial' to allow for wider labels
)

# Create an output widget to display the probabilities
output_widget = widgets.Output()

# Function to calculate and update the labels with the posterior probabilities
def calculate_posterior(base_rate, true_positive_rate, false_positive_rate):
    # Clear previous output
    output_widget.clear_output()
    
    with output_widget:
        # Calculate the probability of a positive test
        prob_test_positive = (true_positive_rate * base_rate) + (false_positive_rate * (1 - base_rate))

        # Calculate the probability of a negative test
        prob_test_negative = 1 - prob_test_positive

        # Bayes' theorem for positive test result
        if prob_test_positive > 0:
            prob_disease_given_pos_test = (true_positive_rate * base_rate) / prob_test_positive
            print(f"Probability of having bayesitis given a positive test: {prob_disease_given_pos_test:.4f}")
        else: 
            print(f"Probability of having bayesitis given a positive test: NOT DEFINED BECAUSE PROBABILITY OF A POSITIVE TEST IS ZERO")

        # Bayes' theorem for negative test result
        if prob_test_negative > 0:
            prob_disease_given_neg_test = ((1 - true_positive_rate) * base_rate) / prob_test_negative
            print(f"Probability of having bayesitis given a negative test: {prob_disease_given_neg_test:.4f}")
        else: 
            print(f"Probability of having bayesitis given a negative test: NOT DEFINED BECAUSE PROBABILITY OF A NEGATIVE TEST IS ZERO")

# Create interactive output
out = widgets.interactive_output(
    calculate_posterior, 
    {
        'base_rate': base_rate, 
        'true_positive_rate': true_positive_rate, 
        'false_positive_rate': false_positive_rate
    }
)

# Display everything
widgets.VBox([widgets.VBox([base_rate, true_positive_rate, false_positive_rate]), out, output_widget])


VBox(children=(VBox(children=(FloatSlider(value=0.01, description='Base Rate:', max=1.0, step=0.01), FloatSlid…

## Lab part 2

2. Compare the correct answer to your initial guess above; can you explain why your initial guess was different than (or the same as) the correct answer?

3. With the default values, what are (a) the probability of **not** having bayesitis given a positive test, and (b) the probability of **not** having bayesitis given a negative test? 

`Note: for questions 4-7, you might find it helpful to use the visualization example at the very end of this notebook.`

4. Explore the Base Rate slider. What changes about the the two probabilities as base rate increases? You don't have to do an exhaustive exploration; try to identify a few representative values (one high, one medium, one low) and also explore setting it to 0 and 1.

5. Put everything back to default values (by re-running the code cell). Now see what happens as you change the true positive rate. You don't have to do an exhaustive exploration; try to identify a few representative values (one high, one medium, one low) and also explore setting it to 0 and 1.

6.  Put everything back to default values (by re-running the code cell). Now see what happens as you change the false positive rate. You don't have to do an exhaustive exploration; try to identify a few representative values (one high, one medium, one low) and also explore setting it to 0 and 1.

7. **Optional challenge question.** Do a little exploration where you change more than 1 parameter at a time. You don't have to be exhaustive. Maybe explore for 5 minutes and see if you can identify 3 cases where you think people might be really bad at guessing the correct answers, and explain why. 

## Frequency format

It can be helpful to think about this concretely in terms of discrete numbers. 

In the code cell below, we assume a population of 1000 and then get actual numbers of individuals *predicted* to have the disease, to test positive, and to test negative. You can modify the values and see what happens (e.g., try changing the population to 100). You can see the original values in the comments at the top. 

*Click run in the code cell if you do not see text between the code cell and **Lab part 3**. Note that you can also choose 'Run all cells' from the 'Run' menu.*

In [None]:
# population = 1000
# base_rate = 0.01
# true_positive_rate = 0.9
# false_positive_rate = 0.05

population = 1000
base_rate = 0.01
true_positive_rate = 0.9
false_positive_rate = 0.05

haves = base_rate * population
nots = population - (base_rate * population)

trues = true_positive_rate * haves
falses = false_positive_rate * nots
all_pos = trues + falses

p_has = trues / (trues + falses)
p_not = (haves - trues) / ((haves-trues) + nots)

approx_prob = round(trues) / ( round(trues) + round(falses) )

print(f'Haves {haves}, Nots {nots}, p_has {p_has}, p_not {p_not}')

print(f'''
We are going to round these numbers now to think about them as individuals. 

That is, approximately {haves:.0f} have it, and approximately {nots:.0f} do not. 
Among those who have the disease, approximately {trues:.0f} will test positive. 
Among those who do not have the disease, approximately {falses:.0f} will also test 
positive.

So while approximately {all_pos:.0f} people are expected to test positive, only 
approximately {trues:.0f} of those have the disease.

So the probability of having the disease is approximately {trues:.0f}/({trues:.0f} + {falses:.0f}) = {approx_prob:.4f}. ''')

## Lab part 3

8. Try changing the population to 100 in the example above. What probability do you get? Now change it to 10000. Now what do you get? What are the implications of these differences? 

*To do this, in the code cell above, change the line that is `population = 1000` to `population = 100` and then run the code cell again.*

---

## A visualization example

This is for you to explore. The visualizations below give you a pie chart to represent probabilities. The rectangular plot tries to plot one cell for each individual. It gets fuzzy (literally) when the population is too large (more than 1500?). 


### Lab part 4
This prompts one final challenge question.

9. **Optional challenge question:** Can you improve the 2nd visualization? This is very open ended and subjective. 

In [None]:
import numpy as np
import ipywidgets as widgets
import matplotlib.pyplot as plt

# Create the population size slider
population_size = widgets.IntSlider(
    value=1000,
    min=100,
    max=10000,
    step=100,
    description='Population Size:',
)
base_rate = widgets.FloatSlider(
    value=0.01,
    min=0,
    max=1.0,
    step=0.01,
    description='Base Rate:',
    readout_format='.2f',
)

true_positive_rate = widgets.FloatSlider(
    value=0.90,
    min=0,
    max=1.0,
    step=0.01,
    description='True Positive Rate:',
    readout_format='.2f',
)

false_positive_rate = widgets.FloatSlider(
    value=0.05,
    min=0,
    max=1.0,
    step=0.01,
    description='False Positive Rate:',
    readout_format='.2f',
)


# Create labels for displaying the probabilities
pos_test_label = widgets.Label()
neg_test_label = widgets.Label()

true_positive_label = widgets.Label()
false_positive_label = widgets.Label()
true_negative_label = widgets.Label()
false_negative_label = widgets.Label()
excluded_label = widgets.Label()

# Colors for each category in RGB format
colors = [(1, 0.5, 0), (1, 0, 0), (0, 1, 0), (1, 0, 1)]

# Function to calculate and update the labels and plots with the posterior probabilities
def calculate_posterior(base_rate, true_positive_rate, false_positive_rate, population_size):
    # Calculate the probability of a positive test
    prob_test_positive = (true_positive_rate * base_rate) + (false_positive_rate * (1 - base_rate))
    
    # Calculate the probability of a negative test
    prob_test_negative = 1 - prob_test_positive
    
    # Bayes' theorem for positive test result
    prob_disease_given_pos_test = (true_positive_rate * base_rate) / prob_test_positive
    pos_test_label.value = f"Probability of bayesitis given a positive test: {prob_disease_given_pos_test:.4f}"
    
    # Bayes' theorem for negative test result
    prob_disease_given_neg_test = ((1 - true_positive_rate) * base_rate) / prob_test_negative
    neg_test_label.value = f"Probability of bayesitis given a negative test: {prob_disease_given_neg_test:.4f}"
    
    # Create a pie chart
    true_positives = true_positive_rate * base_rate
    false_negatives = (1 - true_positive_rate) * base_rate
    true_negatives = (1 - false_positive_rate) * (1 - base_rate)
    false_positives = false_positive_rate * (1 - base_rate)
    
    plt.figure(figsize=(4,4))
    plt.pie(
        [true_positives, false_positives, true_negatives, false_negatives],
        labels=['True Positives', 'False Positives', 'True Negatives', 'False Negatives'],
        autopct='%1.1f%%',
        colors=colors
    )
    plt.show()
    
    # Calculate number of individuals in each category
    n_true_positives = round(true_positives * population_size)
    n_false_negatives = round(false_negatives * population_size)
    n_true_negatives = round(true_negatives * population_size)
    n_false_positives = population_size - n_true_positives - n_false_negatives - n_true_negatives

    # Convert to observed proportions
    obs_tp = true_positives  / population_size
    obs_tn = true_negatives  / population_size
    obs_fp = false_positives / population_size
    obs_tn = false_negatives / population_size


    # Update labels
    true_positive_label.value = f"True Positive count: {n_true_positives}"
    false_negative_label.value = f"False Negative count: {n_false_negatives}"
    true_negative_label.value = f"True Negative count: {n_true_negatives}"
    false_positive_label.value = f"False Positive count: {n_false_positives}"
    true_positive_label.value = f"True Positive count: {n_true_positives}"
    obs_tp_label.value = f"False Negative count: {obs_tp}"
    obs_tn_label.value = f"True Negative count: {obs_tn}"
    obs_fp_label.value = f"False Positive count: {obs_fp}"
    obs_fn_label.value = f"False Negative count: {obs_fn}"
    excluded_label.value = f"Excluded by Rounding: {population_size - n_true_positives - n_false_negatives - n_true_negatives - n_false_positives}"


    # Create grid
    grid_rows = 10  # fixed number of rows
    grid_cols = int(np.ceil(population_size / grid_rows))  # variable number of columns based on population size
    grid = np.zeros((grid_rows, grid_cols, 3))  # initialize grid with zeros
    current_index = 0

     # Fill grid with colors representing each category
    for n, color in zip(
        [n_true_positives, n_false_positives, n_true_negatives, n_false_negatives],  # adjusted order of categories
        colors  # colors for each category
    ):
        for _ in range(n):
            x, y = divmod(current_index, grid_cols)
            grid[x, y] = color
            current_index += 1

    # Display the grid
    plt.figure(figsize=(6,6))
    plt.imshow(grid, aspect='auto')  # set aspect='auto' to allow the plot to fill the width
    plt.axis('off')

    # Create custom legend
    import matplotlib.patches as mpatches
    legend_elements = [mpatches.Patch(color=color, label=label) 
                       for color, label in zip(colors, 
                                               ['True Positives', 
                                                'False Positives', 
                                                'True Negatives', 
                                                'False Negatives'])]
    plt.legend(handles=legend_elements, loc='upper right', bbox_to_anchor=(1.3, 1))
    plt.show()

#
# Create interactive output
out = widgets.interactive_output(
    calculate_posterior, 
    {
        'base_rate': base_rate, 
        'true_positive_rate': true_positive_rate, 
        'false_positive_rate': false_positive_rate,
        'population_size': population_size
    }
)


# Display everything
# widgets.VBox([widgets.VBox([base_rate, true_positive_rate, false_positive_rate, population_size]), \
#               out, pos_test_label, neg_test_label, true_positive_label, false_positive_label, \
#               true_negative_label, false_negative_label, excluded_label])

widgets.VBox([
    widgets.HBox([base_rate, population_size]),  # Place these on the same line
    widgets.HBox([true_positive_rate, false_positive_rate]),  # Place these on the same line
    out, 
    widgets.HBox([pos_test_label,      neg_test_label]), 
    widgets.HBox([true_positive_label, false_positive_label]), 
    widgets.HBox([true_negative_label, false_negative_label]), 
    #excluded_label
    
])


In [None]:
## Improvement ideas
