# Chapter 2: A case study in conditional probabilities and Bayes' Theorem

In the heart of Aquilonia, a picturesque town nestled amidst rolling hills and lush greenery, a mysterious ailment plagues the citizens: *hydromechanical trepidation syndrome*, or *HTS*. This peculiar affliction instills in the townsfolk a perpetual fear, not of life's typical perils, but of the water fountains that adorn the town square.

For generations, Aquilonians have lived harmoniously with the babbling brooks and glistening lakes that graced their idyllic landscape. Water had been their lifeblood, revered as a symbol of purity and serenity. But something had shifted; a shadow of dread had cast itself over their tranquil existence. Fear now gripped the townsfolk, causing them to cross the street to avoid passing too near a fountain. Even the town's cherished elder, Old Man Willoughby, found himself trembling at their sight.

Amidst this despair, the town council turned to *you*, a statistician well-versed in epidemiology and the analysis of unusual outbreaks. *Your mission:* to assess the effectiveness of a newly developed test for detecting HTS in Aquilonia. Your role involves scrutinizing the test's accuracy and reliability in identifying those afflicted with this enigmatic ailment.


## Directions

1. The programming assignment is organized into sequences of short problems. You can see the structure of the programming assignment by opening the "Table of Contents" along the left side of the notebook (if you are using Google Colab or Jupyter Lab).

3. Each problem contains a blank cell containing the following comment: `# ENTER YOUR CODE IN THIS CELL`. Enter your code in these cells below the comment, being sure to not erase the comment. There are usually directions on the precise syntax that you will use to enter your solution properly. Please pay very careful attention to these directions.

4. Below most of the solution cells are "autograder" cells. Do not alter the autograder cells in any way.

5. Do not add any cells of your own to the notebook, or delete any existing cells (either code or markdown).

## Submission instructions

1. Once you have finished entering all your solutions, you will want to rerun all cells from scratch to ensure that everything works OK. To do this in Google Colab, click "Runtime -> Restart and run all" along the top of the notebook.

2. Now scroll back through your notebook and make sure that all code cells ran properly.

3. If everything looks OK, save your assignment and upload the `.ipynb` file at the provided link on the course <a href="https://github.com/jmyers7/stats-book-materials">GitHub repo</a>. Late submissions are not accepted.

4. You may submit multiple times, but I will only grade your last submission.

## Importing and exploring the data

The new test for HTS was given to 10,000 Aquilonians. The results were tabulated and put into a `.csv` file in the course's GitHub repo. Our first task is to import the data into a Pandas dataframe called `df`:

In [None]:
import pandas as pd
url = 'https://raw.githubusercontent.com/jmyers7/stats-book-materials/main/data/data-2-1.csv'
df = pd.read_csv(url).set_index('cin')

Let's print out the first 10 rows of the dataframe:

In [None]:
df.head(10)

Let's also print out the number of rows in the dataframe using the `len` function, just to verify that there are 10,000 rows:

In [None]:
len(df)

We see that our dataframe consists of two columns called `t` and `d`. It also contains an index column called  `cin` which are _citizen identification numbers_ (CINs). These are the Aquilonian versions of our social security numbers.

The `t` variable is a binary variable coding whether the person tested positive for HTS:

$$
t = \begin{cases}
0 & : \text{test is negative}, \\
1 & : \text{test is positive}.
\end{cases}
$$

The `d` variable is also a binary variable, coding whether the person actually has HTS:

$$
d = \begin{cases}
0 & : \text{citizen does not have HTS}, \\
1 & : \text{citizen has HTS}.
\end{cases}
$$

For example, we see from the first row in the dataframe that the person with CIN 388144 has $(t,d) = (0,0)$, which means that they do _not_ have HTS and they also tested negative.

In fact, none of the people in the first ten rows of the dataframe have the disease, and they also all tested negative! This makes me wonder:

* Has _anybody_ tested positive?
* Does _anybody_ have the disease?

Let's find out!

### Problem 1 --- Has anybody tested positive?

By using a boolean mask, pull the rows out of `df` for which `t == 1`. Assign the result to a new dataframe called `df_test`.

In [None]:
# ENTER YOUR CODE IN THIS CELL



In [None]:
# Autograder cell. Do NOT alter or delete this cell.


Now print out the number of rows in `df_test` by using the `len` function as I demonstrated above.

In [None]:
# ENTER YOUR CODE IN THIS CELL



So, it turns out that there _are_ people in Aquilonia that have tested positive for HTS.

### Problem 2 --- Does anybody have HTS?

We now want to repeat Problem 1, but by masking for the condition `d == 1`. Using a mask again, pull the rows out of `df` for which `d == 1` and save the result into a new dataframe called `df_disease`. You might also print the first ten rows by calling `df_disease.head(10)`.

In [None]:
# ENTER YOUR CODE IN THIS CELL



In [None]:
# Autograder cell. Do NOT alter or delete this cell.


Now get the number of rows in `df_disease` by calling the `len` function:

In [None]:
# ENTER YOUR CODE IN THIS CELL



We see that there is a disparity between the number of people who tested positive for HTS and the number of people who actually have the disease! This means that the test makes errors. We will count the number of errors in a later section, but first we need to check on Old Man Willoughby.

### Problem 3 --- Old Man Willoughby

Does Old Man Willoughby have HTS!?!??!!!?? I must know!

In the next cell, index into the dataframe using Old Man Willoughby's CIN number 340870 and pull out the corresponding value for the `d` variable (*Hint*: Use the `.loc` method on the dataframe). Save your result into the variable `willoughby`. Also, print it out to see if he has HTS!

In [None]:
# ENTER YOUR CODE IN THIS CELL



In [None]:
# Autograder cell. Do NOT alter or delete this cell.


## Probability = Proportion

We imagine that all 10,000 citizens in our dataframe make up the sample space $S$ of a uniform probability measure $P$. As $P$ is discrete, it has a PMF given by

$$
p(s) = \frac{1}{10{,}000}
$$

where $s\in S$. (You might also think of $s$ as a literal _row_ in the dataframe.)

Thus, if $A$ is a subset of rows in the dataframe, we have

$$
P(A) = \sum_{s\in A} p(s) = \frac{|A|}{10{,}000} \tag{$\ast$}
$$

where $|A|$ denotes the cardinality of $A$. But the ratio on the right-hand side of $(\ast)$ is nothing but the proportion of rows in the dataframe that lie in the subset $A$.

### Problem 4 --- Probability of testing positive

Remember, you created the dataframe `df_test` above in Problem 1 by pulling out all rows for which `t == 1`. You also computed the number of rows in `df_test`. Using the formula ($\ast$) above, compute the probability of testing positive, $P(t=1)$. Save your answer into the variable `test_prob`.

In [None]:
# ENTER YOUR CODE IN THIS CELL



In [None]:
# Autograder cell. Do NOT alter or delete this cell.


Keep in mind that the probability $P(t=1)$ is nothing but the proportion of people in the dataframe who tested positive.

### Problem 5 --- Probability of having HTS

You also created the dataframe `df_disease` in Problem 2 by pulling out all rows for which `d == 1`. Using it, compute the probability of having HTS, $P(d=1)$. Save your answer into the variable `disease_prob`.

In [None]:
# ENTER YOUR CODE IN THIS CELL



In [None]:
# Autograder cell. Do NOT alter or delete this cell.


Keep in mind that the probability $P(d=1)$ is the proportion of people in the dataframe that have HTS.

## Quantifying test reliability

There are four combinations for the $t$'s and $d$'s:

$$
(t,d) = (1, 1), \ (0, 0), \ (1, 0), \ (0, 1).
$$

The first two combinations correspond to a _true positive_ (_TP_) and _true negative_ (_TN_), respectively. The third and fourth combinations correspond to a _false positive_ (_FP_) and _false negative_ (_FN_), respectively. False positives and false negatives are the two different types of errors that the test may make.

(By the way, false positives are also called _type I errors_, and false negatives are also called _type II errors_.)



### Problem 6 --- True positive rate

The _true positive rate_ (_TPR_) of the test is the conditional probability

$$
P(t = 1 | d=1) = \frac{P(t=1, d=1)}{P(d=1)}.
$$

Thus, the TPR is the probability that a person tests positive, _given_ that they have HTS.

You already computed the probability $P(d=1)$ in Problem 4 above and saved it into the variable `disease_prob`. So, to compute the TPR, we need only compute the (joint) probability in the numerator. To do this, create a boolean mask for the conditions `t == 1` **and** `d == 1`. Use it to index into `df` to pull out the rows with $(t,d)=(1,1)$. Save the result into a new dataframe called `df_joint`.

In [None]:
# ENTER YOUR CODE IN THIS CELL



In [None]:
# Autograder cell. Do NOT alter or delete this cell.


Using `df_joint`, compute the probability $P(t=1, d=1)$. Save your answer into the variable `joint_prob`.

In [None]:
# ENTER YOUR CODE IN THIS CELL



In [None]:
# Autograder cell. Do NOT alter or delete this cell.


Finally, compute the TPR using `joint_prob`, `disease_prob`, and the formula above. Save your result into the variable `tpr`.

In [None]:
# ENTER YOUR CODE IN THIS CELL



In [None]:
# Autograder cell. Do NOT alter or delete this cell.


You just computed `joint_prob` as

$$
P(t=1, d=1) = \frac{|\{\text{rows with $(t,d)=(1,1)$}\}|}{10{,}000}
$$

and then divided it by

$$
P(d=1) = \frac{|\{\text{rows with $d=1$}\}|}{10{,}000}
$$

to get the TPR. But when you do this, both of the $10{,}000$'s will cancel, leaving you with

$$
P(t=1|d=1) = \frac{|\{\text{rows with $(t,d)=(1,1)$}\}|}{|\{\text{rows with $d=1$}\}|}. \tag{$\dagger$}
$$

Using this _new_, simplified formula for the TPR along with your dataframes `df_joint` and `df_disease`, compute the TPR one more time and save it into the variable `tpr_new`.

In [None]:
# ENTER YOUR CODE IN THIS CELL



In [None]:
# Autograder cell. Do NOT alter or delete this cell.


Make sure to print out `tpr_new` to check that it matches `tpr`!

The formula $(\dagger)$ is interesting because it shows that $P(t=1 | d=1)$ is just the proportion of people in the dataframe that have tested positive drawn from the set of all people who have the disease.

### Problem 7 --- True negative rate

The _true negative rate_ (_TNR_) of the test is the conditional probability

$$
P(t=0 | d=0) = \frac{P(t=0, d=0)}{P(d=0)}.
$$

Both of the probabilities on the right-hand side are proportions out of $10{,}000$, and so a simplified formula for the TNR is

$$
P(t=0 | d=0) = \frac{|\{\text{rows with $(t,d)=(0,0)$}\}|}{|\{\text{rows with $d=0$}\}|}.
$$

Now, you already know how many rows in the dataframe have $d=0$, because this is equal to

$$
|\{\text{rows with $d=0$}\}| = 10{,}000 - |\{\text{rows with $d=1$}\}|,
$$

and the cardinality on the right-hand side is exactly the number of rows in `df_disease`. Using this formula, compute the number of rows in `df` that have $d=0$. Save your answer into the variable `nodisease_card`.


In [None]:
# ENTER YOUR CODE IN THIS CELL



In [None]:
# Autograder cell. Do NOT alter or delete this cell.


Now, pull the rows out of `df` for which `t == 0` **and** `d == 0`. Save your result into the variable `df_joint`. (You are overwriting the old `df_joint`.)

In [None]:
# ENTER YOUR CODE IN THIS CELL



In [None]:
# Autograder cell. Do NOT alter or delete this cell.


Use `nodisease_card` and the length of `df_joint` to compute the TNR. Save your result into the variable `tnr`.

In [None]:
# ENTER YOUR CODE IN THIS CELL



In [None]:
# Autograder cell. Do NOT alter or delete this cell.


## Quantifying test errors

Obviously, you want the true positive and true negative rates for the test to be as close to $1$ as possible.

On the other hand, you want the rates at which the test makes errors to be as close to $0$ as possible. There are two such rates, called the _false positive rate_ (_FPR_) and the _false negative rate_ (_FNR_). Fortunately, they are easily obtainable via the formulas

$$
FPR = 1 - TNR
$$

and

$$
FNR = 1 - TPR.
$$

### Problem 8 --- Error rates

Using the formulas just given, compute the FPR and FNR using `tnr` and `tpr` from above. Compute both of these quantities in a single cell, saving them into variables called `fpr` and `fnr`.

In [None]:
# ENTER YOUR CODE IN THIS CELL



In [None]:
# Autograder cell. Do NOT alter or delete this cell.


## More information has arrived!

More research suggests that citizens in Northern Aquilonia are more likely to suffer from HTS than citizens who live in other parts of the town. It has also been determined that citizens who have had HTS previously are less likely to _currently_ have HTS, since they have built up some internal immunity.

Let's code these new variables as follows. We let $n$ represent the following binary variable:

$$
n = \begin{cases}
0 & : \text{citizen does not live in Northern Aquilonia,} \\
1 & : \text{citizen does live in Northern Aquilonia,}
\end{cases}
$$

and we let $r$ represent the following binary variable:

$$
r = \begin{cases}
0 & : \text{citizen has not had HTS in the past,} \\
1 & : \text{citizen has had HTS in the past.}
\end{cases}
$$

We now have four binary variables: the original $t$ and $d$, as well as $n$ and $r$. We can visualize the "flow of influence" between these four variables as in the following figure:

<center>
<img src="https://github.com/jmyers7/stats-book-materials/blob/1d56e0c30ca219c908404e5191a218db54a8bb8c/img/pgm.svg?raw=true" width="400" align="center">
</center>

Our goal over the rest of the assignment is to compute the conditional probability

$$
P\big(d= 1 | t=1, n=1, r=0\big).
$$

This is the probability that a person has HTS, given that they test positive, they live in Northern Aquilonia, and they have not had HTS before. We want to compare this probability to

$$
P(d=1 | t=1),
$$

which is the probability that a person has HTS, given that they have tested positive, but _without_ the additional information brought by the $n$ and $r$ variables. These types of probabilities are called _precisions_, so we want to see how $n=1$ and $r=0$ affect the precision of the test.

### Problem 9 --- Precision, part 1

Use <a href="https://mml.johnmyersmath.com/stats-book/chapters/rules-of-prob.html#the-law-of-total-probability-and-bayes-theorem">Bayes' Theorem</a> to compute the precision $P(d=1 | t=1)$ in _one_ line of code. Use the saved variables `tpr`, `fpr`, and `disease_prob` from above. Save your answer into the variable `precision`.

In [None]:
# ENTER YOUR CODE IN THIS CELL



In [None]:
# Autograder cell. Do NOT alter or delete this cell.


Print out `precision` for your own information, so you can compare it to the precision computed in the next problem.

## Conditional Bayes

In order to compute the precision _with_ $n=1$ and $r=0$, we need to use a more advanced form of Bayes' Theorem given by

$$
P\big(d=1 | t=1, n=1, r=0\big) = \frac{P\big(t=1 | d=1, n=1, r=0\big)P\big(d=1 | n=1, r=0\big)}{P\big(t=1|n=1,r=0\big)}.
$$

Notice that this is essentially the simple version of Bayes' Theorem we discussed in class, but with the string "$n=1, r=0$" added everywhere on the right side of the conditioning bar. Therefore, this might be called a _conditional_ form of Bayes' Theorem.

We need to alter this version of Bayes' Theorem a little in order to use it.

* First, because of the direction of the "flow of influence" in the diagram above, the first factor in the numerator of Bayes' Theorem may be written as

  $$
  P\big(t=1 | d=1, n=1, r=0\big) = P(t=1 | d=1).
  $$

  This equation is actually an assumption built into the model---you can just take it for granted. Essentially, it says that the variable $t$ is (conditionally) independent of $n$ and $r$, _given_ the value $d=1$. This reflects the fact that $n$ and $r$ do _not_ directly influence $t$. Rather, they influence $t$ only through $d$, but if $d$ is assumed to have a given value, the influence of $n$ and $r$ is blocked. (Look up <a href="https://en.wikipedia.org/wiki/Bayesian_network">Bayesian network</a> if you want to see more detail.)

* Second, in order to compute the denominator of the conditional form of Bayes' Theorem, we also need a _conditional_ form of the <a href="https://mml.johnmyersmath.com/stats-book/chapters/rules-of-prob.html#the-law-of-total-probability-and-bayes-theorem">Law of Total Probability</a>:

  $$
  P\big(t=1|n=1, r=0\big) =  P\big(t=1 | d=1\big)P\big(d=1 | n=1, r=0\big) + P\big(t=1 | d=0\big)P\big(d=0 | n=1, r=0\big).
  $$

Making these substitutions, we see that our conditional form of Bayes' Theorem may be written as

$$
P\big(d=1 | t=1, n=1, r=0\big) = \frac{P(t=1 | d=1)P\big(d=1 | n=1, r=0\big)}{P\big(t=1 | d=1\big)P\big(d=1 | n=1, r=0\big) + P\big(t=1 | d=0\big)P\big(d=0 | n=1, r=0\big)}.
$$

This is the form of the theorem that we will use in our code.

### Problem 10 --- Precision, part 2

Notice that the only unknown factors in the conditional form of Bayes' Theorem are the probabilities

$$
P\big( d=1 | n=1, r=0 \big) \quad \text{and} \quad P\big( d=0 | n=1, r=0 \big).
$$

Luckily, you do not need to find these probabilities yourself: You are told by medical experts that the first one is equal to $5\%$. From this, you can easily compute the second one. What is it?

Once you answer that question, use the conditional form of Bayes' Theorem along with the values stored in the variables `tpr` and `fpr` to compute the probability

$$
P\big(d=1 | t=1, n=1, r=0\big).
$$

Do this in _one_ line of code, and save your result into the variable `precision_nr`.

In [None]:
# ENTER YOUR CODE IN THIS CELL



In [None]:
# Autograder cell. Do NOT alter or delete this cell.


Make sure to print out `precision_nr` to compre it to `precision`! You should notice that the conditions $n=1$ and $r=0$ have a _big_ influence on the difference between these two precisions.