In [1]:
import pandas as pd

**Exercise 5.2. [Purpose: Getting an intuition for the previous results by using "natural ferquency" and "Markov" representations.]** 

**(A)** Suppose that the population consists of 100,000 people. Compute how many people would be expected to fall into each cell of Table 5.4. To compute the expected frequency of people in a cell, just multiply the cell probability by the size of the population. 

In [2]:
# Probability of having the disease
p_present = 0.001
# Probability of getting a positive test
p_positive_present = 0.99
p_positive_absent = 0.05

In [3]:
# Joint probabilities
data = [[p_positive_present * p_present, p_positive_absent * (1 - p_present)],
        [(1 - p_positive_present) * p_present, (1 - p_positive_absent) * (1 - p_present)]]
index = ['positive', 'negative']
columns = ['present', 'absent']
df = pd.DataFrame(data, index = index, columns = columns)
df.columns.set_names(['disease'], inplace = True)
df.index.set_names(['test result'], inplace = True)
df

disease,present,absent
test result,Unnamed: 1_level_1,Unnamed: 2_level_1
positive,0.00099,0.04995
negative,1e-05,0.94905


In [4]:
frequencies = df * 100000
print(frequencies)

disease      present   absent
test result                  
positive        99.0   4995.0
negative         1.0  94905.0


In [5]:
# Frequencies per column and row
print(frequencies.sum(axis = 0))
print(frequencies.sum(axis = 1))

disease
present      100.0
absent     99900.0
dtype: float64
test result
positive     5094.0
negative    94906.0
dtype: float64


Notice the frequencies on the lower margin of the table. They indicate that out of 100,000 people, only 100 have the disease, whule 99,900 do not have the disease. These marginal frequencies instantiate the prior probability that p(present) = 0.001. Notice also the cell frequencies in the present column, which indicate that of 100 people with the disease, 99 have a positive test result and 1 has a negative test result. These cell frequencies instantiate the hit rate of 0.99. 

**(B)** Take a good look at the frequencies in the table you just computed for the previous part. These are te so-called 'natural frequencies' of the events, as opposed to the somewhat unintuitive experssion in terms of conditional probabilities. From the cell frequencies alone, determine the proportion of people who have the disease, given that their test result is positive. Before computing the exact answer arithmetically, first give a rough intuitive answer merely by looking at the relative frequencoes in the `positive` row. Does your intuitive answer match the intuitive answer you provided when originally reading about Table 5.4? Probably no. Your intuitive answer here is probably closer to the correct answer. Now compute the exact answer arithmetically. It should match the result from applying Bayes' rule to Table 5.4.

By using frequencies, we know that 4995 + 99 people got a positive result, but only 99 people have the disease. That's a very small proportion. 

Arithmetically:

In [6]:
99 / (4995 + 99)

0.019434628975265017

**(C)** No we'll consider a related representation of the probabilities in terms of natural frequencies, which is especially useful when we accumulate more data. This type of representation is called a Mark representation. Suppose now we start with a population of N = 10,000,000 people. We expect 99.9% of them (i.e. 9,990,000) not to have the disease, and just 0.1% (i.e. 10,000) to have the disease. Now consider how many people we expect to test positive. Of the 10,000 people who have the disease, 99% (i.e. 9,900) will be expected to test positive. O fthe 9,990,000 people wo do not have the disease, 5% (i.e. 499,500) will be expected to test positive. Now consider re-testing everyone who hass tested positive on the first test. How man of them are expected to show a negative result on the retest? Use this diagram to compute your answer:

In [7]:
N = 10000000
branch_1_1 = N * p_present

branch_1_2 = branch_1_1 * p_positive_present

branch_1_3 = branch_1_2 * (1 - p_positive_present)

In [8]:
print(branch_1_1)
print(branch_1_2)
print(branch_1_3)

10000.0
9900.0
99.00000000000009


In [9]:
branch_2_1 = N * (1 - p_present)

branch_2_2 = branch_2_1 * p_positive_absent

branch_2_3 = branch_2_2 * (1 - p_positive_absent)

In [10]:
print(branch_2_1)
print(branch_2_2)
print(branch_2_3)

9990000.0
499500.0
474525.0


```            
            N = 10,000,000
     <                         >
    < p(present)       p(absent) >    
   <                              >
10,000                        9,990,000
   V                              V
   V p(positive|present)          V p(positive|absent)
   V                              V 
 9,900                         499,500    
   V                              V
   V p(negative|present)          V p(negative|absent)
   V                              V
   99                          474,525
```

When computinog the frequencies for the empty boxes above, be careful to use the proper conditional probabilities!

**(D)** Use the diagram in the previous part to answer this: what proportion of people, who test positive at first and then negative on retest, actually have the disease? In other words, of the total number of people at the bottom of the diagram in the previous part (those are the people who tested positive then negative), what proportion of them are in the left branch of the tree? How does the result compare with your answer to exercise 5.1?

In [11]:
branch_1_3 / (branch_1_3 + branch_2_3)

0.00020858616504854387

We get exactly the same result as in exercise 5.1