### Bayesian Statistics with Conditional Probability


People rely on the collective intelligence of previous experiences to protect themselves or to make better decisions in the future, 
like saving themselves from eating bad food at the wrong restaurant. 

We discussed conditional probability of an event as the probability obtained using additional information that some other event has already occurred. 
If the occurrence of Event B is dependent on Event A, we used the following formula for finding P(B|A):

                                             P(A ∩ B)  
                                    P(B|A) = --------  
                                               P(A)

In this notebook, 
we will extend the discussion of conditional probability to applications of Bayes' theorem (or Bayes' rule). 
Bayes' rule is used to update prior probabilities based on information that is obtained later. 
Bayes' theorem deals with sequences of events where each occurrence of a subsequent event provides 
new information that is used to revise the probability of the previous event.
The terms _prior probability_ and _posterior probability_ are the common terms for this.

**Prior probability** (a priori) is an initial probability value obtained before any additional information is obtained.

**Posterior probability** (a posteriori) is a probability value that has been revised by using additional information that is later obtained.

### Bayes' Theorem

Bayes' Theorem: https://www.youtube.com/watch?v=R13BD8qKeTg

The probability of event A, given that event X has subsequently occurred, is mathematically represented as below:

                                          P(X|A) * P(A)
                  P(A|X) = ------------------------------------------
                          [P(A) * P(X|A)] + [ P(not A) * P(X|not A)]
                                  

Consider an example of conducting cancer tests. 
Tests detect things that don’t exist (false positive) and miss things that do exist (false negative).
People often consider the test results directly, without considering the errors in the tests. 
Bayes’ theorem converts the results from a test into the real probability of the event. 

**Correct for measurement errors...** 
If you know the real probabilities and the chance of a false positive and false negative, 
you can correct for measurement errors.

**Relate the actual probability to the measured test probability...** 
Bayes’ theorem lets you relate `P(A|X)`, the chance that an event A happened given the indicator X, 
and `P(X|A)`, the chance the indicator X happened given that event A occurred. 
Given mammogram test results and known error rates, you can predict the actual chance of having cancer.

Bayes’ Theorem: 
It lets you take the test results and correct for the “skew” introduced by false positives. 
Consider the example of a cancer test again to illustrate what Bayes' formula is doing.

Let 'A' be the event of person having cancer.
Let 'X' be the event of positive test

P(A|X) = Chance of having cancer (A) given a positive test (X). 
**This is what we want to know**: 
How likely is it to have cancer with a positive result?

P(X|A) = Chance of a positive test (X) given that you had cancer (A). This is the chance of a true positive.

P(A) = Chance of having cancer.

P(not A) = Chance of not having cancer.

P(X|not A) = Chance of a positive test (X) given that you didn’t have cancer (~A). 
This is a false positive.


It all comes down to the chance of a true positive result divided by the chance of any positive result. So we can simplify Bayes' Theorem like this:

                      P(X|A) * P(A)
            P(A|X) = ---------------
                          P(X)          <- Read the note below about what P(X) actually represents


**What do the probabilities in this formula represent?**

<ul>
    <li>P(A|X) is the probability that you have cancer (A) if you tested positive (X)</li>
    <br><li>P(X|A) is the probability that you would test positive (X) if you had cancer (A)</li>
    <br><li>P(A) represents the PRIOR probability of having cancer (that is, the likelihood of having cancer before having the test done.  Note that in some problems, this probability can be difficult to quantify.  Consider that there will be instances when P(A) needs to be revised -- for example, if a positive test result is obtained a first time, and the same test is run again to see if a second run of the test also shows the same result.  See  this video, starting at the 4:54 mark: https://www.youtube.com/watch?v=R13BD8qKeTg</li>
    <br><li>P(X) is the **total probability** of the test being positive.  </li>
</ul>

Number 4 is total probability. It represents the probability of having cancer AND testing positive for cancer [i.e., P(A) multiplied by P(X|A)] **PLUS** the probability of NOT having cancer AND having a false positive result [i.e., P(not A) multiplied by P(X|not A)].  

So in mathematical terms, the **denominator** of the above equation [P(X)] is determined as follows: 

          P(X) = [P(A) * P(X|A)] + (P(~A) * P(X|~A)]
          

The example below illustrates the formula... 

----
`1. Consider an example:` In Boone County, Missouri 51% of the adults are males.
One adult is randomly selected for a survey involving credit card usage. 
What is the prior probability that the selected person is a male?

**Solution** It's known that 51% of the adults in the county are males. 

Consider A to be the event of selecting an adult male. 
So the probability of randomly selecting an adult and getting a male is given by P(A) = 0.51

`2. Let's take this a step further:` 
Now consider that the same survey also includes questions about smoking cigars, and the adult selected above was also selected because this person is known to smoke cigars.   

Based on data from the Substance Abuse and Mental Health Services Administration, we know that 9.5% of males smoke cigars, whereas 1.7% of females smoke cigars.  Now _use this additional information_ (i.e., Bayes Theorem) to find the probability that the selected subject was a male.

**Solution:** Based on the additional given information, we have the following:
    
Let X denote the event that the adult smokes cigars
       
    X' (pronounced _X prime_) is the complement event of X and represents the event that the adult does not smoke cigars
        
    P(A) = 0.51 because 51% of the adults are males
  
    A' (pronounced _A prime_) is the complement event of A and represents the event that the adult is not a male (i.e., is a female)
    
    P(A') = 0.49 because 49% of the adults are females (not males)
    
    P(X|A) = 0.095 because 9.5% of the males smoke cigars 
    (That is, the probability of getting someone who smokes cigars, given that the person is a male, is 0.095.)

    P(X|A') = 0.017 because 1.7% of the females smoke cigars 
    (That is, the probability of getting someone who smokes cigars, given that the person is a female, is 0.017)

Applying Bayes' theorem to the information above, we get the following result:

                                               P(X|A) * P(A)
                        P(A|X) = --------------------------------------
                                    [P(A) * P(X|A)] + [P(A') * P(X|A')]
                                    
                                            0.095 * 0.51  
                               =   -------------------------------
                                   (0.51 * 0.095) + (0.49 * 0.017)
                                   
                                   
                                 =  0.853                                                                                                                          

Before we knew that the survey subject smoked cigars, we determined that the probability that the survey subject was male was 0.51.  

But after finding out that the subject smoked cigars, we could revise the probability that the survey subject was a male to 0.853.  The probability that the cigar-smoking respondent is a male is 0.853. 
The likelihood that the subject was a male increased dramatically once we knew the additional piece of information that the subject also smokes cigars.

Let's apply Bayes' theorem to a multivariate dataset to learn more. Load the framingham data from the directory '/dsa/data/all_datasets/framingham' ... 
This data is from the Framingham Heart Study : https://www.framinghamheartstudy.org

In [1]:
framingham_data <- read.csv("/dsa/data/all_datasets/framingham/framingham.csv")
head(framingham_data)

ERROR while rich displaying an object: Error in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]): namespace ‘rlang’ 0.4.0 is already loaded, but >= 0.4.7 is required

Traceback:
1. FUN(X[[i]], ...)
2. tryCatch(withCallingHandlers({
 .     if (!mime %in% names(repr::mime2repr)) 
 .         stop("No repr_* for mimetype ", mime, " in repr::mime2repr")
 .     rpr <- repr::mime2repr[[mime]](obj)
 .     if (is.null(rpr)) 
 .         return(NULL)
 .     prepare_content(is.raw(rpr), rpr)
 . }, error = error_handler), error = outer_handler)
3. tryCatchList(expr, classes, parentenv, handlers)
4. tryCatchOne(expr, names, parentenv, handlers[[1L]])
5. doTryCatch(return(expr), name, parentenv, handler)
6. withCallingHandlers({
 .     if (!mime %in% names(repr::mime2repr)) 
 .         stop("No repr_* for mimetype ", mime, " in repr::mime2repr")
 .     rpr <- repr::mime2repr[[mime]](obj)
 .     if (is.null(rpr)) 
 .         return(NULL)
 .     prepare_content(is.raw(rpr), rpr)
 . }, 

  male age education currentSmoker cigsPerDay BPMeds prevalentStroke
1 1    39  4         0              0         0      0              
2 0    46  2         0              0         0      0              
3 1    48  1         1             20         0      0              
4 0    61  3         1             30         0      0              
5 0    46  3         1             23         0      0              
6 0    43  2         0              0         0      0              
  prevalentHyp diabetes totChol sysBP diaBP BMI   heartRate glucose TenYearCHD
1 0            0        195     106.0  70   26.97 80         77     0         
2 0            0        250     121.0  81   28.73 95         76     0         
3 0            0        245     127.5  80   25.34 75         70     0         
4 1            0        225     150.0  95   28.58 65        103     1         
5 0            0        285     130.0  84   23.10 85         85     0         
6 1            0        228     180.0 110  

In [2]:
with(framingham_data,table(currentSmoker,TenYearCHD))

             TenYearCHD
currentSmoker    0    1
            0 1834  311
            1 1762  333

**Question:** What is the probability that a person has coronary heart disease in ten years given that the person is also a smoker?

According to the Bayes' theorem formula, let's define events...

**Solution**: 
<ul>
    <li>Let D be the event of the person having coronary heart disease in 10 years and D' be the event of the person _not_ having coronary heart disease in 10 years.</li>
    <li>Let S be the event of the person being a smoker and S' be the event of the _not_ being a smoker.</li>
    <li>$P(D|S)$ - This is the probability that the person has coronary heart disease in 10 years given that the person is a smoker.  **This is what we want to know**.  </li>
    <li>$P(S|D)$ - This is the probability that the person is a smoker given that the person has coronary heart disease in 10 years.  This is the same as a true positive result.  P(S|D) = the number of smokers with heart disease / the number of people with heart disease in 10 years = 333 / 644 = 0.517    </li>
    <li>$P(S|D')$ - This is the probability that the person is a smoker given that the person does not have coronary heart disease in 10 years.  This is the same as a true negative result.  P(S|D') = the number of smokers who don't have heart disease in 10 years  / the number of people who don't have heart disease in 10 years = 1762 / 3596 = 0.490</li>
    <li>$P(D)$ - This is the probability that the person has heart disease in 10 years (644 / 4240 = 0.152)</li>
    <li>$P(D')$ - The probability that the person doesn't have heart disease in 10 years (3596 / 4240 = 0.848)</li>
    <li>$P(S)$ - The probability that the person smokes (2095 / 4240 = 0.494)</li>
    <li>$P(S')$ - The probability that the person doesn't smoke (2145 / 4240 = 0.506)</li>
</ul>
(In Bayes' formula, we're going to substitute as follows: D = A and S = X)



                                               P(S|D) * P(D)
                        P(D|S) = --------------------------------------
                                    [P(D) * P(S|D)] + [P(D') * P(S|D')]
                                    
                                            0.517 * 0.152  
                               =   -------------------------------
                                   (0.152 * 0.517) + (0.848 * 0.490)
                                   
                                   
                                 =  0.159     
                                 
Or, using the simplified Bayes' formula: 

                                P(S|D) * P(D)              0.517 * 0.152
                  P(D|S)  =  -------------------    =     ---------------    =   0.159
                                   P(S)                        0.494



**Example from: ** [IPSUR](http://ipsur.r-forge.r-project.org/book/download/IPSUR.pdf)

**Misfiling Assistants problem.**
In this problem, there are three assistants working at a company: 
Moe, Larry, and Curly. 
Their primary job duty is to file paperwork in the filing cabinet when papers become available.
The three assistants have different work schedules:

<table>
    <tr>
        <th></th>
        <th>Moe</th>
        <th>Larry</th>
        <th>Curly</th>
    </tr>
    <tr>
        <td>Misfiles</td>
        <td>60%</td>
        <td>30%</td>
        <td>10%</td>
    <tr>
</table>

That is, Moe works 60% of the time, Larry works 30% of the time, and Curly does the remaining 10%, and they file documents at approximately the same speed. Suppose a person were to select one of the documents from the cabinet at random. 

Let M be the event, M = {Moe filed the document}  and 

Let L and C be the events that Larry and Curly, respectively, filed the document. 


What are these events’ respective probabilities? 
In the absence of additional information, reasonable prior probabilities would just be

<table>
    <tr>
        <th></th>
        <th>Moe</th>
        <th>Larry</th>
        <th>Curly</th>
    </tr>
    <tr>
        <td>Misfiles</td>
        <td>P(M) = 60%</td>
        <td>P(L) = 30%</td>
        <td>P(C) = 10%</td>
    <tr>
</table>

Now, the boss comes in one day, opens up the file cabinet, and selects a file at random. 
The boss discovers that the file has been misplaced. 
The boss is so angry at the mistake that (s)he threatens to fire the one who erred. 
The question is: Who misplaced the file?

The boss decides to use probability to decide, and walks straight to the workload schedule. 
(S)he reasons that, since the three employees work at the same speed, 
the probability that a randomly selected file would have been filed by each one would be proportional to his workload.
The boss notifies Moe that he has until the end of the day to empty his desk. 
But Moe argues in his defense that the boss has ignored additional information.
Moe’s likelihood of having misfiled a document is smaller than Larry’s and Curly’s, 
since he is a diligent worker who pays close attention to his work.
Moe admits that he works longer than the others, 
but he doesn’t make as many mistakes as they do. 
Thus, Moe recommends that – before making a decision – the boss should update the probability 
(initially based on workload alone) to incorporate the likelihood of having observed a misfiled document.

And, as it turns out, the boss has information about Moe, Larry, and Curly’s filing accuracy in the past (due to historical performance evaluations). 
The performance information may be represented by the following table:

<table>
    <tr>
        <th></th>
        <th>Moe</th>
        <th>Larry</th>
        <th>Curly</th>
    </tr>
    <tr>
        <td>Misfiles</td>
        <td>0.003</td>
        <td>0.007</td>
        <td>0.010</td>
    <tr>
</table>



In other words, on the average, Moe misfiles 0.3% of the documents he is supposed to file. 
Notice that Moe was correct: he is the most accurate filer, followed by Larry, and lastly Curly. 
If the boss were to make a decision based only on the worker’s overall accuracy, 
then Curly should get the axe.
But Curly hears this and interjects that he only works a short period during the day, and consequently makes mistakes only very rarely; 
there is only the tiniest chance that he misfiled this particular document.

The boss would like to use this updated information to update the probabilities for the three assistants, that is, 
(s)he wants to use the additional likelihood that the document was misfiled to update his/her beliefs about the likely culprit. 

Let **A** be the event that **a document is misfiled**.
What the boss would like to know are the three probabilities...

            P(M|A), P(L|A), and P(C|A)
            
We will show the calculation for P(M|A), the other two cases being similar.
We use Bayes’ Rule in the form


                 P(M ∩ A)      P(A|M) * P(M)       
        P(M|A) = ---------- =  ---------------
                    P(A)            P(A)
                                                  

So we'll first find P(M ∩ A), which we know from the Multiplication Rule is the same as P(A|M) · P(M).

We already know P(A|M) is just Moe’s misfile rate (given above as 0.003) and P(M) = 0.6.
Thus, we compute

        P(M ∩ A) = (0.003)(0.06) = 0.0018

        P(L ∩ A) = 0.0021
        
        P(C ∩ A) = 0.0010


Using the Theorem of Total Probability we can write P(A) = P(M ∩ A) + P(L ∩ A) + P(C ∩ A).

        P(A) = 0.0018 + 0.0021 + 0.0010 = 0.0049
        
                                         0.0018
    According to Bayes' rule,  P(M|A) = --------  
                                         0.0049

                                       = 0.37

The above last quantity is called the **posterior probability** that Moe misfiled the document. 
We can use the same argument to calculate

<table>
    <tr>
        <th></th>
        <th>Moe</th>
        <th>Larry</th>
        <th>Curly</th>
    </tr>
    <tr>
        <td></td>
        <td>P(M/A) = 0.37</td>
        <td>P(L/A) = 0.43</td>
        <td>P(C/A) = 0.20</td>
    <tr>
</table>


The conclusion:
Larry gets the axe.
What is happening is an intricate interplay between the time on the job and the misfile rate. 
It is not obvious who the winner (or in this case, loser) will be, 
and the statistician needs to consult Bayes’ Rule to determine the best course of action.

Let's try to implement the same thing in R. 
All the math in the problem above used four simple steps. 

In [3]:
# prior_probs is the name of the variable that contains the prior probabilities we assumed above
# The prior probabilities are based on the amount of time each person was working (i.e., Moe = 60%, 
# Larry = 30%, Curly = 10%)
prior_probs <- c(0.6, 0.3, 0.1)

# We now add information from Moe, Larry, and Curly’s past historical performance evaluations,
# which we are considering the to be the likelihood that they committed the misfiling.
like <- c(0.003, 0.007, 0.01)

# Generate posterior probabilities based on prior probabilities and the likelihood of each event.
post <- prior_probs * like   # Note: This is vector math
post

In [4]:
post / sum(post) # More vector math

We see that we can compute the results using R.
Later in the course, you will see Bayes' Rule applied to a classification problem.