<span style="color:#1871d6; font-size:24px; font-weight:700"> Bayes Theorem </span>

The cigar example in the lab illustrates the application of Bayes' theorem with its calculation using the formula. 
Unfortunately, that calculation is complicated and can cause confusion and/or incorrect substitution of the involved
probability values. 
Fortunately, here is another approach that is much more intuitive and easier:

Assume some convenient value for the total of all items involved, 
then construct a table of rows and columns with the individual cell frequencies based on the known probabilities.

For example, let's assume that the adult population in Boone County, Missouri is 100,000. 
Now we can use the given information to create a table.

*Number of males who smoke cigars:* 
51% of adults are males; so there are 51,000 males. 
If 9.5% of them smoke, that makes 0.095 x 51,000 = 4845. 
Then, males who do not smoke are 51,000 - 4845 = 46,155.
See the table where these values go.


*Number of females who smoke cigars:* 49% of the adults are females, that makes 49,000. 
1.7% of them are smokers, so 0.017 x 49,000 = 833. 
The number of females who do not smoke is 49,000 - 833 = 48,167. 
Again look at the table below. 

In [1]:
cigar <- matrix(c(4845, 833, 46155, 48167), ncol = 2)
colnames(cigar) <- c('smoker', 'nonsmoker')
rownames(cigar) <- c('male', 'female')
cigar.table <- as.table(cigar)

addmargins(cigar.table)

Unnamed: 0,smoker,nonsmoker,Sum
male,4845,46155,51000
female,833,48167,49000
Sum,5678,94322,100000


The above table involves simple arithmetic. 
Simply partition the assumed population into the different cell categories by finding suitable percentages.

Now we can easily address the key question as follows: 
To find the probability of getting a male subject, given that the subject smokes cigars, 
simply use the same conditional probability described before. 

To find the probability of getting a male given that the subject smokes, 
restrict the table to the column of cigar smokers, 
then find the probability of getting a male in that column.
Among the 5678 cigar smokers, there are 4845 males, so the probability we seek is 4845/5678 = 0.85329341. 
That is, $P(M | C)$ = 4845/5678 = 0.85329341 = 0.853 (rounded).

**Activity 1:** 
Now, your turn: 
The actual population of Boone County, Missouri is 170,733 (as of 2013).
Create the above table with actual population values for the given percentages and find the actual $P(M | C)$.

In [2]:
#51% - Male
#49% - Female
#9.5% - M Smoke
#1.7% - F Smoke

170733 * 0.51
170733 * 0.49

In [3]:
# Rounding
# M = 87,074
# F = 83,659

87074 * 0.095
87074 - (87074 * 0.095)

83659 * 0.017
83659 - (83659 * 0.017)

In [4]:
# Add your code here
# --------------------

cigar <- matrix(c(8272, 1422, 78802, 82237), ncol = 2)
colnames(cigar) <- c('smoker', 'nonsmoker')
rownames(cigar) <- c('male', 'female')
cigar.table <- as.table(cigar)

addmargins(cigar.table)

Unnamed: 0,smoker,nonsmoker,Sum
male,8272,78802,87074
female,1422,82237,83659
Sum,9694,161039,170733


a) Now, using the same table, randomly select an individual, what is the prior probability that the selected person is a female?

b) You later learn that the randomly selected person was smoking a cigar. 
Use this additional information to find the posterior probability that the selected person is a female.

In [5]:
addmargins(prop.table(cigar))

Unnamed: 0,smoker,nonsmoker,Sum
male,0.048449919,0.4615511,0.510001
female,0.008328794,0.4816702,0.489999
Sum,0.056778713,0.9432213,1.0


In [6]:
# Add your code here
# --------------------

prior_probs <- c(0.51, .049)

like <- c(0.05, 0.01)

post <- prior_probs * like
post

In [7]:
post/sum(post)

Based on the above, there is only a ~2% chance that a randomly selected smoker is a female. 

Load the framingham data from the directory '/datasets/framingham'.

In [12]:
framingham_data <- read.csv("/dsa/data/all_datasets/framingham/framingham.csv")
head(framingham_data)

male,age,education,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<dbl>,<dbl>,<dbl>,<int>,<int>,<int>
1,39,4,0,0,0,0,0,0,195,106.0,70,26.97,80,77,0
0,46,2,0,0,0,0,0,0,250,121.0,81,28.73,95,76,0
1,48,1,1,20,0,0,0,0,245,127.5,80,25.34,75,70,0
0,61,3,1,30,0,0,1,0,225,150.0,95,28.58,65,103,1
0,46,3,1,23,0,0,0,0,285,130.0,84,23.1,85,85,0
0,43,2,0,0,0,0,1,0,228,180.0,110,30.3,77,99,0


**Activity 2:** Create a two-way table from this data set with diabetes condition in the columns and gender in the rows. Use addmargins to add totals.


In [17]:
dia <- with(framingham_data,table(framingham_data$diabetes,framingham_data$male))
colnames(dia) <- c('nondiabetes','diabetes')
rownames(dia) <- c('female','male')
dia.table <- as.table(dia)

In [19]:
addmargins(dia.table)

Unnamed: 0,nondiabetes,diabetes,Sum
female,2363,1768,4131
male,57,52,109
Sum,2420,1820,4240


In [20]:
addmargins(prop.table(dia.table))

Unnamed: 0,nondiabetes,diabetes,Sum
female,0.5573113,0.41698113,0.97429245
male,0.0134434,0.01226415,0.02570755
Sum,0.5707547,0.42924528,1.0


**Activity 3:** What is the probability that an individual has diabetes, given that the individual is female?  Let <b>d</b> be an event of diabetes and <b>d'</b> be event of nondiabetes.
Similarly let $f$ be the event of female and $f'$ be event of male. 
Find $P(d | f)$ using Bayes formula.

            
                         p(d) * p(f|d)
     p(d|f) =  -------------------------------------
               [p(d) * p(f|d)] + [ p(d') * p(f|d')]

In [21]:
# Add your code here
# --------------------


(0.43 * 0.42) / ((0.43 * 0.42) + (0.57 * 0.56))

# Save your notebook!