# Report 01

Mafalda Borges

In [2]:
import thinkbayes2
from thinkbayes2 import Pmf

In this report I will work through a variety of problems that utilize Bayes' Theorem. The theorem, named after Reverend Thomas Bayes, is used to understand how the probability of a theory is affected by the addition of new data. One way of writing Bayes' Theorem is as follows:

P(H|D) = [P(H)P(D|H)]/P(D)

Where:
P(H|D) is the desired probability or the one found after the addition of new data, which can be called the posterior.
P(H) is the original probability of the theory before seeing the new evidence which will be referred to as the prior.
P(D|H) is the probility of the theory, given the hypothesis, called the likelihood.
P(D) is the probability given any hypothesis, which is the normalizing constant.


The following problems are taken from Allen Downey's blog "Probably Overthinking It"

Below I have copied in the first problem which involves cookies.

Suppose there are two full bowls of cookies. Bowl #1 has 10 chocolate chip and 30 plain cookies, while bowl #2 has 20 of each. Our friend Fred picks a bowl at random, and then picks a cookie at random. We may assume there is no reason to believe Fred treats one bowl differently from another, likewise for the cookies. The cookie turns out to be a plain one. How probable is it that Fred picked it out of Bowl #1?

Starting off, the first hypothesis is Fred picked from Bowl #1, which leaves one other option, Fred picking from Bowl #2. Given that there is an equal chance for Fred to pick from either bowl then P(H1) = .5 and P(H2) = .5, which are our priors. This can be expressed using Pmf (probability mass function).

In [5]:
pmf = Pmf()
pmf.Set('Bowl 1', 0.5)
pmf.Set('Bowl 2', 0.5)

We also know the composition of cookies within each bowl. This means the likelihood of drawing a cookie from the first bowl is 3/4 and from the second bowl is 1/2. 

In [6]:
pmf.Mult('Bowl 1', 0.75)
pmf.Mult('Bowl 2', 0.5)

Mult unsuprisingly multiplies the two probabilities, but now the results are not normalized because they are not out of 1. We next divide by the normalizing constant.

In [7]:
pmf.Normalize()

0.625

Now the data is normalized, we can print the probability of picking form Bowl #1.

In [8]:
print pmf.Prob('Bowl 1')

0.6


The answer is 0.6 meaning given the data, Fred had a 60% chance of picking from Bowl #1. 

Simple enough? Let's move on to a more fun one. 

The blue M&M was introduced in 1995.  Before then, the color mix in a bag of plain M&Ms was (30% Brown, 20% Yellow, 20% Red, 10% Green, 10% Orange, 10% Tan).  Afterward it was (24% Blue , 20% Green, 16% Orange, 14% Yellow, 13% Red, 13% Brown). 
A friend of mine has two bags of M&Ms, and he tells me that one is from 1994 and one from 1996.  He won't tell me which is which, but he gives me one M&M from each bag.  One is yellow and one is green.  What is the probability that the yellow M&M came from the 1994 bag

Pulling the data from the problem, we are left with two hypotheses, either the yellow came from the 1994 bag and the green from the 1996 bag or the yellow came from the 1996 bag and the green from the 1994 bag and both are equally likely.

Our priors are:
P(H1) = P(H2) = .5

Our likelihoods are a bit more difficult, but they can be found by multiplying the probability of finding each type of M&M if they were found in the corresponding bags which we will call the evidence.

Likelihoods:
P(E|H1) = (0.2)(0.2)
P(E|H2) = (0.1)(0.14)

The rest is just arithmetic (mulitplying the priors by the corresponding likelihoods and dividing by the normalizing constant).

In [11]:
priorA = thinkbayes2.Suite({'Yellow 1994': 0.5, 'Yellow 1996': 0.5})
likelihoodA = {'Yellow 1994': 0.04, 'Yellow 1996': 0.014}
posteriorA = priorA.Copy()
for hypothesisA in posteriorA:
    posteriorA[hypothesisA] *= likelihoodA[hypothesisA]
posteriorA.Normalize()
posteriorA.Print()

Yellow 1994 0.740740740741
Yellow 1996 0.259259259259


This means that given the data there is approximately a 74% chance the yellow M&M is from the 1994 bag. 

Let's keep going with a problem about Elvis. 

Elvis Presley had a twin brother who died at birth.  What is the probability that Elvis was an identical twin?

This problem requires that certain information be looked up. We are  given:

To answer this one, you need some background information: According to the Wikipedia article on twins: "Twins are estimated to be approximately 1.9% of the world population, with monozygotic twins making up 0.2% of the total---and 8% of all twins."

Given that we know Elvis was a twin let's only focus on statistics that focus on twin births, rather than all births. This way we are left with two hypotheses, either Elvis was an identical twin or he was a fraternal twin. 

Our priors are then:
P(H1) = 0.08
P(H2) = 0.92

The likelihoods take advantage of Elvis being a male. We will assume that males and females are born roughly at the same rate. For identicial twins only a male can be born. 

Our likelihoods are:
P(E|H1) = 1
P(E|H2) = .5

Then we do the relevant arithmetic.

In [12]:
priorB = thinkbayes2.Suite({'Identical': 0.08, 'Fraternal': 0.92})
likelihoodB = {'Identical': 1, 'Fraternal': .5}
posteriorB = priorB.Copy()
for hypothesisB in posteriorB:
    posteriorB[hypothesisB] *= likelihoodB[hypothesisB]
posteriorB.Normalize()
posteriorB.Print()

Fraternal 0.851851851852
Identical 0.148148148148


This shows that there was only roughly a 15% chance Elvis's twin was identical. 

This next problem is a little stranger.

Two people have left traces of their own blood at the scene of a crime.  A suspect, Oliver, is tested and found to have type O blood.  The blood groups of the two traces are found to be of type O (a common type in the local population, having frequency 60%) and of type AB (a rare type, with frequency 1%).  Do these data (the blood types found at the scene) give evidence in favour [sic] of the proposition that Oliver was one of the two people whose blood was found at the scene?

First let's begin with the easier part which is determining the hypotheses. Hypothesis #1 is Oliver and some person unknown were at the scene of the crime. Hypothesis #2 is two unknown people were at the scene of the crime. This problem is different in that it is asking for the Bayes' factor rather than the posterior probability. This is found by taking a ratio of the likelihoods. Lets call D the data or 'type O' and 'type AB' blood being found. Then the Bayes' factor can be modeled as:

BF = [P(D|H1)]/P(D|H2)

We have to compute each part separately. We know both type O and type AB blood were found. Assuming Oliver was at the scence then the O blood would be his which means P(D|H1) is computing the probability of the other person having type AB blood which we know to be 0.01 from the problem. If Oliver was not at the scene then we need to look at the probability of two random people having those two blood types. We find P(D|H2) = 2*0.6.01. The 2 is because there are two permutations or in other words the first person is type O and the second person is AB or vice versa. 

By substitution the Bayes' factor is:

BF = 0.1/(2*0.6*0.01) = .83. 

Because BF < 1 it is seen that the given data is weak evidence against Oliver, and it is more probable that Oliver was not at the scene. 

This next problem requires just a little more creativity when coming up with the solution.

According to the CDC,"Compared to nonsmokers, men who smoke are about 23 times more likely to develop lung cancer and women who smoke are about 13 times more likely."
If you learn that a woman has been diagnosed with lung cancer, and you know nothing else about her, what is the probability that she is a smoker?

Let's represent the fraction the women who smoke as s. This means women who don't smoke are 1-s. Next if we say that the fraction of non-smoking women with lung cancer is c then we know the number of smoking women with cancer 13 times that amount or 13c. This makes the number of women who smoke and have lung cancer proportional to 13sc and the non-smoking woman who have lung cancer proportional to c(1-s). 

Looking at the population of women with lung cancer, the fraction of those that smoke is 13sc/(13sc + c(1-s)), and we can see the c's cancel. In 2013 the CDC estimated that the population of women who smoked was 15.3 percent of the total US population. If you substitute 0.153 in for s, then you find this number is about 70% meaning that there is a 70% the woman with lung cancer is a smoker.

Let's try a problem of a less serious nature, which was posted on reddit. 

The fictional university of ZombieLand - has n students volunteer for the role of Killing Assistants, student volunteers who help in smooth killing of zombies in the campus. Every student can either manage to shoot a zombie or miss it.
Out of these n Killing Assistants, m students try to shoot the zombie fairly, that is, they are unbiased, while the rest of them are biased. The probability that a zombie will be hit by a shot if the Killing Assistant is biased is 2/3, otherwise the probability is 1/2.
Little Harry figures out this mess by selecting a Killing Assistant randomly. He tricks that particular Killing Assistant in shooting the same zombie twice. The first time, the Killing Assistant shoots the zombie, while the other time, he ends up missing it.
Harry wants to know what are the chances that the selected Killing Assistant was NOT biased.

Our hypthoseses are straight forward. Hypothesis #1 is the Killing Assistant is not biased. Hypothesis #2 is the Killing Assistant is biased. We know that there are n student volunteers and m of them are unbiased. 

This makes our priors:
P(H1) = m/n - proportion of unbiased Killing Assistants
P(H2) = (n-m)/n - proportion of biased Killing Assistants

Then we must compute likelihood. We know that an unbiased Killing Assistant has a 1/2 chance of shooting the zombie and a 1/2 chance of missing the zombie. The biased Killing Assistant has a 2/3 chance of shooting the zombie and a 1/3 chance of missing. 

This makes the likelihoods:
P(D|H1) = (1/2)*(1/2)
P(D|H2) = (2/3)*(1/3)

Then we go through multiplying the priors by the corresponding likelihoods and dividing it by the normalizing constant. Because I am partial to the pen-and-paper method (I'm biased), I am going to write out some of my steps. 

The normalizing constant, which is computed by finding the sum of the products of the corresponding priors and likelihoods is in this case: 
P(D) = (m+8n)/(36n)

Which makes the posterior probability that the Killing Assistant is unbiased 9m/(m+8n).The probability of a biased Killing Assistant is 8*(m-n)/(m+8n).

Let's pretend that half the Killing Assistants are biased. Then the posterior probability of a non-biased Killing Assistant is 9/17. If the fraction of unbiased Killing Assistants to the total population is equal to 8/17 then it is of equal likelihood that the Killing Assistant was biased or unbiased. 

Below is a problem I created myself.

In the United States, in a given year about 2% of women have breast cancer. To detect this cancer, doctors use mammograms to x-ray the tissue. For women with cancer around 84% of the screens will come back as positive, however for women without cancer 6.5% of the screens will yield a false positive. Given a positive result, what are the chances that the woman has cancer?

First I determine my hopetheses, which are either the woman has cancer or she doesn't. Next I need to determine the prior probability that she has cancer based on national statistics of women with breast cancer. I am calling the population of women with cancer: 'Cancer' and the population without: 'No Cancer.'

In [3]:
prior = thinkbayes2.Suite({'Cancer': 0.02, 'No Cancer': 0.98})
prior.Print()

Cancer 0.02
No Cancer 0.98


Next I enter the rate at which each group gets a postive result after receiving a mammogram. I call this the likelikhood.

In [5]:
likelihood = {'Cancer': 0.84, 'No Cancer': 0.065}

Now that the information from the problem has all been accounted for I can apply Bayes' Theorem. I will create a posterior distribution which is the result of multiplying the prior probability of each hypothesis by the likelihood of each event. Afterwards I will have to normalize the distribution. 

In [7]:
posterior = prior.Copy()
for hypothesis in posterior:
    posterior[hypothesis] *= likelihood[hypothesis]
posterior.Print()

Cancer 0.0168
No Cancer 0.0637


Next I normalize the data, so it is expressed in terms of probability. The following returns the normalizing constant.

In [8]:
posterior.Normalize()

0.0805

After normalizing, the distribution shows the probabilities of having cancer (or not) after receiving a postive result.

In [9]:
posterior.Print()

Cancer 0.208695652174
No Cancer 0.791304347826


This shows that given a postive result, there is only about a 21% chance the woman actually has cancer on that given occasion. 

This information was taken from breastcancer.org, the Susan G. Komen organization, and cancer.net. 

And now one more original problem.

In 1988 the New York Times conducted a poll about preferred skirt length. One man interviewed stated that he pictured a woman who wore short skirts (above the knee) as "probably 40 or younger, probably not in business, and probably more liberal than conservative."

How likely is a young woman wearing a skirt to be liberal?

The article tells us about 40% of Liberal women pay attention to fashion trends, compared to 21% of moderates, and 24% of conservatives. However further reading tells us that 68% of woman who pay attention to fashion trends don't wear short skirts. 



Let's first set the three hypotheses. Hypothesis #1 the woman is liberal #2 the woman is conservative #3 the woman is moderate. 

Let's set up the priors by first looking at the political affliations of young women, which requires further research. The numbers I found were from a Gallup poll. 

In [3]:
priorS = thinkbayes2.Suite({'Liberal': 0.3, 'Conservative': 0.2, 'Moderate':.5})
priorS.Print()

Conservative 0.2
Liberal 0.3
Moderate 0.5


Next we can look at how often each group wears skirts to establish likelihoods. We can get this number by multiplying how many women pay attention to fashion trends by how many women who pay attention to fashion trends wear short skirts. 

In [4]:
likelihoodS = {'Liberal': 0.128, 'Conservative': 0.0768, 'Moderate':0.0672}

Now what is left is applying Bayes' Theorem, which involves mutliplying the priors by the corresponding likelihoods and then normalizing the data.

In [5]:
posteriorS = priorS.Copy()
for hypothesisS in posteriorS:
    posteriorS[hypothesisS] *= likelihoodS[hypothesisS]
posteriorS.Print()


Conservative 0.01536
Liberal 0.0384
Moderate 0.0336


Normalizing the data allows it to be expressed in terms of probability.

In [6]:
posteriorS.Normalize()

0.08736

Now we can see how likely this young woman was to be liberal.

In [7]:
posteriorS.Print()

Conservative 0.175824175824
Liberal 0.43956043956
Moderate 0.384615384615


Even with a high moderate population, the woman still has about a 44% of being liberal, which is higher than either of the corresponding groups. What is also interesting is how unlikely she is to be conservative leaning. 