# BAYES' THEOREM

## Independent Events

The ability to determine whether two events are independent is an important skill for statistics.

If two events are independent, then the occurrence of one event does not affect the probability of the other event. Here are some examples of independent events:

I wear a blue shirt; my coworker wears a blue shirt
I take the subway to work; I eat sushi for lunch
The NY Giants win their football game; the NY Rangers win their hockey game
If two events are dependent, then when one event occurs, the probability of the other event occurring changes in a predictable way.

Here are some examples of dependent events:

It rains on Tuesday; I carry an umbrella on Tuesday
I eat spaghetti; I have a red stain on my shirt
I wear sunglasses; I go to the beach

# Question
1.
A certain family plans to have three children. Is the event that the couple’s third child is a girl independent of the event that the couple’s first two children are girls?

Save your answer ('independent' or 'not independent') to the variable third_child.

In [1]:
third_child = 'independent'

# Conditional Probability

Conditional probability is the probability that two events happen. It’s easiest to calculate conditional probability when the two events are independent.

Note: For the rest of this lesson, we’ll be using the statistical convention that the probability of an event is written as P(event).

If the probability of event A is P(A) and the probability of event B is P(B) and the two events are independent, then the probability of both events occurring is the product of the probabilities:

P(A ∩ B) = P(A) \times P(B)P(A∩B)=P(A)×P(B)
The symbol ∩ just means “and”, so P(A ∩ B) means the probability that both A and B happen.

For instance, suppose we are rolling a pair of dice, and want to know the probability of rolling two sixes.

<img src="images/dice.png" style="width: 400;"/>



Each die has six sides, so the probability of rolling a six is 1/6. Each die is independent (i.e., rolling one six does not increase or decrease our chance of rolling a second six), so:

$ P(6 \cap 6) = P(6) \times P(6) = \frac{1}{6} \times \frac{1}{6} = \frac{1}{36} $


# Question
1.
This week, there is a 30% probability that it will rain on any given day. At a certain high school, gym class is held on three days out of the five day school week.

On a school day, what is the probability that it is raining and the students have gym class?

Save your answer to the variable p_rain_and_gym.

# hint

$P(rain)=0.30$

$P(gym) = \frac{3.0}{5.0} = 0.60P $

Because these two events are independent:

$ P(rain ∩ gym) = P(rain) \times P(gym)P(rain∩gym)=P(rain)×P(gym) $

In [2]:
import numpy as np

p_rain_and_gym = .30 * (3.0/5.0)

print(p_rain_and_gym)

0.18


# Testing for a Rare Disease

Suppose you are a doctor and you need to test if a patient has a certain rare disease. The test is very accurate: it’s correct 99% of the time. The disease is very rare: only 1 in 100,000 patients have it.

You administer the test and it comes back positive, so your patient must have the disease, right?

Not necessarily. If we just consider the test, there is only a 1% chance that it is wrong, but we actually have more information: we know how rare the disease is.

Given that the test came back positive, there are two possibilities:

The patient had the disease, and the test correctly diagnosed the disease.
The patient didn’t have the disease and the test incorrectly diagnosed that they had the disease.

# Question
1.
What is the probability that the patient had the disease and the test correctly diagnosed the disease?

Save your answer to the variable p_disease_and_correct.


Hint
The disease is rare, so the probability that the patient had the disease is 1 out of 100,000:

$ P(disease) = \frac{1}{100000}$
 
The test is only wrong 1% of the time, so it is correct 99% of the time:

$P(test\ is\ correct) = 0.99$ 

So the answer should look like:

$p_disease_and_correct = (1.0 / 100000) * 0.99$

2.
What is the probability that the patient does not have the disease and the test incorrectly diagnosed the disease?

Save your answer to the variable p_no_disease_and_incorrect.


Hint
The disease is rare, so the probability that the patient does not have the disease the disease is 99,999 out of 100,000:

$P(disease) = \frac{99999}{100000}$
	 
The test is only wrong 1% of the time:

$P(test\ is\ correct) = 0.01$

So the answer should look like:

p_no_disease_and_incorrect = $(99999.0 / 100000) * 0.01$

3.
Print both p_disease_and_correct and p_no_disease_and_incorrect.

In [3]:
import numpy as np

p_disease_and_correct = .99 * (1.0/100000)

print(p_disease_and_correct)

p_no_disease_and_incorrect = (99999.0/100000) * .01

print(p_no_disease_and_incorrect)

9.9e-06
0.0099999


# Question

1.
Calculate P(positive result | rare disease), or the probability of a positive test result, given that a patient really has this rare disease.

Save your answer (as a decimal) to p_positive_given_disease.


Hint
The test is 99% accurate; given the fact that the patient has the disease, we know that there is a 99% probability that the test will return a positive result.

This is exactly $P(positive result | rare disease)$.

So the answer should look like:

p_positive_given_disease = 0.99
2.
What is P(rare disease), the probability that a randomly selected patient has the rare disease?

Save your answer to p_disease.


Hint
The disease is very rare. Only 1 in 100,000 people have it.

3.
We now need to compute the denominator; we need to find P(positive result).

As we discussed previously, there are two ways to get a positive result:

The patient had the disease, and the test correctly diagnosed the disease.
The patient didn’t have the disease and the test incorrectly diagnosed that they had the disease.
Using these two probabilities, calculate the total probability that a randomly selected patient receives a positive test result, P(positive result).

Save your answer to the variable p_positive.


Hint
The probability that the patient had the disease, and the test correctly diagnosed the disease is:

$1.0 / 100000.0 * 0.99$

The probability that the patient didn’t have the disease and the test incorrectly diagnosed that they had the disease is:

$99999.0 / 100000.0 * 0.01$

The probability of either event A or event B happening is given by:

$P(A\ or\ B) = P(A) + P(B)$ 

4.
Substitute all three of these values into Bayes’ Theorem and calculate $P(rare disease | positive result)$.

Save your result as p_disease_given_positive.


Hint
The numerator should be (p_positive_given_disease) * (p_disease).

The denominator should be p_positive.

5.
Print p_disease_given_positive.

Is it likely that your patient has this disease?


Hint
print p_disease_given_positive
The result should look something like:

0.000989010989011

In [4]:
import numpy as np

# P(positive result | rare disease)
p_positive_given_disease = (0.99 * (.00001))/ (1./100000.)
print(p_positive_given_disease)

# P(rare disease)
p_disease = 1./100000.
print(p_disease)


# P(positive result)
p_positive = (0.00001) + (0.01) 
print(p_positive)


# P(rare disease | positive result)
# result
p_disease_given_positive = (p_positive_given_disease) * (p_disease) / (p_positive)

print(p_disease_given_positive)

0.9899999999999999
1e-05
0.01001
0.000989010989010989


# Spam Filters
Let’s explore a different example. Email spam filters use Bayes’ Theorem to determine if certain words indicate that an email is spam.

Let’s a take word that often appears in spam: “enhancement”.

With just 3 facts, we can make some preliminary steps towards a good spam filter:

1. “enhancement” appears in just 0.1% of non-spam emails

2. “enhancement” appears in 5% of spam emails

3. Spam emails make up about 20% of total emails

Given that an email contains “enhancement”, what is the probability that the email is spam?

# Question
1.
In this example, we are dealing with two probabilities:

P(enhancement) - the probability that the word “enhancement” appears in an email.

P(spam) - the probability that an email is spam.

Using Bayes’ Theorem to answer our question means that we want to calculate P(A|B).

But what are A and B referring to in this case?

Save the string 'enhancement' or 'spam' to the variable a.

Save the string 'enhancement' or 'spam' to the variable b.



2.
What is P(spam)?

Save your answer to p_spam.


Stuck? Get a hint
3.
What is P(enhancement | spam)?

Save your answer to p_enhancement_given_spam.


Stuck? Get a hint
4.
We want to know the overall probability that any email (spam or non-spam) contains “enhancement”.

Because we know the probability of “enhancement” occurring in both spam (0.05) and non-spam (0.001) emails, we can use a weighted average to calculate the probability of “enhancement” occurring in an email:

$P(enhancement) = P(enhancement\ |\ spam) \times P(spam) + P(enhacement\ |\ not\ spam) \times P(not\ spam)$

Save your answer to p_enhancement.


Hint
The probability that an email is not spam is just 1 - the probability that the email is spam:

$P(not\ spam) = 1 - P(spam) $

$P(enhancement | non-spam)$ is given to you above.

5.
Now that we know:

P(spam)
P(enhancement | spam)
P(enhancement)

We can plug this into Bayes’ Theorem:

$P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}$ 


Save your answer as p_spam_enhancement.



6.
Print p_spam_enhancement. This is the probability that an email is spam given that it contains the word “enhancement”.

Should we block all emails that contain “enhancement”?

How much non-spam email would we block?

In [6]:
import numpy as np

a = 'spam'
b = 'enhancement'

p_spam = 0.2

p_enhancement_given_spam = 0.05

p_enhancement = 0.05 * 0.2 + 0.001 * (1 - 0.2)

p_spam_enhancement = p_enhancement_given_spam * p_spam / p_enhancement

print(p_spam_enhancement)

0.9259259259259259


# Review

In this course, we learned several new definitions:

* Two events are independent if the occurrence of one event does not affect the probability of the second event

* If two events are independent then:

   $P(A ∩ B) = P(A) \times P(B)$ 
   

* A prior is an additional piece of information that tells us how likely an event is

* A frequentist approach to statistics does not incorporate a prior

* A Bayesian approach to statistics incorporates prior knowledge

* Bayes’ Theorem is the following:

   $P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}$
   
   
 <img src="images/bayes.svg" style="width: 400;"/>