# Chapter 1 - Probability
http://allendowney.github.io/ThinkBayes2/

## Reading

In [1]:
# Load the data file

from os.path import basename, exists

def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve
        local, _ = urlretrieve(url, filename)
        print('Downloaded ' + local)
    
download('https://github.com/AllenDowney/BiteSizeBayes/raw/master/gss_bayes.csv')

Downloaded gss_bayes.csv


In [2]:
import pandas as pd

gss = pd.read_csv('gss_bayes.csv', index_col=0)
gss.head()

Unnamed: 0_level_0,year,age,sex,polviews,partyid,indus10
caseid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,1974,21.0,1,4.0,2.0,4970.0
2,1974,41.0,1,5.0,0.0,9160.0
5,1974,58.0,2,6.0,1.0,2670.0
6,1974,30.0,1,5.0,4.0,6870.0
7,1974,48.0,1,5.0,4.0,7860.0


### Probability

In [3]:
banker = (gss['indus10'] == 6870)
banker.head()

caseid
1    False
2    False
5    False
6     True
7    False
Name: indus10, dtype: bool

In [4]:
banker.sum()

728

In [5]:
banker.mean()

0.014769730168391155

In [6]:
def prob(A):
    """Computes the probability of a proposition, A."""    
    return A.mean()

In [7]:
prob(banker)

0.014769730168391155

In [9]:
female = gss['sex'] == 2

In [10]:
prob(female)

0.5378575776019476

In [11]:
liberal = gss['polviews'] <=3

In [12]:
prob(liberal)

0.27374721038750255

In [13]:
democrat = gss['partyid'] <= 1

In [14]:
prob(democrat)

0.3662609048488537

### Conjunction

In [15]:
prob(banker)

0.014769730168391155

In [16]:
prob(democrat)

0.3662609048488537

In [17]:
prob(banker & democrat)

0.004686548995739501

In [18]:
prob(democrat & banker)

0.004686548995739501

### Conditional probability

In [19]:
democrat

caseid
1       False
2        True
5        True
6       False
7       False
        ...  
2863     True
2864    False
2865    False
2866    False
2867    False
Name: partyid, Length: 49290, dtype: bool

In [20]:
liberal

caseid
1       False
2       False
5       False
6       False
7       False
        ...  
2863     True
2864    False
2865    False
2866    False
2867    False
Name: polviews, Length: 49290, dtype: bool

In [22]:
selected = democrat[liberal]
selected

caseid
12      False
25       True
26      False
32       True
38      False
        ...  
2845    False
2849     True
2856    False
2857     True
2863     True
Name: partyid, Length: 13493, dtype: bool

`selected` consists of the members of `democrat` (indexed by `caseid`) with value `True` who have value `True` in `liberal`.

In [23]:
prob(selected)

0.5206403320240125

This is the probability that someone is a liberal given that they are a democrat.

In [24]:
selected = female[banker]
prob(selected)

0.7706043956043956

In [25]:
def conditional(proposition, given):
    """Probability of a proposition conditioned on given."""
    return prob(proposition[given])

In [26]:
conditional(liberal, given=female)

0.27581004111500884

**Conditional probability is not commutative.**

In [27]:
conditional(female, given=liberal)

0.5419106203216483

### Condition and conjunction

In [28]:
conditional(female, given=liberal & democrat)

0.576085409252669

In [29]:
conditional(liberal & female, given=banker)

0.17307692307692307

### Laws of probability

#### Theorem 1
$$P(A|B) = \frac{P(A~\mathrm{and}~B)}{P(B)}$$

What fraction of bankers are female?
We can find:
1. The fraction of respondents who are female bankers, and
2. The fraction of respondents who are bankers

and then divide.

In [30]:
prob(female & banker) / prob(banker)

0.7706043956043956

This gives the same result:

In [32]:
conditional(female, given=banker)

0.7706043956043956

#### Theorem 2
$$P(A~\mathrm{and}~B) = P(B) ~ P(A|B)$$
This follows from Thm. 1.

In [33]:
prob(democrat) * conditional(liberal, democrat)

0.1425238385067965

In [34]:
prob(democrat & liberal)

0.1425238385067965

#### Theorem 3 - Bayes's Theorem
$$P(A|B) = \frac{P(A) P(B|A)}{P(B)}$$
By combining Thm. 1 and Thm. 2.

In [35]:
conditional(liberal, given=banker)

0.2239010989010989

In [36]:
prob(liberal) * conditional(banker, given=liberal) / prob(banker)

0.2239010989010989

### The law of total probability
$$P(A) = \sum_i P(B_i) P(A|B_i)$$


In this dataset all respondents are either male or female.

In [37]:
prob(banker)

0.014769730168391155

In [38]:
male = (gss['sex'] == 1)

In [39]:
prob(male & banker) + prob(female & banker)

0.014769730168391155

In [40]:
prob(male) * conditional(banker, given=male) + prob(female) * conditional(banker, given=female)

0.014769730168391153

## Exercises

**Exercise:** Let's use the tools in this chapter to solve a variation of the Linda problem.

> Linda is 31 years old, single, outspoken, and very bright. She majored in philosophy. As a student, she was deeply concerned with issues of discrimination and social justice, and also participated in anti-nuclear demonstrations.  Which is more probable?
> 1. Linda is a banker.
> 2. Linda is a banker and considers herself a liberal Democrat.

To answer this question, compute 

* The probability that Linda is a female banker,

* The probability that Linda is a liberal female banker, and

* The probability that Linda is a liberal female banker and a Democrat.

In [42]:
# Probability that Linda is female banker
prob(female & banker)

0.011381618989653074

In [43]:
# Probability that Linda is a liberal female banker
prob(female & banker & liberal)

0.002556299452221546

In [44]:
# Probability that Linda is a liberal female banker and a Democrat
prob(female & banker & liberal & democrat)

0.0012375735443294787

**Exercise:** Use `conditional` to compute the following probabilities:

* What is the probability that a respondent is liberal, given that they are a Democrat?

* What is the probability that a respondent is a Democrat, given that they are liberal?

Think carefully about the order of the arguments you pass to `conditional`.

In [45]:
# Probability that a respondent is liberal, given that they are a Democrat
conditional(liberal, given=democrat)

0.3891320002215698

In [46]:
# Probability that a respondent is a Democrat, given that they are a liberal
conditional(democrat, given=liberal)

0.5206403320240125

**Exercise:** There's a [famous quote](https://quoteinvestigator.com/2014/02/24/heart-head/) about young people, old people, liberals, and conservatives that goes something like:

> If you are not a liberal at 25, you have no heart. If you are not a conservative at 35, you have no brain.

Whether you agree with this proposition or not, it suggests some probabilities we can compute as an exercise.
Rather than use the specific ages 25 and 35, let's define `young` and `old` as under 30 or over 65:

In [47]:
young = (gss['age'] < 30)
prob(young)

0.19435991073240008

In [48]:
old = (gss['age'] >= 65)
prob(old)

0.17328058429701765

For these thresholds, I chose round numbers near the 20th and 80th percentiles.  Depending on your age, you may or may not agree with these definitions of "young" and "old".

I'll define `conservative` as someone whose political views are "Conservative", "Slightly Conservative", or "Extremely Conservative".

In [49]:
conservative = (gss['polviews'] >= 5)
prob(conservative)

0.3419354838709677

Use `prob` and `conditional` to compute the following probabilities.

* What is the probability that a randomly chosen respondent is a young liberal?

* What is the probability that a young person is liberal?

* What fraction of respondents are old conservatives?

* What fraction of conservatives are old?

For each statement, think about whether it is expressing a conjunction, a conditional probability, or both.

For the conditional probabilities, be careful about the order of the arguments.
If your answer to the last question is greater than 30%, you have it backwards!

In [50]:
# Probability that randomly chosen respondent is a young liberal
prob(young & liberal)

0.06579427875836884

In [51]:
# Probability that a young person is liberal
conditional(liberal, given=young)

0.338517745302714

In [52]:
# Fraction of respondents that are old conservatives
prob(old & conservative)

0.06701156421180766

In [53]:
# Fraction of conservatives that are old
conditional(old, given=conservative)

0.19597721609113564