# Jonathan Halverson
# Wednesday, March 23, 2016
# Probability

### Probability of disjoint or mutually exclusive events: For instance, the probability of rolling a 2 or 5 on a 6-sided die: $P(\textrm{2 or 5}) = P(2) + P(5) = 1/6 + 1/6 = 2/3$.

### OpenIntro: When we write “or” in statistics, we mean “and/or” unless we explicitly state otherwise. Thus, A or B occurs means A, B, or both A and B occur.


### What if events A and B are not disjoint? The general addition rule may be applied: $P(\textrm{A or B}) = P(A) + P(B) - P(\textrm{A and B})$.

### For instance, what is the probability of drawing a face card or diamond card from a deck? There are $4\times3$ face cards (jack, queen, king) and 13 diamonds but three cards are both diamond and face so avoid double counting we need to subtract those out:

In [141]:
P = 12/52.0 + 13/52.0 - 3/52.0
P

0.4230769230769231

### As usual we check with the numerical experiment:

In [142]:
class Card:
    def __init__(self, suit_, label_):
        self.suit = suit_
        self.label = label_

suits = ['club', 'spade', 'heart', 'diamond']
labels = map(str, range(2, 11)) + ['J', 'Q', 'K', 'A']

deck = []
for suit in suits:
    for l in labels:
        deck.append(Card(suit, l))
        
success = 0
for card in deck:
    if (card.suit == 'diamond' or card.label in ['J', 'Q', 'K']):
        success += 1
print success / float(len(deck))

0.423076923077


### An event is a subset of the sample space. For example, (2, 5) as outcomes of a die roll. The complement of the event is the remaining subset or (1, 3, 4, 6). $P(A) + P(A^c) = 1$.

### Two events are indepedent if P(A and B) = P(A) P(B). Note the difference between P(A or B) and P(A and B). What is the probability of randomly selecting two people in the US that are both left handed? P = (0.09)(0.09) = 0.0081

## Conditional probability

### The conditional probability of the outcome of interest A given condition B is computed as the following: $P(A | B) = \frac{P(\textrm{A and B})}{P(B)}=\frac{P(A, B)}{P(B)}$.

### If a probability is based on a single variable, it is a marginal probability. The probability of outcomes for two or more variables or processes is called a joint probability.

### The smallpox data set provides a sample of 6,224 individuals from the year 1721 who were exposed to smallpox in Boston.

In [143]:
import numpy as np
import pandas as pd
df = pd.read_csv('smallpox.txt', sep='\t')

In [144]:
print df.head(3)
print df.tail(3)

  result inoculated
0  lived        yes
1  lived        yes
2  lived        yes
     result inoculated
6221   died         no
6222   died         no
6223   died         no


In [167]:
df.shape

(6224, 2)

In [145]:
df.describe()

Unnamed: 0,result,inoculated
count,6224,6224
unique,2,2
top,lived,no
freq,5374,5980


In [146]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6224 entries, 0 to 6223
Data columns (total 2 columns):
result        6224 non-null object
inoculated    6224 non-null object
dtypes: object(2)
memory usage: 97.3+ KB


### Decent examples on contingency tables here:
http://hamelg.blogspot.com/2015/11/python-for-data-analysis-part-19_17.html

In [147]:
pd.crosstab(index=df.result, columns='count')

col_0,count
result,Unnamed: 1_level_1
died,850
lived,5374


In [148]:
pd.crosstab(index=df.inoculated, columns='count')

col_0,count
inoculated,Unnamed: 1_level_1
no,5980
yes,244


In [149]:
ct = pd.crosstab(index=df['result'], columns=df['inoculated'], margins=True)
ct.index = ['died', 'lived', 'cTotal']
ct.columns = ['no', 'yes', 'rTotal']
ct

Unnamed: 0,no,yes,rTotal
died,844,6,850
lived,5136,238,5374
cTotal,5980,244,6224


In [150]:
# reference values in the table
print ct.ix['cTotal', 'rTotal']
print ct.ix['died', 'yes']

6224
6


In [151]:
ct = ct / ct.ix['cTotal', 'rTotal']
ct

Unnamed: 0,no,yes,rTotal
died,0.135604,0.000964,0.136568
lived,0.825193,0.038239,0.863432
cTotal,0.960797,0.039203,1.0


### The table above gives joint and marginal probabilities but we can also compute them directly:

In [152]:
def joint(dframe, alive, inoc):
    success = 0
    for a, b in zip(dframe.result, dframe.inoculated):
        if (a == alive and b == inoc): success += 1
    return success / float(dframe.shape[0])

def marginal(dframe, column, value):
    success = 0
    for b in dframe[column]:
        if (b == value): success += 1
    return success / float(dframe.shape[0])

### What is the probability of a random person having died?

In [153]:
marginal(df, column='result', value='died')

0.1365681233933162

### What is the probability of being inoculated and died?

In [154]:
joint(df, alive='died', inoc='yes')

0.0009640102827763496

### What is the probability of not being inoculated and died?

In [155]:
joint(df, alive='died', inoc='no')

0.13560411311053985

### These three values can also be obtained from the proportions table:

In [156]:
print ct.ix['died', 'rTotal']
print ct.ix['died', 'yes']
print ct.ix['died', 'no']

0.136568123393
0.000964010282776
0.135604113111


## Conditional probabilities

### What is the probability of dying given the person was inoculated?

In [157]:
joint(df, alive='died', inoc='yes') / marginal(df, column='inoculated', value='yes')

0.02459016393442623

### What is the probability of dying given the person was not inoculated? Note this is a conditional probability: $P(\textrm{died | not inoculated})=P(\textrm{died and not inoculated}) / P(\textrm{not inoculated})$.

In [158]:
joint(df, alive='died', inoc='no') / marginal(df, column='inoculated', value='no')

0.1411371237458194

### These numbers suggest that a person who was not inoculated was almost 5 times more likely to die.

### Suppose we are given only two pieces of information: 96.08% of residents were not inoculated, and 85.88% of the residents who were not inoculated ended up surviving. How could we compute the probability that a resident was not inoculated and lived? Since P(A and B) = P(A | B) P(B):

In [159]:
0.9608 * 0.8588

0.82513504

### Here is the check:

In [160]:
joint(df, alive='lived', inoc='no')

0.8251928020565553

### Let's make sure the conditional probabilities sum to 1: P(A1 | B) + P(A2 | B) = 1. $P(\textrm{lived | not inoculated}) + P(\textrm{died | not inoculated})$

In [165]:
joint(df, alive='lived', inoc='no') / marginal(df, column='inoculated', value='no') + joint(df, alive='died', inoc='no') / marginal(df, column='inoculated', value='no')

1.0

# Bayes

Seem strange that A is subscripted but B is not. What the explanation for this?

### Bayes theorem is used for inverting conditional probabilities. A tree diagram is the best choice when the number of events is small enough to draw. Here is the general formula:

$$P(A_1 | B) = \frac{P(B|A_1)P(A_1)}{P(B|A_1)P(A_1) + P(B|A_2)P(A_2) + \ldots + P(B|A_k)P(A_k)}$$

###OpenIntro: Jose visits campus every Thursday evening. However, some days the parking garage is full, often due to college events. There are academic events on 35% of evenings, sporting events on 20% of evenings, and no events on 45% of evenings. When there is an academic event, the garage fills up about 25% of the time, and it fills up 70% of evenings with sporting events. On evenings when there are no events, it only fills up about 5% of the time. If Jose comes to campus and finds the garage full, what is the probability that there is a sporting event?

$$P(\textrm{sports | full}) = P(\textrm{sports and full}) / P(\textrm{full})$$

In [173]:
P_sports_given_full = (0.2 * 0.7) / (0.2 * 0.7 + 0.35 * 0.25 + 0.45 * 0.05)
print P_sports_given_full

0.56


### Or by Bayes theorem:

$$P(\textrm{sports | full}) = \frac{P(\textrm{full | sports})P(\textrm{sports})} {P(\textrm{full | acad})P(\textrm{acad})+P(\textrm{full | sports})P(\textrm{sports})+P(\textrm{full | none})P(\textrm{none})}$$

In [179]:
P_sports = 0.2
P_full_given_sports = 0.7

P_aca = 0.35
P_full_given_aca = 0.25

P_none = 0.45
P_full_given_none = 0.05

P_full_given_sports * P_sports / (P_full_given_sports * P_sports + P_full_given_aca * P_aca + P_full_given_none * P_none)

0.56

In [180]:
P_aca_given_full = (0.35 * 0.25) / (0.2 * 0.7 + 0.35 * 0.25 + 0.45 * 0.05)
print P_aca_given_full

0.35


In [181]:
P_none_given_full = 1.0 - P_aca_given_full - P_sports_given_full
print P_none_given_full

0.09


### OpenIntro: The last several exercises offered a way to update our belief about whether there is a sporting event, academic event, or no event going on at the school based on the information that the parking lot was full. This strategy of updating beliefs using Bayes’ Theorem is actually the foundation of an entire section of statistics called Bayesian statistics.