# Conditional Probability

## Introduction: The three card Problem

Suppose one has three cards – one card is blue on both sides, one card is pink on both sides, and one card is blue on one side and pink on the other side. Suppose one chooses a card and place it down showing “blue”. What is the chance that the other side is also blue?

### In a Two-way table

It can be easier to think about, and compute conditional probabilities when they are found from observed counts in a two-way table.

The following `hs_athlets` data frame containg  high school athletes data in 14 sports that are classified with respect to their sport and their gender. These numbers are recorded in thousands, so the 454 entry in the Baseball/Softball – Male cell means that 454,000 males played baseball or softball this year.

-  Counts of high school athletes by sport and gender.

In [1]:
import pandas as pd

hs_athlets = pd.read_csv('./data/table01.csv', index_col='Sport')
hs_athlets

Unnamed: 0_level_0,Male,Female,TOTAL
Sport,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Baseball/Softball,454,373,827
Basketball,541,456,997
Cross Country,192,163,355
Football,1048,1,1049
Gymnastics,2,21,23
Golf,163,62,225
Ice Hockey,35,7,42
Lacrosse,50,39,89
Soccer,345,301,646
Swimming,95,141,236


In [2]:
hs_athlets['TOTAL']['TOTAL']

6489

In [3]:
hs_athlets['Male'][:-1].sum()

3899

In [4]:
hs_athlets['Female'][:-1].sum()

2590

Suppose one chooses a high school athlete at random who is involved in one of these 14 sports. Consider several events:

- $F = \text{athlete chosen is female}$
- $S = \text{athlete is a swimmer}$
- $V = \text{athlete plays volleyball}$


What is the probability that the athlete is female?

In [5]:
round(hs_athlets['Female'][14]/hs_athlets['TOTAL'][14],4)

0.3991

Likewise, the probability that the randomly chosen athlete is a swimmer is

In [6]:
round(hs_athlets.loc['Swimming']['TOTAL']/hs_athlets['TOTAL']['TOTAL'],4)

0.0364

and the probability he or she plays volleyball is

In [7]:
round(hs_athlets.loc['Volleyball']['TOTAL']/hs_athlets['TOTAL']['TOTAL'],4)

0.0672

Next, consider the computation of some conditional probabilities. What is the probability a volleyball player is female? In other words, conditional on the fact that the athlete plays volleyball, what is the chance that the athlete is female:

$$ P(F|V) $$

To find this probability, restrict attention only to the volleyball players in the table.

In [8]:
hs_athlets[hs_athlets.index == 'Volleyball']

Unnamed: 0_level_0,Male,Female,TOTAL
Sport,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Volleyball,39,397,436


Of the 436 (thousand) volleyball players, 397 are female, so

$$P(F|V) = \frac{397}{436} = 0.9106$$

In [9]:
round(hs_athlets[hs_athlets.index == 'Volleyball']['Female']/hs_athlets[hs_athlets.index == 'Volleyball']['TOTAL'],4)

Sport
Volleyball    0.9106
dtype: float64

In [10]:
round(hs_athlets['Female']['Volleyball']/hs_athlets['TOTAL']['Volleyball'],4)

0.9106

What is the probability a woman athlete is a swimmer? In other words, if one knows that the athlete is female, what is the (conditional) probability she is a swimmer, or $P(S|F)$?

Here since one is given the information that the athlete is female, one restricts attention to the "Female" column of counts. There are a total of 2590 (thousand) women who play one of these sports; of these, 141 are swimmers. So

$$P(S|V) = \frac{141}{2590} = 0.0544$$

In [11]:
round(hs_athlets['Female']['Swimming']/hs_athlets['Female']['TOTAL'],4)

0.0544

Are event $F$ and $V$ independent? One can check this several ways. Above it was found that the probability a randomly chosen athlete is a volleyball player is $P(V) = 0.0672$. Suppose one is told that the athlete is a female ($F$). Will that change the probability that she is a volleyball player? Of the 2590 women, 397 are volleyball players, and so $P(V|F)=397/2590 = 0.1533$. Note that $P(V)$ is different from $P(V|F)$ that means that the knowledge the athlete is female has increased one’s probability that the athlete is a volleyball player. So the two events are not independent.

##### Conditional Probabilities in a Two-Way Table



Suppose one has two spinners, each that will record a 1, 2, 3, or 4 with equal probabilities. Suppose the smaller of the two spins is 2 – what is the probability that the larger spin is equal to 4? One can answer this question by use of a simulation experiment. First one constructs a data frame – by two uses of the `sample()` function, 1000 random spins of the first spinner are stored in `Spin_1` and 1000 spins of the second spinner in `Spin_2`

In [12]:
import numpy as np
import pandas as pd

df = pd.DataFrame({'Spin_1':np.random.choice(np.arange(1,5), size = 5000,replace = True),
                   'Spin_2':np.random.choice(np.arange(1,5), size = 5000,replace = True)})

In [13]:
np.arange(1,5)

array([1, 2, 3, 4])

In [14]:
df['Min'] = np.minimum(df.Spin_1, df.Spin_2)
df['Max'] = np.maximum(df.Spin_1, df.Spin_2)


In [15]:
t1 = df.groupby(['Min','Max']).count()['Spin_1']

In [16]:
t1.unstack()

Max,1,2,3,4
Min,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,289.0,665.0,645.0,617.0
2,,303.0,630.0,582.0
3,,,321.0,625.0
4,,,,323.0


Since one is told that the smaller of the two spins is equal to 2, one restricts attention to the row where `Min=2`. One observes that `Max` is equal to 2, 3, 4 with frequencies 306, 631, and 613. So

In [17]:
613/(306+631+613)

0.39548387096774196

## Definition and The Multiplication Rule

In this chapter, conditional probabilities have been computed by considering a reduced sample space. There is a formal definition of conditional probability that is useful in computing probabilities of complicated events.

Suppose one has two events $A$ and $B$ where the probability of event $B$ is positive, that is $P(B) \gt 0$. Then the probability of $A$ given $B$ is defined as the quotient

$$P(A|B) = \frac{P(A \cap B)}{P(B)}$$


##### How many boys?

To illustrate this conditional probability definition, suppose a couple has four children. One is told that this couple has at least one boy. What is the chance that they have exactly two boys?

If one lets $L$ be the event "at least one boy" and $B$ the event "have two boys", one wish to find $P(B|L)$

Suppose one represents the genders of the four children (from youngest to oldest) as a sequence of four letters. For example, the sequence $BBGG$ means that the first two children were boys and the last two were girls. If we represent outcomes this way, there are 16 possible outcomes of four births:

In [18]:
import itertools
a = [''.join(x) for x in itertools.product('BG', repeat=4)]

b = np.matrix([a[0:4],a[4:8],a[8:12],a[12:16]])
pd.DataFrame(b)

Unnamed: 0,0,1,2,3
0,BBBB,BBBG,BBGB,BBGG
1,BGBB,BGBG,BGGB,BGGG
2,GBBB,GBBG,GBGB,GBGG
3,GGBB,GGBG,GGGB,GGGG


In [19]:
len([i for i in a if i.count('B') >= 1])

15

In [20]:
len([i for i in a if i.count('B') == 2])

6

If one assumes that boys and girls are equally likely (is this really true?), then each of the outcomes is equally likely and each outcome is assigned a probability of 1/16. Applying the definition of conditional probability, one has:

$$P(B|L) = \frac{P(B \cap L)}{P(L)}$$

There are 15 outcomes in the set $L$, and 6 outcomes where both events $B$ and $L$ occur. So using the definition

$$P(B|L) = \frac{6/16}{15/16}$$


## The Multiplication Rule

If one takes the conditional probability definition and multiplies both sides of the equation by $P(B)$, one obtains the multiplication rule:

$$ P(A \cap B)=P(B)P(A|B)$$

#### Choosing balls from a random bowl

The multiplication rule is especially useful for experiments that can be divided into stages. Suppose one has two bowls – Bowl 1 is filled with one white and 5 black balls, and Bowl 2 has 4 white and 2 black balls. One first spins the spinner below that determines which bowl to select, and then selects one ball from the bowl. What the chance that the ball one selects is white?

In [21]:
import numpy as np
bowl = np.random.choice([1,2], size=10000, p=[1/4,3/4])

color_1 = np.random.choice(['white','black'], size=10000, p=[1/6,5/6])
color_2 = np.random.choice(['white','black'], size=10000, p=[4/6,2/6])

# Color <- ifelse(Bowl == 1, Color_1, Color_2)

a = np.where(bowl == 1, color_1, color_2)

import pandas as pd
df_bowl = pd.DataFrame({'Bowl':bowl,
              'Color':a})

pd.crosstab(index = df_bowl['Bowl'], columns=df_bowl['Color'])
# print(pd.crosstab(index = df_bowl['Bowl'], columns=df_bowl['Color']).iloc[1]['white'])

Color,black,white
Bowl,Unnamed: 1_level_1,Unnamed: 2_level_1
1,2078,411
2,2489,5022


The probability that Bowl 1 was selected and a white ball was chosen is approximately equal to 

In [22]:
pd.crosstab(index = df_bowl['Bowl'], columns=df_bowl['Color']).loc[1]['white']/df_bowl.value_counts().sum()

0.0411

 The chance of choosing a white ball is approximated by

In [23]:
pd.crosstab(index = df_bowl['Bowl'], columns=df_bowl['Color'])['white'].sum()/df_bowl.value_counts().sum()

0.5433

## The Multiplication Rule Under Independence

Whem two events $A$ and $B$ are **independent**, then the multiplication rule takes the simple form

$$P(A \cap B) = P(A) \times P(B)$$

Moreover, if one has a sequence of independent events, say $A_1,A_2, \dots, A_k$ then the probability that all events happen simultaneously is the product of the probabilities of the individual events 

$$P(A_1 \cap A_2 \cap \dots \cap A_K) = P(A_1) \times P(A_2) \dots \times P(A_k)$$

By use of the assumption of independent events and multiplying, one finds probabilities of sophisticated events. We illustrate this in several examples.

#### Blood types of Couples  

White Americans have the blood types $0$,$A$,$B$, and $AB$ with respectively proportions 0.45, 0.40, 0.11, and 0.04. Suppose two people in this group are married. 

1. **What is the probability that the man has blood type O and the woman has blood type A?** Let $0_M$ denote the event that the man has $0$ blood type and $A_W$ the event that the woman has $A$ blood type. Since these two people are not related, it is reasonable to assume that $0_M$ and $A_W$ are independent events. Applying the multiplication rule, the probability the couple have these two specific blood types is

$$
\begin{align*} 
P(0_M \cap A_W) &= P(0_M) \times P(A_W)\\ 
&= (0.45) \times (0.40) = 0.18\\ 
\end{align*}
$$

2. **What is the probability the couple have O and A blood types?** This is a different question from the first one since it is not been specified who has the two blood types. Either the man has blood type $0$ and the woman has blood type $A$, or the other way around. So the probability of interest is

$$
\begin{align*} 
P(\text{two have} A,0 \text{ types}) &= P((0_M \cap A_W)) \cup ((0_W \cap A_M)) \\
&= P(0_M \cap A_W) + P(0_W \cap A_M)
\end{align*}
$$

One adds the probabilities since $0_M \cap A_W$ and $0_W \cap A_M$ are different outcomes. One uses the multiplication rule with the independence assumption to find the probability:

$$
\begin{align*} 
P(\text{two have} A,0 \text{ types}) &= P((0_M \cap A_W)) \cup ((0_W \cap A_M)) \\
&= P(0_M \cap A_W) + P(0_W \cap A_M) \\
&= P(0_M) \times P(A_W) + P(0_W) \times P(A_M) \\
&= (0.45) + (0.40) + (0.45) + (0.40) \\
&= 0.36
\end{align*}
$$

3. **What is the probability the man and the woman have the same blood type?** This is a more general question than the earlier parts since one hasn’t specified the blood types – one is just interested in the event that the two people have the same type. There are four possible ways for this to happen: they can both have type $0$, they both have type $A$, they have type $B$, or they have type $AB$. One first finds the probability of each possible outcome and then sum the outcome probabilities to obtain the probability of interest. One obtains

$$
\begin{align*} 
P(\text{same type})  &= P((0_M \cap 0_W) \cup (A_M \cap A_W) \cup (B_M \cap B_W) \cap (AB_M \cap AB_W))\\
&= (0.45)^2 + (0.40)^2 + (0.11)^2 + (0.04)^2 \\
&= 0.3762
\end{align*}
$$

4. **What is the probability the couple have different blood types?** One way of doing this problem is to consider all of the ways to have different blood types – the two people could have blood types $0$ and $A$, types $0$ and $B$, and so on, and add the probabilities of the different outcomes. But it is simpler to note that the event “having different blood types” is the complement of the event “have the same blood type”. Then using the complement property of probability

$$
\begin{align*} 
P(\text{different type})  &= 1 = P(\text{same type})\\
&= 1-P(\text{same type}) \\
&= 1-0.3762 \\
&= 0.6238
\end{align*}
$$

#### A Five-Game Playoff

Suppose two baseball teams play in a "best of five" playoff series, where the first team to win three games wins the series. Suppose the Yankees play the Angels and one believes that the probability the Yankees will win a single game is 0.6. If the results of the games are assumed independent, what is the probability the Yankees win the series?

This is a more sophisticated problem than the first example, since there are numerous outcomes of this series of games. The first thing to note is that the playoff can last three games, four games, or five games. In listing outcomes, one lets $Y$ and $A$ respectively the single-game outcomes "Yankees win" and "Angels win". Then a series result is represented by a sequence of letters. For example, $YYAY$ means that the Yankees won the first two games, the Angels won the third game, and the Yankees won the fourth game and the series. Using this notation, all of the possible outcomes of the five-game series are written below.

In [24]:
s1 = 'YYYA'

[''.join(x) for x in set(itertools.product(s1, repeat = 4))]

['AYYY',
 'AAAY',
 'YYAY',
 'YAYY',
 'YYYY',
 'YAAY',
 'AYAA',
 'AAYA',
 'AAAA',
 'YYYA',
 'YAAA',
 'YYAA',
 'YAYA',
 'AYAY',
 'AAYY',
 'AYYA']

In [49]:
def startwith(word, rep):
    if(word[:rep].count(a[0]) == rep):
        return True
    return False

In [26]:
import itertools

def contains_consecutive(word):
    groups = itertools.groupby(word)
    return([(sum(1 for _ in group)) for label, group in groups])


In [27]:
import itertools

def contains_consecutive(word, rep):
    groups = itertools.groupby(word)
    for label, group in groups:
        if(sum(1 for _ in group) == rep):
            return True
    return False


In [28]:
contains_consecutive('YYYAA',4)

False

In [29]:
s2 = 'YYYAA'
[''.join(x) for x in set(itertools.permutations(s2)) if not startwith(x,3)]

['YYAAY',
 'AYYAY',
 'AYAYY',
 'YAYAY',
 'YAAYY',
 'YYYAA',
 'YYAYA',
 'AYYYA',
 'YAYYA',
 'AAYYY']

In [30]:
[''.join(x) for x in set(itertools.permutations(s2)) if not contains_consecutive(x,3)]

['YYAAY', 'AYYAY', 'AYAYY', 'YAYAY', 'YAAYY', 'YYAYA', 'YAYYA']

In [31]:
r = [''.join(x) for x in set(itertools.permutations(s2))]

In [32]:
r

['YYAAY',
 'AYYAY',
 'AYAYY',
 'YAYAY',
 'YAAYY',
 'YYYAA',
 'YYAYA',
 'AYYYA',
 'YAYYA',
 'AAYYY']

In [33]:
def not_last(word,c,n):
    if(word[:-1].count(c) == n):
        return True
    return False

In [34]:
for i in r:
    print(not_last(i,'Y',3))

False
False
False
False
False
True
True
True
True
False


In [35]:
[''.join(x) for x in set(itertools.permutations(s2)) if not not_last(x,'Y',3)]

['YYAAY', 'AYYAY', 'AYAYY', 'YAYAY', 'YAAYY', 'AAYYY']

In [36]:
[''.join(x) for x in set(itertools.permutations(s2)) if not not_last(x,'A',3) and not contains_consecutive(x,3)]

['YYAAY', 'AYYAY', 'AYAYY', 'YAYAY', 'YAAYY', 'YYAYA', 'YAYYA']

In [37]:
a1 = [''.join(x) for x in itertools.product('AY', repeat=5)]

In [38]:
def correct_sequence(word, n):
    for i in list(set(word)):
        if(word.count(i) > n):
            return False
    return True

In [39]:
 [''.join(x) for x in itertools.product('AY', repeat=5) if ((correct_sequence(x,3) and not startwith(x,3)))]

['AAAYY',
 'AAYAY',
 'AAYYA',
 'AAYYY',
 'AYAAY',
 'AYAYA',
 'AYAYY',
 'AYYAA',
 'AYYAY',
 'AYYYA',
 'YAAAY',
 'YAAYA',
 'YAAYY',
 'YAYAA',
 'YAYAY',
 'YAYYA',
 'YYAAA',
 'YYAAY',
 'YYAYA',
 'YYYAA']

In [48]:
word = 'BBAAA'

startwith(word,3)

False

In [41]:
[''.join(x) for x in itertools.product('AY', repeat=5) if not contains_consecutive(x,range(0,5))]

['AAAAA',
 'AAAAY',
 'AAAYA',
 'AAAYY',
 'AAYAA',
 'AAYAY',
 'AAYYA',
 'AAYYY',
 'AYAAA',
 'AYAAY',
 'AYAYA',
 'AYAYY',
 'AYYAA',
 'AYYAY',
 'AYYYA',
 'AYYYY',
 'YAAAA',
 'YAAAY',
 'YAAYA',
 'YAAYY',
 'YAYAA',
 'YAYAY',
 'YAYYA',
 'YAYYY',
 'YYAAA',
 'YYAAY',
 'YYAYA',
 'YYAYY',
 'YYYAA',
 'YYYAY',
 'YYYYA',
 'YYYYY']

In [42]:
set(itertools.permutations(s2)).

SyntaxError: invalid syntax (<ipython-input-42-784b09704ceb>, line 1)