# Week 2 -- Probability vs Proportion

<img align="right" style="padding-right:10px;" src="figures_wk2/question_man.png" width=150><br>

These two statistial terms often get confused in the world of statistics are probability and proportion.

><b>Probablity</b> is the likelihood of an event <u><i>occuring</i></u>. <br>
<b>Proportion</b> measures the likelihood that an event <u><i>has occurred</i></u>.

Let's take a look at a couple of examples to help understand the differences between the two values.

## Examples of the Difference Between Probability and Proportion

### Example 1: Flipping A Coin

<img align="left" style="padding-right:10px;" src="figures_wk2/coin.png" width=180>
If we flip a fair coin, the probability that it will land on heads is 0.5 or 50%.

However, if we flip a fair coin 20 times then we can actually count the proportion of times it landed on heads. For example, perhaps it landed on heads in 60% of the flips.

The probability of landing on heads is theoretical, but the proportion of times the coin landed on heads is empirical – we could actually count the proportion.

### Example 2: Rolling Dice

<img align="right" style="padding-right:10px;" src="figures_wk2/dice.png" width=180>

If we roll a six-sided die, the probability that it will land on the number “4” is 1/6 or about 16.67%.

However, if we roll the die 10 times then we can actually count the proportion of times it landed on 4. For example, perhaps it landed on “4” in 20% of the rolls.

The probability of rolling a “4” is theoretical, but the proportion of times the die landed on “4” is empirical – we could actually count the proportion.

### Example 3: Drawing A Queen From A Deck Of Cards

<img align="left" style="padding-right:10px;" src="figures_wk2/cards.png" width=180>

In a standard deck of 52 cards, there are 4 Queens. Thus, the probability of choosing a Queen on any random draw is 4/52 = 7.69%.

However, if we take a random draw (and replace the card we draw) 50 times, we can actually count the proportion of times we draw a Queen. For example, perhaps we draw a Queen in 10% of the draws.

The probability of the choosing a Queen is theoretical, but the proportion of times we actually choose a Queen is empirical – we could actually count the proportion.

**References:**
>Probability vs. Proportion: What’s the Difference? (2021, September 10). Statology. https://www.statology.org/probability-vs-proportion/


In other words, probability is a measure of uncertainty (a theoretical value), whereas proportion is a measure of certainty (an empirical value). 

><b>Probability</b> talks about the chances of some event happening <u><i>in the future</i></u>.<br>
<b>Proportion</b> describes how often some event actually happened <u><i>in the past</i></u>.

# Demo: Calculating Proportion

<img align="left" style="padding-right:10px;" src="figures_wk2/proportion.png" width=180>

To recap, **proportion** describes how oftern a particular event has occurred in the past. Additionally, a proportion is a ratio of the number of events that happened to the number of events that didn't happen.

With respect to the diagram to the left, we might want to determine the proportion of dogs within this clinic.

$Proportion = \frac{number \ of \ dogs \ in \ the \ clinic}{number \ of \ animals \ in \ the \  clinic}$</center>


Time to try this out with a dataset.  The `card_pulls.csv` contains 10,000 observations of a card being drawn from a standard playing deck. The dataset containd the following information:
* **player:** the individual who drew the card for that observation
* **value:** the face value on the card drawn
* **suit:** the suit of the card drawn
* **color:** the color of the card drawn

In [1]:
import pandas as pd

In [2]:
cards_df = pd.read_csv('data_wk2/card_pulls.csv')

In [3]:
cards_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   player  10000 non-null  int64 
 1   value   10000 non-null  object
 2   suit    10000 non-null  object
 3   color   10000 non-null  object
dtypes: int64(1), object(3)
memory usage: 312.6+ KB


In [4]:
cards_df.head(10)

Unnamed: 0,player,value,suit,color
0,8,7,club,black
1,2,6,club,black
2,3,5,heart,red
3,2,A,spade,black
4,10,2,club,black
5,10,10,heart,red
6,6,3,spade,black
7,6,5,club,black
8,6,2,heart,red
9,1,Q,heart,red


Let's say we want to determine the proportion of red cards that were pulled. 

To caluculate this, we need to determine the number of red cards that were drawn and then divide by the total number of cards drawn.

In [5]:
cards_df.groupby('color').size()

color
black    5051
red      4949
dtype: int64

This is a great start, but we only want the number of red cards.

In [6]:
red_cards = cards_df.groupby('color').size()[1]

In [7]:
total_cards = cards_df.shape[0]
total_cards

10000

In [8]:
prop_red = red_cards/total_cards
prop_red

0.4949

So, this tells us that the proportion of red cards within the dataset is 49.49%.

Let's dig a little deeper and look at the proportion of red cards drawn by each player. For this, we are going to do things a little bit different. We are going to add a column with a boolean value as to whether the card drawn is red or not. 

In [9]:
cards_df['red_drawn'] = cards_df.color.apply(lambda x: x == 'red')

In [10]:
cards_df.head(10)

Unnamed: 0,player,value,suit,color,red_drawn
0,8,7,club,black,False
1,2,6,club,black,False
2,3,5,heart,red,True
3,2,A,spade,black,False
4,10,2,club,black,False
5,10,10,heart,red,True
6,6,3,spade,black,False
7,6,5,club,black,False
8,6,2,heart,red,True
9,1,Q,heart,red,True


Trust me, we are making progress.  I'll now group the dataset by person and red_drawn and store that information into a new dataframe.

In [11]:
red_cards = cards_df.groupby(['player','red_drawn']).size().unstack().reset_index()

<div class="alert alert-block alert-info">
<b>Couple of new Pandas functions above: What do those commands do?</b>  <br>
 * unstack(): Creates a new dataframe that represents a pivot table from the columns specified. If multiple columns are used in creating the pivot table, then you will get a multi-level index in the dataframe.<br>
 * reset_index(): Generally reset_index() will allow you to reset the index of a dataframe to a specific column.  In this case, since unstack() returned a multi-level index, reset_index() flatten the dataframe out to have a single index. 
</div>

In [12]:
red_cards.head(10)

red_drawn,player,False,True
0,1,501,519
1,2,503,491
2,3,526,512
3,4,487,481
4,5,538,479
5,6,495,484
6,7,485,469
7,8,492,524
8,9,504,474
9,10,520,516


I'm going to cleanup the headers a little bit to help make things clearer.

In [13]:
# replacing column names with nicer headers
cols = ['player','red_no','red_yes']
red_cards.columns = cols

In [14]:
red_cards.head(10)

Unnamed: 0,player,red_no,red_yes
0,1,501,519
1,2,503,491
2,3,526,512
3,4,487,481
4,5,538,479
5,6,495,484
6,7,485,469
7,8,492,524
8,9,504,474
9,10,520,516


Since I'm after the proportion of red card drawn by player, I will need to know the total number of cards drawn by each individual player.

In [15]:
red_cards['total_draws'] = red_cards.red_no + red_cards.red_yes

red_cards.head(10)

Unnamed: 0,player,red_no,red_yes,total_draws
0,1,501,519,1020
1,2,503,491,994
2,3,526,512,1038
3,4,487,481,968
4,5,538,479,1017
5,6,495,484,979
6,7,485,469,954
7,8,492,524,1016
8,9,504,474,978
9,10,520,516,1036


At this point all we have to do is calculate the proportion of red cards drawn for each player.

In [16]:
red_cards['red_draw_prop'] = red_cards.red_yes/red_cards.total_draws

red_cards.head(10)

Unnamed: 0,player,red_no,red_yes,total_draws,red_draw_prop
0,1,501,519,1020,0.508824
1,2,503,491,994,0.493964
2,3,526,512,1038,0.493256
3,4,487,481,968,0.496901
4,5,538,479,1017,0.470993
5,6,495,484,979,0.494382
6,7,485,469,954,0.491614
7,8,492,524,1016,0.515748
8,9,504,474,978,0.484663
9,10,520,516,1036,0.498069


One final step, let's sort the dataframe based on the the proportion.

In [17]:
red_cards = red_cards.sort_values('red_draw_prop')

red_cards.head(10)

Unnamed: 0,player,red_no,red_yes,total_draws,red_draw_prop
4,5,538,479,1017,0.470993
8,9,504,474,978,0.484663
6,7,485,469,954,0.491614
2,3,526,512,1038,0.493256
1,2,503,491,994,0.493964
5,6,495,484,979,0.494382
3,4,487,481,968,0.496901
9,10,520,516,1036,0.498069
0,1,501,519,1020,0.508824
7,8,492,524,1016,0.515748


# Demo: Calculating Probability

Calculating the probability of an event is a little more complex than calculating the proportion.  The main reason for this is that you have to determine what type of probablity you wish to calculate along with the number of random variables you have in your system. 

Here are the various types of **probability**:
* **Joint probability:** Probability of two or more events happening at the same time.<br> 
* **Marginal probability:** Probability of an event regardless of other variables outcome.<br>
* **Conditional probability:** Probability of an event occurring along with one or more other events. <br>

## Probability for a Single Random Variable 

For this course, we only be examining how to determine the probability with a single random variable in the system. 

Remember, <b>probablity</b> is the likelihood of an event <u><i>occuring</i></u>. <br>

Probability of one random variable is the likelihood of an event that is independent of other factors. Examples include: <br>
* Coin toss.<br>
* Roll of a dice.<br>
* Drawing one card from a deck of cards. <br>

For random variable `x`, the function `P(x)` relates probabilities to all values of `x`.

<center>$Probability\ Density\ of\ x = P(x)$</center>

If `A` is a specific event of `x`, 

<center>$Probability\ of\ Event\ A = P(A)$</center>

Probability of an event is calculated as *the number of desired outcomes* divided by *total number of possible outcomes*, where all outcomes are equally likely:

<center>$Probability = \frac{the\ number\ of\ desired\ outcomes}{total\ number\ of\ possible\ outcomes}$</center>

If we apply that principle to our examples above:<br>
* Coin toss: Probability of heads = 1 (desired outcome) / 2 (possible outcomes) = .50 or 50%<br>
* Dice roll: Probability of rolling 3 = 1 (specific number) / 6 (possible numbers) = .1666 or 16.66%<br>
* Cards: Probability of drawing a queen of hearts = 1 (specific card) / 52 (possible cards) = .0192 or 1.92%<br>


One other important topic to keep in mind, the probability of an event not occurring is called the **complement** and is calculated:

<center>$P(not\ A) = 1 - P(A)$</center>

### Calculate the Probability of a Player Drawing a Spade

If you scroll all the way to the top, we see that we loaded to original dataset into `cards_df`. Let's start out by ensuring that we still have that object to work with.

In [18]:
cards_df.shape

(10000, 5)

In [19]:
cards_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   player     10000 non-null  int64 
 1   value      10000 non-null  object
 2   suit       10000 non-null  object
 3   color      10000 non-null  object
 4   red_drawn  10000 non-null  bool  
dtypes: bool(1), int64(1), object(3)
memory usage: 322.4+ KB


In [20]:
cards_df.head(10)

Unnamed: 0,player,value,suit,color,red_drawn
0,8,7,club,black,False
1,2,6,club,black,False
2,3,5,heart,red,True
3,2,A,spade,black,False
4,10,2,club,black,False
5,10,10,heart,red,True
6,6,3,spade,black,False
7,6,5,club,black,False
8,6,2,heart,red,True
9,1,Q,heart,red,True


I'm going to start by counting up the number of spade cards that each player drew.

In [21]:
spade_cards = cards_df[cards_df.suit == 'spade'].groupby('player').size()

spade_cards

player
1     237
2     262
3     275
4     235
5     251
6     245
7     235
8     247
9     255
10    246
dtype: int64

<div class="alert alert-block alert-info">
<b>Aggregation of spade cards drawn: What all is going on there?</b>  <br>
   1. Filter the cards_df down to only looking at spade cards <br>
   2. Grouped the output of from step 1 by player <br>
   3. Used size() to get a total number for each player
</div>

In [22]:
total_pulls = len(cards_df)

total_pulls

10000

In [23]:
spade_prob = spade_cards.apply(lambda x: x/total_pulls)

#sort the values in this series
spade_prob = spade_prob.sort_values(ascending=False)

spade_prob

player
3     0.0275
2     0.0262
9     0.0255
5     0.0251
8     0.0247
10    0.0246
6     0.0245
1     0.0237
4     0.0235
7     0.0235
dtype: float64

Based on the results above, it looks like the probability of Player 3 drawing a spade is 2.75%. Remember, probability talks about future events occurring.  Another way of stating this would be to say, the likelihood of Player 3 drawing a spade is 2.75%.

Also, we could say that the probability of Player 3 not drawing a spade is 97.25%.

## Binomial (Bernoulli) Distribution

Okay, let's take this one step further. We can see above that Player 3 had the highest probability of drawing a spade and Player 7 the lowest. Suppose we have each of the ten players draw ten cards. What do you think the probability of them pulling one spade card is?

Before you reach for your calculators, we need to stop and talk about what we are truly asking for.  

We'll start with Player 3... On their first draw the card will either be a spade or will not be a spade. The outcome of this is binary (aka. binomial). In the world of statistics, this is known as a **binomial distribution** event.

At its heart, binomial distributions analyze the probability of each outcome under certain conditions. 
* **trial** is an event of interest with a discrete outcome (e.g. a coin toss).
* **success** is defined as the outcome of interest in the trials. 

For example, in the coin toss above, we could say we are interested in the number of H outcomes out of 10 trials (tosses). Each H outcome would be a *success*. Also represented as a "1" (following binary logic).

A **binary distribution** is the number of successes (*x*) in *n* trials with *p* probability of success for each trial. Also called a *Bernoulli distribution*.


In general, we are concerned with calculating two situations:

1. The probability of *x* successes out of *n* trials. This is called the **probability mass function(pmf).**<br>
2. The probability of **no more than** _x_ successes out of *n* trials. This is called the **cumulative distribution function (cdf).**

Python uses scipy's `stats.binom.pmf()` and `stats.binom.cdf()` functions, respectively, for that functionality. 

**pmf example:** A fair coin has a 50% (.50) chance of coming up heads on a toss. What is the probability of getting a head (H) 7 times out of 10 tosses?

x = 7<br>
n = 10<br>
p = 0.5<br>

### Calculate the Probability Each Player Will Draw a Spade out of Ten Cards

As before, we are going to start out verifying we still have access to our prior results.

In [24]:
spade_prob

player
3     0.0275
2     0.0262
9     0.0255
5     0.0251
8     0.0247
10    0.0246
6     0.0245
1     0.0237
4     0.0235
7     0.0235
dtype: float64

In [25]:
from scipy import stats

In [26]:
spade_bern = spade_prob.apply(lambda x: stats.binom.pmf(1, n=10, p=x))

In [27]:
spade_bern

player
3     0.213963
2     0.206314
9     0.202105
5     0.199671
8     0.197215
10    0.196598
6     0.195980
1     0.190984
4     0.189722
7     0.189722
dtype: float64

Remember, Bernoulli's equation is very specifc. We calucated the likelihood of the players drawing 1 spade card in their 10 pulls.  What if we wanted to see the probability of the player drawing 2 spade cards.

In [31]:
spade_2_bern = spade_prob.apply(lambda x: stats.binom.pmf(2, n=10, p=x))
spade_2_bern

player
3     0.027227
2     0.024979
9     0.023798
5     0.023133
8     0.022476
10    0.022312
6     0.022149
1     0.020863
4     0.020546
7     0.020546
dtype: float64

Wow! Quite a difference between 1 spade and 2 spades.

# Recap: Pulling It All Together

So what does all this mean?

Going back and looking at all the calculations for Player 3 for this dataset we have:
* Proportion of red card drawn is 49.3%
* Probablity of drawing a spade is 2.75%
* Probabilty of drawing 1 spade card out of 10 draws is 21.4%
* Probabilty of drawing 2 spade cards out of 10 draws is 2.72%

All of those metrics are accurate, however, for an individual who is not familiar with statistics, they could be a bit confusing.

Next time someone starts peppering you with stats about something, take a step back and think about what they are **really** telling you. You could even ask them how they arrived at those calculations.