# Probability Practice


## What is Probability?

> **Probability is the likelihood of a specific outcome occuring out of all possible outcomes, expressed as a fraction between 0 and 1.**

Perhaps more importantly:

> **"Probabilities do not tell us what will happen for sure; they tell us what is _likely to happen_ and what is _less likely to happen_."**
>
> -- _Naked Statistics_, by Charles Wheelan, p. 72

In general, you can think of dividing the outcome you're exploring by all possible outcomes:

$$ P(Event) = \frac{|Event|}{|Sample\ Space|} $$

## Planning Party Playlists with Probabilities & Combinatorics

- We are constructing a dinner party playlist for a gathering we are planning. 
- We asked our attendees to each provide a handful of songs they would like to be played at the dinner party.

In [1]:
# Imports
import pandas as pd
import numpy as np
import math

In [2]:
# This code might be a bit different from anything you've seen before...
# Point is, it grabs each of the CSVs in a folder and reads them in 
import os, glob

datafolder = "probability_playlists/"
rec_files = glob.glob(datafolder+"*.csv")

playlists = {}
for file in rec_files:
    key = os.path.basename(file).replace('_recs.csv','')
    playlists[key] = pd.read_csv(file)
playlists.keys()

dict_keys(['anne', 'james', 'joe', 'john', 'samantha'])

In [3]:
for name, playlist in playlists.items():
    print(f"{name.title()}'s Requests:")
    display(playlist)

Anne's Requests:


Unnamed: 0,artist,track,Recommended By
0,Smashing Pumpkins,"Tonight, Tonight",Anne
1,Black Eyed Peas,Let's Get it Started,Anne
2,Green Day,Time of your Life,Anne


James's Requests:


Unnamed: 0,artist,track,Recommended By
0,Eve 6,Here's to the Night,James
1,Neutral Milk Hotel,Into the Aeroplane Over the Sea,James
2,Rilo Kiley,With Arms Outstretched,James
3,Red Hot Chili Peppers,Otherside,James


Joe's Requests:


Unnamed: 0,artist,track,Recommended By
0,Green Day,Time of your Life,Joe
1,B-52s,Rock Lobster,Joe
2,Lady GaGa,Poker Face,Joe
3,John Lennon,Imagine,Joe


John's Requests:


Unnamed: 0,artist,track,Recommended By
0,Black Eyed Peas,Let's Get it Started,John
1,Lady GaGa,Poker Face,John
2,Lady GaGa,Bad Romance,John
3,Lady GaGa,Just Dance,John


Samantha's Requests:


Unnamed: 0,artist,track,Recommended By
0,Black Eyed Peas,Let's Get it Started,Samantha
1,Panic at the Disco,Hallelujah,Samantha
2,Adele,Set Fire to the Rain,Samantha


For now, lets assume we take everyone's recommendations and add them all to our playlist, even if the same song has been recommended by someone else.

In [4]:
# Create 1 df for all recs
df = pd.concat(playlists).reset_index(drop=True)
df

Unnamed: 0,artist,track,Recommended By
0,Smashing Pumpkins,"Tonight, Tonight",Anne
1,Black Eyed Peas,Let's Get it Started,Anne
2,Green Day,Time of your Life,Anne
3,Eve 6,Here's to the Night,James
4,Neutral Milk Hotel,Into the Aeroplane Over the Sea,James
5,Rilo Kiley,With Arms Outstretched,James
6,Red Hot Chili Peppers,Otherside,James
7,Green Day,Time of your Life,Joe
8,B-52s,Rock Lobster,Joe
9,Lady GaGa,Poker Face,Joe


### Q1: What is the probability of the next song being by Lady Gaga?

Assume we just accept everyone's suggestions allowing duplicate songs and play on shuffle.

Remember: 


$$ P(E) = \frac{|E|}{|S|} $$

In [5]:
# Set up: number of songs, grouped by artist
df['artist'].value_counts()

Lady GaGa                4
Black Eyed Peas          3
Green Day                2
Eve 6                    1
Adele                    1
Red Hot Chili Peppers    1
Rilo Kiley               1
Neutral Milk Hotel       1
John Lennon              1
Panic at the Disco       1
Smashing Pumpkins        1
B-52s                    1
Name: artist, dtype: int64

In [6]:
# What is the sample space?
S = len(df)

In [14]:
# What about the event space? 
E = df.loc[df['artist'] == 'Lady GaGa'].shape[0]

In [15]:
# Find the probability of lady gaga playing
P_lady_gaga = E / S
P_lady_gaga

0.2222222222222222

In [None]:
#can sanity check by doing (normalize=True) for value counts above,
# and you'll get percentages

### Q2: What is the probability of the next song being "Time of Your Life"?

In [16]:
# Set Up
df['track'].value_counts()

Let's Get it Started               3
Time of your Life                  2
Poker Face                         2
Into the Aeroplane Over the Sea    1
Here's to the Night                1
Bad Romance                        1
Hallelujah                         1
Rock Lobster                       1
Tonight, Tonight                   1
With Arms Outstretched             1
Imagine                            1
Just Dance                         1
Otherside                          1
Set Fire to the Rain               1
Name: track, dtype: int64

In [21]:
# Event space?
E = df.loc[df['track'] == 'Time of your Life'].shape[0]

In [22]:
# Sample space?
S = len(df)

In [24]:
# Probability of 'Time of yYour Life'
P_time_of_your_life = E / S
P_time_of_your_life

0.1111111111111111

### Q3: what is the probability of hearing a song by Lady GaGa or Green Day?


In [46]:
# Set up
df['artist'].value_counts()

KeyError: 'artist'

In [26]:
# Event Space
E = df.loc[(df['artist'] == 'Lady GaGa') & (df['artist'] == 'Green Day')].shape[0]

In [27]:
# Sample space
S = len(df)

In [29]:
# Find the probability
P_lady_gaga_or_greenday = E / S
P_lady_gaga_or_greenday

0.0

### Q4: How many different ways could we build a playlist using everyone's recommendations (without shuffle, no looping, and no repeated songs)?

In [None]:
# First, let's deal with those duplicates we've been ignoring
df = df.drop_duplicates(subset=['artist', 'track'])

In [None]:
# Calculate how many possible playlists


### Q5: What if we limit the playlist to only 10 songs, without replacement? How many possible playlists?

In [None]:
# Calculate how many possible playlists


### Q6: What if we select 10 songs out of the total number of suggestions and allow for repetition?

In [None]:
# Calculate how many possible playlists


Hooray! Great job practicing probabilities and combinatorics!

---

## Conditional Probability with Mushrooms

#### When do we compute conditional probabilities? 

- We need to compute conditional probabilities when the outcome of an event depends on the outcome of previous events (dependent events). A conditional probability of an event is the probability of the event given another event has occurred.


### Mushroom dataset

To discuss conditional probability, let's look at a modified version of the Mushroom dataset from UCI [here](https://archive.ics.uci.edu/ml/datasets/Mushroom). Each row in this dataset corresponds to one observation (one mushroom). 

The modified dataset includes 4 variables:

* **edible-poisonous**
    * This categorical variable can have one of two values: if the mushroom is edible, "edible". If not, "poisonous"

* **bruised**
    * This is a Boolean variable that can assume either one of two values, True or False.

* **gill-spacing**
    * This categorical variable can have one of three values: "close", "crowded", or "distant"
    
* **stalk-shape**
    * This categorical variable can have one of two values: "enlarging" or "tapering"
* **stalk-color-above-ring**
    * This categorical variable can have one of nine values:  "brown","buff","cinnamon","gray","orange", "pink","red","white" or "yellow"

* **stalk-color-below-ring**
    * This categorical variable can have one of nine values:  "brown","buff","cinnamon","gray","orange", "pink","red","white" or "yellow"

* **gill-color**
    * This categorical variable can have one of twelve values: "black","brown","buff","chocolate","gray", "green","orange","pink","purple","red", "white" or "yellow" 



In [30]:
df = pd.read_csv('../data/Mushrooms_cleaned.csv')
df.head()

Unnamed: 0,edible-poisonous,gill-spacing,stalk-shape,stalk-color-above-ring,stalk-color-below-ring,gill-color,bruised
0,poisonous,close,enlarging,white,white,black,True
1,edible,close,enlarging,white,white,black,True
2,edible,close,enlarging,white,white,brown,True
3,poisonous,close,enlarging,white,white,brown,True
4,edible,crowded,tapering,white,white,black,False


### Q1: If you picked a row from this dataset at random, what is the probability it corresponds to a bruised mushroom? 

$P(bruised)$

In [34]:
# Calculate the probability of a bruised mushroom
p_bruised = ((df.loc[df['bruised'] == True]).shape[0]) / len(df)
p_bruised

0.4155588380108321

In [31]:
df['bruised'].value_counts(normalize=True)[True]

0.4155588380108321

### Q2: What is the probability you pick a row corresponding to a mushroom that is bruised _AND_ edible? 

$P(edible \cap bruised)$ 

In [None]:
#  we could use a pivot table, using the above value

In [39]:
event = len(df.loc[(df['bruised'] == True) & (df['edible-poisonous'] == 'edible')])

In [41]:
# Calculate the probability of a bruised and edible mushroom
p_bruised_and_edible = event / df.shape[0]
p_bruised_and_edible

0.33874938453963566

### Q3: What is the probability of picking an edible mushroom given it is bruised? 

$P(edible | bruised)$

In [43]:
# Calculate the probability of an edible mushroom if you know it's bruised
p_edible_given_bruised = p_bruised_and_edible / p_bruised
p_edible_given_bruised

0.8151658767772512

In [44]:
# other way, more mathematical
bruised = df.loc[df['bruised'] == True]
bruised['edible-poisonous'].value_counts(normalize=True)

edible       0.815166
poisonous    0.184834
Name: edible-poisonous, dtype: float64

### Q4: What is the probability of picking a bruised mushroom given it is edible? 

$P(bruised | edible)$

In [None]:
# Calculate the probability of a bruised mushroom if you know it's edible
p_bruised_given_edible = None

### Q5: What is the probability than a mushroom is edible if you can see that part of it is orange?

$P(edible | orange)$

Note - explore the data! Lots of parts of a mushroom could be orange!

In [None]:
# Explore the data and find which columns tell you about the mushroom's color


In [None]:
# Calculate the probability of an edible mushroom if you know part of it is orange
p_edible_given_orange = None

## Level Up

What's the probability that a mushroom is poisonous if it has close gill spacing and a tapering stalk?

$P(edible|close \cap tapering)$

In [None]:
# P that mushroom is edible given close gill spacing and tapering stalk'
p_edible_given_close_tapering = None