# Week 0
## Building Blocks of Probability

We work through the building blocks of probability by sampling using Python and Pandas.

## load libraries

In [None]:
# numerical libraries
import numpy as np
import scipy.special

# pandas!
import pandas as pd

# plotting libraries
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import seaborn as sns
%pylab inline

## sampling

Let's sample some random numbers and apply the naive definition of probability.

### rolling 1 die

In [None]:
# let's roll a 6-sided die
num_times = 12
rolls = np.linspace(1,num_times,num_times,dtype=int)
lets_roll = np.random.randint(low=1,high=7,size=num_times)
lets_roll

In [None]:
# let's put this in a dataframe and count how many times we rolled each number
pd.DataFrame(np.vstack([rolls,lets_roll]).T,columns=['roll','die_1']).groupby('die_1').count()

### rolling 2 dice

In [None]:
# let's roll 2 6-sided die
num_times = 200
rolls = np.linspace(1,num_times,num_times,dtype=int)
lets_roll = np.random.randint(low=1,high=7,size=[num_times,2])
#lets_roll

In [None]:
# now let's put this in a dataframe, so we can count
dos = pd.DataFrame(np.hstack([rolls.reshape(num_times,1),lets_roll]),
                   columns=['roll','die_1','die_2']
                  )

In [None]:
# and let's sum the die up
dos['total'] = dos.die_1 + dos.die_2

In [None]:
# and now count the number of times each sum appeared
dos[['total','roll']].groupby('total').count()

In [None]:
# sometimes it's easier to plot things 
# to see if anything jumps out and to build intuition

sns.distplot(dos['total'],kde=False) # use seaborn to make a pretty histogram
plt.xlim(2,12) # set the limits of x-axis of the plot
plt.xlabel('sum of 2 dice') # label the x-axis
plt.ylabel('number of times sum appeared') # label the y-axis
plt.title('Histogram: Sum of Two Dice, '+str(num_times)+' rolls') # make a nice title

## variability of dice rolling

How variable is this system? To get a sense for this, we will look at how the probability of rolling a 6 varies with the number of times we _roll_ the dice.

First, apply the naive rule of probability to compute the plausibility of rolling a 6:

$P(\text{rolling a six}) = \frac{|\text{set of outcomes where total equals 6}|}{|\text{set of all outcomes of two dice}|}$

This equals? $P(\text{rolling a six}) = $

We expect some degree of variation in our simulated probability of rolling a six and our theoretically computed probability. The question is, how much? And, how much does that variability change if we roll the dice an increasing number of times?

First, what did we get with our experiment earlier?

In [None]:
# how many six's?
len(dos[dos.total == 6])

In [None]:
# how many rolls?
num_times

In [None]:
# so our observed naive probability of rolling a 6 is?
len(dos[dos.total == 6]) / num_times

To get a sense of the variability in a sample of experiments, we need to automate experiment generation with code. We can do this with a series of functions that sample from the dice, compute the observerd proportions of rolling each number, and plotting the histogram of the probability of observing a particular sum of two dice.

Please take apart each function so all of this isn't [wrapped in an enigma](http://churchill-society-london.org.uk/RusnEnig.html). 

In [None]:
def just_roll_them(num_dice=2,num_rolls=100,num_exprmnts=100):
    """
    sample from two dice, take the total, return 
    a dataframe of experiments
    """
    
    # total number of dice throws
    tot_throws = num_rolls * num_exprmnts
    
    # build index of experiments and rolls
    rolls = np.arange(1,num_rolls+1)
    exprmnts = np.arange(1,num_exprmnts+1)
    idx = np.stack(np.meshgrid(exprmnts,rolls),-1).reshape(-1,2)
    
    # sample from dice
    samples = np.random.randint(low=1,high=7,size=[tot_throws,num_dice])
    
    # put it all together in a dataframe
    res = pd.DataFrame(np.hstack([idx,samples]),columns=['exprmnt','roll','die_1','die_2'])
    res['total'] = res.die_1 + res.die_2
    
    # return results
    return res

In [None]:
def compute_prop(exprmnt,num_rolls):
    """
    count dice totals and return nice dataframe
    
    takes dataframe produced by just_roll_them
    """
    
    # count up sums by experiment
    count_df = exprmnt[['exprmnt','total','die_1']].groupby(['exprmnt','total']).count()
    
    # compute proportion
    count_df['prop'] = count_df['die_1'] / num_rolls
    
    # rename columns and drop
    count_df.rename(columns={'die_1':'obs'},inplace=True)
    #count_df.drop(['die_2'],axis=1,inplace=True)
    
    return count_df

In [None]:
def plot_variability(c_df,dice_sum):
    """
    plot the variability of observing dice_sum
    in a dataframe produced by compute prob
    """
    
    sns.distplot(c_df[c_df.index.get_level_values('total').isin([dice_sum])]['prop'])
    plt.xlabel('observered probability') # label the x-axis
    plt.ylabel('density') # label the y-axis
    plt.title('Histogram: Variability in Rolling a '+str(dice_sum)) # make a nice title

In [None]:
def assess_variability(dice_sum,num_rolls,num_exprmnts,num_dice):
    """
    put all our helper functions together
    and just produce a graph already!
    """
    
    rolls_df = just_roll_them(num_dice,num_rolls,num_exprmnts)
    c_df = compute_prop(rolls_df,num_rolls)
    plot_variability(c_df,dice_sum)    

Let's focus on the chances of rolling an 8. How does this vary with the number of times we roll the dice in each experiment?

In [None]:
# what if we only roll the dice 25 times?
# note the x-axis
assess_variability(8,25,100,2)

In [None]:
# how about 1000 times?
# compare the x-axis with the above graph
# would you say you're more certain about the probability of rolling an 8?
assess_variability(8,1000,100,2)

Hopefully this result is intuitive, given all of your experience gambling and playing Settlers of Catan.

## conditional probability with dice

## binomial coefficient

In [None]:
# python implementation of binomial coefficient
scipy.special.binom(4,2)