# Extending the Dice problem

In this notebook, I extend the dice problem presented in _Think Bayes_ by Allen Downey. 

In [3]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
np.set_printoptions(precision=3)
import itertools as it

## The dice problem

In the Dice walkthrough, I computed the posterior probability of rolling a 6. 

(Note: to represent a random variable that can take on any number value, I use N).

I first represented the probability of rolling any of N-sided die in a box of die that contains a 4-sided die, a 6-sided due, a 8-sided die, a 12-sided die, and a 20-sided die. 

In the scenario, I selected a die from the box at random, rolled it, and got a 6. The posterior probability calculated was given my 5 available hypotheses, what is their probability given I rolled a 6?

### Extending Question

What if I have N hypotheses. What if I have N data. What are the posterior probabilities of the n hypotheses given the n data?

<hr>

The approach is to define my hypotheses, my data, and likelihood function.

#### Hypotheses

Let's say I have a box of N-sided die. 

We initially assume that there's an equally likely chance of rolling each of the N-sided die. 

This initial assumption is called our **prior**. Because each of hypotheses are equally likely, each has a probability of 1 / N chance of being rolled.

In [14]:
start = 2
stop = 20
iterate = 2
hypos = [h for h in xrange(start, stop, iterate)]

priors = np.repeat(1.0/len(hypos),len(hypos))
    
print pd.DataFrame({'H' : hypos,
                'P(H)' : priors},
                columns=['H',"P(H)"]
)

    H      P(H)
0   2  0.111111
1   4  0.111111
2   6  0.111111
3   8  0.111111
4  10  0.111111
5  12  0.111111
6  14  0.111111
7  16  0.111111
8  18  0.111111


Above, we have defined our hypotheses or N-sided die in our box of die. I just set that to a range of values, every 2 in 2 through 20. From that, I computed the prior probability of each die. We decided each die is equally likely, so it's just 1/N where N is the value in N-sided die. The important point here, which is where the extension comes from, is that I generalized the process so all that follows is based on this generalized set up. In other words, we can change the initial set up and the computations to follow will change accordingly. This means we are one step closer to adapting this methodology to other problems, which is the goal of these notebooks. 

<hr>

Our data can be generalized too. Our data is the outcomes of N rolls of the die. 

In [16]:
data = np.arange(1,21).tolist()
data

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]

Let's make a dataframe that puts our hypotheses, priors and data all together. 

In [21]:
lstoflsts = []

for hypo in hypos:
    for prior in priors:
        for d in data:
            lst = [hypo,prior,d]
            lstoflsts.append(lst)
            
df = pd.DataFrame(lstoflsts,columns=['H','P(H)','D'])
print df.shape
df.head()

(1620, 3)


Unnamed: 0,H,P(H),D
0,2,0.111111,1
1,2,0.111111,2
2,2,0.111111,3
3,2,0.111111,4
4,2,0.111111,5


I can also calculate the probability of my data, which is just 1/N where N is the length of my data. I'll add this to our dataframe.

In [25]:
prob_data = np.repeat((1.0 / len(data)),len(df['D']))
df['P(D)'] = prob_data
df.head()

Unnamed: 0,H,P(H),D,P(D)
0,2,0.111111,1,0.05
1,2,0.111111,2,0.05
2,2,0.111111,3,0.05
3,2,0.111111,4,0.05
4,2,0.111111,5,0.05


The question is, for the data I've collected, what is the probability of rolling each die?

To determine the probability, I need to calculuate the likelihood.

The likelihood is just the raw frequency of occurrence of the data for our hypotheses.

For example, what's the probability of rolling a 2 with a 4-sided die?

Let's add the likelihood of the data given the hypotheses to our dataframe.

In [26]:
likelihood = []
df2 = df.copy()
for i in range(df.shape[0]):
    h = df.iloc[i]['H']
    d = df.iloc[i]['D']
    if h < d:
        likelihood.append(0)
    if h >= d:
        p = df.iloc[i]['P(H)']
        likelihood.append(1.0/h)

df2['P(D|H)'] = likelihood
df2.head()

Unnamed: 0,H,P(H),D,P(D),P(D|H)
0,2,0.111111,1,0.05,0.5
1,2,0.111111,2,0.05,0.5
2,2,0.111111,3,0.05,0.0
3,2,0.111111,4,0.05,0.0
4,2,0.111111,5,0.05,0.0


Now we can derive what is the probability of the hypotheses given the data we observe. This is our **posterior**. You can think of this value as an _update_ to our prior. Our prior assigned a probability to each hypothesis being true, but was very generic. Our _posterior_ gives a better value for the hypothesis because it's influenced by what we actually observe.

In [27]:
num = (df2['P(H)'].values * df2['P(D|H)'].values)
posterior = num 
df2['P(H|D)_raw'] = posterior.tolist()
df2.head()

Unnamed: 0,H,P(H),D,P(D),P(D|H),P(H|D)_raw
0,2,0.111111,1,0.05,0.5,0.055556
1,2,0.111111,2,0.05,0.5,0.055556
2,2,0.111111,3,0.05,0.0,0.0
3,2,0.111111,4,0.05,0.0,0.0
4,2,0.111111,5,0.05,0.0,0.0


But like we saw before, the probabilities computed aren't really probabilities yet-they need to be normalized or sum up to one.