# Distributions Warmup

It's another day at the office at Big Research Co &trade;. You look up from your
laptop and see a woman in a lab coat standing in front of your desk.

"I need some help" she says. "We lost some subjects from the trial."

She notices a curious look on your face. "Not like that, they just ran away.
We didn't lock the doors soon enough."

"Anyway, there's probably like a 70%, no maybe 80%, no, let's say 90% chance
that a given subject will stick around, and I need to run the study again with
10, or 20 subjects. We need to gather enough data on them to justify the cost,
so I need you to figure out what are the probabilities are that at least half of
them stick around, only 1 person leaves, and that all the subjects stay."

She sees you start to form another question and cuts you off.

"Don't ask. You *really* don't want to know."

---

- What probability distribution would you use to model the scenario outlined
  above?
- Calculate all the requested probabilities.

    Use all the possible combinations of subject count and chance that a subject
    will stay in the study. For example, at first calculate the chance that at
    least half of the subjects stay in the study if there is a 70% that each
    subject sticks around, and there are 10 subjects, then the probability that
    only one person leaves, then the probability that all the subjects stay.

- **Bonus**: visualize the requested probabilities.

## Hints

- Use `scipy.stats` for this.
- Each distribution has a cumulative density function that tells you the
  likelihood that a value falls at or below a given point.
- Consider storing the results of your calculations in a data frame.
- A fancy list comprehension or the `itertools` module can help you find
  all the possible combinations.



In [16]:
import pandas as pd
import numpy as np

import itertools as it

from scipy import stats


In [33]:
p_subject_stays = [.7, .8, .9]
subject_ct = [10, 20]
pct_leave = [0, .05, .5]

In [34]:
row_list = {}
i=0

for s in subject_ct:
    for p in p_subject_stays:
        for l in pct_leave:
            row_dict = {}
            row_dict['subject_ct'] = s
            row_dict['p_subject_stays'] = p
            row_dict['pct_leave'] = l
            row_list[i] = row_dict
            i = i + 1

In [35]:
df = pd.DataFrame(row_list).T

In [36]:
#df['tgt_remain'] = int(df.subject_ct * (1 - df.pct_leave))
df

Unnamed: 0,p_subject_stays,pct_leave,subject_ct
0,0.7,0.0,10.0
1,0.7,0.05,10.0
2,0.7,0.5,10.0
3,0.8,0.0,10.0
4,0.8,0.05,10.0
5,0.8,0.5,10.0
6,0.9,0.0,10.0
7,0.9,0.05,10.0
8,0.9,0.5,10.0
9,0.7,0.0,20.0


In [42]:
for n in subject_ct:
    for p in p_subject_stays:
        print('\n--- p =', p, 'n =', n)
        print('   p(half or more stay) =', stats.binom(n,p).sf(n / 2))
        print('   p(half or fewer leave) =', stats.binom(n,p).pmf(n / 2))
        print('   sum of above =', stats.binom(n,p).sf(n / 2) + stats.binom(n,p).pmf(n / 2))
        print('   p(no more than one leaves) =', stats.binom(n,p).pmf(n - 1))
        print('   p(nobody leaves) =', stats.binom(n,p).pmf(n))
        


--- p = 0.7 n = 10
   p(half or more stay) = 0.8497316674
   p(half or fewer leave) = 0.10291934520000011
   sum of above = 0.9526510126000001
   p(no more than one leaves) = 0.12106082100000007
   p(nobody leaves) = 0.02824752489999998

--- p = 0.8 n = 10
   p(half or more stay) = 0.9672065024000001
   p(half or fewer leave) = 0.02642411520000004
   sum of above = 0.9936306176
   p(no more than one leaves) = 0.26843545600000035
   p(nobody leaves) = 0.10737418240000005

--- p = 0.9 n = 10
   p(half or more stay) = 0.9983650626
   p(half or fewer leave) = 0.0014880347999999988
   sum of above = 0.9998530974000001
   p(no more than one leaves) = 0.38742048900000037
   p(nobody leaves) = 0.34867844010000004

--- p = 0.7 n = 20
   p(half or more stay) = 0.9520381026686565
   p(half or fewer leave) = 0.030817080900085014
   sum of above = 0.9828551835687416
   p(no more than one leaves) = 0.006839337111223874
   p(nobody leaves) = 0.0007979226629761189

--- p = 0.8 n = 20
   p(half or mor

In [38]:
def calc_probs(n, p):
    return {
        'n subjects': n,
        'p_subject_stays': p,
        'p_half_or_more_stay': stats.binom(n,p).sf(n / 2),
        'p_one_or_none_leave': stats.binom(n,p).pmf(n - 1),
        'p_all_stay': stats.binom(n,p).pmf(n)
    }

In [39]:
pd.DataFrame(
    [calc_probs(n, p) for n, p in it.product(subject_ct, p_subject_stays)],
    columns=['n subjects',
        'p_subject_stays',
        'p_half_or_more_stay',
        'p_one_or_none_leave',
        'p_all_stay']
)

Unnamed: 0,n subjects,p_subject_stays,p_half_or_more_stay,p_one_or_none_leave,p_all_stay
0,10,0.7,0.849732,0.121061,0.028248
1,10,0.8,0.967207,0.268435,0.107374
2,10,0.9,0.998365,0.38742,0.348678
3,20,0.7,0.952038,0.006839,0.000798
4,20,0.8,0.997405,0.057646,0.011529
5,20,0.9,0.999993,0.27017,0.121577
