In [None]:
import numpy as np
import pandas as pd

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore', FutureWarning)

## The birthday problem 

Given a number of people, **n**, in a room, what is the probability that at least two share a birthday?

Assumptions for the simplest case (**discuss**):

a. 365 days in a year.

b. All days are equally likely.

c. Subjects have independent birthdays.

<br>

We showed that, for $2\leq n\leq 365$, the probability is equal to 

$$P(n)=1-\frac{365}{365}\times\frac{364}{365}\times\frac{363}{365}\times ...\times \frac{(365-n+1)}{365}=
1-\frac{365\times364\times ...\times (365-n+1)}{365^n}$$

We first calculate these probabilities for a range of n's. Which formula should we use?

In [None]:
# n=50
# how many digits?
365**50

In [None]:
# recall the arange function in numpy - we'll use it below
np.arange(10)

In [None]:
# a function that calculates the probability for 1<n<365
def birthday_prob(n):
    """Calculates the probability that at least 2 people out of n have the same birthday"""
    prob=1
    for i in np.arange(n):
        prob = prob * (365-i)/365
    return 1-prob


In [None]:
birthday_prob(50)

In [None]:
# Construct a data frame with the probabilities for a range of n's
number_people=np.arange(2,101,1)
probs= np.array([]) # an empty array
for i in number_people: probs= np.append(probs,birthday_prob(i))

Birthday_df=pd.DataFrame(
    {"Number of people":number_people,
     "Probability":probs})
Birthday_df

In [None]:
Birthday_df.plot("Number of people","Probability")

In [None]:
# Restricting the range for better visualization
Birthday_df[Birthday_df["Number of people"]<60].plot("Number of people","Probability")

Are the above probabilities surprising? Can you provide an intuition for them?


## A computational (simulation based) solution to the birthday problem

**Simulation goal: use the computer to mimic a physical experiment.**

Steps in a simulation:
- What to simulate;
- Simulate one instance;
- Decide on the number of repetitions;
- Code and summarize the results of simulations.

Two important issues:
- Simulations will give us an estimate/approximation of the probability we are interested in; more repetitions, better the approximation.
- The number of repetitions is important and strategies for selecting them will be discussed in more detail later in the course when we talk about the binomial distribution.

### Before starting to program (or reading/running the code below), think about a plan for how to simulate 


In [None]:
# one simulation
birthdays=np.arange(1,366,1)
n=50
one_run=np.random.choice(birthdays,n)
one_run

In [None]:
# Are there dupicates?
# numpy has built-in sorting function
np.sort(one_run)

I need a function that will give me the number of occurences of the most frequent day. There are many ways to do it:
-  use the `Counter` function we saw in Lecture 2
-  numpy has a useful function called `bincount`
-  write your own function - how would you do it?

We will use `bincount`

In [None]:
# a reminder on how to use Counter (might skip in class)
from collections import Counter
Counter([2,2,2,1,7])

In [None]:
Counter(one_run)

In [None]:
Counter([2,2,2,1,7]).most_common(1)

In [None]:
Counter([2,2,2,1,7]).most_common(1)[0][1]

In [None]:
# the function bincount provides counts for all integers from 0 to the largest
np.bincount([2,2,2,1,7])

In [None]:
np.max(np.bincount([2,2,2,1,7]))

In [None]:
# we now create a function that will simulate nsim simulations of the 
# birthday problem for n subjects
def birthday_sim(n,nsim):
    outcomes = np.array([])
    for i in np.arange(nsim):
        outcomes = np.append(outcomes, np.max(np.bincount(np.random.choice(birthdays,n))))
    return outcomes


In [None]:
birthday_sim(23,100)

In [None]:
# calculate the probability; how many simulations should we run?
n=23
nsim=1000
sum(birthday_sim(n,nsim)>1)/nsim


# Classroom discussion

We introduced a simple example where we can solve a problem analytically/mathematically and computationally (using simulations).

**Questions:**
- Are the assumptions made in the simulations we run the same or different than the ones we made in the derivation?
- Give other examples of problems (in science, finance etc.) that can be approached using both mathematics and computation. If you do not know any, give examples of problems where simulations or mathematics play an important role.
- What are the advantages and disadvantages of using mathematical approaches? What are the advantages and disadvantages of using computational approaches (simulations)?

<br>
<br>


## Mathematical derivation versus computational (simulation-based) estimation


**The triplet birthday problem:** Given a number of people in a room, what is the probability that at least three share a birthday?

Assumptions same as before:

a. 365 days in a year.

b. All days are equally likely.

c. Subjects have independent birthdays.

Can you derive an exact formula for this probability?


In [None]:
# We can answer the "triplets" question as easily
nsim=1000
number_people=np.arange(3,50,1)
probs3= np.array([])
for i in number_people: probs3= np.append(probs3,sum(birthday_sim(i,nsim)>2)/nsim)

Birthday3_df=pd.DataFrame(
    {"Number of people":number_people,
     "Triplet Probability":probs3})
Birthday3_df


In [None]:
Birthday3_df.plot("Number of people","Triplet Probability")

If you want to obtain a smoother curve (reflecting a more accurate estimation of probabilities) you need to increase the number of simulations. Play with the code and see how the number of simulations affects the smoothness of the function!

In [None]:
n=87
nsim=10000
sum(birthday_sim(n,nsim)>2)/nsim

## Can we calculate these probabilities more accurately?

Note that not all days of the year are equally likely to be birthdays. Born in September?

http://thedailyviz.com/2016/09/17/how-common-is-your-birthday-dailyviz/

The visualization in the link above is based on a dataset from FiveThirtyEight github:

https://github.com/fivethirtyeight/data/tree/master/births

Note that in the folowing dataset, the variable for day of week is coded 1 for Monday and 7 for Sunday.


In [None]:
birth_data=pd.read_csv("US_births_2000-2014_SSA.csv")
print(birth_data.shape)
birth_data.head(10)

**Some interesting observations from the data**

The pandas environment has commands that allow you to group rows by unique values in a column. We used it before and we will use it below. You will learn more about it in Lecture 10 (groups, joins, database operations).

In [None]:
# not surprising (and also not relevant for our calculation) day of the week matters
birth_data.groupby('day_of_week').sum()[['births']]

In [None]:
# another irrelevant (but fun) data grouping: is 13th avoided? 
birth_data.groupby('date_of_month').sum()[['births']]

In [None]:
counts_df=birth_data.groupby(['month','date_of_month']).sum()[['births']]
counts_df.head(10)

In [None]:
# the default histogram from Table is not very informative
counts_df.hist("births");

In [None]:
counts_df.hist("births",bins = np.arange(116000, 195000, 2000));

In [None]:
# the inferred probabilities of each day of the year 
day_probs=counts_df.births/sum(counts_df.births)
# January probabilities
day_probs[:31]

In [None]:
# we can draw with specified probabilities
np.random.choice(["H","T"],10,p=[0.8,0.2])

In [None]:
# add February 29 - the number of possible birthdays is now 366
birthdays2=np.arange(1,367,1)

# we now create a function that will simulate nsim simulations of the 
# birthday problem for n subjects with days weighted by their probabilities
def birthday_sim2(n,nsim,pr):
    outcomes = np.array([])
    for i in np.arange(nsim):
        outcomes = np.append(outcomes, np.max(np.bincount(np.random.choice(birthdays2,n,p=pr))))
    return outcomes

In [None]:
# calculate the probability for n=23 
# before running the code below: do you think it is smaller or bigger than 0.5073 (the one we calculated above)
n=23
nsim=100000
sum(birthday_sim2(n,nsim,day_probs)>1)/nsim

### Some conclusions:
- In some situations math offers the best path (for example, lottery winning probabilities); 
    - for example, in Mega Millions players pick six numbers from two separate pools of numbers - five different numbers from 1 to 70, and one number from 1 to 25. Probability of winning?
- In some situations, exact calculations are very difficult (triplet birthdays);
- Often, it is easy to modify simulations to account for changed assumptions (for example, all days equally likely, leap years);
- More accurate calculations or simulations do not always lead to different results - but we might not know that before doing them.

Final note on the triplet problem - this is a reference for the exact calculation (for math majors):
https://www.sciencedirect.com/science/article/pii/S0378375804002721


In [None]:
# in case we need factorials for board calculations
from math import factorial
factorial(70)