<p>How would you answer the following questions:</p>
<ul>
    <li>In a classroom of 23 students, what is the probability of 2 of them share the same birthday (day and month)? </li>
    <li>What about a classroom of 40 students?</li>
</ul>

<p>The answers are 50% and roughly 90%, respectively.</p>
<p>How is that possible? It is strange, counter-intuitive, and completely true, but the birthday problem (birthday paradox) is a thing. It's only a "paradox" because our brains can't handle the compounding power of exponents.
We expect probabilities to be linear and only consider the scenarios we're involved in (both faulty assumptions). But the comparison is made pair-wise. And there are a lot of comparisons in samples like that.</p>

<p>Let's use the birthdates of Rio 2016 Olympics’ athletes to randomly select our sample and prove the converging probabilities with different sample sizes.</p>

In [1]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

## 1) Load the dataset. Convert the column dob (date of birthday) to the correct date data type.

In [2]:
athletes = pd.read_csv('athletes.csv')
athletes.head()

Unnamed: 0,id,name,nationality,sex,dob,height,weight,sport,gold,silver,bronze
0,736041664,A Jesus Garcia,ESP,male,10/17/69,1.72,64.0,athletics,0,0,0
1,532037425,A Lam Shin,KOR,female,9/23/86,1.68,56.0,fencing,0,0,0
2,435962603,Aaron Brown,CAN,male,5/27/92,1.98,79.0,athletics,0,0,1
3,521041435,Aaron Cook,MDA,male,1/2/91,1.83,80.0,taekwondo,0,0,0
4,33922579,Aaron Gate,NZL,male,11/26/90,1.81,71.0,cycling,0,0,0


In [4]:
athletes.dtypes

id               int64
name            object
nationality     object
sex             object
dob             object
height         float64
weight         float64
sport           object
gold             int64
silver           int64
bronze           int64
dtype: object

In [6]:
athletes['birthday'] = pd.to_datetime(athletes['dob'], format = '%m/%d/%y', errors = 'ignore')

In [7]:
athletes.dtypes

id                      int64
name                   object
nationality            object
sex                    object
dob                    object
height                float64
weight                float64
sport                  object
gold                    int64
silver                  int64
bronze                  int64
birthday       datetime64[ns]
dtype: object

In [8]:
athletes = athletes[~athletes['birthday'].isnull()]

## 2) Build a function called sample() that receives a number representing the sample size and returns the sample dataset.

In [9]:
def sample(sample_size):
    return athletes.sample(sample_size).reset_index()

In [10]:
x = sample(23)
x

Unnamed: 0,index,id,name,nationality,sex,dob,height,weight,sport,gold,silver,bronze,birthday
0,61,841446548,Abdulrazzaq Murad,QAT,male,6/29/90,1.86,77.0,handball,0,0,0,1990-06-29
1,2446,83570795,David Andersen,AUS,male,6/23/80,2.1,102.0,basketball,0,0,0,1980-06-23
2,1181,718103063,Asena Rokomarama,FIJ,female,5/2/96,1.6,58.0,rugby sevens,0,0,0,1996-05-02
3,8676,784201731,Phuoc Hoang,VIE,male,3/24/93,1.8,75.0,aquatics,0,0,0,1993-03-24
4,3047,717137601,Emma Larsson,SWE,female,11/15/98,1.47,40.0,gymnastics,0,0,0,1998-11-15
5,7484,430176512,Miguel Tabuena,PHI,male,10/13/94,1.74,66.0,golf,0,0,0,1994-10-13
6,1298,316544316,Baboloki Thebe,BOT,male,3/18/97,,,athletics,0,0,0,1997-03-18
7,1044,620488948,Anton Kosmac,SLO,male,12/14/76,1.83,66.0,athletics,0,0,0,1976-12-14
8,5587,250885863,Kayra Sayit,TUR,female,2/13/88,1.65,90.0,judo,0,0,0,1988-02-13
9,1291,682854309,Aziz Ouhadi,MAR,male,7/24/84,1.68,72.0,athletics,0,0,0,1984-07-24


## 3) Build a function called isSameMonthDay() that receives two birth dates and compare of them have the same day and month or not.

In [11]:
def isSameMonthDay(date1, date2):
    if str(date1.month) + '-' + str(date1.day) == str(date2.month) + '-' + str(date2.day):
        return True
    else:
        return False

In [18]:
d1 = x.birthday[x.id == 907808928][21]
d2 = x.birthday[x.id == 800526632][22]
isSameMonthDay(d1, d2)

False

## 4) Build a function called oneRound() that receives a sample dataset, combines every possible pairs and returns True if at least one of the pair of athletes share the same day and month of birth.

In [19]:
def oneRound(df_sample):
    match_count = 0
    for i in range(len(df_sample)):
        for j in range(len(df_sample)):
            if i != j:
                if isSameMonthDay(df_sample.iloc[i].birthday, df_sample.iloc[j].birthday):
                    match_count += 1
    if match_count > 0:
        return True
    else:
        return False

In [20]:
oneRound(x)

False

In [21]:
x = sample(60)
oneRound(x)

True

## 5) Build a function called trial() that receives the number of trials and sample size, iterate over the number of trials sampling (sample), comparing(oneRound and isSameMonthDay) and print the percentage of times that we find shared birthdates.

In [23]:
def trial(number_of_trials, sample_size):
    count = 0
    for i in range(number_of_trials):
        df = sample(sample_size)
        if oneRound(df):
            count += 1
    print('with sample size of {} in {} trials, the probability of same birthdates is {}%.'.format(sample_size, number_of_trials, count / number_of_trials * 100))

## 6) Run the trial function with 100 trials and:

### a) sample size of 23. You should expect to find shared birthdates 50% of the times. 

In [24]:
trial(100, 23)

with sample size of 23 in 100 trials, the probability of same birthdates is 44.0%.


### b) sample size of 30. You should expect to find shared birthdates 70% of the times. 

In [25]:
trial(100, 30)

with sample size of 30 in 100 trials, the probability of same birthdates is 67.0%.


### c) sample size of 40. You should expect to find shared birthdates 89% of the times. 

In [26]:
trial(100, 40)

with sample size of 40 in 100 trials, the probability of same birthdates is 86.0%.


### d) sample size of 50. You should expect to find shared birthdates 97% of the times. 

In [27]:
trial(100, 50)

with sample size of 50 in 100 trials, the probability of same birthdates is 100.0%.


### e) sample size of 60. You should expect to find shared birthdates 99% of the times.

In [29]:
trial(100, 60)

with sample size of 60 in 100 trials, the probability of same birthdates is 100.0%.
