##  Birth-months advantage in sport.
Team member: Jimmy Wong, Oleksii Lavrenin, Nana Kweku Edusah

Github: https://github.com/olavrenin-data-scientist/Project2_Wong_Edusah_Lavrenin

Primary Dataset: https://github.com/rfordatascience/tidytuesday/blob/main/data/2024/2024-01-09/readme.md




## Data structure
1. NHL Player Birth Records (From NHL Rosters)
Observations: Over 7,000 rows, each representing a unique player-season entry across NHL team rosters.
 
Key Variables:
 
player_id  – A unique ID for each NHL player helps us to remove duplicates and identify distinct players.
 
birth_date – Full date of birth from which we detect the birth month variable used in the analysis.
 
birth_country – Country of birth.
 
position_type – Player’s general position.
 
team_code – NHL team abbreviation. We can track regional effects.
 
Most useful variables:
 
birth_date - for a fetching months from date of birth date.
 
birth_country - for detecting only Canadian native players.
 
player_id  - distinct players (without duplication of records).
 
position_type - for understating the role in the team.

 


2. Canadian Birth Records (Canada, 1991–2022)
Observations: 384 rows — one row for each month from 1991 to 2022 (32 years × 12 months)
 
Purpose: Define a baseline month of birth for the Canadian population.
 
Key Variables:
 
year - calendar year of birthday.
 
month  - calendar month of birthday.
 
births – number of birthdays in specific months.
 
Most useful variables:
 
month and aggregated births - help to detect total numbers for months and years.
 
These datasets help to understand the statistical months of birthdays for Canadian players vs the common Canadian population and detect the potential mismatch.


## Introduction
     The sports industry has a bias that many individuals born earlier sometimes have advantages in power and skillset. This is a unique phenomenon known as birth months advantage or relative effect (ARE). For example, in your league of ice hockey, this effect is very notable, when athletes were born In January and compete against kids born in late December. With this 11–12-month difference, older children are more skilled and stronger, and this impacts long-term success. This project researches and analyzes the birth-months advantage effect in the Canadian NHL league between hockey players. The research on this pattern is very important because the bias can potentially influence the selection of players by coaches and unequal training for players. Coaches, who believe in this bias can give more preference to players who were born in earlier months.


## Questions
1. Do we have a disproportion in the number of NHL players born in January and December?
2. What birth-month results NHL players vs the general Canadian population?
3. Do we have consistent or inconsistent effects where players were born before and after 2000?
4. Does the birth-month effect reflect on position?
5. How does the effect affect goaltenders vs defensemen?
6. Do some teams have more or fewer players in early months vs other teams?
7. Does this phenomenon exist only in hockey or in other sports too? Example: compare with additional dataset for football (NHL).

## Sanity checking

In [33]:
import pandas as pd
# Load datasets
canadian_births_df = pd.read_csv('data/canada_births_1991_2022.csv')
players_df = pd.read_csv('data/nhl_player_births.csv')
rosters_df = pd.read_csv('data/nhl_rosters.csv')
teams_df = pd.read_csv('data/nhl_teams.csv')

# Preprocess player birth data
players_df['birth_date'] = pd.to_datetime(players_df['birth_date'], errors='coerce')
players_df['birth_month'] = players_df['birth_date'].dt.month
players_df['birth_year'] = players_df['birth_date'].dt.year

# Filter Canadian-born players
canadian_players = players_df[players_df['birth_country'] == 'CAN'].copy()

# --- Question 1: January vs December disproportion ---
jan_pct_nhl = (canadian_players['birth_month'] == 1).mean() * 100
dec_pct_nhl = (canadian_players['birth_month'] == 12).mean() * 100

# --- Question 2: Birth month comparison to Canadian population ---
# Aggregate Canadian population birth data by month
canadian_births_by_month = canadian_births_df.groupby('month')['births'].sum()
canadian_births_pct = canadian_births_by_month / canadian_births_by_month.sum() * 100

nhl_birth_month_counts = canadian_players['birth_month'].value_counts().sort_index()
nhl_birth_month_pct = nhl_birth_month_counts / nhl_birth_month_counts.sum() * 100

# Align to full 12 months
nhl_birth_month_pct = nhl_birth_month_pct.reindex(range(1, 13), fill_value=0)
canadian_births_pct = canadian_births_pct.reindex(range(1, 13), fill_value=0)

# --- Question 3: Consistency over time (before and after 2000) ---
pre_2000 = canadian_players[canadian_players['birth_year'] < 2000]
post_2000 = canadian_players[canadian_players['birth_year'] >= 2000]

pre_2000_pct = pre_2000['birth_month'].value_counts(normalize=True).sort_index() * 100
post_2000_pct = post_2000['birth_month'].value_counts(normalize=True).sort_index() * 100
pre_2000_pct = pre_2000_pct.reindex(range(1, 13), fill_value=0)
post_2000_pct = post_2000_pct.reindex(range(1, 13), fill_value=0)

# Display comparison table
comparison_df = pd.DataFrame({
    'Month': range(1, 13),
    'NHL Players (%)': nhl_birth_month_pct.values,
    'Canada Population (%)': canadian_births_pct.values,
    'Pre-2000 NHL (%)': pre_2000_pct.values,
    'Post-2000 NHL (%)': post_2000_pct.values
})

comparison_df


Unnamed: 0,Month,NHL Players (%),Canada Population (%),Pre-2000 NHL (%),Post-2000 NHL (%)
0,1,9.930505,8.015265,9.767874,20.481928
1,2,9.747623,7.545096,9.693593,13.253012
2,3,9.381858,8.480224,9.340761,12.048193
3,4,9.619605,8.372814,9.637883,8.433735
4,5,9.528164,8.765969,9.526462,9.638554
5,6,8.467447,8.523528,8.486537,7.228916
6,7,7.991953,8.875414,8.022284,6.024096
7,8,7.187271,8.718995,7.22377,4.819277
8,9,7.699342,8.680416,7.706592,7.228916
9,10,7.022677,8.367415,7.056639,4.819277
