## When to decide if resampling is an appropriate technique 

- **Small sample sizes**
    - When I don't have enough data for traditional statistical assumptions
    - Example: **Testing a new trading strategu with limited historical data**
    - Problem: I only have 30 data points but I need to estimate confidence intervals
    - Solution: Bootstrap

- **Unknown or complex distributions**
    - When the data doesn't follow distributions that statistical tests assume
    - Example: **Comparing fund performance across different market regimes**
    - Problem: Financial returns are often skewed, have fat tails, or multiple peaks
    - Solution: Permutation tests work regardless of distribution shape

- **Dependency structures**
    - When data points are not independent
    - Problem: Stock prices are autocorrelated
    - Solution: Block boostrap preserves time series structure

- **Model validation**
    - Testing if observed patterns could occur by chance
    - Problem: Is this trading signal actually predictive or just data mining
    - Solution: Permutation test to see if performance persists

- **Robust inference**
    - When I want results that do not depend on distributional assumptions
    - Problem: Outliers or fat tails could invalidate traditional tests
    - Solution: Resampling provides assumption-free inference

- **Extreme risk events (tail risk)**
    - When modelling rare events
    - Example: Estimating 99.9% VaR when I only have 5 years of daily data
    - Problem: Not enough historical data
    - Solution: Boostrap extreme scenarios to generate more tail observations

- **Portfolio risk aggregation**
    - When combining risks from multiple sources with complex dependencies
    - Problem: Formulas for portfolio risk become intractable with many assets
    - Solution: Monte Carlo resampling

**When to not use resampling**
- When I have large samples with well-behaved data

In [4]:
from pybaseball import batting_stats
import pandas as pd

In [9]:
data = batting_stats(2015, 2025)

In [7]:
data

Unnamed: 0,IDfg,Season,Name,Team,Age,G,AB,PA,H,1B,...,maxEV,HardHit,HardHit%,Events,CStr%,CSW%,xBA,xSLG,xwOBA,L-WAR
1,15640,2024,Aaron Judge,NYY,32,158,559,704,180,85,...,117.5,239,0.611,391,0.146,0.267,0.310,0.724,0.480,11.7
3,15640,2022,Aaron Judge,NYY,30,157,570,696,177,87,...,118.4,247,0.611,404,0.169,0.287,0.305,0.706,0.463,11.4
50,25764,2024,Bobby Witt Jr.,KCR,24,161,636,709,211,123,...,116.9,260,0.483,538,0.138,0.236,0.315,0.576,0.407,10.5
6,13611,2018,Mookie Betts,BOS,25,136,520,614,180,96,...,110.6,218,0.502,434,0.220,0.270,0.309,0.607,0.431,10.4
7,10155,2018,Mike Trout,LAA,26,140,471,608,147,80,...,118.0,162,0.460,352,0.201,0.261,0.294,0.603,0.435,9.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1301,10815,2023,Jurickson Profar,- - -,30,125,459,521,111,73,...,108.8,119,0.317,375,0.151,0.236,0.248,0.349,0.306,-1.9
1462,393,2018,Victor Martinez,DET,39,133,467,508,117,87,...,107.6,130,0.306,425,0.148,0.212,0.266,0.406,0.313,-1.7
1390,6876,2017,Mark Trumbo,BAL,31,146,559,603,131,86,...,118.5,155,0.377,411,0.154,0.291,0.235,0.394,0.297,-1.8
1440,1177,2017,Albert Pujols,LAA,37,149,593,636,143,103,...,112.2,197,0.391,504,0.180,0.268,0.262,0.445,0.317,-1.9


In [12]:
hr = data[['Name', 'HR', 'WAR']].sort_values(by='HR', ascending=False).head(50)

In [25]:
import pandas as pd

# Create the birth location data
birth_data = {
    'Name': [
        'Aaron Judge', 'Giancarlo Stanton', 'Cal Raleigh', 'Shohei Ohtani', 
        'Matt Olson', 'Kyle Schwarber', 'Pete Alonso', 'Eugenio Suarez',
        'Jorge Soler', 'Vladimir Guerrero Jr.', 'Khris Davis', 'Salvador Perez',
        'Chris Davis', 'Mark Trumbo', 'Cody Bellinger', 'Mike Trout',
        'Marcus Semien', 'Nelson Cruz', 'Junior Caminero', 'Christian Yelich',
        'Anthony Santander', 'J.D. Martinez', 'Nolan Arenado', 'Juan Soto',
        'Bryce Harper', 'Edwin Encarnacion', 'Brian Dozier', 'Fernando Tatis Jr.',
        'Ronald Acuna Jr.', 'Alex Bregman', 'Josh Donaldson', 'Chris Carter'
    ],
    'Birth_City': [
        'Linden, California', 'Panorama City, California', 'Cullowhee, North Carolina', 'Oshu, Iwate',
        'Atlanta, Georgia', 'Middletown, Ohio', 'Tampa, Florida', 'Puerto Ordaz',
        'Havana', 'Montreal, Quebec', 'Lakewood, California', 'Valencia',
        'Longview, Texas', 'Anaheim, California', 'Scottsdale, Arizona', 'Vineland, New Jersey',
        'San Francisco, California', 'Las Matas de Santa Cruz', 'Santo Domingo', 'Thousand Oaks, California',
        'Maracay', 'Miami, Florida', 'Newport Beach, California', 'Santo Domingo',
        'Las Vegas, Nevada', 'La Romana', 'Fulton, Mississippi', 'San Diego, California',
        'La Guaira', 'Albuquerque, New Mexico', 'Pensacola, Florida', 'Redwood City, California'
    ],
    'Birth_Country': [
        'USA', 'USA', 'USA', 'Japan',
        'USA', 'USA', 'USA', 'Venezuela',
        'Cuba', 'Canada', 'USA', 'Venezuela',
        'USA', 'USA', 'USA', 'USA',
        'USA', 'Dominican Republic', 'Dominican Republic', 'USA',
        'Venezuela', 'USA', 'USA', 'Dominican Republic',
        'USA', 'Dominican Republic', 'USA', 'USA',
        'Venezuela', 'USA', 'USA', 'USA'
    ]
}

# Create DataFrame
df_birth_locations = pd.DataFrame(birth_data)

# Display the DataFrame
print("Birth Location DataFrame:")
print(df_birth_locations)
print(f"\nDataFrame shape: {df_birth_locations.shape}")

# Show some basic statistics
print(f"\nCountry distribution:")
print(df_birth_locations['Birth_Country'].value_counts())

# Show first few rows
print(f"\nFirst 10 rows:")
print(df_birth_locations.head(10))

# Save to CSV if needed
# df_birth_locations.to_csv('player_birth_locations.csv', index=False)
# print("Data saved to 'player_birth_locations.csv'")

Birth Location DataFrame:
                     Name                 Birth_City       Birth_Country
0             Aaron Judge         Linden, California                 USA
1       Giancarlo Stanton  Panorama City, California                 USA
2             Cal Raleigh  Cullowhee, North Carolina                 USA
3           Shohei Ohtani                Oshu, Iwate               Japan
4              Matt Olson           Atlanta, Georgia                 USA
5          Kyle Schwarber           Middletown, Ohio                 USA
6             Pete Alonso             Tampa, Florida                 USA
7          Eugenio Suarez               Puerto Ordaz           Venezuela
8             Jorge Soler                     Havana                Cuba
9   Vladimir Guerrero Jr.           Montreal, Quebec              Canada
10            Khris Davis       Lakewood, California                 USA
11         Salvador Perez                   Valencia           Venezuela
12            Chris Davis

In [29]:
merged_df = hr.merge(df_birth_locations, on='Name', how='left')

In [33]:
merged_df.to_csv('/Users/lucasben/Desktop/School/Schulich/Data Visualization/Data/hr_with_birth_locations.csv', index=False)