## Cell Number 1: *Setup Environment*
Importing essential libraries for data exploration:
- **pandas** and **numpy** for data manipulation
- **pathlib** for file path handling
- **pyarrow** for efficient Parquet file reading

Since we're now using Parquet format, all data types are automatically preserved - no manual dtype configuration needed!

In [1]:
from pathlib import Path
import numpy as np
import pandas as pd

# Display config for better notebook output
pd.set_option("display.max_rows", 20)
pd.set_option("display.max_columns", 0)     # Auto-summarize wide frames
pd.set_option("display.width", 120)

print("Libraries imported successfully!")
print("Ready for data exploration!")


Libraries imported successfully!
Ready for data exploration!


---

---

## Cell Number 2: *Load Cleaned Data*
Load the cleaned horse racing data from `data/processed/cleaned_data.parquet`:
- **Automatic dtype preservation** - all data types from cleaning are maintained
- **Fast loading** - Parquet is optimized for quick read operations
- **Memory efficient** - compressed format reduces memory usage

In [2]:
# Load cleaned data from Parquet - all dtypes automatically preserved!
data_path = Path("../data/processed/cleaned_data.parquet")
horses_df = pd.read_parquet(data_path)

print("=" * 80)
print("CLEANED DATA LOADED FROM PARQUET")
print("=" * 80)

# Calculate memory usage
total_memory_bytes = horses_df.memory_usage(deep=True).sum()
total_memory_mb = total_memory_bytes / (1024 * 1024)

print(f"\nData loaded successfully!")
print(f"Shape: {horses_df.shape}")
print(f"Memory Usage: {total_memory_mb:.2f} MB")

print(f"\nColumn names:")
print(horses_df.columns.tolist())

print(f"\nFirst few rows:")
display(horses_df.head())

print(f"\nData types (automatically preserved from Parquet):")
display(horses_df.dtypes)

CLEANED DATA LOADED FROM PARQUET

Data loaded successfully!
Shape: (60752, 31)
Memory Usage: 20.01 MB

Column names:
['registration_number', 'horse_name', 'track_id', 'race_date', 'distance', 'race_number', 'race_type', 'course_type', 'country', 'purse', 'field_size', 'length_behind_at_poc_1', 'length_behind_at_poc_2', 'length_behind_at_poc_3', 'length_behind_at_poc_4', 'length_behind_at_poc_5', 'length_behind_at_finish', 'post_position', 'position_at_point_of_call_1', 'position_at_point_of_call_2', 'position_at_point_of_call_3', 'position_at_point_of_call_4', 'official_position', 'jockey_id', 'trainer_id', 'earnings', 'equipment', 'final_odds', 'favorite_indicator', 'speed_figure', 'is_dnf']

First few rows:


Unnamed: 0,registration_number,horse_name,track_id,race_date,distance,race_number,race_type,course_type,country,purse,field_size,length_behind_at_poc_1,length_behind_at_poc_2,length_behind_at_poc_3,length_behind_at_poc_4,length_behind_at_poc_5,length_behind_at_finish,post_position,position_at_point_of_call_1,position_at_point_of_call_2,position_at_point_of_call_3,position_at_point_of_call_4,official_position,jockey_id,trainer_id,earnings,equipment,final_odds,favorite_indicator,speed_figure,is_dnf
0,13008939,Restless Rambler,BKF,2025-08-31,4.32,5,STK,D,USA,4500.0,7,10,0,0,0,60,175,3,2,0,0,0,2,171618,246029,1125.0,,3.0,N,60,False
1,13008939,Restless Rambler,WYO,2025-08-09,4.5,6,CLM,D,USA,12000.0,8,0,0,0,0,70,860,6,1,0,0,0,7,160633,153736,0.0,F,10.8,N,65,False
2,13008939,Restless Rambler,WYO,2025-07-12,4.5,8,CLM,D,USA,11500.0,7,0,0,0,0,100,400,7,1,0,0,0,2,160633,153736,2300.0,F,5.5,N,72,False
3,13008939,Restless Rambler,WYO,2025-06-29,4.5,9,SOC,D,USA,10500.0,10,260,0,0,0,340,1225,7,9,0,0,0,7,18028,153736,0.0,F,11.9,N,64,False
4,13010216,Libertarian,FAR,2025-07-25,7.0,7,SST,D,USA,15000.0,6,150,100,0,0,50,160,4,3,2,0,0,3,111515,39754,1500.0,B,10.9,N,76,False



Data types (automatically preserved from Parquet):


registration_number    string[python]
horse_name             string[python]
track_id                     category
race_date              datetime64[ns]
distance                      Float32
                            ...      
equipment                    category
final_odds                    Float32
favorite_indicator     string[python]
speed_figure                    Int64
is_dnf                        boolean
Length: 31, dtype: object

---

---

## Cell Number 3: *Data Quality Overview*
Quick overview of the cleaned dataset including:
- **Data shape and memory usage**
- **Data types verification**
- **Missing values summary**
- **Basic statistics**

In [3]:
print("=" * 80)
print("DATA QUALITY OVERVIEW")
print("=" * 80)

# Basic info
print(f"\nDataset Overview:")
print(f"Shape: {horses_df.shape}")
print(f"Memory Usage: {total_memory_mb:.2f} MB")

# Data types summary
print(f"\nData Types Summary:")
dtype_counts = horses_df.dtypes.value_counts()
for dtype, count in dtype_counts.items():
    print(f"  {dtype}: {count} columns")

# Missing values check
print(f"\nMissing Values Check:")
missing_data = pd.DataFrame({
    'Column': horses_df.columns,
    'Missing_Count': horses_df.isnull().sum(),
    'Missing_Percentage': (horses_df.isnull().sum() / len(horses_df) * 100).round(2).astype(str)+"%"
})
missing_data = missing_data[missing_data['Missing_Count'] > 0].sort_values('Missing_Count', ascending=False)

if len(missing_data) > 0:
    print(f"Columns with missing values:")
    display(missing_data)
else:
    print("✓ No missing values found!")

# Quick stats for numeric columns
print(f"\nNumeric Columns Overview:")
numeric_cols = horses_df.select_dtypes(include=[np.number]).columns
print(f"Number of numeric columns: {len(numeric_cols)}")
if len(numeric_cols) > 0:
    display(horses_df[numeric_cols].describe())

print("\n" + "=" * 80)
print("DATA READY FOR EXPLORATION!")
print("=" * 80)

DATA QUALITY OVERVIEW

Dataset Overview:
Shape: (60752, 31)
Memory Usage: 20.01 MB

Data Types Summary:
  int64: 11 columns
  Int64: 6 columns
  Float32: 4 columns
  string: 3 columns
  category: 1 columns
  datetime64[ns]: 1 columns
  category: 1 columns
  category: 1 columns
  category: 1 columns
  category: 1 columns
  boolean: 1 columns

Missing Values Check:
Columns with missing values:


Unnamed: 0,Column,Missing_Count,Missing_Percentage
equipment,equipment,17621,29.0%
favorite_indicator,favorite_indicator,1,0.0%



Numeric Columns Overview:
Number of numeric columns: 21


Unnamed: 0,distance,race_number,purse,field_size,length_behind_at_poc_1,length_behind_at_poc_2,length_behind_at_poc_3,length_behind_at_poc_4,length_behind_at_poc_5,length_behind_at_finish,post_position,position_at_point_of_call_1,position_at_point_of_call_2,position_at_point_of_call_3,position_at_point_of_call_4,official_position,jockey_id,trainer_id,earnings,final_odds,speed_figure
count,60752.0,60752.0,60752.0,60752.0,60752.0,60752.0,60752.0,60752.0,60752.0,60752.0,60752.0,60752.0,60752.0,60752.0,60752.0,60752.0,60752.0,60752.0,60752.0,60752.0,60752.0
mean,6.797485,5.144555,39751.71875,7.433303,352.1729,376.168373,147.363972,4.968495,498.087388,735.803299,4.214018,4.173114,4.047373,1.528937,0.047636,4.089116,194459.60877,384990.742379,5345.627441,13.339365,69.586186
std,1.415301,2.685579,114524.929688,1.878026,480.004107,543.680412,402.179211,103.281908,703.189008,955.888846,2.38062,2.379081,2.435613,2.52141,0.540657,2.341313,403573.047932,460865.68996,23384.679688,17.834431,69.573563
min,2.0,1.0,2250.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,34.0,5.0,0.0,0.0,0.0
25%,6.0,3.0,13000.0,6.0,100.0,50.0,0.0,0.0,100.0,150.0,2.0,2.0,2.0,0.0,0.0,2.0,124659.0,20416.0,425.0,3.0,53.0
50%,6.5,5.0,21700.0,7.0,260.0,260.0,0.0,0.0,360.0,510.0,4.0,4.0,4.0,0.0,0.0,4.0,150983.0,225666.0,1327.0,6.6,67.0
75%,8.0,7.0,37000.0,9.0,510.0,550.0,150.0,0.0,700.0,1005.0,6.0,6.0,6.0,3.0,0.0,6.0,168313.0,951499.0,4800.0,15.9,79.0
max,25.0,15.0,5000000.0,19.0,9999.0,9999.0,9999.0,9999.0,9999.0,12025.0,19.0,18.0,18.0,18.0,15.0,14.0,3229075.0,3229510.0,3100000.0,273.399994,999.0



DATA READY FOR EXPLORATION!


---

In [4]:
# Cell 3.5: Critical Data Warning - DNF Races
print("=" * 80)
print("CRITICAL DATA WARNING: DNF (Did Not Finish) Races")
print("=" * 80)

# Check for speed_figure = 999 (DNF indicator)
dnf_races = (horses_df['speed_figure'] == 999).sum()
total_races = len(horses_df)

print(f"\nDNF Race Statistics:")
print(f"Total races with speed_figure = 999: {dnf_races:,}")
print(f"Percentage of DNF races: {dnf_races/total_races*100:.2f}%")

if dnf_races > 0:
    print("\nWARNING: Speed figure of 999 indicates DNF (Did Not Finish)")
    print("These values MUST be excluded when calculating average speeds!")
    print("All downstream analyses must handle this appropriately.")
    
    # Show some examples
    print("\nSample DNF races:")
    dnf_sample = horses_df[horses_df['speed_figure'] == 999][
        ['horse_name', 'race_date', 'official_position', 'speed_figure', 'field_size']
    ].head(5)
    display(dnf_sample)


DNF Race Statistics:
Total races with speed_figure = 999: 311
Percentage of DNF races: 0.51%

These values MUST be excluded when calculating average speeds!
All downstream analyses must handle this appropriately.

Sample DNF races:


Unnamed: 0,horse_name,race_date,official_position,speed_figure,field_size
16,Inagoodway,2025-07-12,6,999,6
407,Salvator Mundi,2025-08-01,6,999,6
1032,Too Crowded,2025-07-22,8,999,9
1305,Rain,2025-07-13,5,999,5
2289,Evie's Prince,2025-07-16,4,999,9


---

## Cell Number 4: *Foreign Race Horses vs. Domestic*

In [5]:
# Cell 4: Flag foreign-registered horses
print("="*80)
print("FOREIGN REGISTRATION ANALYSIS")
print("="*80)

# Total Horse Count (based on races, not uniqueness)
total_count = len(horses_df)
print(f"Total horse race entries: {total_count:,}")

# Check for F-prefix registrations
horses_df['is_foreign'] = horses_df['registration_number'].str.startswith('F', na=False)
foreign_count = horses_df['is_foreign'].sum()
print(f"Foreign horse count: {foreign_count:,}")

# Check for Non-F-prefix registrations (Domestic Race Horses)
horses_df['is_domestic'] = ~horses_df['is_foreign']
domestic_count = horses_df['is_domestic'].sum()

# Count unique foreign horses (not race appearances)
unique_foreign_horses = horses_df[horses_df['is_foreign']]['registration_number'].nunique()

# Count unique domestic horses (not race appearances)
unique_domestic_horses = horses_df[horses_df['is_domestic']]['registration_number'].nunique()

print(f"\nForeign horse race entries: {foreign_count:,} ({foreign_count/total_count*100:.2f}%)")
print(f"Unique foreign horses: {unique_foreign_horses:,}")
print(f"Domestic horse race entries: {domestic_count:,} ({domestic_count/total_count*100:.2f}%)")
print(f"Unique domestic horses: {unique_domestic_horses:,}")

# Sample of unique foreign horses (no duplicates)
if foreign_count > 0:
    print("\nSample unique foreign-registered horses:")
    foreign_horses = horses_df[horses_df['is_foreign']][['registration_number', 'horse_name', 'country']].drop_duplicates(subset=['registration_number'])
    display(foreign_horses.head(10))
    
    if len(foreign_horses) > 10:
        print(f"\n... and {len(foreign_horses) - 10} more unique foreign horses")

FOREIGN REGISTRATION ANALYSIS
Total horse race entries: 60,752
Foreign horse count: 772

Foreign horse race entries: 772 (1.27%)
Unique foreign horses: 193
Domestic horse race entries: 59,980 (98.73%)
Unique domestic horses: 15,010

Sample unique foreign-registered horses:


Unnamed: 0,registration_number,horse_name,country
59980,F0044866,Move Over (GB),USA
59984,F0044885,Good Governance (GB),USA
59988,F0045525,Lord Wimborne (IRE),USA
59992,F0045864,Silky Warrior (IRE),USA
59996,F0046068,Wind of Change (BRZ),USA
60000,F0046207,My Sea Cottage (IRE),USA
60004,F0046244,Philipsburg (IRE),USA
60008,F0046282,Big Everest (GB),USA
60012,F0046296,Science (IRE),USA
60016,F0046297,Assembly Point (BRZ),USA



... and 183 more unique foreign horses


---


## Next Steps

Now that we've completed the initial data exploration and identified key characteristics of our horse racing dataset, including the distinction between foreign and domestic horses, proceed to the next analysis notebook:

**→ Continue to: `notebooks/past_performance/logistic_analysis_1.ipynb`**

This notebook will dive deeper into analyzing past performance metrics and building logistic regression models to predict race outcomes.