
# HORSE + RACE Feature Engineering Encyclopedia (Auto‑Generated)
Generated: 2025-11-10 23:40:40

This notebook compiles **exhaustive, production‑grade** feature ideas for every remaining HORSE and RACE field you listed.
Each idea is presented as a markdown explanation followed by a separate executable code cell that writes to the same `merged` DataFrame.

**Conventions**
- Assumes a pandas DataFrame named `merged` with the field names you provided.
- Uses safe conversions (`pd.to_numeric(..., errors="coerce")`, `pd.to_datetime(..., errors="coerce")`).
- Uses per‑race grouping via `race_id` wherever within‑race standardization is appropriate.
- No external plotting libs are used.
- Cells are idempotent where possible.


In [None]:

import numpy as np
import pandas as pd

pd.options.mode.chained_assignment = None  # quiet setting on chained assigns

# Basic hygiene
if "race_date" in merged.columns:
    merged["race_date"] = pd.to_datetime(merged["race_date"], errors="coerce")


## HORSE Fields

### finish_position: Finishing position, 40 if horse didn't finish

#### Normalized rank within race
Convert finish_position to a 0–1 scale within each race so positions are comparable across field sizes.

In [None]:
merged["finish_rank_pct"] = merged.groupby("race_id")["finish_position"].rank(pct=True)

#### Finished / Placed / Won flags
Binary targets for multi‑task modeling and evaluation convenience.

In [None]:
merged["finished_flag"] = (pd.to_numeric(merged["finish_position"], errors="coerce") < 40).astype(int)
merged["placed_flag"] = (pd.to_numeric(merged["finish_position"], errors="coerce") <= 3).astype(int)
merged["won_flag"] = (pd.to_numeric(merged["finish_position"], errors="coerce") == 1).astype(int)

#### Within‑race z‑score of finish
Standardize finish_position per race to remove race‑level scale effects.

In [None]:
def _z(s):
    m, sd = s.mean(), s.std()
    return (s - m) / sd if sd not in (0, np.nan) else pd.Series(np.zeros(len(s)), index=s.index)

merged["finish_z"] = merged.groupby("race_id")["finish_position"].transform(lambda x: _z(pd.to_numeric(x, errors="coerce")))

#### Residual vs market (finish − expected rank)
Measure how much the horse outperformed the market by comparing finish to implied probability rank.

In [None]:
# Expected rank proxy: descending implied probability
if "implied_prob" in merged.columns:
    merged["prob_rank"] = merged.groupby("race_id")["implied_prob"].rank(ascending=False, method="average")
    merged["finish_residual_vs_prob"] = pd.to_numeric(merged["finish_position"], errors="coerce") - merged["prob_rank"]

### positionL: Distance behind horse ahead at finish (lengths)

#### Sanitize numeric lengths
Coerce text encodings (e.g., 'nk', 'hd') to fractional lengths; unknowns become NaN.

In [None]:
_map = {"hd": 0.2, "nk": 0.3, "shd": 0.1, "snk": 0.25}
posL_raw = merged.get("positionL")
if posL_raw is not None:
    posL_num = pd.to_numeric(posL_raw, errors="coerce")
    # try mapping text codes
    mask_txt = posL_raw.astype(str).str.lower().isin(_map.keys())
    posL_num = posL_num.where(~mask_txt, posL_raw.astype(str).str.lower().map(_map))
    merged["positionL_num"] = posL_num

#### Cumulative lengths vs winner
Approximate total deficit to the winner by summing gaps.

In [None]:
if "positionL_num" in merged.columns:
    # Within a race, sort by finish and cumulatively sum gaps
    merged["cum_lengths_deficit"] = merged.groupby("race_id")["positionL_num"].transform(lambda x: x.fillna(0).cumsum())

#### Within‑race z‑score of lengths
Standardize distance gaps for comparability across races.

In [None]:
if "positionL_num" in merged.columns:
    merged["positionL_z"] = merged.groupby("race_id")["positionL_num"].transform(lambda s: (s - s.mean())/s.std() if s.std() not in (0, np.nan) else 0)

### distance_behind: Distance behind the winner (lengths)

#### Numeric conversion & clipping
Ensure numeric and cap extreme outliers to robust levels.

In [None]:
merged["distance_behind_num"] = pd.to_numeric(merged["distance_behind"], errors="coerce")
q_hi = merged["distance_behind_num"].quantile(0.99)
merged["distance_behind_num"] = merged["distance_behind_num"].clip(lower=0, upper=q_hi)

#### Win/Place proximity flags
Binary indicators for being within 1L, 2L, 5L of the winner—useful for classification.

In [None]:
for L in [1,2,5]:
    merged[f"within_{L}L"] = (merged["distance_behind_num"] <= L).astype(int)

### weight_st: Horse weight in stone

#### Convert stones to kilograms
Create a common scale for weight modeling and interaction features.

In [None]:
merged["weight_st_num"] = pd.to_numeric(merged["weight_st"], errors="coerce")
merged["weight_kg_st"] = merged["weight_st_num"] * 6.35029

### weight_lb: Horse weight in pounds

#### Convert pounds to kilograms
Consistent weight scale for analysis and model inputs.

In [None]:
merged["weight_lb_num"] = pd.to_numeric(merged["weight_lb"], errors="coerce")
merged["weight_kg_lb"] = merged["weight_lb_num"] * 0.453592

### official_weight: Official weight in kg

#### Unified weight feature
Combine multiple sources into a single best‑effort `weight_kg`.

In [None]:
cands = []
for c in ["official_weight","weight_kg_st","weight_kg_lb"]:
    if c in merged.columns:
        cands.append(merged[c])
merged["weight_kg"] = pd.concat(cands, axis=1).bfill(axis=1).iloc[:,0] if cands else np.nan

#### Within‑race z‑score of weight
Standardize weight relative to each race context.

In [None]:
if "weight_kg" in merged.columns:
    merged["z_weight"] = merged.groupby("race_id")["weight_kg"].transform(lambda s: (s - s.mean())/s.std() if s.std() not in (0, np.nan) else 0)

#### Weight × distance interaction
Heavier horses may underperform at longer distances.

In [None]:
if "weight_kg" in merged.columns and "race_distance" in merged.columns:
    merged["weight_distance_interaction"] = pd.to_numeric(merged["race_distance"], errors="coerce") * merged["weight_kg"]

### over_weight: Overweight code (extra carried)

#### Overweight numeric & penalty flag
Convert to numeric penalty and identify non‑zero cases.

In [None]:
merged["over_weight_num"] = pd.to_numeric(merged["over_weight"], errors="coerce")
merged["over_weight_flag"] = (merged["over_weight_num"].fillna(0) > 0).astype(int)

### out_handicap: Out of handicap indicator/amount

#### Out‑of‑handicap penalty
Numeric penalty and binary indicator for modeling handicap effects.

In [None]:
merged["out_handicap_num"] = pd.to_numeric(merged["out_handicap"], errors="coerce")
merged["out_handicap_flag"] = (merged["out_handicap_num"].fillna(0) > 0).astype(int)

### headgear: Headgear code

#### Clean categories + dummies
Extract common headgear types and one‑hot encode for models.

In [None]:
hg = merged["headgear"].astype(str).str.upper().str.strip()
common = {"B":"blinkers","V":"visor","P":"pieces","C":"cheekpieces","H":"hood","T":"tongue_tie"}
for k,v in common.items():
    merged[f"headgear_{v}"] = hg.str.contains(k, na=False).astype(int)

#### New headgear change flag
First‑time headgear often changes performance distribution.

In [None]:
merged = merged.sort_values(["horse_id","race_date"])
hg_clean = merged["headgear"].fillna("")
merged["headgear_changed"] = (hg_clean != hg_clean.groupby(merged["horse_id"]).shift(1)).astype(int)

### rpr_rating: RP Rating

#### Numeric, z‑score, percentile
Create standardized and percentile variants for robust modeling.

In [None]:
merged["rpr_rating_num"] = pd.to_numeric(merged["rpr_rating"], errors="coerce")
merged["rpr_rating_z"] = merged["rpr_rating_num"].groupby(merged["race_id"]).transform(lambda s: (s - s.mean())/s.std() if s.std() not in (0, np.nan) else 0)
merged["rpr_rating_pct"] = merged["rpr_rating_num"].rank(pct=True)

#### Delta from previous run
Recent rating momentum often carries forward.

In [None]:
merged = merged.sort_values(["horse_id","race_date"])
merged["rpr_rating_delta_prev"] = merged.groupby("horse_id")["rpr_rating_num"].diff()

#### Residual vs market implied
Difference from implied probability highlights under/over‑rated horses.

In [None]:
if "implied_prob" in merged.columns:
    r_z = merged["rpr_rating_z"].copy()
    merged["rpr_rating_resid_prob"] = r_z - merged.groupby("race_id")["implied_prob"].transform(lambda s: (s - s.mean())/s.std() if s.std() not in (0, np.nan) else 0)

### tr_rating: Topspeed rating

#### Numeric, z‑score, percentile
Create standardized and percentile variants for robust modeling.

In [None]:
merged["tr_rating_num"] = pd.to_numeric(merged["tr_rating"], errors="coerce")
merged["tr_rating_z"] = merged["tr_rating_num"].groupby(merged["race_id"]).transform(lambda s: (s - s.mean())/s.std() if s.std() not in (0, np.nan) else 0)
merged["tr_rating_pct"] = merged["tr_rating_num"].rank(pct=True)

#### Delta from previous run
Recent rating momentum often carries forward.

In [None]:
merged = merged.sort_values(["horse_id","race_date"])
merged["tr_rating_delta_prev"] = merged.groupby("horse_id")["tr_rating_num"].diff()

#### Residual vs market implied
Difference from implied probability highlights under/over‑rated horses.

In [None]:
if "implied_prob" in merged.columns:
    r_z = merged["tr_rating_z"].copy()
    merged["tr_rating_resid_prob"] = r_z - merged.groupby("race_id")["implied_prob"].transform(lambda s: (s - s.mean())/s.std() if s.std() not in (0, np.nan) else 0)

### or_rating: Official Rating

#### Numeric, z‑score, percentile
Create standardized and percentile variants for robust modeling.

In [None]:
merged["or_rating_num"] = pd.to_numeric(merged["or_rating"], errors="coerce")
merged["or_rating_z"] = merged["or_rating_num"].groupby(merged["race_id"]).transform(lambda s: (s - s.mean())/s.std() if s.std() not in (0, np.nan) else 0)
merged["or_rating_pct"] = merged["or_rating_num"].rank(pct=True)

#### Delta from previous run
Recent rating momentum often carries forward.

In [None]:
merged = merged.sort_values(["horse_id","race_date"])
merged["or_rating_delta_prev"] = merged.groupby("horse_id")["or_rating_num"].diff()

#### Residual vs market implied
Difference from implied probability highlights under/over‑rated horses.

In [None]:
if "implied_prob" in merged.columns:
    r_z = merged["or_rating_z"].copy()
    merged["or_rating_resid_prob"] = r_z - merged.groupby("race_id")["implied_prob"].transform(lambda s: (s - s.mean())/s.std() if s.std() not in (0, np.nan) else 0)

### father: Horse's Sire

#### Target‑encoded win rate with shrinkage
Pedigree target encoding with Bayesian shrinkage to control sparse names.

In [None]:
merged["won"] = (merged["finish_position"] == 1).astype(int)
ped_win = merged.groupby("father")["won"].agg(["mean","count"]).rename(columns={"mean":"_m","count":"_n"})
global_m = merged["won"].mean()
prior = 50
ped_win["_shrunk"] = (ped_win["_m"] * ped_win["_n"] + global_m * prior) / (ped_win["_n"] + prior)
merged["father_win_rate_shrunk"] = merged["father"].map(ped_win["_shrunk"])

#### Surface & distance specialization by pedigree
Capture sire/dam preference for surfaces and distance groups.

In [None]:
# Distance groups
if "race_distance" in merged.columns:
    bins = [0,1400,2000,4000]
    labels = ["sprint","mid","staying"]
    merged["dist_group"] = pd.cut(pd.to_numeric(merged["race_distance"], errors="coerce"), bins=bins, labels=labels)
    ped_dist = merged.groupby(["father","dist_group"])["won"].mean().unstack(fill_value=0).add_prefix("father_dist_")
    merged = merged.merge(ped_dist, on="father", how="left")
# Surface condition specialization if available
if "surface_condition" in merged.columns:
    ped_surf = merged.groupby(["father","surface_condition"])["won"].mean().unstack(fill_value=0).add_prefix("father_surf_")
    merged = merged.merge(ped_surf, on="father", how="left")

### mother: Horse's Dam

#### Target‑encoded win rate with shrinkage
Pedigree target encoding with Bayesian shrinkage to control sparse names.

In [None]:
merged["won"] = (merged["finish_position"] == 1).astype(int)
ped_win = merged.groupby("mother")["won"].agg(["mean","count"]).rename(columns={"mean":"_m","count":"_n"})
global_m = merged["won"].mean()
prior = 50
ped_win["_shrunk"] = (ped_win["_m"] * ped_win["_n"] + global_m * prior) / (ped_win["_n"] + prior)
merged["mother_win_rate_shrunk"] = merged["mother"].map(ped_win["_shrunk"])

#### Surface & distance specialization by pedigree
Capture sire/dam preference for surfaces and distance groups.

In [None]:
# Distance groups
if "race_distance" in merged.columns:
    bins = [0,1400,2000,4000]
    labels = ["sprint","mid","staying"]
    merged["dist_group"] = pd.cut(pd.to_numeric(merged["race_distance"], errors="coerce"), bins=bins, labels=labels)
    ped_dist = merged.groupby(["mother","dist_group"])["won"].mean().unstack(fill_value=0).add_prefix("mother_dist_")
    merged = merged.merge(ped_dist, on="mother", how="left")
# Surface condition specialization if available
if "surface_condition" in merged.columns:
    ped_surf = merged.groupby(["mother","surface_condition"])["won"].mean().unstack(fill_value=0).add_prefix("mother_surf_")
    merged = merged.merge(ped_surf, on="mother", how="left")

### gfather: Horse's Grandsire

#### Target‑encoded win rate with shrinkage
Pedigree target encoding with Bayesian shrinkage to control sparse names.

In [None]:
merged["won"] = (merged["finish_position"] == 1).astype(int)
ped_win = merged.groupby("gfather")["won"].agg(["mean","count"]).rename(columns={"mean":"_m","count":"_n"})
global_m = merged["won"].mean()
prior = 50
ped_win["_shrunk"] = (ped_win["_m"] * ped_win["_n"] + global_m * prior) / (ped_win["_n"] + prior)
merged["gfather_win_rate_shrunk"] = merged["gfather"].map(ped_win["_shrunk"])

#### Surface & distance specialization by pedigree
Capture sire/dam preference for surfaces and distance groups.

In [None]:
# Distance groups
if "race_distance" in merged.columns:
    bins = [0,1400,2000,4000]
    labels = ["sprint","mid","staying"]
    merged["dist_group"] = pd.cut(pd.to_numeric(merged["race_distance"], errors="coerce"), bins=bins, labels=labels)
    ped_dist = merged.groupby(["gfather","dist_group"])["won"].mean().unstack(fill_value=0).add_prefix("gfather_dist_")
    merged = merged.merge(ped_dist, on="gfather", how="left")
# Surface condition specialization if available
if "surface_condition" in merged.columns:
    ped_surf = merged.groupby(["gfather","surface_condition"])["won"].mean().unstack(fill_value=0).add_prefix("gfather_surf_")
    merged = merged.merge(ped_surf, on="gfather", how="left")

### race_runners: Total runners in race

#### Field size and bins
Model nonlinear effects of field size with bins and interactions.

In [None]:
merged["race_runners_num"] = pd.to_numeric(merged["race_runners"], errors="coerce")
merged["field_small"] = (merged["race_runners_num"] <= 6).astype(int)
merged["field_large"] = (merged["race_runners_num"] >= 12).astype(int)

#### Field size × draw / weight interactions
Crowding amplifies draw and weight effects.

In [None]:
for col in ["draw","weight_kg"]:
    if col in merged.columns:
        merged[f"fieldX_{col}"] = merged["race_runners_num"] * pd.to_numeric(merged[col], errors="coerce")

### margin: Sum of decimalPrices (implied probabilities) for the race

#### Race overround
Compute per‑race sum of implied probabilities to capture market tightness (overround > 1).

In [None]:
if "implied_prob" in merged.columns:
    merged["race_overround"] = merged.groupby("race_id")["implied_prob"].transform("sum")

#### Normalized implied probability
Scale each horse’s implied probability by race overround to compare across races.

In [None]:
if "implied_prob" in merged.columns and "race_overround" in merged.columns:
    merged["implied_prob_norm"] = merged["implied_prob"] / merged["race_overround"]

### result_win: Horse won (binary)

#### Ensure binary
Normalize to {0,1} for modeling consistency.

In [None]:
merged["result_win"] = pd.to_numeric(merged["result_win"], errors="coerce").fillna(0).astype(int)

### result_place: Horse placed (binary)

#### Ensure binary
Normalize to {0,1} for modeling consistency.

In [None]:
merged["result_place"] = pd.to_numeric(merged["result_place"], errors="coerce").fillna(0).astype(int)

## RACE Fields

### race_id: Unique race identifier

#### Race size and concentration (Herfindahl)
Market concentration captures how much a race is dominated by a few contenders.

In [None]:
if "implied_prob" in merged.columns:
    # HHI of implied probabilities per race
    hhi = merged.assign(p2=lambda df: df["implied_prob"]**2).groupby("race_id")["p2"].transform("sum")
    merged["race_hhi_prob"] = hhi

### course: Course name (with country code in brackets; AW = All Weather)

#### Course clean + one‑hot
Normalize course names and create dummies.

In [None]:
course_clean = merged["course"].astype(str).str.strip()
course_code = course_clean.str.extract(r"\(([^\)]+)\)", expand=False)
merged["course_name_clean"] = course_clean.str.replace(r"\s*\([^\)]*\)","",regex=True).str.strip()
merged["country_from_course"] = course_code
# Basic dummies for major courses (avoid explosion)
top_courses = merged["course_name_clean"].value_counts().head(20).index
for c in top_courses:
    merged[f"course_is_{c[:25].replace(' ','_').lower()}"] = (merged["course_name_clean"] == c).astype(int)

### race_time: Local race time (London TZ) in hh:mm

#### Hour, minute, session bins
Extract hour features; build day vs evening flags.

In [None]:
t = merged["race_time"].astype(str).str.extract(r"^(\d{1,2}):(\d{2})$")
merged["race_hour"] = pd.to_numeric(t[0], errors="coerce")
merged["race_minute"] = pd.to_numeric(t[1], errors="coerce")
merged["race_evening"] = ((merged["race_hour"] >= 17) & (merged["race_hour"] <= 21)).astype(int)

### race_date: Date of the race

#### Calendar decomposition
Year, month, week, weekday for seasonal and meet effects.

In [None]:
merged["race_year"] = merged["race_date"].dt.year
merged["race_month"] = merged["race_date"].dt.month
merged["race_week"] = merged["race_date"].dt.isocalendar().week.astype("int32")
merged["race_dow"] = merged["race_date"].dt.dayofweek

### title: Title of the race

#### Keyword flags (Maiden/Handicap/Listed/Group)
Extract class/type signals from title text.

In [None]:
title = merged["title"].astype(str).str.lower()
for kw in ["maiden","handicap","listed","group","stakes","novice","claiming"]:
    merged[f"title_kw_{kw}"] = title.str.contains(kw, na=False).astype(int)
merged["title_len"] = title.str.len()

### rclass: Race class (raw)

#### Ordinal encoding
Map textual class to ordered numeric scale (lower is better class).

In [None]:
rclass = merged["rclass"].astype(str).str.upper().str.strip()
# Example mapping; adjust to your jurisdiction
map_order = {"GROUP 1":1,"G1":1,"GROUP 2":2,"G2":2,"GROUP 3":3,"G3":3,"LISTED":4,"L":4,"CLASS 1":5,"CLASS 2":6,"CLASS 3":7,"CLASS 4":8,"CLASS 5":9,"CLASS 6":10}
merged["rclass_ord"] = rclass.map(map_order)

### race_class: Class type (derived)

#### Consistency check with rclass
Capture disagreement between raw and derived class as a feature.

In [None]:
if "rclass_ord" in merged.columns:
    merged["race_class_disagree"] = (pd.to_numeric(merged["race_class"], errors="coerce") != merged["rclass_ord"]).astype(int)

### band: Rating band / restrictions

#### Band min/max ratings
Extract numeric limits when band is like '0-85' or '80-95'.

In [None]:
band = merged["band"].astype(str)
rng = band.str.extract(r"(\d+)\s*[-–]\s*(\d+)", expand=True)
merged["band_min"] = pd.to_numeric(rng[0], errors="coerce")
merged["band_max"] = pd.to_numeric(rng[1], errors="coerce")

### ages: Age restrictions (e.g., 3yo+, 2yo only)

#### Min/max allowed age
Parse age constraints to compare with actual horse age.

In [None]:
ages = merged["ages"].astype(str).str.lower()
merged["age_min_allowed"] = pd.to_numeric(ages.str.extract(r"(\d+)\s*yo", expand=False), errors="coerce")
merged["age_only_flag"] = ages.str.contains("only", na=False).astype(int)

### race_distance: Official race distance (raw)

#### Distance in meters (canonical)
Coerce to numeric meters for modeling and interaction consistency.

In [None]:
merged["race_distance_m"] = pd.to_numeric(merged["metric"], errors="coerce") if "metric" in merged.columns else pd.to_numeric(merged["race_distance"], errors="coerce")

#### Distance bins & log distance
Capture nonlinearity: sprint/middle/staying; add log scale.

In [None]:
d = merged["race_distance_m"]
bins = [0,1400,2000,4000,10000]
labels = ["sprint","mid","staying","extreme"]
merged["distance_bin"] = pd.cut(d, bins=bins, labels=labels)
merged["log_distance"] = np.log1p(d)

### surface_condition: Going / surface condition

#### Ordered condition scale
Map qualitative going to ordinal numeric (firmer→softer).

In [None]:
cond = merged["surface_condition"].astype(str).str.lower().str.strip()
order = {"firm":1,"good to firm":2,"good":3,"good to soft":4,"soft":5,"heavy":6,"aw":3}
merged["going_ord"] = cond.map(order)

### hurdles: Hurdles type and count

#### Hurdle count and type flags
Extract number of obstacles and type (e.g., hurdles vs fences).

In [None]:
hud = merged["hurdles"].astype(str).str.lower()
merged["hurdle_count"] = pd.to_numeric(hud.str.extract(r"(\d+)", expand=False), errors="coerce")
for kw in ["hurdle","fence","chase","steeple"]:
    merged[f"hurdles_kw_{kw}"] = hud.str.contains(kw, na=False).astype(int)

### prize_total: Total purse for the race

#### Log prize and prize per runner
Stabilize scale; control for field size.

In [None]:
merged["prize_total_num"] = pd.to_numeric(merged["prize_total"], errors="coerce")
merged["log_prize_total"] = np.log1p(merged["prize_total_num"])
if "race_runners_num" in merged.columns:
    merged["prize_per_runner"] = merged["prize_total_num"] / merged["race_runners_num"]

### prize_breakdown: Places prizes

#### Place prize ratio features
Extract prize skewness (e.g., winner‑take‑most vs balanced).

In [None]:
bd = merged["prize_breakdown"].astype(str).str.lower()
nums = bd.str.findall(r"(\d+[\.,]?\d*)")
# simple count of prize entries
merged["prize_slots"] = nums.apply(lambda x: len(x) if isinstance(x, list) else np.nan)

### winning_time: Race winning time

#### Winning time (seconds)
Parse time formats (m:s.ms) to seconds for speed features.

In [None]:
wt = merged["winning_time"].astype(str)
mins = pd.to_numeric(wt.str.extract(r"^(\d+)\s*[:m]", expand=False), errors="coerce").fillna(0)
secs = pd.to_numeric(wt.str.extract(r"(\d+(?:\.\d+)?)\s*s?$", expand=False), errors="coerce")
merged["winning_time_s"] = mins*60 + secs

#### Speed figure (m/s) and relative speed
Distance‑normalized speed; compare within race class.

In [None]:
if "race_distance_m" in merged.columns:
    merged["speed_mps"] = merged["race_distance_m"] / merged["winning_time_s"]
    merged["speed_class_z"] = merged.groupby("rclass")["speed_mps"].transform(lambda s: (s - s.mean())/s.std() if s.std() not in (0, np.nan) else 0)

### country_code: Country of the race

#### Country dummies & hemisphere
Capture jurisdiction and seasonal effects via country/hemisphere.

In [None]:
cc = merged["country_code"].astype(str).str.upper().str.strip()
top_cc = cc.value_counts().head(10).index
for c in top_cc:
    merged[f"country_is_{c}"] = (cc == c).astype(int)
hemiN = {"GB","IE","FR","US","CA","JP","HK"}
merged["hemisphere_north"] = cc.isin(hemiN).astype(int)

### ncond: Condition type (derived from condition)

#### One‑hot condition type
Directly encode derived condition buckets.

In [None]:
nc = merged["ncond"].astype(str).str.lower().str.strip()
for v in nc.value_counts().head(10).index:
    merged[f"ncond_is_{str(v)[:25].replace(' ','_')}"] = (nc == v).astype(int)

### ages (compliance): Check horse age vs allowed ages

#### Within allowed age flag
Flag mismatches between horse age and race age restrictions.

In [None]:
if "age_min_allowed" in merged.columns and "age" in merged.columns:
    merged["age_within_rules"] = (pd.to_numeric(merged["age"], errors="coerce") >= merged["age_min_allowed"]).astype(int)

## Final Sanity Checks

In [None]:

new_cols = [c for c in merged.columns if c not in ["finish_position","positionL","distance_behind","weight_st","weight_lb","over_weight","out_handicap","headgear","rpr_rating","tr_rating","or_rating","father","mother","gfather","race_runners","margin","official_weight","result_win","result_place","race_id","course","race_time","race_date","title","rclass","band","ages","race_distance","surface_condition","hurdles","prize_breakdown","winning_time","prize_total","metric","country_code","ncond","race_class"]]
print("New engineered columns added:", len(new_cols))
new_cols[:50]
