# 03 — Feature Engineering
---
This notebook continues from **02-preprocessing.ipynb** and constructs modeling-ready features.

### Inputs
- `../data/processed/countries_preprocessed.csv`

### Outputs (CSV only)
- `../data/processed/countries_features.csv`

### Objectives
- Load standardized country-level data (`countries_preprocessed.csv`)
- Engineer time-aware domain features (e.g., lags, rolling means, squares, interactions, ratios)
- Validate feature completeness and guard against leakage
- Save the finalized feature matrix for modeling

### Key decisions/assumptions
- **Countries only:** Aggregates were removed in 01 and are not reintroduced here.
- **Time awareness:** Use past information only (e.g., `*_lag1`, `*_roll3`); no `*_future` or contemporaneous target leakage.
- **Feature set hygiene:** Exclude from model features: `cereal_yield`, `log_cereal_yield`, `year`, identifiers (`Country Code`, `Country Name`), and any `lag0_*`/`*_future` fields.
- **Transforms included:** Companion features such as `log_*`, `*_sq`, interactions (e.g., `tempXprecip`), and ratios (e.g., `fertilizer_per_gdp`) where available.
- **Missing values:** Do not impute the target; feature imputation, if required, occurs inside the modeling pipeline (04).
- **Artifacts policy:** CSV-only.
---

In [1]:
# --- Setup ---
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

pd.set_option("display.max_columns", None)
plt.rcParams.update({"figure.figsize": (10, 4), "figure.dpi": 120})

# Expected input from previous stage
DATA_PATH = Path("../data/processed/countries_preprocessed.csv")
if not DATA_PATH.exists():
    # Fallback to countries_only if needed
    DATA_PATH = Path("../data/processed/countries_only.csv")

print("Using data from:", DATA_PATH)
df = pd.read_csv(DATA_PATH)
print("Shape:", df.shape)
df.head()

Using data from: ..\data\processed\countries_preprocessed.csv
Shape: (14560, 17)


Unnamed: 0,Country Name,Country Code,year,fertilizer_use,arable_land_pct,precipitation,cereal_yield,co2_total_mt,co2_per_capita,gdp_per_capita,population,rural_pop_pct,temp_anomaly,is_aggregate,log_cereal_yield,log_fertilizer_use,log_gdp_per_capita
0,Afghanistan,AFG,1960,,,,,,,,9035043.0,91.599,-0.03,False,,,
1,Afghanistan,AFG,1961,0.143791,11.728991,327.0,1115.1,,,,9214083.0,91.316,0.06,False,7.017596,0.134348,
2,Afghanistan,AFG,1962,0.142857,11.805651,327.0,1079.0,,,,9404406.0,91.024,0.03,False,6.984716,0.133531,
3,Afghanistan,AFG,1963,0.141935,11.882311,327.0,985.8,,,,9604487.0,90.724,0.05,False,6.894467,0.132725,
4,Afghanistan,AFG,1964,0.141026,11.958972,327.0,1082.8,,,,9814318.0,90.414,-0.2,False,6.988229,0.131928,


## Consistency Checks

In [2]:
required = ["Country Name","Country Code","year"]
missing = [c for c in required if c not in df.columns]
print("Missing required columns:", missing if missing else "None")
print("\nPreview:")
df.head()

Missing required columns: None

Preview:


Unnamed: 0,Country Name,Country Code,year,fertilizer_use,arable_land_pct,precipitation,cereal_yield,co2_total_mt,co2_per_capita,gdp_per_capita,population,rural_pop_pct,temp_anomaly,is_aggregate,log_cereal_yield,log_fertilizer_use,log_gdp_per_capita
0,Afghanistan,AFG,1960,,,,,,,,9035043.0,91.599,-0.03,False,,,
1,Afghanistan,AFG,1961,0.143791,11.728991,327.0,1115.1,,,,9214083.0,91.316,0.06,False,7.017596,0.134348,
2,Afghanistan,AFG,1962,0.142857,11.805651,327.0,1079.0,,,,9404406.0,91.024,0.03,False,6.984716,0.133531,
3,Afghanistan,AFG,1963,0.141935,11.882311,327.0,985.8,,,,9604487.0,90.724,0.05,False,6.894467,0.132725,
4,Afghanistan,AFG,1964,0.141026,11.958972,327.0,1082.8,,,,9814318.0,90.414,-0.2,False,6.988229,0.131928,


## Temporal Features (Lags & Rolling Means)

In [3]:
# We compute features per country, sorted by year.
id_col = "Country Code" if "Country Code" in df.columns else "Country Name"
df = df.sort_values([id_col, "year"]).reset_index(drop=True)

# Select base variables if present
base_vars = [c for c in ["cereal_yield","temp_anomaly","precipitation"] if c in df.columns]
print("Base variables for temporal features:", base_vars)

# Create 1-year lags and trailing 3-year rolling means (history-only)
for col in base_vars:
    df[f"{col}_lag1"] = df.groupby(id_col)[col].shift(1)
    df[f"{col}_roll3"] = (
        df.groupby(id_col)[col].shift(1).rolling(window=3, min_periods=1).mean().reset_index(level=0, drop=True)
    )

print("Added lag1 and roll3 features for:", base_vars)

Base variables for temporal features: ['cereal_yield', 'temp_anomaly', 'precipitation']
Added lag1 and roll3 features for: ['cereal_yield', 'temp_anomaly', 'precipitation']


## Interaction & Polynomial Features

In [4]:
# Interactions: use lagged forms to reduce time leakage for pre-season use
if "temp_anomaly_lag1" in df.columns and "precipitation_lag1" in df.columns:
    df["tempXprecip_lag1"] = df["temp_anomaly_lag1"] * df["precipitation_lag1"]

# Simple squares on (lagged) climate signals
for col in ["temp_anomaly_lag1","precipitation_lag1"]:
    if col in df.columns:
        df[f"{col}_sq"] = df[col] ** 2

# Ratios (safe checks)
if "fertilizer_use" in df.columns and "gdp_per_capita" in df.columns:
    denom = (df["gdp_per_capita"].replace(0, np.nan))
    df["fertilizer_per_gdp"] = df["fertilizer_use"] / denom

# If both total CO2 and population exist but per-capita not present, compute it
if "co2_total_mt" in df.columns and "population" in df.columns and "co2_per_capita" not in df.columns:
    denom = (df["population"].replace(0, np.nan))
    df["co2_per_capita"] = (df["co2_total_mt"] * 1e6) / denom

print("Created interaction/polynomial/ratio features where applicable.")

Created interaction/polynomial/ratio features where applicable.


## Validation: Missingness & Leakage Guardrails

In [5]:
# Missingness
na_ratio = df.isna().mean().sort_values(ascending=False)
print("Top missingness (%):\n", (na_ratio.head(15) * 100).round(1))

# Simple leakage heuristics
suspects = [c for c in df.columns if c.endswith("_future") or c.startswith("lag0_")]
print("Leakage suspects:", suspects if suspects else "None detected")

Top missingness (%):
 fertilizer_per_gdp       32.8
precipitation_lag1_sq    29.5
precipitation            29.5
tempXprecip_lag1         29.5
precipitation_lag1       29.5
fertilizer_use           28.9
log_fertilizer_use       28.9
precipitation_roll3      27.1
cereal_yield_lag1        27.0
log_cereal_yield         27.0
cereal_yield             27.0
cereal_yield_roll3       24.4
co2_per_capita           21.7
co2_total_mt             21.7
log_gdp_per_capita       16.8
dtype: float64
Leakage suspects: None detected


### Note on missing values
- Several variables show non-trivial missingness. Largest gaps typically occur in:
  - **Precipitation** and derivatives (`precipitation`, `precipitation_lag1`, `precipitation_roll3`, `*_sq`, `tempXprecip_lag1`).
  - **Fertilizer** metrics (`fertilizer_use`, `fertilizer_per_gdp`) and derived ratios.
  - **CO₂** metrics (`co2_total_mt`, `co2_per_capita`).
  - **Target-related** features (`cereal_yield`, `log_cereal_yield`, `cereal_yield_lag1`, `cereal_yield_roll3`), including first-year lag effects.
- No imputation is performed here. Feature imputation (median) happens inside the modeling pipeline with time-aware CV; the **target is never imputed**.
- Lags naturally create `NaN` for each country’s first observed year. Rolling features use `min_periods=1`; consider `2–3` in sensitivity checks if stricter windows are desired.
- Optionally drop highly sparse features later (e.g., >50–60% missing).

## Save Feature Matrix

In [6]:
# --- Final guards (no-op if already clean) ---

# Ensure deterministic order and types
if "year" in df.columns:
    df["year"] = pd.to_numeric(df["year"], errors="coerce").astype("Int64")
order_cols = [c for c in ["Country Name", "year"] if c in df.columns]
if order_cols:
    df = df.sort_values(order_cols).reset_index(drop=True)

# Prefer strictly past-looking interactions (avoid any accidental contemporaneous term)
drop_if_present = ["temp_precip_interaction"]  # only if a same-year interaction slipped in
df = df.drop(columns=[c for c in drop_if_present if c in df.columns], errors="ignore")

# (Optional) quick leak check: no *_future or lag0_*
leak_like = [c for c in df.columns if c.endswith("_future") or c.startswith("lag0_")]
assert not leak_like, f"Unexpected future/lag0 features: {leak_like}"


In [7]:
# Save final features for modeling (single canonical output)
features_out = "../data/processed/countries_features.csv"
df.to_csv(features_out, index=False)
print(f"✅ Saved features to: {features_out}\nShape: {df.shape}")

✅ Saved features to: ../data/processed/countries_features.csv
Shape: (14560, 27)


## (Optional) Baseline Modeling Sanity Check

In [8]:
# Set to True to quickly verify features carry signal.
RUN_BASELINE = False

if RUN_BASELINE and "cereal_yield" in df.columns:
    from sklearn.ensemble import RandomForestRegressor
    from sklearn.impute import SimpleImputer
    from sklearn.pipeline import Pipeline
    from sklearn.metrics import r2_score, mean_absolute_error

    # Hold-out by time: last 5 years as test
    last_year = int(df["year"].max())
    cutoff = last_year - 5

    # Feature set: all numeric except year, country code, target
    numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
    drop_cols = ["year", "cereal_yield"]
    if "Country Code" in df.columns:
        drop_cols.append("Country Code")
    X_cols = [c for c in numeric_cols if c not in drop_cols]

    train = df[df["year"] <= cutoff].copy()
    test  = df[df["year"] > cutoff].copy()

    X_train, y_train = train[X_cols], train["cereal_yield"]
    X_test,  y_test  = test[X_cols],  test["cereal_yield"]

    pipe = Pipeline([
        ("imputer", SimpleImputer(strategy="median")),
        ("rf", RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1))
    ])

    pipe.fit(X_train, y_train)
    preds = pipe.predict(X_test)

    print(f"Baseline hold-out ({cutoff+1}-{last_year}) | R2: {r2_score(y_test, preds):.3f}  MAE: {mean_absolute_error(y_test, preds):.1f}")

## Environment Information

In [9]:
import sys, platform, numpy, pandas, matplotlib
print("Python:", sys.version.split()[0])
print("Platform:", platform.platform())
print("NumPy:", numpy.__version__)
print("Pandas:", pandas.__version__)
print("Matplotlib:", matplotlib.__version__)

Python: 3.12.11
Platform: Windows-10-10.0.19045-SP0
NumPy: 2.3.3
Pandas: 2.3.3
Matplotlib: 3.10.6


---
### ✅ Notebook Summary
- Loaded preprocessed country dataset
- Created lag/rolling, interaction, and ratio features (history-only where relevant)
- Performed missingness and leakage checks
- Exported final features: `countries_features.csv`

Next: proceed to **04-modeling.ipynb** for training and evaluation.