# 03 — Feature Engineering

**Purpose:** Construct modeling‑ready features from preprocessed country‑year data: lag/rolling statistics, nonlinear transforms, and interactions. Save a single feature matrix for modeling.

**Inputs:** `./data/processed/countries_preprocessed.csv`  
**Outputs:** `./data/processed/countries_features.csv`

**Key decisions/assumptions**
- Countries only (aggregates removed upstream).
- Time‑safe features (lag/roll) computed within country groups.
- Target (`cereal_yield`) is not imputed here; any imputation occurs in the modeling pipeline.
- Current‑year nonlinearities (log/square) and interactions are included to enrich scenario heterogeneity.
- Lags/rolls for `cereal_yield`, `temp_anomaly`, `precipitation` (and derived lag1 squares + lag1 interaction).

## 1) Setup

In [1]:
from pathlib import Path
import numpy as np
import pandas as pd

pd.set_option("display.max_columns", None)

DATA_IN  = Path("../data/processed/countries_preprocessed.csv")
DATA_OUT = Path("../data/processed/countries_features.csv")

print("Reading:", DATA_IN.resolve())
df = pd.read_csv(DATA_IN)
print("Shape in:", df.shape)

# Ensure 'year' is numeric (Int64) for safe sorting
if "year" in df.columns:
    df["year"] = pd.to_numeric(df["year"], errors="coerce").astype("Int64")

Reading: /Users/golibsanaev/Library/CloudStorage/Dropbox/GitHub_gsanaev/crop-yield-prediction-climate-change/data/processed/countries_preprocessed.csv
Shape in: (14040, 17)


## 2) Utilities

In [2]:
def group_key(frame: pd.DataFrame) -> str:
    if "Country Name" in frame.columns: 
        return "Country Name"
    if "Country Code" in frame.columns: 
        return "Country Code"
    raise KeyError("Expected 'Country Name' or 'Country Code' in columns.")

def add_group_lag(frame: pd.DataFrame, col: str, lag: int, new_name: str) -> None:
    gk = group_key(frame)
    frame[new_name] = (
        frame.sort_values([gk, "year"])
             .groupby(gk, group_keys=False)[col]
             .shift(lag)
    )

def add_group_roll_mean(frame: pd.DataFrame, col: str, window: int, new_name: str) -> None:
    gk = group_key(frame)
    frame[new_name] = (
        frame.sort_values([gk, "year"])
             .groupby(gk, group_keys=False)[col]
             .rolling(window, min_periods=1)
             .mean()
             .reset_index(level=0, drop=True)
    )

def safe_log1p(series: pd.Series) -> pd.Series:
    vals = pd.to_numeric(series, errors="coerce")
    vals = vals.clip(lower=0)
    return np.log1p(vals)

def safe_square(series: pd.Series) -> pd.Series:
    vals = pd.to_numeric(series, errors="coerce")
    return vals ** 2

def safe_interaction(a: pd.Series, b: pd.Series) -> pd.Series:
    a = pd.to_numeric(a, errors="coerce")
    b = pd.to_numeric(b, errors="coerce")
    return a * b

## 3) Nonlinear transforms (current year)

In [3]:
added_cols = []

# Log/square on key drivers (current-year)
if "fertilizer_use" in df.columns:
    df["log_fertilizer_use"] = safe_log1p(df["fertilizer_use"]); added_cols.append("log_fertilizer_use")
    df["fertilizer_use_sq"] = safe_square(df["fertilizer_use"]); added_cols.append("fertilizer_use_sq")

if "temp_anomaly" in df.columns:
    df["temp_anomaly_sq"] = safe_square(df["temp_anomaly"]); added_cols.append("temp_anomaly_sq")

if "precipitation" in df.columns:
    df["precipitation_sq"] = safe_square(df["precipitation"]); added_cols.append("precipitation_sq")

# Optional macro logs (kept lightweight)
if "gdp_per_capita" in df.columns:
    df["log_gdp_per_capita"] = safe_log1p(df["gdp_per_capita"]); added_cols.append("log_gdp_per_capita")

print("Added nonlinear cols:", added_cols if added_cols else "none")

Added nonlinear cols: ['log_fertilizer_use', 'fertilizer_use_sq', 'temp_anomaly_sq', 'precipitation_sq', 'log_gdp_per_capita']


## 4) Lag/rolling features (time‑safe)

In [4]:
lag_added, roll_added = [], []

# Lags: 1-year for key series
for base in ["cereal_yield", "temp_anomaly", "precipitation"]:
    if base in df.columns:
        newc = f"{base}_lag1"
        add_group_lag(df, base, lag=1, new_name=newc)
        lag_added.append(newc)

# Roll: 3-year rolling mean for key series
for base in ["cereal_yield", "temp_anomaly", "precipitation"]:
    if base in df.columns:
        newc = f"{base}_roll3"
        add_group_roll_mean(df, base, window=3, new_name=newc)
        roll_added.append(newc)

# Lag1 squares (used later in modeling & scenarios)
for base in ["temp_anomaly", "precipitation"]:
    lagc = f"{base}_lag1"
    if lagc in df.columns:
        sqc = f"{lagc}_sq"
        df[sqc] = safe_square(df[lagc])
        lag_added.append(sqc)

print("Added lag cols:", lag_added if lag_added else "none")
print("Added roll cols:", roll_added if roll_added else "none")

Added lag cols: ['cereal_yield_lag1', 'temp_anomaly_lag1', 'precipitation_lag1', 'temp_anomaly_lag1_sq', 'precipitation_lag1_sq']
Added roll cols: ['cereal_yield_roll3', 'temp_anomaly_roll3', 'precipitation_roll3']


## 5) Interactions (current‑year and lag‑aware)

In [5]:
inter_added = []

# Current-year interactions
if set(["temp_anomaly", "precipitation"]).issubset(df.columns):
    df["tempXprecip"] = safe_interaction(df["temp_anomaly"], df["precipitation"]); inter_added.append("tempXprecip")

if set(["temp_anomaly", "fertilizer_use"]).issubset(df.columns):
    df["tempXfertilizer"] = safe_interaction(df["temp_anomaly"], df["fertilizer_use"]); inter_added.append("tempXfertilizer")

if set(["precipitation", "fertilizer_use"]).issubset(df.columns):
    df["precipXfertilizer"] = safe_interaction(df["precipitation"], df["fertilizer_use"]); inter_added.append("precipXfertilizer")

# Lag1 interaction (used by modeling/scenarios)
if set(["temp_anomaly_lag1", "precipitation_lag1"]).issubset(df.columns):
    df["tempXprecip_lag1"] = safe_interaction(df["temp_anomaly_lag1"], df["precipitation_lag1"]); inter_added.append("tempXprecip_lag1")

print("Added interaction cols:", inter_added if inter_added else "none")

Added interaction cols: ['tempXprecip', 'tempXfertilizer', 'precipXfertilizer', 'tempXprecip_lag1']


## 6) Simple ratios (optional)

In [6]:
ratio_added = []
if set(["fertilizer_use", "gdp_per_capita"]).issubset(df.columns):
    with np.errstate(divide='ignore', invalid='ignore'):
        ratio = pd.to_numeric(df["fertilizer_use"], errors="coerce") / pd.to_numeric(df["gdp_per_capita"], errors="coerce")
        ratio = ratio.replace([np.inf, -np.inf], np.nan)
        df["fertilizer_per_gdp"] = ratio
        ratio_added.append("fertilizer_per_gdp")

print("Added ratios:", ratio_added if ratio_added else "none")

Added ratios: ['fertilizer_per_gdp']


## 7) Sanity checks

In [7]:
# Leakage guardrail: ensure we're not introducing obviously future-labelled cols
bad_cols = [c for c in df.columns if c.endswith("_future") or c.startswith("future_")]
assert not bad_cols, f"Found future-labelled columns: {bad_cols}"

# Quick missingness snapshot (top 15)
mis = df.isna().mean().sort_values(ascending=False) * 100
print("Top missingness (%):")
print(mis.head(15).round(1))

# Minimal column presence check for target and year
assert "cereal_yield" in df.columns, "Target 'cereal_yield' missing after transforms."
assert "year" in df.columns, "'year' missing; required for time-safe features."

Top missingness (%):
precipXfertilizer        34.1
fertilizer_per_gdp       34.1
fertilizer_use_sq        30.0
log_fertilizer_use       30.0
tempXfertilizer          30.0
fertilizer_use           30.0
cereal_yield             28.2
log_cereal_yield         28.2
cereal_yield_lag1        28.2
precipitation            26.9
precipitation_sq         26.9
tempXprecip              26.9
precipitation_lag1_sq    26.9
precipitation_lag1       26.9
tempXprecip_lag1         26.9
dtype: float64


## 8) Save features

In [8]:
df.to_csv(DATA_OUT, index=False)
print("Saved:", DATA_OUT.resolve())
print("Final shape:", df.shape)

Saved: /Users/golibsanaev/Library/CloudStorage/Dropbox/GitHub_gsanaev/crop-yield-prediction-climate-change/data/processed/countries_features.csv
Final shape: (14040, 33)


## 9) Environment

In [9]:
import sys, platform, numpy, pandas
print("Python:", sys.version.split()[0])
print("Platform:", platform.platform())
print("NumPy:", numpy.__version__)
print("Pandas:", pandas.__version__)

Python: 3.12.11
Platform: macOS-15.6.1-arm64-arm-64bit
NumPy: 2.3.3
Pandas: 2.3.3
