# Feature engineering

In this notebook, the goal of feature engineering is to transform raw variables into features that are \
(1) comparable across ZIP codes \
(2) informative for modeling housing prices \
(3) able to capture time dynamics like momentum, trend, and seasonality. \
\
We intentionally create:

Rate and share features (e.g., crime per 1,000 residents, education shares) because raw counts are heavily driven by population size.

Time-series features (lags, rolling averages, year-over-year changes) because housing markets and neighborhood factors evolve over time and tend to be autocorrelated.

Seasonality encodings because housing-related behavior often changes by month, and months are cyclical (December and January are adjacent even though 12 and 1 are numerically far apart).

The output of this notebook is a modeling-ready dataset saved as a CSV.

## imports

In [3]:
import pandas as pd
import numpy as np

IN_PATH = "final_clean_dataset.csv"
OUT_PATH = "final_fe_dataset.csv"

df = pd.read_csv(IN_PATH)

def safe_div(a, b):
    b = b.replace({0: np.nan})
    return a / b

# Parse month into a real datetime.
# If your month is already YYYY-MM-DD, this still works safely.
# If it's YYYY-MM, we append "-01" to represent the month start.
month_as_str = df["month"].astype(str)
if month_as_str.str.len().eq(7).all():  # looks like YYYY-MM
    df["month"] = pd.to_datetime(month_as_str + "-01", errors="coerce")
else:
    df["month"] = pd.to_datetime(month_as_str, errors="coerce")

g = df.groupby("zip", sort=False)

### Cross-Sectional Feature Engineering: Rates, Shares, and Log Transformations

The following features are **cross-sectional transformations**, meaning they describe the structural characteristics of a neighborhood at a given point in time rather than its temporal dynamics.

- **Crime per 1,000 residents** converts raw crime counts into an intensity measure. This ensures comparability across ZIP codes with different population sizes, preventing larger areas from appearing riskier simply due to scale.

- **Education shares** transform degree counts into proportions of the total population. Proportions are generally more interpretable and stable than raw counts, allowing the model to learn structural differences in educational attainment.

- **Log-transformed median income** reduces right-skewness in income distributions. Log transformation stabilizes variance, reduces the influence of extreme values, and often improves the performance of linear models.

These transformations improve interpretability, reduce scale bias, and provide a more meaningful representation of neighborhood characteristics for downstream modeling.


In [4]:
# Crime intensity (per 1,000 residents)
df["crime_per_1000"] = safe_div(df["crime_count"], df["population"]) * 1000

# Education shares (normalize by population)
df["higher_ed_share"] = safe_div(df["higher_ed_count"], df["population"])
df["edu_bachelors_share"] = safe_div(df["edu_bachelors"], df["population"])
df["edu_masters_share"] = safe_div(df["edu_masters"], df["population"])
df["edu_professional_share"] = safe_div(df["edu_professional"], df["population"])
df["edu_doctorate_share"] = safe_div(df["edu_doctorate"], df["population"])

# Graduate-or-higher share (Masters + Professional + Doctorate)
df["edu_gradplus_share"] = safe_div(
    df["edu_masters"] + df["edu_professional"] + df["edu_doctorate"],
    df["population"]
)

# Log-transform income (reduces skew; often improves model stability)
df["log_median_income"] = np.log1p(df["median_income"])


  result = getattr(ufunc, method)(*inputs, **kwargs)


## Time-Series Features

Housing markets exhibit strong temporal dependence. Current prices are influenced by recent history, and neighborhood conditions may evolve gradually rather than abruptly. To capture these dynamics, we construct several types of time-based features.

**Lag features** (e.g., 1-, 3-, 6-, and 12-month lags) capture persistence and seasonal effects. These allow the model to learn autoregressive behavior, such as housing price momentum.

**Percentage change features** (month-over-month and year-over-year) represent growth dynamics rather than static levels. These features help capture acceleration, deceleration, or structural shifts in trends.

**Rolling means** summarize recent history and reduce short-term noise. They provide a smoothed representation of local conditions and often improve generalization.

**Rolling standard deviations** measure volatility, capturing the stability or instability of housing prices or crime rates. Volatility can signal transitional market phases or changing neighborhood conditions.

To avoid data leakage, rolling statistics are computed using only past information (via shifting before rolling).


In [7]:
def add_ts_features(col, lags=(1, 3, 6, 12), rolls=(3, 6, 12), add_std=(6, 12), prefix=None):
    if prefix is None:
        prefix = col

    # Lags
    for k in lags:
        df[f"{prefix}_lag{k}"] = g[col].shift(k)

    # Month-over-month and year-over-year changes
    df[f"{prefix}_pct_change_1m"] = g[col].pct_change(1)
    df[f"{prefix}_pct_change_12m"] = g[col].pct_change(12)

    # Rolling mean and std computed using only past information (shifted by 1)
    for w in rolls:
        df[f"{prefix}_roll_mean_{w}"] = (
            g[col].shift(1).rolling(w).mean().reset_index(level=0, drop=True)
        )
    for w in add_std:
        df[f"{prefix}_roll_std_{w}"] = (
            g[col].shift(1).rolling(w).std().reset_index(level=0, drop=True)
        )

# Price dynamics (ZHVI)
add_ts_features("zhvi", prefix="zhvi")

# Crime dynamics (use rate, not raw count)
add_ts_features("crime_per_1000", prefix="crime_per_1000", add_std=(6, 12))


## Seasonality Encoding

Month-of-year effects are cyclical. Using the numeric month (1â€“12) directly imposes an artificial ordering where December and January appear far apart. To preserve the cyclical structure of time, we encode month using sine and cosine transformations.

This approach allows the model to learn seasonal housing patterns (e.g., spring buying cycles) without introducing discontinuities in the feature space.


In [10]:
df["year"] = df["month"].dt.year
df["month_num"] = df["month"].dt.month

df["month_sin"] = np.sin(2 * np.pi * df["month_num"] / 12)
df["month_cos"] = np.cos(2 * np.pi * df["month_num"] / 12)


## cleanup + export

In [11]:
df.to_csv(OUT_PATH, index=False)
print("Saved:", OUT_PATH)
print("Shape:", df.shape)

Saved: final_fe_dataset.csv
Shape: (940922, 52)


## Overall Impact on Modeling

Collectively, these engineered features allow the model to learn:

- Structural neighborhood differences (rates and shares)
- Temporal persistence and momentum (lags)
- Trend dynamics (percentage changes)
- Stability and risk (rolling volatility)
- Seasonal patterns (cyclical encoding)

By transforming raw observations into economically meaningful signals, feature engineering enhances predictive performance, interpretability, and robustness.