# Tutorial 7: Real Data Applications

This tutorial demonstrates **GMM and SMM estimation with real datasets**. We'll work through two complete applications:

1. **GMM**: Labor supply estimation using PSID data (Mroz 1987)
2. **SMM**: Income dynamics estimation using consumption data

## What You'll Learn

1. How to apply GMM to real microeconomic data
2. Instrumental variables estimation for labor supply
3. How to use SMM for models with latent dynamics
4. Best practices for real-world estimation

## Prerequisites

- Completed Tutorials 1-5
- Understanding of IV estimation
- Basic knowledge of labor economics

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as sts

from momentest import (
    gmm_estimate,
    smm_estimate,
    load_labor_supply,
    load_consumption,
    list_datasets,
    j_test,
    table_estimates,
    confidence_interval,
    plot_moment_comparison,
)

np.random.seed(42)
np.set_printoptions(precision=4, suppress=True)

## Available Datasets

Let's see what real datasets are available in `momentest`:

In [None]:
print("Available datasets:")
for name in list_datasets():
    print(f"  - {name}")

---

# Part 1: GMM with Labor Supply Data

## The Mroz (1987) Dataset

This classic dataset from the Panel Study of Income Dynamics (PSID) contains labor supply data for 753 married women in 1976. It's widely used for:

- Labor supply estimation (wage elasticity)
- Sample selection models (Heckman correction)
- IV/GMM examples in econometrics courses

### The Economic Question

**How responsive is labor supply to wages?**

The wage elasticity of labor supply (γ) tells us how much hours worked change when wages change:

$\gamma = \frac{\partial \ln(hours)}{\partial \ln(wage)}$

This is crucial for:
- Tax policy (how do taxes affect work incentives?)
- Welfare programs (how do benefits affect labor supply?)
- Gender wage gap analysis

In [None]:
# Load the labor supply dataset
labor_data = load_labor_supply()

# Print detailed information
labor_data.info()

In [None]:
# Explore the data
print(f"\nSample size: {labor_data.n} working women")
print("\nVariable summary:")
print("-" * 60)
for var in ['log_hours', 'log_wage', 'age', 'education', 'experience']:
    x = labor_data.data[var]
    print(f"{var:15s}: mean={x.mean():8.2f}, std={x.std():8.2f}, min={x.min():8.2f}, max={x.max():8.2f}")

In [None]:
# Visualize the data
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Hours distribution
ax = axes[0, 0]
ax.hist(labor_data.data['hours'], bins=30, edgecolor='black', alpha=0.7)
ax.set_xlabel('Annual Hours Worked')
ax.set_ylabel('Frequency')
ax.set_title('Distribution of Hours Worked')

# Wage distribution
ax = axes[0, 1]
ax.hist(labor_data.data['wage'], bins=30, edgecolor='black', alpha=0.7, color='orange')
ax.set_xlabel('Hourly Wage ($)')
ax.set_ylabel('Frequency')
ax.set_title('Distribution of Wages')

# Hours vs Wage scatter
ax = axes[1, 0]
ax.scatter(labor_data.data['log_wage'], labor_data.data['log_hours'], alpha=0.5, s=20)
ax.set_xlabel('Log Wage')
ax.set_ylabel('Log Hours')
ax.set_title('Log Hours vs Log Wage')

# Education vs Wage
ax = axes[1, 1]
ax.scatter(labor_data.data['education'], labor_data.data['log_wage'], alpha=0.5, s=20, color='green')
ax.set_xlabel('Years of Education')
ax.set_ylabel('Log Wage')
ax.set_title('Log Wage vs Education')

plt.tight_layout()
plt.show()

## The Endogeneity Problem

### Why OLS Fails

Consider the labor supply equation:

$\ln(hours_i) = \alpha + \gamma \ln(wage_i) + \beta' X_i + \varepsilon_i$

**Problem**: Wages are **endogenous**!

- Unobserved ability affects both wages and hours
- High-ability workers earn more AND may work more (or less!)
- $Cov(\ln(wage), \varepsilon) \neq 0$

**Result**: OLS estimates of γ are biased.

In [None]:
# OLS estimation (biased)
Y = labor_data.data['log_hours']
X_ols = np.column_stack([
    labor_data.data['constant'],
    labor_data.data['log_wage'],
    labor_data.data['education'],
    labor_data.data['experience'],
    labor_data.data['experience_sq'] / 100,  # Scale for numerical stability
])

beta_ols = np.linalg.lstsq(X_ols, Y, rcond=None)[0]

print("OLS Estimates (potentially biased):")
print("=" * 50)
print(f"{'Parameter':<20} {'Estimate':>12}")
print("-" * 35)
print(f"{'Constant':<20} {beta_ols[0]:>12.4f}")
print(f"{'γ (wage elasticity)':<20} {beta_ols[1]:>12.4f}")
print(f"{'Education':<20} {beta_ols[2]:>12.4f}")
print(f"{'Experience':<20} {beta_ols[3]:>12.4f}")
print(f"{'Experience²/100':<20} {beta_ols[4]:>12.4f}")
print("=" * 50)
print("\n⚠️  The wage elasticity may be biased due to endogeneity!")

## GMM with Instrumental Variables

### The Solution: Use Instruments

We need instruments $Z$ that:
1. **Relevance**: $Cov(Z, \ln(wage)) \neq 0$ (correlated with wage)
2. **Exclusion**: $Cov(Z, \varepsilon) = 0$ (uncorrelated with labor supply error)

### Classic Instruments: Husband's Characteristics

- **Husband's education** (`heducation`)
- **Husband's age** (`hage`)

**Why valid?**
- Assortative mating: High-education husbands tend to have high-education wives → correlated with wife's wage
- Exclusion: Husband's education shouldn't directly affect wife's hours (conditional on her wage)

### GMM Moment Conditions

The residual $\varepsilon_i = \ln(hours_i) - \alpha - \gamma \ln(wage_i) - \beta' X_i$ should be orthogonal to instruments:

1. $E[\varepsilon] = 0$
2. $E[\varepsilon \cdot heducation] = 0$ (IV for wage)
3. $E[\varepsilon \cdot education] = 0$ (exogenous)
4. $E[\varepsilon \cdot experience] = 0$ (exogenous)
5. $E[\varepsilon \cdot experience^2] = 0$ (exogenous)

In [None]:
# Check instrument relevance: First stage regression
# Regress log_wage on instruments
Z_first = np.column_stack([
    labor_data.data['constant'],
    labor_data.data['heducation'],
    labor_data.data['education'],
    labor_data.data['experience'],
    labor_data.data['experience_sq'] / 100,
])

gamma_first = np.linalg.lstsq(Z_first, labor_data.data['log_wage'], rcond=None)[0]
fitted_wage = Z_first @ gamma_first
residuals_first = labor_data.data['log_wage'] - fitted_wage
r_squared = 1 - np.var(residuals_first) / np.var(labor_data.data['log_wage'])

print("First Stage: log_wage on instruments")
print("=" * 50)
print(f"Coefficient on heducation: {gamma_first[1]:.4f}")
print(f"R-squared: {r_squared:.4f}")
print("\n✓ Husband's education is correlated with wife's wage (relevance)")

In [None]:
def labor_supply_moments(data, theta):
    """
    GMM moment conditions for labor supply estimation.
    
    Model: ln(hours) = α + γ*ln(wage) + β₁*edu + β₂*exp + β₃*exp² + ε
    
    Args:
        data: Dictionary with labor supply variables
        theta: [alpha, gamma, beta_edu, beta_exp, beta_exp2]
    
    Returns:
        Moment conditions of shape (n, k)
    """
    alpha, gamma, beta_edu, beta_exp, beta_exp2 = theta
    
    # Compute residual
    residual = (data['log_hours'] 
                - alpha 
                - gamma * data['log_wage']
                - beta_edu * data['education']
                - beta_exp * data['experience']
                - beta_exp2 * data['experience_sq'] / 100)
    
    # Moment conditions: E[ε * Z] = 0
    moments = np.column_stack([
        residual,                              # E[ε] = 0
        residual * data['heducation'],         # E[ε * hedu] = 0 (IV)
        residual * data['education'],          # E[ε * edu] = 0
        residual * data['experience'],         # E[ε * exp] = 0
        residual * data['experience_sq'] / 100, # E[ε * exp²] = 0
    ])
    
    return moments

In [None]:
# GMM estimation - just identified (5 moments, 5 parameters)
result_labor = gmm_estimate(
    data=labor_data.data,
    moment_func=labor_supply_moments,
    bounds=[
        (0, 10),      # alpha (constant)
        (-2, 2),      # gamma (wage elasticity)
        (-0.5, 0.5),  # beta_edu
        (-0.5, 0.5),  # beta_exp
        (-0.1, 0.1),  # beta_exp2
    ],
    k=5,
    weighting="optimal",
    n_global=200,
    seed=42,
)

print(result_labor)

In [None]:
# Compare OLS vs GMM
print("\n" + "=" * 70)
print("COMPARISON: OLS vs GMM (IV)")
print("=" * 70)
print(f"{'Parameter':<20} {'OLS':>12} {'GMM (IV)':>12} {'SE':>12}")
print("-" * 60)
print(f"{'Constant':<20} {beta_ols[0]:>12.4f} {result_labor.theta[0]:>12.4f} {result_labor.se[0]:>12.4f}")
print(f"{'γ (wage elasticity)':<20} {beta_ols[1]:>12.4f} {result_labor.theta[1]:>12.4f} {result_labor.se[1]:>12.4f}")
print(f"{'Education':<20} {beta_ols[2]:>12.4f} {result_labor.theta[2]:>12.4f} {result_labor.se[2]:>12.4f}")
print(f"{'Experience':<20} {beta_ols[3]:>12.4f} {result_labor.theta[3]:>12.4f} {result_labor.se[3]:>12.4f}")
print(f"{'Experience²/100':<20} {beta_ols[4]:>12.4f} {result_labor.theta[4]:>12.4f} {result_labor.se[4]:>12.4f}")
print("=" * 70)

In [None]:
# Formatted results table
ci_lower, ci_upper = confidence_interval(result_labor.theta, result_labor.se)

print(table_estimates(
    theta=result_labor.theta,
    se=result_labor.se,
    param_names=["α (constant)", "γ (wage elasticity)", "β_edu", "β_exp", "β_exp²"],
    ci_lower=ci_lower,
    ci_upper=ci_upper,
))

### Interpretation

The **wage elasticity** γ tells us:

- A 1% increase in wages leads to approximately γ% change in hours worked
- Positive γ: Higher wages → more hours (substitution effect dominates)
- Negative γ: Higher wages → fewer hours (income effect dominates)

**Typical findings in the literature:**
- Women's labor supply elasticity: 0.1 to 0.5
- Men's labor supply elasticity: ~0 (inelastic)

## Overidentification Test

Let's add husband's age as an additional instrument to test overidentifying restrictions:

In [None]:
def labor_supply_moments_overid(data, theta):
    """
    Overidentified GMM: 6 moments, 5 parameters.
    Adds husband's age as additional instrument.
    """
    alpha, gamma, beta_edu, beta_exp, beta_exp2 = theta
    
    residual = (data['log_hours'] 
                - alpha 
                - gamma * data['log_wage']
                - beta_edu * data['education']
                - beta_exp * data['experience']
                - beta_exp2 * data['experience_sq'] / 100)
    
    moments = np.column_stack([
        residual,                              # E[ε] = 0
        residual * data['heducation'],         # E[ε * hedu] = 0 (IV)
        residual * data['hage'],               # E[ε * hage] = 0 (additional IV)
        residual * data['education'],          # E[ε * edu] = 0
        residual * data['experience'],         # E[ε * exp] = 0
        residual * data['experience_sq'] / 100, # E[ε * exp²] = 0
    ])
    
    return moments

# Estimate overidentified model
result_labor_overid = gmm_estimate(
    data=labor_data.data,
    moment_func=labor_supply_moments_overid,
    bounds=[
        (0, 10),      # alpha
        (-2, 2),      # gamma
        (-0.5, 0.5),  # beta_edu
        (-0.5, 0.5),  # beta_exp
        (-0.1, 0.1),  # beta_exp2
    ],
    k=6,  # Now 6 moments
    weighting="optimal",
    n_global=200,
    seed=42,
)

print(f"Overidentified estimates:")
print(f"  γ (wage elasticity) = {result_labor_overid.theta[1]:.4f} (SE: {result_labor_overid.se[1]:.4f})")

In [None]:
# J-test for overidentifying restrictions
j_result = j_test(
    objective=result_labor_overid.objective,
    n=labor_data.n,
    k=6,  # 6 moments
    p=5,  # 5 parameters
)

print(j_result)
print("\nInterpretation:")
if j_result.p_value > 0.05:
    print("  ✓ Fail to reject H₀: Instruments appear valid")
else:
    print("  ⚠️ Reject H₀: At least one instrument may be invalid")

## Robustness: Comparing Estimates

Let's compare our just-identified and overidentified estimates:

In [None]:
# Compare estimates
print("\nComparison of GMM Estimates:")
print("=" * 70)
print(f"{'Specification':<25} {'γ (wage elast)':>15} {'SE':>12}")
print("-" * 55)
print(f"{'Just-identified (k=5)':<25} {result_labor.theta[1]:>15.4f} {result_labor.se[1]:>12.4f}")
print(f"{'Overidentified (k=6)':<25} {result_labor_overid.theta[1]:>15.4f} {result_labor_overid.se[1]:>12.4f}")
print("=" * 70)
print("\nNote: For bootstrap inference, see the bootstrap() function in momentest.")
print("Bootstrap is especially useful for complex models where asymptotic SE may be unreliable.")

---

# Part 2: SMM for Income Dynamics

Now let's use **Simulated Method of Moments (SMM)** for a model where analytical moments are difficult to compute.

## The Model: AR(1) Income Process with Measurement Error

Many economic models assume income follows an AR(1) process:

$y_t^* = \rho y_{t-1}^* + \sigma_\eta \eta_t, \quad \eta_t \sim N(0, 1)$

But we observe income with measurement error:

$y_t = y_t^* + \sigma_\varepsilon \varepsilon_t, \quad \varepsilon_t \sim N(0, 1)$

**Parameters to estimate:**
- $\rho$: Persistence of income shocks
- $\sigma_\eta$: Standard deviation of permanent shocks
- $\sigma_\varepsilon$: Standard deviation of measurement error

**Why SMM?**
- The measurement error makes analytical moments complex
- SMM lets us simulate the model and match moments directly

## Using Consumption Data

We'll use the consumption growth data to estimate income dynamics. The idea:
- Consumption growth reflects underlying income shocks
- We can match moments of consumption growth to identify income process parameters

In [None]:
# Load consumption data
cons_data = load_consumption()
cons_data.info()

In [None]:
# Extract consumption growth
c_growth = cons_data.data['c_growth']
c_growth_lag1 = cons_data.data['c_growth_lag1']

print(f"Sample size: {len(c_growth)} quarters")
print(f"\nConsumption growth statistics:")
print(f"  Mean: {c_growth.mean():.4f} ({(c_growth.mean()-1)*100:.2f}% per quarter)")
print(f"  Std:  {c_growth.std():.4f}")
print(f"  Autocorrelation: {np.corrcoef(c_growth[1:], c_growth[:-1])[0,1]:.4f}")

In [None]:
# Visualize consumption growth
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Time series
ax = axes[0]
ax.plot(c_growth, linewidth=0.8)
ax.axhline(1.0, color='red', linestyle='--', alpha=0.5)
ax.set_xlabel('Quarter')
ax.set_ylabel('Consumption Growth (C_t / C_{t-1})')
ax.set_title('U.S. Consumption Growth (1947-2025)')

# Distribution
ax = axes[1]
ax.hist(c_growth, bins=50, density=True, edgecolor='black', alpha=0.7)
ax.axvline(c_growth.mean(), color='red', linestyle='--', label=f'Mean: {c_growth.mean():.4f}')
ax.set_xlabel('Consumption Growth')
ax.set_ylabel('Density')
ax.set_title('Distribution of Consumption Growth')
ax.legend()

plt.tight_layout()
plt.show()

## SMM Setup

For SMM, we need:
1. **Target moments** from the data
2. **Simulation function** that generates data given parameters
3. **Moment function** that computes moments from simulated data

In [None]:
# Target moments from data
# We'll match: mean, variance, and first-order autocorrelation
data_mean = c_growth.mean()
data_var = c_growth.var()
data_autocorr = np.corrcoef(c_growth[1:], c_growth[:-1])[0, 1]

data_moments = np.array([data_mean, data_var, data_autocorr])

print("Target moments from data:")
print(f"  Mean:            {data_mean:.6f}")
print(f"  Variance:        {data_var:.6f}")
print(f"  Autocorrelation: {data_autocorr:.6f}")

In [None]:
def simulate_ar1_measurement_error(theta, shocks):
    """
    Simulate AR(1) process with measurement error.
    
    Model:
        y*_t = mu + rho * (y*_{t-1} - mu) + sigma_eta * eta_t
        y_t = y*_t + sigma_eps * eps_t
    
    Args:
        theta: [mu, rho, sigma_eta, sigma_eps]
        shocks: Array of shape (n_sim, T, 2) - [eta, eps] shocks
    
    Returns:
        Simulated observed series of shape (n_sim, T)
    """
    mu, rho, sigma_eta, sigma_eps = theta
    
    # Ensure positive std devs
    sigma_eta = max(sigma_eta, 1e-6)
    sigma_eps = max(sigma_eps, 1e-6)
    rho = np.clip(rho, -0.999, 0.999)
    
    n_sim, T, _ = shocks.shape
    
    # Initialize latent process
    y_star = np.zeros((n_sim, T))
    y_star[:, 0] = mu + sigma_eta * shocks[:, 0, 0] / np.sqrt(1 - rho**2)
    
    # Simulate AR(1)
    for t in range(1, T):
        y_star[:, t] = mu + rho * (y_star[:, t-1] - mu) + sigma_eta * shocks[:, t, 0]
    
    # Add measurement error
    y_obs = y_star + sigma_eps * shocks[:, :, 1]
    
    return y_obs


def compute_moments_ar1(sim_data):
    """
    Compute moments from simulated data.
    
    Args:
        sim_data: Simulated series of shape (n_sim, T)
    
    Returns:
        Moments of shape (n_sim, 3): [mean, variance, autocorr]
    """
    n_sim, T = sim_data.shape
    
    # Mean
    means = sim_data.mean(axis=1)
    
    # Variance
    variances = sim_data.var(axis=1)
    
    # Autocorrelation
    autocorrs = np.zeros(n_sim)
    for i in range(n_sim):
        if variances[i] > 1e-10:
            autocorrs[i] = np.corrcoef(sim_data[i, 1:], sim_data[i, :-1])[0, 1]
        else:
            autocorrs[i] = 0.0
    
    return np.column_stack([means, variances, autocorrs])

In [None]:
# Test the simulation
T = len(c_growth)
n_sim_test = 100

np.random.seed(42)
test_shocks = np.random.randn(n_sim_test, T, 2)

# Test with reasonable parameters
test_theta = [1.008, 0.3, 0.01, 0.005]  # mu, rho, sigma_eta, sigma_eps
test_sim = simulate_ar1_measurement_error(test_theta, test_shocks)
test_moments = compute_moments_ar1(test_sim)

print("Test simulation:")
print(f"  Simulated mean:     {test_moments[:, 0].mean():.6f} (target: {data_mean:.6f})")
print(f"  Simulated variance: {test_moments[:, 1].mean():.6f} (target: {data_var:.6f})")
print(f"  Simulated autocorr: {test_moments[:, 2].mean():.6f} (target: {data_autocorr:.6f})")

In [None]:
# SMM estimation
# Note: We need wrapper functions for smm_estimate

def sim_func_wrapper(theta, shocks):
    """Wrapper for simulation function."""
    # Reshape shocks from (n_sim, T*2) to (n_sim, T, 2)
    n_sim = shocks.shape[0]
    T_local = shocks.shape[1] // 2
    shocks_reshaped = shocks.reshape(n_sim, T_local, 2)
    return simulate_ar1_measurement_error(theta, shocks_reshaped)

def moment_func_wrapper(sim_data):
    """Wrapper for moment function."""
    return compute_moments_ar1(sim_data)

# Run SMM
print("Running SMM estimation...")
result_smm = smm_estimate(
    sim_func=sim_func_wrapper,
    moment_func=moment_func_wrapper,
    data_moments=data_moments,
    bounds=[
        (0.99, 1.02),   # mu (mean consumption growth)
        (-0.5, 0.9),    # rho (persistence)
        (0.001, 0.05),  # sigma_eta (permanent shock std)
        (0.001, 0.05),  # sigma_eps (measurement error std)
    ],
    n_sim=500,
    shock_dim=T * 2,  # T periods × 2 shocks per period
    seed=42,
    weighting="optimal",
    n_global=100,
)

print(result_smm)

In [None]:
# Display results
print("\n" + "=" * 70)
print("SMM ESTIMATION RESULTS: AR(1) with Measurement Error")
print("=" * 70)

ci_lower_smm, ci_upper_smm = confidence_interval(result_smm.theta, result_smm.se)

print(table_estimates(
    theta=result_smm.theta,
    se=result_smm.se,
    param_names=["μ (mean)", "ρ (persistence)", "σ_η (perm shock)", "σ_ε (meas error)"],
    ci_lower=ci_lower_smm,
    ci_upper=ci_upper_smm,
))

In [None]:
# Check moment fit
print("\nMoment Fit:")
print("=" * 50)
print(f"{'Moment':<20} {'Data':>12} {'Model':>12} {'Diff':>12}")
print("-" * 50)
moment_names = ['Mean', 'Variance', 'Autocorrelation']
for i, name in enumerate(moment_names):
    diff = result_smm.sim_moments[i] - data_moments[i]
    print(f"{name:<20} {data_moments[i]:>12.6f} {result_smm.sim_moments[i]:>12.6f} {diff:>12.6f}")
print("=" * 50)

In [None]:
# Visualize moment fit
fig = plot_moment_comparison(
    data_moments=data_moments,
    model_moments=result_smm.sim_moments,
    moment_names=moment_names,
)
plt.suptitle("SMM Moment Fit: AR(1) with Measurement Error", y=1.02)
plt.tight_layout()
plt.show()

### Interpretation

The estimated parameters tell us about consumption dynamics:

- **μ ≈ 1.008**: Average quarterly consumption growth of ~0.8%
- **ρ**: Persistence of consumption shocks (how much past shocks affect current consumption)
- **σ_η**: Volatility of permanent consumption shocks
- **σ_ε**: Measurement error in consumption data

**Economic implications:**
- Low ρ suggests consumption shocks are not very persistent
- The ratio σ_η/σ_ε tells us about signal-to-noise in the data

## Comparing Simulated vs Actual Data

In [None]:
# Simulate data at estimated parameters
np.random.seed(123)
final_shocks = np.random.randn(1000, T, 2)
sim_at_estimate = simulate_ar1_measurement_error(result_smm.theta, final_shocks)

# Plot comparison
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Distribution comparison
ax = axes[0]
ax.hist(c_growth, bins=50, density=True, alpha=0.6, label='Data', edgecolor='black')
ax.hist(sim_at_estimate.flatten(), bins=50, density=True, alpha=0.6, label='Simulated', edgecolor='black')
ax.set_xlabel('Consumption Growth')
ax.set_ylabel('Density')
ax.set_title('Distribution: Data vs Simulated')
ax.legend()

# Autocorrelation comparison
ax = axes[1]
max_lag = 10
data_acf = [np.corrcoef(c_growth[lag:], c_growth[:-lag])[0,1] if lag > 0 else 1.0 for lag in range(max_lag+1)]
sim_acf = []
for lag in range(max_lag+1):
    if lag == 0:
        sim_acf.append(1.0)
    else:
        acfs = [np.corrcoef(sim_at_estimate[i, lag:], sim_at_estimate[i, :-lag])[0,1] 
                for i in range(100)]
        sim_acf.append(np.mean(acfs))

ax.bar(np.arange(max_lag+1) - 0.15, data_acf, width=0.3, label='Data', alpha=0.7)
ax.bar(np.arange(max_lag+1) + 0.15, sim_acf, width=0.3, label='Simulated', alpha=0.7)
ax.set_xlabel('Lag')
ax.set_ylabel('Autocorrelation')
ax.set_title('Autocorrelation Function')
ax.legend()
ax.set_xticks(range(max_lag+1))

plt.tight_layout()
plt.show()

---

# Summary

## What We Covered

### Part 1: GMM with Labor Supply Data
- Loaded and explored the Mroz (1987) PSID dataset
- Identified the endogeneity problem in labor supply estimation
- Used husband's characteristics as instruments
- Estimated wage elasticity of labor supply via GMM
- Tested overidentifying restrictions with J-test

### Part 2: SMM for Income Dynamics
- Modeled consumption as AR(1) with measurement error
- Defined simulation and moment functions
- Estimated parameters by matching mean, variance, and autocorrelation
- Validated model fit by comparing simulated vs actual data

## Key Takeaways

1. **GMM is powerful for IV estimation** with real microdata
2. **SMM is essential** when analytical moments are unavailable
3. **Always check instrument validity** (first stage, J-test)
4. **Moment selection matters** - choose informative moments
5. **Bootstrap** (see Tutorial 4) provides robust inference for complex models

## Exercises

1. **Labor supply**: Try different instruments (e.g., husband's wage). Does the J-test still pass?
2. **Income dynamics**: Add higher-order autocorrelations as moments. Does estimation improve?
3. **Subsample analysis**: Estimate labor supply separately for women with/without children.
4. **Model comparison**: Compare AR(1) vs AR(2) for consumption dynamics using SMM.

## Next Steps

- See **Tutorial 6** for advanced structural models (Euler equations, dynamic discrete choice)
- Explore the `asset_pricing` dataset for CCAPM estimation

In [None]:
# Exercise starter: Labor supply by presence of young children
print("Exercise: Labor supply by presence of young children")
print("=" * 60)

for has_kids, label in [(0, 'No young kids'), (1, 'Has young kids')]:
    # Filter data
    mask = labor_data.data['youngkids'] > 0 if has_kids else labor_data.data['youngkids'] == 0
    
    subset_data = {k: v[mask] for k, v in labor_data.data.items()}
    n_subset = int(mask.sum())
    
    if n_subset < 50:
        print(f"\n{label}: Too few observations ({n_subset})")
        continue
    
    result_subset = gmm_estimate(
        data=subset_data,
        moment_func=labor_supply_moments,
        bounds=[(0, 10), (-2, 2), (-0.5, 0.5), (-0.5, 0.5), (-0.1, 0.1)],
        k=5,
        weighting="optimal",
        n_global=100,
        seed=42,
    )
    
    print(f"\n{label} (n={n_subset}):")
    print(f"  γ (wage elasticity) = {result_subset.theta[1]:.4f} (SE: {result_subset.se[1]:.4f})")