## Problem Set 3

# Proof for Problem 1

## Goal
Show that, when $x_i$ contains a constant term, the $r$-th sample quantile of the residuals $\hat{u}_{ri}$ is $0$.

---

1. **Quantile Regression Estimator**

   The $r$-th quantile regression estimator $\hat{\beta}_r$ solves:

   $$
   \hat{\beta}_r \;=\; \arg \min_{\beta} \sum_{i=1}^n \rho_r\bigl(y_i - x_i'\beta\bigr),
   $$

   where

   $$
   \rho_r(u) \;=\; u\bigl(r - \mathbf{1}\{\,u < 0\}\bigr).
   $$

   Since $\rho_r(u)$ is not differentiable at $u=0$, the first-order conditions are understood in a **subgradient sense**.

2. **First-Order (Subgradient) Condition**

   The subgradient condition for $\beta$ is:

   $$
   \sum_{i=1}^n x_i \Bigl(r - \mathbf{1}\{\,y_i - x_i'\hat{\beta}_r < 0\}\Bigr) \;=\; 0.
   $$

   Denote the residuals by $\hat{u}_{ri} = y_i - x_i'\hat{\beta}_r$.

3. **Condition for the Constant Term**

   Suppose $x_i$ includes a constant, i.e., one component of $x_i$ is $1$. Then the equation for that constant component simplifies to:

   $$
   \sum_{i=1}^n \Bigl(r - \mathbf{1}\{\hat{u}_{ri} < 0\}\Bigr) = 0,
   $$

   which rearranges to

   $$
   \sum_{i=1}^n \mathbf{1}\{\hat{u}_{ri} < 0\} = nr.
   $$

   Hence, exactly $nr$ residuals are negative, implying the proportion of negative residuals is $r$.

4. **Conclusion: $r$-th Sample Quantile is 0**

   By definition, the $r$-th sample quantile $q$ of the residuals $\{\hat{u}_{ri}\}$ satisfies:

   $$
   \frac{1}{n}\sum_{i=1}^n \mathbf{1}\{\hat{u}_{ri} \leq q\} \;\ge\; r
   \quad \text{and} \quad
   \frac{1}{n}\sum_{i=1}^n \mathbf{1}\{\hat{u}_{ri} \geq q\} \;\ge\; 1 - r.
   $$

   Since we already know the fraction of $\hat{u}_{ri}$ below 0 is exactly $r$, it follows that

   $$
   \boxed{q = 0}.
   $$

## Note
The subgradient perspective is needed because $\rho_r(u)$ has a kink at $u=0$. Including a constant in $x_i$ guarantees that exactly $nr$ of the residuals are negative, so the $r$-th sample quantile of those residuals is indeed $0$.

**Q.E.D.**

In [13]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.regression.quantile_regression import QuantReg
from sklearn.linear_model import LinearRegression
import itertools
from scipy import stats

In [8]:

# Load data
data = pd.read_csv('/Users/jadenfix/Desktop/Graduate School Materials/micrometrics/paeco526_qr.csv')
print(data.head())

# Variable names in data
print(data.columns)

# Define dependent variable and column name for smoking indicator
y = data['dbirwt']
smoking_col = 'tobacco'

# Define covariates by dropping the dependent variable and the smoking variable (by name)
covariates = data.columns.drop(['dbirwt', smoking_col]).tolist()

   alcohol  anemia  cardiac  chyper  dbirwt  dfage  dfeduc  diabete  disllb  \
0        0       0        0       0    3238     30      12        0       0   
1        0       0        0       0    3289     25      17        0       0   
2        0       0        0       0    3236     16      12        0      16   
3        0       0        0       0    3374     21      12        0       0   
4        0       0        0       0    3270     28      16        0      26   

   dlivord  ...  fotherr  fhispan  adequac2  adequac3  tripre2  tripre3  \
0        1  ...        0        0         0         0        0        0   
1        1  ...        0        0         1         0        1        0   
2        2  ...        0        0         1         0        1        0   
3        1  ...        0        0         1         0        1        0   
4        2  ...        0        0         0         0        0        0   

   tripre0  first  plural  dmage2  
0        0      1       0     676  
1 

## Question 2

In [9]:

# Part 2a: OLS Regression
# Model 1: Smoking only
model1 = smf.ols(f'dbirwt ~ {smoking_col}', data=data).fit()
print("Model 1 (Smoking only):")
print(model1.summary())

# Model 2: Smoking + covariates
model2_formula = f'dbirwt ~ {smoking_col} + ' + ' + '.join(covariates)
model2 = smf.ols(model2_formula, data=data).fit()
print("\nModel 2 (Smoking + covariates):")
print(model2.summary())

# Part 2b: Median Regression (Quantile 0.5) - Smoking only
quantile = 0.5
X = sm.add_constant(data[smoking_col])
model_qr = QuantReg(y, X).fit(q=quantile)
print("\nMedian Regression (Smoking only):")
print(model_qr.summary())

# Part 2c: Median Regression with covariates
X_cov = sm.add_constant(data[[smoking_col] + covariates])
model_qr_cov = QuantReg(y, X_cov).fit(q=quantile)
print("\nMedian Regression with covariates:")
print(model_qr_cov.summary())


# Part 2e: Bootstrap SE with 10 reps (example for median regression)
np.random.seed(42)
bootstrap_coefs = []
n_reps = 10
for _ in range(n_reps):
    sample = data.sample(n=len(data), replace=True)
    y_boot = sample['dbirwt']
    X_boot = sm.add_constant(sample[[smoking_col] + covariates])
    model_boot = QuantReg(y_boot, X_boot).fit(q=0.5)
    # Use the column name to extract the smoking coefficient
    bootstrap_coefs.append(model_boot.params[smoking_col])

se_bootstrap = np.std(bootstrap_coefs, ddof=1)
print(f"\nBootstrap SE for Smoking Coefficient (10 reps): {se_bootstrap}")

Model 1 (Smoking only):
                            OLS Regression Results                            
Dep. Variable:                 dbirwt   R-squared:                       0.030
Model:                            OLS   Adj. R-squared:                  0.030
Method:                 Least Squares   F-statistic:                     3121.
Date:                Wed, 26 Feb 2025   Prob (F-statistic):               0.00
Time:                        21:04:31   Log-Likelihood:            -7.7665e+05
No. Observations:              100000   AIC:                         1.553e+06
Df Residuals:                   99998   BIC:                         1.553e+06
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept   3423.3603      1




Bootstrap SE for Smoking Coefficient (10 reps): 4.63391648047444


## Interpretations:
2a.)

-Only smoking:

Intercept & Coefficient: The estimated intercept is about 3423 grams, indicating that the average birth weight for infants whose mothers do not smoke is approximately 3423 grams. The tobacco coefficient is –266 grams and is highly statistically significant (p < 0.001).
Interpretation: This result suggests that, on average, infants born to mothers who smoke weigh about 266 grams less than those born to non‐smokers.
Model Fit: Although the R‑squared is only 0.030, meaning that the smoking indicator alone explains about 3% of the variation in birth weight, the effect of smoking is statistically robust.

-Smoking & coveriates

Intercept & Coefficient for Tobacco: With the inclusion of 30 additional covariates, the estimated intercept drops to about 2805 grams and the tobacco coefficient becomes –232 grams (p < 0.001).
Interpretation: After controlling for a wide range of other factors (such as maternal education, prenatal care, and other risk factors), the average reduction in birth weight associated with smoking is approximately 232 grams. This attenuation (a smaller magnitude than in Model 1) suggests that some of the difference observed in the simple regression is due to confounding factors.
Model Fit: The R‑squared increases to 0.155, meaning that the model now explains about 15.5% of the variation in birth weight. This improvement indicates that the additional covariates contribute substantially to explaining birth weight variability.


2b.)

-Only smoking:

Results: In the median (quantile) regression, the estimated intercept is about 3459 grams and the tobacco coefficient is –259 grams, both statistically significant (p < 0.001).
Interpretation: This result indicates that the conditional median birth weight for infants of non‐smoking mothers is around 3459 grams, and for smoking mothers the median is roughly 259 grams lower. Although similar in magnitude to the OLS results, the median regression focuses on the midpoint of the distribution, providing an alternative view that is less influenced by extreme values.

2c.)

-Smoking + covariates 

Results: When additional covariates are included in the median regression, the intercept is about 3044 grams and the tobacco coefficient is –228 grams (p < 0.001).
Interpretation: Controlling for the same set of covariates as in the OLS model, the median regression shows that infants of smoking mothers have a median birth weight about 228 grams lower than those of non‐smoking mothers, holding other factors constant.
Comparison with OLS: The reduction in magnitude compared to the unadjusted median regression (–259 grams) is similar to the pattern seen in the OLS case, suggesting that some of the adverse effect attributed to smoking in a simple regression is explained by other factors in the full model.

2d.)

What It Is: The quantile regression (using a method like Stata’s qreg) computes standard errors based on an asymptotic variance estimator that involves estimating the density of the error term (sparsity) at the conditional quantile.
Advantages: This approach is computationally efficient and provides quick standard error estimates.
Disadvantages: Its reliability depends on accurately estimating the density at the quantile of interest. If the error density is not constant across observations or is poorly estimated, the standard errors can be misleading.
Implication: Given these concerns, one should be cautious about placing too much confidence in the default standard errors, especially in finite samples or if there is heterogeneity in the error distribution.

2e.) 

Procedure: By resampling the data and re-estimating the quantile regression repeatedly (even if only 10 replications are performed here for demonstration), bootstrap methods provide an alternative way to estimate the variability of the smoking coefficient without relying on the density estimation at the quantile.
Results & Comparison: The bootstrap standard error for the smoking coefficient is approximately 4.63. This value is in the same ballpark as the default standard errors from the OLS and quantile regressions, but the bootstrap approach is generally more robust to violations of the assumptions underlying the asymptotic variance estimator.
Interpretation: The bootstrap standard error gives additional confidence in the precision of the tobacco coefficient estimate, although in practice one would use more than 10 replications for reliable inference.

2f.)

Test Setup: The null hypothesis being tested is that the sum of the smoking and alcohol coefficients is less than or equal to –300 grams.

Decision & Interpretation:

A test statistic of 2.37 (if compared to a one-tailed critical value of around 1.645 at the 5% level) would lead you to reject the null hypothesis. This result would suggest that the combined effect of smoking and alcohol is significantly less than –300 grams.
However, note that in the provided output, the alcohol coefficient in the full OLS model was not statistically significant (p = 0.728), so the precise conclusion would depend on the exact estimates and the standard error from the bootstrap.


## Problem 3 

In [14]:

# Define variable names
dependent_var = 'dbirwt'
smoking_var = 'tobacco'

# Define the list of quantiles to analyze
quantiles = [0.10, 0.25, 0.50, 0.75, 0.90]

# -------------------------------
# 3(a) Quantile Regressions without Covariates
# -------------------------------
print("3(a) Quantile Regressions without Covariates:")
coef_dict_a = {}
for q in quantiles:
    # Design matrix with constant and smoking variable only
    X_a = sm.add_constant(data[[smoking_var]])
    model_a = QuantReg(data[dependent_var], X_a).fit(q=q)
    coef_dict_a[q] = model_a.params[smoking_var]
    print(f"Quantile {q}: {smoking_var} coefficient = {model_a.params[smoking_var]:.4f} (SE = {model_a.bse[smoking_var]:.4f})")

# -------------------------------
# 3(b) Quantile Regressions with Covariates
# -------------------------------
# Define covariates: drop dependent variable and smoking indicator
covariates = data.columns.drop([dependent_var, smoking_var]).tolist()

print("\n3(b) Quantile Regressions with Covariates:")
coef_dict_b = {}
se_dict_b = {}
for q in quantiles:
    # Design matrix: constant, smoking variable, and other covariates
    X_b = sm.add_constant(data[[smoking_var] + covariates])
    model_b = QuantReg(data[dependent_var], X_b).fit(q=q)
    coef_dict_b[q] = model_b.params[smoking_var]
    se_dict_b[q] = model_b.bse[smoking_var]
    print(f"Quantile {q}: {smoking_var} coefficient = {model_b.params[smoking_var]:.4f} (SE = {model_b.bse[smoking_var]:.4f})")

# -------------------------------
# 3(c) Hypothesis Test: Equality of Smoking Coefficients Across Quantiles
# -------------------------------
print("\n3(c) Pairwise Hypothesis Tests for Equality of Smoking Coefficients (with covariates):")
for q1, q2 in itertools.combinations(quantiles, 2):
    diff = coef_dict_b[q1] - coef_dict_b[q2]
    # Assuming independence, variance of the difference is the sum of variances
    se_diff = np.sqrt(se_dict_b[q1]**2 + se_dict_b[q2]**2)
    t_stat = diff / se_diff
    # Approximate degrees of freedom: sample size minus number of parameters
    df = len(data) - (len(covariates) + 1)
    p_val = 2 * (1 - stats.t.cdf(np.abs(t_stat), df=df))
    print(f"Comparing q={q1} vs. q={q2}: diff = {diff:.4f}, t-stat = {t_stat:.4f}, p-value = {p_val:.4f}")

# -------------------------------
# 3(d) Analysis of the "tripre0" Coefficient Across Quantiles
# -------------------------------
print("\n3(d) Analysis of 'tripre0' (No Prenatal Care) Coefficient with Covariates:")
tripre0_dict = {}
for q in quantiles:
    # Use the same model specification as in 3(b)
    X_d = sm.add_constant(data[[smoking_var] + covariates])
    model_d = QuantReg(data[dependent_var], X_d).fit(q=q)
    tripre0_dict[q] = model_d.params['tripre0']
    print(f"Quantile {q}: 'tripre0' coefficient = {model_d.params['tripre0']:.4f} (SE = {model_d.bse['tripre0']:.4f})")

3(a) Quantile Regressions without Covariates:
Quantile 0.1: tobacco coefficient = -283.0000 (SE = 8.9395)
Quantile 0.25: tobacco coefficient = -263.0000 (SE = 5.6752)
Quantile 0.5: tobacco coefficient = -258.9999 (SE = 5.1099)
Quantile 0.75: tobacco coefficient = -256.0000 (SE = 5.5519)
Quantile 0.9: tobacco coefficient = -255.0000 (SE = 7.1958)

3(b) Quantile Regressions with Covariates:
Quantile 0.1: tobacco coefficient = -253.7774 (SE = 8.3610)
Quantile 0.25: tobacco coefficient = -231.8153 (SE = 5.8549)
Quantile 0.5: tobacco coefficient = -227.7445 (SE = 5.0010)
Quantile 0.75: tobacco coefficient = -220.5429 (SE = 5.6759)
Quantile 0.9: tobacco coefficient = -228.9399 (SE = 7.2715)

3(c) Pairwise Hypothesis Tests for Equality of Smoking Coefficients (with covariates):
Comparing q=0.1 vs. q=0.25: diff = -21.9620, t-stat = -2.1516, p-value = 0.0314
Comparing q=0.1 vs. q=0.5: diff = -26.0329, t-stat = -2.6721, p-value = 0.0075
Comparing q=0.1 vs. q=0.75: diff = -33.2345, t-stat = -3.28

## Interpretation
3a.)
Pattern of Tobacco Effect:
– At the 10th percentile, the tobacco coefficient is –283.0, indicating that among infants in the very low end of the birth weight distribution, maternal smoking is associated with an average reduction of 283 grams.
– At the 25th percentile the coefficient is –263.0, at the median (50th) –259.0, at the 75th percentile –256.0, and at the 90th percentile –255.0.
Interpretation:
– The pattern suggests that the negative impact of tobacco on birth weight is most pronounced in the lower tail of the distribution. In other words, smoking seems to hurt the lightest infants slightly more, though the effect becomes marginally smaller as you move to higher quantiles.
– The relatively small changes in the coefficient from the median upward (–259 to –255) indicate that for infants at higher birth weights, the impact of smoking is more uniform.

3b.)
Pattern of Tobacco Effect with Controls:
– After including additional covariates, the tobacco coefficient at the 10th percentile is –253.8, at the 25th percentile –231.8, at the median –227.7, at the 75th percentile –220.5, and at the 90th percentile –228.9.
Interpretation:
– With covariates in the model, the magnitude of the smoking effect is generally attenuated compared to the model without covariates. This suggests that some of the difference in birth weight attributed solely to smoking is explained by other factors.
– The decreasing magnitude from the 10th to the 75th percentile implies that the detrimental effect of smoking is strongest at the lower tail. Although the coefficient at the 90th percentile (–228.9) is a bit more negative than at the 75th, overall the pattern indicates that infants at the lower end of the distribution are more adversely affected by maternal smoking.

3c.)
Significant Differences:
– Comparing the 10th percentile to the 25th, 50th, 75th, and 90th percentiles, the differences in tobacco coefficients are statistically significant (p-values ranging from 0.0010 to 0.0314). For example, the difference between the 10th and 50th percentiles is about –26.0 (t = –2.67, p = 0.0075).
Non-significant Differences:
– Differences among the 25th, 50th, 75th, and 90th percentiles are not statistically significant (p-values above 0.05).
Interpretation:
– These tests indicate that the impact of smoking is statistically stronger at the very bottom (10th percentile) of the birth weight distribution compared to the higher quantiles. However, among the middle and upper quantiles, the effect does not differ significantly. This reinforces the idea of heterogeneity – the adverse impact of smoking is especially critical for the lowest birth weight infants.

3d.)
Estimated Effects Across Quantiles:
– At the 10th percentile, the ‘tripre0’ coefficient is –503.2, indicating that lack of prenatal care is associated with a reduction of about 503 grams in the very low end of the distribution.
– This effect diminishes at higher quantiles: –127.7 at the 25th, –49.3 at the 50th, –38.1 at the 75th, and –25.0 at the 90th percentile.
Interpretation:
– The steep gradient suggests that the absence of prenatal care disproportionately affects infants at the lower tail of the birth weight distribution. For the lowest quantile, not receiving prenatal care is linked with a very large drop in birth weight, while for infants at the upper end, the impact is much less severe.
– This heterogeneity implies that interventions aimed at improving prenatal care might have the greatest benefit for those infants at greatest risk of low birth weight.

## Problem 4

Quantile regression estimates the conditional quantile functions by solving an optimization problem that minimizes an asymmetrically weighted sum of absolute deviations. Instead of minimizing squared errors like in ordinary least squares, the method assigns different weights to positive and negative residuals based on the quantile of interest (for example, 0.5 for the median). This weighting ensures that the estimated regression line corresponds to the desired quantile of the conditional distribution of the response variable. The resulting optimization problem is typically solved using linear programming techniques, making it a robust tool for uncovering how the effects of predictors vary across the entire distribution.  

import matplotlib.pyplot as plt

# Define time points and corresponding events
times = [1, 2, 3, 4, 5]
events = ['t=1: Choose action', 't=2: New state', 't=3: Choose action', 't=4: New state', 't=5: Choose action']

# Create the timeline plot
fig, ax = plt.subplots(figsize=(10, 2))
ax.plot(times, [1]*len(times), 'o-', color='blue')  # Plot markers and line at y=1

# Add event labels above each point
for time, event in zip(times, events):
    ax.text(time, 1.01, event, ha='center', va='bottom', fontsize=10)

# Customize the plot
ax.set_yticks([])  # Hide y-axis
ax.set_xlabel('Time')
ax.set_title('Timeline of Decision Points')
plt.show()