# How much do Covariates Matter?

## Motivation

In regression analyses, we often wonder about "how much covariates matter?" for explaining the relationship between a target variable $D$ and an outcome variable $Y$. 

For example, we might start analysing the gender wage gap with a simple regression model as `log(wage) on gender`. But arguably, men and women differ in many socio-economic characteristics: they might have different (average) levels of education or career experience, and they might work in different industries and select into different higher- or lower-paying industries. So which fraction of the gender wage gap can be explained by these observable characteristics? 

In this notebook, we will compute and decompose the gender wage gab based on a subset of the PSID data set using a method commonly known as the 
"Gelbach Decomposition" ([Gelbach, JoLE 2016](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1425737)). 

We start with loading a subset of the PSID data provided by the AER R package. 

In [1]:
import re

import pandas as pd

import pyfixest as pf

psid = pd.read_csv(
    "https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/refs/heads/master/csv/AER/PSID7682.csv"
)
psid["experience"] = pd.Categorical(psid["experience"])
psid["year"] = pd.Categorical(psid["year"])
psid.head()

Unnamed: 0,rownames,experience,weeks,occupation,industry,south,smsa,married,gender,union,education,ethnicity,wage,year,id
0,1,3,32,white,no,yes,no,yes,male,no,9,other,260,1976,1
1,2,4,43,white,no,yes,no,yes,male,no,9,other,305,1977,1
2,3,5,40,white,no,yes,no,yes,male,no,9,other,402,1978,1
3,4,6,39,white,no,yes,no,yes,male,no,9,other,402,1979,1
4,5,7,42,white,yes,yes,no,yes,male,no,9,other,429,1980,1


Computing a first correlation between gender and wage, we find that males earn on average 0.474 log points
more than women. 

In [2]:
fit_base = pf.feols("log(wage) ~ gender", data=psid, vcov="hetero")
fit_base.summary()

###

Estimation:  OLS
Dep. var.: log(wage), Fixed effects: 0
Inference:  hetero
Observations:  4165

| Coefficient    |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5% |   97.5% |
|:---------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| Intercept      |      6.255 |        0.020 |   320.714 |      0.000 |  6.217 |   6.294 |
| gender[T.male] |      0.474 |        0.021 |    22.818 |      0.000 |  0.434 |   0.515 |
---
RMSE: 0.436 R2: 0.106 


To examine the impact of observable on the relationship between wage and gender, a common strategy in applied research is to incrementally add a set of covariates to the baseline regression model of `log(wage) on gender`. Here, we will incrementally add the following covariates:  

- education,
- experience
- occupation, 
- industry 
- year 
- ethnicity

We can do so by using **multiple estimation syntax**: 

In [3]:
fit_stepwise1 = pf.feols(
    "log(wage) ~ gender + csw0(education, experience, occupation, industry, year, ethnicity)",
    data=psid,
)
pf.etable(fit_stepwise1)

Unnamed: 0_level_0,log(wage),log(wage),log(wage),log(wage),log(wage),log(wage),log(wage)
Unnamed: 0_level_1,(1),(2),(3),(4),(5),(6),(7)
coef,coef,coef,coef,coef,coef,coef,coef
gender[T.male],0.474*** (0.021),0.474*** (0.019),0.425*** (0.018),0.444*** (0.018),0.428*** (0.018),0.432*** (0.016),0.410*** (0.017)
education,,0.065*** (0.002),0.075*** (0.002),0.061*** (0.003),0.063*** (0.003),0.060*** (0.002),0.059*** (0.002)
experience[T.2],,,0.147 (0.156),0.156 (0.155),0.158 (0.154),0.124 (0.137),0.128 (0.136)
experience[T.3],,,0.273 (0.139),0.284* (0.138),0.278* (0.138),0.232 (0.122),0.241* (0.122)
experience[T.4],,,0.377** (0.137),0.389** (0.136),0.385** (0.135),0.287* (0.121),0.298* (0.120)
experience[T.5],,,0.452*** (0.135),0.461*** (0.134),0.456*** (0.133),0.311** (0.119),0.323** (0.118)
experience[T.6],,,0.483*** (0.134),0.495*** (0.133),0.491*** (0.132),0.300* (0.118),0.312** (0.118)
experience[T.7],,,0.591*** (0.133),0.604*** (0.132),0.599*** (0.132),0.374** (0.118),0.386*** (0.117)
experience[T.8],,,0.624*** (0.133),0.641*** (0.132),0.637*** (0.131),0.387*** (0.117),0.400*** (0.117)


Because the table is so long that it's hard to see anything, we restrict it to display only a few variables: 

In [4]:
pf.etable(fit_stepwise1, keep=["gender", "ethnicity", "education"])

Unnamed: 0_level_0,log(wage),log(wage),log(wage),log(wage),log(wage),log(wage),log(wage)
Unnamed: 0_level_1,(1),(2),(3),(4),(5),(6),(7)
coef,coef,coef,coef,coef,coef,coef,coef
gender[T.male],0.474*** (0.021),0.474*** (0.019),0.425*** (0.018),0.444*** (0.018),0.428*** (0.018),0.432*** (0.016),0.410*** (0.017)
ethnicity[T.other],,,,,,,0.133*** (0.020)
education,,0.065*** (0.002),0.075*** (0.002),0.061*** (0.003),0.063*** (0.003),0.060*** (0.002),0.059*** (0.002)
stats,stats,stats,stats,stats,stats,stats,stats
Observations,4165,4165,4165,4165,4165,4165,4165
S.E. type,iid,iid,iid,iid,iid,iid,iid
R2,0.106,0.260,0.376,0.387,0.391,0.520,0.525
"Significance levels: * p < 0.05, ** p < 0.01, *** p < 0.001. Format of coefficient cell: Coefficient (Std. Error)","Significance levels: * p < 0.05, ** p < 0.01, *** p < 0.001. Format of coefficient cell: Coefficient (Std. Error)","Significance levels: * p < 0.05, ** p < 0.01, *** p < 0.001. Format of coefficient cell: Coefficient (Std. Error)","Significance levels: * p < 0.05, ** p < 0.01, *** p < 0.001. Format of coefficient cell: Coefficient (Std. Error)","Significance levels: * p < 0.05, ** p < 0.01, *** p < 0.001. Format of coefficient cell: Coefficient (Std. Error)","Significance levels: * p < 0.05, ** p < 0.01, *** p < 0.001. Format of coefficient cell: Coefficient (Std. Error)","Significance levels: * p < 0.05, ** p < 0.01, *** p < 0.001. Format of coefficient cell: Coefficient (Std. Error)","Significance levels: * p < 0.05, ** p < 0.01, *** p < 0.001. Format of coefficient cell: Coefficient (Std. Error)"


We see that the coefficient on gender is roughly the same in all models. Tentatively, we might already conclude that the observable characteristics in the data do not explain a large part of the gender wage gap. 

But how much do differences in education matter? We have computed 6 additional models that contain education as a covariate. The obtained point estimates 
vary between $0.059$ and $0.075$. Which of these numbers should we report?  

Additionally, note that while we have only computed 6 additional models with covariates, the number of possible models is much larger. 
If I did the math correctly, simply by additively and incrementally adding covariates, we could have computed $57$ different models (not all of which would have included `education` as a control).

As it turns out, different models **will lead to different point estimates**. The order of incrementally adding covariates **might** impact our conclusion. To illustrate this, we keep the same ordering as before, but start with `ethnicity` as our first variable: 

In [5]:
fit_stepwise2 = pf.feols(
    "log(wage) ~ gender + csw0(ethnicity, education, experience, occupation, industry, year)",
    data=psid,
)
pf.etable(fit_stepwise2, keep=["gender", "ethnicity", "education"])

Unnamed: 0_level_0,log(wage),log(wage),log(wage),log(wage),log(wage),log(wage),log(wage)
Unnamed: 0_level_1,(1),(2),(3),(4),(5),(6),(7)
coef,coef,coef,coef,coef,coef,coef,coef
gender[T.male],0.474*** (0.021),0.436*** (0.022),0.450*** (0.020),0.399*** (0.019),0.418*** (0.019),0.404*** (0.019),0.410*** (0.017)
ethnicity[T.other],,0.227*** (0.026),0.141*** (0.024),0.158*** (0.023),0.151*** (0.022),0.146*** (0.022),0.133*** (0.020)
education,,,0.064*** (0.002),0.074*** (0.002),0.060*** (0.003),0.062*** (0.003),0.059*** (0.002)
stats,stats,stats,stats,stats,stats,stats,stats
Observations,4165,4165,4165,4165,4165,4165,4165
S.E. type,iid,iid,iid,iid,iid,iid,iid
R2,0.106,0.121,0.266,0.383,0.394,0.397,0.525
"Significance levels: * p < 0.05, ** p < 0.01, *** p < 0.001. Format of coefficient cell: Coefficient (Std. Error)","Significance levels: * p < 0.05, ** p < 0.01, *** p < 0.001. Format of coefficient cell: Coefficient (Std. Error)","Significance levels: * p < 0.05, ** p < 0.01, *** p < 0.001. Format of coefficient cell: Coefficient (Std. Error)","Significance levels: * p < 0.05, ** p < 0.01, *** p < 0.001. Format of coefficient cell: Coefficient (Std. Error)","Significance levels: * p < 0.05, ** p < 0.01, *** p < 0.001. Format of coefficient cell: Coefficient (Std. Error)","Significance levels: * p < 0.05, ** p < 0.01, *** p < 0.001. Format of coefficient cell: Coefficient (Std. Error)","Significance levels: * p < 0.05, ** p < 0.01, *** p < 0.001. Format of coefficient cell: Coefficient (Std. Error)","Significance levels: * p < 0.05, ** p < 0.01, *** p < 0.001. Format of coefficient cell: Coefficient (Std. Error)"


We obtain 5 new coefficients on `education` that vary between 0.074 and 0.059. 

So, which share of the "raw" gender wage gap can be attributed to differences in education between men and women? Should we report a statisics based on the 0.075 estimate? Or 
on the 0.059 estimate? Which value should we pick? 

To help us with this problem, Gelbach (2016, JoLE) develops a decomposition procedure building on the omitted variable bias formula that produces a single value for the contribution of a given covariate, say education, to the gender wage gap.

## Notation and Gelbach's Algorithm

Before we dive into a code example, let us first introduce the notation and Gelbach's algorithm. We are interested in "decomposing" the effect of 
a variable $X_{1} \in \mathbb{R}$ on an outcome $Y \in \mathbb{R}$ into a part explained by covariates $X_{2} \in \mathbb{R}^{k_{2}}$ and an unexplained part. 

Thus we can specify two regression models: 

- The **short** model 
    $$
        Y = X_{1} \beta_{1} + u_{1}
    $$

- the **long** (or full) model 

    $$
        Y = X_{1} \beta_{1} + X_{2} \beta_{2} + e
    $$

By fitting the **short** regression, we obtain an estimate $\hat{\beta}_{1}$, which we will denote as the **direct effect**, and by estimating the **long** regression, we obtain an estimate of the regression coefficients $\hat{\beta}_{2} \in \mathbb{R}^{k_2}$. We will denote the estimate on $X_1$ in the long regression as the **full** effect. 

We can then compute the contribution of an individual covariate $\hat{\delta}_{k}$ via the following algorithm: 

- Step 1: we compute coefficients from $k_{2}$ auxiliary regression models $\hat{\Gamma}$ as 
    $$
        \hat{\Gamma} = (X_{1}'X_{1})^{-1} X_{1}'X_{2}
    $$

    In words, we regress the target variable $X_{1}$ on each covariate in $X_{2}$. In practice, we can easily do this in one line of code via `scipy.linalg.lstsq()`.

- Step 2: We can compute the total effect **explained** by the covariates, which we denote by $\delta$, as 

    $$
        \hat{\delta} = \sum_{k=1}^{k_2} \hat{\Gamma}_{k} \hat{\beta}_{2,k}
    $$

    where $\hat{\Gamma}_{k}$ are the coefficients from an auxiliary regression $X_1$ on covariate $X_{2,k}$ and $\hat{\beta}_{2,k}$ is the associated estimate on $X_{2,k}$ from the **full** model. 

    The individual **contribution of covariate $k$** is then defined as 

    $$
        \hat{\delta}_{k} = \hat{\Gamma}_{k} \hat{\beta}_{2,k}.
    $$


After having obtained $\delta_{k}$ for each auxiliary variable $k$, we can easily aggregate multiple variables into a single groups of interest. For example, if $X_{2}$ contains a set of dummies from industry fixed effects, we could compute the explained part of "industry" by summing over all the dummies: 

$$
        \hat{\delta}_{\textit{industry}} = \sum_{k \in \textit{industry dummies}} \hat{\Gamma}_{k} \hat{\beta}_{2,k}
$$

## `PyFixest` Example

To employ Gelbach's decomposition in `pyfixest`, we start with the **full** regression model that contains **all variables of interest**: 

In [6]:
fit_full = pf.feols(
    "log(wage) ~ gender + ethnicity + education + experience + occupation + industry +year",
    data=psid,
)

After fitting the **full model**, we can run the decomposition procedure by calling the `decompose()` method. The only required argument is to specify the target parameter, which in this case is "gender". Inference is conducted via a non-parametric bootstrap and can optionally be turned off.

In [7]:
gb = fit_full.decompose(param="gender[T.male]", digits=5)

100%|██████████| 1000/1000 [00:13<00:00, 72.16it/s]


As before, this produces a pretty big output table that reports 
- the **direct effect** of the regression of `log(wage) ~ gender`
- the **full effect** of gender on log wage using the **full regression** with all control variables
- the **explained effect** as the difference between the full and direct effect
- a **single scalar value** for the individual contributions of a covariate to overall **explained effect** 

For our example at hand, the additional covariates only explain a tiny fraction of the differences in log wages between men and women - 0.064 points. 
Of these, around one third can be attributed to ethnicity, 0.00064 to years of eduaction, etc. 

In [8]:
pf.make_table(gb)

Unnamed: 0,direct_effect,full_effect,explained_effect
gender[T.male],0.47447,0.41034,0.06413
,"[0.45957, 0.50398]","[0.39909, 0.43440]","[0.04599, 0.10157]"
ethnicity[T.other],,,0.02275
,,,"[0.01928, 0.02913]"
education,,,0.00064
,,,"[-0.00669, 0.00551]"
experience[T.2],,,-0.00030
,,,"[-0.00038, 0.00020]"
experience[T.3],,,-0.00234
,,,"[-0.00545, 0.00068]"


Because experience is a categorical variable, the table gets pretty unhandy: we produce one estimate for "each" level. Luckily, Gelbach's decomposition 
allows us to group individual contributions into a single number. In the `decompose()` method, we can combine variables via the `combine_covariates` argument: 

In [9]:
gb2 = fit_full.decompose(
    param="gender[T.male]",
    combine_covariates={
        "experience": re.compile("experience"),
        "occupation": re.compile("occupation"),
        "industry": re.compile("industry"),
        "year": re.compile("year"),
        "ethnicity": re.compile("ethnicity"),
    },
)

100%|██████████| 1000/1000 [00:07<00:00, 128.06it/s]


We now report a single value for "experience", which explains a good chunk - around half - of the explained part of the gender wage gap. 

In [14]:
pf.make_table(gb2)

Unnamed: 0,direct_effect,full_effect,explained_effect
gender[T.male],0.4745,0.4103,0.0635
,"[0.4558, 0.4775]","[0.3968, 0.4222]","[0.0512, 0.0784]"
experience,,,0.0390
,,,"[0.0378, 0.0421]"
occupation,,,-0.0165
,,,"[-0.0277, -0.0166]"
industry,,,0.0182
,,,"[0.0170, 0.0247]"
year,,,0.0000
,,,"[-0.0013, 0.0141]"


We can aggregate even more to "individual level" and "job" level variables: 

In [11]:
gb3 = fit_full.decompose(
    param="gender[T.male]",
    combine_covariates={
        "job": re.compile(r".*(occupation|industry).*"),
        "personal": re.compile(r".*(education|experience|ethnicity).*"),
        "year": re.compile("year"),
    },
)

100%|██████████| 1000/1000 [00:08<00:00, 116.51it/s]


In [12]:
pf.make_table(gb3)

Unnamed: 0,direct_effect,full_effect,explained_effect
gender[T.male],0.4745,0.4103,0.0641
,"[0.4511, 0.4826]","[0.3974, 0.4057]","[0.0536, 0.0852]"
job,,,0.0017
,,,"[-0.0060, 0.0044]"
personal,,,0.0624
,,,"[0.0504, 0.0775]"
year,,,0.0000
,,,"[-0.0101, 0.0092]"
,,,


## Literature 

- ["When do Covariates Matter? And Which Ones, and How Much?" by Gelbach, Jonah B. (2016), Journal of Labor Economics](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1425737)