# Fixed Effects
Consider the structural equation relating $Y$ to $X_1$ but with mean that depends on group membership $j=1,...,J$. 
$$
Y_i = \gamma_1X_{1,i} + \alpha_j + \epsilon_i
$$
$$
= \gamma_1X_{1,i} + \sum_{j=1}^J\alpha_j\boldsymbol{1}\{j(i)=j\} + \epsilon_i,
$$
where $\boldsymbol{1}\{j(i)=j\}$ is an indicator that equals 1 if individual $i$ is part of group $j$. Each observation $i$ must belong to some group $j$. Our goal is to better understand this equation and under what circumstances we can identify $\gamma_1$ by regression. This setup is applicable to many scenarios, e.g., when data contains information about individuals within families. In this case $j$ could denote the family. We explore an example below that we've already covered in class.

## Example: J=2, A dummy variable for group membership
I will argue that we've already seen this setup. Consider the case where $J=2$, (i.e., there are two groups). Let $X_2=1$ if a member of group $j=1$ and $X_2=0$ if a member of group $j=2$. We can therefore rewrite the structural equation as
$$
Y_i = \gamma_1X_{1,i} + \alpha_1X_{2,i} + \alpha_2(1-X_{2,i}) + \epsilon_i
$$
or
$$
Y_i = \alpha_2 + \gamma_1X_{1,i} + (\alpha_1 - \alpha_2)X_{2,i} + \epsilon_i
$$
$$
= \gamma_0 + \gamma_1X_{1,i} + \gamma_2X_{2,i} + \epsilon_i
$$
where we define $\gamma_0=\alpha_2$ and $\gamma_2=(\alpha_1-\alpha_2)$. The question now is, suppose we estimated the regression of $Y$ on $X_1$ and $X_2$. How would we interpret $\beta_1$ from this regression and under what conditions will it identify $\gamma_1$?

### Interpreting the Fixed Effects Estimand as the Within Group Estimand
By Frisch-Waugh, we know that
$$
\beta_1 = \frac{Cov[\tilde{Y},\tilde{X}_1]}{Var[\tilde{X}_1]}
$$
where $\tilde{Y} = Y-BLP(Y|X_2)$ and $\tilde{X}_1 = X_1 - \tilde{X}_1$. So we are used to interpreting $\beta_1$ as the effect of $X_1$ on $Y$ after controlling for $X_2$. But when $X_2$ denotes group membership, is there a more intuitive interpretation? Note that
$$
\tilde{Y} = Y - BLP(Y|X_2) = Y - E[Y|X_2=0](1-X_2) + E[Y|X_2=1]X_2
$$
and similarly
$$
\tilde{X}_1 = X_1 - BLP(X_1|X_2) = X_1 - E[X_1|X_2=0](1-X_2) + E[Y|X_2=1]X_2.
$$
Studying these two expressions carefully shows that $\tilde{Y}$ and $\tilde{X}_1$ reflect deviations of $Y$ and $X_1$ from their group means. For example, when $X_2=1$ (i.e., denoting membership in group 1), then $\tilde{Y} = Y - E[Y|X_2=1]$ and $\tilde{X}_1=X_1 - E[X_1|X_2=1]$. Thus, $\beta_1$ utilizes variation *within groups* to identify the impact of $X_1$ on $Y$. Thus, when does $\beta_1$ identify $\gamma_1$? We require that
$$
Cov[\tilde{X}_1,\epsilon] = 0,
$$
that is, the deviations of $X_1$ from its group mean must be uncorrelated with unobservables. If we consider as an example the case where $j=1,2$ denotes one of two families, the Jones's (1) and the Smiths (2), $Y$ denotes wages and $X_1$ years of schooling, then $\beta_1$ identifies $\gamma_1$ if the within-family deviations in schooling are uncorrelated with $\epsilon$. If we thought that in a regression of $Y$ on $X_1$ that ability was an omitted variable but that ability was the same within families but not across families, then by comparing the outcomes of siblings within families with different levels of schooling (i.e., using within group variation), we can identify $\gamma_1$. That is, if on average we see children in the Jones family with more schooling earning a higher wage and children in the Smith family with higher schooling earning a higher wage, then this is evidence of a causal effect of schooling on wages provided that within-family differences in schooling are unrelated to ability, our omitted variable.

Still another way of describing our exogeneity assumption is that within-family variation in schooling is exogenous variation. If siblings within families had different levels of ability and ability influenced both schooling choices and labor income directly, then we still have an endogeneity problem with schooling.

## Example: Family Fixed Effects with Large J
Now that you've seen the intuition for the $J=2$ case, consider the case where $J>2$ is large. That is, suppose we have data that collects data on individuals in many families. An example would be the PSID. The intuition is largely the same, $\beta_1$ will reflect the association between within group variation in $X_1$ on $Y$. Let's estimate a fixed effects regression in R using data from the PSID to estimate the returns to schooling.

In [33]:
library(data.table)
fpath = "/home/jake/Dropbox/classes/3rd Year/Teaching/ECON210_Fall2017/Data/psid-base.csv"

df = na.omit(fread(fpath, 
    select=c("inc_labor2011", "age2011", "edu", "f_id")))
df <- as.data.frame(df)
df = df[df$age2011 > 30 & df$age2011 < 50,]

Read 40.3% of 74384 rowsRead 80.7% of 74384 rowsRead 74384 rows and 4 (of 6524) columns from 0.911 GB file in 00:00:05


In [46]:
fit = lm(inc_labor2011 ~ edu + factor(f_id), data=df)
summary(fit)$coef[1:2,]

Unnamed: 0,Estimate,Std. Error,t value,Pr(>|t|)
(Intercept),-18965.174,50358.721,-0.3766016,0.7065644
edu,5771.379,1066.748,5.4102552,8.202175e-08


We can also cluster on the family level to account for correlation in $\epsilon$ within families. As we see below, the standard errors increase.

In [49]:
cl <- function(dat, fm, cluster){
    # data: on which model was estimated
    # fm: lm model
    # cluster: column of data used for clustering
    require(sandwich, quietly = TRUE)
    require(lmtest, quietly = TRUE)
    M = length(unique(cluster))
    N = length(cluster)
    K = fm$rank
    dfc = (M/(M-1))*((N-1)/(N-K))
    uj  = apply(estfun(fm),2, function(x) tapply(x, cluster, sum));
    vcovCL = dfc*sandwich(fm, meat=crossprod(uj)/N)
    coeftest(fm, vcovCL)
}
cl(df, fit, df$f_id)[1:2,]

Unnamed: 0,Estimate,Std. Error,t value,Pr(>|t|)
(Intercept),-18965.174,21773.29,-0.8710294,0.383985779
edu,5771.379,1979.39,2.9157367,0.003642449


We use the notation `factor(f_id)` to obtain the dummy variables for father's ID which will we will use to produce the family fixed effects. If our assumption about the within-family variation in schooling being exogeneous is correct, then we estimate a causal effect of schooling on labor income of about $5,800.

Equivalently, we could also run a regression of labor income on education, deviating from their group means. That is, a regression of $Y_i - \bar{Y}_{j(i)}$ on $X_{1,i} - \bar{X}_{1,j(i)}$.

In [44]:
fmeans = aggregate(df[,c("inc_labor2011", "edu")], by=list(df$f_id), data=df, mean)
df.g = merge(df, fmeans, by.x="f_id", by.y="Group.1")

fit = lm(I(inc_labor2011.x - inc_labor2011.y) ~ I(edu.x - edu.y), data=df.g)
summary(fit)


Call:
lm(formula = I(inc_labor2011.x - inc_labor2011.y) ~ I(edu.x - 
    edu.y), data = df.g)

Residuals:
    Min      1Q  Median      3Q     Max 
-401446   -5241       0    4441  401446 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)      1.365e-13  6.044e+02   0.000        1    
I(edu.x - edu.y) 5.771e+03  6.389e+02   9.034   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 29330 on 2353 degrees of freedom
Multiple R-squared:  0.03352,	Adjusted R-squared:  0.03311 
F-statistic:  81.6 on 1 and 2353 DF,  p-value: < 2.2e-16


We get the same coefficient estimate on schooling as we did for the fixed effects estimator but not the same standard errors. Standard errors are wrong because we don't take into account the fact that we've removed the group means.

## Example: Classroom Fixed Effects
Another scenario might be where $j=1,...,J$ represents different classrooms. Suppose we were interested in the impact of class attendance on student performance. We are worried about the endogeneity of "consciensciousness", a socioemotional trait that affects both attendance and performance. But we think that students are sorted into classes based in part on this trait and that variation in attendance within classrooms will help us identify the impact of attendance on performance. So we estimate a fixed effects regression with classroom fixed effects.

Of course, if variation in attendance within classrooms still depends on students' consciensciousness then this fixed effects strategy won't identify the causal effect of attendance on student performance.