# Exercise 1: Theory and Simulation for Multivariate Regression
## (a)
$$
E[Y] = E[\beta_0 + \beta_1X_2 + \beta_2X_2 + U] = \beta_0 + \beta_1E[X_1] + \beta_2E[X_2] = 4 + 2*1 + 1*2 = 8
$$

In [1]:
library(MASS)
N = 1000
set.seed(210)
b0 = 4
b1 = 2
b2 = 1

X1.X2 = mvrnorm(N, mu=c(1, 2), Sigma=cbind(c(1, 1), c(1, 4)))
X1 = X1.X2[,1]
X2 = X1.X2[,2]

U = rnorm(N, mean=0, sd=2)

Y = b0 + b1*X1 + b2*X2 + U

In [2]:
mean(Y)

So that our sample mean is as expected, close to its population counterpart.

## (b)

In [3]:
# A function to do regression
reg = function(Y,X) {solve(t(X)%*%X)%*%t(X)%*%Y}

# A regression to form residuals of a regression
res = function(Y,X) {Y - X%*%reg(Y,X)}

# Residuals from regression on X2
X1.t2 = res(X1, cbind(rep(1,N), X2))
Y.t2 = res(Y, cbind(rep(1,N), X2))

b1 = cov(Y.t2, X1.t2)/var(X1.t2)

# Residuals from a regression on X1
X2.t1 = res(X2, cbind(rep(1,N), X1))
Y.t1 = res(Y, cbind(rep(1,N), X1))

b2 = cov(Y.t1, X2.t1)/var(X2.t1)

b0 = mean(Y) - b1*mean(X1) - b2*mean(X2)

c(b0, b1, b2)

And we obtain estimates very close to the population counterparts.

## (c)

In [4]:
b = reg(Y, cbind(rep(1,N), X1, X2))
c(b)

And we obtain identical numerical estimates.

## (d)
We know from univariate regression that

$$
\alpha_1 = \frac{Cov[Y,X_1]}{Var[X_1]}.
$$

So let's continue to work this out.

$$
= \frac{Cov[\beta_0 + \beta_1X_1 + \beta_2X_2 + (Y - BLP(Y|X_1,X_2)), X_1]}{Var[X_1]} 
$$

$$
= \beta_1 + \beta_2\frac{Cov[X_2,X_1]}{Var[X_1]} + \frac{Cov[Y - BLP(Y|X_1,X_2), X_1]}{Var[X_1]}
$$

where since $Cov[Y - BLP(Y|X_1,X_2), X_1] = E[(Y - BLP(Y|X_1,X_2))X_1] = 0$ by properties of BLP,

$$
= \beta_1 + \beta_2\frac{Cov[X_2,X_1]}{Var[X_1]},
$$

so that $\alpha_1\neq\beta_1$ unless after controlling for $X_1$, $X_2$ is uncorrelated with $Y$, i.e., $\beta_2=0$ or that $X_1$ is uncorrelated with $X_2$, i.e., $Cov[X_2,X_1]=0$. In our case $Cov[X_1,X_2] = 1$, $\beta_2=1$ and $Var[X_1] = 1$, so that $\alpha_1 = 2 + 1 = 3$.

In [5]:
a1 = cov(Y, X1)/var(X1)
a1

# Exercise 2: Gender Wage Gaps

In [6]:
url = "https://raw.githubusercontent.com/jtorcasso/teaching/master/econ210_fall2017/data/psid_1980.csv"
df = read.csv(url)

# Generating wage and log wage data
df$wage =  ifelse(df$inc_labor==0 | df$hours==0, NaN, df$inc_labor/df$hours)
df$logwage = log(df$wage)

# Generating works variable
df$works = ifelse(df$hours > 200, 1, 0)

## (a)

In [7]:
means = aggregate(wage ~ male, data=df, mean, na.rm=T)
means

male,wage
0,16.42977
1,28.10706


In [8]:
diff = means$wage[2] - means$wage[1]
diff

## (b)

In [9]:
df.cc = na.omit(df) # complete cases
N = dim(df.cc)[1]

Y = df.cc$wage
X = cbind(rep(1, N), df.cc$male)

b = reg(Y, X)
b

0
16.42977
11.6773


Wow, it appears that the OLS estimates of a regression of wages on a male indicator give the mean wages of females as an estimate for $\beta_0$, the intercept parameter, and the mean difference between males and females as an estimate for $\beta_1$, the slope parameter.

## (c)
First consider the slope parameter $\beta_1$. Because this is the univariate regression case, 
$$
\beta_1 = \frac{Cov[W,D]}{Var[D]} = \frac{E[WD] - E[W]E[D]}{E[D^2] - E[D]^2}
$$
then use Law of Iterated Expectations to show $E[WD]=E[W\cdot 1|D=1]P(D=1) + E[W\cdot 0|D=0]P(D=0) = E[W|D=1]P(D=1)$ and $E[W] = E[W|D=1]P(D=1) + E[W|D=0]P(D=0)$ and note that $E[D]=P(D=1)$ and $E[D^2]=E[D]$,
and $P(D=0) = 1 - P(D=1)$, so that
$$
\beta_1 = \frac{E[W|D=1]P(D=1) - [E[W|D=1]P(D=1) + E[W|D=0]P(D=0)]P(D=1)}{P(D=1)[1-P(D=1)]}
$$
$$
= \frac{E[W|D=1][P(D=1) - P(D=1)^2] - E[W|D=0]P(D=0)P(D=1)}{P(D=1)[1-P(D=1)]}
$$
$$
= \frac{E[W|D=1]P(D=1)[1 - P(D=1)] - E[W|D=0]P(D=1)[1-P(D=1)]}{P(D=1)[1-P(D=1)]}
$$
$$
= E[W|D=1] - E[W|D=0].
$$

Next, since $\beta_0=E[W] - \beta_1E[D]$ we have
$$
\beta_0 = E[W|D=1]P(D=1) + E[W|D=0]P(D=0) - \bigg(E[W|D=1] - E[W|D=0]\bigg)P(D=1)
$$
$$
= E[W|D=0]P(D=0) + E[W|D=0]P(D=1)
$$
$$
= E[W|D=0]P(D=0) + E[W|D=0][1 - P(D=0)] = E[W|D=0]
$$

## (d)
Because sample means are sums and the population expectations are sums (or integrals), the intuition of the proof from part (c) carries through for the OLS estimates since
$$
\hat{\beta}_1 = \frac{\hat{\sigma}_{WD}}{\hat{\sigma}_D^2} = \frac{\frac{1}{N}\sum_{i=1}^NW_iD_i - \bar{W}_N\bar{D}_N}{\frac{1}{N}\sum_{i=1}D_i^2 - \bar{D}_N^2}
$$
and
$$
\hat{\beta}_0 = \bar{W}_N - \hat{\beta}_1\bar{D}_N
$$
are just functions of sample means analagous to the population means in part (c).

## (e)
From class, we have that
$$
\beta_1 = \gamma_1 + \gamma_2\frac{Cov[S,D]}{Var[D]}
$$
and using our result from (c), this simplifies to
$$
\beta_1 = \gamma_1 + \gamma_2\bigg(E[S|D=1] - E[S|D=0]\bigg)
$$
Looking at this expression, we first note that it is likely $\gamma_2>0$, since more education should imply higher wages. Second, it is likely $E[S|D=1] - E[S|D=0] > 0$ since at this time (1980), men were still more educated than women. Therefore, it is likely that $\beta_1 > \gamma_1$. Let's see if this holds empirically. 

## (f)
It is likely that part of the positive effect of captured by $\beta_1$ is associated with women having less schooling. So once we control for schooling (or hold schooling fixed), the difference between male and female wages should be smaller.

## (g)

In [10]:
X = cbind(rep(1, N), df.cc$male, df.cc$edu)
g = reg(Y, X)
g[2]

My prediction was wrong! The wage gap appears to grow after controlling for education. Let's see if our assumption that males have more schooling was correct.

In [11]:
aggregate(edu ~ male, data=df.cc, mean, na.rm=T)

male,edu
0,13.39966
1,13.23971


Apparently, women have more schooling? Thus, once we control for schooling, the wage gap (i.e., $\gamma_1$) increases relative to $\beta_1$, when we don't control for schooling. Why does this defy expectations? Because our sample already contains lots of selection. 

## (h)
We are estimating the regression conditional on observing wages. But it is likely that in 1980, women that work (for which we observe wages) are very different from women that don't work. For instance, they are likely of higher ability and therefore, have more schooling. Let's see if sample means for all men and women (including those that don't work), conform to our expectations.

In [12]:
aggregate(edu ~ male, data=df, mean)

male,edu
0,13.06799
1,13.16766


Indeed, males have more schooling when we don't select on just those people that are working. Thus, the selection is what is driving our result that $\hat{\beta}_1 < \hat{\gamma}_1$. More importantly, if we were able to observe wages for everyone, the wage gap would likely to be higher, since women that work are likely of higher ability (and therefore have higher wages) than women who choose not work.

On the other hand, this could be completely wrong. It could be that women that don't work are of higher ability if they are able to match with higher ability husbands who can support a household with their income. But this argument is not consistent with the data saying that women not in the workforce have lower levels of schooling.