<h2> Exercise 9 - Omitted Factors </h2>

We now examine how omitted variables correlated with the explanatory variable can bias estimates.  We will do so by considering the specification
\begin{align}
y = \boldsymbol{\beta}\boldsymbol{x}' + u
\end{align}
where $(\boldsymbol{x}',u)\sim \text{N}(\boldsymbol{\mu},\boldsymbol{\Sigma}_i)$ for each $i=1,2$ where
\begin{align}
\boldsymbol{\Sigma}_1  =\begin{pmatrix}
1 & 0 & 0.5 & 0\\
0 & 1 & 0.5 & 0\\
0.5 & 0.5 & 1 & 0\\
0 & 0 & 0 & 1
\end{pmatrix}
\end{align}
and
\begin{align}
\boldsymbol{\Sigma}_2  =\begin{pmatrix}
1 & 0 & 0.5 & 0\\
0 & 1 & 0 & 0\\
0.5 & 0 & 1 & 0\\
0 & 0 & 0 & 1
\end{pmatrix}
\end{align}

In [33]:
using Distributions;
using PyPlot;

Sig1 = [1.0 0.0 0.5 0.0; 0.0 1.0 0.5 0.0; 0.5 0.5 1.0 0.0; 0.0 0.0 0.0 1.0];
Sig2 = [1.0 0.0 0.5 0.0; 0.0 1.0 0.0 0.0; 0.5 0.0 1.0 0.0; 0.0 0.0 0.0 1.0];
mu = [0.0; 0.0; 0.0; 0.0;];

d1 = MvNormal(mu,Sig1);
d2 = MvNormal(mu,Sig2);

We now draw several samples from each distribution, construct $y$ from a chosen vector $\beta$.

In [39]:
N = 10000;

xu1 = rand(d1,N);
xu2 = rand(d2,N);

x1 = xu1[1:3,:];
x2 = xu2[1:3,:];

u1 = xu1[4,:];
u2 = xu2[4,:];

beta = [0.64 0.78 0.25];

y1 = (beta*x1)' + u1;
y2 = (beta*x2)' + u2;

We now construct the sample matrices $\boldsymbol{X}$ and $\boldsymbol{Y}$ for our regressions.  First, consider regressing $y$ on the full set of variables.

In [45]:
X1 = x1';
X2 = x2';
Y1 = y1;
Y2 = y2;

betaHat1 = (X1'*X1)\(X1'*Y1);
betaHat2 = (X2'*X2)\(X2'*Y2);

println("Full regression, Sig1: ", betaHat1)
println("Full regression, Sig2: ", betaHat2)
println("True coefficients: ", [0.64 0.78 0.25])

Full regression, Sig1: [0.6296747453668043; 0.7552913255509044; 0.26342516720001163]
Full regression, Sig2: [0.6321842390902478; 0.7914918771962408; 0.2508417840310814]
True coefficients: [0.64 0.78 0.25]


In both cases, the results are close to the specified value of $\boldsymbol{\beta}$.

Next, try only regressing on the explanatory variable, $x_1$.

In [46]:
X1 = x1[1,:];
X2 = x2[1,:];
Y1 = y1;
Y2 = y2;

betaHat1 = (X1'*X1)\(X1'*Y1);
betaHat2 = (X2'*X2)\(X2'*Y2);

println("Explanatory variable only, Sig1: ", betaHat1)
println("Explanatory variable only, Sig2: ", betaHat2)
println("True coefficient: ", [0.64])

Explanatory variable only, Sig1: [0.7612060309089219]
Explanatory variable only, Sig2: [0.748261408411741]
True coefficient: [0.64]


In both cases we are missing controls, so the results are no longer close to the first component of $\boldsymbol{\beta}$.  Next, let's add back in $x_2$.

In [47]:
X1 = x1[1:2,:]';
X2 = x2[1:2,:]';
Y1 = y1;
Y2 = y2;

betaHat1 = (X1'*X1)\(X1'*Y1);
betaHat2 = (X2'*X2)\(X2'*Y2);

println("Explanatory variable only, Sig1: ", betaHat1)
println("Explanatory variable only, Sig2: ", betaHat2)
println("True coefficients: ", [0.64 0.78])

Explanatory variable only, Sig1: [0.7583788575756066; 0.8843428034909009]
Explanatory variable only, Sig2: [0.755411526969277; 0.7944880085164532]
True coefficients: [0.64 0.78]


We still do not get a result which closely estimates the specified value, since in both cases $x_1$ is correlated with the third variable $x_3$.  However, in the second specification we <b> do </b> get a decent estimate of the coefficient on $x_2$, since this variable is uncorrelated with both $x_1$ and $x_3$.  Finally, estimating using $x_1$ and $x_3$:

In [48]:
X1 = [x1[1,:] x1[3,:]];
X2 = [x2[1,:] x2[3,:]];
Y1 = y1;
Y2 = y2;

betaHat1 = (X1'*X1)\(X1'*Y1);
betaHat2 = (X2'*X2)\(X2'*Y2);

println("Explanatory variable only, Sig1: ", betaHat1)
println("Explanatory variable only, Sig2: ", betaHat2)
println("True coefficients: ", [0.64 0.25])

Explanatory variable only, Sig1: [0.38181585226372095; 0.774035714412056]
Explanatory variable only, Sig2: [0.6188294601506036; 0.26352967126902094]
True coefficients: [0.64 0.25]


In the first specification, the correlation of $x_3$ with $x_2$ still results in <b> both </b> coefficients being biased.  In the second specification, with this correlation removed, we see that including the single control $x_3$ is enough to obtain a decent estimate of the coefficient on the explanatory variable $x_1$.