In [4]:
import numpy as np
from sklearn.linear_model import LinearRegression

### Omitted-variable Bias
https://en.wikipedia.org/wiki/Omitted-variable_bias

Linear relationship between a dependent variable $y$ and independent
variables $x_1$, $x_2$ is

$$ y = a + bx_1 + cx_2 + u $$

with parameters $a$, $b$, $c$ and error term $u$.
Lets's assume that $b \neq 0$,  $c \neq 0$ and $\mathrm{cov}(x_1, x_2) \neq 0$
(variables $x_1$, $x_1$ are determinants of $y$ and they are correlated),
and that there is a relationship
$ x_2 = d + fx_1 + e $ with parameters $d$, $f$ and error term $e$. This gives us an equation

$$ y = a + bx_1 + cd + cfx_1 + ce + u = (a + cd) + (b + cf)x_1 + ce + u $$

When we omit $x_2$ from our analysis our estimated effect of $x_1$ is
$\frac{\mathrm{d} y}{\mathrm{d} x_1} = b + cf$ instead of just $\frac{\partial y}{\partial x_1} = b$.

In [40]:
np.random.seed(1)
n = 1000  # number of samples
a = 1
b = 2.5
c = 2
d = 1.5
f = 3
u = np.random.randn(n) / 10
e = np.random.randn(n) / 10

x_1 = np.random.rand(n)
x_2 = d + f * x_1 + e
y = a + b * x_1 + c * x_2 + u

In [41]:
X = np.column_stack([x_1, x_2])
lr = LinearRegression().fit(X, y)
print(lr.coef_, lr.intercept_)

[2.43156422 2.02081192] 0.9755872695286598


When we include both $x_1$ and $x_2$, we can see that our
estimate of the effect of $x_1$ is close to the true effect $b=2.5$.

In [42]:
X = np.column_stack([x_1])
lr = LinearRegression().fit(X, y)
print(lr.coef_, lr.intercept_)

[8.49380426] 4.01242433899085


When we omit $x_2$, we can see that our
estimate of the effect of $x_1$ is aproximately equal to $b + cf$ = 8.5.