# Is it a good idea to use classical regression to identify causal effects?

Hans Olav Melberg (hans.melberg@gmail.com), University of Oslo, October, 2016

#### Example

$$ Y = a + b D + c X + e $$

- Y is the outcome
- D is an intervention (dummy)
- X are confounding variables (affect both Y and D)

#### Under what circumstances does the coefficient b capture the causal effect of D on Y?

With no bias $$E(\hat b) = b$$ and least possible variance

#### Standard assumptions

- No measurement errors (weak exogeneity)
- No homoscedasitcity
- No autocorrelation (Independent errors)
- Exogeneity (All confounders included)
- No multicollinearity
- Existence
- Linearity/additivity/correct functinal form/model/Errors normally distributed (Neither are really necessary)


http://people.duke.edu/~rnau/testing.htm
https://economictheoryblog.com/2015/04/01/ols_assumptions/


#### Some surprises

- If effects are heterogenous, d is NOT the same as average treatment effect!
- It may increase bias to include variables (especially problematic are lagged versions of the dependent variable)

#### Explaining and simulating this

#### Inclusion of some variables may create bias!

Theory: Colliders in DAG
Example: SAT scores, motivation, Acceptance to college
Example (from Hein?)



#### Heterogenous effects

Theory: estimator puts more weight on some observations (since it tries to minimize variance)

In [6]:
import statsmodels.formula.api as smf
import numpy as np
import pandas

In [7]:
x = np.arange(30, dtype=float)

# Make some y data with random noise

y = x*(10. + 2.4*np.random.randn(30)) + 200
print(x,y)

[  0.   1.   2.   3.   4.   5.   6.   7.   8.   9.  10.  11.  12.  13.  14.
  15.  16.  17.  18.  19.  20.  21.  22.  23.  24.  25.  26.  27.  28.  29.] [ 200.          212.7214584   218.70649316  227.33922487  226.10617952
  255.42794394  264.26289185  269.96858159  263.21216535  260.68723603
  259.92537188  315.82022216  320.45417986  322.07054861  342.22975821
  379.51038916  323.43978779  292.03432847  462.34257825  426.0733861
  359.65638486  375.9447247   435.15762314  371.67104907  385.71312485
  340.29093922  533.88568602  432.07313609  627.77225113  487.3451977 ]


In [None]:
mod = smf.ols(formula='Lottery ~ Literacy + Wealth + Region', data=df)
res = mod.fit()
print res.summary()