In [1]:
import pandas as pd
import numpy as np
import statsmodels.api as sm

# Exercise 1

In the lecture, we think the original model suffers from endogeneity bias due to the likely effect income has on institutional development

Although endogeneity is often best identified by thinking about the data and model, we can formally test for endogeneity using the **Hausman test**

We want to test for correlation between the endogenous variable, $avexpr_i$ and the errors, $u_i$

$H_0: Cov(avexpr_i, u_i) = 0 \quad (\text{no endogeneity})$

$H_1: Cov(avexpr_i, u_i) \neq 0 \quad (\text{endogeneity})$

This test is run in two stages

First, we regress $avexpr_i$ on the instrument $logem4_i$

$avexpr_i = \pi_0 + \pi_1logem4_i + v_i$

Second, we retrive the residuals $\hat{v}_i$ and include them in the original equation

$logpgp95_i = \beta_0 + \beta_1 avexpr_i + \alpha \hat{v}_i + u_i$

If $\alpha$ is statistically significant(with a p-value < 0.05), then we reject the null hypothesis and conclude $avexpr_i$ is endogenous

Using the above information, estimate a Hausman test and interpret your results

In [2]:
df = (
    pd.read_stata('https://github.com/QuantEcon/QuantEcon.lectures.code/raw/master/ols/maketable1.dta')
    [['avexpr', 'logem4', 'logpgp95']]
    .assign(constant=1)
    .dropna()
    .reset_index(drop=True)
)
df.head()

Unnamed: 0,avexpr,logem4,logpgp95,constant
0,5.363636,5.634789,7.770645,1
1,6.386364,4.232656,9.133459,1
2,9.318182,2.145931,9.897972,1
3,4.454545,5.634789,6.84588,1
4,5.136364,4.268438,6.877296,1


In [3]:
get_resid_model = sm.OLS(
    df['avexpr'],
    df[['constant', 'logem4']],
).fit()
print(get_resid_model.summary())

                            OLS Regression Results                            
Dep. Variable:                 avexpr   R-squared:                       0.305
Model:                            OLS   Adj. R-squared:                  0.294
Method:                 Least Squares   F-statistic:                     29.80
Date:                Wed, 27 Mar 2019   Prob (F-statistic):           7.29e-07
Time:                        17:15:00   Log-Likelihood:                -116.90
No. Observations:                  70   AIC:                             237.8
Df Residuals:                      68   BIC:                             242.3
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
constant       9.5146      0.548     17.359      0.0

In [4]:
df['vhat'] = get_resid_model.resid

In [5]:
check_endog = sm.OLS(
    df['logpgp95'],
    df[['constant', 'avexpr', 'vhat']]
).fit()
print(check_endog.summary())

                            OLS Regression Results                            
Dep. Variable:               logpgp95   R-squared:                       0.689
Model:                            OLS   Adj. R-squared:                  0.679
Method:                 Least Squares   F-statistic:                     74.05
Date:                Wed, 27 Mar 2019   Prob (F-statistic):           1.07e-17
Time:                        17:15:00   Log-Likelihood:                -62.031
No. Observations:                  70   AIC:                             130.1
Df Residuals:                      67   BIC:                             136.8
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
constant       2.3702      0.565      4.197      0.0

The coefficient on $\hat{v}$ is statistically significant so we reject the null hypothesis and conclude there is endogeneity

# Exercise 2

The OLS parameter $\beta$ can also be estimated using matrix algebra and `numpy` (you may need to review the [numpy](https://lectures.quantecon.org/py/numpy.html) lecture to complete this exercise)

The linear equation we want to estimate is (written in matrix form)

$$y = X\beta + u$$

To solve for the unknown parameter $\beta$ we want to minimise the sum of squared residuals

$$\underset{\hat{\beta}}{\min} \hat{u}'\hat{u}$$

Rearranging the first equation and substituting into the second equation, we can write

$$\underset{\hat{\beta}}{\min} (Y - X\hat{\beta})' (Y - X\hat{\beta})$$

Solving this optimization problem gives the solution for the $\hat{\beta}$ coefficients

$$\hat{\beta} = (X'X)^{-1}X'y$$

Using the above information, computer $\hat{\beta}$ from model 1 using `numpy` - your results should be the same as those in the `statsmodels` output from earlier in the lecture

In [6]:
df = (
    pd.read_stata('https://github.com/QuantEcon/QuantEcon.lectures.code/raw/master/ols/maketable1.dta')
    [['avexpr', 'logpgp95']]
    .assign(constant=1)
    .dropna()
    .reset_index(drop=True)
)


In [7]:
# Do Statsmodels way first for easy comparison
ex2_mod = sm.OLS(
    df['logpgp95'],
    df[['constant', 'avexpr']]
).fit()
print(ex2_mod.summary())

                            OLS Regression Results                            
Dep. Variable:               logpgp95   R-squared:                       0.611
Model:                            OLS   Adj. R-squared:                  0.608
Method:                 Least Squares   F-statistic:                     171.4
Date:                Wed, 27 Mar 2019   Prob (F-statistic):           4.16e-24
Time:                        17:15:01   Log-Likelihood:                -119.71
No. Observations:                 111   AIC:                             243.4
Df Residuals:                     109   BIC:                             248.8
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
constant       4.6261      0.301     15.391      0.0

In [8]:
x = df[['constant', 'avexpr']].to_numpy()
y = df['logpgp95'].to_numpy()

In [9]:
b_hat = np.linalg.inv(x.T @ x) @ (x.T @ y)
b_hat

array([4.62608941, 0.53187135])

In [10]:
# Exercise 2 preferred solution
b_hat2 = np.linalg.solve(x.T @ x, x.T @ y)
b_hat2

array([4.62608941, 0.53187135])