In [32]:
import numpy as np
from numpy import linalg as la
from scipy.stats import chi2
from tabulate import tabulate
import LinearModelsWeek3 as lm
import pandas as pd
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [33]:
y, x, N, T, year, label_y, label_x = lm.load_example_data()

---

### Our dataset

| Variable | Content |
|-|-|
| nr | Variable that identifies the individual  |
| year | Year of observation |
| Black | Black |
| Hisp | Hispanic |
| Educ | Years of schooling |
| Exper | Years since left school |
| Expersq | Exper2 |
| Married | Marital status |
| Union | Union membership |
| Lwage | Natural logarithm of hourly wages |


---

# Problem set introduction
Last time we briefly discussed how the presence of fixed effects causes the estimator to be biased. To recap, consider the following model,

$$ y_{it} = \boldsymbol{x}_{it}\boldsymbol{\beta} + c_i + u_{it}, \quad i=1, \dotsc, N \quad t=1, \dotsc, T \tag{1} $$

where $c_i$ is an unobservable individual specific component which is constant across time. We consider two different scenarios: 

* **Part 1:** If $c$ is systematically related to one or more of the observed variables in the sense of $E[c_{i}\boldsymbol{x}_{it}] \neq \boldsymbol{0}$, then the POLS estimator is _not_ consistent for $\boldsymbol{\beta}$.
* **Part 2:*** If $c_i$ is uncorrelated with the regressors such that $E[c_i\boldsymbol{x}_{it}]=0$ for all $t$, then $\boldsymbol{\beta}$ can be consistently estimated by pooled OLS (POLS) and random effects (RE). 

### Example
Let's take a look at a proper example. We are interested in the effect of unionization on wage, this could be modelled as such.

$$
\log(wage_{it}) = \beta_0 + \beta_1\textit{union}_{it} + c_{i} + u_{it} \tag{2}
$$

Consider what could be in $c_i$ that may be correlated with unionizing? Let us first calcualte what the average union participation is, by checking the mean of the union variable.

In [34]:
mean_union = x[:, -1].mean()
print(f'About {mean_union * 100:.2f}% of our sample is in an union.')

About 24.40% of our sample is in an union.


There are some fixed effects that we could control for, for example if we believe afro americans are more or less prone to unionizing, because of some social economic factors. We can look at the conditional mean for Blacks and Hispanics.

In [35]:
black_union = x[x[:, 1] == 1, -1].mean()
hispanic_union = x[x[:, 2] == 1, -1].mean()
print(f'If we look at the unionization of some sub populations, Black membership is at {black_union * 100:.2f}%, Hispanic membership is at {hispanic_union * 100:.2f}%.')

If we look at the unionization of some sub populations, Black membership is at 37.10%, Hispanic membership is at 27.35%.


Ethnicity may therefore be a fixed effect which is systematically related to $\textit{union}$ (again, most likely ethnicity does not affect union, but it might be a proxy for some socio-economic factors that affect union membership). In our data, this is something which we can control for by including controls in our regression.

We therefore consider the somewhat more elaborate model from last time,

$$
\begin{align}
\log(wage_{it}) & =\beta_{0}+\beta_{1}\textit{exper}_{it}+\beta_{2}\textit{exper}_{it}^{2}+\beta_{3}\textit{union}_{it}+\beta_{4}\textit{married}_{i} +\beta_{5}\textit{educ}_{i}+\beta_{6}\textit{hisp}_{i}+\beta_{7}\textit{black}_{i}+c_{i}+u_{it}. \tag{3}
\end{align}
$$

This should solve some of our problems compared to eq. (2), but we still have an issue if for example people select into union or non-union jobs based on which sector rewards their innate characteristics best, then $E[union_{it}c_i]\neq0$.

## Part 1: Compare POLS to FE/FD
### Question 1:

Start by estimating eq. (3) by POLS. You should already have all the data and code that you need, print it out in a nice table. Is the unionization coefficient statistically significant?

In [73]:

est_pols = lm.estimate(y=y, x=x)
lm.print_table(labels=(label_y, label_x), results=est_pols, title='POLS-estimator')

# First, regress y on x without any transformations. Store the resulting dictionary.
# Then, print the resulting dictionary using the provided print_table() function. The labels should have been provided to you.

POLS-estimator
Dependent variable: Log wage

                       Beta           Se    t-values
--------------  -----------  -----------  ----------
Constant        -0.0347057   0.064569      -0.537498
Black           -0.143842    0.0235595     -6.10546
Hispanic         0.015698    0.0208112      0.754305
Education        0.0993878   0.0046776     21.2476
Experience       0.0891791   0.010111       8.81996
Experience sqr  -0.00284866  0.000707362   -4.02716
Married          0.107666    0.0156965      6.85922
Union            0.180073    0.0171205     10.5179
R² = 0.187
σ² = 0.231


---

Unionization coef is statistically significant (strongly) w/ t-value > 10.

---

You should get a table that look like this:

Pooled OLS <br>
Dependent variable: Log wage <br>

|                |    Beta |     Se |   t-values |
|----------------|---------|--------|------------|
| Constant       | -0.0347 | 0.0646 |    -0.5375 |
| Black          | -0.1438 | 0.0236 |    -6.1055 |
| Hispanic       |  0.0157 | 0.0208 |     0.7543 |
| Education      |  0.0994 | 0.0047 |    21.2476 |
| Experience     |  0.0892 | 0.0101 |     8.8200 |
| Experience sqr | -0.0028 | 0.0007 |    -4.0272 |
| Married        |  0.1077 | 0.0157 |     6.8592 |
| Union          |  0.1801 | 0.0171 |    10.5179 |
R² = 0.187 <br>
σ² = 0.231

### Short recap of fixed effects

As discussed last time, a solution to control for fixed effects, is to "demean" the data. We need to calculate the mean within each person, so we define $\bar{y}_{i}=T^{-1}\sum_{t=1}^{T}y_{it}$, $\mathbf{\bar{x}}_{i}=T^{-1}\sum_{t=1}^{T}\mathbf{x}_{it}$, $\mathbf{\bar{u}}_{i}=T^{-1}\sum_{t=1}^{T}\mathbf{u}_{it}$, and $c_{i} = T^{-1}\sum_{t=1}^{T}c_{i}$.

Subtracting these means from eq. (1) we are able to demean away the fixed effects,

$$
\begin{align}
y_{it}-\bar{y}_{i} & =\left(\mathbf{x}_{it}-\mathbf{\bar{x}}_{i}\right)\mathbf{\beta}+(\color{red}{c_{i}-c_{i}})+\left(u_{it}-\bar{u}_{i}\right) \\
\Leftrightarrow\ddot{y}_{it} & =\ddot{\mathbf{x}}_{it}\mathbf{\beta} + \ddot{u}_{it}. \tag{4}
\end{align}
$$

To substract the mean within each person is not immediately easy. But you are provided with a `perm` function, that takes a "transformation matrix" Q, and uses it to permutate some vector or matrix A.

In order to demean the data, we need to give this `perm` function the following transformation matrix:

$$
\mathbf{Q}_{T}:=\mathbf{I}_{T}-\left(\begin{array}{ccc}
1/T & \ldots & 1/T\\
\vdots & \ddots & \vdots\\
1/T & \ldots & 1/T
\end{array}\right)_{T\times T}.
$$

### Question 2:
Estimate eq. (3) by fixed effects. You need to perform the following steps:
* Create the demeaning matrix Q.
* Demean x and y using the `perm` function and Q.
* Remove the columns in the demeaned x that are only zeroes (remember to shorten the `label_x` as well).
* Estimate y on x using the demeaned arrays.
* Print it out in a nice table.

In [37]:
def demeaning_matrix(T):
    Q_T = np.eye(T)-np.tile(1/T,T) # Fill in
    return Q_T

In [38]:
Q_T = demeaning_matrix(T)

In [39]:
y_demean = lm.perm(Q_T, y)
x_demean = lm.perm(Q_T, x)

In [40]:
lm.check_rank(x_demean)

'The matrix is NOT full rank with rank = 4. Eliminate linearly dependent columns.'

In [41]:
pd.DataFrame(x_demean, columns=label_x).head()

Unnamed: 0,Constant,Black,Hispanic,Education,Experience,Experience sqr,Married,Union
0,0.0,0.0,0.0,0.0,-3.5,-24.5,0.0,-0.125
1,0.0,0.0,0.0,0.0,-2.5,-21.5,0.0,0.875
2,0.0,0.0,0.0,0.0,-1.5,-16.5,0.0,-0.125
3,0.0,0.0,0.0,0.0,-0.5,-9.5,0.0,-0.125
4,0.0,0.0,0.0,0.0,0.5,-0.5,0.0,-0.125


In [42]:
x_demean = x_demean[:, 4:]
lm.check_rank(x_demean)

'The matrix is of full rank with rank = 4'

In [47]:
label_x_fe = label_x[4:]

In [75]:
## FILL IN
# The steps are outlined in question 2 above.


est_fe = lm.estimate(y=y_demean, x=x_demean, transform='fe', T=T) # fill in
lm.print_table(labels=(label_y, label_x_fe), results = est_fe, title='FE-estimator')

FE-estimator
Dependent variable: Log wage

                       Beta           Se    t-values
--------------  -----------  -----------  ----------
Experience       0.116847    0.00841968     13.8778
Experience sqr  -0.00430089  0.000605274    -7.10569
Married          0.0453033   0.0183097       2.47428
Union            0.0820871   0.0192907       4.25526
R² = 0.178
σ² = 0.123


You should get a table that looks like this:

FE regression<br>
Dependent variable: Log wage

|                |    Beta |     Se |   t-values |
|----------------|---------|--------|------------|
| Experience     |  0.1168 | 0.0084 |    13.8778 |
| Experience sqr | -0.0043 | 0.0006 |    -7.1057 |
| Married        |  0.0453 | 0.0183 |     2.4743 |
| Union          |  0.0821 | 0.0193 |     4.2553 |
R² = 0.178 <br>
σ² = 0.123

## Short recap of first differences

The within transformation is one particular transformation
that enables us to get rid of $c_{i}$. An alternative is the first-difference transformation. To see how it works, lag equation (1) one period and subtract it from (1) such that

\begin{equation}
\Delta y_{it}=\Delta\mathbf{x}_{it}\mathbf{\beta}+\Delta u_{it},\quad t=\color{red}{2},\dotsc,T, \tag{5}
\end{equation}

where $\Delta y_{it}:=y_{it}-y_{it-1}$, $\Delta\mathbf{x}_{it}:=\mathbf{x}_{it}-\mathbf{x}_{it-1}$ and $\Delta u_{it}:=u_{it}-u_{it-1}$. As was the case for the within transformation, first differencing eliminates the time invariant component $c_{i}$. Note, however, that one time period is lost when differencing.

In order to first difference the data, we can pass the following transformation matrix to the `perm` function,

$$
\mathbf{D}:=\left(\begin{array}{cccccc}
-1 & 1 & 0 & \ldots & 0 & 0\\
0 & -1 & 1 &  & 0 & 0\\
\vdots &  &  & \ddots &  & \vdots\\
0 & 0 & 0 & \ldots & -1 & 1
\end{array}\right)_{T - 1\times T}.
$$

### Question 3:
Estimate eq. (3) by first differences. You need to perform the following steps:
* Create the first difference matrix D.
* First difference x and y using the `perm` function and Q.
* Remove the columns in the first differenced x that are only zeroes (remember to shorten the `label_x` as well).
* Estimate y on x using the first differenced arrays.
* Print it out in a nice table.

In [49]:
def fd_matrix(T):
    D_T = -np.eye(T)+np.eye(T,k=1) # Fill in
    D_T = D_T[:-1]
    return D_T

In [53]:
D_T = fd_matrix(T)
D_T

array([[-1.,  1.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0., -1.,  1.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0., -1.,  1.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0., -1.,  1.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0., -1.,  1.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0., -1.,  1.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0., -1.,  1.]])

In [54]:
y_diff = lm.perm(D_T, y)
x_diff = lm.perm(D_T,x)

In [55]:
lm.check_rank(x_diff)

'The matrix is NOT full rank with rank = 4. Eliminate linearly dependent columns.'

In [57]:
x_diff = x_diff[:, 4:]

In [58]:
lm.check_rank(x_diff)

'The matrix is of full rank with rank = 4'

In [71]:
## FILL IN
# The steps are outlined in question 3 above.

fd_result = lm.estimate(y=y_diff, x=x_diff, T=T-1, transform='fd') # fill in
lm.print_table(labels=(label_y, label_x_fe), results=fd_result, title='FD-estimator')

FD-estimator
Dependent variable: Log wage

                       Beta          Se    t-values
--------------  -----------  ----------  ----------
Experience       0.11575     0.0195867      5.90964
Experience sqr  -0.00388237  0.00138632    -2.80049
Married          0.0381377   0.0229283      1.66335
Union            0.0427878   0.0196575      2.17667
R² = 0.004
σ² = 0.196


You should get a table that look like this:

FD regression <br>
Dependent variable: Log wage

|                |    Beta |     Se |   t-values |
|----------------|---------|--------|------------|
| Experience     |  0.1158 | 0.0196 |     5.9096 |
| Experience sqr | -0.0039 | 0.0014 |    -2.8005 |
| Married        |  0.0381 | 0.0229 |     1.6633 |
| Union          |  0.0428 | 0.0197 |     2.1767 |
R² = 0.004 <br>
σ² = 0.196

## Summing up Part 1: questions 1, 2, and 3.
Compare the results from your POLS, FE and FD estimations. We were mainly interested in the effect of $\textit{union}$ on wages, did the POLS estimation give a correct conlcusion on this? Is the effect greater or lower than we first though? Is the effect still statistically significant?

---

### Union estimates 


|       | POLS | FE | FD |
| ------ |------| -- | -- |
|$\beta$ | 0.18 | 0.08 | 0.03 | 
|t-values | 10.5 | 4.26 | 2.17 |

It would seem that POLS overestimates the effect of unionization on wages. That is, we probably have some fixed effects (ie. some inherent propensity to unionize) that renders the POLS to be inconsistent.

---

# Part 2: The random effects (RE) estimator.
In part 1 we assumed that $E[union_{it}c_i]\neq0$, and used two methods to remove these fixed effects from each person. Now, what if $E[union_{it}c_i] )= 0$? Then POLS is consistent, but not efficient, since POLS is not using the panel structure of the data. We can therefore do better with the RE estimator.

## A short introduction to the RE estimator
With the FE and FD estimators, we estimate them by OLS, but by first transforming them in a specific way. We do the same now, but our mission is no longer to transform away the fixed effects, but rather to estimate the following model,

$$\check{y}_{it} = \mathbf{\check{x}}_{it}\boldsymbol{\beta} + \check{v}_{it},\tag{6}$$ 

 $\check{y}_{it} = y_{it} - \hat{\lambda}\bar{y}_{it}$, $\mathbf{\check{x}}_{it} = \mathbf{x}_{it} - \hat{\lambda}\mathbf{\bar{x}}_{it}$, and $\check{v}_{it} = v_{it} - \hat{\lambda}\bar{v}_{it}$, where we have gathered the errors $v_{it} = c_i + u_{it}$. We are *"quasi-demeaning"* the variables, by premultiplying the means by $\hat{\lambda}$.

 Our challenge is thus to estimate this $\lambda$, which we can construct in the following way:

$$\hat{\lambda} = 1 - \sqrt{\frac{\widehat{\sigma}_{u}^{2}}{(\widehat{\sigma}_{u}^{2} + T\widehat{\sigma}_{c}^{2})}}, $$

where $\widehat{\sigma}_{u}^{2}$ is estimated from the fixed effects regression, and $\hat{\sigma}_{c}^{2} = \hat{\sigma}_{w}^{2} - \frac{1}{T}\hat{\sigma}_{u}^{2}$. Finally, what is $\hat{\sigma}_{w}^{2}$? That is the error variance from the between estimator, 


$$
\hat{\sigma}_{w}^{2} = \frac{1}{N-K}\left(\bar{\mathbf{y}} - \mathbf{\bar{X}}\hat{\mathbf{\beta}}_{BE}\right)^{\prime}\left(\bar{\mathbf{y}} - \mathbf{\bar{X}}\hat{\mathbf{\beta}}_{BE}\right),
$$

where $\boldsymbol{\beta}_{BE}$ are the between estimater coefficients. The between-groups estimator is not something we have introduced before, but is attained by regressing the time-averaged outcomes $\overline{y}_i$ on the time-averaged regressors $\overline{\mathbf{x}}_i,i=1,2,\dotsc,N$.

### Question 1: The Between Estimator
Estimate the between groups model, which is simply the average within each each individual,

$$
\bar{y}_{i} = \boldsymbol{\bar{x}}_{i}\boldsymbol{\beta} + c_i + \bar{u}_{i}.
$$

So instead of demeaning, like we did in FE, we just calculate the mean with the following transformation *vector* $\mathbf{P}_T$,

\begin{equation} 
\mathbf{P}_T \equiv \left( \frac{1}{T}, \frac{1}{T}, ..., \frac{1}{T} \right)_{1 \times T}
\end{equation}

In order to estimate eq. (3) with the between estimator. You need to perform the following steps:
* Create the mean vector `P`.
* mean `x` and `y` using the `perm` function and `P`.
* Regress `y_mean` on `x_mean`. Note that there are $N$ rows in each, not $NT$. 
* Print it out in a nice table.

In [61]:
def between_mat(T):
    P_T = np.tile(1/T, (1,T))
    return P_T

In [64]:
P_T = between_mat(T)

In [66]:
y_be = lm.perm(P_T, y)
x_be = lm.perm(P_T, x)

In [67]:
lm.check_rank(x_be)

'The matrix is of full rank with rank = 8'

In [70]:
est_be = lm.estimate(y=y_be, x=x_be, transform='be', T=T)
lm.print_table(labels=(label_y, label_x), results=est_be, title='BE-estimator')

BE-estimator
Dependent variable: Log wage

                       Beta          Se    t-values
--------------  -----------  ----------  ----------
Constant         0.492309    0.221009      2.22755
Black           -0.138812    0.0488709    -2.84039
Hispanic         0.00477579  0.0426925     0.111865
Education        0.0946036   0.0109043     8.6758
Experience      -0.0504371   0.0503326    -1.00208
Experience sqr   0.00512449  0.00321182    1.59551
Married          0.143664    0.0411983     3.48713
Union            0.270677    0.0465645     5.81294
R² = 0.219
σ² = 0.121


You should get a table that looks like this:

BE <br>
Dependent variable: Log wage

|                |   Beta |     Se |   t-values |
|----------------|--------|--------|------------|
| Constant        |  0.4923 | 0.2210 |  2.23 | 
| Black           | -0.1388 | 0.0489 | -2.84 | 
| Hispanic        |  0.0048 | 0.0427 |  0.11 | 
| Education       |  0.0946 | 0.0109 |  8.68 | 
| Experience      | -0.0504 | 0.0503 | -1.00 | 
| Experience sqr  |  0.0051 | 0.0032 |  1.60 | 
| Married         |  0.1437 | 0.0412 |  3.49 | 
| Union           |  0.2707 | 0.0466 |  5.81 | 
R² = 0.219 <br>
σ² = 0.121

### Question 2
You should now have all the error variances that you need to calculate

$$\hat{\lambda} = 1 - \sqrt{\frac{\widehat{\sigma}_{u}^{2}}{(\widehat{\sigma}_{u}^{2} + T\widehat{\sigma}_{c}^{2})}}. $$

In [92]:
# Fill in
from math import sqrt


sigma2_u = est_fe['sigma2'][0][0] # fill in
sigma2_c = est_be['sigma2'][0][0]-1/T*est_fe['sigma2'][0][0] # fill in
_lambda = 1-sqrt(sigma2_u/(sigma2_u+T*sigma2_c)) # fill in
print(_lambda)

0.6426409407862546


### Question 3
Now we are finally ready to estimate eq. (3) with random effects. Since we have to use $\hat{\lambda}$ to quasi-demean within each individual, we again use the `perm` function. This time, we pass it the following transformation matrix,

$$
\mathbf{C}_{T}:=\mathbf{I}_{T} - \hat{\lambda}\mathbf{P}_{T},
$$

where $\mathbf{P}_{T}$ is the $1 \times T$ transformation vector we used earlier to calculate the mean of each person.

In [93]:
def quasi_demean_mat(T, _lambda):
    P_T = between_mat(T)
    C_T = np.eye(T)-_lambda*P_T
    return C_T

In [95]:
C_T = quasi_demean_mat(T, _lambda)

In [97]:
y_re = lm.perm(C_T, y) # fill in
x_re = lm.perm(C_T, x) # fill in

In [98]:
lm.check_rank(x_re)

'The matrix is of full rank with rank = 8'

In [102]:
# FILL IN
# Create first the transformation matrix C
# Use the perm function to "quasi-demean" x and y using C
# Estimate RE using OLS and print a nice table


est_re = lm.estimate(y = y_re,x=x_re,transform="re",T=T)
lm.print_table(labels=(label_y, label_x), results=est_re, _lambda=_lambda,
    floatfmt=['', '.3f', '.4f', '.2f']
)  

Results
Dependent variable: Log wage

                  Beta      Se    t-values
--------------  ------  ------  ----------
Constant        -0.107  0.1107       -0.97
Black           -0.144  0.0476       -3.03
Hispanic         0.020  0.0426        0.47
Education        0.101  0.0089       11.36
Experience       0.112  0.0083       13.57
Experience sqr  -0.004  0.0006       -6.88
Married          0.063  0.0168        3.74
Union            0.107  0.0178        6.02
R² = 0.178
σ² = 0.124


AttributeError: 'float' object has no attribute 'item'

The table should look like this:

RE <br>
Dependent variable: Log wage

|                |   Beta |     Se |   t-values |
|----------------|--------|--------|------------|
| Constant       | -0.107 | 0.1107 |      -0.97 |
| Black          | -0.144 | 0.0476 |      -3.03 |
| Hispanic       |  0.020 | 0.0426 |       0.47 |
| Education      |  0.101 | 0.0089 |      11.36 |
| Experience     |  0.112 | 0.0083 |      13.57 |
| Experience sqr | -0.004 | 0.0006 |      -6.88 |
| Married        |  0.063 | 0.0168 |       3.74 |
| Union          |  0.107 | 0.0178 |       6.02 |
R² = 0.178 <br>
σ² = 0.124 <br>
λ = 0.643

## Short introduction to Hausman test

It is evident from the previous question that RE has the advantage over FE that time-invariant variables are not demeaned away. But if $E[c_{i}\boldsymbol{x}_{it}] \neq \boldsymbol{0}$, then the RE estimator is inconsistent, where the FE estimator is consistent (but inefficient), assuming strict exogeneity.

We can use the results from the FE and RE estimations to test whether RE is consistent, by calculating the following test statistics,

$$
H := (\hat{\boldsymbol{\beta}}_{FE} - \hat{\boldsymbol{\beta}}_{RE})'[\widehat{\mathrm{avar}}(\hat{\boldsymbol{\beta}}_{FE}) - \widehat{\mathrm{avar}}(\hat{\mathbf{\beta}}_{RE})]^{-1}(\hat{\boldsymbol{\beta}}_{FE}-\hat{\boldsymbol{\beta}}_{RE})\overset{d}{\to}\chi_{M}^{2}, \tag{7}
$$

*Note 1*: The vector $\hat{\boldsymbol{\beta}}_{RE}$ excludes time invariant variables as these are not present in $\hat{\boldsymbol{\beta}}_{FE}$. <br>
*Note 2:* $\widehat{\mathrm{avar}}(\hat{\boldsymbol{\beta}}_{RE})$ means the RE covariance (but again, we only keep the rows and columns for time-variant variables)

#### Question 4: Comparing FE and RE
Use the results from the FE and RE esimtations to compute the Hausman test statistics in eq. (7).

* Start by calculating the differences in the FE and RE coefficients $\hat{\boldsymbol{\beta}}_{FE} - \hat{\boldsymbol{\beta}}_{RE}$ (again, remember to remove the time invariant columns from RE)
* Then calculate the differences in the covariances $\widehat{\mathrm{avar}}(\hat{\boldsymbol{\beta}}_{FE}) - \widehat{\mathrm{avar}}(\hat{\boldsymbol{\beta}}_{RE})$ (you need to keep the "lower right" part of the RE covariance)
* You now have all the components to compute the Hausman test statistics in eq. (7)

In [106]:
# FILL IN
# Follow the steps in the question
hat_diff = est_fe['b_hat']-est_re['b_hat'][4:] # The differences in beta hat
cov_diff = est_fe['cov']-est_re['cov'][4:,4:] 
# The difference in covariances. NB! We remember that FE is constructed by demeaning.
# Thus, we need to compare estimates the time-varying estimates from FE with the same estimates from RE.

H = hat_diff.T@la.inv(cov_diff)@hat_diff # The Hausman test value

# This calculates the p-value of the Hausman test.
p_val = chi2.sf(H.item(), 4)

---

### Error structure of RE:

$\left[\begin{array}{cccc}\sigma_c^2+\sigma_u^2 & \sigma_c^2 & \cdots & \sigma_c^2 \\ \sigma_c^2 & \sigma_c^2+\sigma_u^2 & \cdots & \sigma_c^2 \\ \vdots & \vdots & \ddots & \vdots \\ \sigma_c^2 & \sigma_c^2 & \cdots & \sigma_c^2+\sigma_u^2\end{array}\right]_{T \times T}$

## NB!
This contains both the time-invariant and time-varying estimates. IF we are to compare with FE, we must compare the time-varying estimates of FE with the same time-varying estimates of RE.

---

In [111]:
# This code takes the results that you have made, and prints a nice looking table.
def print_h_test(est_fe, est_re, hat_diff, p_val):
    table = []
    for i in range(len(hat_diff)):
        row = [
            est_fe['b_hat'][i], est_re['b_hat'][4:][i], hat_diff[i]
        ]
        table.append(row)

    print(tabulate(
        table, headers=['b_fe', 'b_re', 'b_diff'], floatfmt='.4f'
        ))
    print(f'The Hausman test statistic is: {H.item():.2f}, with p-value: {p_val:.2f}.')
print_h_test(est_fe, est_re, hat_diff, p_val)

   b_fe     b_re    b_diff
-------  -------  --------
 0.1168   0.1121    0.0047
-0.0043  -0.0041   -0.0002
 0.0453   0.0628   -0.0175
 0.0821   0.1074   -0.0253
The Hausman test statistic is: 31.45, with p-value: 0.00.


---

$H_0$: "The preferred model is the random effects model"

$H_A$: "The fixed effects model is the preferred model (the reason being unobserved heterogeneity!)"

This is strongly rejected. This indicates that the preferred model is the FE model.

---

Your table should look like this:

| b_fe    |  b_re    | b_diff |
| ------- |  ------- |  -------- |
 | 0.1168  |  0.1121  |   0.0047 |
| -0.0043 |  -0.0041 |   -0.0002 |
 | 0.0453  |  0.0628  |  -0.0175 |
 | 0.0821  |  0.1074  |  -0.0253 |

 The Hausman test statistic is: 31.45, with p-value: 0.00.