# Dynamic linear models

In [None]:
import numpy as np
from numpy import linalg as la
from tabulate import tabulate 
import pandas as pd 
import LinearDynamic_ante as lm
from scipy.stats import chi2
import seaborn as sns
sns.set_theme();

np.set_printoptions(precision=5)
%load_ext autoreload
%autoreload 2

# Prepare the data

In this problem set, we consider the question of state dependence in **firm revenue** ($y$). The data, `cvr_extract.csv`, comes from the Danish register of firms, "CVR registret". It can be reconstructed or modified by downloading the tax files from [skst.dk](https://www.sktst.dk/aktuelt/skatteoplysninger-for-selskaber/) and running the notebook `clean_data.ipynb`. The accompanying notebook, `analysis.ipynb`, does some overly simplistic analyzing that illustrates some questions one might want to explore with the data. 

**Research question:** Should we provide assistance to firms? If the effects of adverse shocks (such as a pandemic) are very persistent, then we want to provide a safety net for firms. Conversely, if firms just rebound after a shock, that can be a waste of tax funds. 

The data consists of all firms observed in every year from 2012-2019, which satisfy that the net income (pre-tax) was below 10 mio. DKK. The variables in the data are: 

  |*Variable*  | *Content* |
  |------------| --------------------------------------------|
  |`net_inc`          | Net income (`income - deficit`) |
  |`taxable_income`          | Income |
  |`deficit`          | Losses |
  |`tax`          | Tax payment |
 | `cat`         | A categorical variable, based on the dummies, `dum_X` below |
  | `dum_X`          | Dummy for whether the firm's name contains the string `X` (in Danish) |
  
  The dummies, e.g. `dum_doctor`, are explained below 

| **Substring** | **If name contains** | 
| ---- | ------ | 
| `as` | 'a/s' | 
| `aps` | 'aps' | 
| `ivs` | 'ivs' | 
| `ab` | 'a/b' | 
| `realestate` | 'ejendom' | 
| `holding` | 'holding' | 
| `invest` | 'invest' | 
| `consult` | 'consult' | 
| `service` | 'service' | 
| `dot_dk` | '.dk' | 
| `doctor` | 'læge' | 
| `carpenter` | 'tømrer' | 
| `transport` | 'transport' or 'lastvogn' | 
| `plumbing` | 'vvs' or 'kloak' | 
| `import` | 'import' | 
| `masonry` | 'murer' | 
| `nielsen` | 'nielsen' | 
| `sorensen` | 'sørensen' | 

In [None]:
dat = pd.read_csv('cvr_extract.csv')
dat.sample(3)

If you'd like a quick summary of your data, panda data frames have useful summary commands

In [None]:
dat.info() # returns number of elements, data type and column name for each column

In [None]:
dat.describe() # returns a number of summary stats (e.g. mean, max, count etc.) for each column

In [None]:
# print table which can be copied to latex
print(dat.describe().style.to_latex())

# Descriptives

In [None]:
ax=dat.groupby('year').net_inc.sum().plot(marker='o'); 
ax.set_ylabel('Total firm revenue'); 

In [None]:
ax = dat.groupby('cat').taxable_income.count().plot(kind='bar'); 
ax.set_ylabel('Observations'); 

In [None]:
ax = dat.groupby('cat').taxable_income.mean().plot(kind='bar'); 
ax.set_ylabel('Avg. net income'); 

In [None]:
ax=dat.groupby(['year', 'cat']).taxable_income.mean().unstack().plot(marker='o'); 
ax.set_ylabel('Firm profits (avg.)'); 
ax.legend(title='Firm name contains', loc=(1.05,0.3)); 

# Set up data

In [None]:
# Select only firms that are in real estate in all years 
I = dat.groupby('firmid').dum_realestate.transform('all') # <-- creates an indicator for real estate
dat = dat[I].copy() # <-- pulls out real estate firms

In [None]:
# convenient list of the names of all the dummy variables
cols_dum = [c for c in dat.columns if c == 'dum_']

# convert int->bool 
for c in cols_dum: 
    dat[c] = dat[c].astype('bool')

N = dat.firmid.unique().size
T = dat.year.unique().size
print(f'Data has {dat.shape[0]:,d} rows: N = {N:,d}, T = {T}')

In [None]:
# measure money in 1000 DKK 
for v in ['net_inc', 'taxable_income', 'deficit', 'tax']: 
    dat[v] = dat[v] / 1000.

In [None]:
# lag net income using "shift" from pandas
dat['lag_net_inc'] = dat.groupby('firmid').net_inc.shift(1)

### Pandas to numpy 

In [None]:
# remove nans due to differencing
I = dat.lag_net_inc.notnull() # cannot use first year: no lagged variable 

T = dat[I].year.unique().size # NB this measure of T already has one year subtracted 
N = dat[I].firmid.unique().size

assert dat[I].shape[0] == N*T, 'Data is not a balanced panel'

In [None]:
# turn panda data frames into numpy arrays
y = dat[I].net_inc.values.reshape((-1,1))
y_l = dat[I].lag_net_inc.values.reshape((-1,1))
const = np.ones((N*T,1))
x = np.column_stack((const, y_l))

#labels
ylbl = 'profit'
xlbl = ['const', 'lagged profit']

# Part 1: POLS
Today we will focus on a parsimonious model of profit $\pi$ (econometricians often use "parsimonious" to mean a "simple"). 

Consider first the following AR(1) (autoregressive model of order $1$),

$$
\pi_{it} = \alpha_0 +  \rho \pi_{it-1} + c_i + u_{it}, \quad t = 1, 2, \dotsc, T \tag{1}
$$

As we have seen before, if one does not take into consideration $c_i$ when estimating $\rho$, one will get biased results. One way to solve this, which is also a common way for AR(1) processes, is to take first-differences. We then have the model,

$$
\Delta \pi_{it} = \rho \Delta \pi_{it-1} + \Delta u_{it}, \quad t = 2, \dotsc, T \tag{2}
$$

This solves the presence of fixed effects.

### Question 1.1
Estimate eq. (1) using POLS and robust standard errors (you'll need to adjust the `robust` function in `LinearDynamic_ante.py` first.)
* Are there signs of autocorrelation in profit?
* What assumptions are no longer satisfied? What happens with fixed effects when we include a lag?

*Note:* We need to use the lagged values of net income. But this time we don't need to lag it ourselves as it that has already been done above.

In [None]:
# FILL IN
# Estimate the AR(1) model using OLS and robust standard errors
# Print out in a nice table

ar1_result= # fill in

Your table should look like this:


AR(1)
Dependent variable: profit

|              |  Beta |     Se|   t-values | 
|------------- | ------|  -------|  ----------|
|const   |       73.494 | 7.77624 |      9.45|
|lagged profit |  0.517 | 0.03473  |     14.90| 
R² = 0.252
σ² = 476937.886

In [None]:
# in case you want results in a panda data frame
ar1_res=pd.DataFrame([ar1_result['b_hat'].flatten(), ar1_result['se'].flatten(),ar1_result['t_values'].flatten()], 
             index=['beta', 'se', 't'], columns=xlbl).T
ar1_res.round(2)


In [None]:
# or as a latex table
(ar1_res.round(2)).style.to_latex()

### Question 1.2
Estimate eq. (2) using first differences. 
* What problem does this solve? 
* What type of exogeneity assumption is used to justify this method of estimation?

*Note 1:* You have to create the first differencing matrix yourself, and use the `perm` function to permutate the dependen and independent variables. <br>
*Note 2:* This time you should use robust standard errors. The function is provided to you.

In [None]:
# FILL IN
# Create a first difference matrix
# First difference both profit and lag of profit
# Estimate AR(1) model using OLS with robust se and print a nice table


Your table should look like this:

FD AR(1) <br>
Dependent variable: profit

|              |   Beta |     Se |   t-values |
|------------- | ------ | ------ | ---------- |
|lagged profit | -0.412 | 0.0244 |     -16.89 |
R² = 0.167
σ² = 520184.128

## Super short introduction to pooled IV (piv)

Suppose we want to estimate the effect of $x_K$ on $y$, conditional on $K - 1$ other controls, we then have the usual equation,

$$
\mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \mathbf{u} \tag{3}
$$

where $\mathbf{X} = (\mathbf{x}_1, \dotsc, \mathbf{x}_K)$. If $\mathbf{x}_K$ is not exogenous, we can define the instrument vector $\mathbf{Z} = (\mathbf{x}_1, \dotsc, \mathbf{x}_{K - 1}, \mathbf{z}_1)$, where $\mathbf{z}_1$ is an instrument for $\mathbf{x}_K$. The details and necessary assumptions and conditions are outlined in Wooldridge (2010) (chapter 5).

We can estimate eq. (3) by OLS using $z_1$ as an instrument for $x_K$, in order to make it easier for you when writing code, I write it up in matrix notation,

$$
\boldsymbol{\hat{\beta}} = (\mathbf{\hat{X'}}\mathbf{\hat{X}})^{-1} \mathbf{\hat{X'}}\mathbf{Y}, \tag{4}
$$

where $\mathbf{\hat{X}} = \mathbf{Z}(\mathbf{Z'}\mathbf{Z})^{-1}\mathbf{Z'}\mathbf{X}$.

# Part 2: Pooled IV
It should not be a surprise that models (1) and (2) violate the strict exegoneity assumption but even if we relax this assumption to sequential exegoneity, the FD-estimator remains inconsistent.

A solution to this is to use an instrument for $\Delta \pi_{it-1}$. The biggest issue is to find an instrument that is not only relevant, but also exogenous.

We often use an additional lag as instruments. So for $\Delta \pi_{it-1}$, we can use $\pi_{it-2}$. In general, we have all possible lags available as instruments. So for $\Delta \pi_{it-1}$ we have, $
\pi_{it-2}^{\textbf{o}} = (\pi_{i0}, \pi_{i1}, \dotsc \pi_{it-2})$ available as instruments.

*Note:* $R^2$ has no meaning in IV-regressions, you can report it if you want to. But I set it to 0.

### Question 2.1
Estimate eq. (2) by using the lag of the independent variable in levels, $z_{it} = \pi_{it-2}$ as an instrument. You need to finish writing the `est_piv` function and a part of the `estimate` function in `LinearDynamic_ante`.

*Note 1:* In the `estimate` function, the `variance` function takes $\mathbf{X}$ as an argument. But we want to pass the `variance` function $\mathbf{\hat{X}}$ instead. <br>
*Note 2:* In order to create the instrument, you need to create a lag matrix, and use `perm`.

In [None]:
# FILL IN
# Create first a lag matrix
# Lag the lagged pi variable
# Finish writing the est_piv function in lineardynamic_ante
# Finish writing the estimate function in lineardynamic_ante
# Estimate using first differences and lagged first differences. Use the 2. lag as instrument.

Your table should look like this:

FD-IV AR(1) <br>
Dependent variable: delta profit

|                   |   Beta |     Se |   t-values |
|-------------------|--------|--------|------------|
| lagged profit     |  0.129 |  0.0763|      1.69 | 
R² = n.a. <br>
σ² = 700805.974

### Question 2.2
Estimate eq. (2) by using the lag of the independent variable in first differences, $z_{it} = \Delta \pi_{it-2}$ as an instrument.

In [None]:
# FILL IN
# Lag the first differenced lag profit variable
# The second lag uses up an extra observation, so you need to shorten both first differenced profit and the 1. first difference lag.
# Estimate using first differences and lagged first differences. Use the 2. first difference lag as instrument.

In [None]:
#  do lagging and differencing by panda data frames (instead of perm function)
dat['diff_net_inc']      = dat.groupby('firmid').net_inc.diff()
dat['lag_diff_net_inc']  = dat.groupby('firmid').diff_net_inc.shift(1)
dat['lag2_diff_net_inc'] = dat.groupby('firmid').diff_net_inc.shift(2)

In [None]:
# load into numpy
I = dat.lag2_diff_net_inc.notnull() # removes years which have become zeros due to differencing
yfd    = dat[I].diff_net_inc.values.reshape((-1,1))
yfd_l1 = dat[I].lag_diff_net_inc.values.reshape((-1,1))
yfd_l2 = dat[I].lag2_diff_net_inc.values.reshape((-1,1))

assert (dat.groupby('firmid').year.size() == dat.year.unique().size).all(), 'not balanced'

In [None]:
# Estimate using lagged difference as IV and data generated by perm function
x_label= None # fill in
AR_fdiv2_res2 = None # Fill in
lm.print_table((ylbl,x_label), AR_fdiv2_res2,title='FD-IV AR(1)')

Your table should look like this:
FD-IV AR(1) <br>
Dependent variable: delta profit

|                   |   Beta |     Se |   t-values |
|-------------------|--------|--------|------------|
| lag delta profit|  -0.032 | 0.1317 |       -0.23 |
R² = NaN <br>
σ² = 659600.152

### Summing up Exercises 1 and 2.

First of all, is it more convincing to use $\pi_{it-2}$ or $\Delta \pi_{it-2}$ as an instrument for $\Delta \pi_{it-1}$?

Then consider how the different models compare to each other. Some questions that you might discuss with your class mates could be:
* Which ones do you feel gives most sense from an economic perspective?
* Which ones gives most sense from an econometric perspective? 
* Do you feel that there is conclusive evidence that there is state dependence in profit?
* Should we give financial aid packages to this industry when  exogenous shocks hit (e.g. Covid)?

## Part 3: GMM

### Question 3.1: Create the level instrument matrix $\mathbf{Z^{\mathbf{o}}}$

The function `sequential_instruments` in `gmm.py` in creates the instrument matrix $\mathbf{Z^{\mathbf{o}}}$ using the second lag of $\pi$ in **levels**. Note that you will not have one array that looks like $\mathbf{Z^{\mathbf{o}}}$, but an array that has something that looks like $\mathbf{Z^{\mathbf{o}}}$ for each firm in the data. Since we have six time periods, and access to $y_{i0}$, you should get five rows of instruments for each firm.
$$
\mathbf{Z^{\mathbf{o}}} = 
\begin{bmatrix}
    y_{i0} & 0 & 0 & 0 & 0 & 0 & \cdots & 0 \\
    0 & y_{i0} & y_{i1} & 0 & 0 & 0 & \cdots & 0 \\
    0 & 0 & 0 & y_{i0} & y_{i1} & y_{i2} & \cdots & 0 \\
    \vdots  & \vdots & \vdots & \vdots & \vdots & \vdots & \ddots & \vdots \\
    0 & 0 & 0 & 0 & 0 & 0 & \cdots & \mathbf{y^o_{it-2}} \\
\end{bmatrix}
\begin{pmatrix}
t = 2 \\
t = 3 \\
t = 3 \\
\vdots \\
t = T
\end{pmatrix}.
$$

In [None]:
import gmm_ante as gmm

In [None]:
# reset: now we no longer need to delete *3* years but only 2. This is because we're letting the IV matrix expand over time.
I = dat.lag_diff_net_inc.notnull()

yfd    = dat[I].diff_net_inc.values.reshape((-1,1))
yfd_l  = dat[I].lag_diff_net_inc.values.reshape((-1,1))
years  = dat[I].year.values

In [None]:
# Create (telescoping) instrument matrix 
z = # fill in
print(z)

##### your instrument matrix should look like this:
$$
\begin{bmatrix}
-28.187 &  0. &     0.  &  ... &  0.  &    0.  &    0.   \\
 0.  &  -28.187 & -62.855& ... &  0.   &   0.  &    0.   \\
  0. &     0.  &    0. &   ... &  0.  &    0.  &    0.   \\
 &&&...&&&\\
   0.   &   0.   &   0. &   ... &  0.  &    0.  &    0.   \\
   0.    &  0.  &    0.  &  ...  & 0.    &  0.   &   0.   \\
   0.  &    0.   &   0.  &  ... & 690.033 & 511.393 &585.533\\
 \end{bmatrix}
 $$

### Question 3.2: GMM 1-step and 2-step

Compute the following quantities: 

a) the initial weighting matrix, 

b) the first-step gmm estimator, 

c) the updated weighting matrix, 

d) the 2-step gmm estimator,

e) the standard errors,

f) the Sargent statistic

*Note: In formulas below we do not include $\frac1N$ in  the definition of $\hat{\mathbf{W}}$. This only has implications for the formula for $\text{Avar}(\hat{\boldsymbol{\beta}})$.*

a) Write the initial weighting matrix $\hat{\mathbf{W}}$ used for the 1-step GMM estimator (equivalent to System 2SLS). What are $\hat{\mathbf{W}}$'s dimensions?


$$\hat{\mathbf{W}}=$$

In [None]:
# Compute the initial weighting matrix 
W = # fill in


b) Compute the first step GMM estimator:

$$\hat{\boldsymbol{\beta}}_{GMM}= (\mathbf{X}'\mathbf{Z}\hat{\mathbf{W}}\mathbf{Z}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Z}\hat{\mathbf{W}}\mathbf{Z}'\mathbf{Y}$$

In [None]:
# Compute the 2sls estimator (1-step GMM)
beta_gmm= # fill in
print("beta_gmm=",beta_gmm)

you should get: `beta_gmm=[[-0.02163]]`

c) Write the expression for the updated weighting matrix used for 2-step GMM (Arellano Bond). What is the dimension of this matrix?


$$\hat{\mathbf{W}}^{\text{opt}}
= $$

In [None]:
# Compute the updated weighting matrix
# you'll need to compute residuals ui_hat
# you should use a loop to multiply each individual's Z_i and ui_hat separately
res = # fill in

W_opt = # fill in


d) Compute the second-step GMM estimator using this weighting matrix

$$\hat{\boldsymbol{\beta}}_{GMM}^{\text{opt}}= (\mathbf{X}'\mathbf{Z}\hat{\mathbf{W}}^{\text{opt}}\mathbf{Z}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Z}\hat{\mathbf{W}}^{\text{opt}}\mathbf{Z}'\mathbf{Y}$$

In [None]:
# Compute the 2-step GMM estimator 
beta_gmm = # fill in 
print("beta_gmm=",beta_gmm)

You should get: `beta_gmm= [[0.07076]]`

e) Compute standard errors: 
$$\hat{\mathbf{V}}(\hat{\boldsymbol{\beta}}_{\text{GMM}}^{\text{opt}}) = \left(\mathbf{X'}\mathbf{Z}\hat{\mathbf{W}}^{\text{opt}}\mathbf{Z}'\mathbf{X}\right)^{-1}$$

In [None]:
X = yfd_l
Z = z
W = W_opt
cov = # fill in 
se = # fill in
print(f'se = {se}, t = {beta_gmm/se}')

You should get: `se = [0.03733], t = [[1.8954]]`

f) Write up the Sargan Test Statistic. What does it provide a test of? What can you conclude from this test?


$$\mathbf{J}=$$

where $\mathbf{W} = (N \hat{\mathbf{S}})^{-1}= (\sum_i \mathbf{Z}_i' \hat{\mathbf{u}}_i\hat{\mathbf{u}}_i' \mathbf{Z}_i)^{-1}$

In [None]:
#Compute the Sargan Test Stat
J = # fill in

# Run test and print results
r=z.shape[1] # number of instruments
K=yfd_l.shape[1] # number of regressors
df=r-K # number of overidentifying restrictions
p_val = chi2.sf(J.flatten()[0], df)

print(f'The Sargan test statistic is: {J.item():.2g}, with p-value: {p_val:.2g}.')

You should get

The Sargan test statistic is: 43, with p-value: 0.002.