# Introduction to GMM

## What is this and why am I learning it?
GMM is a way of uncovering structural parameters from data that may or may not be linear. While we have seen over and over again how regressions can uncover unknown "betas" that summarize causal (and correlational) relationships between variables, these "betas" are _linear_: $y$ is a linear function of $\beta$. However, in many of the models we may right down (especially in Trade, Economic Geography, IO...) we will have structural relationships that are not at all linear in the unknown parameters we want to estimate! 

To give a real-world example, take the system of structural equations from the model in Redding and Sturn (2008) (an Economic Geography, Rosen-Roback style model):

$$w_i^{\sigma} = \sum_n L_nw_n (\tau_{ni}/(P_nT_n))^{1-\sigma} = "FMA_i"$$ 

where $P_n$ is the price index,  $w_i$ are wages in city $i$, $L_i$ is labor (population), $\tau_{ni}$ are trade costs from $n$ to $i$, and $T_i$ are unknown productivity parameters. 

Suppose for simplicity that we know $P_n$ (the price index), we have data on $\{w_n, L_n, \tau_{ni}\}_{i,n}$, and we are using a value for $\sigma = 4$ from previous estimates. Then we can use this equation as a GMM moment condition to estimate $T_n$!

## A more simple example

Let's walk through a more stylized example for our purposes. Suppose you have written down an IO model that goes like this:

Firms produce 2 goods $g = 1,2$ using the input $x$ for each. However, each good uses a different technology, like so:

$$y_g = f_g(x)$$

This is essentially what we saw in class. Now to be more explicit, we are going to say that our model assumes Cobb-Douglass technology, so that our production functions look like this:

\begin{align}
y_1 = x_1^{\alpha_1} \\ 
y_2 = x_2^{\alpha_2} 
\end{align}

$x_1, x_2$ index how much of $x$ is used in the production of goods 1, 2 respectively.

Now firms have uncertainty over their prices. Then our first order conditions from the model imply that the following should hold:
\begin{align}
E\left(p_1\alpha_1x_1^{\alpha_1 - 1} - w\right) &= 0 \\ 
E\left(p_2\alpha_2x_2^{\alpha_2-1} - w\right) &= 0 \\
\end{align}

Now, we have data on lots of firms $f = 1,\ldots, N$, and for each firm we observe prices, inputs, and wages. From these data we can use GMM to back out the values of $\alpha_1, \alpha_2$, by applying the analogy principle, which would imply:

\begin{align}
\frac{1}{N}\sum_i \left(p_{1i}\alpha_1x_{1i}^{\alpha_1 - 1} - w_i\right) &= 0 \\ 
\frac{1}{N}\sum_i \left(p_{2i}\alpha_2x_{2i}^{\alpha_2-1} - w_i\right) &= 0 \\
\end{align}

Our steps will be:
1. Simulate this scenario by proposing a DGP.
2. Create data.
3. Set up our moment conditions.
4. Apply the analogy principle to get from expectations to averages.
5. Estimate GMM.

In [1]:
import numpy as np
import pandas as pd
import scipy.stats.distributions as iid

In [2]:
# Step 1: The DGP
# note that we need 2N
def dgp(true_alphas, N):
    alpha1, alpha2 = true_alphas
    wage = iid.pareto(5).rvs(size=(N,1))
    p_1 = iid.norm(10,2).rvs(size=(N,1))
    p_2 = iid.gamma(3,2).rvs(size=(N,1)) + 10
    
    # create x from these draws, with noise
    x_1 = (wage/(alpha1*p_1))**(1/(alpha1-1)) + iid.norm(scale=2).rvs(size=(N,1))
    x_2 = (wage/(alpha2*p_2))**(1/(alpha2-1)) + iid.norm(scale=2).rvs(size=(N,1))
    
    return (x_1, x_2, p_1, p_2, wage)

In [3]:
# Step 2: create data
alpha_true = (0.75, 0.5)
data = dgp(alpha_true, 100)

In [4]:
data_df = pd.DataFrame(np.c_[data], columns = ["x1", "x2", "p1", "p2", "w"])
data_df.head()

Unnamed: 0,x1,x2,p1,p2,w
0,303.548871,16.598636,8.857415,13.331628,1.594061
1,1387.691634,40.802832,9.514598,15.616165,1.168534
2,1694.956075,42.609204,8.899414,13.264405,1.039825
3,822.70585,37.676943,8.016366,13.764793,1.122764
4,5132.296637,57.086445,12.272475,17.011806,1.087512


In [5]:
data_df.describe()

Unnamed: 0,x1,x2,p1,p2,w
count,100.0,100.0,100.0,100.0,100.0
mean,2017.145618,38.272059,10.008748,14.995389,1.30004
std,2232.322597,15.573732,2.179358,1.819647,0.351552
min,6.085039,5.302891,5.01694,12.416995,1.000121
25%,606.615146,29.501383,8.641036,13.560891,1.081742
50%,1405.433578,37.396499,9.910974,14.638081,1.211738
75%,2590.689909,47.480326,11.623843,16.271642,1.386971
max,15566.982417,92.588302,15.806297,20.071344,2.99085


### Creating moment conditions
Our moment conditions will have to be functions of the unknown parameter, since this is... unknown. Here I imitate Ethan's code, with some minor changes. First we set up for each observation _i_, what the moment condition should look like. Since we have 2 moment conditions, I am going to stack these horizontally. What will the dimentions of this be?

In [6]:
def gj(alphas, data):
    # unpack my inputs
    alpha1, alpha2 = alphas
    x_1, x_2, p_1, p_2, wage = data
    # set up moments (separately)
    moment1 = p_1*alpha1*x_1**(alpha1 - 1) - wage
    moment2 = p_2*alpha2*x_2**(alpha2 - 1) - wage
    # stack the moments next to each other
    moments = np.c_[moment1, moment2]
    
    return moments

In [7]:
gj(alpha_true, data)[:5] # these are pretty small, bc the alphas are right!

array([[-2.54471943e-03,  4.20658915e-02],
       [ 6.37980560e-04,  5.38261627e-02],
       [ 4.13595104e-04, -2.37962984e-02],
       [-1.57651174e-04, -1.51594930e-03],
       [-4.65035360e-05,  3.82690443e-02]])

In [8]:
gj((0.1, 0.2), data)[:5] # these are much larger!

array([[-1.58889334, -1.3123167 ],
       [-1.16712054, -1.00782072],
       [-1.03872087, -0.90796465],
       [-1.12085781, -0.97177754],
       [-1.08695008, -0.95368181]])

Now, we appply the analogy principle to these moment conditions - by analogy, the sample mean approximate the expectation, so we want to take the mean over N. In our data, the _N_ are running down the _rows_. The rows are the 0-axis in python, so we want to take the mean over ``axis = 0``. If we apply the mean over ``axis = 1``, this would allow us to take the mean across the columns, which wouldn't make sense here.

In [9]:
def gN(alphas, data):
    # get individual moments
    e = gj(alphas, data)
    # take mean
    gN = e.mean(axis=0).reshape((len(alphas),1))
    return gN

In [10]:
# test: we want these = 0!
print(f"using true alphas: Eg_j =\n {gN(alpha_true, data)}\n")
print(f"using other alphas: Eg_j =\n {gN((0.9, 0.9), data)}")

using true alphas: Eg_j =
 [[ 0.00258226]
 [-0.00146721]]

using other alphas: Eg_j =
 [[3.1104619 ]
 [8.15269741]]


Now we have our analogous moment conditions, and GMM will try and find the alphas that minimize the squared errors from these according to:

$$E(g_j(a))'\Omega^{-1} E(g_j(a))$$

(a 2 x 2 matrix).

But... what is $\Omega$? The most efficient estimator would weight by the covariance of the moment conditions. So let's set it up!  

In [11]:
def invOmega(alphas, data):
    e = gj(alphas, data)
    # recenter
    e = e - e.mean(axis=0)
    N = e.shape[0]
    var = e.T@e/N
    return np.linalg.inv(var)

In [12]:
invOmega(alpha_true, data)

array([[1318.53877336,  198.82829761],
       [ 198.82829761,  212.27829487]])

In [13]:
# putting it all together, we get the objective function we want to minimize:
def J(alphas,omega,data):

    g = gN(alphas,data) # Sample moments
    N = data[0].shape[0]

    return (N*g.T@omega@g).squeeze() # Scale by sample size

In [14]:
J(alpha_true,invOmega(alpha_true, data),data)

array(0.77424605)

### The 2-step estimator.
So now we have all the ingredients, we have set up the moment conditions and the objective function to minimize J(). But J() depends on $\Omega$, which is unknown! The idea now will be to proceed in 2 steps:

1. Guess $\Omega$. We'll guess the identity matrix.
2. Minimize J using our guess for $\Omega$.
3. Use our estimated alphas to calculate $\widehat{\Omega}$.
4. Using $\widehat{\Omega}$, re-run GMM to re-estimate the alphas.

In [15]:
from scipy.optimize import minimize

In [16]:
# Step 1: Guess Omega.
Omega_guess = np.eye(len(alpha_true))

# Step 2: Minimize J using our guess. 
# We're bringing back our old friend, the minimize function!
minimizer = minimize(lambda a: J(a, Omega_guess, data), x0 = [0.5,0.5])
alpha_hat = minimizer.x
print(f"alpha_hat from first stage = {alpha_hat}")

# Step 3: Update Omega
invOmega_hat = invOmega(tuple(alpha_hat), data)
print(f"Omega_hat =\n {invOmega_hat}")

# Step 4: re-run with omega_hat
minimizer2 = minimize(lambda a: J(a, invOmega_hat, data), x0 = [0.5,0.5])
alpha_hat2 = minimizer2.x
print(f"alpha_hat from second stage = {alpha_hat2}")

alpha_hat from first stage = [0.74975481 0.50020788]
Omega_hat =
 [[1319.2528229   199.56123631]
 [ 199.56123631  212.50245123]]
alpha_hat from second stage = [0.74975481 0.50020789]
