# Introduction to GMM

## What is this and why am I learning it?
GMM is a way of uncovering structural parameters from data that may or may not be linear. While we have seen over and over again how regressions can uncover unknown "betas" that summarize causal (and correlational) relationships between variables, these "betas" are _linear_: $y$ is a linear function of $\beta$. However, in many of the models we may write down (especially in Trade, Economic Geography, IO, etc...) we will have structural relationships that are not at all linear in the unknown parameters we want to estimate! 

To give a real-world example, take the system of structural equations from the model in Ahfeldt et al (2015) in which they estimate an economic geography model that seeks to explain the changes in rental rates and wages that West Berlin experienced during the Cold War when it was cut off from East Berlin. Their model has neighborhood amenities $\tilde{a}_{it}$ in a structural residual that capture benefits of living in a neighborhood that aren't related to model mechanisms (commuting, labor supply and demand, rents). This statement in and of itself is a moment condition/assertion: that in the data, the spatial distribution or gradient of changes in these amenities should not be correlated with the distance to the Berlin Wall. That is, if I remove everything the model says is important, leaving me with residuals $\tilde{a}_{it}$, I shouldn't be able to predict changes in how nice a neighborhood is as a function of the distance to the Wall. Their moment conditions are:

$$E(I_k \times \Delta \ln \tilde{a}_{it}/a_t(\Lambda)) = 0$$ 

where $a_t$ is the average amenity in year $t$, and $\tilde{a}_{it}/a_t$ is a non-linear function of the model's parameter (fundamentals) set $\Lambda$.

**Check understanding:** Try and think about what would happen if you tried to estimate this via OLS. What would go wrong?

## A more simple example

Let's walk through a more stylized example for our purposes. Suppose you have written down an IO model that goes like this:

Firms produce 2 goods, $g\in\{1,2\}$ using a single input $x$ for each. For example, they have one input, corn masa, and can use it to make either tortillas or sopes.  However, each good uses a different technology, like so:

$$y_g = f_g(x)$$

This is essentially what we saw in class. Now to be more explicit, we are going to say that our model assumes Cobb-Douglas technology, so that our production functions look like this:

\begin{align}
y_1 = x_1^{\alpha_1} \\ 
y_2 = x_2^{\alpha_2}  
\end{align}

$x_1, x_2$ index how much of $x$ is used in the production of goods 1, 2 respectively.

Now firms have uncertainty over their prices. Then our first order conditions from the model imply that the following should hold:
\begin{align}
E\left(p_1\alpha_1x_1^{\alpha_1 - 1} - w\right) &= 0 \\ 
E\left(p_2\alpha_2x_2^{\alpha_2-1} - w\right) &= 0 \\
\end{align}

Now, we have data on lots of firms $f = 1,\ldots, N$, and for each firm we observe prices, inputs, and wages. From these data we can use GMM to back out the values of $\alpha_1, \alpha_2$, by applying the analogy principle, which would imply:

\begin{align}
\frac{1}{N}\sum_i \left(p_{1i}\alpha_1x_{1i}^{\alpha_1 - 1} - w_i\right) &= 0 \\ 
\frac{1}{N}\sum_i \left(p_{2i}\alpha_2x_{2i}^{\alpha_2-1} - w_i\right) &= 0 \\
\end{align}

These are our $g_j$ functions! Intuitively what follows is that we will numerically find a vector $[\widehat{\alpha}_1, \widehat{\alpha}_2]$ that forces the left-hand side of this equation to be as close to zero as possible- trying to make this equation hold. In this case we have 2 equations and 2 unknowns, so GMM should be able to solve the system exactly. In the case that we had more conditions than unknown parameters, GMM will almost surely _not_ be able to exactly solve all equations, but it will find the parameter set that gets closest to doing so. This is actually a good thing- it provides us with the opportunity to test how well GMM is doing.

**Check your understanding:** Why can't we learn anything about the appropriateness of the moment conditions when GMM is just identified? Intuitively, how can we learn about how appropriate the moment conditions are in the over-dentified case?

In summary, our steps to reproduce and estimate this model will be:
1. Simulate this scenario by proposing a DGP.
2. Create data.
3. Set up our moment conditions.
4. Apply the analogy principle to get from expectations to averages.
5. Estimate GMM.

In [None]:
import numpy as np
import pandas as pd
import scipy.stats.distributions as iid
import matplotlib.pyplot as plt

In [None]:
# Step 1: The DGP
# note that we need 2N
def dgp(true_alphas, N):
    alpha1, alpha2 = true_alphas
    wage = iid.pareto(5).rvs(size=(N,1))
    p_1 = iid.norm(10,2).rvs(size=(N,1))
    p_2 = iid.gamma(3,2).rvs(size=(N,1)) + 10
    
    # create x from these draws, with noise
    x_1 = (wage/(p_1*alpha1))**(1/(alpha1-1)) + iid.norm(scale=2).rvs(size=(N,1))
    x_2 = (wage/(p_2*alpha2))**(1/(alpha2-1)) + iid.norm(scale=2).rvs(size=(N,1))
    
    return (x_1, x_2, p_1, p_2, wage)

In [None]:
# Step 2: create data
alpha_true = (0.75, 0.5)
data = dgp(alpha_true, 100)

In [None]:
data_df = pd.DataFrame(np.concatenate(data, axis=1), columns = ["x1", "x2", "p1", "p2", "w"])
data_df.head()

In [None]:
data_df.describe()

### Creating moment conditions
Our moment conditions will have to be functions of the unknown parameter, since this is... unknown. Here I imitate Ethan's code, with some minor changes. First we set up for each observation _i_, what the moment condition should look like. Since we have 2 moment conditions, I am going to stack these horizontally. What will the dimentions of this be?

In [None]:
def gj(alphas, data):
    # unpack my inputs
    alpha1, alpha2 = alphas
    x_1, x_2, p_1, p_2, wage = data
    # set up moments (separately)
    moment1 = p_1*alpha1*x_1**(alpha1-1) - wage
    moment2 = p_2*alpha2*x_2**(alpha2-1) - wage
    # stack the moments next to each other
    moments = np.concatenate([moment1, moment2], axis=1)
    
    return moments

In [None]:
gj(alpha_true, data)[:5] # these are pretty small, bc the alphas are right!

In [None]:
gj((0.1, 0.1), data)[:5] # these are much larger!

Now, we appply the analogy principle to these moment conditions - by analogy, the sample mean approximate the expectation, so we want to take the mean over N. In our data, the _N_ are running down the _rows_. The rows are the 0-axis in python, so we want to take the mean over ``axis = 0``. If we apply the mean over ``axis = 1``, this would allow us to take the mean across the columns, which wouldn't make sense here.

In [None]:
def gN(alphas, data):
    # get individual moments
    e = gj(alphas, data)
    # take mean
    gN = e.mean(axis=0).reshape((len(alphas),1))
    return gN

In [None]:
# test: we want these = 0!
print(f"using true alphas: Eg_j =\n {gN(alpha_true, data)}\n")
print(f"using other alphas: Eg_j =\n {gN((0.9, 0.9), data)}")

Now we have our analogous moment conditions, and GMM will try and find the alphas that minimize the squared errors from these according to:

$$E(g_j(a))'\Omega^{-1} E(g_j(a))$$

(a 2 x 2 matrix).

But... what is $\Omega$? The most efficient estimator would weight by the covariance of the moment conditions. So let's set it up!  

In [None]:
def invOmega(alphas, data):
    e = gj(alphas, data)
    # recenter
    e = e - e.mean(axis=0)
    N = e.shape[0]
    var = e.T@e/N
    return np.linalg.inv(var)

In [None]:
invOmega(alpha_true, data)

In [None]:
# putting it all together, we get the objective function we want to minimize:
def J(alphas,invomega,data):

    g = gN(alphas,data) # Sample moments
    N = data[0].shape[0]

    return ((N*g.T)@invomega@g).squeeze() # Scale by sample size

In [None]:
print(f'J at the true alphas = {J(alpha_true,invOmega(alpha_true, data),data)}')

In [None]:
print(f'J at some wrong alphas = {J(alpha_true,invOmega((0.1, 0.1), data),data)}')

**Check your understanding:** Why doesn't J = 0 at the true alphas?

### The 2-step estimator.
So now we have all the ingredients, we have set up the moment conditions and the objective function to minimize J(). But J() depends on $\Omega$, which is unknown! The idea now will be to proceed in 2 steps:

1. Guess $\Omega$. We'll guess the identity matrix.
2. Minimize J using our guess for $\Omega$.
3. Use our estimated alphas to calculate $\widehat{\Omega}$.
4. Using $\widehat{\Omega}$, re-run GMM to re-estimate the alphas.

In [None]:
from scipy.optimize import minimize

In [None]:
# Step 1: Guess Omega.
Omega_guess = np.eye(len(alpha_true))

# Step 2: Minimize J using our guess. 
# We're bringing back our old friend, the minimize function!
minimizer = minimize(lambda a: J(a, Omega_guess, data), x0 = [0.5,0.5], method = 'Nelder-Mead')
alpha_hat = minimizer.x
print(f'Remember true alphas: {alpha_true}\n')
print(f"alpha_hat from first stage = {alpha_hat} -> gives J = {minimizer.fun}\n")

# Step 3: Update Omega
invOmega_hat = invOmega(tuple(alpha_hat), data)
print(f"Omega_hat =\n {invOmega_hat}\n")

# Step 4: re-run with omega_hat. May as well use our current alpha_hats as new starting point
minimizer2 = minimize(lambda a: J(a, invOmega_hat, data), x0 = alpha_hat, method = 'Nelder-Mead')
alpha_hat2 = minimizer2.x
print(f"alpha_hat from second step = {alpha_hat2}")

In [None]:
# quick look at the thing we're minimizing 
fig, ax = plt.subplots(1, 2, figsize=(10,5))

# hold alpha_2 at its correct value; vary alpha_1
# zooming in for easier viewing
alpha1 = np.linspace(0.5, 0.85, 100)
j = [J((a1, alpha_hat[1]), invOmega_hat, data) for a1 in alpha1]

# zooming out and taking logs to help with viewing
alpha1_wide = np.linspace(0.1, 1, 100)
j_log = [np.log(J((a1, alpha_hat[1]), invOmega_hat, data)) for a1 in alpha1_wide]

ax[0].plot(alpha1, j, label='J function')
ax[0].axvline(x=alpha_true[0], color='red', label='True alpha 1')
ax[0].legend()
ax[0].set(xlabel="alpha1 guess",ylabel="J function")

ax[1].plot(alpha1_wide, j_log, label='log J function')
ax[1].axvline(x=alpha_true[0], color='red', label='True alpha 1')
ax[1].legend()
ax[1].set(xlabel="alpha1 guess",ylabel="log J function")
