# Data Generating Process

In this section I am creating the sample data according to a *true* model of my creation. Consider the simple two equation model

\begin{align}
Y &= \beta X + e \\
X &= \pi Z + v
\end{align}


All variables are scalars for simplicity. 

The variables $X$ and $Y$ are generated given the following information:

* $Z \sim N(0,1)$

* $u:=
    \begin{pmatrix}
    e \\ v
    \end{pmatrix}
    \sim 
    N
    \begin{pmatrix}
    \begin{pmatrix}
    0 \\ 0
    \end{pmatrix}
    , 
    \begin{pmatrix}
    1 & \rho \\ \rho & 1
    \end{pmatrix}
    \end{pmatrix}$
  

In [None]:
# loading modules
using LinearAlgebra
import LinearAlgebra.cholesky
using Plots
using Statistics

In [None]:
# defining composite types that
# serve as containers to store results

struct Parameters
    n::Int64
    beta::Float64
    F::Float64
    rho::Float64
end

struct IV_results
    biv::Float64
    se::Float64
    tstat::Float64
end

struct Sample
    x::Array{Float64,1}
    y::Array{Float64,1}
    z::Array{Float64,1}
end

struct Simresults
    dist::AbstractVector
    power::Float64
end

# Defining useful functions

In [None]:
function sample(parms::Parameters)
    
    # first stage coefficient
    # pie = sqrt(parms.r2/(1-parms.r2))
       
    # exogenous variables
    C = cholesky([1 parms.rho; parms.rho 1])
    u = [randn(parms.n,1) randn(parms.n,1)] * C.U;
        
    e = u[:,1]
    v = u[:,2] 
    Z = randn(parms.n)
    
    # creating endogenous variables    
    pie = sqrt(parms.F/parms.n)
    X = pie * Z + v
    Y = parms.beta * X + e
    
    # returning Sample object
    Sample(X, Y, Z)
end

In [None]:
"""  
note: the function iv(...) nests ols estimation
        if you call the function with
        iv(x,y,x) it will do OLS
        iv(x,y,z) it will do IV
"""
function iv(X::Array{Float64,1}, Y::Array{Float64,1}, Z::Array{Float64,1})
    
    biv = Z'Y/(Z'X)
    ehat = Y - X*biv
    shat = ehat'ehat/length(ehat)
    Omegahat_hom = shat * (inv(Z'X) * (Z'Z) * inv(X'Z))
    
    se_hom = sqrt(Omegahat_hom)
    tstat_hom = abs(biv/se_hom)

    # returning IV_results object
    IV_results(biv, se_hom, tstat_hom)

end

In [None]:
function simulate(parms::Parameters, reps::Int64=5000)
    
    b = zeros(reps)
    pow = 0
    
    for r = 1:reps
        
        mysample = sample(parms)
        myresults = iv(mysample.x, mysample.y, mysample.z)
        
        b[r] = myresults.biv
        pow += (myresults.tstat>1.96)
    
    end
    
    # returning simulation result object
    Simresults(b, pow/reps)
end

In [None]:
function power(parms::Parameters, betarange=(-2:0.1:2))

    powerfunc = zeros(length(betarange))
    count = 1

    for b in betarange
        
        parmsnew = Parameters(parms.n, b, parms.F, parms.rho)
        betas = simulate(parmsnew)

        powerfunc[count] = betas.power
        count += 1
        
    end
    
    powerfunc

end

# Simulating Distribution of IV Estimator

There are four things we need to "choose": sample size, $\beta$, degree of endogeneity, and strength of instrument. 

Here are the combinations of the four that we will be studying:

* $N=1000$ (could play around with that)

* $\beta=0$

* $\pi$ will be indirectly governed by setting the first stage $F$ like explained below.

* degree of endogeneity $\rho \in \lbrace 0, 0.5, 0.9 \rbrace$ (increasing degree)

* strength of instrument: via first stage $F \in \lbrace 5.53, 6.66, 8.96, 16.38 \rbrace$. 

The first stage $F$ determines $\pi$. Here's the explanation:

* For $\pi$ it can be shown that $R^2 = (\widehat{\text{Corr}}(X,Z))^2 = s_{XZ}^2/s_{X}^2 = \hat{\pi}^2/(1+\hat{\pi}^2)$ and therefore $ \hat{\pi} = \sqrt{R^2/(1-R^2)}$.

* At the same time: For the simple linear regression model $F=N \cdot R^2/(1-R^2)$ which suggests the following relationship between the first stage $F$ and the coefficient $\hat{\pi}$: $F \approx N \cdot \hat{\pi}^2.$

* For practical purposes we take that to imply that $\pi = \sqrt{\frac{F}{N}}$. 

So what we will be doing here is setting $F$ to govern $\pi$, the strength of the instruments. From the Stock and Yogo paper we know that interesting values for $F$ are $\{5.53, 6.66, 8.96, 16.38\}$ (see their Table 5.2 as shown in my week 9 lecture notes).

In [None]:
# setting parameters
# DGP as described above
beta = 0
N = 1000 # sample size
F = 5.53
rho = 0.90
myparms = Parameters(N, beta, F, rho);

In [None]:
biv_dstn = simulate(myparms);

In [None]:
# finite sample distribution, it's just a histogram!
bias_hat = round(mean(biv_dstn.dist), digits=2)
var_hat = round(var(biv_dstn.dist), digits=1)
mydistplot = histogram(biv_dstn.dist, normed=true,
    xlims=(-3,3), ylims=(0,2.0), xticks=-3:1:3, yticks=0:.25:2.0,
    title="Simulated distribution of IV estimator \n
    (N=$(myparms.n), F=$(myparms.F), rho =$(myparms.rho)) \n
    Bias=$bias_hat, Variance=$var_hat", 
    label="", color="#268bd2")

In [None]:
#savefig("simulated_distribution_iv")
# save figure for Latex lecture notes
savefig(plot(mydistplot, dpi=300, background_color="#eee8d5"), 
    "simulated_distribution_iv_N$(myparms.n)_F$(myparms.F)_rho$(myparms.rho).png")

# Computing Actual Power/Size of t-test
Recall that power function is, most generally, the probability of rejecting the null given the true coefficient. 

To study the power function we will do this:

* we generate many models by varying the value of $\beta$ vary over the interval $[-2, 2]$ in discrete and small steps.

* Irrespective of the value of the *true* coefficient $\beta$, we will always be testing the null hypothesis $H_0: \beta=0$. 

This allows us to simulate the statistical power and size. 

\begin{align*}
    Power = \Pr(\text{reject } H_0: \beta_0 = 0 | \text{ true } \beta \neq 0 )\\
    Size  = \Pr(\text{reject } H_0: \beta_0 = 0 | \text{ true } \beta = 0 )
\end{align*}

Of course we want the size to be fixed at 5% while at the same time maximizing the power.




In [None]:
powergraph = power(myparms); 
size_hat = round(100*powergraph[21], digits=2)
mypowerplot = plot(-2:0.1:2,powergraph, lw=3, 
    xlims=(-3,3), ylims=(0,1.02), xticks=-3:1:3, yticks=0:.25:1,
    title="Simulated power function of IV estimator \n
    (N=$(myparms.n), F=$(myparms.F), rho =$(myparms.rho)) \n
    empirical size = $size_hat %", 
    label="", linecolor="#268bd2", background_color = :transparent, foreground_color=:black)

In [None]:
# save figure for Latex lecture notes
savefig(plot(mypowerplot, dpi=300, background_color="#eee8d5"), 
    "powergraph_N$(myparms.n)_F$(myparms.F)_rho$(myparms.rho).png")