# ECON5280 Lecture 9 Deep IV

<font size="5">Junlong Feng</font>

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/junlong-feng/econ5280/main?filepath=Lecture9_DeepIV.ipynb)

## Outline

* Motivation: To give you a glimpse of the frontier of deep learning+econometrics.
* Nonparametric Instrumental Variable (NPIV): Identification and estimation scheme.
* Neural Networks and Deep Learning: A powerful way to approximate functions.
* The Deep IV Estimator: Using neural networks to estimate NPIV models.
* Applications and Implementation: When and how to use Deep IV.

## 1. Nonparametric Instrumental Variable (NPIV)

From Lecture 7, no matter what the true function $Y=g(D,W,U)$ is, we have the following results:

- When $D$ and $Z$ are both binary and no other covariates are present, a linear model captures LATE. 2SLS/IV/GMM gives you consistent estimates of LATE.
- When $D$ and $Z$ are binary with covariates, CLATE is identified by some ratio of conditional expectation differences. Can be estimated consistently by causal forest.
  - $CLATE(W)=[\mathbb{E}(Y|Z=1,W)-\mathbb{E}(Y|Z=0,W)]/[\mathbb{E}(D|Z=1,W)-\mathbb{E}(D|Z=0,W)]$.

- When $D$ and $Z$ are more complicated, discrete or continuous, identification of CLATE/LATE is much involved and is still an active research area.

In this lecture, we make assumptions on $g$, the function that determines $Y$, to handle more complicated $D$ and $Z$ instead of identifying/estimating causal effect without any structure when $D$ and $Z$ are both binary. However, these assumptions are much weaker than a linear model.

Recall that $Y=g(D,W,U)$ in general where $g$ is an unknown function and $U$ is a vector of unobservable. We assume in this lecture that this function has an **additively separable form** such that:
$$
g(D,W,U)=m(D,W)+U,
$$
and $U$ is a **scalar**. 

- This form is more flexible than linear models since it nests the latter: A linear model is to further assume $m(D,W)=\beta_{0}+D\beta+W'\beta_{W}$.
- This model is less flexible than the orginal one because i) $U$ is **additively separable** and ii) $U$ is a scalar.

Suppose $D$ is endogenous in the sense that $cov(W,U)\neq 0$, but suppose we have an instrument $Z$ such that $\mathbb{E}(U|Z,W)=\mathbb{E}(U|W)$. Then we call this model a **nonparametric instrumental variable model** (NPIV, Newey and Powell, 2003, *Econometrica*).

### 1.1 Identification of the NPIV Model

Since we have assumed $m$ is the true function that determines $Y$, as long as we know $m$, we know everything and thus all kinds of causal effect. 

To know $m$, we need to construct moment conditions:
$$
\begin{align*}
\mathbb{E}(Y|W,Z)=&\mathbb{E}(m(D,W)|W,Z)+\mathbb{E}(U|W,Z)\\
=&\mathbb{E}(m(D,W)|W,Z)+\mathbb{E}(U|W)\\
=&\int_{d} [m(d,W)+\mathbb{E}(U|W)]dF_{D|WZ}(d|W,Z).
\end{align*}
$$
where $F_{D|WZ}(d|W,Z)$ is the conditional density of $D$: $f_{D|WZ}(d|W,Z)$ if $D$ is absolutely continuous, and is the probability function $\Pr(D=d|W,Z)$ if $D$ is discrete.

The above gives us a moment equation for $m(d,W)+\mathbb{E}(U|W)$ as everything else involved is identifiable moments from the population. If this gives us a unique solution, we say $m(d,W)+\mathbb{E}(U|W)$ is **identified**. Although $m$ itself is not identified, this is sufficient to give us any causal effect because $\mathbb{E}(U|W)$ is cancelled out:
$$
\text{Effect of $D$ on $Y$ from $d'$ to $d$}: m(d,W)-m(d',W)=m(d,W)+\mathbb{E}(U|W)-(m(d',W)+\mathbb{E}(U|W)).
$$

**Notes**. Uniqueness of the solution to the moment equation relies on technical conditions that are beyond the scope of this course. Inverting an integral equation is subject to the so-called "ill-posed problem" in math. If the integral is simply summation of finite terms, then a full-rankness type condition can do. However, when it's an integral, the infinite-dimensional analogue of "full-rankness" is not enough. Let me know if you're interested in this topic.

### 1.2 Estimation

Let $h(D,W)=m(D,W)+\mathbb{E}(U|W)$. The moment condition gives us an integral equation for $h(\cdot,W)$.

**When $D$ is discrete**, say $D=\{1,2,\ldots,J\}$, the equations are like:
$$
\sum_{d=1}^{J}h(d,W)\Pr(D=d|W,Z)=\mathbb{E}(Y|W,Z).
$$
Then for every $W=w$ of interest, this equation has $J$ unknowns $(h(1,w),h(2,w),\ldots,h(J,w))$. If $Z$ is continuous or discrete with at least $J$ values as well, we can have at least $J$ equations for different values of $Z$. Then all we need to do is:

- Estimate $\Pr(D=d|W=w,Z=z)$ and $\mathbb{E}(Y|W=w,Z=z)$ by, say, causal forest, for every $z$ and every $w$ of interest.
- Then we invert the matrix of $\hat{\Pr}(D=d|W=w,Z=z)$ and obtain $\hat{h}(d,w)$ for all $d$.

**When $D$ is continuous**, $Z$ also needs to be continuous and we have an integral equation:
$$
\int_{s}h(s,W)f_{D|WZ}(s|W,Z)ds=\mathbb{E}(Y|W,Z).
$$
This is not easy to do numerically; you can again estimate $f_{D|WZ}$ and $\mathbb{E}(Y|W,Z)$, but now that $s$ can take on a continuum of numbers, there are infinite unknowns to be solved for. Extremely challenging.

Newey and Powell (2003) proposed a method based on traditional nonparametrics. In this lecture, we introduce a new one based on ML: **Deep IV** by Hartford, Lewis, Leyton-Brown, and Taddy (2017, "Deep IV: a flexible approach for Counterfactual prediction", *Proceedings of the 34th International Conference on Machine Learning*).

## 2. Neural Networks

### 2.1 Basics and Approxmate Functions

Neural netwoks are (one of, modestly) the most popular building block of deep learning. They have many uses, but today we focus on function approximation: We are going to build neural networks to approximate an arbitrary function, $h$ and $f_{D|WZ}$.

- Origins: Algorthims that try to mimic the brain.
- Widely used in 80s and early 90s. Less popular in late 90s.
- Today: State-of-art technique for many applications.

Example 1. Suppose we have a real vector $x=(x_{1},x_{2},x_{3})$ and a real valued function of it $h(x)$. We observe the function value $y$, but don't know $h$ is. We want to use a flexible enough **parametric** function $h_{\theta}(x)$ to approximate the unknown $h$. A neural net is such an approximation: (See the graph on Canvas)

- Each circle is a neuron, or a unit.

- The first layer has 4 units, consisting of the three variables and a constant $x_{0}=1$.

- They are going to be combined linearly (called the input function) and passed to the hidden layer by a known function $f$ (called activation function).

  - $a_{1}^{(2)}=f^{(1)}(\theta_{01}^{(1)}+\theta_{11}^{(1)}x_{1}+\theta_{21}^{(1)}x_{1}+\theta_{31}^{(1)}x_{3})$, $a_{2}^{(2)}=f^{(1)}(\theta_{02}^{(1)}+\theta_{12}^{(1)}x_{1}+\theta_{22}^{(1)}x_{1}+\theta_{32}^{(1)}x_{3})$, etc.
  - Then from the hidden layer to the final outcome, we again linear combine $a^{(2)}$s first by some $\theta^{(2)}$s, and then pass into the second activation function $f^{(2)}$. That is our final $h_{\theta}(x)$:

  $$
  \begin{align*}
  h_{\theta}(x)=&f^{(2)}(\theta^{(2)}_{0}+\theta^{(2)}_{1}a^{(2)}_{1}+\theta^{(2)}_{2}a^{(2)}_{2}+\theta^{(2)}_{3}a^{(2)}_{3}).
  \end{align*}
  $$

  - One example is Let $f(z)=e^{z}/(1+e^{z})$. Called **logistic** function in stats/econometrics and **sigmoid** function in ML.

In [None]:
x=seq(-6,6,length.out=1000)
y=exp(x)/(1+exp(x))
plot(x,y)

- So $h_{\theta}$ in this example is a known function up to $\theta$, where $\theta$ contains $(3+1)\times 3+(3+1)\times 1$ parameters.

- Generalize this example:

  - Hidden layer can have more units than the inputs.
    - A single hidden layer neural network with a linear output unit can approximate any continuous function arbitrarily well, given enough hidden units (Hornik (1991), "Approximation capabilities of multilayer feedforward networks", *Neural Networks*).
  - There can be more than one hidden layers.
    - One or two hidden layers: Shallow neural network.
    - Three or more hidden layers: Deep neural network. Deep learning.
  - The outcome can be multiple (pattern recognition etc.) 
Example 2. Sigmoid activation function to approximate the max function,

To understand how $h_{\theta}$ can approximate $h$, suppose we have $x_{1}\in \{0,1\}$ and $x_{2}\in \{0,1\}$. Suppose $h(x)=\max\{x_{1}, x_{2}\}$. We can use a sigmoid function $h_{\theta}(x)=\exp(\theta_{0}+\theta_{1}x_{1}+\theta_{2}x_{2})/(1+\exp(\theta_{0}+\theta_{1}x_{1}+\theta_{2}x_{2}))$ to approximate it by setting $\theta_{0}=-20$, $\theta_{2}=40$ and $\theta_{3}=40$. Now we can verify that

- $h_{\theta}(0,0)=\exp(-20)/(1+\exp(-20))\approx 0=\max\{0,0\}$.
- $h_{\theta}(0,1)=h_{\theta}(1,0)=\exp(20)/(1+\exp(20))\approx 1=\max\{0,1\}=\max\{1,0\}$.
- $h_{\theta}(1,1)=\exp(60)/(1+\exp(60))\approx1=\max\{1,1\}$.

Example 3. Play with neural nets at https://playground.tensorflow.org.

### 2.2 Approximate Distributions

We can also use neural network to approximate densities. 

#### 2.2.1 Gaussian Mixture

A Gaussian mixture is a weighted combination of normal distributions, for instance, $\omega_{1}N(2,4)+\omega_{2}N(-1,3)$ where weights $\omega_{1},\omega_{2}>0$ and $\sum_{k=1}^{2}\omega_{k}=1$.

It is known that "any smooth density can be approximated with any speciﬁc nonzero amount of error by a Gaussian mixture model with enough components" (Chapter 3 in Goodfellow, Bengio, and Courville, 2016, *Deep Learning*, MIT press).

So one can use a Gaussian mixture to estimate the conditional density $f_{D|WZ}(d|W,Z)$ in the moment equation:
$$
f_{D|WZ;\theta_{density}}(d|W_{i},Z_{i})=\sum_{k}\omega_{k}(W_{i},Z_{i};\theta_{density})\phi(d;\mu_{k}(W_{i},Z_{i};\theta_{density}),\sigma^{2}_{k}(W_{i},Z_{i};\theta_{density})).
$$

- The right hand side is a weighted average of normal densities: $\sum_{k}\omega_{k}\phi(d;\mu_{k},\sigma_{k}^{2})$.
- $\phi(d;\mu_{k},\sigma_{k}^{2})$ is the density function of $N(\mu_{k},\sigma_{k}^{2})$ evaluated at $d$.
- Mean $\mu_{k}$s, variance $\sigma_{k}^{2}$s, and weights $\omega_{k}$ all depend on data $W_{i},Z_{i}$ because we are estimating a conditional density, which, by definition, is a function of the conditioning variables.
- Therefore, $\mu_{k}$, $\sigma_{k}^{2}$, and $\omega_{k}$ are **unknown functions** of $(W_{i},Z_{i})$. We use neural networks to approximate these functions. $\theta_{density}$ collects the parameters in the neural networks.

#### 2.2.2 The Maximum Likelihood Principle

Following the expression of $f_{D|WZ;\theta_{density}}(d|W_{i},Z_{i})$, we can see that once we fix $\theta_{density}$, this quantity is fixed by plugging in the realization of $W_{i}$ and $Z_{i}$. However, how do we find appropriate $\theta_{density}$?

We are going to use the so called maximum likelihood (ML) principle.

- Likelihood: An arbitrary density function $f(y;\theta)$ is a function of $y$ by treating $\theta$ fixed. It is called the **likelihood function** if we treat it as a function of $\theta$ and treat $y$ fixed.
- So likelihood and density are two sides of the same coin.
- The ML principle says the best estimator $\hat{\theta}$ is the one yielding the largest likelihood given the observed data. In other words, it is the $\theta$ such that the density at the observed data values is largest.
- To put it simple, what we observe is what has the largest "possibility".

Under this principle, we can estimate $\theta_{density}$ such that the joint density of the observed data $\{D_{i}:i=1,\ldots,n\}$ conditional on $\{(W_{i},Z_{i}):i=1,\ldots,n\}$, is largest. By i.i.d., the joint density is equal to the product of the individuals, so
$$
\hat{\theta}_{density}=\arg\max_{\theta_{density}}\Pi_{i=1}^{n}f_{D|WZ;\theta_{density}}(D_{i}|W_{i},Z_{i}).
$$
Note that the maximizer is invariant to strictly monotonic transformations. Since log is strictly increasing and log of product is equal to the sum of logs, we have
$$
\begin{align*}
\hat{\theta}_{density}=&\arg\max_{\theta_{density}}\log\left[\Pi_{i=1}^{n}f_{D|WZ;\theta_{density}}(D_{i}|W_{i},Z_{i})\right]\\
=&\arg\max_{\theta_{density}}\sum_{i=1}^{n}\log \left[f_{D|WZ;\theta_{density}}(D_{i}|W_{i},Z_{i})\right].
\end{align*}
$$
The right hand side is called the **log likelihood function** of $\theta_{density}$, and the resulting estimator is called the **maximum likelihood estimator**, or MLE.

## 3. Deep IV

Now we are going to estimate $h$.

### 3.1 Conditional Expectation as the Best Predictor

We first need to know the following fact: 

If we want to find a function of $Z$, $f(Z)$, to approximate random variable $Y$ in the sense that
$$
\min_{f}\mathbb{E}\left[(Y-f(Z))^{2}\right].
$$
Then the minimizer is $f^{*}(Z)=\mathbb{E}(Y|Z)$.

Proof. (Not required.)
$$
\begin{align*}
\mathbb{E}\left[(Y-f(Z))^{2}\right]=&\mathbb{E}\left[(Y-\mathbb{E}(Y|Z)+\mathbb{E}(Y|Z)-f(Z))^{2}\right]\\
=&\mathbb{E}\left[(Y-\mathbb{E}(Y|Z)^{2}\right]+\mathbb{E}\left[(\mathbb{E}(Y|Z)-f(Z))^{2}\right]\\
&+2\mathbb{E}\left[(Y-\mathbb{E}(Y|Z))\cdot(\mathbb{E}(Y|Z)-f(Z))\right]\\
=&\mathbb{E}\left[(Y-\mathbb{E}(Y|Z)^{2}\right]+\mathbb{E}\left[(\mathbb{E}(Y|Z)-f(Z))^{2}\right].
\end{align*}
$$
Hence, it's minimized by setting $f(Z)=\mathbb{E}(Y|Z)$. (The last equality is because the following: $\mathbb{E}(Y\mathbb{E}(Y|Z))=\mathbb{E}[\mathbb{E}(Y\mathbb{E}(Y|Z)|Z)]=\mathbb{E}[(\mathbb{E}(Y|Z)^{2}]$ and $\mathbb{E}(Yf(Z))=\mathbb{E}[\mathbb{E}(Yf(Z)|Z)]=\mathbb{E}[Z\mathbb{E}(Y|Z)]$, so the cross product term is 0. *Q.E.D.*

Therefore, now we know that $\mathbb{E}(Y|W,Z)=\int_{s}h(s,W)f_{D|WZ}(W,Z)ds$, we know that
$$
\int_{s}h(s,W)f_{D|WZ}(W,Z)ds=\arg\min \mathbb{E}\left[(Y-f(W,Z))^{2}\right],
$$
that is, $h(s,W)$ is the function that minimizes 
$$
\min_{h} \mathbb{E}\left[\left(Y-\int_{s}h(s,W)f_{D|WZ}(W,Z)ds\right)^{2}\right]
$$


### 3.2 Putting Everything Together

Finally, we use neural network to approximate $h(s,W)$ by $h_{\theta}(s,W)$, and determine $\theta$ in the following way:

- Replacing $\mathbb{E}$ by $\sum_{i}/n$ in the last equation in the previous sections.

- Replacing density $f_{D|WZ}(W,Z)$ by its ML estimator of the neural network approximation $f_{D|WZ;\hat{\theta}_{density}}(W,Z)$.

- Replacing $h(s,W)$ by its neural network approximate $h_{\theta}(W,Z)$.

- We estimate $\theta$, i.e., find the neural network parameters $\theta$ such that
  $$
  \min_{\theta}\frac{1}{n}\sum_{i=1}^{n}\left[Y-\int_{s}h_{\theta}(s,W_{i})f_{D|WZ;\hat{\theta}_{density}}(s|W_{i},Z_{i})ds\right]^{2}.
  $$

Everything in the objective function is known except for $\theta$. The minimizer is called $\hat{\theta}$ and $h_{\hat{\theta}}(s,W)$ is the **DeepIV** estimator of $h(s,W)$. 

- The integral can be obtained by drawing a lot of values of $s$ and summing up the integrands at them. Called Monte Carlo integration.
- The minimization can be solved by stochastic gradient descent, or any other numerical minimization methods for neural networks. 

### 3.3 Pros and Cons

Pros:

- Like causal forest, Deep IV can handle high dimensional $W$ (that is, a large number of covariates can be included into the model), yielding rich heterogeneity.
- Variables are not necessarily to be numerical.
- GPU computation valid.

Cons:

- Asymptotic inference is still unclear.

## 4. Applications and Implementation

In principle, Deep IV can be applied to any scenarios when you do not want to make too many assumptions on the functional form of the relationship between $Y$ and $(D,W)$. However, given its drawbacks, you can consider alternative methods as well when possible.

- Its best applicaltion scenario is when your $D$ and $Z$ are continuous and you have a lot of covariates in $Z$ and/or $W$, especially some of them are not traditional numerical variables. 

One application in Hartford, Lewis, Leyton-Brown, and Taddy (2017) is "search-advertisement position effects".

- Sponsored advertisements paid to search engine generate over 40 billion U.S. dollars annually for search engines (Goldman and Rao, 2014, "Experiment as instruments: Heterogeneousgeneous position effects in sponsered search auctions", *EAI Endorsed Transactions on Serious Games*). Example: search anything, like health insurance in Google. There can be **Ad** showing up on the first a few results 

- They examine how advertiser's position on the Bing search page (called *slot*, our $D$) affects probability of a user click ($Y$).
- $D$ is endogenous because it is correlated with latent user intent. For instance, when a user searches for "Coke", it is likely that both she will click and that "Coke" is the top-advertiser.
- Instruments $Z$ contain a large number of indicators for experiments run by Bing in which advertiser-query pairs were randomly assigned to different algorithms that in turn scored advertisers differently in search auctions, resulting in random variation in position.
- Covariates $W$ contain text data like user query.
- $n>20$ million.
- They find that, among other results, "on-brand query" (like Coke bids on the word "Coke"), position is worth more to small websites.

For implementation, Hartford, Lewis, Leyton-Brown, and Taddy (2017) developed a Python package called DeepIV.

- https://github.com/jhartford/DeepIV.
- The package builds on Keras.
- Keras is available in R.