# ECON5280 Lecture 5 Potential Outcome

<font size="5">Junlong Feng</font>

## Outline

* Motivation: Econ cares about causality, but what is causality?
* Potential Outcome and Causal Effects: The key elements in Rubin's approach to causal inference.
* Average Causal Effect and Linear Models: When does a linear model represent causality? 
* Identification of Average Effect: Preliminaries for MM estimation.
* Conditional Treatment Effect: Weaker assumption yet rich information.

## 1. Potential Outcome and Causal Effects

There are more than one way to formally define causality. This course follow the one that is mostly adopted in frequentist statistics and econometrics: Rubin's potential outcome framework (developed in the 1970s).

From now on, let's imagine we have someone called $i$:

* Treatment $D_{i}$: A random variable that indicates the treatment $i$ gets.
  * e.g. Whether $i$ goes to UST or not./ Whether $i$ studies for a master's degree or not./ The brand of COVID-19 vaccination $i$ gets.
  * $D_{i}$ can be either discrete or continuous. 
* Potential outcome $Y_{i}(d)$: A random variable representing the outcome that $i$ would get if she took treatment $d$.
  * e.g. Corresponding to the first example above, let's call going to UST 1 and not going to UST 0. Then $Y_{i}(1)$ can represent $i$'s future income if she goes to UST and $Y_{i}(0)$ is then $i$'s income if she does not.
  * Once we write the potential outcome by $Y_{i}(d)$, we have already made assumptions: i) We implicitly assume that other people's treatment status does not affect $i$'s potential outcome and ii) we implicitly assume that everyone can have treament status $d$. These two assumptions are called **stable unit treatment value assumption (SUTVA)** and we'll maintain it implicitly throughout the semester.
* Observed or realized outcome $Y_{i}$: $Y_{i}=Y_{i}(d)$ if and only if $D_{i}=d$. Or equivalently, $Y_{i}=Y_{i}(D_{i})$.
  * The realized outcome is **the potential outcome** associated with **the** treatment status which $i$ indeed receives.
  * e.g. Suppose you are $i$. You have already come to UST to study. So your $d=1$, and your future income will be equal to your potential future income at $d=1$. However, there may exist a *parallel universe* where you didn't come to UST (i.e. your $d=0$) . Then the actual future income of *that you* in the parallel world will be $Y_{i}(0)$.
* Individual causal, or, treatment effect (ITE), of receiving treament $d$ compared to $d'$: $Y_{i}(d)-Y_{i}(d')$.
* **Fundamental problem of causal inference**: You only live once so you only know at most one of $Y_{i}(d)$ and $Y_{i}(d')$.

### 1.1 Potential Outcome Framework in Action and Average Treatment Effect (ATE)

Why do we think the potential outcome is random? This is because it may be affected by some other random stuff. Let's put everything that would affect $i$'s potential outcome into a vector $U_{i}$. Then essentially we can to write a model for $Y_{i}(d)$ and $Y_{i}$:
$$
Y_{i}(d)=g_{i}(d,U_{i}),\ \ \ \ Y_{i}=Y_{i}(d)\ \text{if and only if}\ D_{i}=d,
$$
Or equivalently 
$$
Y_{i}(d)=g_{i}(d,U_{i}),\ \ \ \ Y_{i}=g_{i}(D_{i},U_{i}).
$$
where $g_{i}$ is an **unknown** function, and $U_{i}$ is a random vector.

* e.g. if $Y$ is your income and $D$ is whether you come to UST, then $U$ may contain things like family income,  parents' education level, your age, your gender, your college, your major, your IQ, your ability, etc.

Now we can see that it's ridiculously difficult to know the individual treatment effect (ITE) $g_{i}(d,U_{i})-g_{i}(d',U_{i})$ because

* For direct observation, you only directly observe at most one of them.
* So a researcher may think *"well then maybe I can somehow get to know function $g_{i}$ so that I compute the potential outcome that I cannot observe"*. But this is very hard without additional assumptions on $g_{i}$. 
* Even if you know $g_{i}$, some variables in $U_{i}$ is not observable by researchers, e.g., ability.
* Now this researcher may think *"well then maybe I use someone else' ($j$ for example) outcome to replace the potential outcome of $i$ that I cannot observe"*. But this is very hard as well because
  * Suppose your friend $j$ didn't come to UST $(d=0)$, and you want to use his outcome as your $Y_{i}(0)$. But this is actually assuming that $Y_{i}(0)=Y_{j}(0)$, implied by $U_{i}=U_{j}$ and $g_{i}=g_{j}$. But this is highly unlikely: Every variable in his $U$ must take the same value as your $U$, say, you have parents with same level of education, your family income are equal, your college and major are equal, your IQ are equal, same gender, same age, same ability, etc. Even if all these miraculously happened, the income production functions $g$ of you two also need to be equal. 

So, let us give up ITE and pursue a more approachable causal parameter, the average treatment effect $ATE(d,d')\equiv \mathbb{E}(Y_{i}(d))-\mathbb{E}(Y_{i}(d'))$. 

* Note ATE has NO subscript because we are going to work with an i.i.d. sample when we study estimation in the next lecture, so $i$'s expectation is equal to everyone else and thus subscript $i$ can be dropped.

## 2. Linear Model and Causality

Suppose you have a model like $Y_{i}=\beta_{0}+\beta_{1}D_{i}+\varepsilon_{i}$. Under what assumption does $\beta_{1}$ capture ATE?

### 2.1 Linear Model, Binary Treatment, and ATE

One may think it crazy to believe that your future income is simply a linear function of whether you come to UST or not. Indeed, it's crazy. However, no one says yet that the linear model is the **true** model $g_{i}$. It could be something else..

Let's start from the simplist case where $D_{i}$ only takes on two values. Without loss of generality (wlog), assume $D_{i}\in\{0,1\}$. We call such a variable a **binary** or a **dummy** variable. Recall $Y_{i}=Y_{i}(d)$ if and only if $D_{i}=d$, so equivalently $Y_{i}=Y_{i}(D_{i})$ and therefore
$$
\begin{align*}
Y_{i}=&g_{i}(D_{i},U_{i})\\
=& \mathbb{E}(g_{i}(D_{i},U_{i}))+\left(g_{i}(D_{i},U_{i})-\mathbb{E}(g_{i}(D_{i},U_{i}))\right)\\
=& \mathbb{E}(g_{i}(0,U_{i}))+[\mathbb{E}(g_{i}(1,U_{i}))-\mathbb{E}(g_{i}(0,U_{i}))]D_{i}+\left(g_{i}(D_{i},U_{i})-\mathbb{E}(g_{i}(D_{i},U_{i}))\right)\\
=&\mathbb{E}g_{i}(0,U_{i})+ATE\times D_{i}+\left[g_{i}(D,U_{i})-\mathbb{E}(g_{i}(D_{i},U_{i}))\right].
\end{align*}
$$
Now let's call the **nonrandom** $\mathbb{E}g_{i}(0,U_{i})\equiv \beta_{0}$ and the **nonrandom** $ATE\equiv \beta_{1}$, and the random  $\left[g_{i}(D,U_{i})-\mathbb{E}(g_{i}(D_{i},U_{i}))\right]\equiv\varepsilon_{i}$. We then have 
$$
Y_{i}=\beta_{0}+\beta_{1}D_{i}+\varepsilon_{i}.
$$
In the above derivation, we made **no assumption** on the form of $g_{i}$; we never said $g_{i}$ is linear or anything. But we end up with a nice linear model linking $Y_{i}$ and $D_{i}$ where the slope $\beta_{1}$ is exactly ATE!

### 2.2 Linear Model, Discrete Treatment, and ATE

Now let's consider a more complex situation, $D_{i}$ is discrete and can take on more than 2 values, e.g. the brand of COVID-19 vaccine.

Suppose $D_{i}\in\{1,...,J\}$. Let's first construct $J$ dummies: $D^{(j)}_{i}=1$ if and only if $D_{i}=j$, $j=1,...,J$. By construction, $\sum_{j=1}^{J}D_{i}^{(j)}=1$. Now following similar idea as [Section 2.1](#2.1 Linear Model, Binary Treatment, and ATE), we have
$$
\begin{align*}
Y_{i}=&g_{i}(D_{i},U_{i})\\
=& \mathbb{E}(g_{i}(D_{i},U_{i}))+\left(g_{i}(D_{i},U_{i})-\mathbb{E}(g_{i}(D_{i},U_{i}))\right)\\
=& \sum_{j=1}^{J}\mathbb{E}(g_{i}(j,U_{i}))\times D_{i}^{(j)}+\left(g_{i}(D_{i},U_{i})-\mathbb{E}(g_{i}(D_{i},U_{i}))\right)\\
\equiv&\sum_{j=1}^{J}\gamma_{j}D_{i}^{(j)}+\varepsilon_{i}\\
\text{(by $\sum_{j}D_{i}^{(j)}=1$):}=&\gamma_{1}+\sum_{j=2}^{J}ATE(j,1)\times D_{i}^{(j)}+ \varepsilon_{i}\\
\equiv&\beta_{1}+\sum_{j=2}^{J}\beta_{j}D_{i}^{(j)}+\varepsilon_{i}.
\end{align*}
$$
where $ATE(j,1)$ is the average treatment effect by chaging the treatment status from 1 to $j$.

* The representation is not unique. You can choose any status $j$ as the base level and reconstruct the fifth and sixth equality.
* The intercept is the expectation of potential outcome $Y_{i}(1)$, and with its presence, only $J-1$ dummies are included in the final linear model. Or, you can include all $J$ dummies without intercept (the fourth equality). You **cannot** simultaneously include **all $J$** dummies **and** the intercept, which is called the **dummy variable trap**.

 Like the binary $D$'s case, we see that when $D$ is multivalued, a linear model can still capture all the ATEs.

* $ATE(j,j')=\beta_{j}-\beta_{j'}$ (why?)
* You have to construct dummies for all treatment status; simply writing $Y_{i}=\beta_{0}+\beta_{1}D_{i}+\varepsilon$ will NOT do. See [Section 2.3](#2.3 Linear Model and Continuous Treatment) for details.

### 2.3 Linear Model and Continuous Treatment

Unfortunately, a linear representation is **no longer** wlog when $D_{i}$ is continuous. The intuition is that if you think about a continuous $D$ as a discrete variable taking on infinite values, now you have to repeat the same process as [Section 2.2](#2.2 Linear Model, Discrete Treatment, and ATE). But this is impossible because there will be a continuum of dummies you have to construct, no way to put them all on the right hand side.

In this case, when we write a linear model $Y_{i}=\beta_{0}+\beta_{1}D_{i}+\varepsilon_{i}$ (including the case where $D_{i}$ is multivalued discrete), and if we still claim that $\beta_{1}$ is causal, **we are making additional assumptions**: 

* For instance, suppose $D_{i}\in[0,1]$.
* Let $\beta_{0}=\mathbb{E}(Y_{i}(0))$, then we need to assume that the **marginal** ATE at any $d$, defined as $\partial_{d}\mathbb{E}(g_{i}(d,U_{i}))$, is a constant $\beta_{1}$. 
* Equivalently, this is assuming the expectation of the potential outcome $\mathbb{E}(g_{i}(d,U_{i}))$ is a **linear function** in $d$.
* This can be a strong assumption, and, as we have seen in the previous two subsections, is not needed when $D$ is binary or discrete (when a complete set of dummies are constructed).
* But if we are willing to assume such linearity, then $\varepsilon_{i}$, again, is equal to $Y_{i}(D_{i})-\mathbb{E}(Y_{i}(D_{i}))$, exactly the same as the previous two cases.

## 3 Identification of ATE

Now for simplicity we focus on the following linear model:
$$
Y_{i}=\beta_{0}+\beta_{1}D_{i}+\varepsilon_{i}.
$$
From [Section 2](#2. Linear Model and Causality), $\beta_{1}$ is the (marginal) ATE if

* $D_{i}$ is binary, or $D$ is multivalued discrete or continuous but the marginal ATE is assumed to be a constant, AND
* $\varepsilon_{i}=Y_{i}(D_{i})-\mathbb{E}(Y_{i}(D_{i}))$ or equivalently, $\varepsilon_{i}=g_{i}(D_{i},U_{i})-\mathbb{E}(g_{i}(D_{i},U_{i}))$.

In these scenarios, we calculate $\beta_{1}$ and get causality. But, how? Let's recall the idea of the MM estimator:

1. We are interested in an unknown parameter.
2. We try to come up with equations in the population level such that the parameter is the unqiue solution to the equations.
3. We try to construct **sample analogue** for the population moment equations, and solve for the parameter at sample level. The solution is called an estimator.

Step 2 is called **identification** and step 3 is estimation. We focus on identification in this lecture and study estimation in the next.

Basically we now need to do two things: i) Find assumptions that give us equations for $\beta_{1}$, and ii) find assumptions such that the solution to the equations is unique. For simplicity, lets call $X\equiv(1,D_{i})'$ and $\beta\equiv (\beta_{0},\beta_{1})'$ and our linear model becomes 
$$
Y_{i}=X_{i}'\beta+\varepsilon_{i}.
$$
**Construction of moment equations**.

* First, by $\varepsilon_{i}=g_{i}(D_{i},U_{i})-\mathbb{E}(g_{i}(D_{i},U_{i}))$, we have $\mathbb{E}(\varepsilon_{i})=0$. Then $cov(\varepsilon_{i},D_{i})=\mathbb{E}(\varepsilon_{i}D_{i})$ and $cov(\varepsilon_{i},1)=\mathbb{E}(\varepsilon_{i})$. Therefore, vector $(cov(\varepsilon_{i},1),cov(\varepsilon_{i},D_{i}))'=\mathbb{E}(X_{i}\varepsilon_{i})$.

* Now let's **assume** $\mathbb{E}(X_{i}\varepsilon_{i})=0$, where the right hand side $0$ is a $2\times 1 $ vector.

* Substitute $\varepsilon_{i}=Y_{i}-X_{i}'\beta$ into the left hand side, we have
  $$
  \mathbb{E}[X_{i}(Y_{i}-X_{i}'\beta)]=0\implies \mathbb{E}(X_{i}X_{i}')\beta=\mathbb{E}(X_{i}Y_{i}).
  $$
  Two moment equations, two unknowns. Moment equations are constructed.

* **Remarks. A sufficient condition** for $cov(D_{i},\varepsilon_{i})=0$ is $\mathbb{E}(\varepsilon_{i}|D_{i})=0$ as we reviewed in Lecture 4. What is a sufficient condition for the latter?

  * $\mathbb{E}(\varepsilon_{i}|D_{i})=0$ is equivalent as $\mathbb{E}[g_{i}(D_{i},U_{i})-\mathbb{E}(g_{i}(D_{i},U_{i}))|D_{i}]=0$. Note that
    $$
    \begin{align*}
    \mathbb{E}[g_{i}(D_{i},U_{i})-\mathbb{E}(g_{i}(D_{i},U_{i}))|D_{i}]=\mathbb{E}[g_{i}(D_{i},U_{i})|D_{i}]-\mathbb{E}[g_{i}(D_{i},U_{i})].
    \end{align*}
    $$
    A sufficient condition for the right hand side to be 0 is $U_{i}\perp D_{i}$. This condition is called **complete random treatment assigned**, meaning that which treatment status one receives is independent with everything else that could also affect the outcome. E.g. suppose whether you come to UST is completely determined by a lottery. Then it is independent of everything in your $U$.

**Conditions under which $\beta$ is unique**.

* From the moment condition $\mathbb{E}(X_{i}X_{i}')\beta=\mathbb{E}(X_{i}Y_{i})$, we know immediately that $\beta$ is unique if $\mathbb{E}(X_{i}X_{i}')$ is invertible, i.e., full-rank.
* In the current case, $X_{i}=(1,D_{i})'$, a two dimensional vector. We can easily work out the if and only if condition for $\mathbb{E}(X_{i}X_{i}')$ to be full-rank: $\mathbb{E}(X_{i}X_{i}')=\begin{pmatrix}1&\mathbb{E}(D_{i})\\\mathbb{E}(D_{i})&\mathbb{E}(D_{i}^{2})\end{pmatrix}$, and thus it is invertible if and only if $\mathbb{E}(D_{i}^{2})-(\mathbb{E}(D_{i}))^{2}\neq 0$, i.e., $\mathbb{V}(D_{i})>0$.

**Summary**. Under SUTVA, $Y_{i}(d)=g_{i}(d,U_{i})$.

*  $\beta_{1}$ in the representation $Y_{i}=\beta_{0}+\beta_{1}D_{i}+\varepsilon_{i}$ is the (marginal) ATE if $D_{i}$ is binary or $\mathbb{E}(Y_{i}(d))$ is linear if $D_{i}$ is multivalued (discrete or continuous).
*  $\beta\equiv(\beta_{0},\beta_{1})$ is equal to $[\mathbb{E}(X_{i}X_{i}')]^{{-1}}\mathbb{E}(X_{i}Y_{i})$ if the following holds:
   * $\mathbb{E}(X_{i}\varepsilon_{i})=0$, which holds under $D_{i}\perp U_{i}$.
   * Matrix $\mathbb{E}(X_{i}X_{i}')$ is full-rank.

Finally, when $D$ is multivalued discrete and $X_{i}=(1,D_{i}^{(2)},D_{i}^{(3)},...,D_{i}^{(J)})$, then all the above derivation work through (except the formula of $\mathbb{E}(X_{i}X_{i}')$ since $X_{i}$ is now more complicated). 

## 4 Conditional ATE

The strongest assumption so far is $U_{i}\perp D_{i}$. Recall $U_{i}$ is a vector that contains everything affecting $Y_{i}$. In most economic applications, it's hard to believe the treatment $D_{i}$ is independent of everything else.

* e.g. Whether you come to UST $(D_{i})$ may be correlated with your major, your family background, your IQ etc. All these things may also affect your future income $Y_{i}$ so should be in $U_{i}$. Then the assumption $U_{i}\perp D_{i}$ no longer holds.

One possibility to resolve this issue is that when these possible factors in $U_{i}$ can be observed in your data, you can explicitly include them in your model. Let's imagine $U_{i}=(W_{i},V_{i})$, where both vectors $W_{i}$ and $V_{i}$ contain variables that affect $Y_{i}$ but $W_{i}$ are recorded in your data set whereas $V_{i}$ are unobservable. 

We make the following assumption: $U_{i}\not\perp D_{i}$ **only because** $W_{i}\not\perp D_{i}$ but $V_{i}\perp D_{i}|W_{i}$. This is called **conditional random assignment**.

* Again, imagine $D_{i}$ is whether you are able to get an offer from UST. $W_{i}$ contains your undergrad GPA, your TOEFL score etc . $V_{i}$ is your cognitive ability. $D_{i}$ is correlated with both $W_{i}$ and $V_{i}$. But conditional on $W_{i}$, that is, after I observe your academic performance $W_{i}$, I may believe that anything remaining in your cognitive ability $V_{i}$ may no longer be correlated with your offer status because the admission process does not take those into account; admission committee only saw your $W$ and made the decision.
* In an experimental scenario, suppose $D_{i}$ is a lottery that gives $i$ free subscription of bilibili/youtube premium. However, bilibili/youtube may assign different lottery for different groups: If you're a new user, your $D=1$ with prob.=0.6. If you're an existing user, $D=1$ with prob.=0.3.  Conditional on your user-history ($W$), $D$ is independent of everything else. But $D$ is indeed correlated with $W$. 

But what's the use of this assumption? It is useful for us to identify another causal parameter: the **conditional average treatment effect (CATE)**: $CATE(d,d')\equiv\mathbb{E}(Y_{i}(d)|W_{i})-\mathbb{E}(Y_{i}(d')|W_{i})$. Compared with ATE, CATE tells you more information: It is a function of covariates $W_{i}$, so can tell you some heterogeneity.

* e.g. CATE may tell you the average causal effect of coming to UST on future income for a male student graduated from a top 10 college in China whose parents also hold a bachelor's degree, whereas ATE only tells you the average effect for the overall population.
* By law of iterated expectation (LIE), $ATE=\mathbb{E}(CATE)$. That is, once you calculate CATE for all possible values of $W_{i}$, take average and that's the ATE.

Like ATE, let's consider identification of CATE.

**Identification of CATE**. Under conditional random assignment $V_{i}\perp D_{i}|W_{i}$, CATE is identified as
$$
CATE_{i}(d,d')=\mathbb{E}(Y_{i}|D=d,W_{i})-\mathbb{E}(Y_{i}|D=d',W_{i}).
$$
Proof. Recall that $Y_{i}=g_{i}(D_{i},W_{i},V_{i})$. Hence,
$$
\begin{align*}
\mathbb{E}(Y_{i}|D=d,W_{i})-\mathbb{E}(Y_{i}|D=d',W_{i})=&\mathbb{E}(Y_{i}(d)|D_{i}=d,W_{i})-\mathbb{E}(Y_{i}(d')|D_{i}=d',W_{i})\\
=&\mathbb{E}(g_{i}(d,W_{i},V_{i})|D_{i}=d,W_{i})-\mathbb{E}(g_{i}(d',W_{i},V_{i})|D_{i}=d',W_{i})\\
=&\mathbb{E}(g_{i}(d,W_{i},V_{i})|W_{i})-\mathbb{E}(g_{i}(d',W_{i},V_{i})|W_{i})\\
=&\mathbb{E}(Y_{i}(d)|W_{i})-\mathbb{E}(Y_{i}(d')|W_{i})\\
\equiv& CATE(d,d').
\end{align*}
$$
So, CATE is identified under conditional random treatment assignment. 

**Can CATE be represented as a linear model like ATE?** For simplicity, let's assume $D$ is binary. We can mimic the derivation of the linear model for ATE:
$$
\begin{align*}
Y_{i}=&g_{i}(D_{i},W_{i},V_{i})\\
=&\mathbb{E}(g_{i}(0,W_{i},V_{i})|W_{i})+D_{i}\times CATE_{i}(W_{i})+\left[g_{i}(D_{i},W_{i},V_{i})-\mathbb{E}(g_{i}(D_{i},W_{i},V_{i})|W_{i}\right]\\
=&\beta_{0}(W_{i})+\beta_{1}(W_{i})D_{i}+\varepsilon_{i}.
\end{align*}
$$
We can easily verify that $\mathbb{E}(\varepsilon_{i}|D_{i},W_{i})=\mathbb{E}(\varepsilon_{i}|W_{i})=0$, where $\mathbb{E}(\varepsilon_{i}|W_{i})=0$ is by construction and $\mathbb{E}(\varepsilon_{i}|D_{i},W_{i})=\mathbb{E}(\varepsilon_{i}|W_{i})$ is by **conditional random assignment**.

**Important**. Now we see that the obtained model is indeed linear in $D$ (we assumed $D$ is binary). However, the CATE is an unknown function $\beta_{1}(W_{i})$ which is **not necessarily linear in $W$**! So if we write a model like $Y_{i}=\beta_{0}+\beta_{1}D_{i}+W_{i}'\beta_{2}+\varepsilon_{i}$, or $Y_{i}=\beta_{0}+\beta_{1}D_{i}+W_{i}'\beta_{2}+W_{i}'D_{i}\beta_{3}+\varepsilon_{i}$ and claim that $\beta_{1}$ or $\beta_{1}+W_{i}'\beta_{3}$ are CATE, we are **making additional assumptions** on the functional form of $\beta_{0}(\cdot)$ and $\beta_{1}(\cdot)$!

## Summary

* Causal effect is defined by the difference of potential outcomes.
* Individual causal effect is almost impossible to trace out. We instead focus on some average, ATE or the more informative CATE.
* We studied two modes of randomization: complete random assignmet: $D_{i}\perp(W_{i},V_{i})$, and conditional random assigment $D_{i}\perp V_{i}|W_{i}$. Note **complete random assignment implies conditional random assignment**.
  * This is because $D_{i}\perp (W_{i},V_{i})$ if and only if $D_{i}\perp V_{i}|W_{i}$ and $D_{i}\perp W_{i}$.
* Under complete random assignment, ATE is identified as $[\mathbb{E}(X_{i}X_{i}')]^{-1}\mathbb{E}(X_{i}Y_{i})$ given that $\mathbb{E}(X_{i}X_{i}')$ is full-rank, if
  * $X_{i}=(1,D_{i})$ when $D$ is binary, or 
  * $X_{i}=(1,D_{i}^{(1)},...,D_{i}^{(J)})$ when $D$ is multivalued discrete, or
  * $X_{i}=(1,D_{i})$ if we believe $\mathbb{E}(g_{i}(d,U_{i}))$ is linear in $d$ when $D$ is continuous or multivalued discrete.
* Under conditional random assignment (or complete random assignment because it implies the former), CATE from $d'$ to $d$ is identified as $\mathbb{E}(Y_{i}|D_{i}=d,W_{i})-\mathbb{E}(Y_{i}|D_{i}=d',W_{i})$. 
* Representing ATE as the slope in a linear model in $D$ is wlog when $D$ is binary.
* Representing CATE as the slope in a linear model in $D$ is wlog when $D$ is binary. But this slope is a function of $W$. Taking this function as linear is NOT wlog but an assumption. 