# Introduction
In the econometric textbook, we encounter cases where the information of $y$ of some sample are missing, making it impossible to directly infer from these samples. There are various scenarios and models: truncated regression, sample selection model, tobit model, and hurdle model... In this text, however, I first show that all these models share the same framework and idea by introducing a general setting and estimating strategy. The various models can be derived by making specific assumptions on the function form and the distribution of the error term on the general setting.

# General Framework
Starting from a general framework. 
$$y_i = \begin{cases}
y^*_i,\quad w_i =1 \\
'Abnormal', \quad w_i=0 \\
\end{cases}
$$
and we have 
$$y^*_i = x'_i \beta +\epsilon_i$$
and 
$$
w_i = \begin{cases}
1, \quad z'_i \gamma +v_i>0 \\
0, \quad z'_i \gamma+ v_i<0
\end{cases}
$$
Here, when $w_i=1$, then the invidual's $y^*_i$ can be observed. When $w_i =0$, the $y^*_i$ cannot be observed: either it directly disappear (truncated), or it becomes a constant value (censored). Here I use 'abnormal' to informally describe such a situation. in either way, we cannot get the information the $y^*_i$. $u_i$ and $v_i$ follows some joint distribution.We discuss the following two cases. 

### Case 1: Sample with y= 'abnormal'  totally missing
This means that those individuals with $y='abnormal'$ are not in sample set. We cannot observe any information of them. all we can observe is just those sample with $w_i$. We first derive the conditional expectation of $y$ for the existing sample.
$$E(y_i | x_i,z_i,w_i =1)= E(x'_i\beta+ \epsilon_i | x_i,v_i > -z'_i \gamma )=x'_i\beta+  E(\epsilon_i | x_i,v_i > -z'_i \gamma ) \tag{1}$$
This implies, that the true (conditional) expectation of $y_i$ should be a linear function of $x'_i$ and a non-linear function of $z'_i$.<b>Therefore,if $x$ and $z$ are two different sets of variables, then a linear regression of $y_i$ on $x_i$ is consistent (variance of estimator may be large, though), but if they contains common variables, then such a linear regression will be inconsistent. </b>
<p>A general estimating strategy is to use mle. when writing down the likelihood function for each individual, we need to condition on $w_i=1$
$$L_i(y=y_i, x_i, z_i) \equiv pr(y=y_i | w_i =1,z_i,x_i)= \frac{pr(y =y_i, w_i=1| z_i,x_i)}{pr(w_i =1 | w_i, x_i)}$$
$$=\frac{pr(\epsilon_i = y - x'_i\beta,v_i>-z'_i )}{pr(v_i>-z'_i \gamma)} \tag{2}$$
we can calculate these two terms if we know the joint distribution of $\epsilon_i$ and $v_i$. Using this likelihood function, we can estimate the $\gamma$ and $\beta$ (if the distribution parameter of $\epsilon_i$ and $v_i$ are all given) using mle method. 

### Case 2: Sample with y='abnormal' still have $z$ and $x$
This means that even if an individual has $y='abnormal'$, we can still observe its $z_i$ and $x_i$.
still,from (1),  a simple linear regression on the $w_i=1$ sample using $x$ may be inconsistent. if $y$ does not disappear but instead is censored at a level (say, $c$), then the true conditional expectation of $y$ for whole sample is 
$$E(y_i | x_i,z_i)= E(y_i |x_i, w_i =1)pr(w_i =1 |z_i)+ E(y_i|x_i,w_i =0)pr(w_i =0 |z_i)$$
$$=\left[ x'_i\beta+ E(\epsilon|x,v_i>-z'_i \gamma) \right]pr(v_i>-z'_i \gamma)+c*pr(v_i<-z'_i \gamma) \tag{3}$$
Again, this true model is a linear function of $x_i$ and a non-linear function of $z_i$. A simple linear regression using $x_i$ is not a good idea. 
We can write down the the likelihood function of this individual as 
$$L_i \equiv {\left[ pr(y =y_i, w_i=1| z_i,x_i)  \right]}^{w_i } { pr(w_i =0)}^{1-w_i }$$
$$={\left[ pr(\epsilon_i = y_i - x'_i\beta,v_i>-z'_i \gamma ) \right]}^{w_i =1} { \left[pr(v_i<-z'_i \gamma)\right]}^{w_i =0} \tag{4}$$

# Sample with abnormal y totally missing,  $x=z$,$\gamma = \beta$,$v=\epsilon$: Truncated Model
The framework now becomes 
$$y_i = \begin{cases}
y^*_i,\quad w_i =1 \\
unobservable, \quad w_i=0 \\
\end{cases}
$$
and we have 
$$y^*_i = x'_i \beta +\epsilon_i$$
and 
$$
w_i = \begin{cases}
1, \quad x'_i \beta +\epsilon_i>c \\
0, \quad x'_i \beta+ \epsilon_i<c
\end{cases}
$$
Due to some exogenous reasons, you may not be able to observe variables that are lower or higher than certain level. such data is truncated (<b>Which means that those sample whose $y<c$ are totally lost. You cannot see any information of them.</b>). What you observed, of course, is just the data after truncation. The purpose of regression is to estimate the conditional expectation, but what you can know is just $E(y_i|x_i,w_i=1)$. No longer $E(y|x)$.
<p>we first find out the expression of $E(y_i|x_i,w_i=1)$. The purpose is check OLS estimation is consistent. assume that $\epsilon$ follows$ N(0,\sigma^2)$. Therefore it is easy to see that
    $$
    E(y_i|x_i,w_i=1)= E(\beta x_i +\epsilon_i|x_i,\epsilon_i>c-\beta x_i  )=\beta x_i + E(\epsilon_i| \epsilon_i>c-\beta x_i)
    $$
</p>
Before we proceed futher, we first give the following simple result: if $s ~ N(0,1)$ then 
$$
E(s|s>c)= \frac{\phi(c)}{1-\Phi(c)}
$$
Since we assume that $\epsilon $ follows $N(0,\sigma^2)$, we have the following simple result:
$$
E(\epsilon_i| \epsilon_i>c-\beta x_i)= \sigma \frac{\phi(\frac{c-\beta x_i}{\sigma})}{1-\Phi(\frac{c-\beta x_i}{\sigma})}\equiv \sigma\lambda(\frac{c-\beta x_i}{\sigma})
$$
The $\lambda(.)$ is called 'Inverse Mills Ratio'. Therefore, it is easy to see that 
 $$
    E(y_i|x_i,w_i=1)= E(\beta x_i +\epsilon_i|x_i,\epsilon_i>c-\beta x_i )=\beta x_i +  \sigma\lambda(\frac{c-\beta x_i}{\sigma})
 $$
This is what the true model should look like. Therefore, if we simply regression $y$ on $\beta x$, there will be omitted variable problem, and estimation is no longer the consistent. Then, how to estimate the $\beta$?.
Often times, maximum likelihood is a good method. Following (2), we want to find out the likelihood for each individual.
$$
pr(y=y_i |x_i, w_i=1)=f(y_i|x_i, y_i>c)=\frac{f(y_i)}{pr(y_i >c)}
$$
since $y_i $ belongs to $N(\beta x, \sigma^2)$. we have 
$$
f(y)= \frac{1}{\sqrt{2\pi \sigma^2}}e^{-\frac{{\left(y-\beta x\right)}^2}{2\sigma^2}}= \frac{1}{\sigma}\phi(\frac{y-\beta x}{\sigma})
$$
and 
$$
pr(y>c)= pr(\beta x + \epsilon>c)= \Phi(\frac{c-\beta x}{\sigma})
$$
There fore  we have 
$$
f(y_i|x_i, y_i>c)= \frac{\frac{1}{\sigma}\phi(\frac{y_i-\beta x_i}{\sigma})}{ \Phi(\frac{c-\beta x_i}{\sigma})}
$$
Now you are ready to construct the MLE to estimate the $\beta$ and $\sigma$

# Sample with unobservable y still have $x$ and $z$,$v\ne \epsilon$, $\gamma\ne \beta$, $cov(\epsilon,v)\ne 0$: Sample Selection Model
The model now becomes:
$$y_i = \begin{cases}
y^*_i,\quad w_i =1 \\
Unobservable, \quad w_i=0 \\
\end{cases}
$$
and we have 
$$y^*_i = x'_i \beta +\epsilon_i$$
and 
$$
w_i = \begin{cases}
1, \quad z'_i \gamma +v_i>0 \\
0, \quad z'_i \gamma+ v_i<0
\end{cases}
$$
Following (1), we immediately know $E(y| x, w_i=1)$: 
    $$
    E(y|  x, w_i=1)= \beta x + E(\epsilon | v>-z'\gamma )
    $$
if we assume that $v$ and $\epsilon$ are jointly normal distributed with varriance $1$ and $\sigma^2_\epsilon$, and correlation coefficient $\rho$, then we have 
$$
 E(y| x, w_i=1)= \beta x + \rho \sigma_{\epsilon}\lambda(-z' \gamma) 
$$
if $x$ and $z$ does not have common variables(which means whether you are in the sample or not is independent of the $y$),or the $v$ and $\epsilon$ is uncorrelated (i.e., $\rho=0$), then it is safe to use OLS. But in most cases, $x$ and $z$ may contain common variables, or $\epsilon$ and $v$ are correlated ($\rho \ne 0$), making OLS estimation no longer consistent.Then how to deal with this?

## Method 1: Heckman two-step 
### step 1
A natural way to deal with this is to esimate the $\lambda(-z' \gamma)$ first, which means we need to estimate the $\gamma$ first. It is easy to do this using mle (same as probit).For an individual, its likelihood function is
$$
{pr(z_i \gamma + v_i>0)}^{d_i}{pr(z_i \gamma + v_i<0)}^{1-d_i} 
$$
in which $d_i=1$ means his $y_i$ is observable.We can then write down the mle and estimate the $\hat{\gamma}$.(notice $v_i$ is normal distribution) we then get the $\hat{\lambda_i}$ for each individual

### step 2
We can then estimate the $\beta$ using (1).(but can $\rho$ and $\sigma_\epsilon$be identified? )

## Method 2: MLE 
Of course, we can also directly use mle.Following (4), the likelihood function for individual $i$ is 
$$
{pr(z_i \gamma + v_i>0, y=y_i)}^{w_i}{pr(z_i \gamma + v_i<0)}^{1-w_i}
$$
in which 
$$
pr(z_i \gamma + v_i>0, y=y_i) = f( y=y_i) pr(z_i \gamma + v_i>0|y=y_i)
$$
we know that $y$ follows $N(\beta x, \sigma^2_\epsilon)$. Therefore $f(y=y_i)= \frac{1}{\sigma_\epsilon}\phi(\frac{y_i-\beta x_i}{\sigma_\epsilon})$. Calculating $pr(z_i \gamma + u_i>0|y=y_i)$ is a bit tricky. First we know that,
$$pr(z_i \gamma + v_i>0|y=y_i)=pr( v_i>-z_i \gamma |\epsilon_i=y_i-x_i\beta)$$
Since $v$ and $\epsilon_i$ jointly follows a normal distribution, we can easily calculate this term. 


# Sample with censored y, still observe $x$, and $x=z$, $v=\epsilon$,$\gamma=\beta$: Tobit Model Type 1

The framework now becomes 
$$y_i = \begin{cases}
y^*_i,\quad w_i =1 \\
0, \quad w_i=0 \\
\end{cases}
$$
and we have 
$$y^*_i = x'_i \beta +\epsilon_i$$
and 
$$
w_i = \begin{cases}
1, \quad x'_i \beta +\epsilon_i>0 \\
0, \quad x'_i \beta+ \epsilon_i<0
\end{cases}
$$
and  $\epsilon_i$ is normal distribution with $N(0,\sigma^2)$, and $E(u_i|x_i)=0$
We still follow the general framework to check whether we can get a consistent estimation using OLS,  we still check $E(y|x)$ and $E(y|x,w_i =1)$. First, following (1)
$$E(y|x, w_i=1)= E(x'_i\beta +\epsilon_i|x, x'_i\beta +\epsilon_i>0)=x'_i\beta +E(\epsilon_i|x, \epsilon_i>-x'_i\beta )=x'_i\beta+\sigma \lambda(x'_i\beta/\sigma)$$.
Apparently, for those sample whose $y^*$ is not censored, we cannot get consistent estimation of $\beta$ using OLS of linear regression, due to the existence of the second term (but non-linear last square method works). 
<p>On the other hand, see $E(y|x)$.according to (3):
$$E(y|x)= E(y|x, w_i=1)pr(w_i =1)+E(y|x,w_i =0)pr(w_i =0)=\Phi(x'_i\beta/\sigma)(x'_i\beta+\sigma \lambda(x'_i\beta/\sigma))$$</p>
This expression contains non-linear parts of $x_i$, therefore a OLS linear regression is still not consistent (but a non-linear least square works.)
<p>For an exercise, we can also show the above results by specifying 
    $$y_i = x'_i\beta +u_i$$ and shows that $E(u_i|x_i)\ne 0$ (so ols linear regression is not consistent). Specifically,$$E(u_i|x_i)= E(\epsilon_i|x_i, \epsilon_i \ge -x'_i \beta)pr( \epsilon_i \ge -x'_i \beta)+ E(-x'_i\beta|x_i, \epsilon_i \le -x'_i \beta)pr( \epsilon_i \le -x'_i \beta)$$
    $$=\int{\max{(\epsilon,-x'_i\beta )}d (\epsilon|x_i)} \ge \int{\epsilon d (\epsilon|x_i)}=0$$
</p>

## Estimation Strategy
Following (4),the likelihood of an individual is
$${\left[ pr(y=y_i|x_i,w_i=1)pr(w_i=1 |x_i)  \right]}^{w_i}{\left[pr(w_i=0|x_i) \right]}^{1-w_i}$$
$$={\left[ pr(y=y_i|x_i,y_i>0)pr(y_i>0 |x_i)  \right]}^{w_i}{\left[pr(y_i<0|x_i) \right]}^{1-w_i}={\left[\frac{f(y=y_i|x_i)}{pr(y_i>0|x_i)}pr(y_i>0 |x_i)  \right]}^{w_i}{\left[pr(y_i<0|x_i) \right]}^{1-w_i}$$
$$={\left[ f(y=y_i|x_i) \right]}^{w_i}{\left[pr(y_i<0|x_i) \right]}^{1-w_i}={\left( \frac{1}{\sigma}\phi(\frac{y_i-x'_i\beta}{\sigma})  \right)}^{w_i} {\left[ \Phi(\frac{-x'_i\beta}{\sigma})\right]}^{1-w_i} $$

# Sample with censored y, still observe $x$, and $x=z$, $v \ne \epsilon$,$\gamma \ne \beta$, $cov(\epsilon,v)=0$: Cragg's hurdle model
The model now becomes:
$$y_i = \begin{cases}
y^*_i,\quad w_i =1 \\
0, \quad w_i=0 \\
\end{cases}
$$
and we have 
$$y^*_i = x'_i \beta +\epsilon_i$$
and 
$$
w_i = \begin{cases}
1, \quad z'_i \gamma +v_i>0 \\
0, \quad z'_i \gamma+ v_i<0
\end{cases}
$$
in which $\epsilon_i$ follows <b>a truncated normal distribution with lower bound $-x'_i \beta$and variance $\sigma_2$</b>. This assumption makes sure that $y^*_i$ is non-negative.Assume that $v$ follows a standard normal distribution. Taking consumption for example, The framework decribes the situation that people first choose whether to buy($w_i=1$) or not($w_i=0$). For people choosing $w_i=0$, they do not make any consumption, therefore $y_i =0$. for people choosing $w_i =1$, they futher makes choice on the amount of consumption, $y^*_i$, which is non-negative. We first check the conditional expectation of $y_i$:
$$E(y_i | x_i,w_i =1)= E(x'_i\beta+ \epsilon_i | v_i > -x'_i \gamma )=x'_i\beta+  E(\epsilon_i | x_i, v_i > -x'_i \gamma )=x'_i\beta+  E(\epsilon_i | x_i )$$

The final equation holds since $v_i$ and $\epsilon_i$ are independent. since $\epsilon_i$ here follows a truncated normal distribution, it is easy to get $E(\epsilon_i | x_i )=\sigma \lambda(\frac{x'_i \beta}{\sigma})$. Following (3),it is also easy to see that the conditional $y$ of whole sample is
$$
E(y_i | x_i)=\left[ x'_i\beta+ E(\epsilon|x,v_i>-z'_i \gamma) \right]pr(v_i>-z'_i \gamma)+0*pr(v_i<-z'_i \gamma)
$$
$$=\left[x'_i\beta+\sigma \lambda(\frac{x'_i \beta}{\sigma})\right]\Phi(z'_i \gamma)$$
<p>
Following (4),we can immediately write down the likelihood for individual $i$ as follows 
$$
{pr(z_i \gamma + v_i>0, y=y_i)}^{w_i}{pr(z_i \gamma + v_i<0)}^{1-w_i}
$$
since $u$ and $v$ are independent, we have 
$$
pr(z_i \gamma + v_i>0, \epsilon=y_i-x'_i \beta)=pr(z_i \gamma + v_i>0 )f_\epsilon(y_i-x'_i \beta)=\Phi(z'_i\gamma )\frac{\frac{1}{\sigma}\phi(\frac{y_i-x'_i \beta}{\sigma})}{1-\Phi(\frac{-x'_i \beta}{\sigma})}
$$
and $pr(z_i \gamma + v_i<0)=1- \Phi(z'_i\gamma )$. Now we can estimate the $\beta$ and $\sigma$.</p>
<p>The following lecture notes link also talks about the log-normal hurdle model. The basic idea is similar with here, except for the specification of the uncertainty:</p>
<p>
    http://legacy.iza.org/teaching/wooldridge-course-09/course_html/docs/slides_twopart_5_r1.pdf
</p>

# Exercise 
## Switching Regression model
Suppose that we have two models
$$y^*_{0i}= \beta_{0i} x_{0i} + u_{0i}, \quad y^*_{1i}= \beta_{1i} x_{1i} + u_{1i}$$
and $$
y_i= \begin{cases}
y^*_{0i},\quad z_i\gamma+v_i<0\\
y^*_{1i},\quad z_i\gamma+v_i>0\\
\end{cases}
$$
$y^*_{0i}$ and $y^*_{1i}$ are un-observable, but $y_i$ is observable. The conditional $y_i$ is therefore 
$$E(y|x_1,x_0,z)= E(y|x_1,x_0,z,z_i\gamma+v<0 )pr(z\gamma+v<0)+ E(y|x_1,x_0,z,z\gamma+v>0 )pr(z\gamma+v>0)$$
$$= \left[ x_0\beta_0 + E(u_0|v<-z \gamma) \right]pr(v<-z \gamma)+\left[ x_1\beta_1 + E(u_1|v>-z \gamma) \right]pr(v>-z \gamma)$$
This is a non-linear function of $z$. Therefore a linear regression specification may not be a good idea. Let's still use mle. It is easy to write down the likelihood function for each individual $i$
$$pr(y_i|x_i, z_i)= pr(\beta_{0i} x_{0i} + u_{0i}=y_i, z_i\gamma+v_i<0|x_i,z_i )+pr(\beta_{1i} x_{1i} + u_{1i}=y_i, z_i\gamma+v_i>0|x_i,z_i )$$
$$=pr(v_i<-z_i\gamma | u_{0i}=y_i-\beta_{0i} x_{0i},z_i)f_{u_{0}}(y_i-\beta_{0i} x_{0i})+pr(v_i>-z_i\gamma | u_{1i}=y_i-\beta_{1i} x_{1i},z_i)f_{u_{1}}(y_i-\beta_{1i} x_{1i})$$
It is easy to get the exact expression for this likelihood if we know the joing distribution of $u_0,u_1,v$

## Tobit Model with endogeneous variable
Suppose that we have the following model:
$$y^*_{1i}=y_{2i}\beta +u_i$$
$$y_{2i}=z'_i \gamma + v_i$$
$$y_i = \begin{cases}
y^*_{1i},\quad w_i=1 \\
0, \quad w_i=0\\
\end{cases}
$$
$$
w_i = \begin{cases}
1,\quad y^*_{1i}>0 \\
0, \quad otherwise \\
\end{cases}
$$
in which $cov(u,v)\ne 0$. we also assume that 
$$
\left[\begin{array}{c}
            u \\
            v \\
        \end{array}\right] follows \quad   
        N
        \left[\begin{array}{cc}
            \left[\begin{array}{c}
            0 \\
            0 \\
        \end{array}\right] ,
            \left[\begin{array}{cc}
            1,\rho \sigma \\
            \sigma \rho,\sigma^2 \\
        \end{array}\right]  \\   
        \end{array}\right] 
$$
Therefore $y_{2}$ is endogeneous if $\rho \ne 0$
The likelihood function is very easy to write. for $w_i=1$, the likelihood is 
$$pr(y_1 = y_{1i}, y_2=y_{2i},w_i =1)=pr(y_1=y_{1i}| y_2=y_{2i}, w_i =1)pr(w_i=1 , y_2= y_{2i})$$
$$=pr(y_1=y_{1i}| y_2=y_{2i}, w_i =1)pr(u_i>-x'_i\beta , v_i= y_{2i}-z'_i \gamma)$$
we have
$$pr(y_1=y_{1i}| y_2=y_{2i}, w_i =1)=pr(y_1=y_{1i}| y_2=y_{2i}, y_{1i}>0)=\frac{\frac{1}{\sigma}f_u(\frac{y_{1i}-y_{2i}\beta}{\sigma}) }{1-\Phi(\frac{0-x'_i\beta}{\sigma})}$$
and $pr(u_i>-x'_i\beta , v_i= y_{2i}-z'_i \gamma)$ is also easy to handle since we already know the the joint distribution of $u_i$ and $v_i$
On the other hand, for $w_i=0$,we no longer get the exact information of y. the likelihood is
$$pr(y_2=y_{2i},w_i =0)= pr( v_i= y_{2i}-z'_i \gamma,u_i>-x'_i\beta ) $$