# <font color = firebrick>Tutorial 21: OLS Regression with Discrete Dependent Variables</font><a id='home'></a><br>

Welcome to this second tutorial in econometrics. In this session, we will explore how Ordinary Least Squares (OLS) regression can be applied when the dependent variable is discrete — that is, when it takes only a limited number of possible values (for example, 0 or 1).

So far, we have used OLS to analyze continuous outcomes, such as income or prices. Here, we move one step further and see what happens when our outcome variable represents a binary decision (e.g., “employed or not,” “buy or not,” “success or failure”). Although OLS can be applied to discrete dependent variables, it is not always the best tool for this type of data. Therefore, after understanding its limitations, we will introduce alternative models that are better suited for discrete choices, such as Probit and Logit regressions.

We will use the `statsmodels` package — a powerful Python library for estimating and testing statistical models. You can consult its documentation here: [(docs)](https://devdocs.io/statsmodels/). For a broader perspective on discrete choice models in statsmodels, you can refer to this [overview](http://www.statsmodels.org/0.6.1/examples/notebooks/generated/discrete_choice_overview.html). 

In this tutorial, we will cover:
1. [Mathematical basis](#math)
2. [Probit regression](#probit)
3. [Logit Regression](#logit)

# 1. Mathematical basis ([top](#home))<a id="math"></a>

Up to now, we have studied models where the **dependent variable is continuous** — for example, income, hours worked, or price. However, in many economic applications, the outcomes we want to explain are discrete: they describe choices or events rather than numerical quantities.

For instance:
- A person decides whether to participate in the labor market (work = 1, not work = 0);
- A consumer chooses whether or not to buy a product (buy = 1, not buy = 0);
- A survey respondent answers “yes” or “no.”

In these situations, the dependent variable $Y$ takes only two possible values: 0 or 1. We call such a variable binary.

### A simple model to model binary decisions

We often imagine that behind each binary choice, there is an unobserved (or latent) variable $Y^∗$ that represents the individual’s *propensity* or *inclination* to choose 1 rather than 0.

We write:

$$
Y^* = \beta X + \epsilon.
$$

where:

- $X$ is a vector of explanatory variables (for example, income, education, or age),
- $\beta$ is a vector of parameters showing how each variable affects the decision,
- and $\epsilon$ is an **error term** capturing unobserved factors (like motivation, preferences, or luck).

We do **not** observe $Y^∗$ directly. What we do observe is:

$$
Y = 
\begin{cases} 
1 & \text{if }\; Y^* > 0, \\
0 & \text{if }\; Y^* \leq 0.
\end{cases}
$$  

It means that if the underlying tendency $Y^*$ is positive, the person chooses option 1 (e.g., works, buys, says yes). Otherwise, she chooses 0. We can express this relationship in probabilistic terms. The probability that $Y=1$ given $X$ is:

$$
P(Y = 1 \mid X) = P(Y^* > 0 \mid X) = P( \beta X + \epsilon >0 \mid X) = P(\epsilon < \beta X \mid X)
$$

Now, we make an assumption about the distribution of the error term $\epsilon$. If we assume it follows a symmetric distribution centered around zero — such as the standard normal (for the Probit model) or the logistic distribution (for the Logit model) — then we can easily compute these probabilities using the corresponding cumulative distribution function. This symmetry property means that:
$$
P(\epsilon<a)=P(\epsilon>−a) \quad \text{for any a}.
$$

which allows for much simpler and more elegant expressions of the model.


### Why this matters

This formulation is essential since it connects economic reasoning (the idea of a latent decision process) with statistical modeling (using probabilities). Once we assume a distribution for $\epsilon$, we can estimate $\beta$ using data and predict probabilities such as:

<center>“Given a person’s age, income, and education, what is the probability that they participate in the labor force?”</center>

The Probit and Logit regressions are widely used tools that will make you more autonomous when analyzing databases involving binary or discrete outcomes.

# 2. Probit regression ([top](#home))<a id="probit"></a>

When we model a binary outcome (a variable that takes only the values 0 or 1), we often want to express the probability that the outcome equals 1. Our main question is then:
> **What is the probability that $Y=1$ for a given value of $X$?**

In the probit model, this probability comes from an assumption about the error term, denoted $\epsilon$.

### Understanding the probabilty $P(\epsilon < X\beta)$

The expression $P(\epsilon < X\beta)$ means: What is the probability that the random error $\epsilon$ takes a value smaller than $X\beta$? This is exactly what a cumulative distribution function (CDF) does: it tells us the probability that a random variable is less than or equal to some number.

$P(\epsilon < X\beta)$ represents the **cumulative distribution function (CDF)** of the error term $\epsilon$. The CDF describes the probability that a random variable takes on a value less than or equal to a given threshold. For example, if $\epsilon$ follows a standard normal distribution, the CDF gives the probability that $\epsilon$ is less than a particular value.

### Assumptions about the error term

To make the model workable, we assume that:
1. The error terms are **Independently and identically distributed (iid)**: This means each $\epsilon_j$ is independent from the others and drawn from the same distribution.
2. The error term follows a **standard normal** distribution with mean 0, variance 1, and symmetric aound 0. Its CDF $\Phi(z)=P(Z \leq z)$ gives the probability that a standard normal random variable is less than $z$.


### Probability of observing $Y = 1$

With these assumptions, the probability of observing $Y = 1$ conditional on $X$ becomes:
$$
P(Y=1 \mid X) = P(\epsilon < X\beta).
$$

Using the assumption that $\epsilon$ follows a standard normal distribution, we find:
$$
P(Y=1 \mid X) = P(\epsilon < X\beta) = \Phi(X\beta).
$$

This is the core of the probit model: the probability that $Y=1$ is given by the normal CDF evaluated at $X\beta$.

### Likelihood function
For each observation $Y_j$ (which is either 0 or 1), the probability of observing it is:

$$
\mathcal{L}(\beta; y_j, x_j)
= \Phi(x_j\beta)^{y_j} \, \big(1 - \Phi(x_j\beta) \big)^{1-y_j}.
$$

Here is the intuition:

- If $y_j = 1$, the term $\Phi(x_j\beta)$ is ``activated'' and contributes to the likelihood.
- If $y_j = 0$, the term $1 - \Phi(x_j\beta)$ is the one that matters.

So this compact expression simply says to use the probability of 1 when we observe a 1, and the probability of 0 when we observe a 0. We call this probabilty **likelihood function**.

To estimate $\beta$, we multiply the likelihoods of all observations:
$$
\mathcal{L}(\beta; Y, X)
= \prod_{j=1}^J 
\Phi(x_j\beta)^{y_j} 
\big(1 - \Phi(x_j\beta)\big)^{1-y_j}.
$$


We then choose the values of $\beta$ that maximize this function. This is called **Maximum Likelihood Estimation (MLE)**. Even though the formula looks complicated, the idea is simple. We just want to pick the values of $\beta$ that make the observed data as likely as possible.

### Clarification of key concepts
Before going further with the Probit model, let us now take a moment to clarify a few important ideas we've sude in this section.

1. **Cumulative Distribution Function (CDF):**
    The CDF of a random variable measures the probability that this variable takes a value *less than or equal to* some threshold. For example, if $Z$ follows a standard normal distribution, its CDF is denoted by $\Phi(z)$, and it represents:
   $$
   \Phi(z) = P(Z \leq z).
   $$
    In our model, when we write an expression such as $P(\varepsilon < X\beta)$, we are essentially evaluating the CDF of the error term at the point $X\beta$.

2. **Standard normal distribution:** This is a normal distribution with mean $0$ and variance $1$. It is widely used because it is symmetric and has well-known mathematical properties. When we assume that the error term $\varepsilon$ is standard normal, we can compute probabilities such as $P(\varepsilon < X\beta)$ simply by applying the standard normal CDF $\Phi(\cdot)$.

3. **Likelihood:** The likelihood tells us “how probable” our observed data are, given particular parameter values. In regression models, we choose the parameters $\beta$ that make the observed data *most likely*. This is the same logic as OLS, where we choose $\beta$ to minimize squared errors; here, we choose $\beta$ to maximize the likelihood instead.

4. **iid (Independently and Identically Distributed):** A sequence of random variables is said to be iid if:
   - all variables come from the **same distribution**, and  
   - no variable influences the others (**independence**).  
   In discrete choice models, we often assume the errors $\varepsilon_j$ are iid, which greatly simplifies the mathematics.


#### From the CDF to the Probit model

Recall that $P(\varepsilon < X\beta)$ is simply the CDF of $\varepsilon$ evaluated at $X\beta$. Now, suppose we assume that that each error term is drawn independently from a standard normal distribution: 
$$
\varepsilon \sim \mathcal{N}(0,1) \quad \text{iid},
$$  

Under this assumption, the probability of observing $Y=1$ given $X$ becomes:

$$
P(Y=1 \mid X) = \Phi(X\beta),
$$

where $\Phi(\cdot)$ is the CDF of the standard normal distribution.

#### Likelihood for a single observation

For a single observation $(y_j, x_j)$, the likelihood is:

$$
\mathcal{L}(\beta; y_j, x_j) 
= \Phi(x_j\beta)^{y_j} \, \big( 1 - \Phi(x_j\beta) \big)^{1 - y_j}.
$$

This expression simply says:

- If $y_j = 1$, we use the probability $\Phi(x_j\beta)$.
- If $y_j = 0$, we use the probability $1 - \Phi(x_j\beta)$.

The exponents ($y_j$ and $1 - y_j$) act as “switches,” activating the correct term.

#### Total likelihood and estimation

To estimate the full vector $\beta$, we multiply the likelihoods of all $J$ observations:

$$
\mathcal{L}(\beta; Y, X) 
= \prod_{j=1}^J 
\Phi(x_j\beta)^{y_j} 
\left( 1 - \Phi(x_j\beta) \right)^{1 - y_j}.
$$

Even though the formula looks intimidating, the idea is simple:

> **Just like OLS chooses $\beta$ to minimize errors, the Probit model chooses $\beta$ to maximize the likelihood of the data.**

Statistical software (such as `statsmodels`) carries out this optimization for us.

## An example: Gambling

When we're talking probability, there is no better example than gambling. Actually, gambling is the source (inspiration?) for a lot of the probability theory we have today. Since relativity and quantum mechanics use probability heavilly, let's attribute that to gambling too.

The file 'pntsprd.dta' contains data about vegas betting. The complete variable list is [here](http://fmwww.bc.edu/ec-p/data/wooldridge/pntsprd.des). We will use `favwin` which is equal to 1 if the favored team won and zero otherwise and `spread` which holds the betting spread. In this context, a spread is the number of points that the favored team must beat the unfavored team by in order to be counted as a win by the favored team.    

In [None]:
import pandas as pd                    # for data handling
import numpy as np                     # for numerical methods and data structures
import matplotlib.pyplot as plt        # for plotting
import seaborn as sea                  # advanced plotting

import statsmodels.formula.api as smf  # provides a way to directly spec models from formulas

In [None]:
# Use pandas read_stata method to get the stata formatted data file into a DataFrame.
vegas = pd.read_stata('Tutorial_Python_20_OLS/PNTSPRD.DTA')

# Take a look...
vegas.head()

In [None]:
vegas.info()

In [None]:
fig, ax = plt.subplots(figsize=(15,6))

ax.scatter( vegas['spread'], vegas['favwin'], facecolors='none', edgecolors='red')

ax.set_ylabel('favored team outcome (win = 1, loss = 0)')
ax.set_xlabel('point spread')
ax.set_title('The data from the point spread dataset')

sea.despine(ax=ax)

### Estimation

We begin with the linear probability model. The model is 

$$\text{Pr}(favwin=1 \mid spread) = \beta_0 + \beta_1 spread + \epsilon .$$

There is nothing new here technique-wise. Let's start with OLS which is like pretending the Y variable is continuous.

In [None]:
# statsmodels adds a constant for us...
res_ols = smf.ols('favwin ~ spread', data=vegas).fit()

print(res_ols.summary())

### Hypothesis testing with t-test
If bookies were all-knowing, the spread would **exactly** account for the predictable winning probability and all we would be left with is the noise --- the intercept should be one-half. Is it true in the data? We can use the `t_test( )` method of the results object to perform t-tests. 

The null hypothesis is $H_0: \beta_0 = 0.5$ and the alternative hypothesis is $H_1: \beta_0 \neq 0.5$.

In [None]:
t_test = res_ols.t_test('Intercept = 0.5')
print(t_test)

Linear probability models have some problems. Perhaps the biggest one is that there is no guarantee that the predicted probability lies between zero and one! 

We can use the `predictedvalues` attribute of the results object to recover the fitted values of the y variables. Let's plot them and take a look. 

In [None]:
fig, ax = plt.subplots(figsize=(15,6))

ax.scatter(vegas['spread'], res_ols.fittedvalues,  facecolors='none', edgecolors='red')
ax.axhline(y=1.0, color='grey', linestyle='--')

ax.set_ylabel('pedict probability of winning')
ax.set_xlabel('point spread')
ax.set_title('Predicted winning probabilities from an OLS model')

sea.despine(ax=ax, trim=True)

Now, let's account for the discreteness and estimate with probit.

In [None]:
res_probit = smf.probit('favwin ~ spread', data=vegas).fit()
print(res_probit.summary())

Notice the top: "Optimization terminated successfully..." That's because with probit there is no analytical solution like there is with OLS. Instead, the computer has to maximize the likelihood function by taking a guess for an initial $\beta$ and then iterating using calculus to make smart choices.

The coefficients are very different. Just look at the intercept! That's in large part b/c the coefficients have a different meaning in a probabilistic model. In order to determine the effect on Y, we have to run the coefficient through the distributional assumption, here Normal. When we do this, we call the results 'marginal effects.' The math is pretty straight-forward -- but then again recovering marginal effects is standard stuff so there's a method for that:

In [None]:
margeff = res_probit.get_margeff('mean')
print(margeff.summary())

Okay, so a unit increase in the spread is correlated with a (statistically significant) 2.5% increase in the probability the team wins. Makes sense -- otherwise those bright, shiny Vegas lights wouldn't be so shiny.

Note that the marginal effect calculation required us to take a stand on from where we calculated the derivative. In a linear model like OLS, the derivative is just the coefficients and those are constant. Here, the model is non-linear (b/c of the Normal distribution) so the derivative changes depending on where we choose. The average is the standard though skewed data might make the median more resonable.

Let's take a look at the marginal effects at different points in the data. Note that the reported marginal effect above is located at the intersection of the marginal effects plot and the vertical dashed line indicating the average spread.

In [None]:
from scipy.stats import norm # import functions related to the normal distribution

y = norm.pdf(res_probit.fittedvalues,0,1)*res_probit.params.spread

fig, ax = plt.subplots(figsize=(15,6))

avg_spread = np.mean(vegas['spread'])

# Create the marginal effects
ax.scatter(vegas['spread'],y, color='black', label = 'marg. effects')

ax.set_ylabel('estimated marginal effect')
ax.set_xlabel('point spread')
ax.set_title('plotting marginal effects')

ax.legend(frameon=False,loc='upper right', bbox_to_anchor=(0.9, 0.7), fontsize=14)

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

ax.axvline(x=avg_spread, color='red', linestyle='--')
ax.text(avg_spread+.5,0.035,'Average Spread',fontsize=14)
ax.set_ylim([-1e-3,0.04])

plt.show()

Let's look at the predicted values. In OLS this was easy. Here, things are (for some bizarre reason) more complicated -- we have to run the $X\hat\beta$ interactions through the standard Normal distribution ourselves. 

In [None]:
pred_probit = norm.cdf(res_probit.fittedvalues,0,1)  # Standard Normal (ie, mean = 0, stdev = 1)

Plot the estimated probabilty of the favored team winning and the actual data. 

In [None]:
fig, ax = plt.subplots(figsize=(15,6))

ax.scatter(vegas['spread'], pred_probit,  facecolors='none', edgecolors='red', label='predicted')
ax.scatter(vegas['spread'], vegas['favwin'],  facecolors='none', edgecolors='blue', label = 'data')
ax.axhline(y=1.0, color='grey', linestyle='--')

# Create the line of best fit to plot
p = res_ols.params                            # params from the OLS model linear probability model
x = range(0,35)                               # some x data
y = [p.Intercept + p.spread*i for i in x]     # apply the coefficients 
ax.plot(x,y, color='black', label = 'linear prob.')

ax.set_ylabel('predict probability of winning')
ax.set_xlabel('point spread')
ax.set_title('Predicted winning probabilities from a probit model')

ax.legend(frameon=False,loc='upper right', bbox_to_anchor=(0.9, 0.7), fontsize=14)
sea.despine(ax=ax, trim=True)

# 3. Logistic Regression (Logit) ([top](#home))<a id="logit"></a>

Our framework is actually pretty flexible so we can use different distributions. The other popular distributional assumption is to assume the $\epsilon$ errors come from a Logistic distribution. Why Logistic? Because the result is a nice simple function for the probability:

$$\text{P} = \frac{\exp \left({\beta_0+\beta_1 spread}\right)}{1+\exp \left({\beta_0+\beta_1 spread}\right)},$$

and we predict a team wins when ever $\text{prob} \ge 0.5$. As above, the computer chooses $\beta$ to best fit the data.

In [None]:
X = np.arange(-8, 8, 0.1);
    
# Determine Y
Y = 1/(1+np.exp(-X))

# Create Figure
fig, ax = plt.subplots(figsize=(15,8))

ax.axhline(y=0.5, color='red',linewidth=1,ls='--')

ax.annotate('Class One: Y=1',xy=(-6,.6),va='center',ha='left',size=18)
ax.annotate('Class Two: Y=0',xy=(-6,.4),va='center',ha='left',size=18)
ax.axhspan(.5, 1, alpha=0.2, color='blue')

ax.plot(X,Y, color = 'black')

ax.set_ylim(0,1)
ax.set_yticks(np.arange(0, 1.01, step=0.1))
ax.set_xlim(-8,8)
ax.set_xlabel('Independent Variable (X)',size=14)
ax.set_ylabel('Dependent Variable (Y)',size=14)
ax.set_title('Logistic Function and Decision Rule',size=20)

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

plt.show()

We estimate the logit model with `logit( )` method from `smf` in a way similar to `probit`.  

In [None]:
res_logit = smf.logit('favwin ~ spread', data=vegas).fit()
print(res_logit.summary())

Again, interpreting logit coefficients is bit more complicated. The probability that a team wins is given by the expression

$$\text{prob} = \frac{\exp \left({\beta_0+\beta_1 spread}\right)}{1+\exp \left({\beta_0+\beta_1 spread}\right)}$$

Our marginal effects will hammer $X\hat\beta$ through the above non-linear function to derive the marginal effects. Let's take a look:

In [None]:
margeff = res_logit.get_margeff('mean')
print(margeff.summary())

Let's again plot the estimated probabilty of the favored team winning and the actual data but now let's compare the implications of our distributional assumptions. First, generate predicted values using `numpy` and the above expression for the probability.

In [None]:
pred_logit = np.exp(res_logit.fittedvalues) /( 1+np.exp(res_logit.fittedvalues) )

Now, plot probit vs logit:

In [None]:
fig, ax = plt.subplots(figsize=(15,6))

ax.scatter(vegas['spread'], pred_logit,  facecolors='none', edgecolors='red', label='predicted-logit')
ax.scatter(vegas['spread'], pred_probit,  facecolors='none', edgecolors='black', label='predicted-probit')
ax.scatter(vegas['spread'], vegas['favwin'],  facecolors='none', edgecolors='blue', label = 'data')
ax.axhline(y=1.0, color='grey', linestyle='--')

# Create the line of best fit to plot
p = res_ols.params                            # params from the OLS model linear probability model
x = range(0,35)                               # some x data
y = [p.Intercept + p.spread*i for i in x]     # apply the coefficients 
ax.plot(x,y, color='black', label = 'linear prob.')

ax.set_ylabel('pedict probability of winning')
ax.set_xlabel('point spread')
ax.set_title('Predicted winning probabilities from logit and probit models')

ax.legend(frameon=False,loc='upper right', bbox_to_anchor=(0.9, 0.7), fontsize=14)
sea.despine(ax=ax, trim=True)

We observe that the probit and logit models are nearly on top of eachother. That's a common occurrence. In practice, the models are often interchangeable and the practitioner will choose one over the other because in their setting one may have some slightly better properties (e.g., more intuitive intrepretation of the marginal effects). 

## <font color='red'>Practice</font>

1. Load the data 'apple.dta'. The data dictionary can be found [here](http://fmwww.bc.edu/ec-p/data/wooldridge/apple.des). The variable `ecolbs` is purchases of eco-friendly apples (whatever that means).  

2. Create a variable named `ecobuy` that is equal to 1 if the observation has a positive purchase of eco-apples (i.e., ecolbs>0).

3. Estimate a linear probability model relating the probability of purchasing eco-apples to household characteristics. 

$$\text{ecobuy} = \beta_0 + \beta_1 \text{ecoprc} + \beta_2 \text{regprc} + \beta_3 \text{faminc} + \beta_4 \text{hhsize} + \beta_5 \text{educ} + \beta_6 \text{age} +  \epsilon$$

4. How many estimated probabilities are negative? Are greater than one?

5. Now estimate the model as a probit; i.e., <br><br>
$$\text{Pr}(\text{ecobuy}=1 \mid X) = \Phi \left(\beta_0 + \beta_1 \text{ecoprc} + \beta_2 \text{regprc} + \beta_3 \text{faminc} + \beta_4 \text{hhsize} + \beta_5 \text{educ} + \beta_6 \text{age} \right),$$<br>where $\Phi( )$ is the CDF of the normal distribution.

6. Compute the **marginal effects** of the coefficients at **the means** and print them out using `summary()`. Interpret the results.

7. Re-estimate the model as a logit model. 

8. Compute the marginal effects of the logit coefficients at the averages in the data.

9. We haven't done much data wrangling lately. I'm feeling a bit sad; I miss shaping data. Create a pandas DataFrame with the row index  'ecoprc', 'regprc', 'faminc', 'hhsize', 'educ', and 'age'. The columns should be labeled 'logit', 'probit', and 'ols'. The columns should contain the marginal effects for the logit and probit models and the coefficients from the ols model.