# <font color = firebrick>Lecture 21: OLS Regression with Discrete Dependent Variables</font><a id='home'></a><br>
    
We continue to learn about the statsmodels package [(docs)](https://devdocs.io/statsmodels/), which provides functions for formulating and estimating statistical models. In this notebook we take on models in which the dependent variable is discrete. In the examples below, the dependent variable is binary (which makes it easier to visualize). At the end of the lecture, we extend the analysis to dependent variables with many discrete values.  

[Here](http://www.statsmodels.org/0.6.1/examples/notebooks/generated/discrete_choice_overview.html) is a nice overview of the discrete choice models in statsmodels. 

The agenda for today's lecture is as follows:

1. [Math Primer](#math)


2. [Probit Regression](#probit)


3. [Logit Regression](#logit)


## <font color=orange>Class Announcements</font> 

None.

# 1. Math Primer ([top](#home))<a id="math"></a>

So far we've been dealing with continuous dependant (ie, LHS) variables such as hours worked. A lot of outcomes we observe and are interested in are not continuous, however. For example, labor force participation in the United States is roughly 65% so the choice of whether to work or not appears to be a significant one. 

Suppose our dependant variable Y is binary (ie, zero or one). For example, Y may represent presence/absence of a certain condition, success/failure of some device, answer yes/no on a survey, etc. We also have a vector of regressors X which we think influence Y. As before, suppose we also have an error term $\epsilon$ which is distributed from some distribution. Define $Y^*$ as some latent (ie, we can't actually observe) variable where 

$$
Y^*= X\beta + \epsilon
$$

and we think $Y=1$ whenever $Y^*>0$, or equivalently whenever $X\beta + \epsilon>0$. Define $P(Y=1|X)$ as the 'probability Y is equal to one conditional on the variables X.' It follows then that

$$
P(Y=1|X)= P(Y^*>0)
$$
$$
\Rightarrow P(\epsilon < X\beta)
$$

# 2. Probit Regression ([top](#home))<a id="probit"></a>

Note that $P(\epsilon < X\beta)$ is the definition of a CDF. Suppose we specify that $\epsilon$ is drawn iid from a standard Normal distribution. With this added assumption, we can do a lot more:

$$
P(Y=1|X)= \Phi(X\beta)
$$

where $\Phi()$ is the CDF for the standard Normal distribution. The *likelihood* we observe a single observation ($Y_j=1$ or $Y_j=0$) is therefore

$$
\mathcal{L}(\beta;y_j,x_j)= \Phi(x_j\beta)^{y_j}\times(1-\Phi(x_j\beta))^{1-y_j}.
$$

The first part ($\Phi(X\beta)^{y_j}$) gets turned on when $y_j=1$ while the second part gets turned on when $y_j=0$. We can therefore solve for the $\beta$ vector to best match the data by maximizing the 'likelihood' function; ie, 

$$
\mathcal{L}(\beta;Y,X)= \Pi_{j=1}^J\Phi(x_j\beta)^{y_j}\times(1-\Phi(x_j\beta))^{1-y_j}
$$

This looks complicated but it's really just a simple maxmization problem like OLS.

## An Example: Gambling

When we're talking probability, there is no better example than gambling. Actually, gambling is the source (inspiration?) for a lot of the probability theory we have today. Since relativity and quantum mechanics use probability heavilly, let's attribute that to gambling too.

The file 'pntsprd.dta' contains data about vegas betting. The complete variable list is [here](http://fmwww.bc.edu/ec-p/data/wooldridge/pntsprd.des). We will use `favwin` which is equal to 1 if the favored team won and zero otherwise and `spread` which holds the betting spread. In this context, a spread is the number of points that the favored team must beat the unfavored team by in order to be counted as a win by the favored team.    

In [None]:
import pandas as pd                    # for data handling
import numpy as np                     # for numerical methods and data structures
import matplotlib.pyplot as plt        # for plotting
import seaborn as sea                  # advanced plotting

import statsmodels.formula.api as smf  # provides a way to directly spec models from formulas

In [None]:
# Use pandas read_stata method to get the stata formatted data file into a DataFrame.
vegas = pd.read_stata('./Data/pntsprd.dta')

# Take a look...so clean!
vegas.head()

In [None]:
vegas.info()

In [None]:
fig, ax = plt.subplots(figsize=(15,6))

ax.scatter( vegas['spread'], vegas['favwin'], facecolors='none', edgecolors='red')

ax.set_ylabel('favored team outcome (win = 1, loss = 0)')
ax.set_xlabel('point spread')
ax.set_title('The data from the point spread dataset')

sea.despine(ax=ax)

### Estimation

We begin with the linear probability model. The model is 

$$\text{Pr}(favwin=1 \mid spread) = \beta_0 + \beta_1 spread + \epsilon .$$

There is nothing new here technique-wise. Let's start with OLS which is like pretending the Y variable is continuous.

In [None]:
# statsmodels adds a constant for us...
res_ols = smf.ols('favwin ~ spread', data=vegas).fit()

print(res_ols.summary())

### Hypothesis testing with t-test
If bookies were all-knowing, the spread would **exactly** account for the predictable winning probability and all we would be left with is the noise --- the intercept should be one-half. Is it true in the data? We can use the `t_test( )` method of the results object to perform t-tests. 

The null hypothesis is $H_0: \beta_0 = 0.5$ and the alternative hypothesis is $H_1: \beta_0 \neq 0.5$.

In [None]:
t_test = res_ols.t_test('Intercept = 0.5')
print(t_test)

Linear probability models have some problems. Perhaps the biggest one is that there is no guarantee that the predicted probability lies between zero and one! 

We can use the `predictedvalues` attribute of the results object to recover the fitted values of the y variables. Let's plot them and take a look. 

In [None]:
fig, ax = plt.subplots(figsize=(15,6))

ax.scatter(vegas['spread'], res_ols.fittedvalues,  facecolors='none', edgecolors='red')
ax.axhline(y=1.0, color='grey', linestyle='--')

ax.set_ylabel('pedict probability of winning')
ax.set_xlabel('point spread')
ax.set_title('Predicted winning probabilities from an OLS model')

sea.despine(ax=ax, trim=True)

Now, let's account for the discreteness and estimate with probit.

In [None]:
res_probit = smf.probit('favwin ~ spread', data=vegas).fit()
print(res_probit.summary())

Notice the top: "Optimization terminated successfully..." That's because with probit there is no analytical solution like there is with OLS. Instead, the computer has to maximize the likelihood function by taking a guess for an initial $\beta$ and then iterating using calculus to make smart choices.

The coefficients are very different. Just look at the intercept! That's in large part b/c the coefficients have a different meaning in a probabilistic model. In order to determine the effect on Y, we have to run the coefficient through the distributional assumption, here Normal. When we do this, we call the results 'marginal effects.' The math is pretty straight-forward -- but then again recovering marginal effects is standard stuff so there's a method for that:

In [None]:
margeff = res_probit.get_margeff('mean')
print(margeff.summary())

Okay, so a unit increase in the spread is correlated with a (statistically significant) 2.5% increase in the probability the team wins. Makes sense -- otherwise those bright, shiny Vegas lights wouldn't be so shiny.

Note that the marginal effect calculation required us to take a stand on from where we calculated the derivative. In a linear model like OLS, the derivative is just the coefficients and those are constant. Here, the model is non-linear (b/c of the Normal distribution) so the derivative changes depending on where we choose. The average is the standard though skewed data might make the median more resonable.

Let's take a look at the marginal effects at different points in the data. Note that the reported marginal effect above is located at the intersection of the marginal effects plot and the vertical dashed line indicating the average spread.

In [None]:
from scipy.stats import norm # import functions related to the normal distribution

y = norm.pdf(res_probit.fittedvalues,0,1)*res_probit.params.spread

fig, ax = plt.subplots(figsize=(15,6))

avg_spread = np.mean(vegas['spread'])

# Create the marginal effects
ax.scatter(vegas['spread'],y, color='black', label = 'marg. effects')

ax.set_ylabel('estimated marginal effect')
ax.set_xlabel('point spread')
ax.set_title('plotting marginal effects')

ax.legend(frameon=False,loc='upper right', bbox_to_anchor=(0.9, 0.7), fontsize=14)

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

ax.axvline(x=avg_spread, color='red', linestyle='--')
ax.text(avg_spread+.5,0.035,'Average Spread',fontsize=14)
ax.set_ylim([-1e-3,0.04])

plt.show()

Let's look at the predicted values. In OLS this was easy. Here, things are (for some bizarre reason) more complicated -- we have to run the $X\hat\beta$ interactions through the standard Normal distribution ourselves. 

In [None]:
pred_probit = norm.cdf(res_probit.fittedvalues,0,1)  # Standard Normal (ie, mean = 0, stdev = 1)

Plot the estimated probabilty of the favored team winning and the actual data. 

In [None]:
fig, ax = plt.subplots(figsize=(15,6))

ax.scatter(vegas['spread'], pred_probit,  facecolors='none', edgecolors='red', label='predicted')
ax.scatter(vegas['spread'], vegas['favwin'],  facecolors='none', edgecolors='blue', label = 'data')
ax.axhline(y=1.0, color='grey', linestyle='--')

# Create the line of best fit to plot
p = res_ols.params                            # params from the OLS model linear probability model
x = range(0,35)                               # some x data
y = [p.Intercept + p.spread*i for i in x]     # apply the coefficients 
ax.plot(x,y, color='black', label = 'linear prob.')

ax.set_ylabel('pedict probability of winning')
ax.set_xlabel('point spread')
ax.set_title('Predicted winning probabilities from a probit model')

ax.legend(frameon=False,loc='upper right', bbox_to_anchor=(0.9, 0.7), fontsize=14)
sea.despine(ax=ax, trim=True)

# 3. Logistic Regression (aka Logit) ([top](#home))<a id="logit"></a>

Our framework is actually pretty flexible so we can use different distributions. The other popular distributional assumption is to assume the $\epsilon$ errors come from a Logistic distribution. Why Logistic? Because the result is a nice simple function for the probability:

$$\text{P} = \frac{\exp \left({\beta_0+\beta_1 spread}\right)}{1+\exp \left({\beta_0+\beta_1 spread}\right)},$$

and we predict a team wins when ever $\text{prob} \ge 0.5$. As above, the computer chooses $\beta$ to best fit the data.

In [None]:
X = np.arange(-8, 8, 0.1);
    
# Determine Y
Y = 1/(1+np.exp(-X))

# Create Figure
fig, ax = plt.subplots(figsize=(15,8))

ax.axhline(y=0.5, color='red',linewidth=1,ls='--')

ax.annotate('Class One: Y=1',xy=(-6,.6),va='center',ha='left',size=18)
ax.annotate('Class Two: Y=0',xy=(-6,.4),va='center',ha='left',size=18)
ax.axhspan(.5, 1, alpha=0.2, color='blue')

ax.plot(X,Y, color = 'black')

ax.set_ylim(0,1)
ax.set_yticks(np.arange(0, 1.01, step=0.1))
ax.set_xlim(-8,8)
ax.set_xlabel('Independent Variable (X)',size=14)
ax.set_ylabel('Dependent Variable (Y)',size=14)
ax.set_title('Logistic Function and Decision Rule',size=20)

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

plt.show()

We estimate the logit model with `logit( )` method from `smf` in a way similar to `probit`.  

In [None]:
res_logit = smf.logit('favwin ~ spread', data=vegas).fit()
print(res_logit.summary())

Again, interpreting logit coefficients is bit more complicated. The probability that a team wins is given by the expression

$$\text{prob} = \frac{\exp \left({\beta_0+\beta_1 spread}\right)}{1+\exp \left({\beta_0+\beta_1 spread}\right)}$$

Our marginal effects will hammer $X\hat\beta$ through the above non-linear function to derive the marginal effects. Let's take a look:

In [None]:
margeff = res_logit.get_margeff('mean')
print(margeff.summary())

Let's again plot the estimated probabilty of the favored team winning and the actual data but now let's compare the implications of our distributional assumptions. First, generate predicted values using `numpy` and the above expression for the probability.

In [None]:
pred_logit = np.exp(res_logit.fittedvalues) /( 1+np.exp(res_logit.fittedvalues) )

Now, plot probit vs logit:

In [None]:
fig, ax = plt.subplots(figsize=(15,6))

ax.scatter(vegas['spread'], pred_logit,  facecolors='none', edgecolors='red', label='predicted-logit')
ax.scatter(vegas['spread'], pred_probit,  facecolors='none', edgecolors='black', label='predicted-probit')
ax.scatter(vegas['spread'], vegas['favwin'],  facecolors='none', edgecolors='blue', label = 'data')
ax.axhline(y=1.0, color='grey', linestyle='--')

# Create the line of best fit to plot
p = res_ols.params                            # params from the OLS model linear probability model
x = range(0,35)                               # some x data
y = [p.Intercept + p.spread*i for i in x]     # apply the coefficients 
ax.plot(x,y, color='black', label = 'linear prob.')

ax.set_ylabel('pedict probability of winning')
ax.set_xlabel('point spread')
ax.set_title('Predicted winning probabilities from logit and probit models')

ax.legend(frameon=False,loc='upper right', bbox_to_anchor=(0.9, 0.7), fontsize=14)
sea.despine(ax=ax, trim=True)

We observe that the probit and logit models are nearly on top of eachother. That's a common occurrence. In practice, the models are often interchangeable and the practitioner will choose one over the other because in their setting one may have some slightly better properties (e.g., more intuitive intrepretation of the marginal effects). 

## <font color='red'>Practice</font>

1. Load the data 'apple.dta'. The data dictionary can be found [here](http://fmwww.bc.edu/ec-p/data/wooldridge/apple.des). The variable `ecolbs` is purchases of eco-friendly apples (whatever that means).  

2. Create a variable named `ecobuy` that is equal to 1 if the observation has a positive purchase of eco-apples (i.e., ecolbs>0).

3. Estimate a linear probability model relating the probability of purchasing eco-apples to household characteristics. 

$$\text{ecobuy} = \beta_0 + \beta_1 \text{ecoprc} + \beta_2 \text{regprc} + \beta_3 \text{faminc} + \beta_4 \text{hhsize} + \beta_5 \text{educ} + \beta_6 \text{age} +  \epsilon$$

4. How many estimated probabilities are negative? Are greater than one?

5. Now estimate the model as a probit; i.e., <br><br>
$$\text{Pr}(\text{ecobuy}=1 \mid X) = \Phi \left(\beta_0 + \beta_1 \text{ecoprc} + \beta_2 \text{regprc} + \beta_3 \text{faminc} + \beta_4 \text{hhsize} + \beta_5 \text{educ} + \beta_6 \text{age} \right),$$<br>where $\Phi( )$ is the CDF of the normal distribution.

6. Compute the **marginal effects** of the coefficients at **the means** and print them out using `summary()`. Interpret the results.

7. Re-estimate the model as a logit model. 

8. Compute the marginal effects of the logit coefficients at the averages in the data.

9. We haven't done much data wrangling lately. I'm feeling a bit sad; I miss shaping data. Create a pandas DataFrame with the row index  'ecoprc', 'regprc', 'faminc', 'hhsize', 'educ', and 'age'. The columns should be labeled 'logit', 'probit', and 'ols'. The columns should contain the marginal effects for the logit and probit models and the coefficients from the ols model.