## Logistic Regression

In this notebook, you'll learn about **logistic regression** and see how to fit a logistic regression model using the statsmodels library.

Logistic regression involves a **binary target variable**, meaning that it is a target which could be true or false. Our goal is to estimate the probability of the target being true, give the value of one or more **explanatory variables**. More precisely, we assume that the target variable follows a [Bernoulli distribution](https://en.wikipedia.org/wiki/Bernoulli_distribution), conditional on the explanatory varaibles.

A Bernoulli distribution is determined by the probability of the target being true (or sometimes called the probability of "success"). With logistic regression, we assume that this probability of success can be estimated using a linear function of the explanatory variables. Specifically, if $x$ is our explanatory variable we assume

$$\text{logit}(p) = \beta_0 + \beta_1\cdot x$$

Here, 

$$\text{logit}(p) = \log(\frac{p}{1-p})$$

To convert to a probability, we can use the **logistic function**:

$$\text{logistic}(x) = \frac{1}{1 + e^{-x}}$$

Now, let's see how we can fit a logistic regression model using Python.

In [None]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf

We'll look at a dataset containing the distance and result of all field goal kicks during the 2021 NFL season.

In [None]:
field_goals = pd.read_csv('../data/fg.csv')
field_goals.head(2)

To fit a model for making a field goal (target = 1) based on distance, we can use the `logit` function.

In [None]:
fg_dist_logreg = smf.logit("target ~ distance",
                          data = field_goals).fit()

In [None]:
fg_dist_logreg.params

This says that 
$$\text{logit}(p) = 6.992968 - 0.119864\cdot\text{distance}$$

What does the model estimate is the probability of making a 40 yard field goal?

In [None]:
def logistic(x):
    return 1 / (1 + np.exp(-x))

In [None]:
distance = 40

logit_p = fg_dist_logreg.params['Intercept'] + fg_dist_logreg.params['distance']*distance

print(f'Estimated Probability of Make: {logistic(logit_p)}')

What about a 60 yard field goal?

In [None]:
distance = 60

logit_p = fg_dist_logreg.params['Intercept'] + fg_dist_logreg.params['distance']*distance

print(f'Estimated Probability of Make: {logistic(logit_p)}')

Let's plot the estimated probability of a make based on distance.

In [None]:
fit_df = pd.DataFrame({
    'distance': np.linspace(start = field_goals['distance'].min(),
                            stop = field_goals['distance'].max(),
                            num = 150)
})

fit_df['fit'] = fg_dist_logreg.predict(fit_df)

fit_df.plot(x = 'distance',
             y = 'fit',
             legend = False,
             figsize = (10,6),
             color = 'black',
            title = 'Estimated Probability of a Make');

Does this model explain the data well?

For this, we can look at some diagnostic plots.

First, let's make a summary table by dividing the 

In [None]:
fg_summary = (
    field_goals
    .assign(group = pd.qcut(field_goals['distance'], 
                            q = 10, 
                            duplicates = 'drop'))
    .groupby('group', observed = False)
    [['distance', 'target']]
    .mean()
    .reset_index()
    .rename(columns = {'target': 'eprob'})
)
fg_summary

In [None]:
fg_summary['fit_prob'] = fg_dist_logreg.predict(fg_summary[['distance']])
fg_summary

In [None]:
ax = fg_summary.plot(x = 'distance', y = 'eprob')
fg_summary.plot(x = 'distance', y = 'fit_prob',
                color = 'black',
                ax = ax);

In [None]:
fg_summary['elogit'] = np.log(fg_summary['eprob'] / (1 - fg_summary['eprob']))
fg_summary['fit_logit'] = np.log(fg_summary['fit_prob'] / (1 - fg_summary['fit_prob']))
fg_summary

In [None]:
ax = fg_summary.plot(x = 'distance', y = 'elogit')
fg_summary.plot(x = 'distance', y = 'fit_logit',
                color = 'black',
                ax = ax);