## Logistic Regression Agenda

  * Attempt to use linear regression for classification
  * Logistic regression is a better alternative for classification
  * Brief overview of probability, odds, e, log, and log-odds
  * What is the logistic regression model?
  * Interpreting logistic regression coefficients
  * Compare logistic regression with other models
  
By the end of this portion of the class you will be able to:
  * Use logistic regression for a classification problem in the future
  * interpret the coefficients of a trained logistic regression model

### Predicting a categorical response

In the first part of today's lesson, we were attempting to predict a **continuous response**. However, what we want to do now is see if we can apply the same sort of logic to predict an outcome that has only 2 distinct possibilities, or what is known as a **categorical response.**

In machine learning parlance, we looked at **regression** when we were using linear regression, but we are now going to try to use the same approach for what is known as a **classification** problem (problems with only a discrete, finite number of outcomes; in our case, just 2).

As always, we are going to import all of the functionality we need before we get started:

In [None]:
# Python 2 and 3 compatibility
from __future__ import print_function

#data handling/modeling
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.cross_validation import train_test_split
from sklearn import metrics
import scipy.stats as stats

# visualization
%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt

Now we are going to import a slightly different dataset. This dataset is also from the famed [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.html) and can be found [here](https://archive.ics.uci.edu/ml/datasets/Vertebral+Column).

This dataset contains 6 biomechanical features used to classify orthopaedic patients into 2 classes - normal and abnormal:
  * pelvic incidence
  * pelvic tilt
  * lumbar lordosis angle
  * sacral slope
  * pelvic radius
  * grade of spondylolisthesis
  
Lets load the data in:

In [None]:
vertebral_data = pd.read_csv("../data/vertebral_column_2_categories.dat", sep=" ",
                             names=["pelvic_incidence","pelvic_tilt","lumbar_lordosis_angle","sacral_slope","pelvic_radius","spondy_grade","outcome"])
vertebral_data.outcome.value_counts()

In order to use linear regression for this task, we have to convert our **categorical** target into a number:

In [None]:
vertebral_data["outcome_number"] = (vertebral_data.outcome=='AB').astype(int)
vertebral_data.outcome_number.value_counts()

Cool, so now our outcome is no longer a value, but a number. Let's plot `pelvic_incidence` relative to this new numeric `outcome_number`:

In [None]:
sns.pairplot(vertebral_data,x_vars=["pelvic_incidence"],y_vars="outcome_number", size=6, aspect=0.8);

And now lets do a simple linear regression on that feature like we did before:

In [None]:
# fit a linear regression model and store the predictions
feature_cols = ['pelvic_incidence']
X = vertebral_data[feature_cols]
y = vertebral_data.outcome_number
linreg = LinearRegression()
linreg.fit(X, y)
outcome_pred = linreg.predict(X)

In [None]:
# scatter plot that includes the regression line
plt.figure(figsize=(8, 6))
plt.scatter(vertebral_data.pelvic_incidence, vertebral_data.outcome_number)
plt.plot(vertebral_data.pelvic_incidence, outcome_pred, color='red');

Lets examine the predictions:

In [None]:
outcome_pred[:10]

If **pelvic_incidence=35**, what class do we predict for outcome? **0**

So, we predict the 0 class for **lower** values of `pelvic_incidence`, and the 1 class for **higher** values of `pelvic_incidence`. What's our cutoff value? Around **pelvic_incidence=45**, because that's where the linear regression line crosses the midpoint (0.5) between predicting class 0 and class 1.

So, we'll say that if **outcome_pred >= 0.5**, we predict a class of **1**, else we predict a class of **0**.

In [None]:
# transform predictions to 1 or 0
outcome_pred_class = np.where(outcome_pred >= 0.5, 1, 0)
outcome_pred_class

In [None]:
# plot the class predictions
plt.figure(figsize=(8, 6))
plt.scatter(vertebral_data.pelvic_incidence, vertebral_data.outcome_number)
plt.plot(vertebral_data.pelvic_incidence, outcome_pred_class, color='red');

What went wrong? This is a line plot, and it connects points in the order they are found. Let's sort the DataFrame by "al" to fix this:

In [None]:
# add predicted class to DataFrame
vertebral_data['outcome_pred_class'] = outcome_pred_class

# sort DataFrame by pelvic_incidence so that the line plot makes sense
vertebral_data.sort('pelvic_incidence', inplace=True)

In [None]:
# plot the class predictions again
plt.figure(figsize=(8, 6))
plt.scatter(vertebral_data.pelvic_incidence, vertebral_data.outcome_number)
plt.plot(vertebral_data.pelvic_incidence, vertebral_data.outcome_pred_class, color='red');

### Logistic regression?

[**Linear regression:**](https://en.wikipedia.org/wiki/Linear_regression) continuous response is modeled as a linear combination of the features used :

$$y = \beta_0 + \beta_1x + ... \beta_nx$$

[**Logistic regression:**](https://en.wikipedia.org/wiki/Logistic_regression) model based on the [**logistic function**](https://en.wikipedia.org/wiki/Logistic_function) which takes any number and outputs a number between 0 and 1. We can interpret the output as a probability.

**Logistic function:
**
$$y = \frac{1}{1 + e^{-x}}$$

Here's what that looks like:

![logistic curve](../images/logistic_curve.png)

The input variable is just the linear combination of features: $$\beta_0 + \beta_1x + ... \beta_nx$$

The sklearn logistic regression function will choose the best coefficients to fit the data. We can then use the logistic function to calculate probabilties.

$$P(y) = \frac{1}{1 + e^{-(\beta_0 + \beta_1x_1 + ... + \beta_nx_n)}}$$

We can rearrange this equation:

$$\log \left( \frac{P(y)}{1-P(y)} \right) = \beta_0 + \beta_1x_1 + ... + \beta_nx_n$$

The thing on the left is called the log-odds (because it's the log of the odds).

In other words:

- Logistic regression outputs the **probabilities of a specific class**
- Those probabilities can be converted into **class predictions**:

$f(x)= 
\begin{cases}
    1,& \text{if } p\geq 0.5\\
    0,              & \text{otherwise}
\end{cases}$

The **logistic function** has some nice properties:

- Takes on an "s" shape (which allows it to be differentiable, a really important math property for functions to have)
- Output is bounded by 0 and 1

Some things to note:

- **Multinomial logistic regression** is used when there are more than 2 classes.
- Coefficients are estimated using **maximum likelihood estimation**, meaning that we choose parameters that maximize the likelihood of the observed data. We do this using fancy math involving taking derivatives, and thats why that S-shaped curve is so important.

### Use Logistic Regression Instead of Linear Regression on Categorical Outcome Variables

Logistic regression can do exactly what we just did:

In [None]:
logreg = LogisticRegression(C=1e9)
feature_cols = ['pelvic_incidence']
X = vertebral_data[feature_cols]
y = vertebral_data.outcome_number
logreg.fit(X, y)
outcome_pred_class_log = logreg.predict(X)

In [None]:
# print the class predictions
outcome_pred_class_log

In [None]:
# plot the class predictions
plt.figure(figsize=(8, 6))
plt.scatter(vertebral_data.pelvic_incidence, vertebral_data.outcome_number)
plt.plot(vertebral_data.pelvic_incidence, outcome_pred_class_log, color='red');

What if we wanted the **predicted probabilities** instead of just the **class predictions**, to understand how confident we are in a given prediction?

In [None]:
# store the predicted probabilites of class 1
outcome_probs = logreg.predict_proba(X)[:, 1]

In [None]:
# plot the predicted probabilities, and the 50% line
plt.figure(figsize=(8, 6))
plt.scatter(vertebral_data.pelvic_incidence, vertebral_data.outcome_number)
plt.plot(vertebral_data.pelvic_incidence, outcome_probs, color='red')
plt.plot(vertebral_data.pelvic_incidence,np.ones(outcome_probs.shape)*.5,'k--');

In [None]:
# examine some example predictions
print("Pelvic incidence of 15:", logreg.predict_proba(15))
print("Pelvic incidence of 10:", logreg.predict_proba(10))
print("Pelvic incidence of 55:", logreg.predict_proba(55))

What are these numbers? 

The first number in each entry indicates the predicted probability of **class 0**, and the second number in each entry indicates the predicted probability of **class 1**.

### Interpreting Logistic Regression Coefficients

In [None]:
# plot the predicted probabilities again
plt.figure(figsize=(8, 6))
plt.scatter(vertebral_data.pelvic_incidence, vertebral_data.outcome_number)
plt.plot(vertebral_data.pelvic_incidence, outcome_probs, color='red');

In [None]:
# compute predicted probability for al=2 using the predict_proba method
logreg.predict_proba(55)[:, 1]

In [None]:
# examine the coefficient for al
zip(feature_cols, logreg.coef_[0])

**Interpretation:** A 1 unit increase in `pelvic_incidence` is associated with a ~0.054 unit increase in the log-odds of `outcome`, where a positive outcome is having a vertebral abnormality (not positive in the real world, but positive in how we coded our outcome feature).

In [None]:
# compute predicted probability for al=3 using the predict_proba method
logreg.predict_proba(56)[:, 1]

### What does this mean actually? 

**Positive coefficients increase the log-odds of the response (and thus increase the probability), and negative coefficients decrease the log-odds of the response (and thus decrease the probability).**

In [None]:
# examine the intercept
logreg.intercept_

**Interpretation:** For a 'pelvic_incidence' value of 0, the log-odds of 'outcome' is -2.36.

In [None]:
# convert log-odds to probability
logodds = logreg.intercept_
odds = np.exp(logodds)
prob = odds/(1 + odds)
prob

That makes sense from the plot above, because the probability of outcome=1 should be very low for such a low `pelvic_incidence` value.

![logistic betas example](../images/logistic_betas_example.png)

Changing the $\beta_0$ value shifts the curve **horizontally**, whereas changing the $\beta_1$ value changes the **slope** of the curve.

The non-bias $\beta$ coefficients are effectively estimates of how certain you are of the outcome given how much evidence that specific feature gives you. A really high magnitude (positive or negative) value means you are very certain of the outcome, given you know that feature's value.

### How do we measure model performance for classification problems?

Now that we have a trained model just as we did before with linear regression, what is our **evaluation metric/loss function**?

There are two common (inverse) measurements we can make that capture the performance of our classification model:
  * **Classification accuracy**: percentage of correct predictions (**reward function**)
  * **Classification error**: percentage of incorrect predictions (**loss function**)

In our case, we are going to use classification accuracy. Let's compute our classification accuracy after training on the whole dataset, using our just-trained one-feature model and the scikit-learn method `accuracy_score`:

In [None]:
y = vertebral_data.outcome_number
y_pred = outcome_pred_class
print("Model accuracy:",metrics.accuracy_score(y,y_pred))

68% is ok, but its not really fantastic. Can we do better? (YES WE CAN!)

#### Exercise Time!!
  * Generate the logistic regression model incorporating all of the features we have available to predict `outcome_number` and get the accuracy when training and testing on all data. How much better is this than the case where we trained our model using only `pelvic_incidence`?
  * Use train/test split with 70% training, 30% testing and get the test error of the model trained on all features using `train_test_split` like we did during linear regression 
  * Inspect all of the model coefficients of the model trained on all features. Which feature is the most important for the prediction? Which is the least important?
  * What are some problems you can see in using the data like we have been? (Look at the fraction of positive and negative outcomes in the dataset)

In [None]:
pass

### Comparing Logistic Regression with Other Models

Logistic regression has some really awesome advantages:

  * It is a highly interpretable method (if you remember what the conversions from log-odds to probability are)
  * Model training and prediction are fast
  * No tuning is required (excluding regularization, which we will talk about later)
  * No need to scale features
  * Outputs well-calibrated predicted probabilities (the probabilities behave like probabilities)

However, logistic regression also has some disadvantages:

  * It presumes a linear relationship between the features and the log-odds of the response
  * Compared to other, more fancypants modeling approaches, performance is (generally) not competitive with the best supervised learning methods
  * Like linear regression for regression, it is sensitive to irrelevant features
  * Unless you explicitly code them (we will see how to do that later), logistic regression can't automatically learn feature interactions