# Logistic Regression

## Agenda

1. Refresh your memory on how to do linear regression in scikit-learn
2. Attempt to use linear regression for classification
3. Show you why logistic regression is a better alternative for classification
4. Brief overview of probability, odds, e, log, and log-odds
5. Explain the form of logistic regression
6. Explain how to interpret logistic regression coefficients
7. Compare logistic regression with other models

## Part 1: Predicting a Continuous Response

In [None]:
# glass identification dataset
import pandas as pd
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/glass/glass.data'
col_names = ['id','ri','na','mg','al','si','k','ca','ba','fe','glass_type']
glass = pd.read_csv(url, names=col_names, index_col='id')
glass['assorted'] = glass.glass_type.map({1:0, 2:0, 3:0, 4:0, 5:1, 6:1, 7:1})

In [None]:
glass.head()

Pretend that we want to predict **ri**, and our only feature is **al**. How would we do it using machine learning? We would frame it as a regression problem, and use a linear regression model with **al** as the only feature and **ri** as the response.

How would we **visualize** this model? Create a scatter plot with **al** on the x-axis and **ri** on the y-axis, and draw the line of best fit.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
sns.lmplot(x='al', y='ri', data=glass, ci=None)

If we had an **al** value of 2, what would we predict for **ri**? Roughly 1.517.



In [None]:
# Exercise: Draw the scatter plot using Pandas.
















# scatter plot using Pandas
glass.plot(kind='scatter', x='al', y='ri')

In [None]:
# fit a linear regression model to predict ri from al
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
feature_cols = ['al']
X = glass[feature_cols]
y = glass.ri
linreg.fit(X, y)

In [None]:
# look at the coefficients to get the equation for the line, but then how do you plot the line?
print linreg.intercept_
print linreg.coef_

In [None]:
# you could make predictions for arbitrary points, and then plot a line connecting them
print linreg.predict(1)
print linreg.predict(2)
print linreg.predict(3)

In [None]:
# or you could make predictions for all values of X, and then plot those predictions connected by a line
ri_pred = linreg.predict(X)

# draw regression line with matplotlib and pandas
plt.scatter(glass.al, glass.ri)
plt.plot(glass.al, ri_pred, color='red')

### Refresher: interpreting linear regression coefficients

Linear regression equation: $y = \beta_0 + \beta_1x$

In [None]:
# compute prediction for al=2 using the predict method
linreg.predict(2)

In [None]:
# examine coefficient for al
pd.DataFrame(zip(feature_cols, linreg.coef_), columns=['feature', 'coef'])

In [None]:
# Note that we can't use a cross_val_score if we want to investigate variable relationships

**Interpretation:** A 1 unit increase in 'al' is associated with a 0.0025 unit decrease in 'ri'.

In [None]:
# compute prediction for al=3 using the predict method
linreg.predict(3)

## Part 2: Predicting a Categorical Response

Let's change our task, so that we're predicting **assorted** using **al**. Let's visualize the relationship to figure out how to do this:

In [None]:
plt.scatter(glass.al, glass.assorted)

Let's draw a **regression line**, like we did before:

In [None]:
# fit a linear regression model and store the predictions
feature_cols = ['al']
X = glass[feature_cols]
y = glass.assorted
linreg.fit(X, y)
assorted_pred = linreg.predict(X)

In [None]:
# scatter plot that includes the regression line
plt.scatter(glass.al, glass.assorted)
plt.plot(glass.al, assorted_pred, color='red')

If **al=3**, what class do we predict for assorted? **1**

If **al=1.5**, what class do we predict for assorted? **0**

So, we predict the 0 class for **lower** values of al, and the 1 class for **higher** values of al. What's our cutoff value? Around **al=2**, because that's where the linear regression line crosses the midpoint between predicting class 0 and class 1.

So, we'll say that if **assorted_pred >= 0.5**, we predict a class of **1**, else we predict a class of **0**.

In [None]:
# understanding np.where
import numpy as np
nums = np.array([5, 15, 8])

In [None]:
# np.where returns the first value if the condition is True, and the second value if the condition is False
np.where(nums > 10, 'big', 'small')

In [None]:
# examine the predictions
assorted_pred[:10]

In [None]:
# transform predictions to 1 or 0
assorted_pred_class = np.where(assorted_pred >= 0.5, 1, 0)
assorted_pred_class

In [None]:
# plot the class predictions
plt.scatter(glass.al, glass.assorted)
plt.plot(glass.al, assorted_pred_class, color='red')

What went wrong? This is a line plot, and it connects points in the order they are found. Let's sort the DataFrame by "al" to fix this:

In [None]:
# add predicted class to DataFrame
glass['assorted_pred_class'] = assorted_pred_class

# sort DataFrame by al
glass.sort('al', inplace=True)

In [None]:
# plot the class predictions again
plt.scatter(glass.al, glass.assorted)
plt.plot(glass.al, glass.assorted_pred_class, color='red')

## Part 3: Using Logistic Regression Instead

Logistic regression can do what we just did, but better..

In [None]:
# fit a linear regression model and store the class predictions
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
feature_cols = ['al']
X = glass[feature_cols]
y = glass.assorted
logreg.fit(X, y)
assorted_pred_class = logreg.predict(X)

In [None]:
# print the class predictions
assorted_pred_class

In [None]:
# plot the class predictions
plt.scatter(glass.al, glass.assorted)
plt.plot(glass.al, assorted_pred_class, color='red')

What if we wanted the **predicted probabilities** instead of just the **class predictions**, to understand how confident we are in a given prediction?

In [None]:
# store the predicted probabilites of class 1
assorted_pred_prob = logreg.predict_proba(X)[:, 1]
print assorted_pred_prob

In [None]:
# plot the predicted probabilities
plt.scatter(glass.al, glass.assorted)
plt.plot(glass.al, assorted_pred_prob, color='red')

In [None]:
# examine some example predictions
print logreg.predict_proba(1)
print logreg.predict_proba(2)
print logreg.predict_proba(3)

What is this? The first column indicates the predicted probability of **class 0**, and the second column indicates the predicted probability of **class 1**.

## Part 4: Probability, odds, e, log, log-odds

$$probability = \frac {one\ outcome} {all\ outcomes}$$

$$odds = \frac {one\ outcome} {all\ other\ outcomes}$$

Examples:

- Dice roll of 1: probability = 1/6, odds = 1/5
- Even dice roll: probability = 3/6, odds = 3/3 = 1
- Dice roll less than 5: probability = 4/6, odds = 4/2 = 2

$$odds = \frac {probability} {1 - probability}$$

In [None]:
# create a table of probability versus odds
table = pd.DataFrame({'probability':[0.1, 0.2, 0.25, 0.5, 0.6, 0.8, 0.9]})
table['odds'] = table.probability/(1 - table.probability)
table

What is **e**? It is the base rate of growth shared by all continually growing processes:

In [None]:
# exponential function: e^1
e = np.exp(1)
e

What is a **(natural) log**? It gives you the time needed to reach a certain level of growth:

In [None]:
# time needed to grow 1 unit to 2.718 units
np.log(e)

It is also the **inverse** of the exponential function:

In [None]:
np.log(np.exp(5))

In [None]:
# add log-odds to the table
table['logodds'] = np.log(table.odds)
table

## Part 5: What is Logistic Regression?

**Linear regression:** continuous response is modeled as a linear combination of the features:

$$y = \beta_0 + \beta_1x$$

**Logistic regression:** log-odds of a categorical response being "true" (1) is modeled as a linear combination of the features:

$$\log \left({p\over 1-p}\right) = \beta_0 + \beta_1x$$

This is called the **logit function**.

Probability is sometimes written as pi:

$$\log \left({\pi\over 1-\pi}\right) = \beta_0 + \beta_1x$$

The equation can be rearranged into the **logistic function**:

$$\pi = \frac{e^{\beta_0 + \beta_1x}} {1 + e^{\beta_0 + \beta_1x}}$$

In other words:

- Logistic regression outputs the **probabilities of a specific class**
- Those probabilities can be converted into **class predictions**

The **logistic function** has some nice properties:

- Takes on an "s" shape
- Output is bounded by 0 and 1

Notes:

- **Multinomial logistic regression** is used when there are more than 2 classes.
- Coefficients are estimated using **maximum likelihood estimation**, meaning that we choose parameters that maximize the likelihood of the observed data.

## Part 6: Interpreting Logistic Regression Coefficients

In [None]:
# plot the predicted probabilities again
plt.scatter(glass.al, glass.assorted)
plt.plot(glass.al, assorted_pred_prob, color='red')

In [None]:
# compute predicted log-odds for al=2 using the equation
logodds = logreg.intercept_ + logreg.coef_ * 2
logodds

In [None]:
# convert log-odds to odds
odds = np.exp(logodds)
odds

In [None]:
# convert odds to probability
prob = odds/(1 + odds)
prob

In [None]:
# compute predicted probability for al=2 using the predict_proba method
logreg.predict_proba(2)[:, 1]

In [None]:
# examine the coefficient for al
pd.DataFrame(zip(feature_cols, logreg.coef_), columns=['feature', 'coef'])

**Interpretation:** A 1 unit increase in 'al' is associated with a 2.0109 unit increase in the log-odds of 'assorted'.

In [None]:
# increasing al by 1 (so that al=3) increases the log-odds by 2.0109

# the -0.10592543 is the logodds we calculated a few cells ago for al=2
# I am stepping through the equation by one "unit" of al

logodds = -0.10592543 + 2.0109
odds = np.exp(logodds)
prob = odds/(1 + odds)
prob

In [None]:
# compute predicted probability for al=3 using the predict_proba method
logreg.predict_proba(3)[:, 1]

**Bottom line:** Positive coefficients increase the log-odds of the response (and thus increase the probability), and negative coefficients decrease the log-odds of the response (and thus decrease the probability).

In [None]:
# examine the intercept
logreg.intercept_

**Interpretation:** For an 'al' value of 0, the log-odds of 'assorted' is -4.127

In [None]:
# convert log-odds to probability
# Probability of assorted is low if al = 0
logodds = logreg.intercept_
odds = np.exp(logodds)
prob = odds/(1 + odds)
prob

That makes sense from the plot above, because the probability of assorted=1 should be very low for such a low 'al' value.

![](images/logistic_betas.png)

Changing the $\beta_0$ value shifts the curve **horizontally**, whereas changing the $\beta_1$ value changes the **slope** of the curve.

## Part 7: Comparing Logistic Regression with Other Models

Advantages of logistic regression:

- Highly interpretable (if you remember how)
- Model training and prediction are fast
- No tuning is required (excluding regularization)
- Features don't need scaling
- Can perform well with a small number of observations
- Outputs well-calibrated predicted probabilities

Disadvantages of logistic regression:

- Presumes a linear relationship between the features and the log-odds of the response
- Performance is (generally) not competitive with the best supervised learning methods
- Sensitive to irrelevant features
- Can't automatically learn feature interactions

## Bonus: Confusion Matrix



In [None]:
from sklearn import metrics
preds = logreg.predict(X)
print metrics.confusion_matrix(y, preds)
# Note that we can't make this martix using cross_val_score so a train_test_split has to do!

##Top Left: True Negatives <Br>
##Top Right False Negatives <Br>
##Bottom Left: False Negatives <br>
##Bottom Right: True Positives <br>

**Exercise** Calculate:
Accuracy
Sensitivity
Specificity
Precision by hand



<br><br><br><br><br><br><br><br><br><br><br><br><br>







#### Accuracy    = (157 + 28) / 214       == .8644
#### Sensitivity (Recall) =  28        / (23 + 28) == .5490
#### Specificity =  157       / (157 + 6) == .9631
#### Precision =  28       / (28 + 6) == .823

In [None]:
print metrics.classification_report(y, preds)

In [None]:
# MORE DATA

# Logistic Regression is a high bias low variance model that is also non-parametric

from sklearn.datasets import make_circles
from sklearn.cross_validation import cross_val_score
circles_X, circles_y = make_circles(n_samples=1000, random_state=123, noise=0.1, factor=0.2)
plt.scatter(circles_X[:,0], circles_X[:,1])

In [None]:
# It has a linear decision boundary, IE the shape is draws between classes are lines!

from matplotlib.colors import ListedColormap
import numpy as np

h = .02  # step size in the mesh

# Create color maps
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])

# we create an instance of Neighbours Classifier and fit the data.
logreg = LogisticRegression()
logreg.fit(circles_X, circles_y)

# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
x_min, x_max = circles_X[:, 0].min() - 1, circles_X[:, 0].max() + 1
y_min, y_max = circles_X[:, 1].min() - 1, circles_X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))
Z = logreg.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure()
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)

# Plot also the training points
plt.scatter(circles_X[:, 0], circles_X[:, 1], c=circles_y, cmap=cmap_bold)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.title("Circle classification Logistic Regression")

plt.show()

In [None]:
logreg = LogisticRegression()
cross_val_score(logreg, circles_X, circles_y, cv=5, scoring='accuracy').mean()
# lame

In [None]:
from sklearn.neighbors import KNeighborsClassifier  # compare to knn
knn = KNeighborsClassifier(n_neighbors=7)
cross_val_score(knn, circles_X, circles_y, cv=5, scoring='accuracy').mean()
# not as lame, remember?

In [None]:
from sklearn import datasets

# new dataset, handwritten digits!
digits = datasets.load_digits()
digits.data

In [None]:
plt.imshow(digits.images[-5], cmap=plt.cm.gray_r, interpolation='nearest')
# the number 9


digits.target[-5]

In [None]:
digits.data.shape
# 1,797 observations, 64 features (8 x 8 image)

In [None]:
digits_X, digits_y = digits.data, digits.target

In [None]:
logreg = LogisticRegression()
cross_val_score(logreg, digits_X, digits_y, cv=5, scoring='accuracy').mean()

In [None]:
# compare to KNN
knn = KNeighborsClassifier(n_neighbors=5)
cross_val_score(knn, digits_X, digits_y, cv=5, scoring='accuracy').mean()

In [None]:
# Thought Exercise, why would KNN potentially be a better model than logsitci regression
# for handwriting?

In [None]:
# OK so wait, when should we use Logistic Regression?

In [None]:
# Using dataset of a 1978 survey conducted to measure likliehood of women to perform extramarital affairs
# http://statsmodels.sourceforge.net/stable/datasets/generated/fair.html

import statsmodels.api as sm
affairs_df = sm.datasets.fair.load_pandas().data

In [None]:
affairs_df

In [None]:
affairs_df['affair_binary'] = (affairs_df['affairs'] > 0)

In [None]:
sns.heatmap(affairs_df.corr())

In [None]:
affairs_df.corr()
# Obviously affairs will correlate to affair_binary but what else?






# It seems children, yrs_married, rate_married, and age all correlate to affair_binary
# Remember correlations are NOT the single way to identify which features to use
# Correlations only give us a number determining how linearlly correlated the variables are
# We may find another variable that affects affairs by evaluating the coefficients of our LR

In [None]:
affairs_X = affairs_df.drop(['affairs', 'affair_binary'], axis=1)
affairs_y = affairs_df['affair_binary']

In [None]:
model = LogisticRegression()
from sklearn.cross_validation import cross_val_score
# check the accuracy on the training set
scores = cross_val_score(model, affairs_X, affairs_y, cv=10)
print scores
print scores.mean()

# Looks pretty good

In [None]:
# Explore individual features that make the biggest impact
# religious, yrs_married, and occupation. But one of these variables doesn't quite make sense right?
pd.DataFrame(zip(affairs_X.columns, np.transpose(model.coef_)))

In [None]:
# Dummy Variables:

# Encoding qualitiative (nominal) data using separate columns (see slides for linear regression for more)

<img src="images/dummy.png">

In [None]:
occuptation_dummies = pd.get_dummies(affairs_df['occupation'], prefix='occ_').iloc[:, 1:]

# concatenate the dummy variable columns onto the original DataFrame (axis=0 means rows, axis=1 means columns)
affairs_df = pd.concat([affairs_df, occuptation_dummies], axis=1)
affairs_df.head()


In [None]:
occuptation_dummies = pd.get_dummies(affairs_df['occupation_husb'], prefix='occ_husb_').iloc[:, 1:]

# concatenate the dummy variable columns onto the original DataFrame (axis=0 means rows, axis=1 means columns)
affairs_df = pd.concat([affairs_df, occuptation_dummies], axis=1)
affairs_df.head()

In [None]:
# remove appropiate columns for feature set
affairs_X = affairs_df.drop(['affairs', 'affair_binary', 'occupation', 'occupation_husb'], axis=1)
affairs_y = affairs_df['affair_binary']

In [None]:
model = LogisticRegression()
model = model.fit(affairs_X, affairs_y)

# check the accuracy on the training set
model.score(affairs_X, affairs_y)

In [None]:
pd.DataFrame(zip(affairs_X.columns, np.transpose(model.coef_)), columns = ['features', 'coef'])

In [None]:
# compare KNN to LR

In [None]:
knn = KNeighborsClassifier(n_neighbors=7)
cross_val_score(knn, affairs_X, affairs_y, cv=5, scoring='accuracy').mean()

In [None]:
logreg = LogisticRegression()
cross_val_score(logreg, affairs_X, affairs_y, cv=5, scoring='accuracy').mean()

In [None]:
# When we are investigating individual correlations between features and categorical responses
# Logistic regression has a good shot :)

# KNN relies on the entire n-space to make predictions while LR uses the model parameters to focus
# on one or more particular features

# LR has concept of "importance" of features

In [None]:
# Final Thought Experiment

# Why might KNN (a kind of look alike model) not perform well here?