In [None]:
Logistic Regression¶
Agenda
Refresh your memory on how to do linear regression in scikit-learn
Attempt to use linear regression for classification
Show you why logistic regression is a better alternative for classification
Brief overview of probability, odds, e, log, and log-odds
Explain the form of logistic regression
Explain how to interpret logistic regression coefficients
Compare logistic regression with other models
Part 1: Predicting a Continuous Response

In [None]:
# glass identification dataset
import pandas as pd
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/glass/glass.data'
col_names = ['id','ri','na','mg','al','si','k','ca','ba','fe','glass_type']
glass = pd.read_csv(url, names=col_names, index_col='id')
glass['assorted'] = glass.glass_type.map({1:0, 2:0, 3:0, 4:0, 5:1, 6:1, 7:1})

In [None]:
glass.head()

# Pretend that we want to predict ri, and our only feature is al. How would we do it using machine learning? We would frame it as a regression problem, and use a linear regression model with al as the only feature and ri as the response.

How would we visualize this model? Create a scatter plot with al on the x-axis and ri on the y-axis, and draw the line of best fit.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
sns.lmplot(x='al', y='ri', data=glass, ci=None)

In [None]:
If we had an al value of 2, what would we predict for ri? Roughly 1.517.

Exercise: Draw this plot without using Seaborn.

In [None]:
# scatter plot using Pandas
glass.plot(kind='scatter', x='al', y='ri')

In [None]:
# scatter plot using Matplotlib
plt.scatter(glass.al, glass.ri)

In [None]:
# fit a linear regression model
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
feature_cols = ['al']
X = glass[feature_cols]
y = glass.ri
linreg.fit(X, y)

In [None]:
# look at the coefficients to get the equation for the line, but then how do you plot the line?
print linreg.intercept_
print linreg.coef_

In [None]:
# you could make predictions for arbitrary points, and then plot a line connecting them
print linreg.predict(1)
print linreg.predict(2)
print linreg.predict(3)

In [None]:
# or you could make predictions for all values of X, and then plot those predictions connected by a line
ri_pred = linreg.predict(X)
plt.plot(glass.al, ri_pred, color='red')

In [None]:
# put the plots together
plt.scatter(glass.al, glass.ri)
plt.plot(glass.al, ri_pred, color='red')

In [None]:
Refresher: interpreting linear regression coefficients
Linear regression equation: $y = \beta_0 + \beta_1x$

In [None]:
# compute prediction for al=2 using the equation
linreg.intercept_ + linreg.coef_ * 2

In [None]:
# compute prediction for al=2 using the predict method
linreg.predict(2)

In [None]:
# examine coefficient for al
zip(feature_cols, linreg.coef_)

# Interpretation: A 1 unit increase in 'al' is associated with a 0.0025 unit decrease in 'ri'.

In [None]:
# increasing al by 1 (so that al=3) decreases ri by 0.0025
1.51699012 - 0.0024776063874696243

In [None]:
# compute prediction for al=3 using the predict method
linreg.predict(3)

In [None]:
Part 2: Predicting a Categorical Response¶
Let's change our task, so that we're predicting assorted using al. Let's visualize the relationship to figure out how to do this:

In [None]:
plt.scatter(glass.al, glass.assorted)

In [None]:
Let's draw a regression line, like we did before:

In [None]:
# fit a linear regression model and store the predictions
feature_cols = ['al']
X = glass[feature_cols]
y = glass.assorted
linreg.fit(X, y)
assorted_pred = linreg.predict(X)

In [None]:
# scatter plot that includes the regression line
plt.scatter(glass.al, glass.assorted)
plt.plot(glass.al, assorted_pred, color='red')

In [None]:
# understanding np.where
import numpy as np
nums = np.array([5, 15, 8])

In [None]:
# np.where returns the first value if the condition is True, and the second value if the condition is False
np.where(nums > 10, 'big', 'small')

In [None]:
# examine the predictions
assorted_pred[:10]

In [None]:
# transform predictions to 1 or 0
assorted_pred_class = np.where(assorted_pred >= 0.5, 1, 0)
assorted_pred_class

In [None]:
# plot the class predictions
plt.scatter(glass.al, glass.assorted)
plt.plot(glass.al, assorted_pred_class, color='red')

In [None]:
# add predicted class to DataFrame
glass['assorted_pred_class'] = assorted_pred_class

# sort DataFrame by al
glass.sort('al', inplace=True)

In [None]:
# plot the class predictions again
plt.scatter(glass.al, glass.assorted)
plt.plot(glass.al, glass.assorted_pred_class, color='red')

# Part 3: Using Logistic Regression Instead¶
Logistic regression can do what we just did:

In [None]:
# fit a linear regression model and store the class predictions
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(C=1e9)
feature_cols = ['al']
X = glass[feature_cols]
y = glass.assorted
logreg.fit(X, y)
assorted_pred_class = logreg.predict(X)


In [None]:
# print the class predictions
assorted_pred_class

In [None]:
# plot the class predictions
plt.scatter(glass.al, glass.assorted)
plt.plot(glass.al, assorted_pred_class, color='red')

# What if we wanted the predicted probabilities instead of just the class predictions, to understand how confident we are in a given prediction?

In [None]:
# store the predicted probabilites of class 1
assorted_pred_prob = logreg.predict_proba(X)[:, 1]

In [None]:
# plot the predicted probabilities
plt.scatter(glass.al, glass.assorted)
plt.plot(glass.al, assorted_pred_prob, color='red')

In [None]:
# examine some example predictions
print logreg.predict_proba(1)
print logreg.predict_proba(2)
print logreg.predict_proba(3)

In [None]:
What is this? The first column indicates the predicted probability of class 0, and the second column indicates the predicted probability of class 1

In [None]:
https://github.com/justmarkham/DAT7/blob/master/notebooks/11_logistic_regression.ipynb