# Classification II: Logistic Regression

**OBJECTIVES**:

- Differentiate between *Regression* and *Classification* problem settings
- Connect Least Squares methods to Classification through Logistic Regression
- Interpret coefficients of the model in terms of probabilities
- Discuss performance of classification model in terms of accuracy
- Understand the effect of an imbalanced target class on model performance

### Classification Problems as Predicting Categorical Target Feature



In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import statsmodels.api as sm
from scipy import stats

from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer, load_digits, load_iris

### Some example datasets in sklearn for classification
<center>
<img src = https://archive.ics.uci.edu/ml/assets/MLimages/Large14.jpg>
</center>

In [None]:
cancer = load_breast_cancer()
#print(cancer.DESCR)
#think of the dataset as a class with attributes
#.data
#.target
#.feature_names
X = cancer.data
y = cancer.target

In [None]:
#create a dataframe
cancer_df = pd.DataFrame(X, columns = cancer.feature_names)
cancer_df['malignant'] = cancer.target
cancer_df.head(2)

In [None]:
#mean radius by malignance


### Our Motivating Example



In [None]:
default = pd.read_csv('data/Default.csv', index_col = 0)

In [None]:
default.info()

In [None]:
default.head(2)

### Visualizing Default by Continuous Features

In [None]:
#scatterplot of balance vs. income colored by default status

In [None]:
#boxplots for balance and income by default


### Considering only `balance` as the predictor



In [None]:
#create binary default column


In [None]:
#scatter of Balance vs Default


##### PROBLEM

1. Build a `LinearRegression` model with balance as the predictor.
2. Interpret the $r^2$ score and $rmse$ for your regressor.
3. Predict the default for balances: `[500, 1000, 1500, 2000, 2500]`.  Do these make sense?

### The Sigmoid aka Logistic Function


$$y = \frac{1}{1 + e^{-x}}$$

In [None]:
#define the logistic


In [None]:
#domain


In [None]:
#plot it


### Usage should seem familiar

Fit a `LogisticRegression` estimator from `sklearn` on the features:

```python 
X = default[['balance']]
y = default['binary_default']
```

In [None]:
#instantiate


In [None]:
#define X and y


In [None]:
#train test split


In [None]:
#fit on the train


In [None]:
#examine train and test scores


### Similarities to our earlier work

In [None]:
#there is a coefficient


In [None]:
#there is an intercept


### Where was the line?

The version of the logistic we have just developed is actually:

$$ y = \frac{e^{ax + b}}{1 + e^{ax + b}} $$

Its output represents probabilities of being labeled the positive class in our example.  This means that we can interpret the output of the above function using our parameters, remembering that we used the `balance` feature to predict `default`.

In [None]:
def predictor(x):
    line = clf.coef_[0]*x + clf.intercept_
    return np.e**line/(1 + np.e**line)

In [None]:
#predict 1000


In [None]:
#predict 2000


In [None]:
#estimator has this too


### Using Categorical Features

In [None]:
default.head(2)

In [None]:
default['student_binary'] = np.where(default.student == 'No', 0, 1)

In [None]:
X = default['student_binary']

In [None]:
#instantiate and fit


In [None]:
#performance


In [None]:
#coefficients


In [None]:
#compare probabilities


### Using Multiple Features



In [None]:
default.columns

In [None]:
features = ['balance', 'income', 'student_binary']
X = default.loc[:, features]
y = default['binary_default']

**Predictions**:

- student: yes
- balance: 1,500 dollars
- income: 40,000 dollars

- student: no
- balance: 1,500 dollars
- income: 40,000 dollars

### This is similar to our multicollinearity in regression; we will call it confounding

<center>
<img src = 'images/default_confound.png'/>
</center>

In [None]:
b_sort = default.sort_values(by = 'balance')
students = b_sort.loc[b_sort['student_binary'] == 1]
non_students = b_sort.loc[b_sort['student_binary'] == 0]
num_defaults = b_sort['binary_default'].sum()

In [None]:
plt.plot(students['balance'], students['binary_default'].cumsum()/students['binary_default'].sum(), label = 'students')
plt.plot(non_students['balance'], non_students['binary_default'].cumsum()/non_students['binary_default'].sum(), label = 'non-students')
plt.title('Confounding in the Default Data')
plt.xlabel('Balance')
plt.ylabel('Default Rate')
plt.grid()
plt.xlim(0, 2300);
plt.legend();
plt.savefig('images/default_confound.png')