# Classification II: Logistic Regression

**OBJECTIVES**:

- Differentiate between *Regression* and *Classification* problem settings
- Connect Least Squares methods to Classification through Logistic Regression
- Interpret coefficients of the model in terms of probabilities
- Discuss performance of classification model in terms of accuracy
- Understand the effect of an imbalanced target class on model performance

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import statsmodels.api as sm
from scipy import stats

from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer, load_digits, load_iris

ModuleNotFoundError: No module named 'seaborn'

### Our Motivating Example



In [None]:
default = pd.read_csv('https://raw.githubusercontent.com/jfkoehler/nyu_bootcamp_fa24/refs/heads/main/data/Default.csv', index_col = 0)

In [None]:
default.info()

In [None]:
default.head(2)

### Visualizing Default by Continuous Features

In [None]:
#scatterplot of balance vs. income colored by default status

In [None]:
sns.scatterplot(data = default, x = 'balance', y = 'income', hue = 'default')

In [None]:
#boxplots for balance and income by default
sns.boxplot(data = default, x = 'default', y = 'balance')

In [None]:
sns.boxplot(data = default, x = 'default', y = 'income')

### Considering only `balance` as the predictor



In [None]:
#create binary default column
default['binary_default'] = np.where(default['default'] == 'No', 0, 1)

In [None]:
#scatter of Balance vs Default
sns.scatterplot(data = default, x = 'balance', y = 'binary_default')

##### PROBLEM

1. Build a `LinearRegression` model with balance as the predictor.
2. Interpret the $r^2$ score and $rmse$ for your regressor.
3. Predict the default for balances: `[500, 1000, 1500, 2000, 2500]`.  Do these make sense?

In [None]:
X = default[['balance']]
y = default['binary_default']

In [None]:
#regplot


### The Sigmoid aka Logistic Function


$$y = \frac{1}{1 + e^{-x}}$$

In [None]:
#define the logistic
def logistic(x): return 1/(1 + np.exp(-x))

In [None]:
#domain
x = np.arange(-10, 10, .1)

In [None]:
#plot it
plt.plot(x, logistic(x))

### Usage should seem familiar

Fit a `LogisticRegression` estimator from `sklearn` on the features:

```python 
X = default[['balance']]
y = default['binary_default']
```

In [None]:
#instantiate


In [None]:
#define X and y
X = default[['balance']]
y = default['default']

In [None]:
#train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 22)

In [None]:
#fit on the train


In [None]:
#examine train and test scores


### Similarities to our earlier work

In [None]:
#there is a coefficient


In [None]:
#there is an intercept


### Where was the line?

The version of the logistic we have just developed is actually:

$$ y = \frac{e^{ax + b}}{1 + e^{ax + b}} $$

Its output represents probabilities of being labeled the positive class in our example.  This means that we can interpret the output of the above function using our parameters, remembering that we used the `balance` feature to predict `default`.

In [None]:
def predictor(x):
    line = clf.coef_[0]*x + clf.intercept_
    return np.e**line/(1 + np.e**line)

In [None]:
#predict 1000
predictor(1000)

In [None]:
#predict 2000
predictor(2000)

In [None]:
#estimator has this too
clf.predict_proba(np.array([[1000]]))

In [None]:
clf.predict(np.array([[1000]]))

### Using Categorical Features

In [None]:
default.head(2)

In [None]:
default['student_binary'] = np.where(default.student == 'No', 0, 1)

In [None]:
X = default[['student_binary']]

In [None]:
#instantiate and fit
clf = LogisticRegression()
clf.fit(X, y)

In [None]:
#performance
clf.score(X, y)

In [None]:
#coefficients
clf.coef_

In [None]:
#compare probabilities
clf.predict_proba(X)

### Using Multiple Features



In [None]:
default.columns

In [None]:
features = ['balance', 'income', 'student_binary']
X = default.loc[:, features]
y = default['binary_default']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=22)

In [None]:
clf = LogisticRegression().fit(X_train, y_train)

In [None]:
clf.score(X_train, y_train)

In [None]:
clf.score(X_test, y_test)

**Predictions**:

- student: yes
- balance: 1,500 dollars
- income: 40,000 dollars

In [None]:
ex1 = np.array([[1500, 40_000, 1]])
#predict probability


- student: no
- balance: 1,500 dollars
- income: 40,000 dollars

In [None]:
ex2 = np.array([[1500, 40_000, 0]])
#predict probability


### This is similar to our multicollinearity in regression; we will call it confounding

<center>
<img src = 'images/default_confound.png'/>
</center>

#### Using `scikitlearn` and its `Pipeline`

From the original data, to build a model involved:

1. One hot or dummy encoding the categorical feature.
2. Standard Scaling the continuous features
3. Building Logistic model

we can accomplish this all with the `Pipeline`, where the first step is a `make_column_transformer` and the second is a `LogisticRegression`.  

In [None]:
from sklearn.pipeline import Pipeline 
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

In [None]:
# create OneHotEncoder instance
ohe = OneHotEncoder(drop = 'first')

In [None]:
# create StandardScaler instance
sscaler = StandardScaler()

In [None]:
# make column transformer
transformer = make_column_transformer((ohe, ['student']), 
                                     remainder = sscaler)

In [None]:
# logistic regressor
clf = LogisticRegression()

In [None]:
# pipeline
pipe = Pipeline([('transform', transformer), ('model', clf)])

In [None]:
X_train, X_test, y_train, y_test = train_test_split(default[['student', 'income', 'balance']], default['default'],
                                                   random_state = 22)

In [None]:
# fit it
pipe.fit(X_train, y_train)

In [None]:
# score on train and test
print(f'Train Score: {pipe.score(X_train, y_train)}')
print(f'Test Score: {pipe.score(X_test, y_test)}')

In [None]:
pipe.named_steps['model'].coef_

#### Compare to KNN and Grid Searching

Let's compare how this estimator performs compared to the `KNeighborsClassifier`.  This time however, we will be trying many KNN models across different numbers of neighbors.  One way we could do this is with a loop; something like:

```python
for neighbor in range(1, 30, 2):
    knn = KNeighborsClassifier(n_neighbors = neighbor).fit(X_train, y_train)
```



Instead, we can use the `GridSearchCV` object from sklearn.  This will take an estimator and a dictionary with parameters to be searched over.  

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
# parameters we want to try
params = {'n_neighbors': range(1, 30, 2)}

In [None]:
# estimator with parameters
knn = KNeighborsClassifier()

In [None]:
# grid search object
grid = GridSearchCV(knn, param_grid=params)

In [None]:
# fit it
X = default[['student_binary', 'income', 'balance']]
y = default['default']
grid.fit(X, y)

In [None]:
# what was best?
grid.best_estimator_

In [None]:
# score it 
grid.score(X, y)

#### Comparing Results

A good way to think about classifier performance is using a **confusion matrix**.  Below, we visualize this using the `ConfusionMatrixDisplay.from_estimator`. 

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay

In [None]:
# a single confusion matrix
ConfusionMatrixDisplay.from_estimator(pipe, X_test, y_test, display_labels=['No', 'Yes'])

In [None]:
# compare knn and logistic
fig, ax = plt.subplots(1, 2, figsize = (19, 5))
ConfusionMatrixDisplay.from_estimator(pipe, X_test, y_test, display_labels=['no', 'yes'], ax = ax[0])
ax[0].set_title('Logistic')
ConfusionMatrixDisplay.from_estimator(grid, X_test, y_test, display_labels=['no', 'yes'], ax = ax[1])
ax[1].set_title('KNN')

#### Practice

In [None]:
from sklearn.datasets import load_breast_cancer

In [None]:
cancer = load_breast_cancer(as_frame=True).frame

In [None]:
cancer.head(3)

In [None]:
# use all features


In [None]:
# train/test split -- random_state = 42


In [None]:
# pipeline to scale then knn


In [None]:
# pipeline to scale then logistic


In [None]:
# fit knn


In [None]:
# fit logreg


In [None]:
# compare confusion matrices on test data
