# Logistic Regression
- A dataset with one or more independent variables is used to determine binary output of the dependent variable. 
- It's a classification algorithm, used to predict binary(1,0) outcomes for a given set of independent variables. The dependent variable's outcome is discrete. 

### The Math behind the Logistic Regression
$$
odds(\theta) = \frac{p}{1 - p}
$$
Where, 
- p = probability of an event happening 
- 1-p = probability of an event not happening
- $\theta$ = odds

__Remember__
- The value of odds($\theta$) can range from 0 to $\infty$
- The values of probability change from o to 1

## Building and Evaluating Logistic Regression Model 

[Pima Indians Diabetes dataset](https://www.kaggle.com/uciml/pima-indians-diabetes-database) originally from the UCI Machine Learning Repository

In [1]:
# read the data into a pandas DataFrame
import pandas as pd
data = pd.read_csv('../data/diabetes.csv')

In [2]:
# print the first 5 rows of data
data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


### Research Question
**Can we predict the diabetes status of a patient given their health measurements?**

In [None]:
data.columns

In [None]:
# define X and y
feature_cols = ['Pregnancies', 'Insulin', 'BMI', 'Age']
X = data[feature_cols]
y = data['Outcome']

In [None]:
# examine the first few rows of `X` matrix 
X.head() 

In [None]:
# examine the first few rows of `Y` matrix 
y.head() 

![](../img/test_train.png)

In [None]:
# split X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, test_size=0.2)

In [None]:
X.shape, y.shape

In [None]:
X_train.shape, X_test.shape

In [None]:
y_train.shape, y_test.shape

In [None]:
# train a logistic regression model on the training set
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(solver='liblinear')
logreg.fit(X_train, y_train)

In [None]:
# make class predictions for the testing set
y_pred_class = logreg.predict(X_test)

In [None]:
y_pred_class

In [None]:
y_test

## Model Evaluation

- Need a way to choose between models: different model types, tuning parameters, and features
- Use a **model evaluation procedure** to estimate how well a model will generalize to out-of-sample data
- Requires a **model evaluation metric** to quantify the model performance

### Model evaluation procedures

1. **Training and testing on the same data**
    - Rewards overly complex models that "overfit" the training data and won't necessarily generalize
2. **Train/test split**
    - Split the dataset into two pieces, so that the model can be trained and tested on different data
    - Better estimate of out-of-sample performance, but still a "high variance" estimate
    - Useful due to its speed, simplicity, and flexibility
3. **K-fold cross-validation**
    - Systematically create "K" train/test splits and average the results together
    - Even better estimate of out-of-sample performance
    - Runs "K" times slower than train/test split

### Model Evaluation Metrics

- **Regression problems:** Mean Absolute Error, Mean Squared Error, Root Mean Squared Error
- **Classification problems:** Classification accuracy

## Confusion Matrix 
![](../img/cm.png)

![](../img/metrics.png)

**Classification accuracy:** percentage of correct predictions

In [None]:
# calculate accuracy
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred_class))

**Null accuracy:** accuracy that could be achieved by always predicting the most frequent class

In [None]:
# examine the class distribution of the testing set (using a Pandas Series method)
y_test.value_counts()

In [None]:
# calculate the percentage of ones
y_test.mean()

In [None]:
# calculate the percentage of zeros
1 - y_test.mean()

In [None]:
# calculate null accuracy (for binary classification problems coded as 0/1)
max(y_test.mean(), 1 - y_test.mean())

Comparing the **true** and **predicted** response values

In [None]:
# print the first 25 true and predicted responses
print('True:', y_test.values[0:25])
print('Pred:', y_pred_class[0:25])

**Conclusion:**

- Classification accuracy is the **easiest classification metric to understand**
- But, it does not tell you the **underlying distribution** of response values
- And, it does not tell you what **"types" of errors** your classifier is making

## Confusion matrix
Table that describes the performance of a classification model

In [None]:
# IMPORTANT: first argument is true values, second argument is predicted values
print(metrics.confusion_matrix(y_test, y_pred_class))

![Small confusion matrix](../img/cm.png)

- Every observation in the testing set is represented in **exactly one box**
- It's a 2x2 matrix because there are **2 response classes**
- The format shown here is **not** universal

**Basic terminology**

- **True Positives (TP):** we *correctly* predicted that they *do* have diabetes
- **True Negatives (TN):** we *correctly* predicted that they *don't* have diabetes
- **False Positives (FP):** we *incorrectly* predicted that they *do* have diabetes (a "Type I error")
- **False Negatives (FN):** we *incorrectly* predicted that they *don't* have diabetes (a "Type II error")

In [None]:
# print the first 25 true and predicted responses
print('True:', y_test.values[0:25])
print('Pred:', y_pred_class[0:25])

In [None]:
print(metrics.confusion_matrix(y_test, y_pred_class))

In [None]:
# save confusion matrix and slice into four pieces
confusion = metrics.confusion_matrix(y_test, y_pred_class)
TP = confusion[0, 0]
print(TP)

![Large confusion matrix](../img/09_confusion_matrix_2.png)

## Metrics computed from a confusion matrix

**Classification Accuracy:** Overall, how often is the classifier correct?

In [None]:
# manually
print((TP + TN) / (TP + TN + FP + FN))

# using sklearn 
print(metrics.accuracy_score(y_test, y_pred_class))

**Classification Error:** Overall, how often is the classifier incorrect?

- Also known as "Misclassification Rate"

In [None]:
# manually
print((FP + FN) / (TP + TN + FP + FN))

# manually
print(1 - metrics.accuracy_score(y_test, y_pred_class))

**Sensitivity:** When the actual value is positive, how often is the prediction correct?

- How "sensitive" is the classifier to detecting positive instances?
- Also known as "True Positive Rate" or "Recall"

In [None]:
# manually
print(TP / (TP + FN))

# using sklearn 
print(metrics.recall_score(y_test, y_pred_class))

**Specificity:** When the actual value is negative, how often is the prediction correct?

- How "specific" (or "selective") is the classifier in predicting positive instances?

In [None]:
print(TN / (TN + FP))

**False Positive Rate:** When the actual value is negative, how often is the prediction incorrect?

In [None]:
print(FP / (TN + FP))

**Precision:** When a positive value is predicted, how often is the prediction correct?

- How "precise" is the classifier when predicting positive instances?

In [None]:
print(TP / (TP + FP))
print(metrics.precision_score(y_test, y_pred_class))

Many other metrics can be computed: F1 score, Matthews correlation coefficient, etc.

**Conclusion:**

- Confusion matrix gives you a **more complete picture** of how your classifier is performing
- Also allows you to compute various **classification metrics**, and these metrics can guide your model selection

**Which metrics should you focus on?**

- Choice of metric depends on your **business objective**
- **Spam filter** (positive class is "spam"): Optimize for **precision or specificity** because false negatives (spam goes to the inbox) are more acceptable than false positives (non-spam is caught by the spam filter)
- **Fraudulent transaction detector** (positive class is "fraud"): Optimize for **sensitivity** because false positives (normal transactions that are flagged as possible fraud) are more acceptable than false negatives (fraudulent transactions that are not detected)

## Adjusting the classification threshold

In [None]:
# print the first 10 predicted responses
logreg.predict(X_test)[0:10]

In [None]:
# print the first 10 predicted probabilities of class membership
logreg.predict_proba(X_test)[0:10, :]

In [None]:
# print the first 10 predicted probabilities for class 1
logreg.predict_proba(X_test)[0:10, 1]

In [None]:
# store the predicted probabilities for class 1
y_pred_prob = logreg.predict_proba(X_test)[:, 1]

In [None]:
# allow plots to appear in the notebook
%matplotlib inline
import matplotlib.pyplot as plt

In [None]:
# histogram of predicted probabilities
plt.hist(y_pred_prob, bins=8)
plt.xlim(0, 1)
plt.title('Histogram of predicted probabilities')
plt.xlabel('Predicted probability of diabetes')
plt.ylabel('Frequency')
plt.show() 

**Decrease the threshold** for predicting diabetes in order to **increase the sensitivity** of the classifier

In [None]:
# predict diabetes if the predicted probability is greater than 0.3
from sklearn.preprocessing import binarize
y_pred_class = binarize([y_pred_prob], threshold=0.3)[0]

In [None]:
# print the first 10 predicted probabilities
y_pred_prob[0:10]

In [None]:
# print the first 10 predicted classes with the lower threshold
y_pred_class[0:10]

In [None]:
y_test[0:10]

In [None]:
# previous confusion matrix (default threshold of 0.5)
print(confusion)

In [None]:
# new confusion matrix (threshold of 0.3)
print(metrics.confusion_matrix(y_test, y_pred_class))

In [None]:
# sensitivity has increased (used to be 0.24)
print(46 / (46 + 16))

In [None]:
# specificity has decreased (used to be 0.91)
print(80 / (80 + 50))

**Conclusion:**

- **Threshold of 0.5** is used by default (for binary problems) to convert predicted probabilities into class predictions
- Threshold can be **adjusted** to increase sensitivity or specificity
- Sensitivity and specificity have an **inverse relationship**

## ROC Curves and Area Under the Curve (AUC)

**Question:** Wouldn't it be nice if we could see how sensitivity and specificity are affected by various thresholds, without actually changing the threshold?

**Answer:** Plot the ROC curve!

In [None]:
from sklearn import metrics

In [None]:
# IMPORTANT: first argument is true values, second argument is predicted probabilities
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred_prob)
plt.plot(fpr, tpr)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.title('ROC curve for diabetes classifier')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.grid(True)

- ROC curve can help you to **choose a threshold** that balances sensitivity and specificity in a way that makes sense for your particular context
- You can't actually **see the thresholds** used to generate the curve on the ROC curve itself

In [None]:
# define a function that accepts a threshold and prints sensitivity and specificity
def evaluate_threshold(threshold):
    print('Sensitivity:', tpr[thresholds > threshold][-1])
    print('Specificity:', 1 - fpr[thresholds > threshold][-1])

In [None]:
evaluate_threshold(0.5)

In [None]:
evaluate_threshold(0.3)

AUC is the **percentage** of the ROC plot that is **underneath the curve**:

In [None]:
# IMPORTANT: first argument is true values, second argument is predicted probabilities
print(metrics.roc_auc_score(y_test, y_pred_prob))

- AUC is useful as a **single number summary** of classifier performance.
- If you randomly chose one positive and one negative observation, AUC represents the likelihood that your classifier will assign a **higher predicted probability** to the positive observation.
- AUC is useful even when there is **high class imbalance** (unlike classification accuracy).

In [None]:
# calculate cross-validated AUC
from sklearn.model_selection import cross_val_score
cross_val_score(logreg, X, y, cv=10, scoring='roc_auc').mean()

**Confusion matrix advantages:**

- Allows you to calculate a **variety of metrics**
- Useful for **multi-class problems** (more than two response classes)

**ROC/AUC advantages:**

- Does not require you to **set a classification threshold**
- Still useful when there is **high class imbalance**