To start, let's use the logistic regression model we fit in the last mission to predict the class labels for each observation in the dataset and add these labels to the Dataframe in a separate column.

In [3]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.linear_model import LogisticRegression

admissions = pd.read_csv("admission.csv")

model = LogisticRegression()
model.fit(admissions[["gpa"]],admissions["admit"])
labels = model.predict(admissions[["gpa"]])
admissions["predicted_label"] = labels
print(admissions["predicted_label"].value_counts())
print(admissions.head(5))

0    598
1     46
Name: predicted_label, dtype: int64
   admit       gpa         gre  predicted_label
0      0  3.177277  594.102992                0
1      0  3.412655  631.528607                0
2      0  2.728097  553.714399                0
3      0  3.093559  551.089985                0
4      0  3.141923  537.184894                0


# Accuracy
The simplest way to determine the effectiveness of a classification model is prediction accuracy. Accuracy helps us answer the question:
- What fraction of the predictions were correct (actual label matched predicted label)?

### Accuracy = # of Correctly Predicted / # of Observations

To decide who gets admitted, we set a threshold and accept all of the students where their computed probability exceeds that threshold. This threshold is called the discrimination threshold and scikit-learn sets it to 0.5 by default when predicting labels. If the predicted probability is greater than 0.5, the label for that observation is 1. If it is instead less than 0.5, the label for that observation is 0.

An accuracy of 1.0 means that the model predicted 100% of admissions correctly for the given discrimination threshold. An accuracy of 0.2 means that the model predicted 20% of the admissions correctly.

In [5]:
admissions.rename(columns={"admit":"actual_label"},inplace=True) #change the column name
matches = admissions["actual_label"] == admissions["predicted_label"]
correct_predictions = admissions[matches == True]
print(correct_predictions.head(5))
accuracy = correct_predictions.shape[0] / float(admissions.shape[0])
print(accuracy)

   actual_label       gpa         gre  predicted_label
0             0  3.177277  594.102992                0
1             0  3.412655  631.528607                0
2             0  2.728097  553.714399                0
3             0  3.093559  551.089985                0
4             0  3.141923  537.184894                0
0.645962732919


# Binary classification outcomes
To start, let's discuss the 4 different outcomes of a binary classification model:

- True Postive - The model correctly predicted that the student would be admitted.
- True Negative - The model correctly predicted that the student would be rejected.
- False Positive - The model incorrectly predicted that the student would be admitted even though the student was actually rejected.
- False Negative - The model incorrectly predicted that the student would be rejected even though the student was actually admitted.

In [8]:
true_positives = admissions[(admissions["actual_label"] == 1) & (admissions["predicted_label"] == 1)].shape[0]
true_neagetives = admissions[(admissions["actual_label"] == 0) & (admissions["predicted_label"] == 0)].shape[0]
print true_positives,true_neagetives

 31 385


# Sensitivity
Let's now look at a few measures that are much more insightful than simple accuracy. Let's start with sensitivity:
- Sensitivity or True Postive Rate - The proportion of applicants that were correctly admitted:
### TPR = True positives / (True Positives + False Negatives)

this measure helps us answer the question:
- How effective is this model at identifying positive outcomes?

We want a highly sensitive model that is able to "catch" all of the positive cases (in this case, the positive case is a patient with cancer).

In [10]:
false_negatives = admissions[(admissions["actual_label"] == 1) & (admissions["predicted_label"] == 0)].shape[0]
sensitivity = true_positives / float(true_positives + false_negatives)
print sensitivity

0.127049180328


# Specificity
Let's now learn about specificity:
- Specificity or True Negative Rate - The proportion of applicants that were correctly rejected:
### TNR = True Negative / (False Positives + True Negatives)

This heps us answer the questions:
- How effective is this model at identifying negative outcomes?

In [12]:
false_positives = admissions[(admissions["actual_label"] == 0) & (admissions["predicted_label"] == 1)].shape[0]
specificity = true_neagetives / float(true_neagetives + false_positives)
print specificity

0.9625
