## Introduction to the Data

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression

In [2]:
admissions = pd.read_csv("admissions.csv")
model = LogisticRegression()
model.fit(admissions[["gpa"]], admissions["admit"])

LogisticRegression()

In [3]:
labels = model.predict(admissions[['gpa']])
admissions['predicted_label'] = labels

print(admissions['predicted_label'].value_counts())

admissions.head()

0    507
1    137
Name: predicted_label, dtype: int64


Unnamed: 0,admit,gpa,gre,predicted_label
0,0,3.177277,594.102992,0
1,0,3.412655,631.528607,0
2,0,2.728097,553.714399,0
3,0,3.093559,551.089985,0
4,0,3.141923,537.184894,0


## Accuracy

Accuracy helps us answer the question:

* **<font size=4>What fraction of the predictions were correct (actual label matched predicted label)?</font>**

$Accuracy = \dfrac{\text{# of Correctly Predicted}}{\text{# of Observations}}$

In [5]:
mapper = {'admit': 'actual_label'}
admissions.rename(mapper, axis=1, inplace=1)
matches = admissions['actual_label'] == admissions['predicted_label']

In [7]:
correct_predictions = admissions[matches]
accuracy = correct_predictions.shape[0]/admissions.shape[0]
correct_predictions.sample(5)

Unnamed: 0,actual_label,gpa,gre,predicted_label
116,0,3.220292,680.028661,0
89,0,3.036726,562.163269,0
200,0,3.055598,686.667456,0
31,0,2.957306,620.739871,0
625,1,3.532418,677.019051,1


In [8]:
accuracy

0.6847826086956522

## Binary classification outcomes

<div>
<table class="table-bordered">
<thead>
<tr>
<th >Prediction</th>
<th><center>Observation</center></th>
</tr>
</thead>
<thead>
<tr>
<th></th>
<th>Admitted (1) </th>
<th>Rejected (0) </th>
</tr>
</thead>
<tbody>
<tr>
<th>Admitted (1) </th>
<td>True Positive (TP)</td>
<td>False Positive (FP)</td>
</tr>
<tr>
<th>Rejected (0) </th>
<td>False Negative (FN)</td>
<td>True Negative (TN)</td>
</tr>
</tbody>
</table>
</div>

We can define these outcomes as:

* True Positive - The model correctly predicted that the student would be admitted.

    - Said another way, the model predicted that the label would be `Positive`, and that ended up being `True`.
    - In our case, Positive refers to being admitted and maps to the label 1 in the dataset.
    - For this dataset, a true positive is whenever predicted_label is 1 and actual_label is 1.


* True Negative - The model correctly predicted that the student would be rejected.

    - Said another way, the model predicted that the label would be `Negative`, and that ended up being `True`.
    - In our case, Negative refers to being rejected and maps to the label 0 in the dataset.
    - For this dataset, a true negative is whenever predicted_label is 0 and actual_label is 0.


* False Positive - The model incorrectly predicted that the student would be admitted even though the student was actually rejected.

    - Said another way, the model predicted that the label would be `Positive`, but that was `False` (the actual label was False).
    - For this dataset, a false positive is whenever predicted_label is 1 but the actual_label is 0.


* False Negative - The model incorrectly predicted that the student would be rejected even though the student was actually admitted.

    - Said another way, the model predicted that the label would be `Negative`, but that was `False` (the actual value was True).
    - For this dataset, a false negative is whenever predicted_label is 0 but the actual_label is 1.

## Binary classification outcomes

In [9]:
true_positives = admissions[(admissions['actual_label'] == admissions['predicted_label'])&(admissions['predicted_label']==1)].shape[0]

true_negatives = admissions[(admissions['actual_label'] == admissions['predicted_label'])&(admissions['predicted_label']==0)].shape[0]

In [10]:
print(true_positives)
print(true_negatives)

89
352


## Sensitivity

Let's now look at a few measures that are much more insightful than simple accuracy. Let's start with **sensitivity**:

* Sensitivity or True Positive Rate - The proportion of applicants that were correctly admitted:

**$TPR=\dfrac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}$**

Of all of the students that should have been admitted (True Positives + False Negatives), what fraction did the model correctly admit (True Positives)? More generally, this measure helps us answer the question:

* **How effective is this model at identifying positive outcomes?**
In our case, the positive outcome (label of `1`) is admitting a student. If the True Positive Rate is low, it means that the model isn't effective at catching positive cases. For certain problems, high sensitivity is incredibly important. If we're building a model to predict which patients have cancer, every patient that is missed by the model could mean a loss of life. We want a **highly sensitive** model that is able to "catch" all of the positive cases (in this case, the positive case is a patient with cancer).

In [13]:
false_negatives = admissions[(admissions['actual_label'] != admissions['predicted_label']) & (admissions['predicted_label'] == 0)].shape[0]

sensitivity = true_positives/(true_positives + false_negatives)
sensitivity

0.36475409836065575

In [14]:
false_negatives

155

## Specificity

$TNR=\dfrac{\text{True Negatives}}{\text{False Positives} + \text{True Negatives}}$

This helps us answer the question:

* **How effective is this model at identifying negative outcomes?**

In [15]:
specificity = true_negatives/(admissions.shape[0] - admissions['actual_label'].sum())
specificity

0.88

In [17]:
false_positives = admissions[(admissions['actual_label']==0)&(admissions['predicted_label']==1)].shape[0]
false_positives

48