https://eng.uber.com/deepeta-how-uber-predicts-arrival-times/

# Lecture Notes
### Confusion Matrix and Im-balanced Datasets

Suppose I told you my model had a 98% accuracy rate. The question is, is the model good or bad?

The real answer is that it depends. If I told you that in the dataset, 98% had the label 0 and 2% had the label 1 this result is not impressive anymore (highly imbalanced data). Because your model can simply be $\hat{y} = 0$ and you get a 98% accuracy rate. Now lets look at the confusion matrix

|            | $Y_{pred} = 0$ | $Y_{pred} = 1$ |
|------------|--------------|--------------|
|$Y_{true} = 0$|        True Negatives     |     False Positives      | 
|$Y_{true} = 1$|  False Negatives         |      True Positives       |

$$\textrm{Accuracy} = \frac{\textrm{True Negatives} + \textrm{True Positives}}{\textrm{Number of Samples}}$$

$$ \textrm{Precision} = \frac{\textrm{True Positives}}{\textrm{True Positives + False Positives}}$$


### Ideal Model (Accuracy 100%)

|            | $Y_{pred} = 0$ | $Y_{pred} = 1$ |
|------------|--------------|--------------|
|$Y_{true} = 0$|        98     |     0      | 
|$Y_{true} = 1$|  0        |      2       |


$$ \textrm{Precision} = \frac{2}{2}=1$$
<br>
<br>

 
### Model: `y = 0` (Accuracy 98%)

|            | $Y_{pred} = 0$ | $Y_{pred} = 1$ |
|------------|--------------|--------------|
|$Y_{true} = 0$|        98     |     0      | 
|$Y_{true} = 1$|  2         |      0       |


$$ \textrm{Precision} = \frac{0}{0 + 0}=... \textrm{math}...=0$$
<br>
<br>
### Model: Flipping a Coin (Accuracy 50%)
|            | $Y_{pred} = 0$ | $Y_{pred} = 1$ |
|------------|--------------|--------------|
|$Y_{true} = 0$|        49     |     49      | 
|$Y_{true} = 1$|  1         |      1       |

$$ \textrm{Precision} = \frac{1}{1 + 49}=0.02$$


Now, if I ask you that this data was on cancer cells
- label 0: No Cancer
- label 1: Has Cancer

It would seem that flipping a coin (50% accuracy) is much better than 99% accuracy.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

data = pd.read_csv('german_credit.csv')

In [6]:
data["default"].value_counts(normalize=True)

0    0.7
1    0.3
Name: default, dtype: float64

In [11]:
data['default'].value_counts(normalize=True).round(3)

index_train, index_valid = train_test_split(data.index, train_size=0.7, random_state=189)
train = data.loc[index_train,:].copy()
valid = data.loc[index_valid,:].copy()

train['default'].value_counts(normalize=True).round(3)

0    0.679
1    0.321
Name: default, dtype: float64

In [12]:
valid['default'].value_counts(normalize=True).round(3)

0    0.75
1    0.25
Name: default, dtype: float64

# 2. Business understanding

The data documentation specifies the following loss matrix: 

<table>
  <tr>
    <th>Actual/ Predicted</th>
    <th>Repayment</th>
     <th>Default</th>
  </tr>
  <tr>
    <th>Repayment</th>
    <td>0</td>
    <td>1</td>
  </tr>
  <tr>
    <th>Default</th>
    <td>5</td>
    <td>0</td>
  </tr>
</table>

That is, if we predict a default but the client is creditworthy, the loss is 1.  If we predict that the client will repay the loan but there is a default, the loss is 5. The loss for a correct classification is 0. Using general classificatioon terminology, we say that the loss from a false positive is 1, the loss from a false negative is 5, and the loss from both true positives and true negatives is zero. 

Using the formula from the lecture, the decision threshold is: 

Here we essentially say that 
1. If the true value was default and we predict repay = Penalty is 5 (False Negative)
2. If the true value was repay and we predict default = Penalty is 1 (False Positive)

Then the optimal decision threshold is
$$\frac{\textrm{False Positive}}{\textrm{False Positive} + \textrm{False Negative}} = \frac{1}{1 + 5} = 0.167$$

First recall that our model does not predict 0 or 1. Rather it predictions a probability value, i.e. a number between 0 and 1. The modeller then needs to set a threshold at which we classify things as 0 or 1.

In particular, here our model is predicting if someone defaults. So if your model prediction was $\hat{Y}=0.2$ then
$$\mathbb{P}[\textrm{Customer Default}] = 0.2$$
Again, we are not classifying anything yet, we are only giving a probability. Now depending on the threshold you set the classification will be different.

So if the threshold was $\tau=0.167$ then if the model prediction was $\hat{Y}=0.2\implies \textrm{Default}$.

However, if the threshold was $\tau=0.25$, then if the model prediction was $\hat{Y}=0.2\implies \textrm{Not Default}$

Essentially, the threshold sets the level of confidence you require your model to have. A low threshold would mean that the model requires very little confidence to predict default. While a high threshold would mean that the model requires high confidence to predict default.

### Consider Court

    It is better that ten guilty persons escape than that one innocent suffer
    - Blackston's ratio

<table>
  <tr>
    <th>Actual/ Predicted</th>
    <th>Innocent</th>
     <th>Guilty</th>
  </tr>
  <tr>
    <th>Innocent</th>
    <td>0</td>
    <td>10</td>
  </tr>
  <tr>
    <th>Guilty</th>
    <td>1</td>
    <td>0</td>
  </tr>
</table>

The decision threshold is 
$$\frac{10}{11}\approx 91\%$$

That is, if $\hat{Y} < 91\% \to \textrm{Innocent}$ and $\hat{Y} \geq 91\% \to \textrm{Guilty}$. This is what we mean when we say "guilty beyond a reasonable doubt".

# 9. Validation Results

<table>
  <tr>
    <th>Actual/ Predicted</th>
    <th>Repayment</th>
     <th>Default</th>
  </tr>
  <tr>
    <th>Repayment</th>
    <td>0</td>
    <td>1</td>
  </tr>
  <tr>
    <th>Default</th>
    <td>5</td>
    <td>0</td>
  </tr>
</table>

Equivalently


|            | $Y_{pred} = 0$ | $Y_{pred} = 1$ |
|------------|--------------|--------------|
|$Y_{true} = 0$|        0     |     1      | 
|$Y_{true} = 1$|  5         |      0       |