Plan --> Acquire --> Prepare --> Explore --> Model --> Deliver

# Why do we evaluate?

- we have to quanitify model's performance
- can compare to other models (decision tree vs k nearest neighbor)

## Vocab:

- **Classifier**
    - Binary
    - Muli-Class
    
- **Evaluation Metric**: how we put a number on classifier's performance

- **Label/Target/outcome**: the thing we are trying to predict

- **Actual and Predicted Values**:
    - Actual: ground truth from dataset, known ahead of time
    - Predicted: ML model is predicting
    - Compare actual and predicted to see how close they match
    
- **Classification Outcomes**:
    - True Postitive(TP): predict positive and it is positive
    - True Negative(TN): predict negative and its negative
    - False Positive (FP): predict pos but its negative
    - False Negative (FN): predict neg but its positive

## HOW TO CALCULATE HOW GOOD OUR CLASSIFIER IS

### Focus on:

- **Accuracy**: how well the model does overall ( number of times you get it right!)
    - (TP + TN) / (TP + TN + FP + FN)

- **Recall**: how many of the actually positive cases did we catch
     - TP/ (TP + FN)
     - sensitivity: aka recall
     - specificity: recall for negative class
    
- **Precision**: times we predicted the positive cases
     - TP/ (TP+FP)

_________________________

# Example:
Predict the Weather
- rain (+)
- no rain (-)

- **TP**: bring an umbrella for rain, it rains
- **TN**: no umbrella, no rain
- **FP**: bring an umbrella, no rain
- **FN**: no umbrella, it rains

- **Taking action**: associated with positive value
        - (ex) bring an umbrella

__________________________

#### 7 day weather prediction model
- Actual:       +, +, -, +, -, -, +
- Predicated: -, +, +, +, +, -, +

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

- Accuracy: 4/7 correct, 57% accurate
- Precision: 3/5, 60% precise
- Recall: 3/4, 75% recall


#### for this model. Recall matters most. You don't want to get wet but you don't care if you bring an umbrella and it doesn't rain

_________________________

## Example:
### In a model that predicts (+) EVERY time

- Actual:       +, +, -, +, -, -, +
- Predicated: +, +, +, +, +, +, +
    

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

- Accuracy: 4/7 correct, 57% accurate
- Precision: 4/5, 57% precise
- Recall: 4/4, 100% recall    

_____________________________

## Example:
### In a model that predicts (-) EVERY time

- Actual:       +, +, -, +, -, -, +
- Predicated: -, -, -, -, -, -, -
    

<br>

- Accuracy: 3/7 correct, 43% accurate
- Precision: 0/0, undefined
- Recall: 0/4, 0% recall

_________

## Senario:
- you are bringing coffee to a meeting
- need to predict if each person at the meeting wants coffee or not


- lola: really good coffee, but super expensive
    - cost of a FP is higher than FN
    - precision is better here because buying a cup of coffee for someone who won't drink it is expensive
    - We want to be sure about our positive predictions
    ~~~~
    
- taco cabana: bad coffee, but cheap
    - cost of a FN is higher than FP
    - recall because the coffee is cheap, its not bad to buy a cheap coffee for someone who won't drink it; worse to not get someone coffee who wanted it
    ~~~~
    
- meeting with super important client
    - cost of FN is higher, because they might be offended if we dont' get them coffee
    - cost of FN == not signing a contract
    - recall

### Walk through the steps:

- **Step 1**. list out posible outcomes
- **Step 2**. spec out what do outcomes mean

<br>

- FP: Buy a coffee for someone who won't drink it
- FN: Don't buy a coffee for someone who wanted one
- TP: Buy a coffee for someone who will drink it
- TN: Don't buy a coffee for someone who wouldn't drink it anyway

______________________

### Senario 2:  
Build a classifier to predict whether a given face should unlock the iPhone.

- What is the positive and negative case?
    - positive: phone unlocking
    - negative: phone staying locked
    
<br>

- What are the possible outcomes?
    - FP: Phone unlocks but it shouldn't
    - FN: Phone doesn't unlock but it should
    - TP: Phone unlocks and it should
    - TN: Phone doesn't unlock and it shouldn't 
    
<br>

- What are the costs of the outcomes?
    - FP is most costly. We do not want a phone to unlock unless it should 
    
<br>

- Which metric should we use?
    - precision is the metric we should use. cost of false negative is not very high. We do not really care if it doesn't unlock and should



__________________________

### Scenario 3: 
Predict whether an email is spam or not. Emails marked as spam skip the inbox and go to the spam folder.

- What is the positive and negative case?
    - positve: is marked as spam
    - negative: not marked as span

<br>

- What are the possible outcomes?
    - FP: not spam marked as spam
    - FN: marked spam but is not spam
    - TP: marked spam, is spam
    - TN: marked not spam, is not spam
    
<br>

- What are the costs of the outcomes?
    - spam marked but it is not spam, missing email you dont want to miss
    
<br>

- Which metric should we use?
    - accuracy: we don't want inbox emails to end up in spam

______________________

## Scenario 4: 
Predict whether an email is a phishing attempt. When we predict positive, show an additional banner warning the user that this might be a phishing email.

- What is the positive and negative case?
    - positve: 
    - negative: 

<br>

- What are the possible outcomes?
    - FP: 
    - FN: 
    - TP: 
    - TN: 

<br>

- What are the costs of the outcomes?
    - 
    
<br>

- Which metric should we use?
    - precision: 

_________

# Python Implementation

In [2]:
import pandas as pd

In [3]:
#create a dataframe with two columns (predicted and actual)
df = pd.DataFrame({
    'actual': ['coffee', 'no coffee', 'no coffee', 'coffee', 'coffee', 'coffee', 'no coffee', 'coffee'],
    'prediction': ['no coffee', 'no coffee', 'coffee', 'coffee', 'coffee', 'coffee', 'no coffee', 'no coffee'],
})
df

Unnamed: 0,actual,prediction
0,coffee,no coffee
1,no coffee,no coffee
2,no coffee,coffee
3,coffee,coffee
4,coffee,coffee
5,coffee,coffee
6,no coffee,no coffee
7,coffee,no coffee


In [5]:
## Confusion Matrix
- pd.crosstab(df.prediction, df.actual)

actual,coffee,no coffee
prediction,Unnamed: 1_level_1,Unnamed: 2_level_1
coffee,-3,-1
no coffee,-2,-2


- TP: predicted coffee + actual is coffee
- FP: predicted coffee, but they didn't like coffee
- FN: predicted no coffee, but really they liked coffee
- TN: predicted no coffee, actual no coffee

### Metrics
- accuracy: (TP + TN) / (TP + TN + FP + FN)
- (3 + 2) / (3 + 1 + 2 +2) = 62.5%

<br>

- precision: TP / (TP + FP)
- 3 / (3 + 1) = 75%
- FP is more costly than FN

<br>

- recall: TP / (TP + FN)
- 3 / (3 + 2) = 60%
- FN is more costly than FP

In [7]:
#add column
df['baseline'] = 'coffee'

In [8]:
df

Unnamed: 0,actual,prediction,baseline
0,coffee,no coffee,coffee
1,no coffee,no coffee,coffee
2,no coffee,coffee,coffee
3,coffee,coffee,coffee
4,coffee,coffee,coffee
5,coffee,coffee,coffee
6,no coffee,no coffee,coffee
7,coffee,no coffee,coffee


In [9]:
# model accuracy
(df.actual == df.prediction).mean()

0.625

In [10]:
# baseline accuracy
(df.actual == df.baseline).mean()

0.625

In [12]:
# precision -- how good are our positive predictions
# precision -- model performance | pred +

#create a subset of data where we predicted the positive case
subset = df[df.prediction == 'coffee']
print(subset)
(subset.prediction == subset.actual).mean()

#this shows that our precision is 75%

      actual prediction baseline
2  no coffee     coffee   coffee
3     coffee     coffee   coffee
4     coffee     coffee   coffee
5     coffee     coffee   coffee


0.75

In [13]:
# recall -- how often do we get the actual positive cases
# recall -- model performance | actual +

#create a subset of data where actual value is positve
subset = df[df.actual == 'coffee']
print(subset)
(subset.prediction == subset.actual).mean()

#this shows that recall is 60%

   actual prediction baseline
0  coffee  no coffee   coffee
3  coffee     coffee   coffee
4  coffee     coffee   coffee
5  coffee     coffee   coffee
7  coffee  no coffee   coffee


0.6

In [14]:
# precision
subset = df[df.baseline == 'coffee']
print(subset)
(subset.baseline == subset.actual).mean()

      actual prediction baseline
0     coffee  no coffee   coffee
1  no coffee  no coffee   coffee
2  no coffee     coffee   coffee
3     coffee     coffee   coffee
4     coffee     coffee   coffee
5     coffee     coffee   coffee
6  no coffee  no coffee   coffee
7     coffee  no coffee   coffee


0.625

In [15]:
# recall
subset = df[df.actual == 'coffee']
print(subset)
(subset.baseline == subset.actual).mean()

   actual prediction baseline
0  coffee  no coffee   coffee
3  coffee     coffee   coffee
4  coffee     coffee   coffee
5  coffee     coffee   coffee
7  coffee  no coffee   coffee


1.0

In [16]:
positive = 'coffee'

# accuracy -- overall hit rate
model_accuracy = (df.prediction == df.actual).mean()
baseline_accuracy = (df.baseline == df.actual).mean()

# precision -- how good are our positive predictions?
# precision -- model performance | predicted positive
subset = df[df.prediction == positive]
model_precision = (subset.prediction == subset.actual).mean()
subset = df[df.baseline == positive]
baseline_precision = (subset.baseline == subset.actual).mean()

# recall -- how good are we at detecting actual positives?
# recall -- model performance | actual positive
subset = df[df.actual == positive]
model_recall = (subset.prediction == subset.actual).mean()
baseline_recall = (subset.baseline == subset.actual).mean()


print(f'''
positive: {positive}

         | accuracy | recall | precision
         | -------- | ------ | ---------         
   model | {model_accuracy:8.1%} | {model_recall:6.1%} | {model_precision:9.1%}
baseline | {baseline_accuracy:8.1%} | {baseline_recall:6.1%} | {baseline_precision:9.1%}
''')


positive: coffee

         | accuracy | recall | precision
         | -------- | ------ | ---------         
   model |    62.5% |  60.0% |     75.0%
baseline |    62.5% | 100.0% |     62.5%

