### Exercises from https://ds.codeup.com/classification/evaluation/#exercises

## EVALUATION

In [193]:
import pandas as pd
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, classification_report


### 2. Confusion matrix:
|               | pred dog   | pred cat   |
|:------------  |-----------:|-----------:|
| actual dog    |         46 |         7  |
| actual cat    |         13 |         34 |


False Positive: Predict cat but actually a dog

False Negative: Predict dog but actually a cat

True Positive: Predicted cat and is a cat

True Negative: Predicted dog and is a dog

In [194]:
FP = 7
FN = 13
TP = 34
TN = 46

## Metrics

- **accuracy**: (TP + TN) / (TP + TN + FP + FN)

- **precision**: TP / (TP + FP)

- **recall**: TP / (TP + FN)


In [195]:
accuracy = (TP + TN) / (TP + TN + FP + FN)
precision = TP / (TP + FP)
recall = TP / (TP + FN)
print(f"Accuracy: {accuracy}, Precision: {precision}, Recall: {recall}")


Accuracy: 0.8, Precision: 0.8292682926829268, Recall: 0.723404255319149


## 3. Rubber ducks

In [196]:
# Read data
df = pd.read_csv('c3.csv')
df

Unnamed: 0,actual,model1,model2,model3
0,No Defect,No Defect,Defect,No Defect
1,No Defect,No Defect,Defect,Defect
2,No Defect,No Defect,Defect,No Defect
3,No Defect,Defect,Defect,Defect
4,No Defect,No Defect,Defect,No Defect
...,...,...,...,...
195,No Defect,No Defect,Defect,Defect
196,Defect,Defect,No Defect,No Defect
197,No Defect,No Defect,No Defect,No Defect
198,No Defect,No Defect,Defect,Defect


### Want to identify as many ducks with defect as possible. Which metric to use?

- Positive: defect
- Negative: no defect

- False positive: predict defect but reality is no defect - customer still happy
- False negative: predict no defect but reality is defect - customer unhappy

Want to err on the side of not letting a single defective duck through. Want to catch as many of the positive cases as possible. False negative is more expensive as then we wouldn't catch a defect.

###  -> USE RECALL

### Which model to go with?

In [197]:
positive = 'Defect'
models = ['model1','model2','model3']
recalls={}
precisions={}
for model in models:
    
    # accuracy -- overall hit rate
    model_accuracy = (df[model]== df.actual).mean()
    # baseline_accuracy = (df.baseline == df.actual).mean()

    # precision -- how good are our positive predictions?
    # precision -- model performance | predicted positive
    subset = df[df[model] == positive]
    model_precision = (subset[model] == subset.actual).mean()
    precisions[model] = model_precision
    # subset = df[df.baseline == positive]
    # baseline_precision = (subset.baseline == subset.actual).mean()

    # recall -- how good are we at detecting actual positives?
    # recall -- model performance | actual positive
    subset = df[df.actual == positive]
    model_recall = (subset[model] == subset.actual).mean()
    recalls[model]=model_recall
    # baseline_recall = (subset.baseline == subset.actual).mean()

    print(f'{model}')
    print(f'   model accuracy: {model_accuracy:.2%}')
    # print(f'baseline accuracy: {baseline_accuracy:.2%}')
    print()
    print(f'   model recall: {model_recall:.2%}')
    # print(f'baseline recall: {baseline_recall:.2%}')
    print()
    print(f'model precision: {model_precision:.2%}')
    # print(f'baseline precision: {baseline_precision:.2%}')
    print('--------')

print(f'The model with the highest recall is {max(recalls, key = recalls.get)}')
# print(f'The model with the highest precision is {max(precisions, key = precisions.get)}')

model1
   model accuracy: 95.00%

   model recall: 50.00%

model precision: 80.00%
--------
model2
   model accuracy: 56.00%

   model recall: 56.25%

model precision: 10.00%
--------
model3
   model accuracy: 55.50%

   model recall: 81.25%

model precision: 13.13%
--------
The model with the highest recall is model3


## -> USE MODEL 3

### Going to give a vacation to those with defective duck
### Really don't want to accidentally give someone a vacation package if it's not defective
#### Reminder that false positive is predict defect but actually not defective, false negative is predict no defect but actually is defective
#### False positive is more expensive -> we give a vacation to someone who got a duck without a defect 
### ->>> optimize for precision

In [198]:
print(f'The model with the highest precision is {max(precisions, key = precisions.get)}')

The model with the highest precision is model1


## -> USE MODEL 1

## 4. Given the Gives you paws dataset use pandas to create a baseline model (i.e. a model that just predicts the most common class) and answer the following questions:

In [199]:
df = pd.read_csv('gives_you_paws.csv')

In [200]:
df.head()

Unnamed: 0,actual,model1,model2,model3,model4
0,cat,cat,dog,cat,dog
1,dog,dog,cat,cat,dog
2,dog,cat,cat,cat,dog
3,dog,dog,dog,cat,dog
4,cat,cat,cat,dog,dog


In [201]:
baseline = df.actual.value_counts().idxmax()

- positive = cat
- negative = dog

- false positive = predicts cat but dog
- false negative = predicts dog but cat

In [202]:
df['baseline'] = baseline

In [203]:
(df.baseline == df.actual).mean()

0.6508

In [204]:
models = df.columns.to_list()

In [205]:
models.remove('actual')

In [206]:
models

['model1', 'model2', 'model3', 'model4', 'baseline']

In [214]:
positive = 'dog'

models = df.columns.to_list()
models.remove('actual')
accuracies= {}
precisions = {}
recalls = {}

for model in models:
    # accuracy -- overall hit rate
    model_accuracy = (df[model] == df.actual).mean()

    # precision -- how good are our positive predictions?
    # precision -- model performance | predicted positive
    subset = df[df[model] == positive]
    model_precision = (subset[model] == subset.actual).mean()

    # recall -- how good are we at detecting actual positives?
    # recall -- model performance | actual positive
    subset = df[df.actual == positive]
    model_recall = (subset[model] == subset.actual).mean()

    accuracies[model] = model_accuracy
    precisions[model] = model_precision
    recalls[model] = model_recall
        
    print(model)
    print(f'   model accuracy: {model_accuracy:.2%}')
    print(f'   model recall: {model_recall:.2%}')
    print(f'   model precision: {model_precision:.2%}')

results = pd.DataFrame(data = [accuracies, precisions, recalls], index = ['accuracy','precision','recall'])
results

model1
   model accuracy: 80.74%
   model recall: 80.33%
   model precision: 89.00%
model2
   model accuracy: 63.04%
   model recall: 49.08%
   model precision: 89.32%
model3
   model accuracy: 50.96%
   model recall: 50.86%
   model precision: 65.99%
model4
   model accuracy: 74.26%
   model recall: 95.57%
   model precision: 73.12%
baseline
   model accuracy: 65.08%
   model recall: 100.00%
   model precision: 65.08%


Unnamed: 0,model1,model2,model3,model4,baseline
accuracy,0.8074,0.6304,0.5096,0.7426,0.6508
precision,0.890024,0.893177,0.659888,0.731249,0.6508
recall,0.803319,0.490781,0.508605,0.955747,1.0


In terms of accuracy, how do the various models compare to the baseline model?

In [215]:
print(f"Models better than baseline based on accuracy: {results.columns[results.loc['accuracy'].baseline<results.loc['accuracy']].to_list()}")
print(f"Models worse than baseline based on accuracy: {results.columns[results.loc['accuracy'].baseline>results.loc['accuracy']].to_list()}")

Models better than baseline based on accuracy: ['model1', 'model4']
Models worse than baseline based on accuracy: ['model2', 'model3']


Are any of the models better than the baseline?

In [216]:
print(f"Models better than baseline based on accuracy: {results.columns[results.loc['accuracy'].baseline<results.loc['accuracy']].to_list()}")

Models better than baseline based on accuracy: ['model1', 'model4']


### Suppose you are working on a team that solely deals with dog pictures. Which of these models would you recomend for Phase I? For Phase II?


This means 
 - positive = dog
 - negative is cat
 
 - false positive = predict dog but actually a cat
 - false negative = predict cat but actually a dog
 
 - Phase 1: want to err on the side of more dogs so we can correct - cost of missing out on a positive case is high - recall best
 - Phase 2: want to err on the side of less dogs - cost of acting on a positive prediction is high - optimize for precision

### From review:

Phase 1: don't miss out on cat pictures
Phase 2: make sure don't show any cats
Recall best

#### I want Phase I model to let more actual cats through but predicted dogs so we can correct - happier with a false positive. False negative more expensive as then I don't have chance to correct. Optimize for recall.

In [218]:
recalls.pop('baseline')

1.0

In [220]:
print(f"Best model based on recall: {max(recalls, key = recalls.get)}")

Best model based on recall: model4


#### I want Phase II model to avoid sending any cats through - happier with a false positive. False negative more expensive as then the customer sees a cat!! Optimize for recall.

In [221]:
precisions.pop('baseline')

0.6508

In [222]:
print(f"Best model based on precision: {max(precisions, key = precisions.get)}")

Best model based on precision: model2


### Suppose you are working on a team that solely deals with cat pictures. Which of these models would you recomend for Phase I? For Phase II?

 - positive = cat
 - negative = dog

 - false positive = predict cat but actually a dog
 - false negative = predict dog but actually a cat

 - Phase 1 - optimize for recall as better to act on a predicted positive than not to
 - Phase 2 - optimize for precision as cost of acting on positive prediction is high

In [223]:
positive = 'cat'

models = df.columns.to_list()
models.remove('actual')
accuracies= {}
precisions = {}
recalls = {}

for model in models:
    # accuracy -- overall hit rate
    model_accuracy = (df[model] == df.actual).mean()

    # precision -- how good are our positive predictions?
    # precision -- model performance | predicted positive
    subset = df[df[model] == positive]
    model_precision = (subset[model] == subset.actual).mean()

    # recall -- how good are we at detecting actual positives?
    # recall -- model performance | actual positive
    subset = df[df.actual == positive]
    model_recall = (subset[model] == subset.actual).mean()

    accuracies[model] = model_accuracy
    precisions[model] = model_precision
    recalls[model] = model_recall
        
    print(model)
    print(f'   model accuracy: {model_accuracy:.2%}')
    print(f'   model recall: {model_recall:.2%}')
    print(f'   model precision: {model_precision:.2%}')

results = pd.DataFrame(data = [accuracies, precisions, recalls], index = ['accuracy','precision','recall'])
results

model1
   model accuracy: 80.74%
   model recall: 81.50%
   model precision: 68.98%
model2
   model accuracy: 63.04%
   model recall: 89.06%
   model precision: 48.41%
model3
   model accuracy: 50.96%
   model recall: 51.15%
   model precision: 35.83%
model4
   model accuracy: 74.26%
   model recall: 34.54%
   model precision: 80.72%
baseline
   model accuracy: 65.08%
   model recall: 0.00%
   model precision: nan%


Unnamed: 0,model1,model2,model3,model4,baseline
accuracy,0.8074,0.6304,0.5096,0.7426,0.6508
precision,0.689772,0.484122,0.358347,0.807229,
recall,0.815006,0.890607,0.511455,0.345361,0.0


#### I want Phase I model to let more actual dogs through but predicted cats so we can correct - happier with a false positive. False negative more expensive as then I don't have chance to correct. Optimize for recall.

In [224]:
recalls.pop('baseline')

0.0

In [225]:
print(f"Best model based on recall: {max(recalls, key = recalls.get)}")

Best model based on recall: model2


#### I want Phase II model to avoid sending any dogs through - happier with a false negative. False positive more expensive as then the customer sees a dog!! Optimize for precision.

In [226]:
precisions.pop('baseline')

nan

In [227]:
print(f"Best model based on precision: {max(precisions, key = precisions.get)}")

Best model based on precision: model4


### Using scikit-learn

In [228]:
positive = 'cat'

models = df.columns.to_list()
models.remove('actual')
accuracies= {}
precisions = {}
recalls = {}

for model in models:
    # accuracy -- overall hit rate
    model_accuracy = accuracy_score(df.actual, df[model])

    # precision -- how good are our positive predictions?
    # precision -- model performance | predicted positive
    model_precision = precision_score(df.actual, df[model], pos_label=positive)

    # recall -- how good are we at detecting actual positives?
    # recall -- model performance | actual positive
    model_recall = recall_score(df.actual, df[model], pos_label = positive)

    accuracies[model] = model_accuracy
    precisions[model] = model_precision
    recalls[model] = model_recall
        
    print(model)
    print(f'   model accuracy: {model_accuracy:.2%}')
    print(f'   model recall: {model_recall:.2%}')
    print(f'   model precision: {model_precision:.2%}')

results_sci = pd.DataFrame(data = [accuracies, precisions, recalls], index = ['accuracy','precision','recall'])
results_sci

model1
   model accuracy: 80.74%
   model recall: 81.50%
   model precision: 68.98%
model2
   model accuracy: 63.04%
   model recall: 89.06%
   model precision: 48.41%
model3
   model accuracy: 50.96%
   model recall: 51.15%
   model precision: 35.83%
model4
   model accuracy: 74.26%
   model recall: 34.54%
   model precision: 80.72%
baseline
   model accuracy: 65.08%
   model recall: 0.00%
   model precision: 0.00%


  _warn_prf(average, modifier, msg_start, len(result))


Unnamed: 0,model1,model2,model3,model4,baseline
accuracy,0.8074,0.6304,0.5096,0.7426,0.6508
precision,0.689772,0.484122,0.358347,0.807229,0.0
recall,0.815006,0.890607,0.511455,0.345361,0.0


In [229]:
results

Unnamed: 0,model1,model2,model3,model4,baseline
accuracy,0.8074,0.6304,0.5096,0.7426,0.6508
precision,0.689772,0.484122,0.358347,0.807229,
recall,0.815006,0.890607,0.511455,0.345361,0.0


In [230]:
models

['model1', 'model2', 'model3', 'model4', 'baseline']

In [231]:
for model in models:
    print(model)
    print(pd.DataFrame(classification_report(df.actual, df[model], output_dict =True)).T)#,labels = [positive]))
    
    print('-----')

model1
              precision    recall  f1-score    support
cat            0.689772  0.815006  0.747178  1746.0000
dog            0.890024  0.803319  0.844452  3254.0000
accuracy       0.807400  0.807400  0.807400     0.8074
macro avg      0.789898  0.809162  0.795815  5000.0000
weighted avg   0.820096  0.807400  0.810484  5000.0000
-----
model2
              precision    recall  f1-score    support
cat            0.484122  0.890607  0.627269  1746.0000
dog            0.893177  0.490781  0.633479  3254.0000
accuracy       0.630400  0.630400  0.630400     0.6304
macro avg      0.688649  0.690694  0.630374  5000.0000
weighted avg   0.750335  0.630400  0.631310  5000.0000
-----
model3
              precision    recall  f1-score    support
cat            0.358347  0.511455  0.421425  1746.0000
dog            0.659888  0.508605  0.574453  3254.0000
accuracy       0.509600  0.509600  0.509600     0.5096
macro avg      0.509118  0.510030  0.497939  5000.0000
weighted avg   0.554590  0.50960

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
