In [61]:
import pandas as pd
import numpy as np
from sklearn.metrics import classification_report

# Exercises

Create a new file named model_evaluation.py or model_evaluation.ipynb for these exercises.

### Given the following confusion matrix, evaluate (by hand) the model's performance.


|               | pred dog   | pred cat   |
|:------------  |-----------:|-----------:|
| actual dog    |         46 |         7  |
| actual cat    |         13 |         34 |

* Decide the Positive and Negative Classes
> `positive`: cat
> `negative`: dog

* In the context of this problem, what is a false positive?
> Predicting `cat`, with actual `dog` --> 7

* In the context of this problem, what is a false negative?
> Predicting `dog`, with actual `cat` --> 13

* How would you describe this model? --> Follow steps given. Poorly worded exercise prompt.

In [4]:
# set values 
true_positive = 34
false_positive = 7
true_negative = 46
false_negative = 13

accuracy = (true_positive + true_negative) / (true_positive + true_negative + false_positive + false_negative)
precision = true_positive / (true_positive + false_positive)
recall = true_positive / (true_positive + false_negative)

print("Accuracy is", accuracy)
print("Recall is", round(recall,2))
print("Precision is", round(precision,2))

Accuracy is 0.8
Recall is 0.72
Precision is 0.83


### You are working as a data scientist working for Codeup Cody Creator (C3 for short), a rubber-duck manufacturing plant.

Unfortunately, some of the rubber ducks that are produced will have defects. Your team has built several models that try to predict those defects, and the data from their predictions can be found here.

In [5]:
df = pd.read_csv('c3.csv')

In [6]:
df.head()

Unnamed: 0,actual,model1,model2,model3
0,No Defect,No Defect,Defect,No Defect
1,No Defect,No Defect,Defect,Defect
2,No Defect,No Defect,Defect,No Defect
3,No Defect,Defect,Defect,Defect
4,No Defect,No Defect,Defect,No Defect


Use the predictions dataset and pandas to help answer the following questions:

#### An internal team wants to investigate the cause of the manufacturing defects. They tell you that they want to identify as many of the ducks that have a defect as possible.

* Which evaluation metric would be appropriate here? 
> Decide positive and negative class --> Positive == Defect</br>
> `Recall` is best here becuase we don't want to miss any defective units

* Which model would be the best fit for this use case?
> Set up models to calculate for highest recall</br>
> `Model 3` has best recall

In [25]:
df.actual.value_counts(normalize=True)

No Defect    0.92
Defect       0.08
Name: actual, dtype: float64

In [26]:
df['baseline_prediction'] = 'No Defect'

In [27]:
#subset for from actual category with Defects (positive)
subset = df[df.actual =='Defect']

In [22]:
# can use subset from above. 
for model in df.columns[1:]:
    print(f'{model} precision: {(subset.actual == subset[model]).mean()}')

model1 precision: 0.5
model2 precision: 0.5625
model3 precision: 0.8125


#### Takeaways (answers to those earlier questions)
> QC should use model that prioritizes recall to reduce false negatives. </br> 
> QC should use model 3

### Recently several stories in the local news have come out highlighting customers who received a rubber duck with a defect, and portraying C3 in a bad light. The PR team has decided to launch a program that gives customers with a defective duck a vacation to Hawaii. They need you to predict which ducks will have defects, but tell you the really don't want to accidentally give out a vacation package when the duck really doesn't have a defect (false positive). 

#### Which evaluation metric would be appropriate here? 

> Decide positive and negative class --> Positive == Defect</br>
> `precision` is best here becuase PR wants to avoid false positives

#### Which model would be the best fit for this use case?
> `model1` is best for PR to use as it has highest precision

In [34]:
# iterate through models and print recall of each

for model in df.columns[1:4]:
    
    sub_mod = df[df[model] == 'Defect']
    model_precision = (sub_mod[model] == sub_mod.actual).mean()
    
    print(f'{model} precision: {model_precision}')

model1 precision: 0.8
model2 precision: 0.1
model3 precision: 0.13131313131313133


### You are working as a data scientist for Gives You Paws ™, a subscription based service that shows you cute pictures of dogs or cats (or both for an additional fee).

At Gives You Paws, anyone can upload pictures of their cats or dogs. The photos are then put through a two step process. First an automated algorithm tags pictures as either a cat or a dog (Phase I). 

Given this dataset, use pandas to create a baseline model (i.e. a model that just predicts the most common class) and answer the following questions:

In [35]:
#acquire data and form baseline
df_paws = pd.read_csv('gives_you_paws.csv')

In [36]:
df_paws.head()

Unnamed: 0,actual,model1,model2,model3,model4
0,cat,cat,dog,cat,dog
1,dog,dog,cat,cat,dog
2,dog,cat,cat,cat,dog
3,dog,dog,dog,cat,dog
4,cat,cat,cat,dog,dog


In [37]:
df_paws.actual.value_counts(normalize=True)

dog    0.6508
cat    0.3492
Name: actual, dtype: float64

In [38]:
df_paws['baseline_prediction'] = 'dog'

In terms of accuracy, how do the various models compare to the baseline model? 
> 50% of the models have greater accuracy than the baseline. All +/- 15%

Are any of the models better than the baseline?
> `model1` and `model4` perform better than baseline

In [48]:
baseline_accuracy = (df_paws.baseline_prediction == df_paws.actual).mean()

print(f'baseline accuracy: {baseline_accuracy}')

for model in df_paws.columns[1:5]:
    model_accuracy = (df_paws[model] == df_paws.actual).mean()
    print(f'{model} accuracy: {model_accuracy}')

baseline accuracy: 0.6508
model1 accuracy: 0.8074
model2 accuracy: 0.6304
model3 accuracy: 0.5096
model4 accuracy: 0.7426


Suppose you are working on a team that solely deals with dog pictures. Which of these models would you recommend? 
* This is a question asking you to run some more tests. Not just pick the most accurate.
> Minimize cat pictures (false positives)</br>
> Use precision</br>
> Recommend `model2`

In [54]:
print(f'baseline: {baseline_accuracy}')   

for model in df_paws.columns[1:5]:  
    sub_dog = df_paws[df_paws[model] == 'dog'] 
    precision = (sub_dog.actual == sub_dog[model]).mean()
    print(f'{model} precision: {precision}')

baseline: 0.6508
model1 precision: 0.8900238338440586
model2 precision: 0.8931767337807607
model3 precision: 0.6598883572567783
model4 precision: 0.7312485304490948


Suppose you are working on a team that solely deals with cat pictures. Which of these models would you recommend?
> Recommend `model4`

In [56]:
print(f'baseline: {1-baseline_accuracy}')   

for model in df_paws.columns[1:5]:  
    sub_cat = df_paws[df_paws[model] == 'cat'] 
    precision = (sub_cat.actual == sub_cat[model]).mean()
    print(f'{model} precision: {precision}')

baseline: 0.34919999999999995
model1 precision: 0.6897721764420747
model2 precision: 0.4841220423412204
model3 precision: 0.358346709470305
model4 precision: 0.8072289156626506


Follow the links below to read the documentation about each function, then apply those functions to the data from the previous problem.

`sklearn.metrics.accuracy_score`
`sklearn.metrics.precision_score`
`sklearn.metrics.recall_score`
`sklearn.metrics.classification_report`

In [65]:
paws_labels= ['cat','dog']
for model in df_paws.columns[1:5]:
    print(model)
    print(pd.DataFrame(classification_report(df_paws.actual, df_paws[model], labels=paws_labels, output_dict=True)))
    print('-----\n')


model1
                   cat          dog  accuracy    macro avg  weighted avg
precision     0.689772     0.890024    0.8074     0.789898      0.820096
recall        0.815006     0.803319    0.8074     0.809162      0.807400
f1-score      0.747178     0.844452    0.8074     0.795815      0.810484
support    1746.000000  3254.000000    0.8074  5000.000000   5000.000000
-----

model2
                   cat          dog  accuracy    macro avg  weighted avg
precision     0.484122     0.893177    0.6304     0.688649      0.750335
recall        0.890607     0.490781    0.6304     0.690694      0.630400
f1-score      0.627269     0.633479    0.6304     0.630374      0.631310
support    1746.000000  3254.000000    0.6304  5000.000000   5000.000000
-----

model3
                   cat          dog  accuracy    macro avg  weighted avg
precision     0.358347     0.659888    0.5096     0.509118      0.554590
recall        0.511455     0.508605    0.5096     0.510030      0.509600
f1-score      0.