2. Given the following confusion matrix, evaluate (by hand) the model's performance.

|               | pred dog   | pred cat   |
|:------------  |-----------:|-----------:|
| actual dog    |         46 |         7  |
| actual cat    |         13 |         34 |


- In the context of this problem, what is a false positive?
- In the context of this problem, what is a false negative?
- How would you describe this model?

*Positive:* is a dog

*Negative:* it is a cat

*TP:* We predicted it is a dog, and it's a dog - 46

*FP:* We predicted it is a dog, but it is a cat - 13

*TN:* We predicted it is a cat, and it is a cat - 34

*FN:* We predicted it is a cat, but it is a dog - 7

Model performance:

$ \frac{(TP + TN)}{(FP + FN + TP + TN)}$

In [2]:
(46 + 34) / (46 + 13 + 34 + 7)

0.8

Recall:

$\frac{TP}{TP+FN}$

In [6]:
46 / (46 + 7)

0.8679245283018868

Precision

$\frac{TP}{TP+FP}$

In [7]:
46 /(46 + 13)

0.7796610169491526

The model did well

In [1]:
import pandas as pd

3. You are working as a datascientist working for Codeup Cody Creator (C3 for short), a rubber-duck manufacturing plant.

Unfortunately, some of the rubber ducks that are produced will have defects. Your team has built several models that try to predict those defects.

Use the predictions dataset and pandas to help answer the following questions:

- An internal team wants to investigate the cause of the manufacturing defects. They tell you that they want to identify as many of the ducks that have a defect as possible. Which evaluation metric would be appropriate here? Which model would be the best fit for this use case?
- Recently several stories in the local news have come out highlighting customers who received a rubber duck with a defect, and portraying C3 in a bad light. The PR team has decided to launch a program that gives customers with a defective duck a vacation to Hawaii. They need you to predict which ducks will have defects, but tell you the really don't want to accidentally give out a vacation package when the duck really doesn't have a defect. Which evaluation metric would be appropriate here? Which model would be the best fit for this use case?

In [2]:
c3_df = pd.read_csv('https://ds.codeup.com/data/c3.csv')

In [6]:
c3_df

Unnamed: 0,actual,model1,model2,model3
0,No Defect,No Defect,Defect,No Defect
1,No Defect,No Defect,Defect,Defect
2,No Defect,No Defect,Defect,No Defect
3,No Defect,Defect,Defect,Defect
4,No Defect,No Defect,Defect,No Defect
...,...,...,...,...
195,No Defect,No Defect,Defect,Defect
196,Defect,Defect,No Defect,No Defect
197,No Defect,No Defect,No Defect,No Defect
198,No Defect,No Defect,Defect,Defect


In [7]:
#An internal team wants to investigate the cause of the manufacturing defects. 
#They tell you that they want to identify as many of the ducks that have a defect as possible. 
#Which evaluation metric would be appropriate here? Which model would be the best fit for this use case?

*Positive:* Duck has a defect

*Negative:* No defect

*TP:* The model predicted that the duck has a defect and it does

*FP:* The model predicted that the duck has a defect but it doesn't

*TN:* The model predicted that the duck has no defect and it doesn't

*FN:* The model predicted that the duck has but defect but it does has a defect

1. *Cost of the FP:* The cost of the rubber duck max
2. *Cost of the FN:* The cost of the rubber duck + transportation + customer service

We would like to avoid <u>false negative results</u>, when we predict that the rubber duck has no defect, but it actually does.

Model evaluation: **recall** $\frac{TP}{TP + FN}$

In [3]:
defect = c3_df[c3_df.actual == 'Defect']

In [9]:
defect.head()

Unnamed: 0,actual,model1,model2,model3
13,Defect,No Defect,Defect,Defect
30,Defect,Defect,No Defect,Defect
65,Defect,Defect,Defect,Defect
70,Defect,Defect,Defect,Defect
74,Defect,No Defect,No Defect,Defect


In [4]:
for i in range(1, 4):
    best = 0
    best_index = 0
    print(f'Model {i}')
    recall_score = (defect.actual == defect['model' + str(i)]).mean()
    print(f'Recall score is {recall_score: >20}\n\n')
    
    if recall_score > best:
        best = recall_score
        best_index = i
    
print(f'Best model is Model {best_index}')

Model 1
Recall score is                  0.5


Model 2
Recall score is               0.5625


Model 3
Recall score is               0.8125


Best model is Model 3


In [42]:
from sklearn.metrics import classification_report

print(classification_report(c3_df.actual, c3_df.model1))

              precision    recall  f1-score   support

      Defect       0.80      0.50      0.62        16
   No Defect       0.96      0.99      0.97       184

    accuracy                           0.95       200
   macro avg       0.88      0.74      0.79       200
weighted avg       0.95      0.95      0.94       200



In [43]:
print(classification_report(c3_df.actual, c3_df.model2))

              precision    recall  f1-score   support

      Defect       0.10      0.56      0.17        16
   No Defect       0.94      0.56      0.70       184

    accuracy                           0.56       200
   macro avg       0.52      0.56      0.44       200
weighted avg       0.87      0.56      0.66       200



In [44]:
print(classification_report(c3_df.actual, c3_df.model3))

              precision    recall  f1-score   support

      Defect       0.13      0.81      0.23        16
   No Defect       0.97      0.53      0.69       184

    accuracy                           0.56       200
   macro avg       0.55      0.67      0.46       200
weighted avg       0.90      0.56      0.65       200



4. **You are working as a data scientist for Gives You Paws ™, a subscription based service that shows you cute pictures of dogs or cats (or both for an additional fee).**

At Gives You Paws, anyone can upload pictures of their cats or dogs. The photos are then put through a two step process. First an automated algorithm tags pictures as either a cat or a dog (Phase I). Next, the photos that have been initially identified are put through another round of review, possibly with some human oversight, before being presented to the users (Phase II).

Several models have already been developed with the data.

Given this dataset, use pandas to create a baseline model (i.e. a model that just predicts the most common class) and answer the following questions:

- a. In terms of accuracy, how do the various models compare to the baseline model? Are any of the models better than the baseline?
- b. Suppose you are working on a team that solely deals with dog pictures. Which of these models would you recommend?
- c. Suppose you are working on a team that solely deals with cat pictures. Which of these models would you recommend?

In [30]:
paws = pd.read_csv('gives_you_paws.csv')

In [31]:
paws

Unnamed: 0,actual,model1,model2,model3,model4
0,cat,cat,dog,cat,dog
1,dog,dog,cat,cat,dog
2,dog,cat,cat,cat,dog
3,dog,dog,dog,cat,dog
4,cat,cat,cat,dog,dog
...,...,...,...,...,...
4995,dog,dog,dog,dog,dog
4996,dog,dog,cat,cat,dog
4997,dog,cat,cat,dog,dog
4998,cat,cat,cat,cat,dog


In [36]:
paws['baseline'] = paws.actual.value_counts().idxmax()

In [38]:
paws.baseline.unique()

array(['dog'], dtype=object)

In [39]:
#a. In terms of accuracy, how do the various models compare to the baseline model? 
#Are any of the models better than the baseline?

In [45]:
baseline_accuracy = (paws.actual == paws.baseline).mean()

In [46]:
baseline_accuracy

0.6508

In [51]:
cols = paws.columns[1:-1]

In [56]:
i = 1
for col in cols:
    #print(f'Model {i}')
    accuracy = (paws.actual == paws[col]).mean()
    if accuracy > baseline_accuracy:
        print(f'Model {i} performs better than a baseline. Accuracy is {accuracy}\n')
    else:
        print(f'Model {i} failed the test. Accuracy is {accuracy}\n')
    i += 1

Model 1 performs better than a baseline. Accuracy is 0.8074

Model 2 failed the test. Accuracy is 0.6304

Model 3 failed the test. Accuracy is 0.5096

Model 4 performs better than a baseline. Accuracy is 0.7426



In [57]:
#b. Suppose you are working on a team that solely deals with dog pictures. 
#Which of these models would you recommend?

In [58]:
#positive --> dogs

In [59]:
#fp: picture predicted as a dog but it is a cat
#fn: picture predicted as a cat but it's dog

#we would like to avoid false positive results -> recommended precision evaluation

In [60]:
for col in cols:
    dog = paws[paws[col] == 'dog']
    precision_dog = (dog.actual == dog[col]).mean()
    print(f'{col}    {precision_dog}')
    

model1    0.8900238338440586
model2    0.8931767337807607
model3    0.6598883572567783
model4    0.7312485304490948


Model 1 and 2 give better predition results

In [62]:
#Suppose you are working on a team that solely deals with cat pictures. 
#Which of these models would you recommend?

In [63]:
#positive --> cats

#fp picture predicted as a cat but it is a dog
#fn picture predicted as a dog but it is a cat

#we would like to avoid false positive results -> recommended precision evaluation

In [64]:
for col in cols:
    cat = paws[paws[col] == 'cat']
    precision_cat = (cat.actual == cat[col]).mean()
    print(f'{col}    {precision_cat}')

model1    0.6897721764420747
model2    0.4841220423412204
model3    0.358346709470305
model4    0.8072289156626506


Model 4 is recommended for the cats team