# Evaluation Exercises

- 2. Given the following confusion matrix, evaluate (by hand) the model's performance.

In [1]:
import pandas as pd
from sklearn.metrics import confusion_matrix

df = pd.DataFrame({'pred_dog': [46, 13], 'pred_cat': [7, 34]}, 
                  index=['actual_dog', 'actual_cat'])
df

Unnamed: 0,pred_dog,pred_cat
actual_dog,46,7
actual_cat,13,34


In [2]:
rubric_df = pd.DataFrame ([['True Negative', 'False Positive'], ['False Negative', 'True Positive']], columns=df.columns, index=df.index)
rubric_df

Unnamed: 0,pred_dog,pred_cat
actual_dog,True Negative,False Positive
actual_cat,False Negative,True Positive


In [3]:
rubric_df + ': ' + df.values.astype(str)

Unnamed: 0,pred_dog,pred_cat
actual_dog,True Negative: 46,False Positive: 7
actual_cat,False Negative: 13,True Positive: 34


In [4]:
# 
# Positive Class: Cat
#

In [5]:
#Calculating the values manually
# Recall: TP / (TP + FN)
recall = 34 / (34 + 13)
# Precision: TP / (TP + FP)
precision = 34 / (34 + 7)
# Accuracy : (TP + TN) / (TP + TN + FP + FN)
accuracy = (34 + 46) / df.values.sum()

In [6]:
accuracy

0.8

In [7]:
precision

0.8292682926829268

In [8]:
recall

0.723404255319149

- 3. You are working as a datascientist working for Codeup Cody Creator (C3 for short), a rubber-duck manufacturing plant.

    - Unfortunately, some of the rubber ducks that are produced will have defects. Your team has built several models that try to predict those defects, and the data from their predictions can be found here.
       - Use the predictions dataset and pandas to help answer the following questions:

- An internal team wants to investigate the cause of the manufacturing defects. They tell you that they want to identify as many of the ducks that have a defect as possible. Which evaluation metric would be appropriate here? Which model would be the best fit for this use case?

In [9]:
c3 = pd.read_csv (r'c3.csv')
type(c3)

pandas.core.frame.DataFrame

In [10]:
c3

Unnamed: 0,actual,model1,model2,model3
0,No Defect,No Defect,Defect,No Defect
1,No Defect,No Defect,Defect,Defect
2,No Defect,No Defect,Defect,No Defect
3,No Defect,Defect,Defect,Defect
4,No Defect,No Defect,Defect,No Defect
...,...,...,...,...
195,No Defect,No Defect,Defect,Defect
196,Defect,Defect,No Defect,No Defect
197,No Defect,No Defect,No Defect,No Defect
198,No Defect,No Defect,Defect,Defect


In [11]:
c3.head(),c3.shape

(      actual     model1  model2     model3
 0  No Defect  No Defect  Defect  No Defect
 1  No Defect  No Defect  Defect     Defect
 2  No Defect  No Defect  Defect  No Defect
 3  No Defect     Defect  Defect     Defect
 4  No Defect  No Defect  Defect  No Defect,
 (200, 4))

In [12]:
c3.actual.value_counts()

No Defect    184
Defect        16
Name: actual, dtype: int64

- 3a: An internal team wants to investigate the cause of the manufacturing defects. They tell you that they want to identify as many of the ducks that have a defect as possible. Which evaluation metric would be appropriate here? Which model would be the best fit for this use case?

    - Evaluation metric: Recall to hit even the false positives
    - Evaluation model: Classification model

- Recently several stories in the local news have come out highlighting customers who received a rubber duck with a defect, and portraying C3 in a bad light. 
    - The PR team has decided to launch a program that gives customers with a defective duck a vacation to Hawaii. 
    - They need you to predict which ducks will have defects, but tell you they really don't want to accidentally give out a vacation package when the duck really doesn't have a defect. 
        - Which evaluation metric would be appropriate here? 
            - Precision
        - Which model would be the best fit for this use case? 
            - Model 3 with 0.8125

In [17]:
c3['baseline'] = 'No Defect'

In [19]:
pd.crosstab(c3.baseline, c3.actual)

actual,Defect,No Defect
baseline,Unnamed: 1_level_1,Unnamed: 2_level_1
No Defect,16,184


In [24]:
rubric_df = pd.DataFrame(rubric_df.values, columns = ['Pred_Good', 'Pred_Defect'], index= ['Act Good', 'Act Defect'])

In [27]:
confusion_matrix(c3.baseline, c3.actual, labels=('No Defect', 'Defect')).T

array([[184,   0],
       [ 16,   0]])

In [28]:
rubric_df

Unnamed: 0,Pred_Good,Pred_Defect
Act Good,True Negative,False Positive
Act Defect,False Negative,True Positive


In [None]:
#  Problem 1: Internal/Manufacturing: Optimize for Recall
# Answer: We will pick model 3 at 0.8125 for the marketing team

In [29]:
#c3 is the general data frame
# recall is actually the ones with the actual value 
# c3.actual is a defect and then compare to the models 

subset = c3[c3.actual == 'Defect']
(subset.actual == subset.model1).mean()

0.5

In [30]:
pd.crosstab(c3.baseline, c3.actual)

actual,Defect,No Defect
baseline,Unnamed: 1_level_1,Unnamed: 2_level_1
No Defect,16,184


In [31]:
subset = c3[c3.actual == 'Defect']
model1_recall = (subset.model1 == subset.actual).mean()
model2_recall = (subset.model2 == subset.actual).mean()
model3_recall = (subset.model3 == subset.actual).mean()
model1_recall, model2_recall, model3_recall

(0.5, 0.5625, 0.8125)

In [32]:
# Baseline
# For a classification problem, a common choice for the baseline model is a model 
# that simply predicts the most common class every single time.

baseline = c3.actual.value_counts()
baseline

No Defect    184
Defect        16
Name: actual, dtype: int64

In [33]:
type(baseline)

pandas.core.series.Series

In [None]:
# Problem 2: Marketing: Optimize for Precision
# Answer: Model 1 for the marketing team

In [34]:
#add baseline column
c3['baseline'] = 'Defect'
c3

Unnamed: 0,actual,model1,model2,model3,baseline
0,No Defect,No Defect,Defect,No Defect,Defect
1,No Defect,No Defect,Defect,Defect,Defect
2,No Defect,No Defect,Defect,No Defect,Defect
3,No Defect,Defect,Defect,Defect,Defect
4,No Defect,No Defect,Defect,No Defect,Defect
...,...,...,...,...,...
195,No Defect,No Defect,Defect,Defect,Defect
196,Defect,Defect,No Defect,No Defect,Defect
197,No Defect,No Defect,No Defect,No Defect,Defect
198,No Defect,No Defect,Defect,Defect,Defect


In [35]:
# Problem 2: Marketing: Optimize for Precision
# precision -- how good are our positive predictions?
# precision -- model performance | predicted positive

# We will pick model #1 for the marketing team at 0.1313

In [36]:
# Calc Precision for model1 = .08
subset = c3[c3.model1 == 'Defect']
model1_precision = (subset.model1 == subset.actual).mean()
model1_precision

0.8

In [37]:
# Calc Precision for model2 = .1
subset = c3[c3.model2 == 'Defect']
model2_precision = (subset.model2 == subset.actual).mean()
model2_precision

0.1

In [38]:
# Calc Precision for model3 = .1313
subset = c3[c3.model3 == 'Defect']
model3_precision = (subset.model3 == subset.actual).mean()
model3_precision

0.13131313131313133

In [39]:
subset = c3[c3.baseline == 'Defect']
print(subset)
(subset.baseline == subset.actual).mean()

        actual     model1     model2     model3 baseline
0    No Defect  No Defect     Defect  No Defect   Defect
1    No Defect  No Defect     Defect     Defect   Defect
2    No Defect  No Defect     Defect  No Defect   Defect
3    No Defect     Defect     Defect     Defect   Defect
4    No Defect  No Defect     Defect  No Defect   Defect
..         ...        ...        ...        ...      ...
195  No Defect  No Defect     Defect     Defect   Defect
196     Defect     Defect  No Defect  No Defect   Defect
197  No Defect  No Defect  No Defect  No Defect   Defect
198  No Defect  No Defect     Defect     Defect   Defect
199  No Defect  No Defect  No Defect     Defect   Defect

[200 rows x 5 columns]


0.08

- 4. You are working as a data scientist for Gives You Paws ™, a subscription based service that shows you cute pictures of dogs or cats (or both for an additional fee).

    - At Gives You Paws, anyone can upload pictures of their cats or dogs. The photos are then put through a two step process. First an automated algorithm tags pictures as either a cat or a dog (Phase I). Next, the photos that have been initially identified are put through another round of review, possibly with some human oversight, before being presented to the users (Phase II).
        - Several models have already been developed with the data, and you can find their results here.



In [40]:
paws = pd.read_csv (r'gives_you_paws.csv')
print (paws)

     actual model1 model2 model3 model4
0       cat    cat    dog    cat    dog
1       dog    dog    cat    cat    dog
2       dog    cat    cat    cat    dog
3       dog    dog    dog    cat    dog
4       cat    cat    cat    dog    dog
...     ...    ...    ...    ...    ...
4995    dog    dog    dog    dog    dog
4996    dog    dog    cat    cat    dog
4997    dog    cat    cat    dog    dog
4998    cat    cat    cat    cat    dog
4999    dog    dog    dog    dog    dog

[5000 rows x 5 columns]


In [41]:
paws.actual.value_counts().idxmax()

'dog'

In [52]:
modelcols = paws.columns[1:]

In [53]:
output = {}
for model in modelcols:
    accuracy = (paws.actual == paws[model]).mean()
    output.update({model:accuracy})

In [54]:
output

{'model1': 0.8074,
 'model2': 0.6304,
 'model3': 0.5096,
 'model4': 0.7426,
 'baseline': 0.6508}

In [55]:
paws.actual.value_counts()

dog    3254
cat    1746
Name: actual, dtype: int64

- Given this dataset, use pandas to create a baseline model (i.e. a model that just predicts the most common class) and answer the following questions:

# 4a
- In terms of accuracy, how do the various models compare to the baseline model? Are any of the models better than the baseline?
    - Yes, Model1 and Model 2 perform better than the baseline
            - Baseline model: 0.6508
            - Model 1: 0.8900
            - Model 2: 0.8931
            - Model 3: 0.6598

In [56]:
baseline = paws.actual.value_counts()
baseline

dog    3254
cat    1746
Name: actual, dtype: int64

In [57]:
#add baseline column
paws['baseline'] = 'dog'
paws

Unnamed: 0,actual,model1,model2,model3,model4,baseline
0,cat,cat,dog,cat,dog,dog
1,dog,dog,cat,cat,dog,dog
2,dog,cat,cat,cat,dog,dog
3,dog,dog,dog,cat,dog,dog
4,cat,cat,cat,dog,dog,dog
...,...,...,...,...,...,...
4995,dog,dog,dog,dog,dog,dog
4996,dog,dog,cat,cat,dog,dog
4997,dog,cat,cat,dog,dog,dog
4998,cat,cat,cat,cat,dog,dog


In [58]:
# phase 1: Recall
subset = paws[paws.actual == 'dog']
(subset.actual == subset.model1).mean()

0.803318992009834

In [59]:
(subset.actual == subset.model2).mean()

0.49078057775046097

In [60]:
(subset.actual == subset.model3).mean()

0.5086047940995697

In [61]:
(subset.actual == subset.model4).mean()

0.9557467732022127

In [63]:
# precision -- how good are our positive predictions?
# precision -- model performance | predicted positive

In [64]:
#paws is the general data frame
# recall is actually the ones with the actual value 
# paws.actual is a dog and then compare to the models 

subset = paws[paws.actual == 'dog']
(subset.actual == subset.model1).mean()

0.803318992009834

In [66]:
subset = paws[paws.actual == 'dog']
model1_recall = (subset.model1 == subset.actual).mean()
model2_recall = (subset.model2 == subset.actual).mean()
model3_recall = (subset.model3 == subset.actual).mean()
model4_recall = (subset.model4 == subset.actual).mean()
model1_recall, model2_recall, model3_recall, model4_recall

(0.803318992009834,
 0.49078057775046097,
 0.5086047940995697,
 0.9557467732022127)

In [67]:
# a: model 4
# phase 2: Precision
subset = paws[paws.model1 == 'dog']
(subset.actual == subset.model1).mean()

0.8900238338440586

In [68]:
# Calc Precision for model1 = .89
subset = paws[paws.model1 == 'dog']
model1_precision = (subset.model1 == subset.actual).mean()
model1_precision

0.8900238338440586

In [69]:
# Calc Precision for model2 = .89
subset = paws[paws.model2 == 'dog']
model2_precision = (subset.model2 == subset.actual).mean()
model2_precision

0.8931767337807607

In [70]:
# Calc Precision for model3 = .65
subset = paws[paws.model3 == 'dog']
model3_precision = (subset.model3 == subset.actual).mean()
model3_precision

0.6598883572567783

In [72]:
# Calc Precision for model4 = .65
subset = paws[paws.model4 == 'dog']
model3_precision = (subset.model4 == subset.actual).mean()
model3_precision

0.7312485304490948

In [71]:
subset = paws[paws.baseline == 'dog']
print(subset)
(subset.baseline == subset.actual).mean()

     actual model1 model2 model3 model4 baseline
0       cat    cat    dog    cat    dog      dog
1       dog    dog    cat    cat    dog      dog
2       dog    cat    cat    cat    dog      dog
3       dog    dog    dog    cat    dog      dog
4       cat    cat    cat    dog    dog      dog
...     ...    ...    ...    ...    ...      ...
4995    dog    dog    dog    dog    dog      dog
4996    dog    dog    cat    cat    dog      dog
4997    dog    cat    cat    dog    dog      dog
4998    cat    cat    cat    cat    dog      dog
4999    dog    dog    dog    dog    dog      dog

[5000 rows x 6 columns]


0.6508

# 4b. 
- Suppose you are working on a team that solely deals with dog pictures. Which of these models would you recomend for Phase I? For Phase II?
    - Phase I: The photos are then put through a two step process. First an automated algorithm tags pictures as either a cat or a dog
        - Model 4: 0.9557
        - Precision is a googd metric for the phase 1 - Model 4
      
    - Phase II: Next, the photos that have been initially identified are put through another round of review, possibly with some human oversight, before being presented to the users
        - Model 2: 0.8931
        - Precision is a good measure for phase 2

# 4c. 
- Suppose you are working on a team that solely deals with cat pictures. Which of these models would you recomend for Phase I? For Phase II?
    - Phase I: The photos are then put through a two step process. First an automated algorithm tags pictures as either a cat or a dog
        - Model 1: .6897
    - Phase II: Next, the photos that have been initially identified are put through another round of review, possibly with some human oversight, before being presented to the users
        - Model 2: .4841

In [73]:
# make dataframe for soley dealing with cats: 

#add baseline column
paws['baseline'] = 'cat'
paws

Unnamed: 0,actual,model1,model2,model3,model4,baseline
0,cat,cat,dog,cat,dog,cat
1,dog,dog,cat,cat,dog,cat
2,dog,cat,cat,cat,dog,cat
3,dog,dog,dog,cat,dog,cat
4,cat,cat,cat,dog,dog,cat
...,...,...,...,...,...,...
4995,dog,dog,dog,dog,dog,cat
4996,dog,dog,cat,cat,dog,cat
4997,dog,cat,cat,dog,dog,cat
4998,cat,cat,cat,cat,dog,cat


In [74]:
#paws is the general data frame
# recall is actually the ones with the actual value 
# paws.actual is a dog and then compare to the models 

subset = paws[paws.actual == 'cat']
(subset.actual == subset.model1).mean()

0.8150057273768614

In [75]:
subset = paws[paws.actual == 'cat']
model1_recall = (subset.model1 == subset.actual).mean()
model2_recall = (subset.model2 == subset.actual).mean()
model3_recall = (subset.model3 == subset.actual).mean()
model1_recall, model2_recall, model3_recall

(0.8150057273768614, 0.8906071019473081, 0.5114547537227949)

In [76]:
# Calc Precision for model1 = .6897
subset = paws[paws.model1 == 'cat']
model1_precision = (subset.model1 == subset.actual).mean()
model1_precision

0.6897721764420747

In [77]:
# Calc Precision for model2 = .4841
subset = paws[paws.model2 == 'cat']
model2_precision = (subset.model2 == subset.actual).mean()
model2_precision

0.4841220423412204

In [78]:
# Calc Precision for model2 = .5157
subset = paws[paws.model1 == 'cat']
model3_precision = (subset.model3 == subset.actual).mean()
model3_precision

0.5157537566650509

In [79]:
#setting the baseline back to dogs
baseline = paws.actual.value_counts()
baseline

dog    3254
cat    1746
Name: actual, dtype: int64

In [80]:
#setting the baseline back to dogs
#add baseline column
paws['baseline'] = 'dog'
paws

Unnamed: 0,actual,model1,model2,model3,model4,baseline
0,cat,cat,dog,cat,dog,dog
1,dog,dog,cat,cat,dog,dog
2,dog,cat,cat,cat,dog,dog
3,dog,dog,dog,cat,dog,dog
4,cat,cat,cat,dog,dog,dog
...,...,...,...,...,...,...
4995,dog,dog,dog,dog,dog,dog
4996,dog,dog,cat,cat,dog,dog
4997,dog,cat,cat,dog,dog,dog
4998,cat,cat,cat,cat,dog,dog


# 5
- Follow the links below to read the documentation about each function, then apply those functions to the data from the previous problem.

sklearn.metrics.accuracy_score
sklearn.metrics.precision_score
sklearn.metrics.recall_score
sklearn.metrics.classification_report


In [81]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, classification_report

In [84]:
subset = paws[paws.actual == 'dog']
{model: accuracy_score(paws.actual, paws[model]) for model in modelcols}

{'model1': 0.8074,
 'model2': 0.6304,
 'model3': 0.5096,
 'model4': 0.7426,
 'baseline': 0.6508}

In [88]:
print('Model 1')
pd.DataFrame(classification_report(paws.actual, paws.model1, output_dict=True))

Model 1


Unnamed: 0,cat,dog,accuracy,macro avg,weighted avg
precision,0.689772,0.890024,0.8074,0.789898,0.820096
recall,0.815006,0.803319,0.8074,0.809162,0.8074
f1-score,0.747178,0.844452,0.8074,0.795815,0.810484
support,1746.0,3254.0,0.8074,5000.0,5000.0


In [89]:
print('Model 2')
pd.DataFrame(classification_report(paws.actual, paws.model2, output_dict=True))

Model 2


Unnamed: 0,cat,dog,accuracy,macro avg,weighted avg
precision,0.484122,0.893177,0.6304,0.688649,0.750335
recall,0.890607,0.490781,0.6304,0.690694,0.6304
f1-score,0.627269,0.633479,0.6304,0.630374,0.63131
support,1746.0,3254.0,0.6304,5000.0,5000.0


In [94]:
print('Model 3')
pd.DataFrame(classification_report(paws.actual, paws.model3, output_dict=True))

Model 3


Unnamed: 0,cat,dog,accuracy,macro avg,weighted avg
precision,0.358347,0.659888,0.5096,0.509118,0.55459
recall,0.511455,0.508605,0.5096,0.51003,0.5096
f1-score,0.421425,0.574453,0.5096,0.497939,0.521016
support,1746.0,3254.0,0.5096,5000.0,5000.0


In [95]:
print('Model 4')
pd.DataFrame(classification_report(paws.actual, paws.model4, output_dict=True))

Model 4


Unnamed: 0,cat,dog,accuracy,macro avg,weighted avg
precision,0.807229,0.731249,0.7426,0.769239,0.757781
recall,0.345361,0.955747,0.7426,0.650554,0.7426
f1-score,0.483755,0.82856,0.7426,0.656157,0.708154
support,1746.0,3254.0,0.7426,5000.0,5000.0
