In [44]:
import pandas as pd
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score, recall_score, accuracy_score, classification_report

### Exercise 2

Given the following confusion matrix, evaluate (by hand) the model's performance.

|               | pred dog   | pred cat   |
|:------------  |-----------:|-----------:|
| actual dog    |         46 |         7  |
| actual cat    |         13 |         34 |


Positive = dog

Negative = cat

a. In the context of this problem, what is a false positive?

**false positive = predict dog but actually cat**

b. In the context of this problem, what is a false negative?

**false negative = predict cat but actually dog**

In [58]:
# setting number of true/false, positive/negatives & calculating accuracy, recall, & precision manually
tp = 46
tn = 34
fp = 13
fn = 7

accuracy = (tp + tn) / (tp + tn + fp + fn)
recall = tp / (tp + fn)
precision = tp / (tp + fp)

accuracy, recall, precision

(0.8, 0.8679245283018868, 0.7796610169491526)

c. How would you describe this model?

**dog classifier**

### Exercise 3, Part I

You are working as a datascientist working for Codeup Cody Creator (C3 for short), a rubber-duck manufacturing plant.

Unfortunately, some of the rubber ducks that are produced will have defects. Your team has built several models that try to predict those defects, and the data from their predictions can be found here.

Use the predictions dataset and pandas to help answer the following questions:

In [51]:
# reading in dataframe
df = pd.read_csv('c3.csv')
df.head()

Unnamed: 0,actual,model1,model2,model3
0,No Defect,No Defect,Defect,No Defect
1,No Defect,No Defect,Defect,Defect
2,No Defect,No Defect,Defect,No Defect
3,No Defect,Defect,Defect,Defect
4,No Defect,No Defect,Defect,No Defect


a. An internal team wants to investigate the cause of the manufacturing defects. They tell you that they want to identify as many of the ducks that have a defect as possible. Which evaluation metric would be appropriate here? Which model would be the best fit for this use case?

**Positive = defected**

**Negative = not defected**

true positive = predict defect, has defect

false positive = predict defect, no defect

false negative = predict no defect, has defect

true negative = predict no defect, no defect

**The recall evaluation metric would be used as we want to minimize false positives**

**Model 3 would be the best fit for this case**

In [52]:
# checking values
df.actual.value_counts()

No Defect    184
Defect        16
Name: actual, dtype: int64

In [53]:
# setting baseline to most common
df['baseline'] = 'No Defect'
df.head()

Unnamed: 0,actual,model1,model2,model3,baseline
0,No Defect,No Defect,Defect,No Defect,No Defect
1,No Defect,No Defect,Defect,Defect,No Defect
2,No Defect,No Defect,Defect,No Defect,No Defect
3,No Defect,Defect,Defect,Defect,No Defect
4,No Defect,No Defect,Defect,No Defect,No Defect


In [54]:
# Creating subset as we only care about comparing the models with those that actually are defected
subset = df[df.actual == 'Defect']
subset.head()

Unnamed: 0,actual,model1,model2,model3,baseline
13,Defect,No Defect,Defect,Defect,No Defect
30,Defect,Defect,No Defect,Defect,No Defect
65,Defect,Defect,Defect,Defect,No Defect
70,Defect,Defect,Defect,Defect,No Defect
74,Defect,No Defect,No Defect,Defect,No Defect


In [55]:
# Comparing each model's predictions to the actual defected results to see which is more accurate
model1_recall = (subset.model1 == subset.actual).mean()
model2_recall = (subset.model2 == subset.actual).mean()
model3_recall = (subset.model3 == subset.actual).mean()
baseline_recall = (subset.baseline == subset.actual).mean()

print(f'model1 recall: {model1_recall:.2%}')
print(f'model2 recall: {model2_recall:.2%}')
print(f'model3 recall: {model3_recall:.2%}')
print(f'baseline recall: {baseline_recall:.2%}')

model1 recall: 50.00%
model2 recall: 56.25%
model3 recall: 81.25%
baseline recall: 0.00%


In [60]:
# Completing the same calculation as above but with sklearn.metrics.recall_score
model1_recall = recall_score(subset.actual, subset.model1, pos_label = 'Defect')
model2_recall = recall_score(subset.actual, subset.model2, pos_label = 'Defect')
model3_recall = recall_score(subset.actual, subset.model3, pos_label = 'Defect')
baseline_recall = recall_score(subset.actual, subset.baseline, pos_label = 'Defect')

print(f'model1 recall: {model1_recall:.2%}')
print(f'model2 recall: {model2_recall:.2%}')
print(f'model3 recall: {model3_recall:.2%}')
print(f'baseline recall: {baseline_recall:.2%}')

model1 recall: 50.00%
model2 recall: 56.25%
model3 recall: 81.25%
baseline recall: 0.00%


**Model 3 would be the best fit**

### Exercise 3, Part II

b. Recently several stories in the local news have come out highlighting customers who received a rubber duck with a defect, and portraying C3 in a bad light. The PR team has decided to launch a program that gives customers with a defective duck a vacation to Hawaii. They need you to predict which ducks will have defects, but tell you the really don't want to accidentally give out a vacation package when the duck really doesn't have a defect. Which evaluation metric would be appropriate here? Which model would be the best fit for this use case?


**In this case, precision would be the appropriate evaluation metric**

In [63]:
# Creating subset as we only care about the accuracy of those times when the model predicted defects
subset1 = df[df.model1 == 'Defect']
subset2 = df[df.model2 == 'Defect']
subset3 = df[df.model3 == 'Defect']
subset4 = df[df.baseline == 'Defect']

In [64]:
# Checking each model to see which was the most accurate about predicting defects
model1_precision = (subset1.model1 == subset1.actual).mean()
model2_precision = (subset2.model2 == subset2.actual).mean()
model3_precision = (subset3.model3 == subset3.actual).mean()
baseline_precision = (subset4.baseline == subset4.actual).mean()

print(f'model1 precision: {model1_precision:.2%}')
print(f'model2 precision: {model2_precision:.2%}')
print(f'model3 precision: {model3_precision:.2%}')
print(f'baseline precision: {baseline_precision:.2%}')

model1 precision: 80.00%
model2 precision: 10.00%
model3 precision: 13.13%
baseline precision: nan%


In [65]:
# Completing the same calculation as above but with sklearn.metrics.precision_score
model1_precision = precision_score(subset1.actual, subset1.model1, pos_label = 'Defect')
model2_precision = precision_score(subset2.actual, subset2.model2, pos_label = 'Defect')
model3_precision = precision_score(subset3.actual, subset3.model3, pos_label = 'Defect')
baseline_precision = precision_score(subset4.actual, subset4.baseline, pos_label = 'Defect')

print(f'model1 precision: {model1_precision:.2%}')
print(f'model2 precision: {model2_precision:.2%}')
print(f'model3 precision: {model3_precision:.2%}')
print(f'baseline precision: {baseline_precision:.2%}')

model1 precision: 80.00%
model2 precision: 10.00%
model3 precision: 13.13%
baseline precision: 0.00%


  _warn_prf(average, modifier, msg_start, len(result))


**Model 1 would be the best fit in this case.**

### Exercise 4

You are working as a data scientist for Gives You Paws ™, a subscription based service that shows you cute pictures of dogs or cats (or both for an additional fee).

At Gives You Paws, anyone can upload pictures of their cats or dogs. The photos are then put through a two step process. First an automated algorithm tags pictures as either a cat or a dog (Phase I). Next, the photos that have been initially identified are put through another round of review, possibly with some human oversight, before being presented to the users (Phase II).

Several models have already been developed with the data, and you can find their results here.

Given this dataset, use pandas to create a baseline model (i.e. a model that just predicts the most common class) and answer the following questions:

In [66]:
df = pd.read_csv('gives_you_paws.csv')
df.head()

Unnamed: 0,actual,model1,model2,model3,model4
0,cat,cat,dog,cat,dog
1,dog,dog,cat,cat,dog
2,dog,cat,cat,cat,dog
3,dog,dog,dog,cat,dog
4,cat,cat,cat,dog,dog


In [67]:
df.actual.value_counts()

dog    3254
cat    1746
Name: actual, dtype: int64

In [68]:
df['baseline'] = 'dog'
df.head()

Unnamed: 0,actual,model1,model2,model3,model4,baseline
0,cat,cat,dog,cat,dog,dog
1,dog,dog,cat,cat,dog,dog
2,dog,cat,cat,cat,dog,dog
3,dog,dog,dog,cat,dog,dog
4,cat,cat,cat,dog,dog,dog


### Exercise 4a

In terms of accuracy, how do the various models compare to the baseline model? Are any of the models better than the baseline?

In [69]:
# Checking each model to see how accurate its predictions were and comparing to the baseline model
model1_accuracy = (df.actual == df['model1']).mean()
model2_accuracy = (df.actual == df['model2']).mean()
model3_accuracy = (df.actual == df['model3']).mean()
model4_accuracy = (df.actual == df['model4']).mean()
baseline_accuracy = (df.actual == df['baseline']).mean()

print(f'model1 accuracy: {model1_accuracy:.2%}')
print(f'model2 accuracy: {model2_accuracy:.2%}')
print(f'model3 accuracy: {model3_accuracy:.2%}')
print(f'model4 accuracy: {model4_accuracy:.2%}')
print(f'baseline accuracy: {baseline_accuracy:.2%}')

model1 accuracy: 80.74%
model2 accuracy: 63.04%
model3 accuracy: 50.96%
model4 accuracy: 74.26%
baseline accuracy: 65.08%


In [70]:
# Completing the same calculation as above but with sklearn.metrics.precision_score
model1_accuracy = accuracy_score(df.actual, df.model1)
model2_accuracy = accuracy_score(df.actual, df.model2)
model3_accuracy = accuracy_score(df.actual, df.model3)
model4_accuracy = accuracy_score(df.actual, df.model4)
baseline_accuracy = accuracy_score(df.actual, df.baseline)

print(f'model1 accuracy: {model1_accuracy:.2%}')
print(f'model2 accuracy: {model2_accuracy:.2%}')
print(f'model3 accuracy: {model3_accuracy:.2%}')
print(f'model4 accuracy: {model4_accuracy:.2%}')
print(f'baseline accuracy: {baseline_accuracy:.2%}')

model1 accuracy: 80.74%
model2 accuracy: 63.04%
model3 accuracy: 50.96%
model4 accuracy: 74.26%
baseline accuracy: 65.08%


**Models 1 & 4 are the only ones that perform better than the baseline**

### Exercise 4b

Suppose you are working on a team that solely deals with dog pictures. Which of these models would you recomend for Phase I? For Phase II?


**Positive = dog**

**Negative = cat**

true positive = predict dog, is dog

false positive = predict dog, is cat

false negative = predict cat, is dog

true negative = predict cat, is cat

**For Phase I, we would want to minimize false negatives (predicting cat when actually dog) so we would use recall**

**For Phase II, we would want to minimize false positives (predicting dog when actually cat) so we would use precision**

In [78]:
# creating subset since we care about finding the model that best predicts dog when actually dog
subset = df[df.actual == 'dog']
subset.head()

Unnamed: 0,actual,model1,model2,model3,model4,baseline
1,dog,dog,cat,cat,dog,dog
2,dog,cat,cat,cat,dog,dog
3,dog,dog,dog,cat,dog,dog
5,dog,dog,dog,dog,dog,dog
8,dog,dog,cat,dog,dog,dog


In [79]:
# Checking each model to find which one is most accurate at predicting dog when actually dog
model1_recall = (subset.actual == subset['model1']).mean()
model2_recall = (subset.actual == subset['model2']).mean()
model3_recall = (subset.actual == subset['model3']).mean()
model4_recall = (subset.actual == subset['model4']).mean()
baseline_recall = (subset.actual == subset['baseline']).mean()

print(f'model1 recall: {model1_recall:.2%}')
print(f'model2 recall: {model2_recall:.2%}')
print(f'model3 recall: {model3_recall:.2%}')
print(f'model4 recall: {model4_recall:.2%}')
print(f'baseline recall: {baseline_recall:.2%}')

model1 recall: 80.33%
model2 recall: 49.08%
model3 recall: 50.86%
model4 recall: 95.57%
baseline recall: 100.00%


In [84]:
# Completing the same calculation as above but with sklearn.metrics.classification_report
model1_recall = pd.DataFrame(classification_report(subset.actual, subset.model1, labels=['cat','dog'], output_dict=True))
model1_recall

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Unnamed: 0,cat,dog,accuracy,macro avg,weighted avg
precision,0.0,1.0,0.803319,0.5,1.0
recall,0.0,0.803319,0.803319,0.401659,0.803319
f1-score,0.0,0.890934,0.803319,0.445467,0.890934
support,0.0,3254.0,0.803319,3254.0,3254.0


In [85]:
# Same but for model2
model2_recall = pd.DataFrame(classification_report(subset.actual, subset.model2, labels=['cat','dog'], output_dict=True))
model2_recall

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Unnamed: 0,cat,dog,accuracy,macro avg,weighted avg
precision,0.0,1.0,0.490781,0.5,1.0
recall,0.0,0.490781,0.490781,0.24539,0.490781
f1-score,0.0,0.658421,0.490781,0.32921,0.658421
support,0.0,3254.0,0.490781,3254.0,3254.0


In [86]:
# Same but for model3
model3_recall = pd.DataFrame(classification_report(subset.actual, subset.model3, labels=['cat','dog'], output_dict=True))
model3_recall

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Unnamed: 0,cat,dog,accuracy,macro avg,weighted avg
precision,0.0,1.0,0.508605,0.5,1.0
recall,0.0,0.508605,0.508605,0.254302,0.508605
f1-score,0.0,0.674272,0.508605,0.337136,0.674272
support,0.0,3254.0,0.508605,3254.0,3254.0


In [87]:
# Same but for model4
model4_recall = pd.DataFrame(classification_report(subset.actual, subset.model4, labels=['cat','dog'], output_dict=True))
model4_recall

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Unnamed: 0,cat,dog,accuracy,macro avg,weighted avg
precision,0.0,1.0,0.955747,0.5,1.0
recall,0.0,0.955747,0.955747,0.477873,0.955747
f1-score,0.0,0.977373,0.955747,0.488686,0.977373
support,0.0,3254.0,0.955747,3254.0,3254.0


**For Phase I, would recommend model 4 with 96% recall**

In [91]:
# creating subsets since we care about minimizing false positives so want to look at only the predictions that were 'dog' 
subset1 = df[df.model1 == 'dog']
subset2 = df[df.model2 == 'dog']
subset3 = df[df.model3 == 'dog']
subset4 = df[df.model4 == 'dog']
subset5 = df[df.baseline == 'dog']

In [92]:
# comparing models to find the one that was most precise at predicting dog when actually dog
model1_precision = (subset1.actual == subset1['model1']).mean()
model2_precision = (subset2.actual == subset2['model2']).mean()
model3_precision = (subset3.actual == subset3['model3']).mean()
model4_precision = (subset4.actual == subset4['model4']).mean()
baseline_precision = (subset5.actual == subset5['baseline']).mean()

print(f'model1 precision: {model1_precision:.2%}')
print(f'model2 precision: {model2_precision:.2%}')
print(f'model3 precision: {model3_precision:.2%}')
print(f'model4 precision: {model4_precision:.2%}')
print(f'baseline precision: {baseline_precision:.2%}')

model1 precision: 89.00%
model2 precision: 89.32%
model3 precision: 65.99%
model4 precision: 73.12%
baseline precision: 65.08%


In [95]:
# Completing the same calculation as above but with sklearn.metrics.classification_report
model1_precision = pd.DataFrame(classification_report(subset1.actual, subset1.model1, labels=['cat','dog'], output_dict=True))
model1_precision

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Unnamed: 0,cat,dog,accuracy,macro avg,weighted avg
precision,0.0,0.890024,0.890024,0.445012,0.792142
recall,0.0,1.0,0.890024,0.5,0.890024
f1-score,0.0,0.941812,0.890024,0.470906,0.838235
support,323.0,2614.0,0.890024,2937.0,2937.0


In [96]:
# Same but for model2
model2_precision = pd.DataFrame(classification_report(subset2.actual, subset2.model2, labels=['cat','dog'], output_dict=True))
model2_precision

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Unnamed: 0,cat,dog,accuracy,macro avg,weighted avg
precision,0.0,0.893177,0.893177,0.446588,0.797765
recall,0.0,1.0,0.893177,0.5,0.893177
f1-score,0.0,0.943575,0.893177,0.471787,0.842779
support,191.0,1597.0,0.893177,1788.0,1788.0


In [97]:
# same but for model3
model3_precision = pd.DataFrame(classification_report(subset3.actual, subset3.model3, labels=['cat','dog'], output_dict=True))
model3_precision

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Unnamed: 0,cat,dog,accuracy,macro avg,weighted avg
precision,0.0,0.659888,0.659888,0.329944,0.435453
recall,0.0,1.0,0.659888,0.5,0.659888
f1-score,0.0,0.7951,0.659888,0.39755,0.524677
support,853.0,1655.0,0.659888,2508.0,2508.0


In [98]:
# Same but for model4
model4_precision = pd.DataFrame(classification_report(subset4.actual, subset4.model4, labels=['cat','dog'], output_dict=True))
model4_precision

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Unnamed: 0,cat,dog,accuracy,macro avg,weighted avg
precision,0.0,0.731249,0.731249,0.365624,0.534724
recall,0.0,1.0,0.731249,0.5,0.731249
f1-score,0.0,0.844764,0.731249,0.422382,0.617733
support,1143.0,3110.0,0.731249,4253.0,4253.0


**For Phase II, would recommend model 2 with just over 89% recall**

### Exercise 4c

Suppose you are working on a team that solely deals with cat pictures. Which of these models would you recomend for Phase I? For Phase II?

**Positive = cat**

**Negative = dog**

true positive = predict cat, is cat

false positive = predict cat, is dog

false negative = predict dog, is cat

true negative = predict dog, is dog

In [39]:
subset = df[df.actual == 'cat']
subset.head()

Unnamed: 0,actual,model1,model2,model3,model4,baseline
0,cat,cat,dog,cat,dog,dog
4,cat,cat,cat,dog,dog,dog
6,cat,cat,cat,cat,dog,dog
7,cat,dog,cat,cat,dog,dog
11,cat,cat,dog,cat,cat,dog


In [40]:
model1_recall = (subset.actual == subset['model1']).mean()
model2_recall = (subset.actual == subset['model2']).mean()
model3_recall = (subset.actual == subset['model3']).mean()
model4_recall = (subset.actual == subset['model4']).mean()
baseline_recall = (subset.actual == subset['baseline']).mean()

print(f'model1 recall: {model1_recall:.2%}')
print(f'model2 recall: {model2_recall:.2%}')
print(f'model3 recall: {model3_recall:.2%}')
print(f'model4 recall: {model4_recall:.2%}')
print(f'baseline recall: {baseline_recall:.2%}')

model1 recall: 81.50%
model2 recall: 89.06%
model3 recall: 51.15%
model4 recall: 34.54%
baseline recall: 0.00%


**For Phase I, would recommend model 2 with 89% recall**

In [41]:
subset1 = df[df.model1 == 'cat']
subset2 = df[df.model2 == 'cat']
subset3 = df[df.model3 == 'cat']
subset4 = df[df.model4 == 'cat']
subset5 = df[df.baseline == 'cat']

In [42]:
model1_precision = (subset1.actual == subset1['model1']).mean()
model2_precision = (subset2.actual == subset2['model2']).mean()
model3_precision = (subset3.actual == subset3['model3']).mean()
model4_precision = (subset4.actual == subset4['model4']).mean()
baseline_precision = (subset5.actual == subset5['baseline']).mean()

print(f'model1 precision: {model1_precision:.2%}')
print(f'model2 precision: {model2_precision:.2%}')
print(f'model3 precision: {model3_precision:.2%}')
print(f'model4 precision: {model4_precision:.2%}')
print(f'baseline precision: {baseline_precision:.2%}')

model1 precision: 68.98%
model2 precision: 48.41%
model3 precision: 35.83%
model4 precision: 80.72%
baseline precision: nan%


**For Phase II, would recommend model 4 with 81% recall**