# Model Evaluation Exercises

## 1. Given the following confusion matrix, evaluate (by hand) the model's performance.

|               | actual cat | actual dog |
|:------------  |-----------:|-----------:|
| predicted cat |         34 |          7 |
| predicted dog |         13 |         46 |

* In the context of this problem, what is a false positive?
* In the context of this problem, what is a false negative?
* How would you describe this model?

In [1]:
# Accuracy

accuracy = (34 + 46) / (34 + 7 + 13 + 46)

# Recall

recall = (34) / (34 + 13)

# Precision

precision = (34) / (34 + 7)

print(f'''

The models accuracy is: {accuracy:.0%}

The models recall is: {recall:.0%}

The models precision is: {precision:.0%}


''')



The models accuracy is: 80%

The models recall is: 72%

The models precision is: 83%





FP: Predicted cat, but it was dog
FN: Predicted dog, but it was cat

In this model, we are trying to predict if an image is an image of a cat or a dog. For evaluation purposes, we will assume that the cat is the positive variable we are looking for (i.e we either correctly indentify its a cat, or we correctly identify it is not a cat). 

TP: Predicted cat, and it was cat
TN: Predicted not cat, and it was not a cat

I would likely use Recall to evualte this model, because we would want to ensure we are getting as many of the positive cases as possible. 

## 2. You are working as a datascientist working for Codeup Cody Creator (C3 for short), a rubber-duck manufacturing plant.

In [2]:
import pandas as pd

In [3]:
cody = pd.read_csv("c3.csv")

In [4]:
cody.head()

Unnamed: 0,actual,model1,model2,model3
0,No Defect,No Defect,Defect,No Defect
1,No Defect,No Defect,Defect,Defect
2,No Defect,No Defect,Defect,No Defect
3,No Defect,Defect,Defect,Defect
4,No Defect,No Defect,Defect,No Defect


In [5]:
cody.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 4 columns):
actual    200 non-null object
model1    200 non-null object
model2    200 non-null object
model3    200 non-null object
dtypes: object(4)
memory usage: 6.4+ KB


### An internal team wants to investigate the cause of the manufacturing defects. They tell you that they want to identify as many of the ducks that have a defect as possible. Which evaluation metric would be appropriate here? Which model would be the best fit for this use case?

Objective: Detect defects in rubber ducks 

* TP: Predicted a defect and there was a defect
* TN: Didn't predict a defect, and no defect was present
* FP: Predicted a defect, and there wasn't one
* FN: Didn't predict a defect, but there was a defect

In [32]:
cody["baseline"] = cody.actual.value_counts().index[0]

To find all the ducks with defect, we want to identify the model that has the highest TP or lowest TN, which means we would like to use a precision evaluation. 

In [6]:
positive = "Defect"

subset = cody[cody.actual==positive]
model1_recall = (subset.model1 == subset.actual).mean()
model2_recall = (subset.model2 == subset.actual).mean()
model3_recall = (subset.model3 == subset.actual).mean()


print(f'model 1 recall = {model1_recall:.0%}')
print(f'model 2 recall = {model2_recall:.0%}')
print(f'model 3 recall = {model3_recall:.0%}')

model 1 recall = 50%
model 2 recall = 56%
model 3 recall = 81%


In this case - model three has the highest accuracy when it comes to detecting actual positives (or least number of FN), so this is the model i would recommend using.

### Recently several stories in the local news have come out highlighting customers who received a rubber duck with a defect, and portraying C3 in a bad light. The PR team has decided to launch a program that gives customers with a defective duck a vacation to Hawaii. They need you to predict which ducks will have defects, but tell you the really don't want to accidentally give out a vacation package when the duck really doesn't have a defect. Which evaluation metric would be appropriate here? Which model would be the best fit for this use case?

In this case, we want to ensure that our predictions are as high as possible, which is why we will then evaluate our model using **precision**, becasue now our objective is to make sure we are not giving away free tickets to Hawaii, in other words, the cost of a *False Positive* is much higher

In [7]:
positive = "Defect"

model1 = cody[cody.model1== positive]
model1_precision = (model1.model1 == model1.actual).mean()

model2 = cody[cody.model2== positive]
model2_precision = (model2.model2 == model2.actual).mean()

model3 = cody[cody.model3 == positive]
model3_precision = (model3.model3 == model3.actual).mean()



print(f'Precision of model 1 is: {model1_precision:.0%}')
print(f'Precision of model 2 is: {model2_precision:.0%}')
print(f'Precision of model 3 is: {model3_precision:.0%}')



Precision of model 1 is: 80%
Precision of model 2 is: 10%
Precision of model 3 is: 13%


In this case, model 1 is the best one at predicting True positives, or least number of False Positives, so it is the model I would recommend using for the Haiwaii offer.

## 3. You are working as a data scientist for Gives You Paws ™, a subscription based service that shows you cute pictures of dogs or cats (or both for an additional fee).

At Gives You Paws, anyone can upload pictures of their cats or dogs. The photos are then put through a two step process. First an automated algorithm tags pictures as either a cat or a dog (Phase I). Next, the photos that have been initially identified are put through another round of review, possibly with some human oversight, before being presented to the users (Phase II).

### Given this dataset, use pandas to create a baseline model (i.e. a model that just predicts the most common class) and answer the following questions:

In [8]:
paws = pd.read_csv("gives_you_paws.csv")

In [9]:
paws.head()

Unnamed: 0,actual,model1,model2,model3,model4
0,cat,cat,dog,cat,dog
1,dog,dog,cat,cat,dog
2,dog,cat,cat,cat,dog
3,dog,dog,dog,cat,dog
4,cat,cat,cat,dog,dog


In [10]:
paws.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
actual    5000 non-null object
model1    5000 non-null object
model2    5000 non-null object
model3    5000 non-null object
model4    5000 non-null object
dtypes: object(5)
memory usage: 195.4+ KB


In [11]:
paws.actual.value_counts()

dog    3254
cat    1746
Name: actual, dtype: int64

In [12]:
paws["baseline"] = "dog"

### In terms of accuracy, how do the various models compare to the baseline model? Are any of the models better than the baseline?

In [13]:
df = paws

# accuracy -- overall hit rate
model1_accuracy = (df.model1 == df.actual).mean()
model2_accuracy = (df.model2 == df.actual).mean()
model3_accuracy = (df.model3 == df.actual).mean()
model4_accuracy = (df.model4 == df.actual).mean()
baseline_accuracy = (df.baseline == df.actual).mean()


print(f'Model 1 accuracy = {model1_accuracy:.0%}')
print(f'Model 2 accuracy = {model2_accuracy:.0%}')
print(f'Model 3 accuracy = {model3_accuracy:.0%}')
print(f'Model 4 accuracy = {model4_accuracy:.0%}')
print(f'Baseline accuracy = {baseline_accuracy:.0%}')

Model 1 accuracy = 81%
Model 2 accuracy = 63%
Model 3 accuracy = 51%
Model 4 accuracy = 74%
Baseline accuracy = 65%


In terms of accuracy, model1 and model4 performed better than the base line. 

### Suppose you are working on a team that solely deals with dog pictures. Which of these models would you recomend for Phase I? For Phase II?

Positive = "dog"

* TP: Predict dog picture, and it is a dog picture
* TN: Predict not a picture of dog (ie.cat) and it is a cat picture
* FP: Predict picture of dog, and it wasn't 
* FN: Predicted not a picture of a dog, and it was a picture of a dog

For phase I, the risk of missing out on dog pictures is more costly than the risk of mis-identifying pictures, so we would want to ensure a high number of positive cases, because we would want to maximize the number of dog pictures that are brought in. As such, we use a **recall** evaluation

In [14]:
positive = "dog"

subset = paws[paws.actual==positive]
model1_recall = (subset.model1 == subset.actual).mean()
model2_recall = (subset.model2 == subset.actual).mean()
model3_recall = (subset.model3 == subset.actual).mean()
model4_recall = (subset.model4 == subset.actual).mean()


print(f'model 1 recall = {model1_recall:.0%}')
print(f'model 2 recall = {model2_recall:.0%}')
print(f'model 3 recall = {model3_recall:.0%}')
print(f'model 4 recall = {model4_recall:.0%}')

model 1 recall = 80%
model 2 recall = 49%
model 3 recall = 51%
model 4 recall = 96%


Model 4 would be best model for phase I. For phase II, we would now need to make sure that our predicted value is as high as possible, because it would be pretty bad if a cat picture showed up on a dog stream, so we need to ensure we have as little False Positives as possible. In other words, it is better to predict that a dog picture is not a dog picture, than to predict that a cat picture is a dog picture. As such, here we use a **precision evaluation**.

In [15]:
positive = "dog"

model1 = paws[paws.model1== positive]
model1_precision = (model1.model1 == model1.actual).mean()

model2 = paws[paws.model2== positive]
model2_precision = (model2.model2 == model2.actual).mean()

model3 = paws[paws.model3 == positive]
model3_precision = (model3.model3 == model3.actual).mean()

model4 = paws[paws.model4 == positive]
model4_precision = (model4.model4 == model4.actual).mean()



print(f'Precision of model 1 is: {model1_precision:.0%}')
print(f'Precision of model 2 is: {model2_precision:.0%}')
print(f'Precision of model 3 is: {model3_precision:.0%}')
print(f'Precision of model 4 is: {model4_precision:.0%}')




Precision of model 1 is: 89%
Precision of model 2 is: 89%
Precision of model 3 is: 66%
Precision of model 4 is: 73%


In this case, model 1 or model 2 would be better, because they have higher *accurate* predicted positives.

## Suppose you are working on a team that solely deals with cat pictures. Which of these models would you recomend for Phase I? For Phase II?

Objective: Predict that an image is a cat or not

* TP: Predicted cat image, and it was a cat image
* TN: Predicted not a cat image, and it wasn't a cat image
* FP: Predicted cat image, and it wasn't a cat image
* FN: Predicted not a cat image, and it was a cat image


Since our objective is to predict value images, I would use recall on the first phase, and precision on the second phase. 

In [16]:
positive = "cat"

subset = paws[paws.actual==positive]
model1_recall = (subset.model1 == subset.actual).mean()
model2_recall = (subset.model2 == subset.actual).mean()
model3_recall = (subset.model3 == subset.actual).mean()
model4_recall = (subset.model4 == subset.actual).mean()


print(f'model 1 recall = {model1_recall:.0%}')
print(f'model 2 recall = {model2_recall:.0%}')
print(f'model 3 recall = {model3_recall:.0%}')
print(f'model 4 recall = {model4_recall:.0%}')

model 1 recall = 82%
model 2 recall = 89%
model 3 recall = 51%
model 4 recall = 35%


Model 2 has teh highest percentage of actual positive values, which would make it ideal for phase 1. 

In [17]:
positive = "cat"

model1 = paws[paws.model1== positive]
model1_precision = (model1.model1 == model1.actual).mean()

model2 = paws[paws.model2== positive]
model2_precision = (model2.model2 == model2.actual).mean()

model3 = paws[paws.model3 == positive]
model3_precision = (model3.model3 == model3.actual).mean()

model4 = paws[paws.model4 == positive]
model4_precision = (model4.model4 == model4.actual).mean()



print(f'Precision of model 1 is: {model1_precision:.0%}')
print(f'Precision of model 2 is: {model2_precision:.0%}')
print(f'Precision of model 3 is: {model3_precision:.0%}')
print(f'Precision of model 4 is: {model4_precision:.0%}')

Precision of model 1 is: 69%
Precision of model 2 is: 48%
Precision of model 3 is: 36%
Precision of model 4 is: 81%


Model 4 has the most precision when it comes to guessing True Positives, and would be the best for Phase II.

## 4. Follow the links below to read the documentation about each function, then apply those functions to the data from the previous problem.

## Accuracy

* sklearn.metrics.accuracy_score

Using sklearn to calculate accuracy for all models.

In [18]:
import sklearn
from sklearn import metrics

In [19]:
sklearn.metrics.accuracy_score(paws.actual, paws.model1)

0.8074

In [20]:
paws.apply(lambda col: sklearn.metrics.accuracy_score(paws.actual, col))

actual      1.0000
model1      0.8074
model2      0.6304
model3      0.5096
model4      0.7426
baseline    0.6508
dtype: float64

## Precision

* sklearn.metrics.precision_score

Now that we are working just with dog team, we want to 
predict precision for all model - where positive is "dog"

In [21]:
subgroup = paws[paws.model1 == "dog"]

sklearn.metrics.precision_score(subgroup.actual, subgroup.model1, pos_label="dog")

0.8900238338440586

In [22]:
paws.apply(lambda col: sklearn.metrics.precision_score(paws.actual, col, pos_label="dog"))

actual      1.000000
model1      0.890024
model2      0.893177
model3      0.659888
model4      0.731249
baseline    0.650800
dtype: float64

Now for the cat team, where positive is "cat"

In [23]:
paws.apply(lambda col: sklearn.metrics.precision_score(paws.actual, col, pos_label = "cat"))

  'precision', 'predicted', average, warn_for)


actual      1.000000
model1      0.689772
model2      0.484122
model3      0.358347
model4      0.807229
baseline    0.000000
dtype: float64

## Recall

* sklearn.metrics.recall_score

Same as above, we will just be doing the dog model, but now evaluating on recall

In [24]:
sklearn.metrics.recall_score(paws.actual, paws.model1, pos_label ="dog")

0.803318992009834

In [25]:
paws.apply(lambda col: sklearn.metrics.recall_score(paws.actual, col, pos_label = "dog"))

actual      1.000000
model1      0.803319
model2      0.490781
model3      0.508605
model4      0.955747
baseline    1.000000
dtype: float64

Now for the cat team - where positive is "cat"

In [26]:
paws.apply(lambda col: sklearn.metrics.recall_score(paws.actual, col, pos_label="cat"))

actual      1.000000
model1      0.815006
model2      0.890607
model3      0.511455
model4      0.345361
baseline    0.000000
dtype: float64

## Classification Report

* sklearn.metrics.classification_report

Built a text report showing the main classification metrics

In [27]:
scores = sklearn.metrics.classification_report(paws.actual, paws.model1, output_dict = True)

In [28]:
import pprint

pprint.pprint(scores)

{'accuracy': 0.8074,
 'cat': {'f1-score': 0.7471777369388292,
         'precision': 0.6897721764420747,
         'recall': 0.8150057273768614,
         'support': 1746},
 'dog': {'f1-score': 0.8444516233241802,
         'precision': 0.8900238338440586,
         'recall': 0.803318992009834,
         'support': 3254},
 'macro avg': {'f1-score': 0.7958146801315047,
               'precision': 0.7898980051430666,
               'recall': 0.8091623596933477,
               'support': 5000},
 'weighted avg': {'f1-score': 0.8104835821984157,
                  'precision': 0.8200959550792857,
                  'recall': 0.8074,
                  'support': 5000}}


In [29]:
print(sklearn.metrics.classification_report(paws.actual, paws.model1))

              precision    recall  f1-score   support

         cat       0.69      0.82      0.75      1746
         dog       0.89      0.80      0.84      3254

    accuracy                           0.81      5000
   macro avg       0.79      0.81      0.80      5000
weighted avg       0.82      0.81      0.81      5000



In [30]:
for i in range(0,4):
    print(f" Model {i + 1} ")
    print(print(sklearn.metrics.classification_report(paws.actual, paws[paws.columns[1 + i]])))
    print("--------------------------")

 Model 1 
              precision    recall  f1-score   support

         cat       0.69      0.82      0.75      1746
         dog       0.89      0.80      0.84      3254

    accuracy                           0.81      5000
   macro avg       0.79      0.81      0.80      5000
weighted avg       0.82      0.81      0.81      5000

None
--------------------------
 Model 2 
              precision    recall  f1-score   support

         cat       0.48      0.89      0.63      1746
         dog       0.89      0.49      0.63      3254

    accuracy                           0.63      5000
   macro avg       0.69      0.69      0.63      5000
weighted avg       0.75      0.63      0.63      5000

None
--------------------------
 Model 3 
              precision    recall  f1-score   support

         cat       0.36      0.51      0.42      1746
         dog       0.66      0.51      0.57      3254

    accuracy                           0.51      5000
   macro avg       0.51      0.51 