# Model Evaluation Excercises

In [None]:
# Do these exercises using the manual method, then go back using sklearn's functions as the 2nd half of the exercise.

In [None]:
# Using basically the code shown in the lesson, and also will need to use subsetting.

In [None]:
import pandas as pd

### 1. Given the following confusion matrix, evaluate (by hand) the model's performance.


|               | actual cat | actual dog |
|:------------  |-----------:|-----------:|
| predicted cat |         34 |          7 |
| predicted dog |         13 |         46 |

In [None]:
total_cats = 34 + 13
total_dogs = 7 + 46

print(total_cats)
print(total_dogs)

In [None]:
# There are 47 total actual cats, and 53 total actual dogs.

In [None]:
# What is the positive case?
# Since the upper left quadrant is usually the target positive, the positive in this scenario is: actual cats.

- In the context of this problem, what is a false positive?

In [None]:
# Since the context of the problem was setup already, and the confusion matrix was already setup the way it is:
# A false positive would be the predicted 7 cats that were actually 7 dogs.

- In the context of this problem, what is a false negative?

In [None]:
# A False negative would be the 13 missed cats, that were predicted to be dogs.

- How would you describe this model?

In [None]:
# The model is pretty accurate, with an accuracy rating of 80%.

In [3]:
true_positives = 34
true_negatives = 46
false_positives = 7
false_negatives = 13

accuracy = (true_positives + true_negatives) / (true_positives + true_negatives + false_positives + false_negatives)

recall = true_positives / (true_positives + false_negatives)

precision = true_positives / (true_positives + false_positives)

specificity = true_negatives / (true_negatives + false_positives)

print("Accuracy:", accuracy)
print("Recall:", recall)
print("Precision:", precision)
print("Specificity:", specificity)

Accuracy: 0.8
Recall: 0.723404255319149
Precision: 0.8292682926829268
Specificity: 0.8679245283018868


### 2. You are working as a datascientist working for Codeup Cody Creator (C3 for short), a rubber-duck manufacturing plant.

Unfortunately, some of the rubber ducks that are produced will have defects. Your team has built several models that try to predict those defects.

Use the predictions dataset and pandas to help answer the following questions:

- An internal team wants to investigate the cause of the manufacturing defects. They tell you that they want to identify as many of the ducks that have a defect as possible. Which evaluation metric would be appropriate here? Which model would be the best fit for this use case?
- Recently several stories in the local news have come out highlighting customers who received a rubber duck with a defect, and portraying C3 in a bad light. The PR team has decided to launch a program that gives customers with a defective duck a vacation to Hawaii. They need you to predict which ducks will have defects, but tell you the really don't want to accidentally give out a vacation package when the duck really doesn't have a defect. Which evaluation metric would be appropriate here? Which model would be the best fit for this use case?

In [80]:
# First off, need to define what the positive case and negative case are. What's the target?

# Target is we are trying to predict how many ducks are defective for the interal sales team. 
# Thus, my "positive" case would be correctly predicting and identifying a defective duck.
# So, a True positive would be correctly predicting that a duck was defective. 
# A True Negative would be correctly predicting a duck was not defective.

In [152]:
# Loading in the prior models:

ducks = pd.read_csv('c3.csv')
ducks.head()

Unnamed: 0,actual,model1,model2,model3
0,No Defect,No Defect,Defect,No Defect
1,No Defect,No Defect,Defect,Defect
2,No Defect,No Defect,Defect,No Defect
3,No Defect,Defect,Defect,Defect
4,No Defect,No Defect,Defect,No Defect


#### Which evaluation metric would be appropriate here?

Based on the wording of the question, The goal is to identify as many defective ducks as possible. Thus, the **Recall/Sensitivity** metric should be employed, because a True Positive is a defective duck, which is what the internal team is trying to look for. We want a model that has the highest True Positive rate.

The other way of looking at this is to state that the model is looking for non-defective ducks, and thus we are looking for True Positives. In that case, the metric we'd want to employ would be specificty metric, which is finding the percentage of predicting true negatives out of all negatives or "Recall for the negative class".

#### Which model would be the best fit for this use case?

In [153]:
ducks['baseline_prediction'] = ducks.actual.value_counts().index[0]
ducks.head(2)

Unnamed: 0,actual,model1,model2,model3,baseline_prediction
0,No Defect,No Defect,Defect,No Defect,No Defect
1,No Defect,No Defect,Defect,Defect,No Defect


In [81]:
# Since I'm focused on sensitivity, I want to find the model that most correctly predicts True positives, or has the closest match to actual defective ducks.

In [82]:
# Specificity: TP / (TP + FN)

In [100]:
# Confusion Matricies:

pd.crosstab(ducks.model1, ducks.actual)

actual,Defect,No Defect
model1,Unnamed: 1_level_1,Unnamed: 2_level_1
Defect,8,2
No Defect,8,182


In [101]:
pd.crosstab(ducks.model2, ducks.actual)

actual,Defect,No Defect
model2,Unnamed: 1_level_1,Unnamed: 2_level_1
Defect,9,81
No Defect,7,103


In [102]:
pd.crosstab(ducks.model3, ducks.actual)

actual,Defect,No Defect
model3,Unnamed: 1_level_1,Unnamed: 2_level_1
Defect,13,86
No Defect,3,98


In [103]:
ducks.actual.value_counts()

No Defect    184
Defect        16
Name: actual, dtype: int64

In [104]:
ducks.model1.value_counts()

No Defect    190
Defect        10
Name: model1, dtype: int64

In [105]:
ducks.model2.value_counts()

No Defect    110
Defect        90
Name: model2, dtype: int64

In [66]:
ducks.model3.value_counts()

No Defect    101
Defect        99
Name: model3, dtype: int64

#### Model 1 most closely matches the true number of defective ducks.

##### The cost is high if we get a False Negative, thus we want to use the Recall/Sensitivity metric

- We are setting defective duck as the positive case in this scenario, since the sensitivity metric looks only at true positives and false negatives, and optimizses for 

I can compare the models against the actual results and come up with an isolated list of true positives and true negatives:

In [154]:
# Set the positive case:
positive = "Defect"

In [156]:
# Creating a simple boolean mask on only the 'actual' column in the duck df:

ducks.actual == positive

0      False
1      False
2      False
3      False
4      False
       ...  
195    False
196     True
197    False
198    False
199    False
Name: actual, Length: 200, dtype: bool

In [155]:
ducks

Unnamed: 0,actual,model1,model2,model3,baseline_prediction
0,No Defect,No Defect,Defect,No Defect,No Defect
1,No Defect,No Defect,Defect,Defect,No Defect
2,No Defect,No Defect,Defect,No Defect,No Defect
3,No Defect,Defect,Defect,Defect,No Defect
4,No Defect,No Defect,Defect,No Defect,No Defect
...,...,...,...,...,...
195,No Defect,No Defect,Defect,Defect,No Defect
196,Defect,Defect,No Defect,No Defect,No Defect
197,No Defect,No Defect,No Defect,No Defect,No Defect
198,No Defect,No Defect,Defect,Defect,No Defect


In [157]:
# My baseline prediction is that the duck is defective, since that is what was defined above as our positive case.
# In this situation, we do not care about any actual negatives (ie, we don't care about non-defective ducks).
# Technically, I think we should've done this in reverse, but this'll work...

# ducks['baseline_prediction'] = 'Defect'
# ducks['baseline_prediction_p'] = "No Defect"
# ducks

In [158]:
# Now, going through and narrowing down the dataframe to only those rows that match the "Defect" boolean, ie, those rose in my above mask that returned "True".
# This next step simply created a new df that returns the 16 rows that actually had a defect.
# Boolean masks are very powerful...

subset = ducks[ducks.actual == positive]
subset

Unnamed: 0,actual,model1,model2,model3,baseline_prediction
13,Defect,No Defect,Defect,Defect,No Defect
30,Defect,Defect,No Defect,Defect,No Defect
65,Defect,Defect,Defect,Defect,No Defect
70,Defect,Defect,Defect,Defect,No Defect
74,Defect,No Defect,No Defect,Defect,No Defect
87,Defect,No Defect,Defect,Defect,No Defect
118,Defect,No Defect,Defect,No Defect,No Defect
135,Defect,Defect,No Defect,Defect,No Defect
140,Defect,No Defect,Defect,Defect,No Defect
147,Defect,Defect,No Defect,Defect,No Defect


In [159]:
subset.shape

(16, 5)

In [160]:
subset.actual == subset.model1

13     False
30      True
65      True
70      True
74     False
87     False
118    False
135     True
140    False
147     True
163     True
171    False
176    False
186    False
194     True
196     True
dtype: bool

In [161]:
# Now what I'm doing is figuring out when model1 matched actual results:

model_recall = (subset.actual == subset.model1)
model_recall

# Still 16 rows, but where the answer is False, that is where the model missed a true positive, ie, returned a False negative. Which we need that for our recall evaluation metric.

13     False
30      True
65      True
70      True
74     False
87     False
118    False
135     True
140    False
147     True
163     True
171    False
176    False
186    False
194     True
196     True
dtype: bool

In [162]:


subset.actual == subset.model1

13     False
30      True
65      True
70      True
74     False
87     False
118    False
135     True
140    False
147     True
163     True
171    False
176    False
186    False
194     True
196     True
dtype: bool

In [163]:
subset.actual != subset.model1

13      True
30     False
65     False
70     False
74      True
87      True
118     True
135    False
140     True
147    False
163    False
171     True
176     True
186     True
194    False
196    False
dtype: bool

In [164]:
subset

Unnamed: 0,actual,model1,model2,model3,baseline_prediction
13,Defect,No Defect,Defect,Defect,No Defect
30,Defect,Defect,No Defect,Defect,No Defect
65,Defect,Defect,Defect,Defect,No Defect
70,Defect,Defect,Defect,Defect,No Defect
74,Defect,No Defect,No Defect,Defect,No Defect
87,Defect,No Defect,Defect,Defect,No Defect
118,Defect,No Defect,Defect,No Defect,No Defect
135,Defect,Defect,No Defect,Defect,No Defect
140,Defect,No Defect,Defect,Defect,No Defect
147,Defect,Defect,No Defect,Defect,No Defect


In [165]:
(subset.actual == subset.model1).mean()

0.5

In [166]:
# positive = "Defect"
# subset = ducks[ducks.actual == positive]
model_recall1 = (subset.actual == subset.model1).mean()
baseline_recall = (subset.baseline_prediction == subset.actual).mean()
print("Model 1")
print(f"Model recall: {model_recall1:.2%}")
print(f"Baseline recall: {baseline_recall:.2%}")

Model 1
Model recall: 50.00%
Baseline recall: 0.00%


In [167]:
model_recall2 = (subset.actual == subset.model2).mean()
baseline_recall = (subset.baseline_prediction == subset.actual).mean()
print("Model 2")
print(f"Model recall: {model_recall2:.2%}")
print(f"Baseline recall: {baseline_recall:.2%}")

Model 2
Model recall: 56.25%
Baseline recall: 0.00%


In [168]:
model_recall3 = (subset.actual == subset.model3).mean()
baseline_recall = (subset.baseline_prediction == subset.actual).mean()
print("Model 3")
print(f"Model recall: {model_recall3:.2%}")
print(f"Baseline recall: {baseline_recall:.2%}")

Model 3
Model recall: 81.25%
Baseline recall: 0.00%


### In this case, model 3 would be the model we would use, since it most often correctly identified a defective duck.

### Ironically, Model 3 is also the least overall accurate model. So if we wanted simple accuracy of how many ducks were defective and how many were not, we would've used model 1. However, the goal was to find defective ducks, not find non-defective ducks.

## 2b. Recently several stories in the local news have come out highlighting customers who received a rubber duck with a defect, and portraying C3 in a bad light. The PR team has decided to launch a program that gives customers with a defective duck a vacation to Hawaii. They need you to predict which ducks will have defects, but tell you the really don't want to accidentally give out a vacation package when the duck really doesn't have a defect. 

- Which evaluation metric would be appropriate here? 
- Which model would be the best fit for this use case?

##### The cost is high if we get a False Positive, more expensive than a False Negative, thus we want to use the Precision metric.

Precision: TP / (TP + FP)

We already know that Model 3 was most accurate when predicting defective ducks, but we need to do the reverse test now; Need to find out which model didn't over predict defective ducks.

In [150]:
positive_p = "Defect"

subset_p = ducks[ducks.actual == positive_p]

In [137]:
# subset_p = subset_p.drop(columns = 'baseline_prediction_p')
subset_p

Unnamed: 0,actual,model1,model2,model3,baseline_prediction
13,Defect,No Defect,Defect,Defect,Defect
30,Defect,Defect,No Defect,Defect,Defect
65,Defect,Defect,Defect,Defect,Defect
70,Defect,Defect,Defect,Defect,Defect
74,Defect,No Defect,No Defect,Defect,Defect
87,Defect,No Defect,Defect,Defect,Defect
118,Defect,No Defect,Defect,No Defect,Defect
135,Defect,Defect,No Defect,Defect,Defect
140,Defect,No Defect,Defect,Defect,Defect
147,Defect,Defect,No Defect,Defect,Defect


In [138]:
(subset_p.actual == subset_p.model1).mean()

0.5

In [148]:
subset_p = ducks[ducks.model1 == positive_p]

model1_precision = (subset_p.actual == subset_p.model1).mean()

subset = ducks[ducks.baseline_prediction == positive]
baseline_precision = (subset_p.actual == subset_p.baseline_prediction).mean()
print("Model 1")
print(f"Model recall: {model1_precision:.2%}")
print(f"Baseline recall: {baseline_precision:.2%}")

Model 1
Model recall: 80.00%
Baseline recall: 80.00%


In [145]:
subset_p = ducks[ducks.model2 == positive_p]

model2_precision = (subset_p.actual == subset_p.model2).mean()
baseline_precision = (subset_p.actual == subset_p.baseline_prediction).mean()
print("Model 2")
print(f"Model recall: {model2_precision:.2%}")
print(f"Baseline recall: {baseline_precision:.2%}")

Model 2
Model recall: 10.00%
Baseline recall: 10.00%


In [146]:
subset_p = ducks[ducks.model3 == positive_p]

model3_precision = (subset_p.actual == subset_p.model3).mean()
baseline_precision = (subset_p.actual == subset_p.baseline_prediction).mean()
print("Model 3")
print(f"Model recall: {model3_precision:.2%}")
print(f"Baseline recall: {baseline_precision:.2%}")

Model 3
Model recall: 13.13%
Baseline recall: 13.13%


#### Model 1 has the highest precision, thus the marketing team would be least likely using model 1 to reward a non-defective duck owner with a free vacation.

#### They should use Model 1

## 3. You are working as a data scientist for Gives You Paws ™, a subscription based service that shows you cute pictures of dogs or cats (or both for an additional fee).

At Gives You Paws, anyone can upload pictures of their cats or dogs. The photos are then put through a two step process. First an automated algorithm tags pictures as either a cat or a dog (Phase I). Next, the photos that have been initially identified are put through another round of review, possibly with some human oversight, before being presented to the users (Phase II).

Several models have already been developed with the data, and you can find their results here.

Given this dataset, use pandas to create a baseline model (i.e. a model that just predicts the most common class) and answer the following questions:

- In terms of accuracy, how do the various models compare to the baseline model? Are any of the models better than the baseline?
- Suppose you are working on a team that solely deals with dog pictures. Which of these models would you recomend for Phase I? For Phase II?
- Suppose you are working on a team that solely deals with cat pictures. Which of these models would you recomend for Phase I? For Phase II?

In [1]:
import pandas as pd
df = pd.read_csv('gives_you_paws.csv')
df

Unnamed: 0,actual,model1,model2,model3,model4
0,cat,cat,dog,cat,dog
1,dog,dog,cat,cat,dog
2,dog,cat,cat,cat,dog
3,dog,dog,dog,cat,dog
4,cat,cat,cat,dog,dog
...,...,...,...,...,...
4995,dog,dog,dog,dog,dog
4996,dog,dog,cat,cat,dog
4997,dog,cat,cat,dog,dog
4998,cat,cat,cat,cat,dog


In [7]:
accuracy = df.actual.value_counts()
accuracy

dog    3254
cat    1746
Name: actual, dtype: int64

In [12]:
# Use pandas to create a baseline model (i.e. a model that just predicts the most common class):
df.actual.value_counts()
df['baseline_prediction'] = 'dog'
model_accuracy = (df.actual == df.actual).mean()
baseline_accuracy = (df.baseline_prediction == df.actual).mean()
print(f'   model accuracy: {model_accuracy:.2%}')
print(f'baseline accuracy: {baseline_accuracy:.2%}')

   model accuracy: 100.00%
baseline accuracy: 65.08%


In [37]:
(df.baseline_prediction == df.actual).mean()

0.6508

In [176]:
df

Unnamed: 0,actual,model1,model2,model3,model4,baseline_prediction
0,cat,cat,dog,cat,dog,dog
1,dog,dog,cat,cat,dog,dog
2,dog,cat,cat,cat,dog,dog
3,dog,dog,dog,cat,dog,dog
4,cat,cat,cat,dog,dog,dog
...,...,...,...,...,...,...
4995,dog,dog,dog,dog,dog,dog
4996,dog,dog,cat,cat,dog,dog
4997,dog,cat,cat,dog,dog,dog
4998,cat,cat,cat,cat,dog,dog


In terms of accuracy, how do the various models compare to the baseline model? Are any of the models better than the baseline?

In [17]:
# Model 1

df.model1.value_counts()

model_accuracy1 = (df.model1 == df.actual).mean()
baseline_accuracy = (df.baseline_prediction == df.actual).mean()
print(f'   model accuracy: {model_accuracy1:.2%}')
print(f'baseline accuracy: {baseline_accuracy:.2%}')

   model accuracy: 80.74%
baseline accuracy: 65.08%


In [18]:
# Model 2

df.model2.value_counts()

model_accuracy2 = (df.model2 == df.actual).mean()
baseline_accuracy = (df.baseline_prediction == df.actual).mean()
print(f'   model accuracy: {model_accuracy2:.2%}')
print(f'baseline accuracy: {baseline_accuracy:.2%}')

   model accuracy: 63.04%
baseline accuracy: 65.08%


In [19]:
# Model 3

df.model3.value_counts()

model_accuracy3 = (df.model3 == df.actual).mean()
baseline_accuracy = (df.baseline_prediction == df.actual).mean()
print(f'   model accuracy: {model_accuracy3:.2%}')
print(f'baseline accuracy: {baseline_accuracy:.2%}')

   model accuracy: 50.96%
baseline accuracy: 65.08%


In [20]:
# Model 4

df.model3.value_counts()

model_accuracy4 = (df.model4 == df.actual).mean()
baseline_accuracy = (df.baseline_prediction == df.actual).mean()
print(f'   model accuracy: {model_accuracy4:.2%}')
print(f'baseline accuracy: {baseline_accuracy:.2%}')

   model accuracy: 74.26%
baseline accuracy: 65.08%


In [182]:
# Ryan's code:

# Programmatically get all the model columns
# .loc[starting_row:ending_row, starting_column:ending_column]

models = df.loc[:, "model1":"baseline_prediction"].columns.tolist()
models

output = []
for model in models:
    
    output.append({
        "model": model,
        "accuracy": (df[model] == df.actual).mean(),
    })


metrics = pd.DataFrame(output)
metrics = metrics.sort_values(by="accuracy", ascending=False, ignore_index=True)
metrics

Unnamed: 0,model,accuracy
0,model1,0.8074
1,model4,0.7426
2,baseline_prediction,0.6508
3,model2,0.6304
4,model3,0.5096


In [183]:
metrics.accuracy.max()

0.8074

In [180]:
# Highest Models:

model_list = pd.DataFrame([model_accuracy1, model_accuracy2, model_accuracy3, model_accuracy4, baseline_accuracy])
model_list

Unnamed: 0,0
0,0.8074
1,0.6304
2,0.5096
3,0.7426
4,0.6508


In [181]:
model_list.max()

# Model 1 has the highest accuracy, and is higher than the baseline model.

0    0.8074
dtype: float64

Suppose you are working on a team that solely deals with dog pictures. Which of these models would you recomend for Phase I? For Phase II?


Phase I: Dog is problem setup goal, so I'll want to use the Recall metric. 
    
Phase II: We'd want precision, since we want to filter down to pictures that are only dogs as much as possible, since human labor is expensive.

## 4. Follow the links below to read the documentation about each function, then apply those functions to the data from the previous problem.

- sklearn.metrics.accuracy_score
- sklearn.metrics.precision_score
- sklearn.metrics.recall_score
- sklearn.metrics.classification_report