In [1]:
import pandas as pd

# Exercises:

## 2. Given the following confusion matrix, evaluate (by hand) the model's performance.

|               | pred dog   | pred cat   |
|:------------  |-----------:|-----------:|
| actual dog    |         46 |         7  |
| actual cat    |         13 |         34 |

### a. In the context of this problem, what is a false positive?

A false positive would be predicting that it's a dog, but it is actually a cat. 

### b. In the context of this problem, what is a false negative?

A false negative would be predicting that it's a cat, but it is actually a dog.

### c. How would you describe this model?

I'm not really sure what is being asked here. 

## 3. You are working as a data scientist working for Codeup Cody Creator (C3 for short), a rubber-duck manufacturing plant.

## Unfortunately, some of the rubber ducks that are produced will have defects. Your team has built several models that try to predict those defects, and the data from their predictions can be found [here](https://ds.codeup.com/data/c3.csv).

## Use the predictions dataset and pandas to help answer the following questions:

In [2]:
#Importing the dataset:

defects_df = pd.read_csv('cody_defects_data.csv')
defects_df

Unnamed: 0,actual,model1,model2,model3
0,No Defect,No Defect,Defect,No Defect
1,No Defect,No Defect,Defect,Defect
2,No Defect,No Defect,Defect,No Defect
3,No Defect,Defect,Defect,Defect
4,No Defect,No Defect,Defect,No Defect
...,...,...,...,...
195,No Defect,No Defect,Defect,Defect
196,Defect,Defect,No Defect,No Defect
197,No Defect,No Defect,No Defect,No Defect
198,No Defect,No Defect,Defect,Defect


### a. An internal team wants to investigate the cause of the manufacturing defects. They tell you that they want to identify as many of the ducks that have a defect as possible. Which evaluation metric would be appropriate here? Which model would be the best fit for this use case?

If we're wanting to ensure that we get as many defective ducks as possible, that means that we want all positive cases, and would likely be willing to tolerate some false positives. Therefore, we would want to use recall as our metric. 

In [3]:
#Determining which model is best:

models = ['model1', 'model2', 'model3']

for model in models:
    true_positive = defects_df[(defects_df[model] == "Defect") & (defects_df['actual'] == "Defect")].shape[0]
    false_negative = defects_df[(defects_df[model] == "No Defect") & (defects_df['actual'] == "Defect")].shape[0]
    recall = true_positive / (true_positive + false_negative)
    print(f"Recall for {model} is {recall}. \n")

Recall for model1 is 0.5. 

Recall for model2 is 0.5625. 

Recall for model3 is 0.8125. 



The above shows that the best model to use would be Model 3.

### b. Recently several stories in the local news have come out highlighting customers who received a rubber duck with a defect, and portraying C3 in a bad light. The PR team has decided to launch a program that gives customers with a defective duck a vacation to Hawaii. They need you to predict which ducks will have defects, but tell you the really don't want to accidentally give out a vacation package when the duck really doesn't have a defect. Which evaluation metric would be appropriate here? Which model would be the best fit for this use case?

Since a false positive here would be costly, but we also don't want to have false negatives due to complaints, we would use an F1 metric. If we don't catch defects, we get complaints. If we claim something is defective when it is not, then we're paying for trips to Hawaii. 

In [15]:
for model in models:
    true_positive = defects_df[(defects_df[model] == "Defect") & (defects_df['actual'] == "Defect")].shape[0]
    true_negative = defects_df[(defects_df[model] == "No Defect") & (defects_df['actual'] == "No Defect")].shape[0]
    false_negative = defects_df[(defects_df[model] == "No Defect") & (defects_df['actual'] == "Defect")].shape[0]
    false_positive = defects_df[(defects_df[model] == "Defect") & (defects_df['actual'] == "No Defect")].shape[0]
    recall = true_positive / (true_positive + false_negative)
    precision = true_positive / (true_positive + false_positive)
    f1 = (2 * ((precision * recall)/(precision + recall)))
    print(f"The f1 score for {model} is {f1}. \n")

The f1 score for model1 is 0.6153846153846154. 

The f1 score for model2 is 0.169811320754717. 

The f1 score for model3 is 0.22608695652173916. 



Based on the above, it appears that the best model is Model 1.

In [5]:
#If we were going for precision:

for model in models:
    true_positive = defects_df[(defects_df[model] == "Defect") & (defects_df['actual'] == "Defect")].shape[0]
    false_positive = defects_df[(defects_df[model] == "Defect") & (defects_df['actual'] == "No Defect")].shape[0]
    precision = true_positive / (true_positive + false_positive)
    print(f"The precision score for {model} is {precision}. \n")

The precision score for model1 is 0.8. 

The precision score for model2 is 0.1. 

The precision score for model3 is 0.13131313131313133. 



If we're going for precision, minimizing false positives, then we would go with Model 1.

## 4. You are working as a data scientist for Gives You Paws ™, a subscription based service that shows you cute pictures of dogs or cats (or both for an additional fee).

## At Gives You Paws, anyone can upload pictures of their cats or dogs. The photos are then put through a two step process. First an automated algorithm tags pictures as either a cat or a dog (Phase I). 

## Next, the photos that have been initially identified are put through another round of review, possibly with some human oversight, before being presented to the users (Phase II).

## Several models have already been developed with the data, and you can find their results [here](https://ds.codeup.com/data/gives_you_paws.csv).

## Given this dataset, use pandas to create a baseline model (i.e. a model that just predicts the most common class) and answer the following questions:

In [22]:
paws_df = pd.read_csv('gives_you_paws_data.csv')

paws_model = ['model1', 'model2', 'model3', 'model4']
paws_df.head()

Unnamed: 0,actual,model1,model2,model3,model4
0,cat,cat,dog,cat,dog
1,dog,dog,cat,cat,dog
2,dog,cat,cat,cat,dog
3,dog,dog,dog,cat,dog
4,cat,cat,cat,dog,dog


In [11]:
#Creating the baseline model:
paws_df['actual'].value_counts()

baseline_model_score = (paws_df[paws_df['actual'] == 'dog'].shape[0]) / (paws_df.actual.shape[0])
print(f"Since 'dog' is the most common, our baseline model would predict 'dog' every time and have a score of {baseline_model_score}.")

Since 'dog' is the most common, our baseline model would predict 'dog' every time and have a score of 0.6508.


### a. In terms of accuracy, how do the various models compare to the baseline model? Are any of the models better than the baseline?

In [34]:
for model in paws_model:
    true_positive = paws_df[(paws_df[model] == "dog") & (paws_df['actual'] == "dog")].shape[0]
    true_negative = paws_df[(paws_df[model] == "cat") & (paws_df['actual'] == "cat")].shape[0]
    false_negative = paws_df[(paws_df[model] == "cat") & (paws_df['actual'] == "dog")].shape[0]
    false_positive = paws_df[(paws_df[model] == "dog") & (paws_df['actual'] == "cat")].shape[0]
    accuracy = (true_positive + true_negative) / (true_positive + true_negative + false_positive + false_negative)
    print(f"The accuracy score for {model} is {accuracy}. \n")

The accuracy score for model1 is 0.8074. 

The accuracy score for model2 is 0.6304. 

The accuracy score for model3 is 0.5096. 

The accuracy score for model4 is 0.7426. 



Based on the above, it appears that models 1 and 4 are better than baseline in terms of accuracy.

### b. Suppose you are working on a team that solely deals with dog pictures. Which of these models would you recommend for Phase I? For Phase II?

Calculate recall for each model for phase 1 and select the best

Calculate precision for each model for phase 2 and select the best

### c. Suppose you are working on a team that solely deals with cat pictures. Which of these models would you recommend for Phase I? For Phase II?

- positive = 'cat'

In [36]:
from sklearn.metrics import classification_report

#https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html

In [None]:
x = classification_report(df.actual, df.model1, labels = ['cat', 'dog'], output_dict = True)

pd.Dataframe(x).T
#This will convert the output into a DataFrame for legibility

#I think I will need to go back and create separate dataframes, and then run a similar script for each of the models. Model 4 will be the best, if done correctly.

## 5. Follow the links below to read the documentation about each function, then apply those functions to the data from the previous problem.
[sklearn.metrics.accuracy_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html) <br>
[sklearn.metrics.precision_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html) <br>
[sklearn.metrics.recall_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html)<br>
[sklearn.metrics.classification_report](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html)
    