# Evaluating Classification Models Performance
There are multiple methods to evaluate classification model performance.

# False Positive and Negatives
We cannot always assume that the classification is correct, even for our training data set.

False Positive (Type 1 Error): Model predicted 0 (negative) but actual was 1 (positive).  
False Negative (Type 2 Error): Model predicted 1 (positive) but actual was 0.  
True Positive: Model predicted 1 (positive) and actual was 1 (positive).  
True Negative: Model predicted 0 (negative) and actual was 0 (negative).


### Visual Example
<img src="images/evaluation/false_positives_negatives.png" height="75%" width="75%"></img>
- The "red" data points are the testing data points, classified as either 0 or 1
- The "blue" data points are the predicted classifications of the red (testing) data points
- The "gray" data points are the classification of the blue (predicted) data points

As we can see, some of the predicted values were false for the testing set.
- Type 1 (False Positive) Error with data point #3
- Type 2 (False Negative) Error with data point #2

# Confusion Matrix
<img src="images/evaluation/confusion_matrix.png" height="75%" width="75%"></img>

We can model the "False Positive and Negatives" intuition using a Confusion Matrix.
- 40 data points were actually 0
    - 35 data points predicted 0
    - 5 data points predicted 1
- 60 data points points were actually 1
    - 10 data points predicted 0
    - 50 data points predicted 1
    
Therefore, there were 5 + 10 = 15 incorrect predictions and 35 + 50 = 85 correct predictions.
- 85% accuracy rate
- 15% error rate

# Precision vs Recall vs Accuracy
<img src="images/evaluation/precision_recall_accuracy.png" height="75%" width="75%"></img>
- https://towardsdatascience.com/precision-vs-recall-386cf9f89488

Precision refers to the percentage of your results that are relevant.  
Recall refers to the percentage of the total relevant results correctly classified by the model.  
Accuracy refers to the percentage of correct results correctly classified by the model.

### Example of Differences
Let's say a model classifies cats and dogs in a photograph.

The model identifies 18 cats and 18 dogs in a photograph containing 12 dogs and 24 cats.
- Of the 18 identified cats, 12 are actually cats (true positive) and 6 are dogs (false positive)
- Of the 18 identified dogs, 12 are cats (false negative) and 6 are actually dogs (true negative)

The precision of identifying cats is 12 / 18. The recall of identifying cats is 12 / 24.  
The precision of identifying dogs is 6 / 18. The recall of identifying dogs is 6 / 12.  
The accuracy of identifying cats and dogs is 18 / 36.

# Accuracy Paradox
Let's say you have a training set of 100 people, and the model predicts if they have cancer.
- In reality, only 2 of them have cancer

You create 2 classification models.

### Model 1:
Your model simply assumes nobody has cancer.

You find out 2 people actually had cancer. Therefore, 2 incorrect predictions, so the accuracy of this model is 98%.

### Model 2:
You run a random forest classification, and you predict 5 people have cancer.

You find out 2 people actually had cancer, and 3 people did not have cancer. Therefore, 3 incorrect predictions, so the accuracy of this model is 97%.

### Comparision of Models
Even though Model 2 is less accurate, it's a far more useful algorithm than just assuming no one has cancer.

In conclusion, accuracy is not the greatest judgement of a classification model. We need to delve deeper and find out exactly what the model is doing.

# Cumulative Accuracy Profile (CAP) Curve
We can see how much gain we receive from using a specific model.
- Hit Ratio: Return on investment

### CAP Curve Example
<img src="images/evaluation/cap_random_scenario.png" height="50%" width="50%"></img>
- https://medium.com/@lotass/classification-models-performance-evaluation-c3a91562793

10% of a random set of people purchased the product.

#### But can we improve this? Yes, we can!  
1. Inspect your training data, take a group of customers who bought the product, then extract their featuers (independent variables) such as browsing device, age, salary, etc.  
2. Fit that group data set to a classification model.  
3. Make a prediction of whom from the group data set would purchase the product.  
4. Contact the predicted people that would purchase the product.  
5. Measure the response of the people that we predicted would purchase the product in a CAP curve.

<img src="images/evaluation/cap_directed_scenario.png" height="50%" width="50%"></img>
- This CAP curve was determined using a Logistic Regression model

As we can see in this CAP curve, we did a lot better than the random scenario.

#### Let's use a different classification model!
What would the CAP curve look like if we tried a different classiffication model?

<img src="images/evaluation/svm_lr_cap_curves.png" height="60%" width="60%"></img>

It seems as though the SVM model performs worse than the Logistic Regression model.

### CAP Curve Analysis
The better your model, the larger will be the area between its CAP curve and the random scenario straight line.

Although, it's difficult to measure the area between curves, so we're going to try an easier approach.

1. Draw a line on the 50% point of the x-axis.  
2. From that intersection point of your model, project it to the y-axis to get an X% value.

<img src="images/evaluation/cap_curve_analysis.png" height="60%" width="60%"></img>
- If X < 60% (6000) then you have a rubbish model
- If 60% < X < 70% 7000) then you have a poor model
- If 70% < X < 80% (8000) then you have a good model
- If 80% < X < 90% (9000) then you have a very good model
- If 90% < X < 100% (10,000) then your model is too good to be true!
    - This usually happens due to overfitting to the training group data set
        - The model would predict great on a data point from the trained group data set, but it'll predict poorly for an unseen data point