### Evaluation Metrics
The metrics that you choose to evaluate your machine learning model is very important. Choice of metrics influences how the performance of machine learning algorithms is measured and compared.



## Confusion Matrix
The Confusion matrix is one of the most intuitive and easiest tool used for finding the correctness and accuracy of the model. It is used for classification problem where the output can be of two or more types of classes.

The confusion matrix, is a table with two dimensions (__Actual__ and __Predicted__), and sets of “classes” in both dimensions. Our Actual classifications are columns and Predicted ones are Rows.
 ![1_0exdQRxrXQgIBZdPFIxbTw-300x153.webp](attachment:e3bd3eed-f812-4dc9-917e-8b5124e10c09.webp)

 The Confusion matrix in itself is not a performance measure as such, but almost all of the performance metrics are based on Confusion Matrix and the numbers inside it.

__Terminology__
- __True Positives (TP):__ Correctly predicted positive instances.

Ex: The case where a person is actually having cancer(1) and the model classifying his case as cancer(1) comes under True positive.

- __True Negatives (TN):__  Correctly predicted negative instances.

Ex: The case where a person NOT having cancer and the model classifying his case as Not cancer comes under True Negatives.

- __False Positives (FP):__ Incorrectly predicted positive instances (Type I error).

Ex: A person NOT having cancer and the model classifying his case as cancer comes under False Positives.

- __False Negatives (FN):__ Incorrectly predicted negative instances (Type II error).

Ex: A person having cancer and the model classifying his case as No-cancer comes under False Negatives.



## Accuracy
Accuracy in classification problems is the number of correct predictions made by the model over all kinds predictions made.

![1_5XuZ_86Rfce3qyLt7XMlhw.webp](attachment:9f6d453c-1ed8-4e6a-a7df-84ad85c17ffe.webp)

Accuracy is the most straightforward metric. It represents the proportion of correctly classified instances out of the total number of instances.  

__Use Case:__ Useful when classes are balanced (roughly equal number of instances in each class) and when all types of errors are equally important.

__Limitation:__ Can be misleading with imbalanced datasets. A model that always predicts the majority class can achieve high accuracy even if it performs poorly on the minority class.

## Precision
Precision is a measure that tells us what proportion of patients that we diagnosed as having cancer, actually had cancer. The predicted positives (People predicted as cancerous are TP and FP) and the people actually having a cancer are TP.

![1_KhlD7Js9leo0B0zfsIfAIA-300x224.webp](attachment:ace75922-4e9e-4f61-8dd9-bd9bf8d76c07.webp)

Imagine a fishing net.

- __True Positives:__ The fish you actually catch (the things you wanted to catch).
- __False Positives:__ The seaweed, rocks, and other debris you accidentally catch in the net (things you didn't want).

__Precision is like the "purity" of your catch.__ A high precision means you caught mostly fish and very little debris.

__When is High Precision Important?__

High precision is crucial in situations where the cost or consequence of a false positive is significant. Here are some detailed examples:


1. __Spam Email Detection:__
    -  __Goal:__ Identify spam emails and filter them out of the inbox.
    -  __Consequence of a False Positive (FP):__ A legitimate email (e.g., from a     friend, a work colleague, or an important service) is marked as spam and might be missed by the user. This can lead to missed opportunities, damaged relationships, or other problems.
    -  __Consequence of a False Negative (FN):__ A spam email gets through to the inbox. This is generally less harmful, as most users are accustomed to manually deleting occasional spam.
    -  __Priority:__ Minimize False Positives (FP). We want to be very sure that when we mark an email as spam, it truly is spam. Therefore, we prioritize high precision.

2. __Search Engines:__
    - __Goal:__ Return relevant search results to the user's query.
    - __Consequence of a False Positive (FP):__ The search engine shows irrelevant results. This is annoying for the user but usually not    catastrophic.
    - __Consequence of a False Negative (FN):__ The search engine doesn't show relevant results. This is also bad, but users can often refine their search or try another search engine.
    - __Priority:__ Often a balance between precision and recall is desired, but in many cases, especially when dealing with very large datasets, precision is more important. Users are more forgiving of missing a few relevant results than having a lot of irrelevant ones.

3. __Fraud Detection (High-Value Transactions):__
    - __Goal:__ Identify fraudulent transactions (e.g., credit card fraud).
    - __Consequence of a False Positive (FP):__ A legitimate transaction is flagged as fraudulent, causing inconvenience to the customer (e.g., card being blocked).
    - __Consequence of a False Negative (FN):__ A fraudulent transaction goes undetected, leading to financial loss for the customer or the company.
    - __Priority:__ For high-value transactions, minimizing FPs is often prioritized to avoid disrupting legitimate customers.


In Summary Precision is about the accuracy of your positive predictions. Use it when the cost of a false positive is high. It's about being "precise" in your positive predictions, even if you miss some actual positive cases.

## Recall or Sensitivity
Recall is a measure that tells us what proportion of patients that actually had cancer was diagnosed by the algorithm as having cancer. The actual positives (People having cancer are TP and FN) and the people diagnosed by the model having a cancer are TP.
![1_a8hkMGVHg3fl4kDmSIDY_A-300x192.webp](attachment:53d5a693-efa9-4f15-af6e-509f7a6c3920.webp)

Using the same fishing net analogy:
- __True Positives:__ The fish you actually catch (the things you wanted to catch).
- __False Negatives:__ The fish that escape the net (the things you failed to catch).

__Recall is about how much of what you were looking for you actually "recalled" or retrieved.__ A high recall means you caught most of the fish, even if you also caught some debris.

__When is High Recall Important?__

High recall is crucial in situations where the cost or consequence of a false negative is significant. Here are some detailed examples:
1. __Medical Diagnosis (Disease Detection):__
    - __Goal:__ Identify patients with a specific disease (e.g., cancer, a contagious infection).
    - __Consequence of a False Negative (FN):__ A patient with the disease is not diagnosed, leading to delayed treatment, disease progression, and potentially severe consequences, including death.
    - __Consequence of a False Positive (FP):__ A healthy patient is incorrectly diagnosed, leading to unnecessary anxiety, further tests, and potentially unnecessary treatment.
    - __Priority:__ Minimize False Negatives (FN). It's more important to catch all or most of the actual cases of the disease, even if it means some healthy individuals undergo further testing.  
2. __Security and Surveillance (Threat Detection):__
    - __Goal:__ Identify potential threats (e.g., weapons at an airport, intrusions in a network).
    - __Consequence of a False Negative (FN):__ A real threat goes undetected, potentially leading to catastrophic consequences (e.g., a terrorist attack, a data breach).
    - __Consequence of a False Positive (FP):__ A false alarm triggers an investigation or security procedure. This is inconvenient and costly but generally less severe than missing a real threat.
    - __Priority:__ Minimize False Negatives (FN). It's crucial to catch all or most of the real threats, even if it means some false alarms.
3. __Customer Churn Prediction (Focus on Retention):__
    - __Goal:__ Identify customers who are likely to stop using a service (churn).
    - __Consequence of a False Negative (FN):__ A customer who is about to churn is not identified, and no retention efforts are made, leading to lost revenue.
    - __Consequence of a False Positive (FP):__ A loyal customer is targeted with unnecessary retention offers, potentially annoying them.
    - __Priority:__ In this context, minimizing FNs is often more important. It's better to offer retention incentives to some loyal customers unnecessarily than to miss customers who are about to leave.

In Summary Recall is about capturing as many of the actual positive instances as possible. Use it when the cost of missing a positive case (a false negative) is high. It's about being "sensitive" to positive cases, even if it means having some false alarms.

## F1 score
The F1-score is a way to balance precision and recall. It's especially useful when you have imbalanced datasets (where one class has significantly more samples than the other) or when both false positives and false negatives have important consequences.
![1_YjBz5UyU04AqDtS-EagBVw.webp](attachment:48312ffe-f7fc-48b5-8fd5-4173f3254c5a.webp)

The F1-score is the harmonic mean of precision and recall. The harmonic mean is used instead of a simple average because it penalizes extreme values. This means that an F1-score will be low if either precision or recall is low.

Mathematically, it's calculated as:

__F1-score = 2 * (Precision * Recall) / (Precision + Recall)__  


Let's break it down:
- __Precision:__ TP / (TP + FP) - How many of the predicted positives were actually positive?
- __Recall:__ TP / (TP + FN) - How many of the actual positives did we correctly predict?
__Why the Harmonic Mean?__

Imagine you have two models:
- __Model A:__ Precision = 1.0 (perfect precision), Recall = 0.1
- __Model B:__ Precision = 0.1, Recall = 1.0 (perfect recall)


If you use a simple average:
- Average of Model A = (1.0 + 0.1) / 2 = 0.55
- Average of Model B = (0.1 + 1.0) / 2 = 0.55
The simple average gives both models the same score, even though they are drastically different. Model A is useless because it doesn't find most of the positive cases, and Model B is also bad because it produces many false alarms.

Now, let's calculate the F1-score:
- F1-score of Model A = 2 * (1.0 * 0.1) / (1.0 + 0.1) = 0.18
- F1-score of Model B = 2 * (0.1 * 1.0) / (0.1 + 1.0) = 0.18
Both models now get a low F1-score, correctly reflecting their poor performance.

__When is the F1-Score Important?__

The F1-score is most useful when:

1. __You need to balance precision and recall:__ You care about both minimizing false positives and minimizing false negatives.
2. __You have an imbalanced dataset:__ Accuracy can be misleading in imbalanced datasets. The F1-score provides a more realistic evaluation of the model's performance on the minority class.   

__Examples:__
1. __Information Retrieval (Search Engines):__ While sometimes precision is prioritized, often a good balance between precision and recall is needed to provide relevant results without missing too many important ones. The F1-score can be used to optimize this balance.

2. __Bioinformatics:__ Identifying specific proteins or genes related to a disease. Both false positives (incorrectly identifying a protein) and false negatives (missing a crucial protein) can have serious consequences. The F1-score helps to find a model that balances these risks.

3. __Natural Language Processing (NLP) Tasks (Named Entity Recognition):__ Identifying named entities (people, organizations, locations) in text. You want to correctly identify as many entities as possible (high recall) without incorrectly labeling words as entities (high precision).


__How to Interpret the F1-Score:__
- The F1-score ranges from 0 to 1.  
- 1 is the best possible score (perfect precision and recall).   
-  0 is the worst possible score.

The F1-score is a valuable metric when you need to consider both precision and recall. It's particularly useful for imbalanced datasets and situations where both types of errors have significant costs. By using the harmonic mean, it penalizes models that have a high value in one metric but a low value in the other, encouraging a balanced performance.