#### Error analysis for classification tasks

* When working with datasets where the ratio of positive to negative examples is very skewed, i.e. very far from 50-50 and $y=1$ will be the most rare class, then the usual error metrics like accuracy, bias and variance don't work very well. Other metrics, such as the common pair of precision and recall, may work better for assessing the model performance.
* Error analysis requires to research i.e. to manually examine, count, and categorize all missclassified samples in the cross validation dataset, based on common traits. It is reccomended to invest most efforts for improvements in the most common types of missclassified data. For example, in the case of classification of different types of spam emails:
  * $m_{cv}$ = 500 examples in the cross validation set
  * 60 missclassified: 15 pharmaceutical | 3 misspellings | 7 spam header routing issues | 35 stealing password attempts
  * As reccomened it's better to first examine and improve the case of password stealing emails.
![image.png](attachment:59e6efc7-c61e-4b48-8844-1d0f02097bc3.png)

#### Confusion matrix 
* A matrix (2 x 2) to visualize which classes are most frequently confused by a classifier.
* In binary classification: it shows how many true positives, false positives, true negatives, and false negatives there are.
* In multi-class classification: it shows which specific class pairs are being confused (e.g., class A often predicted as class B).
  
|                       | **Predicted Positive**  | **Predicted Negative**  |
|-----------------------|-------------------------|-------------------------|
| **Actual Positive**   | **True Positive (TP)**  | **False Negative (FN)** |
| **Actual Negative**   | **False Positive (FP)** | **True Negative (TN)**  |

#### Precision and recall

* Precision = what is the fraction of true positives from all predicted positives
* high precision = accurate predictions, accurate results, highconfidence
$$ \text{precision} = \frac{\#\ \text{of true positives}}{\#\ \text{of all predicted positives}} = \frac{\#\ \text{of true positives}}{\#\ \text{of true positives} + \#\ \text{of false positives}}$$
* Recall = what is the fraction of correctly predicted positives from all positives
* high recall = high probability of identifying the correct $\hat{y}$ 
$$ \text{recall} = \frac{\#\ \text{of true positives}}{\#\ \text{of all actual positives}} = \frac{\#\ \text{of true positives}}{\#\ \text{of true positives} + \#\ \text{of false negatives}}$$
* Depending on the chosen threshold (in logistic regression) - we may aim for higher recall or higher precision. When precision rises, recall falls and v.v. For example:
  * a condition to predict 1 if $f_{w,b} \geq 0.7$ and 0 if $f_{w,b} < 0.7$  will give a higher precision i.e. higher certainty, but lower recall i.e. fewer of the samples will be correctly classified
  * a threshold of 0.5 will give even recal/precision
  * a threshold of 0.3 will help us to miss less potential actual positives
  * a threshold of 0.9 will give us a max certainty

#### F1 score

* The F1 score is called the harmonic mean of precision (P) and recall (R) because it emphasizes the smaller of the two values, which makes it a more balanced measure when there's an uneven trade-off between precision and recall.
$$ F_1 = \frac {1} { \frac{1}{2} ( \frac{1}{P} + \frac{1}{R} ) } =  2\frac{PR}{P + R} $$
* Gives us a way to compare precision and recall values of different models by combining them into one score for easier choice. For example:

|                       | **Precision (P)**       | **Recall (R)**          | **F1 score**            |
|-----------------------|-------------------------|-------------------------|-------------------------|
| **Model 1**           |   0.5                   |   0.4                   | 0.444 **best model**    |
| **Model 2**           |   0.7 **max precision** |   0.1                   | 0.175                   |
| **Model 3**           |   0.02                  |   1.0  **max recall**   | 0.0392                  |