# **Evaluating Models**

| | |
|-|-|
| Author(s) | [Keeyana Jones](https://github.com/keeyanajones/) |

## **Overview**

Evaluating machine learning models is a critical step in the machine learning lifecycle, following model training.  Its how you assess how ell your model performs its intended task, whether its making accurate predictions, classifying data correctly, or generating relevant content.  Without proper evaluation, you can't be sure if your model is truly useful, if its overfitting or if its biased. 

The specific evaluation metrics and strategies depend heavily on the type of machine learning task your model is designed for.  

### **General Principles of Model Evaluation**

1. **Use a Separate Test Set:** Always evaluate your model on data it has never seen before during training.  This is typically achieved by splitting your dataset into:
   - **Training Set:** Used to train the model.
   - **Validation Set:** Used during training to tune hyperparameters and prevent overfitting. 
   - **Test Set:** A completely held out dataset used only once at the very end to get an unbiased estimate of the models performance on new, unseen data.
   - **Cross Validation:** For smaller datasets, cross validation techniques (e.g., k-fold cross validation) help ensure that the evaluation is robust and not overly dependent on a single train test split.
2. **Understand Your Goal:** The best metric depends on the business problem your trying to solve.
   - Is it more important to catch all positive cases (high recall), even if some are false alarms?
   - Or is it more important that positive predictions are almost always correct (high precision) even if you miss some true positives? 
   - Is a balanced view of both recall and precision important (F1 score)?
3. **Consider Baselines:** Compare your models performance against simple baselines (e.g., random guessing, predicting the majority class, a simple rule based system) to ensure your model is actually adding value.
4. **Visualize Results:** Beyond numerical metrics, visualizations (e.g., confusion matrices, ROC curves, residual plots) provide deeper insights into model behavior.

### **Evaluation Metrics by Task Type**

#### 1. **Classification Models**
These models predict a categorical label.
   - **Confusion Matrix:** A table that summarizes the performance of a classification model on a set of test data.  It breaks down predictions into:
      - **True Positives (TP):** Correctly predicted positive class.
      - **True Negative (TN):** Correctly predicted negative class.
      - **False Positives (FP):** Predicted positive, but actually negative (type I error).
      - **False Negative (FN):** Predicted negative, but actually positive (type II error).
   
   - **Accuracy:**
      - **Formula:**
         `(TP + TN)/(TP + TN + FP + FN)`
      - **Interpretation:** The proportion of correctly classified instances.
      - **Caveat:** Can be misleading on imbalanced datasets (e.g., 99% accuracy on a dataset where 99%, are negative even if you classify all as negative).

   - **Precision (Positive Predictive Value):**
      - **Formula:**
        `TP/(TP + FP)`
      - **Interpretation:** Of all instances predicted as positive, what proportion are actually positive?  Useful when the cost of False Positives is high (e.g, spam detection, recommending a stock).  

   - **Recall (sensitivity, True Positive Rate):**
      - **Formula:**
        `TP/(TP + FN)`
      - **Interpretation:** Of all actual positive instances, what proportion did the model correctly identify?  Useful when the cost of False Negatives is high (e.g, disease detection, fraud detection).  

   - **F1 Score:**
      - **Formula:**
        `2 * (Precision * Recall)/(Precision + Recall)`
      - **Interpretation:** The harmonic mean of precision and recall. Provides a balance between the two and is good for imbalanced datasets.

   - **Specificity (True Negative Rate):**
      - **Formula:**
        `TN/(TN + FP)`
      - **Interpretation:** Of all actual negative instances, what proportion did the model correctly identify?

   - **ROC Curve (Receiver Operating Characteristic) & AUC (Area Under the Curve):**
      - **ROC Curve:** Plots the True Positive Rate (Recall) against the False Positive Rate for various classification thresholds.
      - **ACU:** The area under the ROC curve. A single scalar value (0 to 1) that summarizes classifier performance across all possible thresholds. A higher AUC indicates better overall performance.

   - **Log Loss (Cross Entropy Loss):**
      - **Interpretation:** Measures the performance of a classification model where the prediction is a probability value between 0 and 1. Penalizes false classification and rewards confident correct classification.  Lower is better.

#### 2. **Regression Models**

These models, predict a continuous numerical value.  

**Mean Absolute Error (MAE):**
   - **Formula:** 
$ \frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i| $
   - **Interpretation:** The average of the absolute differences between predictions and actual values.  Less sensitive to outliers than MSE. 
----
**Mean Square Error (MSE):**
   - **Formula:** 
$ \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2 $
   - **Interpretation:** The average of the squared differences between predictions and actual values.  Penalizes larger errors more heavily.  Units are squared.  
----
**Root Mean Square Error (RMSE):**
   - **Formula:** 
$ \sqrt {\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2} $
   - **Interpretation:** The square root of MSE.  Puts the error back into the same units as the target variable, making it more interpretable than MSE.
----
**R Squared (Coefficient of Determination):**
   - **Formula:** 
$ 1 - \frac{\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}{\sum_{i=1}^{n}(y_i - \bar{y})^2} $
   - **Interpretation:** Representation proportion of the variance in the dependent variable tha is predictable from the independent variables.  Ranges from 0 to 1, where 1 means the model perfectly explains the variance.  Can be negative in the model is worse than a simple mean.  

#### 3. **Clustering Models**

These models group data points without predefined labels.  Evaluation is often more challenging and can involve internal or external metrics.  

- Internal Metrics (without ground truth):
   - **Silhouette Score:** Measures how similar an object is to its own cluster compared to other clusters.  Ranges form -1 (bad clustering) to +1 (dense, well separated clusters). 
   - **Davies-Bouldin Index:** Lower value indicate better clustering (clusters are well separated and compact).
   - **Dunn Index:** Higher values indicate better clustering (clusters are compact and well-separated) 
- External Metrics (with ground truth labels):
   - **Adjusted Rand Index (ARI):** Measures the similarity between two clusterings (your models and the ground truth), accounting for chance.  Ranges from -1 to 1 with 1 being perfect agreement.
   - **Normalized Mutual Information (NMI):** Measures the mutual information between the true labels and the clusters, normalized to be between 0 and 1. 
   - **Homogeneity, Completeness, V-measure:** Metrics that describe different aspects of a clustering solution relative to a ground truth. 

#### 4. **Generative Models (GANs, VAEs, LLMs)**

Evaluation is often more subjective and can involve human judgement.  

- Image Generation:
   - **Inception Score (IS):** Measures the quality and diversity of generated images (higher is better).
   - **Frechet Inception Distance (FID):** Measures the similarity between real and generated images (lower is better).
   - **Human Evaluation:** Surveys, A/B tests to assess realism, quality, and creativity.

- Text Generation (LLMs):
   - **Perplexity:** Measures how well a probability model predicts a sample.  Lower perplexity generally means a better language model.
   - **BLEU (Bilingual Evaluation Understudy), ROUGE (Recall Orientated Understudy for Gisting Evaluation), METEOR:** used for machine translation, summarization, and text generation, comparing generated text to reference text.
   - **Human Evaluation:** For coherence, fluency, relevance, harmfulness, bias, factual accuracy. This is paramount for LLMs.
   - **Adversarial Testing/REd Teaming:** Specifically trying to make the model generate harmful or biased content.  

### **Challenges in Model Evaluation**

- **Data Drift:** Model Performance can degrade over time if the characteristics of the production data change form the training data.
- **Concept Drift:** The relationship between input features and the target variable changes over time.
- **Imbalanced Datasets:** Requires carful selection of metrics (accuracy can be misleading.)
- **Subjectivity:** For generative models, quality can be subjective and hard to quantify.
- **Cost of Errors:** Different types of errors (FP vs. FN) can have vastly different real world costs.
- **Explainability:** Understating why a model makes certain predictions can be as important as the accuracy itself.  
- **Bias and Fairness:** Ensuring the model performs equally well across different demographic groups or sensitive attributes.  This often requires specialized fairness metrics.

Effective model evaluation is an iterative process that helps refine models, ensure they meet business objectives, and maintain their performance over time in real world applications.

----