# Performance Evaluation in Machine Learning

Assessing the performance of a machine learning model is a crucial step in developing an effective ML solution. To measure the quality or effectiveness of a model, various metrics are employed, known as performance metrics or evaluation metrics. These metrics provide insights into how well the model performs on the given data, enabling us to enhance its performance by fine-tuning hyperparameters. The ultimate goal of any ML model is to generalize effectively on unseen or new data, and performance metrics help determine the model's ability to achieve this generalization.

In machine learning, tasks are broadly categorized into classification and regression. Since not all metrics are applicable to every type of problem, it is essential to understand which metrics are suitable for specific tasks. Different evaluation metrics are used for regression and classification problems. This discussion will focus on the metrics used for both classification and regression tasks.

---

## 1. Performance Metrics for Classification

In classification tasks, the model identifies categories or classes of data based on the training dataset. It learns from the provided data and then classifies new data into predefined classes or groups. The output is a predicted class label, such as Yes/No, 0/1, or Spam/Not Spam. To evaluate the performance of a classification model, several metrics are used, including:

- **Accuracy**
- **Confusion Matrix**
- **Precision**
- **Recall**
- **F-Score**

### I. Accuracy

Accuracy is one of the simplest and most commonly used classification metrics. It is calculated as the ratio of correct predictions to the total number of predictions made.

**Formula for Accuracy:**
$$
\text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}}
$$


To implement the accuracy metric, we can compare the ground truth values with the predicted values either manually or by using the scikit-learn library.

**Example using scikit-learn:**
1. Import the `accuracy_score` function from the library:
   ```python
   from sklearn.metrics import accuracy_score
   ```
   Here, `metrics` is a class within the `sklearn` module.

2. Pass the ground truth (`y_test`) and predicted values (`y_pred`) to the function to compute accuracy:
   ```python
   print(f'Accuracy Score is {accuracy_score(y_test, y_pred)}')
   ```

**When to Use Accuracy?**
The accuracy metric is most appropriate when the classes in the target variable are roughly balanced. For instance, in a fruit image dataset where 60% of the images are apples and 40% are mangoes, if the model predicts whether an image is an apple or a mango with 97% accuracy, this metric is meaningful.

**When Not to Use Accuracy?**
Accuracy should not be used when the target variable is heavily imbalanced, with one class dominating the dataset. For example, consider a disease prediction model where, out of 100 people, only 5 have the disease and 95 do not. If the model predicts that no one has the disease (a poor prediction), the accuracy would still be 95%, which is misleading and does not reflect the model's actual performance.

---

### II. Confusion Matrix

A confusion matrix is a table that summarizes the prediction results of a binary classifier, providing a detailed breakdown of the model's performance on a test dataset where the true values are known. It is a useful tool for evaluating the effectiveness of a classification model.

While the concept of a confusion matrix is straightforward, the terminology associated with it can be confusing for beginners. Terms like true positives, false negatives, true negatives, and false positives are often used to describe the different outcomes in the matrix, which may require some time to fully grasp.
$$
\begin{array}{|c|c|c|}
\hline
 & \text{Predicted Positive} & \text{Predicted Negative} \\
\hline
\text{Actual Positive} & \text{TP (True Positive)} & \text{FN (False Negative)} \\
\hline
\text{Actual Negative} & \text{FP (False Positive)} & \text{TN (True Negative)} \\
\hline
\end{array}
$$

In general, the table is divided into four terminologies, which are as follows:

1. **True Positive (TP):** In this case, the prediction outcome is true, and it is true in reality, also.
2. **True Negative (TN):** In this case, the prediction outcome is false, and it is false in reality, also.
3. **False Positive (FP):** In this case, prediction outcomes are true, but they are false in actuality.
4. **False Negative (FN):** In this case, predictions are false, and they are true in actuality.

---

### III. Precision

The **precision metric** addresses the limitations of accuracy by focusing on the proportion of positive predictions that were actually correct. It is calculated as the ratio of **True Positives (TP)**—predictions that are correctly identified as positive—to the total number of positive predictions, which includes both **True Positives (TP)** and **False Positives (FP)**. In other words, precision measures how many of the predicted positive cases are truly positive.

**Formula for Precision:**
$$
\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}
$$
This metric is particularly useful in scenarios where the cost of false positives is high, such as in medical diagnoses or spam detection, as it helps evaluate the model's ability to avoid incorrect positive predictions.

---

### IV. Recall or Sensitivity

Recall, also known as sensitivity, is a metric similar to precision but focuses on calculating the proportion of actual positive cases that were correctly identified by the model. It is calculated as the ratio of **True Positives (TP)**—predictions that are correctly identified as positive—to the total number of actual positives, which includes both **True Positives (TP)** and **False Negatives (FN)** (cases that were incorrectly predicted as negative).

**Formula for Recall:**

$$
\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}}
$$

Recall is particularly important in scenarios where missing positive cases (false negatives) is costly, such as in disease detection or fraud prevention. It measures the model's ability to correctly identify all relevant positive instances from the dataset.

**When to Use Precision and Recall?**
Precision and recall help evaluate a model’s performance, but they focus on different types of errors.

- **Recall** is important when missing a positive result (false negative) is costly. For example, in a medical test for a serious disease, we want to catch every possible case, even if it means a few false alarms.
- **Precision** is crucial when false positives need to be avoided. For instance, in spam detection, we don’t want important emails mistakenly marked as spam.

If you want fewer false negatives, aim for high recall. If you want fewer false positives, focus on high precision.

---

### V. F-Score

The F-score, also known as the F1 Score, is a metric used to assess the performance of a binary classification model, specifically focusing on predictions for the positive class. It combines both Precision and Recall into a single value, providing a balanced measure of both. The F1 Score is calculated as the harmonic mean of Precision and Recall, giving them equal importance.
$$
F1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}
$$

**When to Use F-Score?**
As F-score makes use of both precision and recall, it should be used if both of them are important for evaluation, but one (precision or recall) is slightly more important to consider than the other. For example, when False negatives are comparatively more important than false positives, or vice versa.

**F1-Score Interpretation:**
- **F1-score ≈ 1:** The model is perfect, meaning both Precision and Recall are 100%. This is an ideal case, but it rarely happens in real-world models.
- **F1-score high (0.7 - 0.9):** The model is performing well, meaning it is correctly detecting positive cases with few incorrect predictions.
- **F1-score moderate (0.5 - 0.7):** The model is performing okay, but there is room for improvement. There may be a compromise in Precision or Recall.
- **F1-score low (< 0.5):** The model is weak, meaning there are either too many false positives or too many false negatives. The model is not reliable.

**F1-Score Comparison with Other Metrics:**
- If Precision is high but Recall is low, the model is only predicting sure positive cases but missing many actual positives.
- If Recall is high but Precision is low, the model is predicting everything as positive, leading to many incorrect predictions.
- F1-score balances both, providing an overall idea of how well the model is controlling false positives and false negatives.

**Example:**

For a fraud detection model:
- Precision = 0.8 (80% of detected fraud cases are correct)
- Recall = 0.6 (60% of total fraud cases are detected)

The F1-score would be:

$$
F1 = 2 \times \frac{(0.8 \times 0.6)}{(0.8 + 0.6)} = 2 \times \frac{0.48}{1.4} = \frac{0.96}{1.4} = 0.685
$$

This means the model is performing okay, but Recall needs improvement to detect more fraud cases.

**Conclusion:**
F1-score is a balanced metric that indicates whether the model is biased towards high precision or high recall. A high F1-score means the model is effectively predicting positive cases without making many errors.

---

## 2. Performance Metrics for Regression

Regression is a supervised learning method used to identify relationships between dependent and independent variables. A regression model predicts continuous or discrete numerical values. Unlike classification, the evaluation metrics for regression are distinct, meaning metrics like Accuracy (used in classification) are not applicable. Instead, regression models are assessed based on the errors in their predictions. Below are some commonly used metrics to evaluate the performance of regression models:

- **Mean Absolute Error (MAE)**
- **Mean Squared Error (MSE)**
- **R2 Score (R-squared)**
- **Adjusted R2 (Adjusted R-squared)**

### I. Mean Absolute Error (MAE)

Mean Absolute Error or MAE is one of the simplest metrics, which measures the absolute difference between actual and predicted values, where absolute means taking a number as Positive.

**Formula for MAE:**
$$
\text{MAE} = \frac{\sum |y - \hat{y}|}{N}
$$


**Interpretation of MAE:**
1. **MAE = 0:** The model's predictions are perfect (no difference between predictions and actual values).
2. **Low MAE:** The model's predictions are very accurate (small average difference between predictions and actual values).
3. **High MAE:** The model's predictions have significant errors (large average difference between predictions and actual values).

---

### II. Mean Squared Error (MSE)

Mean Squared Error (MSE) is a widely used metric for evaluating regression models. It calculates the average of the squared differences between the predicted and actual values.

**Formula for MSE:**
$$
\text{MSE} = \frac{\sum |y - \hat{y}|^2}{N}
$$


**Interpretation:**
- MSE squares the errors, so it gives more importance to large errors and outliers.
- A low MSE indicates good model performance, while a high MSE indicates inaccurate predictions.

---

### III. R-Squared Error (Coefficient of Determination)

R-squared error, also called the Coefficient of Determination, is a widely used metric for evaluating regression models. It helps assess how well a model performs by comparing it to a constant baseline. This baseline is determined by calculating the mean of the data and drawing a reference line at that mean.
$$
R^2 = 1 - \frac{\sum (y - \hat{y})^2}{\sum (y - \bar{y})^2}
$$


**Interpretation of R-squared:**
- **R² = 1:** Perfect model (all predictions are correct).
- **R² = 0:** The model's predictions are equal to the mean, meaning the model is not performing well.
- **R² < 0:** The model is performing worse than the mean line.

**R² Values and Interpretation:**
- **R² = 1:** The model is perfectly accurate(chances of overfitting).
- **R² = 0.9 – 1.0:** The model is performing very well, with high accuracy.
- **R² = 0.7 – 0.9:** The model is performing well but could be improved.
- **R² = 0.5 – 0.7:** The model's performance is moderate, explaining some variation.
- **R² < 0.5:** The model's performance is weak, with many incorrect predictions.

**Ways to Improve R²:**
1. **Include Additional Features:** If the model is missing important factors, adding more features can improve R².
2. **Use Better Models:** If using linear regression, consider more complex models like decision trees or random forests.
3. **Data Preprocessing:** Handle outliers and clean the data to improve model performance.
4. **Handle Non-Linearity:** If the data relationship is non-linear, use non-linear models to improve R².

**Conclusion:**
The higher the R² value, the better the model. If R² is low, optimize features or try advanced models to improve performance.
