<a href="https://colab.research.google.com/github/koffqq/rehab-exercise/blob/main/course_ai_rehab_hc_2024_tutorial_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Classification Metrics (Binary)**

### 1. **Confusion Matrix**

A **Confusion Matrix** is a summary of prediction results that compares the actual (true) labels to the predicted labels made by a classifier. It provides insights into the number of true positives, false positives, true negatives, and false negatives.

|                   | **Predicted Positive** | **Predicted Negative** |
|-------------------|------------------------|------------------------|
| **True Positive**  | **TP**                 | **FN**                 |
| **True Negative**  | **FP**                 | **TN**                 |

Where:
- **TP** = True Positives (truely predicted positive cases),
- **FP** = False Positives (falsely predicted positive cases),
- **FN** = False Negatives (falsely predicted negative cases),
- **TN** = True Negatives (truely predicted negative cases).

This matrix helps in calculating other metrics like accuracy, precision, recall, F1-score, and specificity.

---

### 2. **Accuracy**

**Accuracy** measures the percentage of correct predictions out of all predictions.

$$
\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}
$$

Where:
- **TP** = True Positives,
- **TN** = True Negatives,
- **FP** = False Positives,
- **FN** = False Negatives.

---

### 3. **Precision**

**Precision** measures the proportion of true positive predictions out of all positive predictions (i.e., how many selected items are relevant).

$$
\text{Precision} = \frac{TP}{TP + FP}
$$

It is useful when the cost of false positives is high (e.g., in spam detection).

---

### 4. **Recall (Sensitivity or True Positive Rate)**

**Recall** measures the proportion of actual positive cases that were correctly identified (i.e., how many relevant items are selected).

$$
\text{Recall} = \frac{TP}{TP + FN}
$$

It is important when the cost of false negatives is high (e.g., in medical diagnosis).

---

### 5. **Specificity (True Negative Rate)**

**Specificity** measures the proportion of actual negative cases that were correctly identified. It is the complement of recall and focuses on the correct identification of negative cases.

$$
\text{Specificity} = \frac{TN}{TN + FP}
$$

Specificity is important in scenarios where false positives are costly, such as in medical tests.

---

### 6. **Balanced Accuracy**

**Balanced Accuracy** adjusts for class imbalance by averaging the accuracy for both the positive and negative classes.

$$
\text{Balanced Accuracy} = \frac{1}{2} \left( \frac{TP}{TP + FN} + \frac{TN}{TN + FP} \right)
$$

This is particularly useful when working with imbalanced datasets, as it gives equal weight to the positive and negative classes.

---

### 7. **F1-Score**

**F1-Score** is the harmonic mean of precision and recall, providing a single metric that balances both.

$$
F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
$$

F1-Score is useful when you need a balance between precision and recall.

---

### 8. **ROC-AUC Score**

**ROC (Receiver Operating Characteristic)** curve is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold (probability threshold to output positive) is varied.

- **AUC (Area Under Curve)** is the area under the ROC curve and provides a single value to compare different models (a higher AUC means the model is more robust and its good performance does not vary a lot by changing the threshold).

The **True Positive Rate (TPR)** and **False Positive Rate (FPR)** are used to plot the ROC curve:

$$
\text{TPR} = \frac{TP}{TP + FN}, \quad \text{FPR} = \frac{FP}{FP + TN}
$$
<br>

A perfect model has an AUC of 1, while a random classifier has an AUC of 0.5 (assuming binary and balanced classes), because it represents the performance of a **random classifier**. Here's why:

##### 1. **Random Guessing**:
   - A **random classifier** does not have any discriminative ability; it essentially guesses the class label randomly, with a 50/50 chance for binary classification (assuming balanced classes).
   - This means that the classifier will predict positive and negative labels in a completely unstructured manner.

##### 2. **ROC Curve for a Random Classifier**:
   - A random classifier would have an equal chance of making correct and incorrect predictions, leading to a **True Positive Rate (TPR)** that is proportional to the **False Positive Rate (FPR)**.
   - In this case, the ROC curve is a diagonal line from the bottom-left (0, 0) to the top-right (1, 1), indicating that as the threshold changes, the TPR and FPR increase at the same rate. This is essentially a 50% chance of being correct.

##### 3. **AUC for a Random Classifier**:
   - The **AUC** measures the area under the ROC curve. For a random classifier, the ROC curve is the diagonal line, and the area under this line is exactly **0.5**.
   - An AUC of **0.5** implies that the model has no predictive ability; it is effectively guessing. This is the baseline for comparing other models.

##### Why Not Below 0.5?
- If a classifier has an **AUC below 0.5**, it means the model is worse than random guessing. In theory, a classifier with an AUC less than 0.5 can be inverted (flipping its predictions), and it would perform better than random guessing (yielding an AUC greater than 0.5).
- For example, an AUC of 0.3 means that the classifier systematically predicts incorrectly more often than not. You could swap the predictions (i.e., predict positive where the model predicts negative, and vice versa), and you would get an AUC of **1 - 0.3 = 0.7**, which would indicate a better performance.

##### AUC and Model Performance:
- **AUC = 0.5**: The model has no discriminative ability (random guessing).
- **AUC < 0.5**: The model is worse than random guessing and can be improved by inverting predictions.
- **AUC > 0.5**: The model is better than random guessing, with higher values indicating better performance.
- **AUC = 1.0**: A perfect classifier, with no false positives and no false negatives.

In summary, the minimum AUC of **0.5** reflects the performance of a random classifier because it cannot do any better than randomly guessing the outcomes of positive or negative classes. This is the baseline against which the discriminative ability of other models is measured.

---

### 9. **Precision-Recall Curve (PR-curve)**

The **Precision-Recall Curve (PR-curve)** is a graphical representation that shows the trade-off between **precision** and **recall** for different threshold values in a classification model, particularly useful when dealing with **imbalanced datasets**. Here’s a detailed explanation:



### What the PR-Curve Shows:
- The **Precision-Recall Curve** plots **precision** on the Y-axis and **recall** on the X-axis for different threshold values.
- As the threshold for classifying an instance as positive changes, the values of precision and recall will change.
  - When the threshold is lowered, more instances are classified as positive, leading to higher recall but often lower precision because of more false positives.
  - When the threshold is raised, fewer instances are classified as positive, leading to higher precision but lower recall.

### Why the PR-Curve is Useful:
- **Imbalanced Datasets**: The PR-curve is particularly valuable when dealing with imbalanced datasets where the number of positive instances is much smaller than the negative instances. In such cases, **accuracy** and even the **ROC curve** can be misleading.
  - For example, in a dataset where 95% of the data is negative, a classifier can achieve high accuracy by predicting everything as negative. However, the PR-curve focuses on the performance for the minority class (positive class).
  
- **Trade-off between Precision and Recall**: It shows how well the model performs in balancing precision and recall, which is crucial in scenarios where you need to optimize one over the other. For instance:
  - In **spam detection**, you might want higher precision (to avoid labeling legitimate emails as spam).
  - In **medical diagnostics**, you might prioritize recall (to minimize false negatives and ensure sick patients are identified).

### Interpretation of the PR-Curve:
- A good model will maintain high precision while recall increases, meaning the curve will be pushed towards the top-right corner.
- A **baseline** model would produce a horizontal line representing the proportion of positive examples in the dataset. If your model’s PR curve is above this line, it is performing better than random guessing.

### Comparison with the ROC Curve:
- **ROC Curve**: The ROC curve plots **True Positive Rate (TPR)** (recall) against **False Positive Rate (FPR)**, and it’s useful when you care about both classes equally. However, the ROC curve can be misleading when classes are highly imbalanced because it includes FPR, which is not very informative when negatives dominate the dataset.
  
- **PR Curve**: The PR-curve only focuses on the positive class, and it is more informative than the ROC curve in scenarios where:
  - You are working with imbalanced data.
  - You care more about the performance of the positive class (e.g., fraud detection, rare disease detection).
  
### Area Under the PR-Curve (AUC-PR):
- Similar to the ROC curve, the **Area Under the Precision-Recall Curve (AUC-PR)** can be used to summarize the model’s performance.
  - **AUC-PR = 1.0**: Perfect classifier (high precision and recall at all thresholds).
  - **AUC-PR near 0**: Poor classifier.
  - **AUC-PR closer to the baseline**: Indicates a model that is close to random guessing.

### Example:
In a dataset where you are detecting fraudulent transactions (rare positives), the PR curve can tell you how much you will sacrifice in precision as you increase recall. This trade-off is critical because in such a case, you may want to balance precision and recall differently depending on the application (e.g., minimizing false positives vs. catching as many fraud cases as possible).

---

### 10. **Log-Loss**

**Log-Loss**, also known as **Logarithmic Loss** or **Binary Cross-Entropy**, is a performance metric that evaluates the accuracy of probabilistic predictions for a binary classification model. Unlike simpler metrics like accuracy, log-loss takes into account the predicted probability of each class and penalizes incorrect classifications more harshly.

The formula for log-loss is:

$$
\text{Log-Loss} = - \frac{1}{n} \sum_{i=1}^{n} \left[ y_i \log(p_i) + (1 - y_i) \log(1 - p_i) \right]
$$

Where:
- $(n)$ is the number of data points.
- $(y_i)$ is the actual label (0 or 1).
- $(p_i)$ is the predicted probability for the positive class (between 0 and 1).

### Key Points:
- **Log-Loss** measures how well a model’s predicted probabilities match the actual outcomes.
- **Lower log-loss** values are better, as it indicates that the predicted probabilities are closer to the true labels.
  - **Log-Loss = 0**: Perfect predictions (the model predicts 1 with 100% certainty when the true label is 1, and similarly for 0).
  - **Log-Loss approaches infinity**: The model is confidently wrong (e.g., predicting a probability of 1 when the true label is 0).

#### Why Log-Loss is Important:

1. **Probabilistic Predictions**:
   - Unlike accuracy, precision, or recall, which only evaluate hard class predictions (0 or 1), log-loss evaluates the **quality of the predicted probabilities**.
   - In many real-world applications, knowing the probability of an event is just as important as the predicted class. For example, in medical diagnostics or financial risk assessments, you may want to understand the confidence level of predictions (e.g., 80% chance of a disease vs. 55% chance).

2. **Capturing Confidence**:
   - Log-loss penalizes **overconfident wrong predictions** much more heavily than predictions that are closer to the correct probability. For example:
     - Predicting 0.99 when the true label is 1 leads to a small penalty.
     - Predicting 0.01 when the true label is 1 leads to a much larger penalty.
   - This characteristic encourages models to be cautious and avoid making extremely confident incorrect predictions.

3. **Handling Imbalanced Classes**:
   - In **imbalanced datasets**, accuracy can be misleading because a model might perform well by always predicting the majority class. However, log-loss takes into account the confidence of the model's predictions, providing a more nuanced assessment of performance, especially for the minority class.

4. **Continuous Evaluation**:
   - Since log-loss evaluates the predicted probabilities rather than the final binary classification, it provides more detailed feedback during model training. It can guide model tuning by revealing how confident and correct the model is on average.

5. **Penalizes Misclassification**:
   - Log-loss penalizes false predictions more harshly than correct predictions. If a model is unsure about a prediction (e.g., predicting a probability close to 0.5), the penalty is less severe compared to confidently making an incorrect prediction.

### Example:
Consider a binary classification task, and you have the following predictions:

| Actual | Predicted Probability |
|--------|-----------------------|
| 1      | 0.9                   |
| 0      | 0.2                   |
| 1      | 0.6                   |
| 0      | 0.8                   |

- For the first prediction, the model is confident (0.9) and correct, so the log-loss will be small.
- For the last prediction, the model is confident but wrong (0.8 when the true label is 0), so the log-loss will be large.

This shows how log-loss differentiates between predictions that are confident and wrong versus predictions that are uncertain.

### Why is Log-Loss Preferable in Some Cases?

- **Probability-Sensitive**: Log-loss cares about how confident the model is in its predictions, which is crucial when working with probabilistic predictions. For instance, if you want your model to provide insights into how likely a certain event is (e.g., fraud detection, medical diagnosis), log-loss will penalize overconfidence in wrong predictions.
  
- **Model Calibration**: Log-loss helps in assessing how well-calibrated a model is. A well-calibrated model will output probabilities that reflect the true likelihood of events. For example, if a model predicts a 0.7 probability for 100 events, about 70 of those events should be positive.

### When to Use Log-Loss:

- **Classification problems with imbalanced data** where you need to account for the confidence of predictions.
- **Applications requiring probabilistic predictions** rather than hard classifications, such as financial risk models, medical diagnosis, or weather prediction.
- **When you need a metric that penalizes confident misclassifications**, which encourages models to be careful when predicting probabilities.

In summary, **log-loss** is important because it evaluates the quality of probabilistic predictions and penalizes overconfident wrong predictions, providing a more nuanced and useful measure for tasks where understanding prediction probabilities is critical.


# **Classification Metrics (Multiclass)**

### 1. **Confusion Matrix (Multiclass)**

In a **multiclass confusion matrix**, the rows represent the actual classes, and the columns represent the predicted classes. Instead of just two classes (positive and negative), each class has its own row and column.

|                   | **Predicted Class 1** | **Predicted Class 2** | **...** | **Predicted Class N** |
|-------------------|-----------------------|-----------------------|--------|-----------------------|
| **True Class 1**   | **TP for Class 1**    | **FP for Class 1**     | ...    |                       |
| **True Class 2**   | **FN for Class 2**    | **TP for Class 2**     | ...    |                       |
| **...**           |                       |                       | ...    |                       |
| **True Class N**   |                       |                       |        | **TP for Class N**    |

- **True Positives (TP)**: Correctly predicted instances for each class.
- **False Positives (FP)**: Instances incorrectly predicted as a particular class.
- **False Negatives (FN)**: Instances of a class that were incorrectly predicted as another class.
- **True Negatives (TN)** are generally calculated for each class by summing all the instances not in that class.

This matrix helps in calculating other multiclass metrics like precision, recall, F1-score, and accuracy.

---

### 2. **Accuracy (Multiclass)**

In the multiclass case, **accuracy** still measures the proportion of correct predictions out of all predictions, regardless of the class.

$$
\text{Accuracy} = \frac{\text{Total Correct Predictions}}{\text{Total Predictions}}
$$

Where **Total Correct Predictions** is the sum of the diagonal elements of the confusion matrix (all true positives).

---

### 3. **Precision (Multiclass)**

For multiclass problems, **precision** can be calculated in three ways:

- **Macro-averaged Precision**: Precision is calculated for each class separately and then averaged equally, treating all classes equally.
  
  $$
  \text{Macro Precision} = \frac{1}{N} \sum_{i=1}^{N} \frac{TP_i}{TP_i + FP_i}
  $$

- **Weighted Precision**: Precision is calculated for each class and then weighted by the number of true instances in each class to account for class imbalance.
  
  $$
  \text{Weighted Precision} = \sum_{i=1}^{N} \frac{n_i}{n} \cdot \frac{TP_i}{TP_i + FP_i}
  $$

- **Micro-averaged Precision**: Treats all classes as a single binary classification problem by summing the **TP**, **FP**, and **FN** across all classes:
  
  $$
  \text{Micro Precision} = \frac{\sum TP}{\sum (TP + FP)}
  $$

---

### 4. **Recall (Sensitivity or True Positive Rate) (Multiclass)**

Similarly, recall can be computed in three ways:

- **Macro-averaged Recall**: Calculated by averaging the recall for each class equally:
  
  $$
  \text{Macro Recall} = \frac{1}{N} \sum_{i=1}^{N} \frac{TP_i}{TP_i + FN_i}
  $$

- **Weighted Recall**: Adjusts for the number of true instances in each class:
  
  $$
  \text{Weighted Recall} = \sum_{i=1}^{N} \frac{n_i}{n} \cdot \frac{TP_i}{TP_i + FN_i}
  $$

- **Micro-averaged Recall**: Treats all classes as a single binary classification problem by summing the **TP**, **FP**, and **FN** across all classes:
  
  $$
  \text{Micro Recall} = \frac{\sum TP}{\sum (TP + FN)}
  $$

---

### 5. **Specificity (True Negative Rate) (Multiclass)**

For multiclass classification, **specificity** is calculated for each class by treating it as a binary classification (this class vs. all others):

$$
\text{Specificity for Class i} = \frac{TN_i}{TN_i + FP_i}
$$

Similar to precision and recall, you can compute **macro-averaged specificity** (equal weight for each class) or **weighted specificity** (weighted by class size).

---

### 6. **Balanced Accuracy (Multiclass)**

**Balanced Accuracy** is particularly useful for imbalanced datasets in multiclass problems. It is calculated by averaging the recall (sensitivity) for each class, giving equal importance to each class, regardless of class size.

#### **Unweighted Balanced Accuracy**:

$$
\text{Balanced Accuracy} = \frac{1}{N} \sum_{i=1}^{N} \frac{TP_i}{TP_i + FN_i}
$$

Where:
- $ N $ is the total number of classes.
- $ TP_i $ is the number of true positives for class $ i $.
- $ FN_i $ is the number of false negatives for class $ i $.

---

#### **Weighted Balanced Accuracy**:


$$ {Weighted Balanced Accuracy} = \sum_{i=1}^{N} w_i \cdot \frac{TP_i}{TP_i + FN_i}
$$

Where:
- $ w_i $ is the weight for class $ i $, which is proportional to the number of true samples in class $ i $:
  $$
  w_i = \frac{n_i}{n}
  $$
  Where $ n_i $ is the number of true instances of class $ i $, and $ n $ is the total number of instances across all classes.


---

### 7. **F1-Score (Multiclass)**

The **F1-Score** can be computed in three ways:

- **Macro F1-Score**: The F1-score is calculated for each class separately, and the results are averaged equally.
  
  $$
  \text{Macro F1} = \frac{1}{N} \sum_{i=1}^{N} \frac{2 \cdot \text{Precision}_i \cdot \text{Recall}_i}{\text{Precision}_i + \text{Recall}_i}
  $$

- **Weighted F1-Score**: Similar to precision and recall, it weighs the F1 score for each class based on the number of true instances in that class:
  
  $$
  \text{Weighted F1} = \sum_{i=1}^{N} \frac{n_i}{n} \cdot \frac{2 \cdot \text{Precision}_i \cdot \text{Recall}_i}{\text{Precision}_i + \text{Recall}_i}
  $$

- **Micro-averaged F1-Score**: In this case, micro-averaged precision and recall are used to compute a single F1-score across all classes. Since micro-averaging sums up the **TP**, **FP**, and **FN** globally, the **micro F1** is also computed globally:
  
  $$
  \text{Micro F1} = \frac{2 \times \text{Micro Precision} \times \text{Micro Recall}}{\text{Micro Precision} + \text{Micro Recall}}
  $$

---

### Summary of Averages:

- **Macro**: Treats each class equally and averages metrics across classes.
- **Weighted**: Weighs each class's contribution based on its size.
- **Micro**: Sums up the true positives, false positives, and false negatives globally across all classes and treats the problem as a single binary classification.

These approaches provide flexibility when evaluating multiclass classifiers, allowing you to emphasize overall performance (micro), treat all classes equally (macro), or account for class imbalance (weighted).

### 8. **ROC-AUC for Multiclass**

In the multiclass setting, **ROC-AUC** can be computed by considering each class against all other classes (one-vs-rest approach). For each class $i$, we compute the ROC curve, treating class $i$ as the positive class and all others as the negative class.

The overall **multiclass AUC** can then be calculated by averaging the AUC scores for all classes.

For each class $i$, the **True Positive Rate (TPR)** and **False Positive Rate (FPR)** are defined as:

$$
\text{TPR}_i = \frac{TP_i}{TP_i + FN_i}, \quad \text{FPR}_i = \frac{FP_i}{FP_i + TN_i}
$$

The **ROC-AUC** is then computed as the area under the ROC curve for each class, and the average AUC can be taken across all classes.

# Hyper-parameters (for you)
Which k gives the best performance for the following dataset and its split,using a k-NN classifier?

In [13]:
# Modifying the model to K-Nearest Neighbors (KNN)

import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, balanced_accuracy_score, f1_score

# Create a synthetic dataset
X, y = make_classification(n_samples=5000, n_features=20, n_informative=10, n_classes=5, n_clusters_per_class=2, flip_y=0, random_state=0)

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train the K-Nearest Neighbors (KNN) model with n_neighbors = {1, 3, 5, 7, 10, 12, 15, 40, 100}
model = KNeighborsClassifier(n_neighbors=100)

# Predict x_test and calculate all the matrix
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Calculate confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)

# Calculate classification metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average = "macro")
recall = recall_score(y_test, y_pred, average = "macro")
specificity = conf_matrix[1, 1] / (conf_matrix[1, 1] + conf_matrix[0, 1]) # TN / (TN + FP)
balanced_accuracy = balanced_accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average = "macro")

# Compile all metrics into a DataFrame
metrics_df = pd.DataFrame({
    'Metric': ['Accuracy', 'Precision', 'Recall', 'Specificity', 'Balanced Accuracy', 'F1 Score'],
    'Score': [accuracy, precision, recall, specificity, balanced_accuracy, f1]
})
# Convert metrics to percentages and round to 1 decimal place
metrics_df['Score'] = (metrics_df['Score'] * 100).round(1).astype(str) + '%'

metrics_df

Unnamed: 0,Metric,Score
0,Accuracy,76.5%
1,Precision,76.6%
2,Recall,76.8%
3,Specificity,89.2%
4,Balanced Accuracy,76.8%
5,F1 Score,76.3%


# (For you)
Which k gives the best performance for the following dataset and its cross-validated split,using a k-NN classifier?

In [22]:
# Implementing cross-validation for the same example

import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, balanced_accuracy_score, f1_score
from sklearn.model_selection import StratifiedKFold

# Create a synthetic dataset
X, y = make_classification(n_samples=5000, n_features=20, n_informative=10, n_classes=5, n_clusters_per_class=2, flip_y=0, random_state=0)

# Set up cross-validation with 5 splits
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)

# Arrays to store results for each fold
accuracy_scores = []
precision_scores = []
recall_scores = []
balanced_accuracy_scores = []
f1_scores = []

# Loop through the cross-validation splits
for train_index, test_index in cv.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    # Train the K-Nearest Neighbors (KNN) model with n_neighbors = {1, 3, 5, 7, 10, 12, 15, 40, 100}
    model = KNeighborsClassifier(n_neighbors=1)
    model.fit(X_train, y_train)

    # Predict
    y_pred = model.predict(X_test)

    # Accuracy
    accuracy = accuracy_score(y_test, y_pred)
    accuracy_scores.append(accuracy)

    # Precision, Recall, F1 (Macro)
    precision = precision_score(y_test, y_pred, average='macro')
    recall = recall_score(y_test, y_pred, average='macro')
    f1 = f1_score(y_test, y_pred, average='macro')
    precision_scores.append(precision)
    recall_scores.append(recall)
    f1_scores.append(f1)

    # Balanced Accuracy
    balanced_acc = balanced_accuracy_score(y_test, y_pred)
    balanced_accuracy_scores.append(balanced_acc)

# Aggregate Results
avg_accuracy = np.mean(accuracy_scores)
avg_precision = np.mean(precision_scores)
avg_recall = np.mean(recall_scores)
avg_f1 = np.mean(f1_scores)
avg_balanced_acc = np.mean(balanced_accuracy_scores)

std_accuracy = np.std(accuracy_scores)
std_precision = np.std(precision_scores)
std_recall = np.std(recall_scores)
std_f1 = np.std(f1_scores)
std_balanced_acc = np.std(balanced_accuracy_scores)

# Display aggregated metrics with mean and standard deviation in formatted style
metrics_df = {
    'Metric': ['Accuracy', 'Precision (Macro)', 'Recall (Macro)', 'F1 (Macro)', 'Balanced Accuracy'],
    'Score': [f"{(avg_accuracy * 100):.1f}% ± {(std_accuracy * 100):.1f}%",
              f"{(avg_precision * 100):.1f}% ± {(std_precision * 100):.1f}%",
              f"{(avg_recall * 100):.1f}% ± {(std_recall * 100):.1f}%",
              f"{(avg_f1 * 100):.1f}% ± {(std_f1 * 100):.1f}%",
              f"{(avg_balanced_acc * 100):.1f}% ± {(std_balanced_acc * 100):.1f}%"]
}

# Convert to DataFrame
df_metrics = pd.DataFrame(metrics_df)

df_metrics

Unnamed: 0,Metric,Score
0,Accuracy,78.4% ± 1.2%
1,Precision (Macro),78.5% ± 1.1%
2,Recall (Macro),78.4% ± 1.2%
3,F1 (Macro),78.4% ± 1.1%
4,Balanced Accuracy,78.4% ± 1.2%


In [23]:
# Implementing cross-validation with GridSearch for KNN hyperparameter tuning

import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, balanced_accuracy_score, f1_score

# Create a synthetic dataset
X, y = make_classification(n_samples=5000, n_features=20, n_informative=10, n_classes=5, n_clusters_per_class=2, flip_y=0, random_state=0)

# Set up cross-validation with 5 splits
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)

# Define the parameter grid for KNeighborsClassifier (range of n_neighbors)
param_grid = {'n_neighbors': [1, 3, 5, 7, 10, 12, 15, 40, 100]}

# Initialize the KNN model, doesn't learn as bootstrapping classification
knn = KNeighborsClassifier()

# Set up the GridSearchCV
grid_search = GridSearchCV(estimator=knn, param_grid=param_grid, cv=cv, scoring='accuracy', return_train_score=True)

# Perform the grid search
grid_search.fit(X, y)

# Get the best hyperparameters and corresponding score
best_k = grid_search.best_params_['n_neighbors']
best_score = grid_search.best_score_

# Output results rounded and converted to percentage
best_score_percent = round(best_score * 100, 1)
print(f"Best number of neighbors (k): {best_k}")
print(f"Best accuracy score: {best_score_percent}%")

Best number of neighbors (k): 10
Best accuracy score: 83.2%


# Understanding Train, Validation, and Test Data

When building machine learning models, it is crucial to divide the dataset into three distinct parts: **training**, **validation**, and **test** sets. These splits help ensure that the model can generalize well to new data and avoid overfitting. Let’s explore the purpose of each one in detail:

## 1. **Training Data**
The **training data** is the portion of the dataset used to train the model. This is the data the model "learns" from by adjusting its internal parameters to minimize the error between predictions and actual labels.

### Key Points:
- The model has direct access to the training data.
- Used for optimizing the model’s parameters (e.g., the neighbors in k-NN or weights in neural networks).
- The performance on the training data should improve as the model learns.

### Potential Issue:
If the model performs well only on the training data but poorly on new data, it is likely **overfitting**—i.e., learning patterns specific to the training data that don't generalize to unseen data.

---

## 2. **Validation Data**
The **validation data** is used during the training process to evaluate the model’s performance on unseen data. This helps fine-tune the model and make decisions like choosing hyperparameters (e.g., learning rate, number of neighbors in KNN, etc.). Choose or tune k with the (seen) validation data for this specified test. -> and then go to the next test set -> use cross validation.

### Key Points:
- The model does **not** train on the validation data; it only evaluates performance.
- Helps in tuning model hyperparameters.
- Used for **early stopping** in some algorithms to prevent overfitting.
  
### Why It's Needed:
If you only rely on training data for evaluating the model, it can lead to an over-optimized model that does not generalize. Validation data acts as a checkpoint to measure how well the model performs on data it hasn’t seen during training.

---

## 3. **Test Data**
The **test data** is used **only after** the model is fully trained and optimized. This is the final evaluation step to check how well the model performs on completely unseen data.

### Key Points:
- The test data should only be used **once** for final evaluation.
- Provides an unbiased estimate of the model’s performance on real-world data.
- Helps answer the question: **"How well will this model perform on new data?"**

### Why Separate from Validation Data?
The validation set is used multiple times during training for tuning and model selection. Thus, the model can indirectly "see" this data, potentially leading to overfitting on the validation set as well. The test set provides an objective, **final evaluation** that the model hasn’t been exposed to before.

---

## Summary

- **Training Data**: Used to fit the model and learn patterns.
- **Validation Data**: Used to fine-tune and select the best model during training.
- **Test Data**: Used only after the model is fully trained to evaluate final performance.

# For you!
- Split the data into test and train/validation.<br>
- Use Cross Validation and the train/validation data to find the optimal k-nn model.<br>
- Use your optimal model and test it on the test data, and report the classification metrics.

```python
# Splitting the data and creating the grid search object
# ...

# Finding the optimal hyper-pars for the model
grid_search.fit(X_train_valid, y_train_valid)

# Testing the optimal model on the test data
y_pred = grid_search.predict(X_test)

# Creating the classification metrics for the test set
# ...
```

In [25]:
# Implementing cross-validation with GridSearch for KNN hyperparameter tuning

import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, balanced_accuracy_score, f1_score

# Create a synthetic dataset
X, y = make_classification(n_samples=5000, n_features=20, n_informative=10, n_classes=5, n_clusters_per_class=2, flip_y=0, random_state=0)

# Split the dataset into training, validation and test sets
X_train_valid, X_test, y_train_valid, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Set up cross-validation with 5 splits
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)

# Define the parameter grid for KNeighborsClassifier (range of n_neighbors)
param_grid = {'n_neighbors': [1, 3, 5, 7, 10, 12, 15, 40, 100]}

# Initialize the KNN model
knn = KNeighborsClassifier()

# Set up the GridSearchCV
grid_search = GridSearchCV(estimator=knn, param_grid=param_grid, cv=cv, scoring='accuracy', return_train_score=True)

# Perform the grid search
grid_search.fit(X_train_valid, y_train_valid)

# Get the best hyperparameters and corresponding score
best_k = grid_search.best_params_['n_neighbors']
best_score = grid_search.best_score_

# Output results rounded and converted to percentage
best_score_percent = round(best_score * 100, 1)
print(f"Best number of neighbors (k): {best_k}")
print(f"Best accuracy score: {best_score_percent}%")

# Testing the optimal model on the test data
y_pred = grid_search.predict(X_test)

# Calculate confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)

# Calculate classification metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average = "macro")
recall = recall_score(y_test, y_pred, average = "macro")
specificity = conf_matrix[1, 1] / (conf_matrix[1, 1] + conf_matrix[0, 1]) # TN / (TN + FP)
balanced_accuracy = balanced_accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average = "macro")

# Compile all metrics into a DataFrame
metrics_df = pd.DataFrame({
    'Metric': ['Accuracy', 'Precision', 'Recall', 'Specificity', 'Balanced Accuracy', 'F1 Score'],
    'Score': [accuracy, precision, recall, specificity, balanced_accuracy, f1]
})
# Convert metrics to percentages and round to 1 decimal place
metrics_df['Score'] = (metrics_df['Score'] * 100).round(1).astype(str) + '%'

metrics_df

Best number of neighbors (k): 10
Best accuracy score: 81.8%


Unnamed: 0,Metric,Score
0,Accuracy,82.1%
1,Precision,82.1%
2,Recall,82.1%
3,Specificity,93.4%
4,Balanced Accuracy,82.1%
5,F1 Score,81.9%


# (For you) Everything in a Cross Validation loop!

In [3]:
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, balanced_accuracy_score, f1_score
from sklearn.model_selection import StratifiedKFold

# Create a synthetic dataset
X, y = make_classification(n_samples=5000, n_features=20, n_informative=10, n_classes=5, n_clusters_per_class=2, flip_y=0, random_state=0)

# Set up cross-validation for the test and validation data
cv_test = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
cv_valid = StratifiedKFold(n_splits=4, shuffle=True, random_state=0)

# Define the parameter grid for KNeighborsClassifier (range of n_neighbors) for tuning hyperparameter, to have ultimate model before testing the validation data
# Depending on the validation data
param_grid = {'n_neighbors': [1, 3, 5, 7, 10, 12, 15, 40, 100]}

# Initialize the KNN model
knn = KNeighborsClassifier()

# Set up the GridSearchCV
grid_search = GridSearchCV(estimator=knn, param_grid=param_grid, cv=cv_valid, scoring='accuracy', return_train_score=True)

# Arrays to store results for each fold
accuracy_scores = []
precision_scores = []
recall_scores = []
balanced_accuracy_scores = []
f1_scores = []

# Loop through the cross-validation splits
for train_valid_index, test_index in cv_test.split(X, y):
    X_train_valid, X_test = X[train_valid_index], X[test_index]
    y_train_valid, y_test = y[train_valid_index], y[test_index]

    # Perform the grid search on the training and validation data
    grid_search.fit(X_train_valid, y_train_valid)
    y_pred = grid_search.predict(X_test)

    # Calculate classification metrics for this split
    accuracy_scores.append(accuracy_score(y_test, y_pred))
    precision_scores.append(precision_score(y_test, y_pred, average='macro'))
    recall_scores.append(recall_score(y_test, y_pred, average='macro'))
    balanced_accuracy_scores.append(balanced_accuracy_score(y_test, y_pred))
    f1_scores.append(f1_score(y_test, y_pred, average='macro'))

# Calculate mean and standard deviation for each metric
metrics_df = pd.DataFrame({
    'Metric': ['Accuracy', 'Precision', 'Recall', 'Balanced Accuracy', 'F1 Score'],
    'Mean Score': [np.mean(accuracy_scores), np.mean(precision_scores), np.mean(recall_scores), np.mean(balanced_accuracy_scores), np.mean(f1_scores)],
    'Std Dev': [np.std(accuracy_scores), np.std(precision_scores), np.std(recall_scores), np.std(balanced_accuracy_scores), np.std(f1_scores)]
})

# Convert metrics to percentages and round to 1 decimal place
metrics_df['Mean Score'] = (metrics_df['Mean Score'] * 100).round(1).astype(str) + '%'
metrics_df['Std Dev'] = (metrics_df['Std Dev'] * 100).round(1).astype(str) + '%'

metrics_df



Unnamed: 0,Metric,Mean Score,Std Dev
0,Accuracy,82.7%,0.9%
1,Precision,83.0%,0.9%
2,Recall,82.7%,0.9%
3,Balanced Accuracy,82.7%,0.9%
4,F1 Score,82.7%,0.9%


In [4]:
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score

# Random Forest works better: high variance dataset with noise
X1, y1 = make_classification(n_samples=5000, n_features=20, n_informative=5, n_redundant=0, n_repeated=0,
                             n_clusters_per_class=1, flip_y=0.3, class_sep=2, random_state=42)

# XGBoost works better: dataset with complex interactions and subtle patterns
X2, y2 = make_classification(n_samples=5000, n_features=20, n_informative=10, n_redundant=0, n_repeated=0,
                             n_clusters_per_class=2, flip_y=0.05, class_sep=0.5, random_state=42)

# Train-test split for both datasets
X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y1, test_size=0.3, random_state=42)
X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2, test_size=0.3, random_state=42)

# Random Forest model
rf_model = RandomForestClassifier(random_state=42)
# XGBoost model
xgb_model = XGBClassifier(random_state=42, eval_metric='mlogloss')

# Evaluate Random Forest on both datasets
rf_acc_dataset1 = cross_val_score(rf_model, X1, y1, cv=5, scoring='accuracy').mean()
rf_acc_dataset2 = cross_val_score(rf_model, X2, y2, cv=5, scoring='accuracy').mean()

# Evaluate XGBoost on both datasets
xgb_acc_dataset1 = cross_val_score(xgb_model, X1, y1, cv=5, scoring='accuracy').mean()
xgb_acc_dataset2 = cross_val_score(xgb_model, X2, y2, cv=5, scoring='accuracy').mean()

# Output the results
print(f"Random Forest Accuracy on Dataset 1 (High Variance): {rf_acc_dataset1:.4f}")
print(f"XGBoost Accuracy on Dataset 1 (High Variance): {xgb_acc_dataset1:.4f}")

print(f"Random Forest Accuracy on Dataset 2 (Complex Patterns): {rf_acc_dataset2:.4f}")
print(f"XGBoost Accuracy on Dataset 2 (Complex Patterns): {xgb_acc_dataset2:.4f}")


Random Forest Accuracy on Dataset 1 (High Variance): 0.8480
XGBoost Accuracy on Dataset 1 (High Variance): 0.8390
Random Forest Accuracy on Dataset 2 (Complex Patterns): 0.8790
XGBoost Accuracy on Dataset 2 (Complex Patterns): 0.8924


The difference between the two datasets and why **Random Forest** works better on one while **XGBoost** works better on the other can be explained by examining their characteristics.

### 1. **Dataset 1 (High Variance, Simple Patterns, Noisy Data)**

#### Characteristics:
- **Noise**: The `flip_y=0.3` parameter introduces 30% label noise, meaning that 30% of the labels are randomly flipped, making the dataset highly noisy and inconsistent, contributing to high variance.
- **Simple relationships**: Only 5 out of the 20 features are informative (`n_informative=5`), and there are no redundant or repeated features (`n_redundant=0, n_repeated=0`). Additionally, there is only 1 cluster per class (`n_clusters_per_class=1`), making the relationships between features and labels relatively simple.
- **Class separation**: The `class_sep=2` parameter indicates a high degree of separation between classes, meaning that the different classes are easily distinguishable.

#### Why Random Forest Works Better:
- **Bagging Reduces Variance**: Random Forest works well with high-variance models by using bagging (Bootstrap Aggregating), which reduces overfitting by averaging multiple decision trees trained on different subsets of the data. This helps smooth out the noise in the dataset.
- **Robust to Noise**: Since Random Forest averages predictions from multiple decision trees, it is more robust to noise. Even if individual trees overfit due to the high label noise, the ensemble tends to generalize better.
- **Simple Relationships**: Given the simpler relationships between features and labels (due to fewer informative features and only one cluster per class), Random Forest captures the patterns effectively without overcomplicating the model.

#### Conclusion:
Random Forest performs better in **high-variance, noisy datasets** with simpler patterns, where its ability to average predictions across independent trees helps mitigate the noise and variance.

---

### 2. **Dataset 2 (Complex Patterns, Low Noise)**

#### Characteristics:
- **Complex relationships**: This dataset has more informative features (`n_informative=10`), which means there are more meaningful and complex patterns between the features and the target labels. There are also two clusters per class (`n_clusters_per_class=2`), further increasing the complexity of the relationships.
- **Low noise**: The `flip_y=0.05` parameter introduces only 5% label noise, meaning that the relationships between features and labels are more consistent and less prone to random fluctuations.
- **Subtle class separation**: The `class_sep=0.5` parameter indicates that the separation between classes is more subtle, requiring the model to capture nuanced relationships between features to make accurate predictions.

#### Why XGBoost Works Better:
- **Boosting Reduces Bias**: XGBoost is a boosting method that focuses on reducing bias by sequentially improving weak learners. It works well when there are complex patterns to uncover because it refines the model iteratively by focusing on the hardest-to-classify examples.
- **Handling Complex Patterns**: XGBoost is particularly well-suited for datasets with more informative features and subtle class separations. Its iterative approach helps capture the deeper relationships in the data.
- **Low Noise**: XGBoost is more prone to overfitting in highly noisy datasets, but here, the low noise allows it to focus on capturing the complex patterns without being distracted by random noise.

#### Conclusion:
XGBoost performs better in **datasets with complex patterns and low noise**, where its sequential approach to refining the model allows it to capture subtle relationships between features and target labels.

---

### Summary of Differences:

| Dataset                   | Characteristics                                          | Why Random Forest Works Better           | Why XGBoost Works Better                 |
|---------------------------|----------------------------------------------------------|------------------------------------------|------------------------------------------|
| **Dataset 1 (High Variance, Noisy)**  | High noise (`flip_y=0.3`), few informative features, simple class separation (`class_sep=2`) | Reduces variance and handles noise effectively by averaging multiple models | Prone to overfitting to noisy data       |
| **Dataset 2 (Complex Patterns, Low Noise)** | Low noise (`flip_y=0.05`), more informative features, subtle class separation (`class_sep=0.5`) | Independent trees struggle to capture complex patterns | Captures complex relationships through iterative refinement |

### Key Takeaways:
- **Random Forest** works better on high-variance, noisy datasets with simpler relationships by reducing variance through bagging, making it more robust to noisy data.
- **XGBoost** excels in datasets with complex patterns and low noise, where it can iteratively reduce bias and improve performance by focusing on harder-to-classify instances.

In this case:
- **Random Forest performs better on Dataset 1**, which has high noise, few informative features, and simple class separation.
- **XGBoost performs better on Dataset 2**, which has low noise, more informative features, and complex relationships between features and classes.

# Excercise!

For the same datasets above, create a nested cross validation pipeline and then compare the results for XGBoost and RandomForest classifiers (with proper hyper-parameter tuning)


```python
# Random Forest works better: high variance dataset with noise
X1, y1 = make_classification(n_samples=5000, n_features=20, n_informative=5, n_redundant=0, n_repeated=0,
                             n_clusters_per_class=1, flip_y=0.3, class_sep=2, random_state=42)

# XGBoost works better: dataset with complex interactions and subtle patterns
X2, y2 = make_classification(n_samples=5000, n_features=20, n_informative=10, n_redundant=0, n_repeated=0,
                             n_clusters_per_class=2, flip_y=0.05, class_sep=0.5, random_state=42)
```

In [5]:
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, GridSearchCV, KFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score

# Datasets
X1, y1 = make_classification(n_samples=5000, n_features=20, n_informative=5, n_redundant=0, n_repeated=0,
                             n_clusters_per_class=1, flip_y=0.3, class_sep=2, random_state=42)
X2, y2 = make_classification(n_samples=5000, n_features=20, n_informative=10, n_redundant=0, n_repeated=0,
                             n_clusters_per_class=2, flip_y=0.05, class_sep=0.5, random_state=42)

# Models and hyperparameter grids
rf_model = RandomForestClassifier(random_state=42)
rf_param_grid = {'n_estimators': [50, 100, 200], 'max_depth': [None, 10, 20]}

xgb_model = XGBClassifier(random_state=42, eval_metric='mlogloss')
xgb_param_grid = {'n_estimators': [50, 100, 200], 'max_depth': [3, 5, 7], 'learning_rate': [0.01, 0.1, 0.3]}


def nested_cv(X, y, model, param_grid):
  outer_cv = KFold(n_splits=5, shuffle=True, random_state=42)
  outer_scores = []

  for train_index, test_index in outer_cv.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    inner_cv = KFold(n_splits=4, shuffle=True, random_state=42)
    grid_search = GridSearchCV(model, param_grid, cv=inner_cv, scoring='accuracy')
    grid_search.fit(X_train, y_train)

    outer_scores.append(accuracy_score(y_test, grid_search.predict(X_test)))
  return np.mean(outer_scores)

# Evaluate models using nested cross-validation
rf_score_dataset1 = nested_cv(X1, y1, rf_model, rf_param_grid)
xgb_score_dataset1 = nested_cv(X1, y1, xgb_model, xgb_param_grid)

rf_score_dataset2 = nested_cv(X2, y2, rf_model, rf_param_grid)
xgb_score_dataset2 = nested_cv(X2, y2, xgb_model, xgb_param_grid)

# Output results
print(f"Random Forest Accuracy (Dataset 1): {rf_score_dataset1:.4f}")
print(f"XGBoost Accuracy (Dataset 1): {xgb_score_dataset1:.4f}")
print(f"Random Forest Accuracy (Dataset 2): {rf_score_dataset2:.4f}")
print(f"XGBoost Accuracy (Dataset 2): {xgb_score_dataset2:.4f}")

Random Forest Accuracy (Dataset 1): 0.8480
XGBoost Accuracy (Dataset 1): 0.8466
Random Forest Accuracy (Dataset 2): 0.8820
XGBoost Accuracy (Dataset 2): 0.8944
