## Validation Metrics


1. Introduction to Validation Metrics:
Validation metrics are used to evaluate the performance of machine learning models. In imbalanced datasets, where one class significantly outnumbers the other(s), traditional metrics like accuracy may not be informative. Therefore, specialized metrics are necessary to assess a model's performance in such scenarios.

2. Handling Imbalanced Datasets:
Before diving into metrics, let's briefly discuss techniques for handling imbalanced datasets:

  *   Resampling: You can oversample the minority class, undersample the majority class, or use synthetic data generation techniques like SMOTE (Synthetic Minority Over-sampling Technique).
  *   Different Algorithms: Some algorithms, like ensemble methods (Random Forest, Gradient Boosting), handle imbalanced data better.
  * Cost-Sensitive Learning: Modify the algorithm's cost function to penalize misclassifying the minority class more.

3. Common Validation Metrics for Imbalanced Datasets:
Here are some common validation metrics for imbalanced datasets:

  * Accuracy:
Accuracy is the ratio of correctly predicted instances to the total instances in the dataset. It's not suitable for imbalanced datasets.

  * Precision:
Precision measures the percentage of true positives (correctly predicted positive instances) among all predicted positives.

  * Recall:
Recall calculates the percentage of true positives among all actual positive instances.

  * F1-Score:
The F1-Score is the harmonic mean of precision and recall and is especially useful when you want to balance precision and recall.

  * Area Under the Receiver Operating Characteristic (ROC-AUC):
ROC-AUC measures the area under the Receiver Operating Characteristic curve and helps evaluate the model's ability to distinguish between classes.

  * Area Under the Precision-Recall Curve (PR AUC):
PR AUC calculates the area under the Precision-Recall curve and is useful when you care more about the minority class's performance.

4. Implementation and Example:
Let's implement these metrics using Python and scikit-learn on an imbalanced dataset. We'll use the famous "Breast Cancer" dataset from scikit-learn:


## Importing Libraries

In [None]:
# import the libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

In [None]:
# Import necessary libraries
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, roc_curve, auc, precision_recall_curve, average_precision_score

## Importing Dataset

In [None]:
# Import breast cancer dataset for demo task
from sklearn.datasets import load_breast_cancer

In [None]:
# Import the load_digits dataset from sklearn.datasets
from sklearn.datasets import load_digits

## Analysing Dataset

### Loading and Analysing Breast Cancer Dataset

The dataset includes a total of 569 instances with 30 feature variables.

In [None]:
# Load the breast cancer dataset
data = load_breast_cancer()

# Print the attributes and methods of the data object
print(dir(data))

In [None]:
# Extract the data -> It will be your X
X = data.data

# Extract the target -> It will be your y
y = data.target

In [None]:
# Create a DataFrame with the data and feature names to view data
df = pd.DataFrame(X, columns=data.feature_names)
df['Target'] = y  # Add target column

# Display the DataFrame
print(df)

In [None]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Loading and Analysing Digits Dataset

The digit dataset in scikit-learn contains 64 features (columns) for each sample. These features represent the pixel values of an 8x8 image grid. Therefore, the digit dataset has 64 columns.

In [None]:
# Load the Digits dataset
data = load_digits()

# Print the attributes and methods of the data object
print(dir(data))

In [None]:
# Extract the data -> It will be your X_digits
X = data.data

# Extract the target -> It will be your y_digits
y = data.target

# Extract the images -> This are digits image for visualization
images = data.images


In [None]:
# Display images of the digits along with their labels
plt.figure(figsize=(20, 4))
# Plot the first 10 digits
for index, (image, label) in enumerate(zip(images[:10], y[:10])):
    plt.subplot(1, 10, index + 1)
    plt.imshow(image, cmap=plt.cm.gray)
    plt.title(f'Training: {label}\n', fontsize=20)


    
# Hint: Use plt.subplot, plt.imshow, plt.title, and plt.show
plt.show()


In [None]:
## Create a DataFrame with the data and feature names to view data
df = pd.DataFrame(X, columns=data.feature_names)
# Add a target column to the DataFrame to include the target labels
df['Target']    = y

# Display the first few rows of the DataFrame
print(df.head())


In [None]:
# Split the dataset into training and testing sets -> test_size = 0.2 and random_state = 42
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


## Building and Training your Model

### Model for Breast Cancer Dataset

In [None]:
# Create a random forest classifier
rf_classifier = RandomForestClassifier(random_state=42)

# Train the random forest classifer
rf_classifier.fit(X_train, y_train)

# Make predictions
y_pred = rf_classifier.predict(X_test)

### Model for Digits Dataset

In [None]:
# Create a random forest classifier
rf_classifier = RandomForestClassifier(random_state=42)

# Train the classifiers
rf_classifier.fit(X_train, y_train) 

# Make predictions
y_pred = rf_classifier.predict(X_test)


In [None]:
# Display images of the digits along with their predicted and actual labels
plt.figure(figsize=(25, 5))
# Plot the first 10 digits

for index, (image, label, prediction) in enumerate(zip(X_test[:10], y_test[:10], y_pred[:10])):
    plt.subplot(1, 10, index + 1)
    plt.imshow(image.reshape(8,8), cmap=plt.cm.gray)

    plt.title(f'Actual: {label}\nPrediction: {prediction}', fontsize=20)
plt.show()

## Validation Metrics

### Validation Metrics for Breast Cancer Dataset

In [None]:
# Calculate validation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred)
precision_recall_auc = average_precision_score(y_test, y_pred)

In [None]:
# Print the metrics
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1-Score: {f1:.2f}")
print(f"ROC-AUC: {roc_auc:.2f}")
print(f"Precision-Recall AUC: {precision_recall_auc:.2f}")


### Validation Metrics for Digits Dataset

In [None]:
# Calculate validation metrics
# hint -> For multi-class classification we have to specify the average argument to the weighted
# Calculate validation metrics

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(pd.get_dummies(y_test), pd.get_dummies(y_pred), average='weighted')
roc_auc = roc_auc_score(pd.get_dummies(y_test), pd.get_dummies(y_pred), average='weighted')
precision_recall_auc = average_precision_score(pd.get_dummies(y_test), pd.get_dummies(y_pred), average="weighted")


# Print the metrics
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1-Score: {f1:.2f}")
print(f"ROC-AUC: {roc_auc:.2f}")
print(f"Precision-Recall AUC: {precision_recall_auc:.2f}")



### ROC Curve for Breast Cancer Dataset

In [None]:
# Plot ROC curve
fpr, tpr, _ = roc_curve(y_test, y_pred)
roc_auc = auc(fpr, tpr)

plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc='lower right')
plt.show()

### ROC Curve for Digits Dataset


In [None]:
# Plot ROC curve for digit 3
print((y_test[2], y_pred[2]))

# Plot ROC curve
# fpr, tpr, _ = roc_curve(y_test[2], y_pred[2])
# roc_auc = auc(fpr, tpr)

# plt.figure()
# plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
# plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
# plt.xlim([0.0, 1.0])
# plt.ylim([0.0, 1.05])
# plt.xlabel('False Positive Rate')
# plt.ylabel('True Positive Rate')
# plt.title('Receiver Operating Characteristic')
# plt.legend(loc='lower right')
# plt.show()

### Optional

In [None]:
# Plot ROC curve for digit for all digits

