# [Nicholas Yim, Aseef Durrani]
# Dataset \#2 - Brain Tumor MRI Classification
---

In [39]:
# Import relevant Python libraries

import os
import time
import cv2
import pandas as pd
import numpy as np


In [40]:
# Plotly Dependencies

import plotly.figure_factory as ff
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.express as px
import plotly.io as pio

# **A. Load and Explore Dataset**

In [41]:
# Define dataset paths
TRAINING_DIR = "../datasets/mri/Training/"
TESTING_DIR = "../datasets/mri/Testing/"
CLASSES = ["glioma", "meningioma", "notumor", "pituitary"]

In [42]:
# Function to load images and labels
def load_data(data_dir, target_size=(128, 128)):
    images, labels, invalid_files = [], [], []
    for label, cls in enumerate(CLASSES):
        class_path = os.path.join(data_dir, cls)
        for img_name in os.listdir(class_path):
            img_path = os.path.join(class_path, img_name)
            img = cv2.imread(img_path, cv2.IMREAD_GRAYSCALE)  # Load as grayscale
            if img is not None and img.size > 0:  # Check for valid image
                resized_img = cv2.resize(img, target_size)  # Resize to target size
                images.append(resized_img)
                labels.append(label)
            else:
                invalid_files.append(img_path)
    return np.array(images), np.array(labels), invalid_files

In [43]:
# Load training and testing data
X_train, y_train, invalid_train = load_data(TRAINING_DIR)
X_test, y_test, invalid_test = load_data(TESTING_DIR)

# Report invalid files
print(f"Invalid training files: {len(invalid_train)}")
if invalid_train:
    print(f"Invalid training files: {invalid_train}")
print(f"Invalid testing files: {len(invalid_test)}")
if invalid_test:
    print(f"Invalid testing files: {invalid_test}")

# Dataset information
print(f"Training samples: {len(X_train)}, Testing samples: {len(X_test)}")
print(f"Image shape after resizing: {X_train[0].shape}")

Invalid training files: 0
Invalid testing files: 0
Training samples: 5712, Testing samples: 1311
Image shape after resizing: (128, 128)


In [44]:
# Analyze class distribution
class_counts = {cls: sum(1 for label in y_train if label == idx) for idx, 
                cls in enumerate(CLASSES)}
print("Training class distribution:")
for cls, count in class_counts.items():
    print(f"{cls}: {count} samples")

Training class distribution:
glioma: 1321 samples
meningioma: 1339 samples
notumor: 1595 samples
pituitary: 1457 samples


In [45]:
# Visualize class distribution
fig = go.Figure(data=[
    go.Bar(x=list(class_counts.keys()), y=list(class_counts.values()), marker_color='skyblue')
])
fig.update_layout(
    title="Class Distribution in Training Data",
    xaxis_title="Class",
    yaxis_title="Number of Samples",
    xaxis_tickangle=-45
)
fig.show()

In [46]:
# Visualize sample images
sample_images = []
for i, cls in enumerate(CLASSES):
    cls_indices = np.where(y_train == i)[0]
    if len(cls_indices) > 0:
        cls_idx = cls_indices[0]  # Get the first index for the class
        sample_images.append(X_train[cls_idx])
    else:
        print(f"No samples found for class {cls}")

# Use plotly express for a clean visualization
fig = px.imshow(
    np.array(sample_images),
    facet_col=0,
    facet_col_wrap=4,
    color_continuous_scale='gray',
)
fig.update_layout(
    title="Sample Images from Each Class",
    coloraxis_showscale=False,
)
fig.for_each_annotation(lambda a: a.update(text=CLASSES[int(a.text.split('=')[-1])]))
fig.update_xaxes(visible=False)  # Hide x-axis
fig.update_yaxes(visible=False)  # Hide y-axis
fig.show()

In [47]:
# Plot histogram of pixel intensity values
pixel_values = X_train.ravel()
histogram = np.histogram(pixel_values, bins=50)

fig = go.Figure(data=[
    go.Bar(x=histogram[1][:-1], y=histogram[0], marker_color='gray')
])
fig.update_layout(
    title="Pixel Intensity Distribution",
    xaxis_title="Pixel Value",
    yaxis_title="Frequency"
)
fig.show()

# **Dataset Exploration**

## Overview
This notebook explores the dataset by loading images, analyzing their distribution, and visualizing them. The exploration uses **OpenCV (`cv2`)**, a computer vision library, for image preprocessing tasks. Below is a detailed explanation of its usage in this notebook.

---

### **Modules and Functions Explained**

#### **`cv2.imread()`**
- **Purpose**: Reads image files and loads them into memory.
- **Usage in Notebook**:
  - `cv2.imread(img_path, cv2.IMREAD_GRAYSCALE)` loads each image from the dataset directory in grayscale mode.
  - **Reason**: Grayscale mode simplifies analysis by reducing the image to a single channel instead of three (RGB), thereby reducing computational complexity [1].

#### **`cv2.resize()`**
- **Purpose**: Resizes images to a specified target size.
- **Usage in Notebook**:
  - `cv2.resize(img, target_size)` resizes each grayscale image to \(128 x 128\) pixels.
  - **Reason**: Ensures all images are uniform in size, which is essential for consistent processing and analysis in machine learning tasks [1].

---

### **Dataset Exploration Steps**
1. **Invalid Image Checks**:
   - Images are checked to ensure they are neither `None` (missing) nor zero-sized.
   - Invalid images are logged for inspection.

2. **Training and Testing Samples**:
   - The number of images in each dataset is reported, confirming successful loading.

3. **Class Distribution**:
   - A bar chart visualizes the number of samples for each class, revealing any imbalances.

4. **Image Visualization**:
   - Representative images from each class are displayed to confirm proper loading and class labeling.

5. **Pixel Intensity Distribution**:
   - A histogram shows the distribution of pixel values (0-255) across the dataset, aiding in understanding the brightness and contrast characteristics of the images [1].

---

## Insights
- The use of **`cv2.imread`** and **`cv2.resize`** ensures efficient image preprocessing, making the dataset ready for machine learning workflows.
- These functions are computationally efficient and widely used in the field, making them suitable for this task.

# **B. Pre-Processing of the Dataset**

In [48]:
# Normalize pixel values to [0, 1] range
def preprocess_images(images):
    return images / 255.0

# One-hot encode labels
def one_hot_encode(labels, num_classes):
    encoded = np.zeros((labels.size, num_classes))
    encoded[np.arange(labels.size), labels] = 1
    return encoded

# Apply pre-processing
X_train = preprocess_images(X_train)
X_test = preprocess_images(X_test)

# One-hot encode labels
y_train_one_hot = one_hot_encode(y_train, num_classes=len(CLASSES))
y_test_one_hot = one_hot_encode(y_test, num_classes=len(CLASSES))

# Print pre-processed dataset stats
print(f"Training data shape: {X_train.shape}")
print(f"Testing data shape: {X_test.shape}")
print(f"One-hot encoded label shape: {y_train_one_hot.shape}")

# Dictionary to store the first index found for each class
first_occurrences = {}

# Iterate over training labels and find the first occurrence of each class
for i, one_hot_vector in enumerate(y_train_one_hot):
    class_index = np.argmax(one_hot_vector)
    if class_index not in first_occurrences:
        first_occurrences[class_index] = i
    # If we have found at least one occurrence of each class, we can stop.
    if len(first_occurrences) == len(CLASSES):
        break

# Now print out the first occurrence for each class
for class_index in sorted(first_occurrences.keys()):
    i = first_occurrences[class_index]
    one_hot_vector = y_train_one_hot[i]
    class_name = CLASSES[class_index]
    print(f"One-hot vector: {one_hot_vector} -> {class_name}")

Training data shape: (5712, 128, 128)
Testing data shape: (1311, 128, 128)
One-hot encoded label shape: (5712, 4)
One-hot vector: [1. 0. 0. 0.] -> glioma
One-hot vector: [0. 1. 0. 0.] -> meningioma
One-hot vector: [0. 0. 1. 0.] -> notumor
One-hot vector: [0. 0. 0. 1.] -> pituitary


In [49]:
# Prepare data for visualization
visualization_data = [
    {
        "Image": X_train[i],
        "Label": CLASSES[np.argmax(y_train_one_hot[i])]
    }
    for i in range(5)
]

# Create subplots to display images
fig = px.imshow(
    np.stack([data["Image"] for data in visualization_data]),
    facet_col=0,
    color_continuous_scale="gray"
)

# Update layout for clean visualization
fig.update_layout(
    title="First 5 Training Images and Their Labels",
    coloraxis_showscale=False,
)
fig.for_each_annotation(lambda a: a.update(text=visualization_data[int(a.text.split("=")[-1])]["Label"]))
fig.update_xaxes(visible=False)
fig.update_yaxes(visible=False)
fig.show()


# **Pre-processing of the Dataset**

## Overview
Preprocessing is an essential step to prepare the dataset for machine learning. Based on the dataset exploration, the following preprocessing steps were applied:

---

### **Image Normalization**
- **Reason**: 
  - Pixel intensity values in the images range from 0 to 255. Normalizing these values to the range [0, 1] improves numerical stability and ensures consistency across images.
- **Implementation**:
  - Each pixel value was divided by 255.0 using a custom normalization function.
  - This scales all images to the same range, making them suitable for input into machine learning models.
- **Impact**:
  - Normalization helps speed up the convergence of the model during training by standardizing the input data.

---

### **One-Hot Encoding of Labels**
- **Reason**: 
  - The labels are categorical integers (`0-3`), representing the four classes: glioma, meningioma, notumor, and pituitary. Machine learning models, especially neural networks, perform better when labels are represented as one-hot encoded vectors.
- **Implementation**:
  - A custom one-hot encoding function was used to convert each label into a vector of size 4:
    - Label `0` → `[1, 0, 0, 0]` (glioma)
    - Label `1` → `[0, 1, 0, 0]` (meningioma)
    - Label `2` → `[0, 0, 1, 0]` (notumor)
    - Label `3` → `[0, 0, 0, 1]` (pituitary)
  - The encoded vectors ensure that the model treats each class as distinct and non-ordinal.
- **Impact**:
  - One-hot encoding prevents the model from interpreting the class labels as having any inherent order.

---

### **Verification**
- The shapes of the preprocessed datasets are as follows:
  - **Training Data**:
    - Images: `(5712, 128, 128)` — Each image is normalized to a 128x128 pixel array with values between 0 and 1.
    - Labels: `(5712, 4)` — Labels are represented as one-hot encoded vectors for four classes: glioma, meningioma, notumor, and pituitary.
  - **Testing Data**:
    - Images: `(1311, 128, 128)` — Each image is normalized to a 128x128 pixel array with values between 0 and 1.
    - Labels: `(1311, 4)` — Labels are represented as one-hot encoded vectors.

---

## Insights
- Pixel intensity ranges are uniform across all images, which improves the model's ability to generalize and accelerates training convergence.
- Labels are correctly formatted for classification tasks, with each class treated independently.
- These steps address findings from the dataset exploration phase (e.g., wide pixel intensity range and the need for standardized class encoding).
- With normalization and one-hot encoding complete, the dataset is now ready for feature engineering and model training, ensuring efficient and accurate downstream processing.


# **C. Feature Engineering / Learning from Dataset**

In [50]:
from sklearn.decomposition import PCA

# Flatten images for PCA
X_train_flat = X_train.reshape(X_train.shape[0], -1)
X_test_flat = X_test.reshape(X_test.shape[0], -1)

# Apply PCA to reduce dimensions to 1600 components
n_components = 1600
pca = PCA(n_components=n_components)
X_train_pca = pca.fit_transform(X_train_flat)
X_test_pca = pca.transform(X_test_flat)

# Explained variance ratio
explained_variance_ratio = np.sum(pca.explained_variance_ratio_)
print(f"Explained Variance with {n_components} components: {explained_variance_ratio:.2f}")

# Verify shape after PCA
print(f"Shape of training data after PCA: {X_train_pca.shape}")
print(f"Shape of testing data after PCA: {X_test_pca.shape}")

Explained Variance with 1600 components: 0.97
Shape of training data after PCA: (5712, 1600)
Shape of testing data after PCA: (1311, 1600)


# **Feature Engineering: Principal Component Analysis (PCA)**

## Overview
This notebook applies **Principal Component Analysis (PCA)** to reduce the dimensionality of the dataset. PCA transforms the high-dimensional image data into a lower-dimensional subspace while retaining as much variance as possible. The implementation leverages **`scikit-learn`'s `PCA` module**, which uses Singular Value Decomposition (SVD) internally for efficient computation [2].

---

### **Modules and Functions Explained**

#### **`PCA` Class (from `sklearn.decomposition`)**
- **Purpose**: The `PCA` class performs Principal Component Analysis for dimensionality reduction [2].
- **Key Parameters**:
  - `n_components`: Specifies the number of principal components to retain. In this implementation, `n_components=1600`, which represents approximately 10% of the original features.
- **Key Attributes**:
  - `explained_variance_ratio_`: Proportion of variance explained by each principal component.
  - `components_`: Principal axes in the feature space (eigenvectors).
- **Key Methods**:
  - `fit_transform`: Fits the PCA model to the data and applies dimensionality reduction simultaneously.
  - `transform`: Projects new data onto the existing PCA subspace.

*(Adapted from the PCA method explained in [2].)*

---

### **PCA Steps**

1. **Flattening Images**:
   - Each image was reshaped from a 2D array of shape `(128, 128)` to a 1D array of shape `(1, 16,384)`.
   - Flattening ensures that PCA treats each pixel as an individual feature.

2. **Data Normalization**:
   - Pixel values were normalized to the range [0, 1] during preprocessing.
   - Normalization ensures consistent scaling and prepares the data for PCA.

3. **Applying PCA**:
   - The `PCA` class from `scikit-learn` was used to reduce the dataset's dimensionality.
   - PCA automatically computes principal components using SVD, bypassing the need to construct the covariance matrix explicitly.

4. **Dimensionality Reduction**:
   - \(n_{\text{components}} = 1600\) was selected, reducing the data from \(16,384\) features to \(1,600\) features.
   - These components captured 97% of the dataset's variance, significantly reducing dimensionality while retaining most of the information.

5. **Projection of Test Data**:
   - The training data mean was subtracted from both the training and testing datasets to center the data.
   - The test data was then projected onto the same PCA subspace as the training data using the selected principal components.

---

### **Results**
1. **Explained Variance**:
   - The top 1,600 components captured **97% of the variance**, indicating that most of the dataset's information is retained in the reduced subspace.

2. **Dataset Shapes**:
   - **Training Data**: Reduced from \(16,384\) features to \(1,600\) features per sample, resulting in a shape of `(5712, 1600)`.
   - **Testing Data**: Reduced from \(16,384\) features to \(1,600\) features per sample, resulting in a shape of `(1311, 1600)`.

---

## Insights
- PCA successfully reduced the dimensionality of the dataset while retaining **97% of the variance**.
- This dimensionality reduction simplifies computations for downstream tasks, such as classification, clustering, or visualization [3].
- By using `scikit-learn`, we leveraged an efficient and robust implementation of PCA that handles high-dimensional data effectively.
- The selection of 1,600 components balances variance retention with computational efficiency, ensuring that the majority of meaningful information in the dataset is preserved [3].


# **D. Processing of Dataset**

In [51]:
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

# Define a hyperparameter grid
svm_params = {
    'C': [1, 10],  # Reduced number of values
    'gamma': ['scale', 0.1],
    'kernel': ['rbf']  # Retain RBF for non-linear data
}

# Initialize SVM and GridSearchCV
svm = SVC()
svm_grid = GridSearchCV(
    estimator=svm,
    param_grid=svm_params,
    cv=3,  # Use 3-fold cross-validation
    scoring='accuracy',
    verbose=1,
    n_jobs=-1  # Utilize all available cores
)

# Fit GridSearchCV on the entire training data
print("Training SVM model...")
svm_grid.fit(X_train_pca, y_train)

# Save the best model
best_svm = svm_grid.best_estimator_
print(f"Best SVM parameters: {svm_grid.best_params_}")

Training SVM model...
Fitting 3 folds for each of 4 candidates, totalling 12 fits
Best SVM parameters: {'C': 10, 'gamma': 'scale', 'kernel': 'rbf'}


In [52]:
from sklearn.metrics import accuracy_score, classification_report

# Predict on the test set using the best SVM model
y_test_pred_svm = best_svm.predict(X_test_pca)

# Evaluate SVM on the test set
accuracy_svm = accuracy_score(y_test, y_test_pred_svm)
print(f"SVM Accuracy on Test Set: {accuracy_svm}")
print("SVM Classification Report:")
print(classification_report(y_test, y_test_pred_svm, target_names=CLASSES))

SVM Accuracy on Test Set: 0.9610983981693364
SVM Classification Report:
              precision    recall  f1-score   support

      glioma       0.95      0.93      0.94       300
  meningioma       0.93      0.91      0.92       306
     notumor       0.98      1.00      0.99       405
   pituitary       0.98      1.00      0.99       300

    accuracy                           0.96      1311
   macro avg       0.96      0.96      0.96      1311
weighted avg       0.96      0.96      0.96      1311



In [53]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

# Define updated Random Forest hyperparameter grid
rf_params = {
    'n_estimators': [50, 100, 200],  # Flexibility in the number of trees
    'max_depth': [10, 20, None],  # Limited and unlimited depth
    'min_samples_split': [2, 5],  # Minimum samples for a node split
}

# Initialize Random Forest model
rf = RandomForestClassifier(random_state=42)

# Perform Grid Search with Cross-Validation
rf_grid = GridSearchCV(rf, rf_params, cv=3, scoring='accuracy', n_jobs=-1)

# Fit Grid Search on training data
print("Training Random Forest model...")
rf_grid.fit(X_train_pca, y_train)

# Save the best model
best_rf = rf_grid.best_estimator_
print(f"Best Random Forest parameters: {rf_grid.best_params_}")

Training Random Forest model...
Best Random Forest parameters: {'max_depth': None, 'min_samples_split': 5, 'n_estimators': 200}


In [54]:
from sklearn.metrics import accuracy_score, classification_report

# Predict on the test set using the best Random Forest model
y_test_pred_rf = best_rf.predict(X_test_pca)

# Evaluate performance on the test set
accuracy_rf = accuracy_score(y_test, y_test_pred_rf)
print(f"Random Forest Accuracy on Test Set: {accuracy_rf}")
print("Random Forest Classification Report:")
print(classification_report(y_test, y_test_pred_rf, target_names=CLASSES))

Random Forest Accuracy on Test Set: 0.8794813119755912
Random Forest Classification Report:
              precision    recall  f1-score   support

      glioma       0.83      0.74      0.78       300
  meningioma       0.81      0.80      0.80       306
     notumor       0.99      0.99      0.99       405
   pituitary       0.85      0.95      0.90       300

    accuracy                           0.88      1311
   macro avg       0.87      0.87      0.87      1311
weighted avg       0.88      0.88      0.88      1311



In [55]:
from sklearn.metrics import confusion_matrix
import plotly.figure_factory as ff

# Function to plot confusion matrices using Plotly
def plot_confusion_matrix(y_true, y_pred, title):
    cm = confusion_matrix(y_true, y_pred)
    cm_text = [[str(cell) for cell in row] for row in cm]

    fig = ff.create_annotated_heatmap(
        z=cm,
        x=CLASSES,
        y=CLASSES,
        annotation_text=cm_text,
        colorscale="Blues",
    )
    fig.update_layout(
        title=title,
        xaxis_title="Predicted",
        yaxis_title="Actual",
        xaxis=dict(tickmode="array", tickvals=list(range(len(CLASSES))), ticktext=CLASSES),
        yaxis=dict(tickmode="array", tickvals=list(range(len(CLASSES))), ticktext=CLASSES),
    )
    fig.show()

# Plot confusion matrices
plot_confusion_matrix(y_test, y_test_pred_svm, "SVM Confusion Matrix")
plot_confusion_matrix(y_test, y_test_pred_rf, "Random Forest Confusion Matrix")

In [56]:
# Extract feature importances from the best Random Forest model
feature_importances = best_rf.feature_importances_

# Create a DataFrame for visualization
importance_df = pd.DataFrame({
    "Feature": [f"PC{i+1}" for i in range(len(feature_importances))],
    "Importance": feature_importances,
}).sort_values(by="Importance", ascending=False).head(10)  # Top 10 features

# Horizontal bar plot
fig = px.bar(
    importance_df,
    x="Importance",
    y="Feature",
    orientation="h",
    title="Top 10 Feature Importances (PCA Components)",
    labels={"Importance": "Importance", "Feature": "Feature (Principal Components)"},
)
fig.update_layout(
    xaxis_title="Importance",
    yaxis_title="Feature",
    showlegend=False,
    height=500,
)
fig.show()

In [57]:
# Print a table of top 10 feature importances
from tabulate import tabulate

# Extract data for table
table_data = importance_df.values.tolist()
table_headers = ["Feature", "Importance"]

# Print the table
print(tabulate(table_data, headers=table_headers, tablefmt="pretty"))

+---------+----------------------+
| Feature |      Importance      |
+---------+----------------------+
|   PC2   | 0.03862667406751771  |
|   PC1   | 0.03783486017439883  |
|   PC4   | 0.01274167606101232  |
|   PC3   | 0.009630583848444625 |
|  PC21   | 0.00935151815175204  |
|  PC15   | 0.00698767003888738  |
|   PC7   | 0.006411683330571576 |
|  PC24   | 0.005981861092320659 |
|  PC13   | 0.005939145405281331 |
|   PC6   | 0.005835037241322338 |
+---------+----------------------+


# **Classification Using SVM and Random Forest**

## Overview
This section implements two machine learning models, **Support Vector Machine (SVM)** and **Random Forest**, to classify MRI images into four tumor types (glioma, meningioma, notumor, pituitary). Both models use PCA-reduced features as input, optimizing performance while retaining computational efficiency [4], [6].

---

### **Modules and Functions Explained**

#### **`PCA` Class (from `sklearn.decomposition`)**
- **Purpose**: The `PCA` class performs Principal Component Analysis for dimensionality reduction [2].
- **Key Parameters**:
  - `n_components`: Specifies the number of principal components to retain. In this implementation, `n_components=1600`, which represents approximately 10% of the original features.
- **Key Attributes**:
  - `explained_variance_ratio_`: Proportion of variance explained by each principal component.
  - `components_`: Principal axes in the feature space (eigenvectors).
- **Key Methods**:
  - `fit_transform`: Fits the PCA model to the data and applies dimensionality reduction simultaneously.
  - `transform`: Projects new data onto the existing PCA subspace.

*(Adapted from PCA implementation details in [2]).*

#### **`SVC` Class (from `sklearn.svm`)** [2]
- **Purpose**: The `SVC` class implements the Support Vector Machine algorithm for classification tasks [4].
- **Key Parameters**:
  - `C`: Regularization parameter. A smaller value results in a wider margin but may lead to misclassifications.
  - `kernel`: Specifies the kernel type. Used `rbf` in this implementation for its flexibility in decision boundaries.
  - `gamma`: Kernel coefficient for non-linear kernels (e.g., RBF).
- **Key Methods**:
  - `fit`: Trains the model on the given dataset.
  - `predict`: Predicts labels for new data points.

#### **`RandomForestClassifier` Class (from `sklearn.ensemble`)** [2]
- **Purpose**: Implements the Random Forest algorithm, an ensemble method that combines multiple decision trees to improve classification performance [5].
- **Key Parameters**:
  - `n_estimators`: Number of decision trees in the forest.
  - `max_depth`: Maximum depth of each tree.
  - `min_samples_split`: Minimum number of samples required to split a node.
- **Key Methods**:
  - `fit`: Trains the Random Forest model on the dataset.
  - `predict`: Predicts labels for new data points.

#### **`GridSearchCV` Class (from `sklearn.model_selection`)** [2]
- **Purpose**: Performs exhaustive search over a grid of hyperparameters to find the best model configuration [2].
- **Key Parameters**:
  - `param_grid`: Dictionary specifying the hyperparameters to search.
  - `cv`: Number of cross-validation folds.
  - `scoring`: Metric used to evaluate model performance (e.g., accuracy).
- **Key Methods**:
  - `fit`: Trains the model over all parameter combinations.
  - `best_estimator_`: Retrieves the best model after grid search.

---

### **1. Support Vector Machine (SVM)**

#### **Overview**
- **SVM** is a supervised learning algorithm that finds the optimal hyperplane to separate classes in a high-dimensional feature space.
- Works effectively with the PCA-reduced data due to its ability to handle high-dimensional input.

#### **Implementation**
- A grid search was used to tune hyperparameters:
  - `C`: Regularization parameter controlling the trade-off between margin width and classification accuracy.
  - `kernel`: Determines the type of decision boundary (`rbf` used for non-linear decision boundaries).
  - `gamma`: Kernel coefficient for non-linear kernels.
- Cross-validation (\(k=3\)) was used to evaluate performance across folds [4].

#### **Results**
- The optimized SVM achieved a validation accuracy of **96.11%**.
- **Classification Report (SVM)**:
  - **Precision**: Average 96%
  - **Recall**: Average 96%
  - **F1-Score**: Average 96%
- **Confusion Matrix**:
  - Glioma and meningioma classes were classified with slightly lower values.
  - Minimal misclassifications across all tumor types.

---

### **2. Random Forest**

#### **Overview**
- **Random Forest** is an ensemble method that trains multiple decision trees and combines their outputs for classification [5], [6].
- Offers robust performance on high-dimensional data and provides feature importance insights.

#### **Implementation**
- A grid search was used to tune hyperparameters:
  - `n_estimators`: Number of trees in the forest.
  - `max_depth`: Maximum depth of each tree.
  - `min_samples_split`: Minimum number of samples required to split a node.
- Cross-validation (\(k=3\)) was used to evaluate performance [5].

#### **Results**
- The optimized Random Forest achieved a validation accuracy of **87.95%**.
- **Classification Report (Random Forest)**:
  - **Precision**: Average 88%
  - **Recall**: Average 88%
  - **F1-Score**: Average 88%
- **Confusion Matrix**:
  - No tumor class nearly perfectly classified.
  - Glioma and meningioma classes were classified with significantly lower values.
  - Pituitary class classified with slightly lower values.
*(Results adapted from the implemented model performance.)*

---

### **Top Features (Random Forest)**

| Feature | Importance          |
|---------|----------------------|
| PC2     | 0.03862667406751771 |
| PC1     | 0.03783486017439883 |
| PC4     | 0.01274167606101232 |
| PC3     | 0.009630583848444625|
| PC21    | 0.00935151815175204 |
| PC15    | 0.00698767003888738 |
| PC7     | 0.006411683330571576|
| PC24    | 0.005981861092320659|
| PC13    | 0.005939145405281331|
| PC6     | 0.005835037241322338|

*(Feature importance details adapted from [7]).*

---

### **Evaluation and Comparison**

| Metric              | SVM (%)   | Random Forest (%) |
|---------------------|-----------|--------------------|
| Accuracy            | **96.11** | **87.95**         |
| Precision (avg)     | **96**    | **88**            |
| Recall (avg)        | **96**    | **88**            |
| F1-Score (avg)      | **96**    | **88**            |

- **SVM** demonstrated superior performance, particularly in accuracy and recall, with minimal misclassifications across classes.
- **Random Forest**, while robust, showed more misclassifications, particularly in separating glioma and meningioma.

---

### **Visualization**

1. **Confusion Matrices**:
   - **SVM Confusion Matrix**:
     - Strong classification performance across all tumor types.
     - Very few misclassifications observed.
   - **Random Forest Confusion Matrix**:
     - Notable misclassifications in glioma and meningioma classes.

2. **Feature Importances (Random Forest)**:
   - Random Forest identified PC2, PC1, and PC4 as the most influential components, indicating their relevance in distinguishing tumor classes [7].
   - The bar chart above visualizes the top 10 feature importances.

---

### **Insights**
- Both models successfully classify tumor types using PCA-reduced features [2], [4].
- **SVM** outperformed Random Forest, achieving a validation accuracy of **96.11%**, with consistent results across tumor classes.
- **Random Forest** achieved a validation accuracy of **87.95%**, with strong performance in the notumor and pituitary classes but was less effective in distinguishing glioma and meningioma [5], [6].
- The feature importance analysis suggests that **PC2** and **PC1** contribute significantly to Random Forest classification decisions [7].
- The results suggest that **SVM** is better suited for this task due to its ability to create clear decision boundaries in the PCA-reduced feature space [4], [5].

# **E. Comparative Analysis of SVM and Random Forest**

In [61]:
from sklearn.metrics import confusion_matrix
import plotly.graph_objects as go
from plotly.subplots import make_subplots  # Correct import for subplots

# Function to overlay confusion matrices side-by-side using Plotly
def plot_confusion_matrices_side_by_side(y_true, y_pred_svm, y_pred_rf, classes):
    # SVM Confusion Matrix
    cm_svm = confusion_matrix(y_true, y_pred_svm)
    svm_heatmap = go.Heatmap(
        z=cm_svm, 
        x=classes, 
        y=classes, 
        colorscale="Blues", 
        showscale=True, 
        text=cm_svm, 
        texttemplate="%{z}"
    )
    
    # Random Forest Confusion Matrix
    cm_rf = confusion_matrix(y_true, y_pred_rf)
    rf_heatmap = go.Heatmap(
        z=cm_rf, 
        x=classes, 
        y=classes, 
        colorscale="Blues", 
        showscale=True, 
        text=cm_rf, 
        texttemplate="%{z}"
    )
    
    # Subplots
    fig = make_subplots(
        rows=1, 
        cols=2, 
        subplot_titles=["SVM Confusion Matrix", "Random Forest Confusion Matrix"], 
        shared_yaxes=True
    )
    fig.add_trace(svm_heatmap, row=1, col=1)
    fig.add_trace(rf_heatmap, row=1, col=2)
    
    # Update layout
    fig.update_layout(
        title="Confusion Matrices: SVM vs Random Forest", 
        width=1000, 
        height=500
    )
    fig.update_xaxes(title_text="Predicted", row=1, col=1)
    fig.update_yaxes(title_text="Actual", row=1, col=1)
    fig.update_xaxes(title_text="Predicted", row=1, col=2)
    
    fig.show()

# Call the function to plot confusion matrices for test results
plot_confusion_matrices_side_by_side(y_test, y_test_pred_svm, y_test_pred_rf, CLASSES)

In [62]:
from sklearn.metrics import precision_recall_fscore_support, accuracy_score

# Function to calculate and return metrics
def calculate_metrics(y_true, y_pred, model_name):
    precision, recall, f1_score, _ = precision_recall_fscore_support(y_true, y_pred, average='weighted')
    accuracy = accuracy_score(y_true, y_pred)
    return {
        "Model": model_name, 
        "Accuracy": accuracy, 
        "Precision": precision, 
        "Recall": recall, 
        "F1-Score": f1_score
    }

# Calculate metrics for both models on the test set
svm_metrics = calculate_metrics(y_test, y_test_pred_svm, "SVM")
rf_metrics = calculate_metrics(y_test, y_test_pred_rf, "Random Forest")

# Print metrics
print("SVM Metrics:", svm_metrics)
print("Random Forest Metrics:", rf_metrics)

SVM Metrics: {'Model': 'SVM', 'Accuracy': 0.9610983981693364, 'Precision': 0.9607235479172626, 'Recall': 0.9610983981693364, 'F1-Score': 0.9608425675068786}
Random Forest Metrics: {'Model': 'Random Forest', 'Accuracy': 0.8794813119755912, 'Precision': 0.8786254655072444, 'Recall': 0.8794813119755912, 'F1-Score': 0.8777682346491501}


In [63]:
# Combine metrics into a single DataFrame for comparison
metrics_df = pd.DataFrame([svm_metrics, rf_metrics])
print(metrics_df)

# Bar Chart to Compare Metrics
fig = px.bar(
    metrics_df.melt(id_vars="Model"), 
    x="variable", 
    y="value", 
    color="Model", 
    barmode="group", 
    title="Model Performance Comparison"
)
fig.update_layout(
    xaxis_title="Metric", 
    yaxis_title="Score", 
    legend_title="Model", 
    height=500, 
    width=700
)
fig.show()

           Model  Accuracy  Precision    Recall  F1-Score
0            SVM  0.961098   0.960724  0.961098  0.960843
1  Random Forest  0.879481   0.878625  0.879481  0.877768


# **Comparative Analysis of SVM and Random Forest**

## Overview
This section compares the **Support Vector Machine (SVM)** and **Random Forest** models based on:
1. Computational complexity.
2. Performance metrics (accuracy, precision, recall, F1-score).
3. Confusion matrices.
4. Final recommendation for production use.

---

### **1. Computational Complexity**

#### **SVM**
- **Time Complexity**: O(n² · m), where n is the number of samples and m is the number of features. Training time increases significantly with larger datasets [4].
- **Memory Usage**: Requires storing the kernel matrix of size n × n.

#### **Random Forest**
- **Time Complexity**: O(t · d · log n), where t is the number of trees, d is the depth of each tree, and n is the number of samples [6].
- **Memory Usage**: Scales better as each tree is trained independently.

---

### **2. Performance Metrics**

| Metric              | SVM (%)   | Random Forest (%) |
|---------------------|-----------|--------------------|
| Accuracy            | **96.11** | **87.95**          |
| Precision (avg)     | **96**    | **88**             |
| Recall (avg)        | **96**    | **88**             |
| F1-Score (avg)      | **96**    | **88**             |

## Insights
- SVM outperformed Random Forest across all metrics, demonstrating better generalization on the dataset [4].

---

### **3. Confusion Matrices**

#### **Visualization**
The confusion matrices for both models are shown above, with the SVM matrix on the left and the Random Forest matrix on the right. The SVM model demonstrates fewer misclassifications compared to Random Forest, particularly for the "glioma" and "meningioma" classes [6], [7].

---

### **4. Metric Comparison**

#### **Visualization**
The bar chart above compares the accuracy, precision, recall, and F1-score for the two models. SVM consistently outperformed Random Forest across all metrics.

---

### **5. Top Features (Random Forest)**

| Feature | Importance          |
|---------|----------------------|
| PC2     | 0.03862667406751771 |
| PC1     | 0.03783486017439883 |
| PC4     | 0.01274167606101232 |
| PC3     | 0.009630583848444625|
| PC21    | 0.00935151815175204 |
| PC15    | 0.00698767003888738 |
| PC7     | 0.006411683330571576|
| PC24    | 0.005981861092320659|
| PC13    | 0.005939145405281331|
| PC6     | 0.005835037241322338|

- **Insight**: Random Forest identified PC2, PC1, and PC4 as the most influential components, indicating their relevance in distinguishing tumor classes [7].

---

### **6. Recommendation**

#### **Summary**
- **SVM** is the superior model for this dataset, achieving an accuracy of **96.11%** and excelling across all other metrics.
- **Random Forest** demonstrated robustness and interpretability but achieved only **87.95%** accuracy, with more misclassifications in key classes.

#### **Final Recommendation**
- For **production use**, **SVM** is recommended due to its superior performance in this classification task. While it has higher computational complexity during training, the gains in accuracy and precision justify its use.
- **Random Forest** may be considered if feature importance insights are critical to the use case.

---

## Conclusion
Both models successfully classified the MRI dataset into four tumor types. However, the SVM model provided significantly better performance and generalization, making it the preferred choice for this specific classification problem.

# **F. Discussion on Ethical Issues**

## Overview
Machine learning applications in medical imaging, particularly in brain tumor classification, have significant ethical implications. While these technologies have the potential to improve healthcare outcomes, they must be developed and deployed responsibly. Below, we discuss the key ethical issues associated with this dataset and task [5].

---

### **1. Bias and Fairness**
- **Potential Issues**:
  - The dataset used for training might not be representative of the diverse population it aims to serve. For example, certain demographic groups (e.g., based on age, gender, ethnicity) may be underrepresented.
  - Models trained on biased data could lead to disproportionate misclassification rates for underrepresented groups, potentially causing harm.
- **Mitigation Strategies**:
  - Ensure diversity in the training dataset by including images from varied demographics.
  - Continuously monitor model performance across different subgroups to identify and address biases.

---

### **2. Misdiagnosis and Accountability**
- **Potential Issues**:
  - Misclassifications by the model (e.g., labeling a tumor as "notumor") could lead to severe consequences, including delayed treatment or inappropriate medical interventions.
  - It is unclear who would be held accountable for such errors — the developers, the healthcare providers, or the model itself.
- **Mitigation Strategies**:
  - Use these models only as decision-support tools, ensuring that healthcare professionals verify the outputs.
  - Clearly communicate the limitations and potential error rates of the model to end users.

---

### **3. Privacy and Data Security**
- **Potential Issues**:
  - Medical imaging data contains sensitive information. Improper handling or storage of the dataset could lead to data breaches and compromise patient privacy.
  - Sharing datasets across institutions may inadvertently expose personal identifiers if not properly anonymized.
- **Mitigation Strategies**:
  - Follow strict protocols for anonymizing datasets before use.
  - Store and process data in secure environments that comply with regulations such as HIPAA (Health Insurance Portability and Accountability Act).

---

### **4. Transparency and Interpretability**
- **Potential Issues**:
  - Complex models like SVMs and Random Forests lack transparency, making it difficult for stakeholders to understand how decisions are made.
  - This "black-box" nature may erode trust among clinicians and patients.
- **Mitigation Strategies**:
  - Pair machine learning models with interpretable tools (e.g., SHAP or LIME) to explain individual predictions.
  - Regularly audit model outputs to identify potential anomalies or biases.

---

### **5. Accessibility and Equity**
- **Potential Issues**:
  - High computational requirements for training and deploying models may limit their accessibility to under-resourced healthcare settings.
  - Uneven access to these tools could widen the gap in healthcare quality between different regions or institutions.
- **Mitigation Strategies**:
  - Optimize models for deployment on low-resource hardware without compromising accuracy.
  - Partner with organizations to subsidize or expand access to these technologies in underserved areas.

---

## Final Thoughts
While machine learning holds promise for improving brain tumor diagnosis and treatment, careful attention must be paid to ethical considerations at every stage of the process. By addressing bias, accountability, privacy, transparency, and accessibility, we can work toward the responsible deployment of these models to enhance — rather than hinder — healthcare equity and outcomes.

# **G. Bibliography**

[1] OpenCV, "OpenCV Documentation." Available: https://docs.opencv.org/4.x/index.html.

[2] Scikit-learn, "Scikit-learn: Machine Learning in Python." Available: https://scikit-learn.org/stable/index.html.

[3] J. Lever, M. Krzywinski, and N. Altman, "Principal component analysis," *Nature Methods*, vol. 14, no. 7, pp. 641–642, Jul. 2017. DOI: https://doi.org/10.1038/nmeth.4346.

[4] GeeksforGeeks, "SVM Hyperparameter Tuning using GridSearchCV | ML" Available: https://www.geeksforgeeks.org/svm-hyperparameter-tuning-using-gridsearchcv-ml/.

[5] B. D. Mittelstadt, P. Allo, M. Taddeo, S. Wachter, and L. Floridi, "The ethics of algorithms: Mapping the debate," *Big Data & Society*, vol. 3, no. 2, Nov. 2016. DOI: https://doi.org/10.1177/2053951716679679.

[6] L. Breiman, "Random Forests," *Statistics Department, University of California*, Berkeley, CA, Jan. 2001. Available: https://www.stat.berkeley.edu/~breiman/randomforest2001.pdf.

[7] W. Koehrsen, "Random Forest in Python: A Practical End-to-End Machine Learning Example," *Towards Data Science*, Dec. 27, 2017. Available: https://towardsdatascience.com/random-forest-in-python-24d0893d51c0.
