## ü©∫ Dataset Selection: DermMNIST (MedMNIST v2)

### What is DermMNIST?

**DermMNIST** is a curated dataset of **dermatoscopic images** from the MedMNIST v2 collection, labeled by skin lesion type. The dataset includes classifications for **benign vs. malignant lesions** and features **multi-class variants**, making it ideal for medical image classification research and machine learning experiments.

---

### üéØ Problem Statement

**Research Question:**  
Can a k-Nearest Neighbors (k-NN) classifier accurately classify dermatology lesion images in DermMNIST using similarity in feature space?

**Project Objectives:**
1. Evaluate k-NN's performance on medical image classification
2. Compare accuracy across different values of *k* (number of neighbors)
3. Assess performance across different feature representations (raw pixels vs. CNN embeddings)
4. Determine if distance-based similarity aligns with diagnostic similarity in dermatology

---

### ‚úÖ Why DermMNIST is ideal for k-NN:

1. **üîç Distance-based similarity matches medical intuition**  
   K-NN's core assumption ‚Äî "similar things are close together" ‚Äî aligns perfectly with dermatology: **similar lesions look similar**.

2. **üß† Image embeddings enable meaningful comparisons**  
   By extracting **feature embeddings** using a pretrained CNN (like ResNet), we move beyond noisy raw pixels to high-quality representations where distance truly captures visual similarity.

3. **üè• Real-world medical use case**  
   This mirrors how dermatologists work: **classifying skin lesions based on similarity to known cases** ‚Äî demonstrating k-NN's practical value in medical diagnosis support systems.

4. **üìä Multi-class classification challenge**  
   With multiple lesion types, this dataset tests k-NN's ability to handle complex, real-world medical classification beyond simple binary decisions.

5. **üîì Publicly accessible and well-documented**  
   As part of the MedMNIST v2 benchmark, the dataset is standardized and available for reproducible research.

---

### üõ†Ô∏è Typical k-NN Setup for DermMNIST:

**Step 1: Feature Extraction**  
Extract features using a pretrained CNN (e.g., ResNet) to convert images into feature vectors

**Step 2: k-NN Classification**  
Run k-NN on these embeddings (not raw pixels) to classify lesions based on similarity

**Step 3: Evaluation**  
Measure how well "visually similar" lesions share the same diagnosis

---

### üéì Why This Matters:

This project demonstrates that **k-NN isn't limited to toy datasets**. When paired with proper feature engineering (CNN embeddings), it becomes a powerful tool for real-world medical imaging tasks ‚Äî perfectly aligned with the workshop's goal of applying ML pipeline patterns to meaningful, complex problems in healthcare.


## Why Use CNN Embeddings for This Dataset?

The DermMNIST dataset consists of **medical images** of skin lesions. Unlike tabular datasets, image data is high-dimensional and contains complex visual patterns such as texture, color variation, and shape. Applying k-Nearest Neighbors (k-NN) directly to raw image pixels is generally ineffective because raw pixels do not represent visual similarity in a meaningful way.

In raw pixel space, small changes in lighting, scale, or position can cause large differences in pixel values, even when two images appear visually similar. Since k-NN relies on distance calculations, this makes similarity measurements unreliable and leads to poor classification performance.

To address this, we use **CNN embeddings**. A pretrained Convolutional Neural Network (CNN), such as ResNet-18, is used to transform each image into a fixed-length numerical vector (embedding) that captures high-level visual features. These embeddings encode important characteristics of the images, such as lesion structure and texture, while being more robust to low-level noise and variations.

In this project, the CNN is used **only as a feature extractor**, not as a classifier. The extracted embeddings provide a more meaningful feature space in which visually similar images are closer together. The k-NN algorithm is then applied to these embeddings to perform classification based on visual similarity.

Using CNN embeddings allows us to:
- Effectively apply k-NN to image data
- Improve distance-based similarity comparisons
- Reduce the impact of irrelevant pixel-level variations
- Focus the analysis on the behavior of k-NN rather than raw image representation

---

### üîß Feature Extraction Process

The embeddings used in this project were **pre-computed** using the `extract_embeddings_dermamnist.py` script, which implements the following pipeline:

1. **Load DermMNIST dataset** ‚Äî Downloads and loads train, validation, and test splits (28√ó28 RGB images)
2. **Initialize pretrained ResNet-18** ‚Äî Uses ImageNet-pretrained weights for robust feature extraction
3. **Remove classification layer** ‚Äî Replaces the final fully connected layer with an identity layer, converting the model into a pure feature extractor that outputs **512-dimensional embeddings**
4. **Process all splits** ‚Äî Extracts embeddings for all images in train, validation, and test sets
5. **Save to disk** ‚Äî Stores all embeddings and labels in a compressed `.npz` file for efficient reuse

This preprocessing step separates feature extraction from k-NN experimentation, allowing us to:
- ‚ö° **Run k-NN experiments quickly** without re-computing embeddings each time
- üî¨ **Focus on k-NN hyperparameter tuning** (different values of *k*)
- üìä **Ensure reproducibility** by using the same feature representations across all experiments


In [2]:
import numpy as np

#getting the data
emb_path = "data/dermamnist_28_resnet18_embeddings.npz" 
data = np.load(emb_path)

X_train, y_train = data["X_train"], data["y_train"]
X_val, y_val     = data["X_val"], data["y_val"]
X_test, y_test   = data["X_test"], data["y_test"]

print("datasets Train shape: ", X_train.shape)
print("datasets Val shape: ", X_val.shape)
print("datasets Test shape: ", X_test.shape)

datasets Train shape:  (7007, 512)
datasets Val shape:  (1003, 512)
datasets Test shape:  (2005, 512)


## Scale the Embeddings

Although CNN embeddings represent high-level image features, their individual dimensions can still have different scales. Without scaling, certain dimensions of the embedding vector may disproportionately influence the distance calculation, even if they are not more informative.

To address this, we apply **feature scaling** using standardization (zero mean and unit variance). Scaling ensures that all embedding dimensions contribute more evenly to the distance computation, making similarity comparisons more meaningful.

In [3]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_val_s   = scaler.transform(X_val)
X_test_s  = scaler.transform(X_test)
