# Data Preprocessing and Spectral Enhancement Pipeline

## Multispectral Breast Cancer Classification - Phase 1

This notebook implements the comprehensive data preprocessing and spectral enhancement pipeline for the multispectral breast cancer classification research targeting **98-99.5% accuracy**.

### Pipeline Components:
1. **Image Standardization**: Resize to 224x224, normalize intensities
2. **Spectral Enhancement**: RGB → HSV → Jet color space conversions
3. **Data Augmentation**: Rotation, flip, zoom, shear, contrast enhancement
4. **Quality Assessment**: Image quality metrics and validation
5. **Dataset Organization**: Structured data loading for ML training

### Expected Outcomes:
- Standardized dataset ready for CNN training
- Enhanced spectral representations for improved feature extraction
- Augmented dataset with 5-10x more training samples
- Optimized data loading pipeline for efficient training

---

In [None]:
# Import Required Libraries for Data Preprocessing
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from PIL import Image, ImageEnhance, ImageFilter
import cv2
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Deep Learning and Image Processing
import torch
import torch.nn as nn
import torchvision.transforms as transforms
from torchvision.utils import make_grid
import albumentations as A
from albumentations.pytorch import ToTensorV2

# Scientific computing and visualization
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import json
from tqdm import tqdm
import random

# Set random seeds for reproducibility
np.random.seed(42)
torch.manual_seed(42)
random.seed(42)

# Configuration
DATASET_PATH = r"c:\Users\mrhas\Downloads\archive\MultiModel Breast Cancer MSI Dataset"
PROCESSED_PATH = r"c:\Users\mrhas\Downloads\archive\processed_dataset"
IMG_SIZE = 224  # Standard input size for most CNN architectures
BATCH_SIZE = 32
NUM_WORKERS = 4

# Create processed dataset directory
os.makedirs(PROCESSED_PATH, exist_ok=True)

print("🔧 Data Preprocessing and Spectral Enhancement Pipeline")
print("=" * 60)
print(f"Source Dataset: {DATASET_PATH}")
print(f"Processed Output: {PROCESSED_PATH}")
print(f"Target Image Size: {IMG_SIZE}x{IMG_SIZE}")
print("=" * 60)