# **Development Guide: Modular Pre-Processing for Segmentation and Classification (CBIS-DDSM)**

This outline defines the transformation steps to convert Kaggle JPEG images into high-fidelity tensors. The focus is on removing systematic noise and preserving critical morphological signals for distinguishing malignant from benign pathologies.

## 1. Manifest Structuring and Early Filtering

Unlike the original DICOM version, the Kaggle files already provide image paths mapped in simplified CSVs. The first step is to create a "Master Manifest" to avoid processing irrelevant data.

- **Action:** Consolidate the files `calc_case(with_jpg_img).csv` and `mass_case(with_jpg_img).csv`.
- **Exclusion Filter:** Immediately remove all records labeled `BENIGN_WITHOUT_CALLBACK`. This class represents findings that do not require immediate clinical action, and removing it early saves about 20% of total processing time.
- **Expected Output:** A single CSV containing the columns `patient_id`, `image_path`, `mask_path`, `view` (CC/MLO), and the final binary label (0 for Benign, 1 for Malignant).

## 2. Intensity Normalization (8-bit to Float32)

Since the data is already converted to 8 bits, the goal is to ensure the neural network receives a stable statistical distribution.

- **Action:** Convert the loaded image to `float32` and normalize pixels to the [0.0, 1.0] range.
- **Justification:** Neural networks converge faster when inputs are centered and scaled, minimizing gradient vanishing in deep architectures like U-Net.

## 3. Anatomical Isolation (Breast Masking)

DDSM scans contain noisy borders and text labels ("L", "CC", "R") that introduce learning bias (overfitting).

- **Action:** Apply the **Otsu Method** to binarize the image and identify the largest connected object (the breast).
- **Refinement:** Use morphological opening to smooth the contour and apply a binary mask over the original image to turn all background and text artifacts to absolute black (0).
- **Expected Output:** Grayscale image where only breast tissue has positive pixel values.

## 4. Pectoral Muscle Suppression (MLO Views)

The pectoral muscle appears as a high-density region that can distort local contrast normalization and mimic malignant masses.

- **Action:** For images identified as MLO view, apply the **Hough Transform** to detect the linear muscle edge in the upper corner.
- **Processing:** Zero out (value 0) all pixels inside the triangle or polygon that delimits the detected muscle.
- **Expected Output:** Clean anatomical image, free of interfering muscle densities.

## 5. Adaptive Contrast Enhancement (CLAHE)

Compression to 8 bits reduces the separation between gray tones. CLAHE is vital to recover the visibility of spicules and tumor margins.

- **Action:** Apply CLAHE with `clipLimit` 2.0 and `tileGridSize` 8x8.
- **Quality Control:** It is mandatory to generate automatic comparative visualizations at this stage. Excessive enhancement can generate noise that the segmentation model may confuse with calcifications.

## 6. Precision Patch Extraction

Resizing full mammograms to small sizes (e.g., 224x224) destroys essential texture information. Using patches preserves native lesion resolution.

- **Action:** Use the provided binary masks to compute the lesion centroid.
- **Crop:** Extract a 598x598 patch centered on the lesion. If the lesion is near the edge, apply zero padding to keep a fixed dimension.
- **Expected Output:** Synchronized patch pairs (Image and Mask) ready for direct model input.

## 7. Data Balancing Strategy

The imbalance between malignant and benign in CBIS-DDSM requires a proactive approach to avoid majority-class bias.

- **Action:** Implement class weights in the loss function calculation.
- **Justification:** This ensures errors on malignant cases (False Negatives) are penalized more than errors on benign cases.
- **Test with and without.**

## 8. Data Augmentation

The idea is to simulate physical variability in mammography exams to increase model generalization.

- **Action:** Apply **Elastic Transforms** along with random rotations (up to 175°) and flips.
- **Justification:** Because breast tissue is non-rigid, elastic deformations simulate different compression levels, making the model resilient to acquisition variations.


In [7]:
import os
import numpy as np
import pandas as pd
import cv2
from sklearn.model_selection import GroupShuffleSplit
from sklearn.utils.class_weight import compute_class_weight
from typing import Tuple, Optional
import matplotlib.pyplot as plt
import albumentations as A
import random

class CBISDDSM_Preprocessor:
    
    def __init__(self, base_path=None, output_path=None):
        default_kaggle = '/kaggle/input/cbis-ddsm-breast-cancer-image-dataset'
        default_local = 'cbis-ddsm-breast-cancer-image-dataset'

        if base_path is None:
            candidate_paths = [
                os.path.join(os.getcwd(), default_local),
                default_local,
                default_kaggle,
            ]
            for path in candidate_paths:
                if os.path.isfile(os.path.join(path, 'csv', 'calc_case_description_train_set.csv')):
                    base_path = path
                    break
            else:
                base_path = default_local

        if output_path is None:
            output_path = '/kaggle/working' if os.path.isdir('/kaggle/working') else 'preprocessed_output'

        self.base_path = base_path
        self.output_path = output_path
        self.manifest = None
    
    # 1. Manifesting Structuration and Pre Filtering
    def load_and_filter_manifest(self):
        """
        Loads, consolidates, filters and saves the manifest assuring the correct filepaths.

        Args:
            None

        Returns:
            pd.DataFrame: Filtered manifest with correct paths and labels.
        """
        print(">> Loading CSV files and consolidating Manifest...")

        # 1. Load diagnose CSVs and dicom_info (contains correct JPEG image paths)
        try:
            calc_df = pd.read_csv(f'{self.base_path}/csv/calc_case_description_train_set.csv')
            mass_df = pd.read_csv(f'{self.base_path}/csv/mass_case_description_train_set.csv')
            dicom_info = pd.read_csv(f'{self.base_path}/csv/dicom_info.csv')
        except FileNotFoundError:
            print(f'ERROR: Files not found in {self.base_path}')
            return None

        # 2. Merge both diagnose datasets
        full_df = pd.concat([calc_df, mass_df], ignore_index=True)
        
        # 3. Clean dicom_info filepaths
        dicom_info['image_path_clean'] = dicom_info['image_path'].str.replace('CBIS-DDSM/', '', regex=False)

        # 4. Filter by image type in dicom_info
        full_mamo_info = dicom_info[dicom_info['SeriesDescription'] == 'full mammogram images']
        roi_mask_info = dicom_info[dicom_info['SeriesDescription'] == 'ROI mask images']

        # 5. Create mapping dictionaries: PatientID (key) -> Filepaths (value)
        img_map = dict(zip(full_mamo_info['PatientID'], full_mamo_info['image_path_clean']))
        mask_map = dict(zip(roi_mask_info['PatientID'], roi_mask_info['image_path_clean']))

        # 6. Extract keys (directory ID) from the original CSV paths
        full_df['img_key'] = full_df['image file path'].str.split('/').str[0]
        full_df['mask_key'] = full_df['ROI mask file path'].str.split('/').str[0]

        # 7. Map real paths using filtered keys
        full_df['image_path'] = full_df['img_key'].map(img_map)
        full_df['mask_path'] = full_df['mask_key'].map(mask_map)
        
        # 8. Pathology and labels filtering
        # Filter BENIGN_WITHOUT_CALLBACK data to focus on binary classification
        full_df = full_df[full_df['pathology'] != 'BENIGN_WITHOUT_CALLBACK'].copy()
        full_df['label'] = full_df['pathology'].apply(lambda x: 1 if x == 'MALIGNANT' else 0)
        full_df['participant_id'] = full_df['patient_id'].astype(str).str.extract(r'(P_\d+)', expand=False)
        full_df['participant_id'] = full_df['participant_id'].fillna(full_df['patient_id'].astype(str))

        # 9. Remove rows where mapping has failed
        initial_count = len(full_df)
        self.manifest = full_df.dropna(subset=['image_path', 'mask_path']).copy()
        print(f"Mapping concluded: {len(self.manifest)} valid samples of {initial_count} initial ones.")
            
        # 10. Final columns organization
        self.manifest = self.manifest[[
            'patient_id', 'participant_id', 'image_path', 'mask_path', 'image view', 'label', 'abnormality type'
        ]].rename(columns={'image view': 'view', 'abnormality type': 'abnormality_type'})

        # 11. Saves final file
        self.manifest = self.manifest.reset_index(drop=True)
        os.makedirs(self.output_path, exist_ok=True)
        save_loc = os.path.join(self.output_path, 'manifest.csv')
        self.manifest.to_csv(save_loc, index=False)

        # 12. Prints statistics
        self.print_dataset_statistics()
        
        return self.manifest

    def print_dataset_statistics(self):
        """
        Calculates and prints out dataset statistics.

        Args:
            None

        Returns:
            None
        """
        print(">> Calculating and printing statistics...")
        
        if self.manifest is None: return
        print("\n" + "="*50)
        print("Dataset Statistics:")
        print("="*50)
        print(f"Number of samples: {len(self.manifest)}")
        print(f"\nClass Distribution:\n{self.manifest['label'].value_counts()}")
        print(f"\nAbnormality types:\n{self.manifest['abnormality_type'].value_counts()}")
        print("="*50 + "\n")

# 2. Intensity normalization & Full Pipeline Execution
    def read_and_normalize_image(self, image_rel_path: str, view: str) -> Optional[np.ndarray]:
        """
        Loads image and runs the full cleaning pipeline:
        Load -> Float32 -> Otsu -> Muscle Suppression.

        Args:
            image_rel_path (str): Relative path to the image.
            view (str): The view type ('CC' or 'MLO').
        
        Returns:
            Optional[np.ndarray]: Normalized image array or None if not found.
        """
        full_path = f"{self.base_path}/{image_rel_path}"
        img = cv2.imread(full_path, cv2.IMREAD_GRAYSCALE)

        if img is None: return None
        
        # 1. Normalize
        img_float = img.astype(np.float32) / 255.0
        
        # 2. Apply Otsu (Step 3)
        img_otsu = self.segment_breast_otsu(img_float)
        
        # 3. Apply Muscle Suppression (Step 4)
        img_final = self.suppress_pectoral_muscle(img_otsu, view)
        
        return img_final
    
    # 3. Anatomical Isolation (Breast Masking)
    def segment_breast_otsu(self, image: np.ndarray) -> np.ndarray:
        """
        Uses Otsu's Thresholding to separate the foreground and keeps only
        the largest connected component (the breast), removing labels and background noise.

        Args:
            image (np.ndarray): Input image (float32 or uint8).

        Returns:
            np.ndarray: Image with clean background (absolute black) and normalized.
        """
        # 1. Ensure image is 8-bit (required for cv2.threshold)
        if image.dtype != np.uint8:
            # If float [0,1], convert to [0,255]
            img_uint8 = (image * 255).astype(np.uint8)
        else:
            img_uint8 = image

        # 2. Apply Otsu's Thresholding
        # cv2.THRESH_OTSU automatically calculates the optimal threshold value
        thresh_val, binary_mask = cv2.threshold(img_uint8, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)

        # 3. Morphological Operations (Refinement)
        # Use 'Opening' (Erosion followed by Dilation) to remove small white noise dots
        kernel = np.ones((5, 5), np.uint8)
        binary_mask = cv2.morphologyEx(binary_mask, cv2.MORPH_OPEN, kernel)

        # 4. Find Contours
        # Returns all isolated white objects in the mask
        contours, _ = cv2.findContours(binary_mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)

        if not contours:
            print(">> Warning: No contours detected by Otsu.")
            return image  # Return original if segmentation fails critically

        # 5. Select Largest Connected Component (The Breast)
        # The largest contour by area is almost invariably the breast. The rest are labels.
        largest_contour = max(contours, key=cv2.contourArea)

        # 6. Create the Final Clean Mask
        mask_clean = np.zeros_like(img_uint8)
        # Fill the largest contour with white (255)
        cv2.drawContours(mask_clean, [largest_contour], -1, 255, thickness=cv2.FILLED)

        # 7. Apply Mask to Original Image
        # Where mask is 0, the image becomes 0 (Absolute Black)
        img_masked = cv2.bitwise_and(img_uint8, img_uint8, mask=mask_clean)

        # Return normalized float32 [0.0, 1.0]
        return img_masked.astype(np.float32) / 255.0
    
    def _is_breast_on_left(self, image: np.ndarray) -> bool:
        """
        Helper to determine if the breast is on the left or right side of the image
        by comparing the sum of pixel intensities in the left vs right halves.
        """
        h, w = image.shape
        left_sum = np.sum(image[:, :w//2])
        right_sum = np.sum(image[:, w//2:])
        return left_sum > right_sum

    # 4. Pectoral Muscle Suppression (MLO Views)
    def suppress_pectoral_muscle(self, image: np.ndarray, view: str) -> np.ndarray:
        """
        For MLO views, uses Hough Transform to detect the muscle edge and mask it out.
        Extrapolates the detected muscle edge to the image boundaries
        to ensure a clean 'corner cut' rather than a floating geometric shape.

        Args:
            image (np.ndarray): Input image (float32 or uint8).
            view (str): The view type ('CC' or 'MLO').

        Returns:
            np.ndarray: Image with pectoral muscle masked (set to 0).
        """
        # 0. Only apply to MLO views
        if view != 'MLO':
            return image

        # 1. Prepare image for Edge Detection (uint8)
        if image.dtype != np.uint8:
            img_uint8 = (image * 255).astype(np.uint8)
        else:
            img_uint8 = image

        # 2. Determine Orientation (Left or Right side)
        is_left = self._is_breast_on_left(img_uint8)
        h, w = img_uint8.shape

        # 3. Canny Edge Detection
        # We focus on the top half of the image where the muscle is located
        edges = cv2.Canny(img_uint8, 30, 100)
        
        # Region of Interest (ROI): The muscle is always in the upper corner
        # We mask out the bottom half to avoid detecting ribs or skin folds
        roi_mask = np.zeros_like(edges)
        if is_left:
            roi_mask[0:h//2, 0:w//2] = 255 # Top-Left quadrant
        else:
            roi_mask[0:h//2, w//2:w] = 255 # Top-Right quadrant
            
        edges_roi = cv2.bitwise_and(edges, edges, mask=roi_mask)

        # 4. Hough Transform to find lines
        lines = cv2.HoughLinesP(
            edges_roi, 
            rho=1, 
            theta=np.pi/180, 
            threshold=25, 
            minLineLength=30, 
            maxLineGap=30
        )

        if lines is None:
            return image # No line found, return original

        # 5. Filter for the "Best" Line (The Muscle Edge)
        # The muscle edge is usually the longest line in that quadrant.
        longest_line = None
        max_len = 0

        for line in lines:
            x1, y1, x2, y2 = line[0]
            
            # Avoid division by zero
            if x2 == x1: continue 
            
            length = np.sqrt((x2 - x1)**2 + (y2 - y1)**2)
            angle = np.abs(np.arctan2(y2 - y1, x2 - x1) * 180 / np.pi)
            
            # Muscle is roughly diagonal (20-85 degrees)
            if 20 < angle < 85 and length > max_len:
                max_len = length
                best_line = (x1, y1, x2, y2)

        if best_line is None: return image

        # Extrapolate Line to Edges
        # y = mx + b
        x1, y1, x2, y2 = best_line
        m = (y2 - y1) / (x2 - x1)
        b = y1 - m * x1

        mask = np.zeros_like(img_uint8)
        
        if is_left:
            # Muscle is Top-Left. We need points where line hits Top (y=0) and Left (x=0)
            # 1. x-intercept (where y=0): x = -b/m
            x_int = int(-b / m) if m != 0 else 0
            # 2. y-intercept (where x=0): y = b
            y_int = int(b)
            
            # Define Triangle: (0,0), (0, y_int), (x_int, 0)
            # Clip to valid coordinates to be safe
            pts = np.array([[[0, 0], [0, min(y_int, h)], [min(x_int, w), 0]]], dtype=np.int32)
        
        else:
            # Muscle is Top-Right. We need points where line hits Top (y=0) and Right (x=W)
            # 1. x-intercept (where y=0): x = -b/m
            x_int = int(-b / m) if m != 0 else w
            # 2. y-intercept (where x=W): y = mW + b
            y_int = int(m * w + b)
            
            # Define Triangle: (W,0), (W, y_int), (x_int, 0)
            pts = np.array([[[w, 0], [w, min(y_int, h)], [min(x_int, w), 0]]], dtype=np.int32)

        # 6. Draw and Invert Mask
        cv2.fillPoly(mask, pts, 255)
        
        # Keep everything NOT in the mask
        img_no_muscle = cv2.bitwise_and(img_uint8, img_uint8, mask=cv2.bitwise_not(mask))
        
        return img_no_muscle.astype(np.float32) / 255.0

    def read_mask(self, mask_rel_path: str) -> Optional[np.ndarray]:
        """
        Reads and binarizes the mask image.

        Args:
            mask_rel_path (str): Relative path to the mask image.
        
        Returns:
            Optional[np.ndarray]: Binarized mask array or None if not found.
        """

        full_path = f"{self.base_path}/{mask_rel_path}"
        mask = cv2.imread(full_path, cv2.IMREAD_GRAYSCALE)

        if mask is None: return None
        # Binarization
        _, mask_binary = cv2.threshold(mask, 127, 255, cv2.THRESH_BINARY)
        return (mask_binary / 255.0).astype(np.float32)

    # 5. Contrast Enhancement (CLAHE)
    def apply_clahe(self, image: np.ndarray) -> np.ndarray:
        """
        Applies CLAHE (Contrast Limited Adaptive Histogram Equalization) to enhance image contrast.

        Args:
            image (np.ndarray): Input image array.

        Returns:
            np.ndarray: CLAHE enhanced image array.
        """
        img_uint8 = (image * 255).astype(np.uint8)
        clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
        return clahe.apply(img_uint8).astype(np.float32) / 255.0

    # 6. Precision Patch Extraction
    def extract_lesion_patch(self, image: np.ndarray, mask: np.ndarray, patch_size=598):
        """
        Calculates the center of the lesion based on the mask and extracts
        a fixed-size patch. Applies Zero-Padding if the lesion is near the edge.

        Args:
            image (np.ndarray): Full normalized image (float32).
            mask (np.ndarray): Full binary mask (float32).
            patch_size (int): Target size (default 598x598).

        Returns:
            Tuple[np.ndarray, np.ndarray]: (Image Patch, Mask Patch).
        """
        # 1. Find the Center of the Lesion (Centroid)
        # We convert mask to uint8 for cv2.moments
        mask_uint8 = (mask * 255).astype(np.uint8)
        moments = cv2.moments(mask_uint8)

        # Safety check: if mask is empty, fallback to image center
        if moments["m00"] == 0:
            cy, cx = image.shape[0] // 2, image.shape[1] // 2
        else:
            cx = int(moments["m10"] / moments["m00"])
            cy = int(moments["m01"] / moments["m00"])

        # 2. Calculate Padding
        # We pad the ORIGINAL image with half the patch size on all sides.
        # This simplifies the math: we can simply slice without worrying about
        # negative indices or going out of bounds.
        half_size = patch_size // 2
        
        # Pad: ((top, bottom), (left, right))
        # Constant values=0 means black padding
        pad_width = ((half_size, half_size), (half_size, half_size))
        
        img_padded = np.pad(image, pad_width, mode='constant', constant_values=0)
        mask_padded = np.pad(mask, pad_width, mode='constant', constant_values=0)

        # 3. Calculate Coordinates in the Padded Image
        # The center (cx, cy) in original moves to (cx+half, cy+half) in padded
        pad_cx = cx + half_size
        pad_cy = cy + half_size

        # Determine start and end points for the crop
        start_x = pad_cx - half_size
        end_x = start_x + patch_size
        start_y = pad_cy - half_size
        end_y = start_y + patch_size

        # 4. Perform the Crop
        img_patch = img_padded[start_y:end_y, start_x:end_x]
        mask_patch = mask_padded[start_y:end_y, start_x:end_x]

        # Final sanity check on shape
        if img_patch.shape != (patch_size, patch_size):
            img_patch = cv2.resize(img_patch, (patch_size, patch_size))
            mask_patch = cv2.resize(mask_patch, (patch_size, patch_size), interpolation=cv2.INTER_NEAREST)

        return img_patch, mask_patch
    
    # 7. Data Balancing Strategy
    def compute_class_weights(self, train_df: pd.DataFrame) -> dict:
        """
        Calculates class weights to penalize errors on the minority class
        more heavily during the Loss Function calculation.

        Args:
            train_df (pd.DataFrame): DataFrame containing only TRAINING data.

        Returns:
            dict: Dictionary mapping class index -> weight (e.g., {0: 1.0, 1: 2.5}).
        """
        # Extract labels (0 = Benign, 1 = Malignant)
        y_train = train_df['label'].values

        # Calculate weights using sklearn's 'balanced' heuristic
        # Formula: n_samples / (n_classes * np.bincount(y))
        weights = compute_class_weight(
            class_weight='balanced',
            classes=np.unique(y_train),
            y=y_train
        )

        # Create dictionary for easy usage in PyTorch/Keras
        weight_dict = dict(zip(np.unique(y_train), weights))
        
        print(f"\n>> Class Weights Calculated:")
        print(f"   Benign (0): {weight_dict[0]:.4f}")
        print(f"   Malignant (1): {weight_dict[1]:.4f}")
        
        return weight_dict
    
    # 8. Data Augmentation Pipeline
    def get_augmentation_pipeline(self):
        """
        Defines the sequence of geometric transformations to simulate
        physical variations in mammograms. 
        Adjusted for compatibility with newer Albumentations versions.
        """
        return A.Compose([
            # 1. Flips
            A.HorizontalFlip(p=0.5),
            A.VerticalFlip(p=0.5),

            # 2. Random Rotations
            # border_mode=0 (cv2.BORDER_CONSTANT) automatically pads with 0 (black)
            A.Rotate(limit=175, border_mode=0, p=0.5),

            # 3. Elastic Transforms
            # Removed 'alpha_affine', 'value', and 'mask_value' which caused errors.
            A.ElasticTransform(
                alpha=120, 
                sigma=120 * 0.05, 
                border_mode=0, 
                p=0.5
            )
        ])

    def apply_augmentation(self, image: np.ndarray, mask: np.ndarray):
        """
        Applies the augmentation pipeline to a pair of Image and Mask.
        
        Args:
            image (np.ndarray): Input image patch.
            mask (np.ndarray): Input mask patch.
            
        Returns:
            Tuple[np.ndarray, np.ndarray]: Augmented image and mask.
        """
        aug_pipeline = self.get_augmentation_pipeline()
        
        # Albumentations expects specific keys
        augmented = aug_pipeline(image=image, mask=mask)

        img_out = augmented['image'].astype(np.float32)
        mask_out = (augmented['mask'] > 0.5).astype(np.float32)
        return img_out, mask_out

    def create_group_split(self, test_size=0.15, val_size=0.15, group_col='participant_id'):
        """
        Creates group-aware train, validation, and test splits to avoid leakage.

        Args:
            test_size (float): Proportion of the dataset to include in the test split.
            val_size (float): Proportion of the dataset to include in the validation split.
            group_col (str): Column name used to group samples (e.g., participant_id).
        
        Returns:
            Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]: Train, validation, and test DataFrames.
        """
        if group_col not in self.manifest.columns:
            raise ValueError(f"Missing group column: {group_col}")

        gss = GroupShuffleSplit(n_splits=1, test_size=test_size, random_state=42)
        train_val_idx, test_idx = next(gss.split(self.manifest, groups=self.manifest[group_col]))
        train_val_df = self.manifest.iloc[train_val_idx].reset_index(drop=True)
        test_df = self.manifest.iloc[test_idx].reset_index(drop=True)

        val_ratio = val_size / (1 - test_size)
        gss_val = GroupShuffleSplit(n_splits=1, test_size=val_ratio, random_state=42)
        train_idx, val_idx = next(gss_val.split(train_val_df, groups=train_val_df[group_col]))
        train_df = train_val_df.iloc[train_idx].reset_index(drop=True)
        val_df = train_val_df.iloc[val_idx].reset_index(drop=True)

        return train_df, val_df, test_df

    def visualize_sample(self, idx=0):
        """
        Visualizes the full processing pipeline for a given sample index.
        Plots 6 panels to compare phases.
        """
        row = self.manifest.iloc[idx]
        
        # Load Raw
        full_path = f"{self.base_path}/{row['image_path']}"
        img_raw = cv2.imread(full_path, cv2.IMREAD_GRAYSCALE)
        if img_raw is None: return
        img_raw_norm = img_raw.astype(np.float32) / 255.0
        
        # Load Mask
        mask = self.read_mask(row['mask_path'])
        if mask.shape != img_raw_norm.shape:
            mask = cv2.resize(mask, (img_raw_norm.shape[1], img_raw_norm.shape[0]), interpolation=cv2.INTER_NEAREST)

        # Step 3: Otsu
        img_otsu = self.segment_breast_otsu(img_raw_norm)
        
        # Step 4: Muscle Suppression
        img_muscle = self.suppress_pectoral_muscle(img_otsu, row['view'])
        
        # Step 5: CLAHE
        img_clahe = self.apply_clahe(img_muscle)
        
        # Step 6: Patch
        img_patch, mask_patch = self.extract_lesion_patch(img_clahe, mask)
        
        # Step 8: Augmentation
        img_aug, mask_aug = self.apply_augmentation(img_patch, mask_patch)
        
        # PLOTTING
        fig, axes = plt.subplots(2, 3, figsize=(18, 12))
        
        # Row 1: Cleaning phases
        axes[0,0].imshow(img_raw_norm, cmap='gray'); axes[0,0].set_title('1. Raw Image')
        axes[0,1].imshow(img_otsu, cmap='gray'); axes[0,1].set_title('2. Otsu (Clean BG)')
        axes[0,2].imshow(img_muscle, cmap='gray'); axes[0,2].set_title(f'3. Muscle Removed ({row["view"]})')
        
        # Row 2: Enhancement & Patching
        axes[1,0].imshow(img_clahe, cmap='gray'); axes[1,0].set_title('4. CLAHE Enhanced')
        
        # Overlay for Patch
        overlay_patch = np.stack([img_patch]*3, axis=-1)
        overlay_patch[:, :, 0] = np.where(mask_patch > 0.5, 1.0, overlay_patch[:, :, 0])
        axes[1,1].imshow(overlay_patch); axes[1,1].set_title('5. Lesion Patch (598x598)')
        
        # Overlay for Augmentation
        overlay_aug = np.stack([img_aug]*3, axis=-1)
        overlay_aug[:, :, 0] = np.where(mask_aug > 0.5, 1.0, overlay_aug[:, :, 0])
        axes[1,2].imshow(overlay_aug); axes[1,2].set_title('6. Augmented (+Elastic)')

        plt.suptitle(f"Processing Pipeline ID: {row['patient_id']} | Label: {row['label']}", fontsize=16)
        plt.tight_layout()
        plt.show()

    def save_verification_batch(self, n_samples=20, save_dir='pipeline_verification'):
        """
        Saves a batch of processed samples (plots) to disk for manual inspection.
        """
        full_save_path = os.path.join(self.output_path, save_dir)
        os.makedirs(full_save_path, exist_ok=True)
        print(f"\n>> Generating {n_samples} verification images in: {full_save_path}")

        # Pick samples: Try to get 50% MLO (to check muscle) and 50% CC
        mlo_indices = self.manifest[self.manifest['view'] == 'MLO'].index.tolist()
        cc_indices = self.manifest[self.manifest['view'] == 'CC'].index.tolist()
        
        # Shuffle and select
        random.shuffle(mlo_indices)
        random.shuffle(cc_indices)
        
        # Combine indices (e.g., 10 MLO + 10 CC)
        half_n = n_samples // 2
        selected_indices = mlo_indices[:half_n] + cc_indices[:half_n]
        
        for i, idx in enumerate(selected_indices):
            self._save_single_sample(idx, full_save_path, i)
            
        print(f">> Done. Check the '{save_dir}' folder.")

    def _save_single_sample(self, idx, save_folder, file_counter):
        row = self.manifest.iloc[idx]
        
        # Pipeline Steps
        full_path = f"{self.base_path}/{row['image_path']}"
        img_raw = cv2.imread(full_path, cv2.IMREAD_GRAYSCALE)
        if img_raw is None: return
        img_raw_norm = img_raw.astype(np.float32) / 255.0
        
        mask = self.read_mask(row['mask_path'])
        if mask.shape != img_raw_norm.shape:
            mask = cv2.resize(mask, (img_raw_norm.shape[1], img_raw_norm.shape[0]), interpolation=cv2.INTER_NEAREST)

        img_otsu = self.segment_breast_otsu(img_raw_norm)
        img_muscle = self.suppress_pectoral_muscle(img_otsu, row['view'])
        img_clahe = self.apply_clahe(img_muscle)
        img_patch, mask_patch = self.extract_lesion_patch(img_clahe, mask)
        img_aug, mask_aug = self.apply_augmentation(img_patch, mask_patch)
        
        # Plotting
        fig, axes = plt.subplots(2, 3, figsize=(18, 12))
        
        axes[0,0].imshow(img_raw_norm, cmap='gray'); axes[0,0].set_title(f'1. Raw ({row["view"]})')
        axes[0,1].imshow(img_otsu, cmap='gray'); axes[0,1].set_title('2. Otsu')
        axes[0,2].imshow(img_muscle, cmap='gray'); axes[0,2].set_title('3. Muscle Removed')
        axes[1,0].imshow(img_clahe, cmap='gray'); axes[1,0].set_title('4. CLAHE')
        
        overlay_patch = np.stack([img_patch]*3, axis=-1)
        overlay_patch[:, :, 0] = np.where(mask_patch > 0.5, 1.0, overlay_patch[:, :, 0])
        axes[1,1].imshow(overlay_patch); axes[1,1].set_title('5. Patch')
        
        overlay_aug = np.stack([img_aug]*3, axis=-1)
        overlay_aug[:, :, 0] = np.where(mask_aug > 0.5, 1.0, overlay_aug[:, :, 0])
        axes[1,2].imshow(overlay_aug); axes[1,2].set_title('6. Augmented')

        plt.suptitle(f"ID: {row['patient_id']} | Label: {row['label']} | View: {row['view']}")
        plt.tight_layout()
        
        # SAVE instead of Show
        filename = f"sample_{file_counter:02d}_{row['patient_id']}_{row['view']}.png"
        plt.savefig(os.path.join(save_folder, filename))
        plt.close(fig) # Close memory to avoid overflow

def run_pipeline():
    processor = CBISDDSM_Preprocessor()
    df = processor.load_and_filter_manifest()
    
    if df is not None:
        train, val, test = processor.create_group_split()
        print(f"Split concluded: Train={len(train)}, Val={len(val)}, Test={len(test)}")
        
        processor.compute_class_weights(train)
        
        print("\n>> Validating pipeline steps on sample...")

        # EXECUTE BATCH SAVING
        # Saves 20 images (10 MLO, 10 CC) to /kaggle/working/pipeline_verification/
        processor.save_verification_batch(n_samples=20)

        # EXECUTE MLO TEST
        # Try to find an MLO sample to demonstrate muscle removal
        """
        mlo_samples = df[df['view'] == 'MLO'].index
        idx = mlo_samples[0] if len(mlo_samples) > 0 else 0
        processor.visualize_sample(idx=idx)
        """

    print("\n>> Process finished...")

if __name__ == "__main__":
    run_pipeline()

>> Loading CSV files and consolidating Manifest...
Mapping concluded: 2286 valid samples of 2286 initial ones.
>> Calculating and printing statistics...

Dataset Statistics:
Number of samples: 2286

Class Distribution:
label
1    1181
0    1105
Name: count, dtype: int64

Abnormality types:
abnormality_type
mass             1214
calcification    1072
Name: count, dtype: int64

Split concluded: Train=1598, Val=344, Test=344

>> Class Weights Calculated:
   Benign (0): 1.0025
   Malignant (1): 0.9975

>> Validating pipeline steps on sample...

>> Generating 20 verification images in: preprocessed_output/pipeline_verification
>> Done. Check the 'pipeline_verification' folder.

>> Process finished...
