# Medical Image Preprocessing

- Works with datasets stored **anywhere** (absolute or relative paths).
- Supports **single transform** or **multi-step pipeline** (ordered list).
- Saves outputs **next to each input directory** as:

**Output directory structure:**

- `<INPUT_ROOT>/<RUN_NAME>_<pipelineSlug>/`
  - `processed/` ‚Äì Processed images (mirrors original folder structure)
  - `_previews/` ‚Äì Per-step and final preview grids
  - `manifest.csv` ‚Äì Source ‚Üí destination log with status/errors
  - `pipeline.json` ‚Äì Snapshot of steps and parameters used


- Safety: If a run folder already exists, it **raises** unless `overwrite=True`.
- Tip: Set `RUN_NAME` as **MMDDYYYY** (your choice).


-----
-----

### 0) Imports

In [None]:
from pathlib import Path
import random

# Local utilities (make sure these files are next to the notebook)
from utils import (
    list_images, load_image_rgb, make_preview_grid,
    apply_pipeline_for_root, split_paths_by_root, run_pipeline
)
from pipeline_utils import build_pipeline_slug
from transforms import REGISTRY, SPECS  # exposes available transforms by name

import matplotlib.pyplot as plt

### 1) Configure Input Paths & Run Name

- **RUN_NAME**: enter *MMDDYYYY* (e.g., `11012025`).  
- **TRAIN_DIRS / TEST_DIRS**: list one or more folders. You can also leave one empty.


In [None]:
# üëâ Enter a run name as MMDDYYYY (or any label you like)
RUN_NAME = input("Enter RUN NAME (MMDDYYYY): ").strip()  # e.g., 11012025

# üëâ Point to your data directories (absolute or relative).
# You can include multiple directories per split.
TRAIN_DIRS = [
    r"G:\diabetic-retinopathy-1519\resized_test19",
    # "/another/train/path",
]
TEST_DIRS = [
    r"G:\diabetic-retinopathy-1519\resized_traintest15_train19",
    # "/another/test/path",
]

print("RUN_NAME:", RUN_NAME)
print("TRAIN_DIRS:", TRAIN_DIRS)
print("TEST_DIRS:", TEST_DIRS)


### 2) Discover Images

In [None]:
train_paths = list_images(TRAIN_DIRS)
test_paths  = list_images(TEST_DIRS)

print(f"Found {len(train_paths)} train images")
print(f"Found {len(test_paths)} test images")


### 3) Available Transforms

Below is the list of transform names you can use in a pipeline (the keys of `REGISTRY`).


In [None]:
for name in sorted(REGISTRY.keys()):
    print(f"‚Ä¢ {name}")
    spec = SPECS.get(name, {})
    if not spec:
        print("    (no parameters)")
    else:
        for k, meta in spec.items():
            default = meta.get("default")
            typ = meta.get("type", "")
            desc = meta.get("desc", "")
            extra = []
            if "min" in meta:    extra.append(f"min={meta['min']}")
            if "max" in meta:    extra.append(f"max={meta['max']}")
            if "choices" in meta: extra.append(f"choices={meta['choices']}")
            extra_str = f" [{', '.join(extra)}]" if extra else ""
            print(f"    - {k}: default={default} ({typ}){extra_str}\n      {desc}")
    print()


-----
----

## Choose one of the two modes below

### A) **Single-Method Mode** (preview + apply one transform)
- Good for quick experiments.
- Produces a run folder named `<RUN_NAME>_<method>` for each input root.

### B) **Pipeline Mode** (preview + apply multiple transforms in order)
- Chain steps like `crop_dark_borders ‚Üí clahe ‚Üí resize`.
- Produces a run folder named `<RUN_NAME>_<name1+name2+...>` for each input root.

> You can run A or B (or both), in any order.


-----

## MODE A

### A1) Configure a single transform

Set `METHOD_NAME` to one of the names from `REGISTRY`, and adjust `PARAMS` as needed.


In [None]:
# Example: single transform
METHOD_NAME = "clahe"  # e.g., "crop_dark_borders", "circle_crop", "resize", "unsharp_mask", "clahe"
PARAMS = {
    "clip_limit": 2.0,
    "tile_grid_size": (8, 8),
    "space": "LAB",
}

assert METHOD_NAME in REGISTRY, f"{METHOD_NAME} not found. Available: {list(REGISTRY.keys())}"
print("Single-method:", METHOD_NAME, PARAMS)


### A2) Preview a few images in-notebook (no files written)

- Shows **before/after** pairs for a small sample from `TRAIN_DIRS` (or `TEST_DIRS` if train is empty).


In [None]:
# Choose a small sample
sample_from = train_paths if len(train_paths) > 0 else test_paths
sample_n = min(8, len(sample_from))
sample_paths = random.sample(sample_from, sample_n) if sample_n else []

if not sample_paths:
    print("No images available to preview.")
else:
    fn = REGISTRY[METHOD_NAME]
    pairs = []
    for p in sample_paths:
        img = load_image_rgb(p)
        out = fn(img, **PARAMS)
        pairs.extend([img, out])
    grid = make_preview_grid(pairs, cols=4, pad=6)
    plt.figure(figsize=(12, 8))
    plt.title(f"Preview: BEFORE/AFTER pairs ‚Äì {METHOD_NAME}")
    plt.imshow(grid)
    plt.axis("off")


### A3) Apply & Save single transform

- Creates `<INPUT_ROOT>/<RUN_NAME>_<METHOD_NAME>/`
- Writes processed images into `/processed/`, previews into `/_previews/`,
- Also creates `manifest.csv` and `pipeline.json`.

> **Safety:** If the run folder exists, this cell raises unless `overwrite=True`.


In [None]:
# Wrap single transform into a one-step pipeline
PIPELINE_SINGLE = [(METHOD_NAME, PARAMS)]
PIPELINE_SLUG = build_pipeline_slug(PIPELINE_SINGLE)
print("Pipeline slug:", PIPELINE_SLUG)

# Group paths by their root and run once per root
train_buckets = split_paths_by_root(train_paths, TRAIN_DIRS)
test_buckets  = split_paths_by_root(test_paths, TEST_DIRS)

OVERWRITE = False   # set True to reuse an existing run folder

for root, paths in train_buckets.items():
    summary = apply_pipeline_for_root(
        input_root=root,
        src_paths=paths,
        pipeline=PIPELINE_SINGLE,
        run_name=RUN_NAME,
        overwrite=OVERWRITE,
        save_previews=True,
        preview_sample=12,
    )
    print(f"[TRAIN] {root} ->", summary)

for root, paths in test_buckets.items():
    summary = apply_pipeline_for_root(
        input_root=root,
        src_paths=paths,
        pipeline=PIPELINE_SINGLE,
        run_name=RUN_NAME,
        overwrite=OVERWRITE,
        save_previews=True,
        preview_sample=12,
    )
    print(f"[TEST] {root} ->", summary)


-----

## MODE B

### B1) Configure a multi-step pipeline

Order matters (e.g., `crop_dark_borders` **before** `clahe`).  
Adjust params as needed.


In [None]:
PIPELINE = [
    ("crop_dark_borders", {"tol": 7}),
    ("clahe", {"clip_limit": 2.0, "tile_grid_size": (8, 8), "space": "LAB"}),
    # ("resize", {"width": 512, "height": 512, "keep_aspect": False}),
    # ("unsharp_mask", {"sigma": 10.0, "amount_a": 4.0, "amount_b": -4.0, "bias": 128.0}),
]
# Validate steps
for name, _ in PIPELINE:
    assert name in REGISTRY, f"{name} not found. Available: {list(REGISTRY.keys())}"

PIPELINE_SLUG = build_pipeline_slug(PIPELINE)
PIPELINE_SLUG


### B2) Quick look: run pipeline on a few images (no files written)

Shows only final outputs for a small sample from the training set.


In [None]:
sample_from = train_paths if len(train_paths) > 0 else test_paths
sample_n = min(8, len(sample_from))
sample_paths = random.sample(sample_from, sample_n) if sample_n else []

if not sample_paths:
    print("No images available to preview.")
else:
    pairs = []
    for p in sample_paths:
        img = load_image_rgb(p)
        out = run_pipeline(img, PIPELINE)
        # Add before and after in sequence
        pairs.extend([img, out])

    grid = make_preview_grid(pairs, cols=4, pad=6)
    plt.figure(figsize=(12, 8))
    plt.title(f"Pipeline BEFORE/AFTER pairs ‚Äì {PIPELINE_SLUG}")
    plt.imshow(grid)
    plt.axis("off")


### B3) Apply & Save the pipeline

- Creates `<INPUT_ROOT>/<RUN_NAME>_<PIPELINE_SLUG>/`
- Saves processed images, previews, manifest, and pipeline meta per root.


In [None]:
train_buckets = split_paths_by_root(train_paths, TRAIN_DIRS)
test_buckets  = split_paths_by_root(test_paths, TEST_DIRS)

OVERWRITE = False  # set True to reuse an existing run folder

for root, paths in train_buckets.items():
    summary = apply_pipeline_for_root(
        input_root=root,
        src_paths=paths,
        pipeline=PIPELINE,
        run_name=RUN_NAME,
        overwrite=OVERWRITE,
        save_previews=True,
        preview_sample=12,
    )
    print(f"[TRAIN] {root} ->", summary)

for root, paths in test_buckets.items():
    summary = apply_pipeline_for_root(
        input_root=root,
        src_paths=paths,
        pipeline=PIPELINE,
        run_name=RUN_NAME,
        overwrite=OVERWRITE,
        save_previews=True,
        preview_sample=12,
    )
    print(f"[TEST] {root} ->", summary)


------

### Tips & Troubleshooting

- **Folder exists error**: Change `RUN_NAME` or set `OVERWRITE=True`.
- **No images found**: Check `TRAIN_DIRS` / `TEST_DIRS` and extensions.
- **Transforms not found**: Confirm the name is listed in `sorted(REGISTRY.keys())`.
- **Preview without saving**: Use the ‚ÄúQuick look‚Äù cells; they don‚Äôt create any run folders.
- **Performance**: If I/O is the bottleneck, run fewer previews (`preview_sample`) or process fewer roots at once.


------
-----