# Cell Instance Segmentation with UNI-style Backbone (BCCD)

This notebook provides a reproducible pipeline for the assignment:
- data split + preprocessing
- model training
- instance segmentation post-processing
- evaluation (IoU, Dice, mAP)
- qualitative success/failure analysis

## 1) Setup

In [None]:
from pathlib import Path
import json
import os

ROOT = Path('.').resolve()
CKPT = ROOT / 'github_uni/checkpoints_single/best_model.pt'
EVAL_DIR = ROOT / 'github_uni/checkpoints_single/instance_eval_test'
print('ROOT:', ROOT)
print('Checkpoint exists:', CKPT.exists())

## 2) Data Preparation (Train/Val/Test + EDT)
Use provided `test/` as hold-out. Split only `train/` into train/val for model selection.

Image preprocessing and augmentation strategy:
- Resize to `224x224` (ViT patch-compatible resolution).
- Normalize with ImageNet mean/std to match pretrained encoder statistics.
- Generate EDT supervision from binary masks (`generate_edt.py`).
- Training-time augmentation (conservative):
  - horizontal/vertical flip
  - 90/180/270 rotation
  - no aggressive color jitter/elastic transforms by default

Why conservative augmentation (pathology cell segmentation):
- Cell morphology and boundary texture are key signals; aggressive distortions can corrupt those cues.
- Strong synthetic color changes may create unrealistic stain distributions.
- Rotation/flip are label-preserving and biologically plausible for patch-level cell images.
- With a strong pretrained backbone and limited data, conservative augmentation is usually more stable.

Why no explicit stain normalization in this baseline:
- First objective was a stable and reproducible baseline with minimal extra preprocessing.
- BCCD color variation is moderate relative to multi-center pathology cohorts.
- Stain normalization can introduce texture shifts/artifacts if not tuned carefully.
- We leave Reinhard/Macenko normalization as a controlled follow-up ablation.


In [None]:
# Run once (terminal or notebook shell):
# !python3 github_uni/datasets/create_split.py \
#   --train_original "data/BCCD Dataset with mask/train/original" \
#   --train_mask "data/BCCD Dataset with mask/train/mask" \
#   --val_ratio 0.2 --seed 8888 \
#   --out_json "github_uni/datasets/splits/bccd_train_val_split.json"

# !python3 github_uni/datasets/generate_edt.py \
#   --dataset_root "data/BCCD Dataset with mask" --d_max 15

## 3) Model Adaptation and Design (Pathology Context)
- **Backbone**:
  - Current reported results use `vit_base_patch16_224`.
  - Pipeline is architecturally aligned with UNI-style adaptation (tokenized encoder -> dense decoder), but pretrained weights differ.
- **Token-to-map adaptation**: remove prefix/register tokens, reshape patch tokens into 2D feature map.
- **Decoder (lightweight by design)**:
  - `1x1` channel projection from encoder embedding to decoder width
  - 3-stage `bilinear interpolate + Conv-BN-ReLU`
  - final resize to full image resolution
  - decoder normalization: **BatchNorm**
- **Why interpolate + Conv-BN-ReLU (not deconv)**:
  - avoids checkerboard artifacts often seen with deconvolution
  - produces smoother boundaries for thin/overlapping cell edges
  - lower complexity and faster iteration with frozen encoder
- **Dual heads (shared decoder feature input)**:
  - semantic head (`seg_logits`): pixel-wise cell/background classification
  - distance head (`dist_pred`): EDT regression for touching-cell separation cues
- **Instance step**: threshold foreground + smooth distance + local peaks + watershed.

Design rationale:
- Small crowded cells need strong global features + stable local refinement.
- A simple decoder avoids overfitting on BCCD while preserving boundary detail.
- EDT head targets the overlap problem where semantic masks alone tend to merge cells.


## 4) Training and Hyperparameter Strategy


In [None]:
# Train command (example):
# !python3 github_uni/main_single.py

# Baseline hyperparameters
BASELINE_CFG = {
    'seed': 8888,
    'lr': 2e-4,
    'wd': 1e-4,
    'batch_size': 4,
    'img_size': 224,
    'max_num_epochs': 50,
    'dist_weight': 0.5,
}

# Tuning notes
# Exp-2: lr=1e-4, wd=5e-5 (kept other settings fixed)
# Result: slight degradation vs baseline on this split.
# Exp-3 (running): lr=1e-4, wd=1e-4, dist_weight=0.3
# Goal: reduce over-segmentation and improve instance precision/AJI.

# Important tuning principle:
# balance two-head performance jointly:
# - higher dist_weight can improve splitting but may increase false positives / over-splitting
# - lower dist_weight can protect semantic Dice/IoU but weaken touching-cell separation
# therefore monitor Dice/IoU + AP/AJI/F1@0.5 together.

BASELINE_CFG


## 5) Baseline + Ablation Checkpoint Metrics


In [None]:
baseline_metrics = {
    'best_val': {'loss': 0.29771364391860317, 'f1': 0.9237635902519266, 'f1_0': 0.9530601222049384, 'f1_1': 0.8944670582989146},
    'best_test': {'loss': 0.3000761054456234, 'f1': 0.9239783591861196, 'f1_0': 0.9521112985077236, 'f1_1': 0.8958454198645156}
}

exp2_metrics = {
    'best_val': {'loss': 0.30855946848958227, 'f1': 0.9207980251792034, 'f1_0': 0.9510220371592142, 'f1_1': 0.8905740131991925},
    'best_test': {'loss': 0.3117054454982281, 'f1': 0.920871966296492, 'f1_0': 0.9499723239967988, 'f1_1': 0.8917716085961853}
}

exp3_metrics = {
    'best_val': {'loss': 0.2980442628011865, 'f1': 0.9228892796054556, 'f1_0': 0.952499013768235, 'f1_1': 0.8932795454426763},
    'best_test': {'loss': 0.3008011419326067, 'f1': 0.9230502689125806, 'f1_0': 0.951569863205416, 'f1_1': 0.8945306746197451}
}

baseline_metrics, exp2_metrics, exp3_metrics


## 6) Instance Segmentation + Evaluation (IoU, Dice, mAP)

In [None]:
# Baseline instance evaluation
# !python3 github_uni/eval/evaluate_instance.py #   --checkpoint github_uni/checkpoints_single/best_model.pt #   --split test #   --save_dir github_uni/checkpoints_single/instance_eval_test

# Exp-2 instance evaluation
# !python3 github_uni/eval/evaluate_instance.py #   --checkpoint github_uni/checkpoints_single2/best_model.pt #   --split test #   --save_dir github_uni/checkpoints_single2/instance_eval_test

# Exp-3 instance evaluation
# !python3 github_uni/eval/evaluate_instance.py #   --checkpoint github_uni/checkpoints_single3/best_model.pt #   --split test #   --save_dir github_uni/checkpoints_single3/instance_eval_test


In [None]:
summary_path = EVAL_DIR / 'metrics_summary.json'
if summary_path.exists():
    summary = json.loads(summary_path.read_text())
    print(json.dumps(summary, indent=2)[:2000])
else:
    print('Run evaluation first, then re-run this cell.')

## 7) Metrics Table (IoU, Dice, mAP)


In [None]:
import json
from pathlib import Path

def load_summary(p):
    p = Path(p)
    return json.loads(p.read_text()) if p.exists() else None

s1 = load_summary('github_uni/checkpoints_single/instance_eval_test/metrics_summary.json')
s2 = load_summary('github_uni/checkpoints_single2/instance_eval_test/metrics_summary.json')
s3 = load_summary('github_uni/checkpoints_single3/instance_eval_test/metrics_summary.json')

rows = []
for name, s in [('Baseline', s1), ('Exp-2', s2), ('Exp-3', s3)]:
    if s is None:
        continue
    rows.append({
        'Run': name,
        'Dice': round(s['semantic']['dice_fg'], 4),
        'IoU': round(s['semantic']['iou_fg'], 4),
        'AP50': round(s['detection']['AP@0.50'], 4),
        'mAP(0.50:0.95)': round(s['detection']['mAP_50_95'], 4),
        'Inst_F1@0.5': round(s['instance']['f1'], 4),
        'AJI': round(s['instance']['aji'], 4),
    })

rows


## 8) Qualitative Visualizations (3-5 cases)
The evaluation script exports overlays to:
- `.../qualitative/success`
- `.../qualitative/failure`

In [None]:
from IPython.display import display
from PIL import Image

def show_first_n(folder, n=3):
    folder = Path(folder)
    files = sorted(folder.glob('*.png'))[:n]
    print(folder, 'count=', len(files))
    for f in files:
        print(f.name)
        display(Image.open(f))

show_first_n(EVAL_DIR / 'qualitative' / 'success', n=3)
show_first_n(EVAL_DIR / 'qualitative' / 'failure', n=3)

## 9) Interpretation of Success/Failure Cases and Main Challenge
For each selected case, briefly comment on:
- foreground overlap quality (Dice/IoU behavior)
- split quality for touching cells (mAP/AJI behavior)
- likely error source: clumping, weak boundaries, tiny objects, staining/background variation

Main challenge: **cell overlap / touching cells**.
- In our runs, semantic quality is strong, but instance metrics are moderate.
- This indicates foreground is detected well, yet separation of adjacent cells is still imperfect.
- Precision lower than recall suggests occasional over-segmentation in crowded regions.

Trade-off reminder:
- higher Dice/IoU does not always imply higher instance mAP
- instance metrics are more sensitive to merge/split errors than semantic metrics.


## 10) Reflection and Next Improvements
- Replace fallback backbone with UNI2-h when gated access is approved.
- Add explicit stain normalization (e.g., Reinhard/Macenko) and compare against current normalization-only pipeline.
- Decoder ablation: BatchNorm vs GroupNorm (not tested due to time).
- Explore stronger semantic losses (e.g., Focal loss) to improve class-decision balance in practice.
- Test more aggressive augmentation in controlled studies (e.g., random resized crop, mild color jitter, blur).
- Tune watershed post-processing (`min_distance`, `peak_threshold`, `area_min`).
- Add stronger instance-aware supervision for overlap regions:
  - example: center-heatmap auxiliary head or boundary-aware auxiliary head.
- Use Ray Tune for broader hyperparameter search under frozen-encoder lightweight training.
