# 8. Reviewer Response — Reproducibility & Statistical Analysis

This notebook addresses the following reviewer concerns:

1. **[A] Missing full hyperparameters** — Dump the fully resolved Detectron2 config
   (anchors, augmentation, normalization, input sizes, etc.) for each model.
2. **[B] No uncertainty quantification** — Bootstrap confidence intervals on
   existing test-set predictions.
3. **[C] COCO mAP thresholds not justified for small colonies** — Evaluate at
   multiple IoU thresholds and analyse colony size distribution.
4. **[D] Summary** — Auto-generated summary for the reviewer response letter.
5. **[E] Filter sensitivity** — Quantify the impact of the >100 annotation
   threshold on data distribution and results.
6. **[F] Multi-seed & YOLOv8 integration** — Load results from notebooks 10/11
   to report training variance and cross-architecture comparison.

**Prerequisites:** Run `1_setup.ipynb`. For Parts E–F, also run notebooks 10 and 11.

## 8.0 Imports

In [None]:
import os
import json

import config
from utils.reproducibility import (
    dump_full_config,
    extract_key_config_summary,
    generate_reproducibility_report,
    print_config_summary,
)
from utils.evaluation import (
    bootstrap_coco_eval,
    format_ci_table,
    plot_bootstrap_distributions,
    multi_threshold_evaluate,
    format_multi_threshold_table,
    size_distribution_analysis,
    filter_sensitivity_analysis,
)

---
## Part A: Full Hyperparameter & Config Documentation

**Reviewer concern:** *"Missing full hyperparameters, anchors, normalization,
random seeds, augmentation details. Missing exact Detectron2 config files."*

**Response:** We reconstruct the fully resolved Detectron2 config for each
model architecture. All training used the Detectron2 model zoo defaults;
only the parameters listed in our thesis (LR, batch size, epochs, etc.)
were overridden. The full resolved configs are exported as YAML files.

### A.1 Generate Config Summary for One Model

In [None]:
# Pick a representative model
MODEL_KEY = "faster_rcnn_R101"
NUM_CLASSES = 3  # AGAR dataset

summary = extract_key_config_summary(MODEL_KEY, num_classes=NUM_CLASSES)
print_config_summary(summary)

### A.2 Export Full Resolved Config (YAML) for All Architectures

In [None]:
# Export for all 6 architectures used in the study
EXPORT_DIR = os.path.join(config.RESULTS_DIR, "reproducibility")
os.makedirs(EXPORT_DIR, exist_ok=True)

for model_key in config.MODELS:
    yaml_path = os.path.join(EXPORT_DIR, f"{model_key}_full_config.yaml")
    dump_full_config(model_key, num_classes=NUM_CLASSES, output_path=yaml_path)

print(f"\n✓ All configs exported to {EXPORT_DIR}")
print("  Files:", os.listdir(EXPORT_DIR))

### A.3 Export Per-Trained-Model Reports

For each trained model directory, save the full config + summary alongside the model weights.

In [None]:
# Uncomment the models you want to export reports for.
# This saves full_config.yaml + config_summary.json in each model directory.

MODELS_TO_REPORT = {
    # (model_key, trained_model_dir, num_classes)
    # AGAR models
    "total_faster_rcnn_R50":  ("faster_rcnn_R50",  config.AGAR_TRAINED_MODELS["total_faster_rcnn_R50"],  3),
    "total_faster_rcnn_R101": ("faster_rcnn_R101", config.AGAR_TRAINED_MODELS["total_faster_rcnn_R101"], 3),
    "total_retinanet_R50":    ("retinanet_R50",    config.AGAR_TRAINED_MODELS["total_retinanet_R50"],    3),
    "total_retinanet_R101":   ("retinanet_R101",   config.AGAR_TRAINED_MODELS["total_retinanet_R101"],   3),
    "total_mask_rcnn_R50":    ("mask_rcnn_R50",    config.AGAR_TRAINED_MODELS["total_mask_rcnn_R50"],    3),
    "total_mask_rcnn_R101":   ("mask_rcnn_R101",   config.AGAR_TRAINED_MODELS["total_mask_rcnn_R101"],   3),
}

for label, (model_key, model_dir, n_cls) in MODELS_TO_REPORT.items():
    print(f"\n--- {label} ---")
    generate_reproducibility_report(model_key, model_dir, num_classes=n_cls)

---
## Part B: Uncertainty Quantification (Bootstrap Confidence Intervals)

**Reviewer concern:** *"No uncertainty quantification (confidence intervals,
variance estimates)."*

**Response:** We compute 95% bootstrap confidence intervals by resampling
test images with replacement (N=1000) and running COCOeval on each resample.
This quantifies the variability of AP due to the specific test-set composition
without requiring retraining.

### B.1 Configure

In [None]:
# ===================== CONFIGURE =====================

# Ground truth annotation file
DATASET_SOURCE = "agar"
SUBSET = "total"

if DATASET_SOURCE == "agar":
    GT_PATH = config.AGAR_DATASETS[SUBSET]["test"]
else:
    GT_PATH = config.ROBOFLOW_DATASETS["curated"]["test"]

# Prediction file from the model to analyze
TRAINED_MODEL_KEY = "total_faster_rcnn_R101"
MODEL_SOURCE = "agar"   # 'agar' or 'roboflow'
PREDICTIONS_SUBFOLDER = "0"   # or 'test', '2', etc.

PREDICTIONS_PATH = config.get_predictions_path(
    TRAINED_MODEL_KEY, MODEL_SOURCE, PREDICTIONS_SUBFOLDER
)

# Bootstrap params
N_BOOTSTRAP = 1000
CONFIDENCE_LEVEL = 0.95
SEED = 42

# ====================================================

print(f"Ground truth: {GT_PATH}")
print(f"Predictions:  {PREDICTIONS_PATH}")
print(f"Bootstrap:    {N_BOOTSTRAP} iterations, {int(CONFIDENCE_LEVEL*100)}% CI")

### B.2 Run Bootstrap

In [None]:
ci_results = bootstrap_coco_eval(
    gt_path=GT_PATH,
    predictions_path=PREDICTIONS_PATH,
    n_bootstrap=N_BOOTSTRAP,
    confidence_level=CONFIDENCE_LEVEL,
    seed=SEED,
)

### B.3 Results Table

In [None]:
ci_table = format_ci_table(ci_results, confidence_level=CONFIDENCE_LEVEL)
display(ci_table)

# Save to CSV
ci_output = os.path.join(config.RESULTS_DIR, "bootstrap_ci")
os.makedirs(ci_output, exist_ok=True)
csv_path = os.path.join(ci_output, f"{TRAINED_MODEL_KEY}_bootstrap_ci.csv")
ci_table.to_csv(csv_path, index=False)
print(f"\nSaved to: {csv_path}")

### B.4 Distribution Plots

In [None]:
plot_save = os.path.join(ci_output, f"{TRAINED_MODEL_KEY}_bootstrap_dist.png")
plot_bootstrap_distributions(ci_results, save_path=plot_save)

### B.5 Run for Multiple Models (Batch)

Run bootstrap CIs for all models to build a comparison table.

In [None]:
# Uncomment to run batch bootstrap for all AGAR total models
#
# MODELS_TO_BOOTSTRAP = [
#     "total_faster_rcnn_R50",
#     "total_faster_rcnn_R101",
#     "total_retinanet_R50",
#     "total_retinanet_R101",
#     "total_mask_rcnn_R50",
#     "total_mask_rcnn_R101",
# ]
#
# all_ci = {}
# for key in MODELS_TO_BOOTSTRAP:
#     pred_path = config.get_predictions_path(key, "agar", PREDICTIONS_SUBFOLDER)
#     if not os.path.exists(pred_path):
#         print(f"Skipping {key}: {pred_path} not found")
#         continue
#     print(f"\n{'='*50}")
#     print(f"Model: {key}")
#     print(f"{'='*50}")
#     all_ci[key] = bootstrap_coco_eval(
#         GT_PATH, pred_path, n_bootstrap=N_BOOTSTRAP,
#         confidence_level=CONFIDENCE_LEVEL, seed=SEED,
#     )
#
# # Summary comparison table
# import pandas as pd
# rows = []
# for key, res in all_ci.items():
#     rows.append({
#         "Model": key,
#         "AP (mean ± std)": f"{res['AP']['mean']:.4f} ± {res['AP']['std']:.4f}",
#         "AP 95% CI": f"[{res['AP']['ci_low']:.4f}, {res['AP']['ci_high']:.4f}]",
#         "AP50 (mean ± std)": f"{res['AP50']['mean']:.4f} ± {res['AP50']['std']:.4f}",
#     })
# display(pd.DataFrame(rows))

---
## Part C: IoU Threshold Justification & Size Analysis

**Reviewer concern:** *"COCO mAP thresholds not justified for very
small colonies."*

**Response:** We (1) report the size distribution of colonies using COCO
size categories, (2) evaluate at multiple individual IoU thresholds
including a lenient 0.25 suitable for small objects, and (3) report
AP-small, AP-medium, and AP-large separately.

### C.1 Colony Size Distribution Analysis

In [None]:
size_save = os.path.join(config.RESULTS_DIR, "size_analysis")
os.makedirs(size_save, exist_ok=True)

size_stats = size_distribution_analysis(
    gt_path=GT_PATH,
    save_path=os.path.join(size_save, f"{SUBSET}_size_distribution.png"),
)

print("\nSize Statistics:")
print(json.dumps(size_stats, indent=2))

### C.2 Multi-Threshold Evaluation

In [None]:
# Evaluate at IoU = 0.25, 0.50, 0.75, 0.90
# IoU=0.25 is included as a lenient threshold for small colonies

mt_results = multi_threshold_evaluate(
    gt_path=GT_PATH,
    predictions_path=PREDICTIONS_PATH,
    iou_thresholds=[0.25, 0.5, 0.75, 0.9],
    per_category=True,
)

### C.3 Multi-Threshold Results Table

In [None]:
mt_table = format_multi_threshold_table(mt_results)
display(mt_table)

# Save
mt_csv = os.path.join(size_save, f"{TRAINED_MODEL_KEY}_multi_iou.csv")
mt_table.to_csv(mt_csv, index=False)
print(f"\nSaved to: {mt_csv}")

### C.4 Interpretation

Key points for the reviewer response:

- **Size distribution:** Report the percentage of colonies that fall in each
  COCO size category. If most colonies are "small" (< 32² px), then the
  standard IoU=0.5 threshold is indeed strict — even a few pixels of offset
  on a 10×10 box drops IoU significantly.
- **AP at IoU=0.25:** Provides a lenient evaluation suitable for small objects.
- **AP-small vs AP-medium vs AP-large:** Shows how model performance varies
  by colony size, directly addressing the reviewer's concern.
- **Standard AP@[.50:.95]:** Remains the primary metric for comparability
  with other work, but the additional thresholds provide context.

---
## Part D: Summary for Reviewer Letter

Run this cell to generate a text summary suitable for copy-pasting into
the reviewer response letter.

In [None]:
print("="*70)
print("REVIEWER RESPONSE SUMMARY")
print("="*70)

# A: Config
print("\n[A] HYPERPARAMETERS & CONFIG")
print("  Full resolved Detectron2 YAML configs exported for all 6 architectures.")
print("  Key defaults (not overridden):")
s = extract_key_config_summary("faster_rcnn_R101", num_classes=3)
print(f"    Anchors sizes:    {s['anchor_generator']['sizes']}")
print(f"    Anchor ratios:    {s['anchor_generator']['aspect_ratios']}")
print(f"    Pixel mean:       {s['input_preprocessing']['pixel_mean']}")
print(f"    Pixel std:        {s['input_preprocessing']['pixel_std']}")
print(f"    Input format:     {s['input_preprocessing']['format']}")
print(f"    Min size train:   {s['input_preprocessing']['min_size_train']}")
print(f"    Max size train:   {s['input_preprocessing']['max_size_train']}")
print(f"    Augmentations:    {[a['name'] for a in s['data_augmentation_train']['augmentations']]}")
print(f"    Random seed:      {s['random_seed']['seed']}")

# B: Bootstrap CIs
print("\n[B] UNCERTAINTY QUANTIFICATION")
print(f"  Model: {TRAINED_MODEL_KEY}")
print(f"  Bootstrap: {N_BOOTSTRAP} iterations, {int(CONFIDENCE_LEVEL*100)}% CI")
ap = ci_results['AP']
print(f"  AP:    {ap['mean']:.4f} ± {ap['std']:.4f}  "
      f"95% CI [{ap['ci_low']:.4f}, {ap['ci_high']:.4f}]")
ap50 = ci_results['AP50']
print(f"  AP50:  {ap50['mean']:.4f} ± {ap50['std']:.4f}  "
      f"95% CI [{ap50['ci_low']:.4f}, {ap50['ci_high']:.4f}]")
ap75 = ci_results['AP75']
print(f"  AP75:  {ap75['mean']:.4f} ± {ap75['std']:.4f}  "
      f"95% CI [{ap75['ci_low']:.4f}, {ap75['ci_high']:.4f}]")

# C: Multi-threshold
print("\n[C] IoU THRESHOLD ANALYSIS")
for iou_thr, vals in sorted(mt_results.items()):
    print(f"  IoU={iou_thr}: AP={vals['AP']*100:.1f}%  "
          f"AP-small={vals['AP_small']*100:.1f}%  "
          f"AP-medium={vals['AP_medium']*100:.1f}%  "
          f"AP-large={vals['AP_large']*100:.1f}%")

# D: Size
print("\n[D] COLONY SIZE DISTRIBUTION")
for bucket, count in size_stats['coco_size_buckets'].items():
    pct = size_stats['coco_size_percentages'][bucket.split('(')[0].strip()]
    print(f"  {bucket}: {count} annotations ({pct})")
print(f"  Median area: {size_stats['area']['median']:.1f} px²")
print(f"  Mean area:   {size_stats['area']['mean']:.1f} px²")

---
## Part E: Filter Sensitivity Analysis (>100 annotations threshold)

**Reviewer concern:** *Filtering images with >100 annotations could bias results.*

**Response:** We analyze what the filter excludes — how many images/annotations
are removed, and whether the remaining data has a meaningfully different
size distribution. This shows the filter is a minor data-cleaning step,
not a source of systematic bias.

In [None]:
from utils.evaluation import filter_sensitivity_analysis

# ── Paths to UNFILTERED annotations ──
# These are the annotations BEFORE the >100 filter was applied
UNFILTERED_TRAIN = config.AGAR_UNFILTERED["total_train"]

sensitivity_save = os.path.join(config.RESULTS_DIR, "filter_sensitivity")
os.makedirs(sensitivity_save, exist_ok=True)

filter_stats = filter_sensitivity_analysis(
    gt_path=UNFILTERED_TRAIN,
    threshold=100,
    save_path=os.path.join(sensitivity_save, "filter_sensitivity_total.png"),
)

print("\nFilter Sensitivity Results:")
print(json.dumps(filter_stats, indent=2))

# Save to JSON
with open(os.path.join(sensitivity_save, "filter_sensitivity_stats.json"), 'w') as f:
    json.dump(filter_stats, f, indent=2)

In [None]:
# Test sensitivity to different threshold values
thresholds = [50, 75, 100, 150, 200]
threshold_results = []

for thr in thresholds:
    stats = filter_sensitivity_analysis(
        gt_path=UNFILTERED_TRAIN,
        threshold=thr,
        save_path=None,  # don't save individual plots
    )
    threshold_results.append({
        "Threshold": thr,
        "Images kept": stats["filtered"]["num_images"],
        "Images excluded": stats["excluded"]["num_images"],
        "% excluded": stats["excluded"]["pct_images_excluded"],
        "Annotations kept": stats["filtered"]["num_annotations"],
        "Annotations excluded": stats["excluded"]["num_annotations"],
        "% anns excluded": stats["excluded"]["pct_annotations_excluded"],
        "Area shift (px²)": f"{stats['impact']['area_shift']:.1f}",
    })
    plt.close('all')  # close the auto-generated plots

import pandas as pd
df_thr = pd.DataFrame(threshold_results)
display(df_thr)

csv_path = os.path.join(sensitivity_save, "filter_threshold_comparison.csv")
df_thr.to_csv(csv_path, index=False)
print(f"\nSaved to: {csv_path}")

---
## Part F: Integration of Multi-Seed & YOLOv8 Results

**Prerequisites:** Run `10_multi_seed_train.ipynb` and `11_yolo_baseline.ipynb` first.

This section loads the multi-seed training results and cross-architecture
comparison to generate the final reviewer response tables.

In [None]:
import pandas as pd

# ── Load multi-seed Detectron2 results ──
multi_seed_path = os.path.join(config.RESULTS_DIR, "multi_seed_results.json")
yolo_path = os.path.join(config.RESULTS_DIR, "yolo_test_results.json")

if os.path.exists(multi_seed_path):
    with open(multi_seed_path, 'r') as f:
        d2_seed_results = json.load(f)
    df_d2 = pd.DataFrame(d2_seed_results)
    
    print("=== Detectron2 Multi-Seed Results ===")
    for model in df_d2['model'].unique():
        sub = df_d2[df_d2['model'] == model]
        print(f"\n{model}:")
        for m in ['AP', 'AP50', 'AP75']:
            mean = sub[m].mean()
            std = sub[m].std()
            print(f"  {m}: {mean:.1f} ± {std:.1f}")
else:
    print(f"Multi-seed results not found: {multi_seed_path}")
    print("Run notebook 10 first.")
    d2_seed_results = None

if os.path.exists(yolo_path):
    with open(yolo_path, 'r') as f:
        yolo_results = json.load(f)
    df_yolo = pd.DataFrame(yolo_results)
    
    print("\n=== YOLOv8 Results ===")
    for m in ['mAP50', 'mAP50_95', 'mAP75']:
        mean = df_yolo[m].mean()
        std = df_yolo[m].std()
        print(f"  {m}: {mean:.1f} ± {std:.1f}")
else:
    print(f"\nYOLO results not found: {yolo_path}")
    print("Run notebook 11 first.")
    yolo_results = None

In [None]:
if d2_seed_results and yolo_results:
    rows = []
    
    # Detectron2 models
    for model in df_d2['model'].unique():
        sub = df_d2[df_d2['model'] == model]
        rows.append({
            "Model": model.replace("total_", ""),
            "Framework": "Detectron2",
            "AP (mean±std)": f"{sub['AP'].mean():.1f} ± {sub['AP'].std():.1f}",
            "AP50 (mean±std)": f"{sub['AP50'].mean():.1f} ± {sub['AP50'].std():.1f}",
            "AP75 (mean±std)": f"{sub['AP75'].mean():.1f} ± {sub['AP75'].std():.1f}",
            "N seeds": len(sub),
        })
    
    # YOLO
    rows.append({
        "Model": "YOLOv8m",
        "Framework": "Ultralytics",
        "AP (mean±std)": f"{df_yolo['mAP50_95'].mean():.1f} ± {df_yolo['mAP50_95'].std():.1f}",
        "AP50 (mean±std)": f"{df_yolo['mAP50'].mean():.1f} ± {df_yolo['mAP50'].std():.1f}",
        "AP75 (mean±std)": f"{df_yolo['mAP75'].mean():.1f} ± {df_yolo['mAP75'].std():.1f}",
        "N seeds": len(df_yolo),
    })
    
    df_comparison = pd.DataFrame(rows)
    display(df_comparison)
    
    comp_csv = os.path.join(config.RESULTS_DIR, "final_comparison_table.csv")
    df_comparison.to_csv(comp_csv, index=False)
    print(f"\nSaved to: {comp_csv}")
else:
    print("Cannot build comparison — run notebooks 10 and 11 first.")

### F.3 Updated Reviewer Response Summary

With multi-seed and cross-architecture results available, the response
to reviewers now includes:

1. **[A] Full configs** — resolved YAML for all 6 architectures
2. **[B] Bootstrap CIs** — evaluation-level uncertainty
3. **[C] IoU thresholds** — AP@0.25/0.50/0.75/0.90 with size breakdown
4. **[D] Colony sizes** — COCO size bucket distribution
5. **[E] Filter sensitivity** — impact of >100 annotation threshold
6. **[F] Training variance** — mean ± std across 3 seeds (Detectron2 + YOLOv8)

Copy the tables from Parts B, C, E, and F into the reviewer letter.