# RTpipeline on Google Colab - Part 2: CPU Analysis

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/kstawiski/rtpipeline/blob/main/rtpipeline_colab_part2_cpu.ipynb)

**💰 Cost Optimization:** This notebook runs on **CPU ONLY** (no GPU needed)!

## What This Part Does

✅ **Loads segmentations** from Part 1
✅ **DVH extraction** (dose-volume histogram metrics)
✅ **Radiomics features** (150+ texture/shape features)
✅ **Robustness testing** (optional - feature stability)
✅ **Aggregation and visualization**
✅ **Downloadable results**

## Prerequisites

- Completed Part 1 (GPU segmentation)
- Part 1 outputs saved to Google Drive
- **CPU runtime** (Runtime → Change runtime type → None/CPU)

---

**⚡ Quick Start:**
1. Run cells 1-3 (setup)
2. Mount Google Drive (cell 4)
3. **UPDATE CONFIGURATION** (cell 5) - Point to Part 1 output folder
4. Run remaining cells

## 1️⃣ Setup: Install Miniconda & System Dependencies

In [None]:
%%bash
echo "=== Installing System Dependencies ==="
apt-get update -qq
apt-get install -y -qq dcm2niix pigz > /dev/null

echo -e "\n=== Installing Python dependencies (pydicom, SimpleITK, etc.) ==="
python3 -m pip install -q "pydicom>=3.0.0" "SimpleITK>=2.3.0" "dicompyler-core>=0.5.6" "rt-utils>=1.4.0" "nibabel>=5.1.0" "xlsxwriter" "openpyxl"
echo "✅ Core Python deps installed"

if [ ! -d "/content/miniconda" ]; then
    echo -e "\n=== Installing Miniconda ==="
    wget -q https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O /tmp/miniconda.sh
    bash /tmp/miniconda.sh -b -p /content/miniconda
    rm /tmp/miniconda.sh
    echo "✅ Miniconda installed"
else
    echo "✅ Miniconda already installed"
fi

export PATH="/content/miniconda/bin:$PATH"
eval "$(/content/miniconda/bin/conda shell.bash hook)"
conda init bash


echo -e "\n=== Installing Snakemake (base env) ==="
conda install -n base -c conda-forge -c bioconda -y -q snakemake
echo -e "\n✅ Setup complete!"

## 2️⃣ Clone RTpipeline Repository

In [None]:
%%bash
if [ ! -d "/content/rtpipeline" ]; then
    echo "Cloning rtpipeline repository..."
    git clone -q https://github.com/kstawiski/rtpipeline.git /content/rtpipeline
    echo "✅ Repository cloned"
else
    echo "✅ Repository already exists"
    cd /content/rtpipeline
    git pull origin main
    echo "Repository updated"
fi

## 3️⃣ Create Conda Environments

Creates two environments (~5-10 minutes, only once per session)

In [None]:
%%bash
export PATH="/content/miniconda/bin:$PATH"
eval "$(/content/miniconda/bin/conda shell.bash hook)"

echo "=== Accepting Anaconda Terms of Service ==="
conda config --set channel_priority flexible
if ! conda tos accept --channel defaults 2>&1; then
    echo "⚠️ ToS acceptance failed or already accepted"
fi

cd /content/rtpipeline

if conda env list | grep -q "^rtpipeline "; then
    echo "✅ Environment 'rtpipeline' exists"
else
    echo "Creating 'rtpipeline' environment..."
    conda env create -f envs/rtpipeline.yaml -q
    echo "✅ Created"
fi

if conda env list | grep -q "^rtpipeline-radiomics "; then
    echo "✅ Environment 'rtpipeline-radiomics' exists"
else
    echo "Creating 'rtpipeline-radiomics' environment..."
    conda env create -f envs/rtpipeline-radiomics.yaml -q
    echo "✅ Created"
fi

echo ""
echo "✅ Environments ready"

## 4️⃣ Mount Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

print("\n✅ Google Drive mounted at /content/drive/MyDrive/")

---

# ⚙️ CONFIGURATION - UPDATE THIS!

## 5️⃣ Configure Part 1 Output Path & Processing Options

**🔴 REQUIRED:** Update `PART1_OUTPUT_DIR`

---

In [None]:
import os
import shutil
import multiprocessing

PART1_OUTPUT_DIR = "/content/drive/MyDrive/rtpipeline_part1_output_20250101_120000"

USE_LOCAL_COPY = False  # Set True for faster performance (uses /content space)
OUTPUT_DIR = "/content/output" if USE_LOCAL_COPY else PART1_OUTPUT_DIR
LOGS_DIR = "/content/logs"
LOCAL_TEMP_DIR = "/content/tmp_part2"
os.makedirs(LOGS_DIR, exist_ok=True)
os.makedirs(LOCAL_TEMP_DIR, exist_ok=True)

CPU_COUNT = multiprocessing.cpu_count()
WORKERS = max(1, CPU_COUNT - 1)
SNAKEMAKE_JOB_THREADS = WORKERS
DVH_THREADS_PER_COURSE = 2
RADIOMICS_THREAD_LIMIT = 4
RADIOMICS_SEQUENTIAL = False
AGGREGATION_THREADS = "auto"

ENABLE_DVH = True
ENABLE_RADIOMICS = True
ENABLE_ROBUSTNESS = False
ENABLE_QC = True

RADIOMICS_SKIP_ROIS = ["body", "couchsurface", "bones"]
RADIOMICS_MAX_VOXELS = 1500000000
RADIOMICS_MIN_VOXELS = 10
RADIOMICS_PARAMS_CT = "/content/rtpipeline/rtpipeline/radiomics_params.yaml"
RADIOMICS_PARAMS_MR = "/content/rtpipeline/rtpipeline/radiomics_params_mr.yaml"

ROBUSTNESS_STRUCTURES = [
    "GTV*", "CTV*", "PTV*",
    "urinary_bladder", "rectum", "prostate"
]
ROBUSTNESS_INTENSITY = "standard"
ROBUSTNESS_VOLUME_CHANGES = [-0.15, 0.0, 0.15]
ROBUSTNESS_TRANSLATIONS_MM = 0.0
ROBUSTNESS_NOISE_LEVELS = [0.0]
ROBUSTNESS_CONTOUR_REALIZATIONS = 0

CUSTOM_STRUCTURES_FILE = "custom_structures_pelvic.yaml"

if not os.path.exists(PART1_OUTPUT_DIR):
    print("🔴 ERROR: Part 1 output not found!
   Update PART1_OUTPUT_DIR above.")
else:
    if USE_LOCAL_COPY:
        print("Copying Part 1 outputs to /content for faster processing...")
        if os.path.exists(OUTPUT_DIR):
            shutil.rmtree(OUTPUT_DIR)
        shutil.copytree(PART1_OUTPUT_DIR, OUTPUT_DIR)
        print("✅ Copy complete")
    else:
        print("⚠️ Using Part 1 outputs directly on Drive (slower but saves space)")

    import glob
    courses = []
    for patient_dir in glob.glob(f"{OUTPUT_DIR}/*/"):
        patient_name = os.path.basename(patient_dir.rstrip('/'))
        if patient_name.startswith('_') or patient_name.startswith('.'):
            continue
        for course_dir in glob.glob(f"{patient_dir}/*/"):
            course_name = os.path.basename(course_dir.rstrip('/'))
            if not course_name.startswith('_'):
                courses.append(f"{patient_name}/{course_name}")

    if courses:
        print(f"
✅ Found {len(courses)} course(s)")
        for c in courses[:3]:
            print(f"  - {c}")
        if len(courses) > 3:
            print(f"  ... and {len(courses) - 3} more")
    else:
        print("
⚠️ No courses found. Verify Part 1 outputs.")

print(f"
📋 Configuration Summary:")
print(f"   CPU cores: {CPU_COUNT}")
print(f"   Snakemake workers: {WORKERS}")
print(f"   Snakemake job threads: {SNAKEMAKE_JOB_THREADS}")
print(f"   Parallelism: DVH threads/course={DVH_THREADS_PER_COURSE}, Radiomics limit={RADIOMICS_THREAD_LIMIT}")
print(f"   Outputs directory: {OUTPUT_DIR}")
print(f"   Logs directory: {LOGS_DIR}")


---

## 6️⃣ Generate Configuration File

In [None]:
try:
    import yaml
except ImportError:
    import subprocess as _subprocess
    import sys as _sys
    _subprocess.check_call([_sys.executable, '-m', 'pip', 'install', 'pyyaml'])
    import yaml

config_data = {
    'dicom_root': DICOM_ROOT,
    'output_dir': OUTPUT_DIR,
    'logs_dir': LOGS_DIR,
    'snakemake_job_threads': WORKERS,
    'workers': WORKERS,
    'dvh': {
        'threads_per_course': DVH_THREADS_PER_COURSE
    },
    'radiomics': {
        'sequential': bool(RADIOMICS_SEQUENTIAL),
        'params_file': RADIOMICS_PARAMS_CT,
        'mr_params_file': RADIOMICS_PARAMS_MR,
        'thread_limit': RADIOMICS_THREAD_LIMIT,
        'skip_rois': RADIOMICS_SKIP_ROIS,
        'max_voxels': RADIOMICS_MAX_VOXELS,
        'min_voxels': RADIOMICS_MIN_VOXELS
    },
    'radiomics_robustness': {
        'enabled': bool(ENABLE_ROBUSTNESS),
        'structures': ROBUSTNESS_STRUCTURES if ENABLE_ROBUSTNESS else [],
        'intensity': ROBUSTNESS_INTENSITY,
        'volume_changes': ROBUSTNESS_VOLUME_CHANGES if ENABLE_ROBUSTNESS else [],
        'translations_mm': ROBUSTNESS_TRANSLATIONS_MM,
        'noise_levels': ROBUSTNESS_NOISE_LEVELS if ENABLE_ROBUSTNESS else [],
        'contour_realizations': ROBUSTNESS_CONTOUR_REALIZATIONS
    },
    'aggregation': {
        'threads': AGGREGATION_THREADS
    },
    'custom_structures': CUSTOM_STRUCTURES_FILE,
    'components': {
        'dvh': bool(ENABLE_DVH),
        'radiomics': bool(ENABLE_RADIOMICS),
        'robustness': bool(ENABLE_ROBUSTNESS),
        'qc': bool(ENABLE_QC)
    }
}

config_path = '/content/config_part2.yaml'
with open(config_path, 'w') as f:
    f.write('# RTpipeline Configuration - Part 2 (CPU Analysis)\n')
    yaml.safe_dump(config_data, f, sort_keys=False)

print(f"✅ Configuration written to: {config_path}")
print(f"\nReview configuration: !cat {config_path}")


## 7️⃣ Run Part 2 Pipeline

⏱️ **Estimated Time:**
- DVH only: 5-15 minutes
- DVH + Radiomics: 20-45 minutes
- DVH + Radiomics + Robustness: 30-90 minutes

In [None]:
import os
import subprocess
import time

os.environ['PATH'] = f"/content/miniconda/bin:{os.environ.get('PATH', '')}"
os.chdir('/content/rtpipeline')

print("═══════════════════════════════════════════════════")
print("   RTpipeline Part 2: CPU Analysis")
print("═══════════════════════════════════════════════════")
print("\n⚡ Processing:")
if ENABLE_DVH:
    print(f"   ✓ DVH extraction ({DVH_THREADS_PER_COURSE} threads/course)")
if ENABLE_RADIOMICS:
    print(f"   ✓ Radiomics (thread limit: {RADIOMICS_THREAD_LIMIT})")
if ENABLE_ROBUSTNESS:
    print(f"   ✓ Robustness testing ({ROBUSTNESS_INTENSITY})")
if ENABLE_QC:
    print("   ✓ Quality control")
print(f"   ✓ Aggregation\n")

start_time = time.time()

# Install Snakemake
try:
    subprocess.run(["conda", "run", "-n", "base", "snakemake", "--version"],
                   check=True, capture_output=True)
except subprocess.CalledProcessError:
    print("Installing Snakemake...")
    subprocess.run(["conda", "install", "-n", "base", "-c", "conda-forge",
                    "-c", "bioconda", "snakemake", "-y", "-q"], check=True)
    print("✅ Snakemake installed\n")

# Run pipeline
cmd = [
    "conda", "run", "-n", "base", "snakemake",
    "--configfile", "/content/config_part2.yaml",
    "--use-conda",
    "--cores", str(WORKERS),
    "--printshellcmds",
    "--keep-going"
]

result = subprocess.run(cmd, capture_output=False, text=True)

total_time = time.time() - start_time
print("\n" + "="*50)
if result.returncode == 0:
    print("✅ Part 2 Complete!")
else:
    print("⚠️ Completed with some errors")
print("="*50)
print(f"Total time: {total_time/60:.1f} minutes")
print(f"\nResults: {OUTPUT_DIR}/_RESULTS/")

## 8️⃣ View Results

In [None]:
import pandas as pd
import os

results_dir = f"{OUTPUT_DIR}/_RESULTS"

if not os.path.exists(results_dir):
    print("⚠️ Results directory not found")
else:
    print("═══════════════════════════════════")
    print("   Results Summary")
    print("═══════════════════════════════════\n")
    
    files = [f for f in os.listdir(results_dir) if f.endswith('.xlsx')]
    print(f"Generated {len(files)} files:\n")
    for f in files:
        size_mb = os.path.getsize(os.path.join(results_dir, f)) / 1024 / 1024
        print(f"  ✓ {f} ({size_mb:.1f} MB)")
    
    # Load results
    try:
        dvh = pd.read_excel(os.path.join(results_dir, "dvh_metrics.xlsx"))
        print(f"\n📊 DVH: {len(dvh)} rows")
        print(f"   Structures: {', '.join(dvh['Structure'].value_counts().head(5).index.tolist())}")
        globals()['dvh'] = dvh
    except FileNotFoundError:
        print(f"\n⚠️ DVH metrics file not found.")
    except Exception as e:
        print(f"\n⚠️ Error loading DVH metrics: {e}")
    
    if ENABLE_RADIOMICS:
        try:
            radiomics = pd.read_excel(os.path.join(results_dir, "radiomics_ct.xlsx"))
            print(f"\n🔬 Radiomics: {len(radiomics)} rows, {len(radiomics.columns)} features")
            globals()['radiomics'] = radiomics
        except:
            pass
    
    if ENABLE_ROBUSTNESS:
        try:
            rob = pd.read_excel(os.path.join(results_dir, "radiomics_robustness_summary.xlsx"),
                               sheet_name='global_summary')
            print(f"\n🎯 Robustness: {len(rob)} features")
            print(f"   {rob['robustness_label'].value_counts().to_dict()}")
            globals()['robustness_summary'] = rob
        except:
            pass
    
    print("\n✅ Results loaded: dvh, radiomics, robustness_summary")

## 9️⃣ Visualizations

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("whitegrid")

if 'dvh' in globals():
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    top_structures = dvh.groupby('Structure')['Dmean_Gy'].mean().sort_values(ascending=False).head(10)
    top_structures.plot(kind='barh', ax=axes[0], color='steelblue')
    axes[0].set_xlabel('Mean Dose (Gy)')
    axes[0].set_title('Top 10 Structures by Mean Dose')
    
    dvh['ROI_Volume_cc'].hist(bins=50, ax=axes[1], color='coral', edgecolor='black')
    axes[1].set_xlabel('ROI Volume (cc)')
    axes[1].set_ylabel('Frequency')
    axes[1].set_title('ROI Volume Distribution')
    axes[1].set_yscale('log')
    
    plt.tight_layout()
    plt.show()
    print("✅ DVH visualizations")
else:
    print("⚠️ No DVH data")

## 🔟 Download Results

In [None]:
%%bash
cd /content
zip -r -q results.zip output/_RESULTS/
echo "✅ Archive: /content/results.zip"
ls -lh /content/results.zip

In [None]:
from google.colab import files
files.download('/content/results.zip')
print("\n✅ Download started")

## 1️⃣1️⃣ Save to Google Drive

In [None]:
import shutil
from datetime import datetime

timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
drive_results = f"/content/drive/MyDrive/rtpipeline_results_{timestamp}"

try:
    shutil.copytree(f"{OUTPUT_DIR}/_RESULTS", drive_results)
    files = [f for f in os.listdir(drive_results) if f.endswith('.xlsx')]
    
    print("\n" + "="*60)
    print("🎉 ALL DONE!")
    print("="*60)
    print(f"\nSaved to: {drive_results}")
    print(f"\n{len(files)} files:")
    for f in files:
        print(f"  ✓ {f}")
except Exception as e:
    print(f"\n⚠️ Error: {e}")

---

## 🎉 Complete!

**What you have:**
- ✅ DVH metrics
- ✅ Radiomic features
- ✅ Robustness analysis (if enabled)
- ✅ Analysis-ready data

**💰 Cost Savings:** GPU used only for Part 1 segmentation!

---

**Resources:**
- [Output Format Guide](https://github.com/kstawiski/rtpipeline/blob/main/output_format.md)
- [Robustness Guide](https://github.com/kstawiski/rtpipeline/blob/main/RADIOMICS_ROBUSTNESS.md)
- [Repository](https://github.com/kstawiski/rtpipeline)

**Version:** 2.0 (Part 2 - CPU Analysis)