# RTpipeline on Google Colab

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/kstawiski/rtpipeline/blob/main/rtpipeline_colab.ipynb)

This notebook allows you to run the RTpipeline radiotherapy data processing system on Google Colab with GPU acceleration.

## What This Pipeline Does

RTpipeline processes DICOM radiotherapy data and generates:
- ‚úÖ **Automatic segmentation** of 100+ organs using TotalSegmentator
- ‚úÖ **DVH metrics** (dose-volume histograms)
- ‚úÖ **Radiomics features** (150+ texture/shape features)
- ‚úÖ **Quality control reports**
- ‚úÖ **Analysis-ready tables** for machine learning

## Prerequisites

- Google Colab account (free tier works, but GPU runtime recommended)
- DICOM files (CT, RTPLAN, RTDOSE, RTSTRUCT)
- ~10-30 minutes processing time per patient (GPU)

---

**‚ö° Quick Start:** Run all cells in order


## 1Ô∏è‚É£ Setup: Install Dependencies

This cell installs all required packages (~5 minutes)

In [None]:
%%bash
# Check GPU availability
echo "=== GPU Check ==="
nvidia-smi || echo "‚ö†Ô∏è No GPU detected. Pipeline will use CPU (slower)."

# Install system dependencies
echo -e "\n=== Installing System Dependencies ==="
apt-get update -qq
apt-get install -y -qq dcm2niix pigz > /dev/null

# Install Python packages
echo -e "\n=== Installing Python Packages ==="
pip install -q --upgrade pip
pip install -q pydicom dicompyler-core numpy pandas scipy SimpleITK \
    dicom2nifti rt-utils plotly matplotlib openpyxl xlsxwriter \
    TotalSegmentator>=2.4.0 snakemake

# Install PyRadiomics (separate due to numpy compatibility)
pip install -q "numpy<2.0" pyradiomics

echo -e "\n‚úÖ Setup complete!"

## 2Ô∏è‚É£ Clone RTpipeline Repository

In [None]:
%%bash
# Clone repository
if [ ! -d "/content/rtpipeline" ]; then
    echo "Cloning rtpipeline repository..."
    git clone https://github.com/kstawiski/rtpipeline.git /content/rtpipeline
    echo "‚úÖ Repository cloned"
else
    echo "‚úÖ Repository already exists"
fi

cd /content/rtpipeline
git pull origin main
echo "Repository updated to latest version"

## 3Ô∏è‚É£ Upload Your DICOM Files

You have two options:

### Option A: Upload from Google Drive

Run this cell to mount your Google Drive, then access files from `/content/drive/MyDrive/`

In [None]:
from google.colab import drive
drive.mount('/content/drive')

print("\n‚úÖ Google Drive mounted at /content/drive/MyDrive/")
print("\nYour DICOM files should be in: /content/drive/MyDrive/your_dicom_folder/")

### Option B: Upload Files Directly

Use the file browser on the left (üìÅ icon) to upload DICOM files to `/content/dicom_data/`

In [None]:
%%bash
# Create directories
mkdir -p /content/dicom_data
mkdir -p /content/output
mkdir -p /content/logs

echo "‚úÖ Directories created:"
echo "  - /content/dicom_data (upload your DICOM files here)"
echo "  - /content/output (results will be saved here)"
echo "  - /content/logs (processing logs)"

## 4Ô∏è‚É£ Configure Pipeline

Modify the settings below according to your needs:

In [None]:
import os

# ============ CONFIGURATION ============

# DICOM directory (change if using Google Drive)
DICOM_ROOT = "/content/dicom_data"
# Example: DICOM_ROOT = "/content/drive/MyDrive/my_dicom_folder"

# Output directory
OUTPUT_DIR = "/content/output"
LOGS_DIR = "/content/logs"

# Processing options
USE_GPU = True  # Set to False if no GPU available
ENABLE_RADIOMICS = True  # Extract radiomic features
ENABLE_CT_CROPPING = False  # Crop CT to anatomical region
CROP_REGION = "pelvis"  # Options: pelvis, thorax, abdomen, head_neck, brain

# Advanced settings
WORKERS = 4  # Parallel workers (adjust based on available memory)
SEG_WORKERS = 2  # Segmentation workers (GPU: 1-4, CPU: 1)
FAST_MODE = False  # CPU-friendly mode (lower quality)

# =======================================

# Detect GPU
import subprocess
try:
    subprocess.run(['nvidia-smi'], check=True, capture_output=True)
    gpu_available = True
    print("‚úÖ GPU detected")
except:
    gpu_available = False
    USE_GPU = False
    SEG_WORKERS = 1
    print("‚ö†Ô∏è No GPU detected - using CPU mode")

# Check DICOM directory
if not os.path.exists(DICOM_ROOT):
    print(f"\n‚ö†Ô∏è WARNING: DICOM directory not found: {DICOM_ROOT}")
    print("Please upload your DICOM files or update DICOM_ROOT variable above.")
else:
    dicom_count = sum(1 for root, dirs, files in os.walk(DICOM_ROOT) for f in files if f.endswith('.dcm'))
    print(f"\n‚úÖ DICOM directory found: {DICOM_ROOT}")
    print(f"   Found {dicom_count} DICOM files")

print(f"\nConfiguration:")
print(f"  GPU: {USE_GPU}")
print(f"  Radiomics: {ENABLE_RADIOMICS}")
print(f"  CT Cropping: {ENABLE_CT_CROPPING}")
print(f"  Workers: {WORKERS}")

## 5Ô∏è‚É£ Generate Configuration File

In [None]:
config_yaml = f"""# RTpipeline Configuration for Google Colab
# Generated automatically

# Input/Output directories
dicom_root: "{DICOM_ROOT}"
output_dir: "{OUTPUT_DIR}"
logs_dir: "{LOGS_DIR}"

# Processing parameters
workers: {WORKERS}

segmentation:
  workers: {SEG_WORKERS}
  threads_per_worker: null
  force: false
  fast: {str(FAST_MODE).lower()}
  roi_subset: null
  extra_models: []
  device: "{'gpu' if USE_GPU else 'cpu'}"
  force_split: true
  nr_threads_resample: 1
  nr_threads_save: 1
  num_proc_preprocessing: 1
  num_proc_export: 1

custom_models:
  enabled: false
  root: "custom_models"
  models: []
  workers: 1
  force: false
  nnunet_predict: "nnUNet_predict"
  retain_weights: true
  conda_activate: null

radiomics:
  sequential: false
  params_file: "/content/rtpipeline/rtpipeline/radiomics_params.yaml"
  mr_params_file: "/content/rtpipeline/rtpipeline/radiomics_params_mr.yaml"
  thread_limit: 4
  skip_rois:
    - body
    - couchsurface
    - bones
  max_voxels: 1500000000
  min_voxels: 10

aggregation:
  threads: auto

environments:
  main: "base"
  radiomics: "base"

custom_structures: "custom_structures_pelvic.yaml"

ct_cropping:
  enabled: {str(ENABLE_CT_CROPPING).lower()}
  region: "{CROP_REGION}"
  superior_margin_cm: 2.0
  inferior_margin_cm: 10.0
  use_cropped_for_dvh: true
  use_cropped_for_radiomics: true
  keep_original: true
"""

# Write config file
config_path = "/content/config_colab.yaml"
with open(config_path, 'w') as f:
    f.write(config_yaml)

print(f"‚úÖ Configuration written to: {config_path}")
print("\nPreview:")
print(config_yaml[:500] + "...")

## 6Ô∏è‚É£ Run Pipeline

‚è±Ô∏è **Estimated time:**
- With GPU: 10-30 minutes per patient
- Without GPU: 1-3 hours per patient

**Note:** Colab may timeout after 12 hours. For large datasets, process in batches.

In [None]:
%%bash
cd /content/rtpipeline

echo "=== Starting RTpipeline ==="
echo "Configuration: /content/config_colab.yaml"
echo ""

# Run pipeline using Python CLI (simpler than Snakemake for Colab)
python3 -m rtpipeline.cli \
    --dicom-root "${DICOM_ROOT}" \
    --outdir "${OUTPUT_DIR}" \
    --logs "${LOGS_DIR}" \
    --workers ${WORKERS} \
    --seg-workers ${SEG_WORKERS}

echo ""
echo "=== Pipeline Complete ==="
echo "Results saved to: ${OUTPUT_DIR}"
echo "Check aggregated results: ${OUTPUT_DIR}/_RESULTS/"

## 7Ô∏è‚É£ View Results

Load and preview the aggregated results:

In [None]:
import pandas as pd
import os

results_dir = f"{OUTPUT_DIR}/_RESULTS"

# Check if results exist
if not os.path.exists(results_dir):
    print("‚ö†Ô∏è Results directory not found. Pipeline may still be running or failed.")
    print(f"Expected location: {results_dir}")
else:
    print("‚úÖ Results found!\n")

    # List result files
    result_files = os.listdir(results_dir)
    print("Available files:")
    for f in result_files:
        if f.endswith('.xlsx'):
            filepath = os.path.join(results_dir, f)
            size_mb = os.path.getsize(filepath) / 1024 / 1024
            print(f"  - {f} ({size_mb:.2f} MB)")

    # Load DVH metrics
    try:
        dvh_path = os.path.join(results_dir, "dvh_metrics.xlsx")
        dvh = pd.read_excel(dvh_path)
        print(f"\n‚úÖ Loaded DVH metrics: {len(dvh)} rows")
        print("\nFirst few rows:")
        display(dvh.head())

        print("\nStructures found:")
        print(dvh['Structure'].value_counts().head(10))
    except Exception as e:
        print(f"\n‚ö†Ô∏è Could not load DVH metrics: {e}")

    # Load radiomics
    if ENABLE_RADIOMICS:
        try:
            rad_path = os.path.join(results_dir, "radiomics_ct.xlsx")
            radiomics = pd.read_excel(rad_path)
            print(f"\n‚úÖ Loaded radiomics: {len(radiomics)} rows, {len(radiomics.columns)} features")
            print("\nFirst few rows:")
            display(radiomics.head())
        except Exception as e:
            print(f"\n‚ö†Ô∏è Could not load radiomics: {e}")

    # Load metadata
    try:
        meta_path = os.path.join(results_dir, "case_metadata.xlsx")
        metadata = pd.read_excel(meta_path)
        print(f"\n‚úÖ Loaded metadata: {len(metadata)} courses")
        print("\nSummary:")
        print(f"  Patients: {metadata['PatientID'].nunique()}")
        print(f"  Courses: {len(metadata)}")
    except Exception as e:
        print(f"\n‚ö†Ô∏è Could not load metadata: {e}")

## 8Ô∏è‚É£ Quick Visualization

Create some basic plots:

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("whitegrid")

# DVH metrics visualization
try:
    # Plot mean dose by structure
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))

    # Mean dose
    top_structures = dvh.groupby('Structure')['Dmean_Gy'].mean().sort_values(ascending=False).head(10)
    top_structures.plot(kind='barh', ax=axes[0], color='steelblue')
    axes[0].set_xlabel('Mean Dose (Gy)')
    axes[0].set_title('Top 10 Structures by Mean Dose')

    # Volume distribution
    dvh['ROI_Volume_cc'].hist(bins=50, ax=axes[1], color='coral', edgecolor='black')
    axes[1].set_xlabel('ROI Volume (cc)')
    axes[1].set_ylabel('Frequency')
    axes[1].set_title('ROI Volume Distribution')
    axes[1].set_yscale('log')

    plt.tight_layout()
    plt.show()

    print("‚úÖ Visualizations created")
except Exception as e:
    print(f"‚ö†Ô∏è Could not create visualizations: {e}")

## 9Ô∏è‚É£ Download Results

Download results to your local machine:

In [None]:
%%bash
# Create ZIP archive of results
echo "Creating results archive..."
cd /content
zip -r -q results.zip output/_RESULTS/

echo "‚úÖ Results archived: /content/results.zip"
ls -lh /content/results.zip

In [None]:
from google.colab import files

print("Downloading results.zip...")
files.download('/content/results.zip')
print("\n‚úÖ Download started. Check your browser's download folder.")

### Alternative: Save to Google Drive

If you mounted Google Drive earlier, copy results there:

In [None]:
%%bash
# Check if Drive is mounted
if [ -d "/content/drive/MyDrive" ]; then
    echo "Copying results to Google Drive..."
    cp -r /content/output/_RESULTS /content/drive/MyDrive/rtpipeline_results_$(date +%Y%m%d_%H%M%S)
    echo "‚úÖ Results copied to: /content/drive/MyDrive/rtpipeline_results_*"
else
    echo "‚ö†Ô∏è Google Drive not mounted. Run the 'Mount Google Drive' cell first."
fi

## üßπ Cleanup (Optional)

Free up space by removing large intermediate files:

In [None]:
%%bash
echo "Disk usage before cleanup:"
du -sh /content/output

# Remove intermediate segmentation files (keep only _RESULTS)
# Uncomment to clean:
# find /content/output -type d -name "Segmentation_*" -exec rm -rf {} + 2>/dev/null
# find /content/output -type f -name "*.nii.gz" -delete 2>/dev/null

echo "\nTo cleanup, uncomment the find commands in this cell and re-run."

---

## üìö Additional Resources

- **Output Format Guide:** [output_format.md](https://github.com/kstawiski/rtpipeline/blob/main/output_format.md)
- **Quick Reference:** [output_format_quick_ref.md](https://github.com/kstawiski/rtpipeline/blob/main/output_format_quick_ref.md)
- **GitHub Repository:** https://github.com/kstawiski/rtpipeline
- **Issues/Questions:** https://github.com/kstawiski/rtpipeline/issues

## ‚ö†Ô∏è Troubleshooting

**Pipeline fails with GPU errors:**
- Set `USE_GPU = False` in configuration cell
- Reduce `SEG_WORKERS` to 1

**Out of memory errors:**
- Reduce `WORKERS` and `SEG_WORKERS`
- Enable `FAST_MODE = True`
- Process patients in smaller batches

**Colab timeout:**
- Upgrade to Colab Pro for longer runtime
- Process in batches
- Save intermediate results to Google Drive

**Missing DICOM files:**
- Ensure DICOM directory is correct
- Check file permissions
- Verify .dcm file extensions

---

**Notebook Version:** 1.0
**Compatible with:** rtpipeline v2.0+
**Last Updated:** 2025-11-13
