# Data Owner (DO) - Federated SAM2 Medical Image Segmentation

This notebook is for **Data Owners** (hospitals/medical institutions) participating in federated learning for medical image segmentation.

## What this notebook does:
1. **Login** to your datasite using your Google account
2. **Upload** your medical imaging dataset (images + masks)
3. **View** submitted FL jobs from Data Scientists
4. **Approve** and **run** FL training jobs
5. Your data **never leaves** your system - only model updates (LoRA adapters ~2-8 MB) are shared

## Prerequisites
1. Go to https://colab.research.google.com/
2. Upload this notebook with `File` -> `Upload Notebook`

## Install Dependencies

In [None]:
# Install syft-flwr from the development branch
!uv pip install -v "git+https://github.com/OpenMined/syft-flwr.git@feat/syft-client-p2p"

## Login to Datasite

Login as a Data Owner using your Google account. This will:
- Authenticate with Google Drive
- Create your SyftBox folder structure
- Enable P2P communication with Data Scientists

In [None]:
import syft_client as sc
import syft_flwr

print(f"{sc.__version__ = }")
print(f"{syft_flwr.__version__ = }")

do_email = input("Enter the Data Owner's email: ")
do_client = sc.login_do(email=do_email)

## View Peers

Check which Data Scientists have connected to your datasite.

In [None]:
# View connected Data Scientists
do_client.peers

## Upload Medical Imaging Dataset

**IMPORTANT**: Please add this DO as peer on the DS side before uploading the dataset.
If not, the DS will not be able to discover the dataset.

### Dataset Structure
Your dataset should have the following structure:
```
dataset/
├── train/
│   ├── images/
│   │   ├── image001.png
│   │   ├── image002.png
│   │   └── ...
│   └── masks/
│       ├── image001.png
│       ├── image002.png
│       └── ...
├── test/
│   ├── images/
│   └── masks/
├── mock/
│   ├── images/  (synthetic/anonymized samples for preview)
│   └── masks/
└── README.md
```

In [None]:
from pathlib import Path

# Option 1: Download sample dataset from HuggingFace
# (Replace with your actual medical imaging dataset)
from huggingface_hub import snapshot_download

DATASET_DIR = Path("./dataset/").expanduser().absolute()

# NOTE: Replace this with your actual medical imaging dataset
# This is a placeholder - you should use your own dataset
if not DATASET_DIR.exists():
    print("Creating sample dataset structure...")
    DATASET_DIR.mkdir(parents=True, exist_ok=True)
    
    # Create directory structure
    for split in ["train", "test", "mock"]:
        (DATASET_DIR / split / "images").mkdir(parents=True, exist_ok=True)
        (DATASET_DIR / split / "masks").mkdir(parents=True, exist_ok=True)
    
    print("Please add your medical imaging data to:")
    print(f"  {DATASET_DIR}")
    print("\nExpected structure:")
    print("  train/images/*.png - training images")
    print("  train/masks/*.png  - training masks")
    print("  test/images/*.png  - test images")
    print("  test/masks/*.png   - test masks")
    print("  mock/              - anonymized samples for DS preview")

DATASET_PATH = DATASET_DIR
print(f"Dataset path: {DATASET_PATH}")

In [None]:
# Create and upload the dataset
do_client.create_dataset(
    name="medical-segmentation",
    mock_path=DATASET_PATH / "mock",
    private_path=DATASET_PATH,  # Contains train/ and test/
    summary="Medical image segmentation dataset for federated SAM2 training",
    readme_path=DATASET_PATH / "README.md" if (DATASET_PATH / "README.md").exists() else None,
    tags=["medical", "segmentation", "sam2"],
    sync=True,
)

In [None]:
# View all datasets
do_client.datasets.get_all()

## View Submitted Jobs

Check for FL training jobs submitted by Data Scientists.

In [None]:
# List all submitted jobs
do_client.jobs

In [None]:
# View details of a specific job
if len(do_client.jobs) > 0:
    print(do_client.jobs[0])

## Approve Submitted Jobs

Review and approve FL training jobs. This allows the job to run on your data.

**Important**: Review the job code before approving to ensure it:
- Only trains LoRA adapters (not the full model)
- Does not export raw data
- Follows your data governance policies

In [None]:
# Approve the first pending job
if len(do_client.jobs) > 0:
    do_client.jobs[0].approve()
    print("Job approved!")
else:
    print("No jobs to approve")

## Run Approved Jobs

Execute the approved FL training job. This will:
1. Load your local medical imaging data
2. Train SAM2 LoRA adapters on your data
3. Send only the adapter weights (~2-8 MB) to the aggregator

**Your raw data never leaves your system!**

In [None]:
# Process approved jobs
do_client.process_approved_jobs()

## Check Job Status

In [None]:
# Check status of all jobs
do_client.jobs

## Clean Up

Optional: Delete the SyftBox folder from your Drive when done.

In [None]:
# WARNING: This will delete your SyftBox folder!
# Uncomment to run:
# do_client.delete_syftbox()

## Debug (Optional)

View your SyftBox folder structure for debugging.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
!sudo apt install tree -qq

In [None]:
!tree ./drive/MyDrive/SyftBox