# SAR-RARP50 Dataset Pre-processing

This project uses the SAR-RARP50 dataset, which contains 50 video segments from the suturing phase of Robotic Assisted Radical Prostatectomy (RARP) procedures. The dataset was introduced for the EndoVis 2022 Challenge by Dimitris Psychogyios, Beatrice Van Amsterdam, Emanuele Colleoni and Danail Stoyanov. It includes both [training](https://rdr.ucl.ac.uk/articles/dataset/SAR-RARP50_train_set/24932499) and [test](https://rdr.ucl.ac.uk/articles/dataset/SAR-RARP50_test_set/24932499) splits.

The data for each of the 50 surgical clips is distributed as a `.zip` file. Each archive contains the raw video_left.avi file and corresponding segmentation masks (`.png` files) (action annotations are also provided, but will not be used for this segmentation project).

To prepare the data for use, two pre-processing steps are required:

1. **Unzip Archives:** Each `video_XX.zip` file must be extracted into its own corresponding `video_XX` directory.

2. **Extract Frames:** The `.avi` video files must be sampled into individual RGB frames at a rate of `1 Hz`. This is necessary to align each frame with its corresponding segmentation mask.

The directory structure before and after unzipping is as follows:

**Initial Structure:**
```python
data/
└── train_dataset
    └── video_01.zip
    └── ...
    └── video_40.zip
└── test_dataset
    └── video_41.zip
    └── ...
    └── video_50.zip
```

**After Unzipping:**
```python
data/
└── train_dataset
    └── video_XX.zip
        └── action_continuous.txt
        └── action_discrete.txt
        └── segmentation
            └──000000000.png
            └──
            └──nnnnnnnnnn.png
        └── video_left.avi
    └── ...
└── test_dataset
    └── video_XX.zip
        └── action_continuous.txt
        └── action_discrete.txt
        └── segmentation
            └──000000000.png
            └──
            └──nnnnnnnnnn.png
        └── video_left.avi
    └── ...
```

In [None]:
# !pip install monai

import os
from google.colab import drive
import glob
from tqdm.notebook import tqdm
import shutil

## 1. Setup and Configuration

This section mounts Google Drive where I have stored the datasets, and clones the official SAR-RARP50 toolkit.

**What is the SAR-RARP50 Toolkit?**

The [SAR-RARP50-evaluation repository](https://github.com/surgical-vision/SAR_RARP50-evaluation) is a companion toolkit created by the dataset authors. It provides helper scripts to automate common data handling tasks. For the pre-processing step, we use its unpack script to extract video frames at a precise frequency, ensuring they align perfectly with the provided segmentation masks.

In [None]:
# --- Mount Google Drive to access the project files and dataset ---
drive.mount('/content/drive')

# --- Configuration ---
PROJECT_ROOT = '/content/drive/MyDrive/Colab Notebooks/Surgical_Tool_Segmentation'
DATA_ROOT = os.path.join(PROJECT_ROOT, 'data')
TOOLKIT_REPO_PATH = os.path.join(PROJECT_ROOT, 'SAR_RARP50-evaluation') # Path to the SAR-RARP50 toolkit repository

# --- Paths to the training and test set directories ---
train_path = os.path.join(DATA_ROOT, 'train_dataset')
test_path = os.path.join(DATA_ROOT, 'test_dataset')

# --- Frame Extraction Frequency ---
# The ground-truth segmentation masks for this dataset were created for frames
# sampled at a rate of 1 frame per second (1 Hz). Thus, we will use the same
# sampling frequency for our RGB frames
EXTRACTION_FREQUENCY = 1

# --- Clone the Official Toolkit Repository ---
# Check if it already exists to prevent re-downloading
if not os.path.exists(TOOLKIT_REPO_PATH):
  !git clone https://github.com/surgical-vision/SAR_RARP50-evaluation {TOOLKIT_REPO_PATH}
else:
  print("Toolkit repository already exists.")

Mounted at /content/drive
Toolkit repository already exists.


## 2. Unzip Video Files

This step finds all `video_XX.zip` files in the train_dataset and test_dataset folders and extracts each one into a new, corresponding directory (e.g., `video_01.zip` is extracted to a folder named `video_01`).

In [None]:
# --- Find all .zip files ---
all_zips = glob.glob(f'{train_path}/*.zip') + glob.glob(f'{test_path}/*.zip')

print(f"Found {len(all_zips)} zip files to extract.")

# --- Loop through each zip file and extract it ---
for zip_file_path in tqdm(all_zips, desc="Unzipping files"):
  base_name = os.path.splitext(os.path.basename(zip_file_path))[0]
  output_dir = os.path.join(os.path.dirname(zip_file_path), base_name)

  # Only unzip if the output directory doesn't already exist
  if not os.path.exists(output_dir):
    !unzip -q {zip_file_path} -d {output_dir}

print("All files have been successfully unzipped.")

Found 54 zip files to extract.


Unzipping files:   0%|          | 0/54 [00:00<?, ?it/s]

All files have been successfully unzipped.


## 3. Extract Image Frames from Videos

Now that the `.avi` files are available, this step will recursively find all `video_left.avi` files and sample them into individual `.png` image frames at the frequency set in the configuration step.

In [None]:
# --- Setup ---
# Change the notebook's current working directory to the toolkit's folder
%cd {TOOLKIT_REPO_PATH}

# --- Get a list of video directories ---
all_paths = glob.glob(f'{DATA_ROOT}/*_dataset/video_*')
all_video_dirs = [path for path in all_paths if os.path.isdir(path)]
all_video_dirs.sort() # Sort the list for consistent processing order

print(f"Found {len(all_video_dirs)} video directories to process.")

# --- Loop, Extract, and Verify ---
for video_dir in tqdm(all_video_dirs, desc="Processing videos"):
  video_name = os.path.basename(video_dir)
  segmentation_path = os.path.join(video_dir, 'segmentation')
  images_path = os.path.join(video_dir, 'rgb')

  # --- 1. Check if the 'rgb' directory already exists. ---
  should_skip = False
  if os.path.exists(images_path):
    # If it exists, check if its contents are complete.
    segmentation_files = set(os.listdir(segmentation_path))
    extracted_images = set(os.listdir(images_path))
    if segmentation_files.issubset(extracted_images):
      print(f"\n--- Skipping: {video_name} (already processed and verified) ---")
      should_skip = True

  if not should_skip:

    print(f"\n--- Processing: {video_name} ---")

    # --- 2. Extract frames for the current video ---
    !python -m scripts.sarrarp50 unpack "{video_dir}" -j4 -f {EXTRACTION_FREQUENCY} > /dev/null 2>&1

    # --- 3. Verify the extracted frames ---
    # Get the filenames of all segmentation masks
    segmentation_files = set(os.listdir(segmentation_path))
    # Get the filenames of all newly extracted RGB images
    extracted_images = set(os.listdir(images_path))

    # Check if every segmentation mask has a corresponding extracted image
    if segmentation_files.issubset(extracted_images):
        print(f"SUCCESS: All {len(segmentation_files)} segmentation masks have a matching RGB frame.")
    else:
        # Find which files are missing
        missing_files = segmentation_files - extracted_images
        print(f"ERROR: Verification failed for {video_name}.")
        print(f"Missing {len(missing_files)} frame(s): {list(missing_files)[0]}")

print("\nAll videos have been processed and verified.")

/content/drive/MyDrive/Colab Notebooks/Surgical_Tool_Segmentation/SAR_RARP50-evaluation
Found 54 video directories to process.


Processing videos:   0%|          | 0/54 [00:00<?, ?it/s]


--- Skipping: video_41 (already processed and verified) ---

--- Skipping: video_42 (already processed and verified) ---

--- Skipping: video_43 (already processed and verified) ---


KeyboardInterrupt: 

## Aggregate Files and Create New Archives

The following steps will streamline the dataset for easier use in the main training notebook.

1.  **Aggregate Files:** All `.png` image frames and segmentation masks from the individual `video_XX` folders will be copied into new, top-level `all_rgb` and `all_segmentation` folders inside their respective `train_dataset` and `test_dataset` directories. Files will be renamed (e.g., `video_01_000000000.png`) to ensure there are no name conflicts.

2.  **Create New Zip Archives:** New archives (`training_dataset.zip` and `test_dataset.zip`) will be created in the `data` directory. These will contain **only** the aggregated `all_rgb` and `all_segmentation` folders, making the data much faster to copy and load in future sessions.

In [None]:
# --- Define paths for aggregated folders ---
agg_train_rgb_path = os.path.join(train_path, 'all_rgb')
agg_train_seg_path = os.path.join(train_path, 'all_segmentation')
agg_test_rgb_path = os.path.join(test_path, 'all_rgb')
agg_test_seg_path = os.path.join(test_path, 'all_segmentation')

# --- Function to find, rename, and copy files ---
def aggregate_files(source_dir, dest_rgb_dir, dest_seg_dir):
    """Copies and renames files from nested video folders to flat directories."""
    # --- Get a list of video directories ---
    all_paths = glob.glob(f'{source_dir}/video_*')
    all_video_dirs = [path for path in all_paths if os.path.isdir(path)]
    all_video_dirs.sort() # Sort the list for consistent processing order

    # Use tqdm for a progress bar
    for video_dir in tqdm(all_video_dirs, desc=f"Aggregating {os.path.basename(source_dir)}"):
        video_name = os.path.basename(video_dir)

        # Aggregate RGB images
        rgb_files = glob.glob(os.path.join(video_dir, 'rgb', '*.png'))
        for f_path in rgb_files:
            filename = os.path.basename(f_path)
            new_filename = f"{video_name}_{filename}"
            shutil.copy(f_path, os.path.join(dest_rgb_dir, new_filename))

        # Aggregate segmentation masks
        seg_files = glob.glob(os.path.join(video_dir, 'segmentation', '*.png'))
        for f_path in seg_files:
            filename = os.path.basename(f_path)
            new_filename = f"{video_name}_{filename}"
            shutil.copy(f_path, os.path.join(dest_seg_dir, new_filename))

# --- Run the aggregation process ---
aggregate_files(train_path, agg_train_rgb_path, agg_train_seg_path)
aggregate_files(test_path, agg_test_rgb_path, agg_test_seg_path)

print("\n File aggregation complete.")

Aggregating test_dataset:   0%|          | 0/10 [00:00<?, ?it/s]


 File aggregation complete.


In [None]:
# --- Create the training_dataset.zip folder ---
! (cd "{train_path}" && zip -qr "{train_path}.zip" all_rgb all_segmentation)

# --- Create the test_dataset.zip folder ---
! (cd "{test_path}" && zip -qr "{test_path}.zip" all_rgb all_segmentation)

print("\n New zip folders created successfully.")


 New zip folders created successfully.
