# 🚀 Open in Google Colab

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/matheus-rech/pitvqa-surgical-workflow/blob/main/notebooks/01_upload_pitvqa_to_huggingface.ipynb)

**Quick Start:** Click the badge above to open this notebook directly in Google Colab!

---

# 🏥 PitVQA Dataset → HuggingFace Hub

This notebook downloads the PitVQA surgical VQA dataset from UCL Research Data Repository
and uploads it to HuggingFace Hub for easy access during training.

**What this does:**
1. Downloads PitVQA dataset (~7.56 GB) using Colab's disk
2. Extracts and organizes the data
3. Creates a proper HuggingFace Dataset
4. Uploads to your HuggingFace account

**Requirements:**
- HuggingFace account with write token
- ~15GB free disk space (Colab provides ~100GB)

**Time estimate:** 30-60 minutes (mostly download time)

---
**Project:** PitVQA Surgical Workflow Understanding  
**Target:** MICCAI 2026  
**Author:** Generated with Claude Code

## 1. Setup & Dependencies

In [15]:
# Install required packages
%pip install -q huggingface_hub datasets pillow tqdm pandas ipywidgets

You should consider upgrading via the '/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


In [16]:
# Check available disk space
try:
    import google.colab
    import subprocess
    result = subprocess.run(['df', '-h', '/content'], capture_output=True, text=True)
    print(result.stdout)
    print("\n✅ You need ~15GB free. Colab typically provides ~100GB.")
except ImportError:
    import shutil
    import os
    total, used, free = shutil.disk_usage(os.getcwd())
    print(f"💾 Disk space check:")
    print(f"   Total: {total // (1024**3)} GB")
    print(f"   Used: {used // (1024**3)} GB")
    print(f"   Free: {free // (1024**3)} GB")
    print("\n✅ You need ~15GB free space.")


💾 Disk space check:
   Total: 460 GB
   Used: 438 GB
   Free: 21 GB

✅ You need ~15GB free space.


## 2. HuggingFace Authentication

Enter your HuggingFace token with **write** permissions.  
Get one at: https://huggingface.co/settings/tokens

In [17]:
from huggingface_hub import notebook_login# Login to HuggingFace (will prompt for token)notebook_login()print("✅ Logged in to HuggingFace!")

✅ Logged in to HuggingFace!


In [18]:
# Configure your dataset repository
HF_USERNAME = "matheus-rech"  # Your HuggingFace username
DATASET_NAME = "pitvqa-surgical"  # Dataset name on HuggingFace
REPO_ID = f"{HF_USERNAME}/{DATASET_NAME}"

print(f"📦 Dataset will be uploaded to: https://huggingface.co/datasets/{REPO_ID}")

📦 Dataset will be uploaded to: https://huggingface.co/datasets/matheus-rech/pitvqa-surgical


## 3. Download PitVQA Dataset

Downloading from UCL Research Data Repository.  
Source: https://doi.org/10.5522/04/27004666

In [19]:
import os
from tqdm import tqdm
import urllib.request

# Detect if running in Colab or locally
import importlib.util
try:
    IN_COLAB = importlib.util.find_spec("google.colab") is not None
except (ModuleNotFoundError, ValueError):
    IN_COLAB = False

if IN_COLAB:
    BASE_DIR = "/content"
else:
    BASE_DIR = os.path.join(os.getcwd(), "data")

# Create directories
DOWNLOAD_DIR = os.path.join(BASE_DIR, "pitvqa_download")
DATA_DIR = os.path.join(BASE_DIR, "pitvqa_data")

os.makedirs(DOWNLOAD_DIR, exist_ok=True)
os.makedirs(DATA_DIR, exist_ok=True)

print(f"📍 Running in: {'Google Colab' if IN_COLAB else 'Local environment'}")
print(f"📂 Download directory: {DOWNLOAD_DIR}")
print(f"📂 Data directory: {DATA_DIR}\n")

# Download URLs from UCL RDR
DOWNLOAD_URLS = {
    "videos": "https://rdr.ucl.ac.uk/ndownloader/files/49158880",
    "annotations": "https://rdr.ucl.ac.uk/ndownloader/files/49228108",
    "frame_annotations": "https://rdr.ucl.ac.uk/ndownloader/files/49228111"
}

print("📥 Starting download... This may take 30-60 minutes.")
print("   (Dataset is ~7.56 GB total)\n")

📍 Running in: Local environment
📂 Download directory: /Users/matheusrech/Downloads/pitvqa-surgical-workflow/notebooks/data/pitvqa_download
📂 Data directory: /Users/matheusrech/Downloads/pitvqa-surgical-workflow/notebooks/data/pitvqa_data

📥 Starting download... This may take 30-60 minutes.
   (Dataset is ~7.56 GB total)



In [None]:
# Download with progress bar
def download_with_progress(url, filename):
    """Download file with progress bar."""
    filepath = os.path.join(DOWNLOAD_DIR, filename)
    
    if os.path.exists(filepath):
        print(f"⏭️  {filename} already exists, skipping...")
        return filepath
    
    print(f"📥 Downloading {filename}...")
    
    # Get file size
    response = urllib.request.urlopen(url)
    total_size = int(response.headers.get('content-length', 0))
    
    # Download with progress
    block_size = 8192
    progress_bar = tqdm(total=total_size, unit='iB', unit_scale=True)
    
    with open(filepath, 'wb') as f:
        while True:
            buffer = response.read(block_size)
            if not buffer:
                break
            f.write(buffer)
            progress_bar.update(len(buffer))
    
    progress_bar.close()
    print(f"✅ Downloaded: {filename}")
    return filepath

# Download all files
downloaded_files = {}
for name, url in DOWNLOAD_URLS.items():
    downloaded_files[name] = download_with_progress(url, f"{name}.zip")

print("\n✅ All downloads complete!")

📥 Downloading videos.zip...


 31%|███       | 2.50G/8.11G [2:40:14<3:47:40, 411kiB/s]   

In [None]:
# Check downloaded files
import subprocess
result = subprocess.run(['ls', '-lh', DOWNLOAD_DIR], capture_output=True, text=True)
print(result.stdout)

## 4. Extract & Organize Data

In [None]:
import zipfile
import shutil

# Extract all zip files
print("📦 Extracting files...\n")

for name, filepath in downloaded_files.items():
    if filepath and os.path.exists(filepath):
        extract_dir = os.path.join(DATA_DIR, name)
        os.makedirs(extract_dir, exist_ok=True)
        
        print(f"📂 Extracting {name}...")
        try:
            with zipfile.ZipFile(filepath, 'r') as zip_ref:
                zip_ref.extractall(extract_dir)
            print(f"   ✅ Extracted to {extract_dir}")
        except zipfile.BadZipFile:
            print(f"   ⚠️  {filepath} is not a zip file, copying directly...")
            shutil.copy(filepath, extract_dir)

print("\n✅ Extraction complete!")

In [None]:
# Explore the extracted structure
import subprocess
result = subprocess.run(['find', DATA_DIR, '-type', 'f'], capture_output=True, text=True)
files = result.stdout.strip().split('\n')
print('\n'.join(files[:50]))
print("\n...")
print(f"{len(files)} total files")

In [None]:
# Check disk usage
import subprocess
try:
    result = subprocess.run(['du', '-sh', DATA_DIR + '/*'], capture_output=True, text=True, shell=True)
    print(result.stdout)
except:
    pass
result = subprocess.run(['du', '-sh', DATA_DIR], capture_output=True, text=True)
print(result.stdout)

## 5. Create HuggingFace Dataset Structure

In [None]:
import json
import pandas as pd
from pathlib import Path

# Find annotation files
data_path = Path(DATA_DIR)

# List all JSON and CSV files
json_files = list(data_path.rglob("*.json"))
csv_files = list(data_path.rglob("*.csv"))

print(f"Found {len(json_files)} JSON files")
print(f"Found {len(csv_files)} CSV files")

# Show first few
print("\nJSON files:")
for f in json_files[:5]:
    print(f"  {f}")
    
print("\nCSV files:")
for f in csv_files[:5]:
    print(f"  {f}")

In [None]:
# Load and inspect annotation structure
if json_files:
    sample_json = json_files[0]
    print(f"Sample JSON: {sample_json}")
    with open(sample_json, 'r') as f:
        sample_data = json.load(f)
    
    if isinstance(sample_data, list):
        print(f"\nType: List with {len(sample_data)} items")
        print(f"First item keys: {sample_data[0].keys() if sample_data else 'empty'}")
        if sample_data:
            print(f"\nSample item:")
            print(json.dumps(sample_data[0], indent=2)[:500])
    elif isinstance(sample_data, dict):
        print(f"\nType: Dict with keys: {sample_data.keys()}")
        print(f"\nSample:")
        print(json.dumps(sample_data, indent=2)[:500])

In [None]:
# Create a README for the dataset
readme_content = '''---
license: cc-by-nc-nd-4.0
task_categories:
  - visual-question-answering
  - video-classification
language:
  - en
tags:
  - medical
  - surgical
  - pituitary-surgery
  - vqa
  - endoscopic
  - neurosurgery
size_categories:
  - 100K<n<1M
---

# PitVQA: Visual Question Answering in Pituitary Surgery

## Dataset Description

PitVQA is a dataset for Visual Question Answering (VQA) in endoscopic pituitary surgery.

### Dataset Summary

- **25 videos** of endoscopic pituitary surgeries
- **109,173 frames** extracted at 1 fps
- **884,242 question-answer pairs** (~8 per frame)
- **59 annotation classes**:
  - 4 surgical phases
  - 15 surgical steps
  - 18 surgical instruments
  - 3 instrument presence variations
  - 5 instrument positions
  - 14 operation notes

### Source

Original dataset from UCL Research Data Repository:  
https://doi.org/10.5522/04/27004666

Collected at the National Hospital of Neurology and Neurosurgery, London, UK.

### Citation

```bibtex
@article{hoque2024pitvqa,
  title={PitVQA: Image-grounded Text Embedding LLM for Visual Question Answering in Pituitary Surgery},
  author={Hoque, Mobarak and Clarkson, Matt and Bano, Sophia and Stoyanov, Danail and Marcus, Hani},
  journal={arXiv preprint arXiv:2405.13949},
  year={2024}
}
```

### License

CC BY-NC-ND 4.0 (Non-commercial, No Derivatives)

### Related Resources

- [PitVQA Paper (arXiv)](https://arxiv.org/abs/2405.13949)
- [PitVQA GitHub](https://github.com/mobarakol/PitVQA)
- [MICCAI PitVis Challenge](https://www.synapse.org/Synapse:syn51232283)

---

*Uploaded to HuggingFace Hub for research purposes.*
'''

readme_path = os.path.join(DATA_DIR, "README.md")
with open(readme_path, "w") as f:
    f.write(readme_content)

print("✅ Created README.md for dataset")

## 6. Upload to HuggingFace Hub

In [None]:
from huggingface_hub import HfApi, create_repo

api = HfApi()

# Create the dataset repository
try:
    create_repo(
        repo_id=REPO_ID,
        repo_type="dataset",
        private=False,
        exist_ok=True
    )
    print(f"✅ Repository created/exists: https://huggingface.co/datasets/{REPO_ID}")
except Exception as e:
    print(f"⚠️ Note: {e}")

In [None]:
# Upload the entire dataset folder
print(f"📤 Uploading dataset to HuggingFace Hub...")
print(f"   This may take 30-60 minutes for ~7.5GB")
print(f"   Repository: https://huggingface.co/datasets/{REPO_ID}\n")

api.upload_folder(
    folder_path=DATA_DIR,
    repo_id=REPO_ID,
    repo_type="dataset",
    commit_message="Upload PitVQA dataset from UCL RDR",
    ignore_patterns=["*.zip", "__MACOSX/*", ".DS_Store"]
)

print(f"\n🎉 Upload complete!")
print(f"\n📦 Dataset available at: https://huggingface.co/datasets/{REPO_ID}")

## 7. Verify Upload

In [None]:
# List files in the uploaded repository
from huggingface_hub import list_repo_files

files = list_repo_files(REPO_ID, repo_type="dataset")
print(f"📂 Files in repository ({len(files)} total):\n")
for f in files[:20]:
    print(f"  {f}")
if len(files) > 20:
    print(f"  ... and {len(files) - 20} more files")

In [None]:
# Test loading the dataset
from datasets import load_dataset

print("🔄 Testing dataset loading...")
try:
    # Try to load (this verifies the upload worked)
    ds = load_dataset(REPO_ID, trust_remote_code=True)
    print(f"\n✅ Dataset loads successfully!")
    print(ds)
except Exception as e:
    print(f"\n⚠️ Note: {e}")
    print("Dataset uploaded but may need manual dataset script.")
    print(f"Files are accessible at: https://huggingface.co/datasets/{REPO_ID}/tree/main")

## 8. Cleanup (Optional)

In [None]:
# Clean up downloaded files to free disk space
cleanup = input("Delete local files to free disk space? (y/n): ")

if cleanup.lower() == 'y':
    import shutil
    shutil.rmtree(DOWNLOAD_DIR, ignore_errors=True)
    shutil.rmtree(DATA_DIR, ignore_errors=True)
    print("✅ Cleaned up local files")
else:
    print("⏭️ Keeping local files")

# Check disk usage
if IN_COLAB:
    import subprocess
    result = subprocess.run(['df', '-h', '/content'], capture_output=True, text=True)
    print(result.stdout)
else:
    import shutil
    total, used, free = shutil.disk_usage(BASE_DIR)
    print(f"\n💾 Disk usage:")
    print(f"   Total: {total // (1024**3)} GB")
    print(f"   Used: {used // (1024**3)} GB")
    print(f"   Free: {free // (1024**3)} GB")

---

## ✅ Done!

Your PitVQA dataset is now on HuggingFace Hub!

### Next Steps

1. **View your dataset**: https://huggingface.co/datasets/{REPO_ID}
2. **Use in training**:
   ```python
   from datasets import load_dataset
   ds = load_dataset("matheus-rech/pitvqa-surgical")
   ```
3. **Stream without downloading**:
   ```python
   ds = load_dataset("matheus-rech/pitvqa-surgical", streaming=True)
   ```

### Project Links

- GitHub: https://github.com/matheus-rech/pitvqa-surgical-workflow
- Dataset: https://huggingface.co/datasets/matheus-rech/pitvqa-surgical

---

*Generated with Claude Code for MICCAI 2026 project*