# [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/harpreetsahota204/document_visual_ai_with_fiftyone_workshop/blob/main/01_loading_document_datasets.ipynb)


Let's download the dependencies for this notebook:


In [None]:
!pip install fiftyone pdf2image

## Download datasets


Let's start by downloading a folder from my Google Drive account. This folder contains:

- A zip file with PDFs of the first page of Visual AI papers at NeurIPS 2025

- A json file with metadata

Let's go ahead and download the data.

⚠️ This will take ~2GB of disk space

In [None]:
import gdown

folder_id = '1WK-cPumZ2FeiKXEcdD0wDnN0MlBDzCq_'

gdown.download_folder(id=folder_id, output='data')

Now can extract the files:

In [None]:
!unzip data/neurips_vision_papers.zip -d vision_papers_pdfs

We need images to create a [FiftyOne Dataset](https://docs.voxel51.com/user_guide/using_datasets.html). The following code will convert the PDFs we just downloaded to images so we can create [Samples](https://docs.voxel51.com/api/fiftyone.core.sample.html#module-fiftyone.core.sample) and parse them to a Dataset.

In [None]:
from pdf2image import convert_from_path
from pathlib import Path
from tqdm.auto import tqdm

def pdf_to_image(pdf_path, output_dir, dpi):
    """Convert PDF to PNG image."""
    Path(output_dir).mkdir(exist_ok=True)
    
    # Get arxiv_id from filename (remove _page1.pdf)
    arxiv_id = Path(pdf_path).stem.replace('_page1', '')
    output_path = Path(output_dir) / f"{arxiv_id}.png"
    
    # Skip if already exists
    if output_path.exists():
        return output_path
    
    # Convert PDF to image
    images = convert_from_path(pdf_path, first_page=1, last_page=1, dpi=dpi)
    images[0].save(output_path, 'PNG')
    
    return output_path

# Convert all PDFs to images
pdf_dir = Path("vision_papers_pdfs/neurips_vision_papers")
pdf_files = list(pdf_dir.glob("*_page1.pdf"))

for pdf_file in tqdm(pdf_files):
    pdf_to_image(pdf_file, output_dir="neurips_vision_papers_images", dpi=500)

Next we will:

- Load the metadata from the json file
- Create Samples for a FiftyOne Dataset
- Add the Samples to the Dataset
- Launch the FiftyOne App to explore what we have

Notice that we are parsing `arxiv_category` as a [FiftyOne Classification](https://docs.voxel51.com/api/fiftyone.core.labels.html#fiftyone.core.labels.Classification).

In [None]:
import json

vision_papers = []
with open("data/neurips_2025_vision_papers.json", "r") as f:
    for line in f:
        vision_papers.append(json.loads(line))

print(f"Loaded {len(vision_papers)} papers")

In [None]:
import fiftyone as fo
from pathlib import Path

# Create FiftyOne dataset
dataset = fo.Dataset(
    "neurips-2025-vision-papers", #name the dataset
    overwrite=True, #here in case you need to reuse this dataset name
    persistent=True #keep it persistent across Python sessions
    )

image_dir = Path("neurips_vision_papers_images")

# Add samples from vision_papers
samples = []
for paper in vision_papers:
    arxiv_id = paper['arxiv_id']
    
    # Check if image exists
    image_path = image_dir / f"{arxiv_id}.png"

    # Create sample
    sample = fo.Sample(filepath=str(image_path))
    
    # Add metadata field
    sample["type"] = paper["type"]
    sample["name"] = paper["name"]
    sample["virtualsite_url"] = paper["virtualsite_url"]
    sample["abstract"] = paper["abstract"]
    sample["arxiv_id"] = arxiv_id
    sample["arxiv_authors"] = paper["arxiv_authors"]
    
    # Add classification for arxiv_category
    sample["arxiv_category"] = fo.Classification(label=paper["arxiv_category"])
    
    samples.append(sample)

# Add samples to dataset
dataset.add_samples(samples)
dataset.compute_metadata()
dataset.save()

print(f"Created dataset with {len(dataset)} samples")

### The Dataset

You now have a dataset which contains NeurIPS 2025 accepted papers focused on computer vision and related fields, enriched with arXiv metadata and first-page images.

It includes papers from multiple vision-related categories including Computer Vision (cs.CV), Multimedia (cs.MM), Image and Video Processing (eess.IV), Graphics (cs.GR), and Robotics (cs.RO).

Each entry includes paper metadata, abstracts, author information, and a high-resolution (500 DPI) PNG image of the paper's first page.

Let's call the Dataset.

When you "call the dataset" in FiftyOne—such as by printing it with `print(dataset)`, you get a summary of the dataset's structure and contents.

This includes information like the number of samples, available fields, and possibly a preview of the first or last sample.

This is a useful way to inspect your dataset after loading or creating it.

In [None]:
dataset

And you can call the [first()](https://docs.voxel51.com/api/fiftyone.core.dataset.html#fiftyone.core.dataset.Dataset.first) method of the Dataset to see what the Sample's schema looks like:

In [None]:
dataset.first()

We can get a sense of the distribution of `arxiv_category` as follows:

In [None]:
dataset.count_values("arxiv_category.label")

Now, let's [map these category labels](https://docs.voxel51.com/api/fiftyone.core.dataset.html#fiftyone.core.dataset.Dataset.map_labels) to something more human readable. We're doing this because, towards the end of this notebook, we'll use visual document retrieval model to perform zero shot classification of the document images.

Begin by [cloning the sample field](https://docs.voxel51.com/api/fiftyone.core.dataset.html#fiftyone.core.dataset.Dataset.clone_sample_field):

In [None]:
dataset.clone_sample_field("arxiv_category", "arxiv_category_mapped")

mapping = {
    "cs.CV": "Computer Vision",
    "cs.MM": "Multimedia",
    "eess.IV": "Image and Video Processing",
    "cs.GR": "Graphics",
    "cs.RO": "Robotics",
}

view = dataset.map_labels("arxiv_category_mapped", mapping)
view.save()

And we can verify this worked:

In [None]:
dataset.count_values("arxiv_category_mapped.label")

You can launch [FiftyOne App](https://docs.voxel51.com/user_guide/app.html) and visualize the entire dataset as follows:

In [None]:
session = fo.launch_app(dataset, auto=False)
session.url

### Additional resources

You can checkout other datasets that have already been parsed into FiftyOne format and are hosted on the Hugging Face Hub:

- [NutriGreen Dataset](https://huggingface.co/datasets/Voxel51/NutriGreen) - a collection of images representing branded packaged food products

- [CommonForms](https://huggingface.co/datasets/Voxel51/commonforms_val_subset) (subset of validation set) - contains 10,000 annotated document images with bounding boxes for three types of form fields: text inputs, choice buttons (checkboxes/radio buttons), and signature fields

- [Form Understanding in Noisy Scanned Documents](https://huggingface.co/datasets/Voxel51/form_understanding_in_noisy_scanned_documents_plus) - provides ground truth data for extracting structured information from scanned forms, including entity recognition and relationship extraction between form fields and their values.

- [Consolidated Receipt Dataset](https://huggingface.co/datasets/Voxel51/consolidated_receipt_dataset) - contains over 11,000 Indonesian receipts collected from shops and restaurants, featuring images with OCR annotations (bounding boxes and text) and multi-level semantic labels for parsing. This FiftyOne implementation provides an accessible interface for exploring the training split with 800 annotated receipt images.

- [Scanned Receipts OCR and Information Extraction Dataset](https://huggingface.co/datasets/Voxel51/scanned_receipts) - comprises 1,000 whole scanned receipt images collected from real-world scenarios.


- [Document Haystack (subset)](https://huggingface.co/datasets/Voxel51/document-haystack-10pages) - a comprehensive benchmark designed to evaluate the performance of Vision Language Models (VLMs) on long, visually complex documents. This expands on the "Needle in a Haystack" concept by embedding needles — short key-value statements in pure text or as multimodal text+image snippets — within real-world documents. 