# [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/harpreetsahota204/document_visual_ai_with_fiftyone_workshop/blob/main/01_loading_document_datasets.ipynb)


Let's download the dependencies for this notebook:


In [None]:
!pip install fiftyone pdf2image

## Download datasets


Let's start by downloading a folder from my Google Drive account. This folder contains:

- A zip file with PDFs of the first page of Visual AI papers at NeurIPS 2025

- A json file with metadata

Let's go ahead and download the data.

⚠️ This will take ~2GB of disk space

In [None]:
import gdown

folder_id = '1WK-cPumZ2FeiKXEcdD0wDnN0MlBDzCq_'

gdown.download_folder(id=folder_id, output='data')

Now can extract the files:

In [None]:
!unzip data/neurips_vision_papers.zip -d vision_papers_pdfs

We need images to create a [FiftyOne Dataset](https://docs.voxel51.com/user_guide/using_datasets.html). The following code will convert the PDFs we just downloaded to images so we can create [Samples](https://docs.voxel51.com/api/fiftyone.core.sample.html#module-fiftyone.core.sample) and parse them to a Dataset.

In [1]:
from pdf2image import convert_from_path
from pathlib import Path
from tqdm.auto import tqdm

def pdf_to_image(pdf_path, output_dir, dpi):
    """Convert PDF to PNG image."""
    Path(output_dir).mkdir(exist_ok=True)
    
    # Get arxiv_id from filename (remove _page1.pdf)
    arxiv_id = Path(pdf_path).stem.replace('_page1', '')
    output_path = Path(output_dir) / f"{arxiv_id}.png"
    
    # Skip if already exists
    if output_path.exists():
        return output_path
    
    # Convert PDF to image
    images = convert_from_path(pdf_path, first_page=1, last_page=1, dpi=dpi)
    images[0].save(output_path, 'PNG')
    
    return output_path

# Convert all PDFs to images
pdf_dir = Path("vision_papers_pdfs/neurips_vision_papers")
pdf_files = list(pdf_dir.glob("*_page1.pdf"))

for pdf_file in tqdm(pdf_files):
    pdf_to_image(pdf_file, output_dir="neurips_vision_papers_images", dpi=500)

  0%|          | 0/1133 [00:00<?, ?it/s]

Next we will:

- Load the metadata from the json file
- Create Samples for a FiftyOne Dataset
- Add the Samples to the Dataset
- Launch the FiftyOne App to explore what we have

Notice that we are parsing `arxiv_category` as a [FiftyOne Classification](https://docs.voxel51.com/api/fiftyone.core.labels.html#fiftyone.core.labels.Classification).

In [3]:
import json

vision_papers = []
with open("data/neurips_2025_vision_papers.json", "r") as f:
    for line in f:
        vision_papers.append(json.loads(line))

print(f"Loaded {len(vision_papers)} papers")

Loaded 1134 papers


In [None]:
import fiftyone as fo
from pathlib import Path

# Create FiftyOne dataset
dataset = fo.Dataset(
    "neurips-2025-vision-papers", #name the dataset
    overwrite=True, #here in case you need to reuse this dataset name
    persistent=True #keep it persistent across Python sessions
    )

image_dir = Path("neurips_vision_papers_images")

# Add samples from vision_papers
samples = []
for paper in vision_papers:
    arxiv_id = paper['arxiv_id']
    
    # Check if image exists
    image_path = image_dir / f"{arxiv_id}.png"

    # Create sample
    sample = fo.Sample(filepath=str(image_path))
    
    # Add metadata field
    sample["type"] = paper["type"]
    sample["name"] = paper["name"]
    sample["virtualsite_url"] = paper["virtualsite_url"]
    sample["abstract"] = paper["abstract"]
    sample["arxiv_id"] = arxiv_id
    sample["arxiv_authors"] = paper["arxiv_authors"]
    
    # Add classification for arxiv_category
    sample["arxiv_category"] = fo.Classification(label=paper["arxiv_category"])
    
    samples.append(sample)

# Add samples to dataset
dataset.add_samples(samples)
dataset.compute_metadata()
dataset.save()

print(f"Created dataset with {len(dataset)} samples")

You are running the oldest supported major version of MongoDB. Please refer to https://deprecation.voxel51.com for deprecation notices. You can suppress this exception by setting your `database_validation` config parameter to `False`. See https://docs.voxel51.com/user_guide/config.html#configuring-a-mongodb-connection for more information
 100% |███████████████| 1134/1134 [229.0ms elapsed, 0s remaining, 5.0K samples/s]      
Computing metadata...
 100% |███████████████| 1134/1134 [638.7ms elapsed, 0s remaining, 1.8K samples/s]     
Created dataset with 1134 samples


You can inspect the Dataset as follows:

In [5]:
dataset

Name:        neurips-2025-vision-papers
Media type:  image
Num samples: 1134
Persistent:  True
Tags:        []
Sample fields:
    id:               fiftyone.core.fields.ObjectIdField
    filepath:         fiftyone.core.fields.StringField
    tags:             fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:         fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
    created_at:       fiftyone.core.fields.DateTimeField
    last_modified_at: fiftyone.core.fields.DateTimeField
    type:             fiftyone.core.fields.StringField
    name:             fiftyone.core.fields.StringField
    virtualsite_url:  fiftyone.core.fields.StringField
    abstract:         fiftyone.core.fields.StringField
    arxiv_id:         fiftyone.core.fields.StringField
    arxiv_authors:    fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    arxiv_category:   fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classifica

And you can call the [first()](https://docs.voxel51.com/api/fiftyone.core.dataset.html#fiftyone.core.dataset.Dataset.first) method of the Dataset to see what the Sample's schema looks like:

In [6]:
dataset.first()

<Sample: {
    'id': '69139a206c4d6c2f8f6d04c4',
    'media_type': 'image',
    'filepath': '/Users/harpreetsahota/workspace/document_visual_ai_with_fiftyone_workshop/neurips_vision_papers_images/2510.11296v2.png',
    'tags': [],
    'metadata': <ImageMetadata: {
        'size_bytes': 1173920,
        'mime_type': 'image/png',
        'width': 4250,
        'height': 5500,
        'num_channels': 3,
    }>,
    'created_at': datetime.datetime(2025, 11, 11, 20, 18, 40, 882000),
    'last_modified_at': datetime.datetime(2025, 11, 11, 20, 18, 41, 190000),
    'type': 'Poster',
    'name': '$\\Delta \\mathrm{Energy}$: Optimizing Energy Change During Vision-Language Alignment Improves both OOD Detection and OOD Generalization',
    'virtualsite_url': 'https://neurips.cc/virtual/2025/poster/116579',
    'abstract': "Recent approaches for vision-language models (VLMs) have shown remarkable success in achieving fast downstream adaptation. When applied to real-world downstream tasks, VLMs inev

You can launch [FiftyOne App](https://docs.voxel51.com/user_guide/app.html) and visualize the entire dataset as follows:

In [None]:
session = fo.launch_app(dataset, auto=False)
session.url