# Step 1: Creating a Mini Dataset from RVL-CDIP

In this notebook, I create a mini dataset from the original RVL-CDIP dataset.
A sample of images is copied from each top-level folder into a single flat directory, and 
the corresponding label files are updated to reflect the new file paths.

In [2]:
# Import required libraries for file and data operations.
import os
import shutil
import random

## Define Paths and Parameters

- **SOURCE_IMAGES**: Path to the original folder containing images.
- **SOURCE_LABELS**: Path to the original folder containing label files.
- **DEST_PATH**: Destination path for my mini dataset.
- **N_SAMPLES_PER_FOLDER**: Number of images I randomly select from each top-level folder.

The set `selected_images` is used to store the basenames of selected images for later label matching.


In [6]:
# Define source and destination paths for images and labels.
SOURCE_IMAGES = 'data/rvl-cdip/images'
SOURCE_LABELS = 'data/rvl-cdip/labels'
DEST_PATH = 'data/rvl-cdip-mini-dataset'
DEST_IMAGES = os.path.join(DEST_PATH, 'images')
DEST_LABELS = os.path.join(DEST_PATH, 'labels')

# Number of images to select from each top-level folder.
N_SAMPLES_PER_FOLDER = 100

# Set to store the basenames of selected images (used for label matching).
selected_images = set()


## Copy Images

In this step, code iterate over every top-level folder in the source images directory.
For each folder:
- Recursively gather all image files.
- Randomly select up to `N_SAMPLES_PER_FOLDER` images.
- Copy the selected images into the destination folder, maintaining a flat structure.


In [7]:
# Create the destination images folder (flat structure).
os.makedirs(DEST_IMAGES, exist_ok=True)

# Iterate over each top-level folder in SOURCE_IMAGES.
for top_folder in os.listdir(SOURCE_IMAGES):
    top_folder_path = os.path.join(SOURCE_IMAGES, top_folder)
    if os.path.isdir(top_folder_path):
        # Gather all image files recursively within this top-level folder.
        image_files = []
        for root, _, files in os.walk(top_folder_path):
            for file in files:
                if file.lower().endswith(('.png', '.jpg', '.jpeg', '.tif', '.tiff')):
                    image_files.append(os.path.join(root, file))
        
        # Randomly select up to N_SAMPLES_PER_FOLDER images.
        selected_files = random.sample(image_files, k=min(N_SAMPLES_PER_FOLDER, len(image_files)))
        
        # Copy each selected image to the destination folder (flat structure).
        for src_file in selected_files:
            # Use basename only since a flat structure is desired.
            base_name = os.path.basename(src_file)
            dest_file = os.path.join(DEST_IMAGES, base_name)
            shutil.copy2(src_file, dest_file)
            selected_images.add(base_name)

print("Mini dataset images created at:", DEST_IMAGES)

Mini dataset images created at: data/rvl-cdip-mini-dataset/images


## Copy and Update Labels

In this step:
- Create the destination folder for label files.
- Process each label file (e.g., train.txt, test.txt, val.txt) by updating the image paths to reflect the flat folder structure.
- Only write the lines for images that were selected.


In [8]:
# Create the destination labels folder.
os.makedirs(DEST_LABELS, exist_ok=True)

# Process each label file (e.g., train.txt, test.txt, val.txt) and update image paths.
for label_filename in os.listdir(SOURCE_LABELS):
    src_label_file = os.path.join(SOURCE_LABELS, label_filename)
    dest_label_file = os.path.join(DEST_LABELS, label_filename)
    
    with open(src_label_file, 'r', encoding='utf-8') as infile, \
         open(dest_label_file, 'w', encoding='utf-8') as outfile:
        for line in infile:
            parts = line.strip().split()
            if len(parts) < 2:
                continue  # Skip malformed lines
            image_path = parts[0]
            label = " ".join(parts[1:])
            
            # Update image path to flat structure: use only the basename.
            base_name = os.path.basename(image_path)
            
            # If the image was selected, write the updated label line with the flat image name.
            if base_name in selected_images:
                outfile.write(f"{base_name} {label}\n")

print("Mini dataset label files written to:", DEST_LABELS)


Mini dataset label files written to: data/rvl-cdip-mini-dataset/labels


## Document Categories

The categories are numbered from 0 to 15 in the following order:

- 0. letter
- 1. form
- 2. email
- 3. handwritten
- 4. advertisement
- 5. scientific report
- 6. scientific publication
- 7. specification
- 8. file folder
- 9. news article
- 10. budget
- 11. invoice
- 12. presentation
- 13. questionnaire
- 14. resume
- 15. memo


## Summary

- I defined the paths and parameters for my mini dataset.
- A sample of images was copied from the original dataset into a new directory with a flat structure.
- I updated the label files so that they refer to the new image paths.

This mini dataset will be used in further experiments for document classification that I plan to conduct.
