# IIIF Workflow: Preparing for Text Analysis

#### For this tutorial, make sure you are using Python [conda env:base] as your kernel

This notebook will show you how to prepare files in order to do text analysis on digitized archival materials available in UCLA Library Digital Collections. The workflow provides a good introduction to using python in the command line.

**Steps used in this workflow:**

1. Download JPEG images from IIIF-based digital libraries (PYTHON)
2. Convert JPEGs to PDFs (PYTHON)
3. Render PDFs as text-searchable (OCR the PDFs) (BASH)
4. Convert PDFs to .txt files (BASH)

## Step 1: Download JPEG images from IIIF-based digital libraries
This section is based on ChatGPT assisted code in response to failed attempts to use python package iiif-download (https://pypi.org/project/iiif-download/) and iiif-downloader (https://github.com/YaleDHLab/iiif-downloader, https://github.com/ClaudioMartino/IIIF-Downloader). I may do furhter research to see if UCLA intentially blocks IIIF donwloaders, thus requiring this more complicated coding.

In [None]:
#confirm the parent directory where your files will be downloaded
pwd

In [None]:
#create the sub-directory for your project
import os

# Define the folder path
folder_path = "iiif-studentactivism"

# Create the folder
os.makedirs(folder_path, exist_ok=True)  # exist_ok=True avoids error if folder exists

print(f"Folder '{folder_path}' created successfully!")

In [None]:
#install the library that will allow you to use the IIIF manifest code to download all JPEGs associated with the digital object described in the manifest
pip install iiif-download

In [None]:
#Download images from the first manifest
import os
import requests
from urllib.parse import urlparse

# === CONFIG ===
manifest_url = "https://iiif.library.ucla.edu/ark%3A%2F21198%2Fz1xq34bs/manifest"

# Create a folder name based on manifest URL (safe folder name)
parsed_url = urlparse(manifest_url)
folder_name = os.path.splitext(os.path.basename(parsed_url.path))[0]
output_dir = os.path.join("iiif-studentactivism", folder_name)
os.makedirs(output_dir, exist_ok=True)

# Step 1: Download manifest JSON
resp = requests.get(manifest_url)
resp.raise_for_status()
manifest = resp.json()

# Step 2: Loop through canvases
canvases = manifest.get("items", [])
print(f"Found {len(canvases)} canvases.")

image_counter = 1
for canvas in canvases:
    items = canvas.get("items", [])
    
    for item in items:
        for body in item.get("items", []):
            image_info = body.get("body", body)
            service = image_info.get("service")
            
            # If service is a list, take the first
            if isinstance(service, list):
                service = service[0]
            
            if service:
                iiif_id = service.get("@id") or service.get("id")
                if iiif_id:
                    # Full-resolution IIIF URL
                    image_url = f"{iiif_id}/full/full/0/default.jpg"
                    
                    # Generic filename
                    filename = f"image_{image_counter}.jpg"
                    image_path = os.path.join(output_dir, filename)
                    
                    # Download image
                    r = requests.get(image_url)
                    if r.status_code == 200:
                        with open(image_path, "wb") as f:
                            f.write(r.content)
                        print(f"Downloaded {filename}")
                        image_counter += 1
                    else:
                        print(f"Failed to download image {image_counter}, status: {r.status_code}")

## Step 2: Convert JPEGs to PDFs: Python
Code source: https://stackoverflow.com/questions/27327513/create-pdf-from-a-list-of-images

Note: for small amounts of files, right-clicking and choosing "convert to PDF" may suffice; for large amounts of files (or if wanting to practice python skills), follow these steps!

In [37]:
# Source - https://stackoverflow.com/a
# Posted by ilovecomputer, modified by community. See post 'Timeline' for change history
# Retrieved 2025-12-03, License - CC BY-SA 4.0

from PIL import Image  
# install by > python3 -m pip install --upgrade Pillow  # ref. https://pillow.readthedocs.io/en/latest/installation.html#basic-installation

images = [
    Image.open("/Users/mollyhaigh/iiif-studentactivism/manifest/" + f)
    for f in ["image_1.jpg", "image_2.jpg", "image_3.jpg", "image_4.jpg", 
              "image_5.jpg", "image_6.jpg", "image_7.jpg", "image_8.jpg", 
              "image_9.jpg", "image_10.jpg", "image_11.jpg", "image_12.jpg", 
              "image_13.jpg", "image_14.jpg", "image_15.jpg", "image_16.jpg",
              "image_17.jpg", "image_18.jpg", "image_19.jpg", "image_20.jpg", 
              "image_21.jpg", "image_22.jpg", "image_23.jpg", "image_24.jpg", 
              "image_25.jpg", "image_26.jpg", "image_27.jpg", "image_28.jpg", 
              "image_29.jpg"
]
] 

pdf_path = "/Users/mollyhaigh/iiif-studentactivism/manifest/manifest.pdf"
    
images[0].save(
    pdf_path, "PDF" ,resolution=100.0, save_all=True, append_images=images[1:]
)

## Step 3: Render PDFs as text-searchable (OCR the PDFs): Bash
This section is based on the Programming Historian tutorial: https://programminghistorian.org/en/lessons/working-with-batches-of-pdf-files

In [None]:
# in bash, install ocrmypdf using %
# % is called 'magic' and lets you install via bash in a jupyter notebook
%pip install ocrmypdf

In [None]:
# If prior cell runs correctly, you will be prompted to restart the kernel. Go ahead and do so.

In [None]:
# Run this code to make sure ocrmypdf has installed
!ocrmypdf --version

In [None]:
#make sure you are in the directory that contains the pdf you need to OCR
!pwd

In [None]:
# bash command line to OCR one PDF
# insert name of PDF file twice
!ocrmypdf --language eng --deskew --clean 'image_1.pdf' 'image_1.pdf'

In [None]:
#the best way to test the OCR worked is to manually open the PDF and use control 

## Step 4: Convert PDFs to .txt files: Bash
This section is based on the Programming Historian tutorial: https://programminghistorian.org/en/lessons/working-with-batches-of-pdf-files

Optional final step: use ChatGPT to clean up the outputs (see https://aigenealogyinsights.com/2023/03/22/ai-genealogy-use-case-cleaning-up-ocr-text/)

In [None]:
#this step requires you run some code OUTSIDE of this notebook, in your command line. Copy-paste the following 

#/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
#brew install poppler

#then you are ready to install pdftotext!

In [None]:
%pip install pdftotext

In [None]:
cd iiif-studentactivism/manifest/

In [None]:
!pdftotext 'image_1.pdf' 'image_1.txt'

In [None]:
#check that it worked by listing the files in the directory
# new .txt file means it was created
!ls

## 