# IIIF Workflow: Preparing for Text Analysis

#### For this tutorial, make sure you are using Python [conda env:base] as your kernel

This notebook will show you how to prepare files in order to do text analysis on digitized archival materials available in UCLA Library Digital Collections. The workflow provides a good introduction to using python in the command line.

**Steps used in this workflow:**

1. Download JPEG images from IIIF-based digital libraries (BASH & PYTHON)
2. Convert JPEGs to PDFs (PYTHON)
3. Render PDFs as text-searchable (OCR the PDFs) (BASH)
4. Convert PDFs to .txt files (BASH)

## Step 1: Download JPEG images from IIIF-based digital libraries (BASH & PYTHON)
This section is based on ChatGPT assisted code in response to failed attempts to use python package iiif-download (https://pypi.org/project/iiif-download/) and iiif-downloader (https://github.com/YaleDHLab/iiif-downloader, https://github.com/ClaudioMartino/IIIF-Downloader). I may do furhter research to see if UCLA intentially blocks IIIF donwloaders, thus requiring this more complicated coding.

In [1]:
#Bash
#confirm the parent directory where your files will be downloaded
!pwd

/Users/mhaigh/Downloads


In [2]:
#Python
#create the sub-directory for your project
import os

# Define the folder path
folder_path = "iiif-studentactivism"

# Create the folder
os.makedirs(folder_path, exist_ok=True)  # exist_ok=True avoids error if folder exists

print(f"Folder '{folder_path}' created successfully!")

Folder 'iiif-studentactivism' created successfully!


In [3]:
#Bash
#install the library that will allow you to use the IIIF manifest code to download all JPEGs associated with the digital object described in the manifest
!pip install iiif-download



In [4]:
#Python
#Download images from the first manifest
import os
import requests
from urllib.parse import urlparse

# === CONFIG ===
manifest_url = "https://iiif.library.ucla.edu/ark%3A%2F21198%2Fz1xq34bs/manifest"

# Create a folder name based on manifest URL (safe folder name)
parsed_url = urlparse(manifest_url)
folder_name = os.path.splitext(os.path.basename(parsed_url.path))[0]
output_dir = os.path.join("iiif-studentactivism", folder_name)
os.makedirs(output_dir, exist_ok=True)

# Step 1: Download manifest JSON
resp = requests.get(manifest_url)
resp.raise_for_status()
manifest = resp.json()

# Step 2: Loop through canvases
canvases = manifest.get("items", [])
print(f"Found {len(canvases)} canvases.")

image_counter = 1
for canvas in canvases:
    items = canvas.get("items", [])
    
    for item in items:
        for body in item.get("items", []):
            image_info = body.get("body", body)
            service = image_info.get("service")
            
            # If service is a list, take the first
            if isinstance(service, list):
                service = service[0]
            
            if service:
                iiif_id = service.get("@id") or service.get("id")
                if iiif_id:
                    # Full-resolution IIIF URL
                    image_url = f"{iiif_id}/full/full/0/default.jpg"
                    
                    # Generic filename
                    filename = f"image_{image_counter}.jpg"
                    image_path = os.path.join(output_dir, filename)
                    
                    # Download image
                    r = requests.get(image_url)
                    if r.status_code == 200:
                        with open(image_path, "wb") as f:
                            f.write(r.content)
                        print(f"Downloaded {filename}")
                        image_counter += 1
                    else:
                        print(f"Failed to download image {image_counter}, status: {r.status_code}")

Found 29 canvases.
Downloaded image_1.jpg
Downloaded image_2.jpg
Downloaded image_3.jpg
Downloaded image_4.jpg
Downloaded image_5.jpg
Downloaded image_6.jpg
Downloaded image_7.jpg
Downloaded image_8.jpg
Downloaded image_9.jpg
Downloaded image_10.jpg
Downloaded image_11.jpg
Downloaded image_12.jpg
Downloaded image_13.jpg
Downloaded image_14.jpg
Downloaded image_15.jpg
Downloaded image_16.jpg
Downloaded image_17.jpg
Downloaded image_18.jpg
Downloaded image_19.jpg
Downloaded image_20.jpg
Downloaded image_21.jpg
Downloaded image_22.jpg
Downloaded image_23.jpg
Downloaded image_24.jpg
Downloaded image_25.jpg
Downloaded image_26.jpg
Downloaded image_27.jpg
Downloaded image_28.jpg
Downloaded image_29.jpg


## Step 2: Convert JPEGs to PDFs: Python
Code source: https://stackoverflow.com/questions/27327513/create-pdf-from-a-list-of-images

Note: for small amounts of files, right-clicking and choosing "convert to PDF" may suffice; for large amounts of files (or if wanting to practice python skills), follow these steps!

In [5]:
# Source - https://stackoverflow.com/a
# Posted by ilovecomputer, modified by community. See post 'Timeline' for change history
# Retrieved 2025-12-03, License - CC BY-SA 4.0

from PIL import Image  
# install by > python3 -m pip install --upgrade Pillow  # ref. https://pillow.readthedocs.io/en/latest/installation.html#basic-installation

#replace the word "path" below with the directory where your files downloaded
images = [
    Image.open("/Users/mhaigh/Downloads/iiif-studentactivism/manifest/" + f)
    for f in ["image_1.jpg", "image_2.jpg", "image_3.jpg", "image_4.jpg", 
              "image_5.jpg", "image_6.jpg", "image_7.jpg", "image_8.jpg", 
              "image_9.jpg", "image_10.jpg", "image_11.jpg", "image_12.jpg", 
              "image_13.jpg", "image_14.jpg", "image_15.jpg", "image_16.jpg",
              "image_17.jpg", "image_18.jpg", "image_19.jpg", "image_20.jpg", 
              "image_21.jpg", "image_22.jpg", "image_23.jpg", "image_24.jpg", 
              "image_25.jpg", "image_26.jpg", "image_27.jpg", "image_28.jpg", 
              "image_29.jpg"
]
] 

pdf_path = "/Users/mhaigh/Downloads/iiif-studentactivism/manifest/manifest.pdf"
    
images[0].save(
    pdf_path, "PDF" ,resolution=100.0, save_all=True, append_images=images[1:]
)

## Step 3: Render PDFs as text-searchable (OCR the PDFs): Bash
This section is based on the Programming Historian tutorial: https://programminghistorian.org/en/lessons/working-with-batches-of-pdf-files

In [None]:
#Steps 3 and 4 require you run some code OUTSIDE of this notebook, in your command line.  

#Copy-paste the following and follow all instructions from your command line:
#/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

#Then, copy-paste the following in your command line
#brew install poppler pkg-config

#Then, copy-paste the following in your command line
#brew install unpaper

#Then, copy-paste the following in your command line
#brew install ghostscript

#then you are ready to return to this notebook and install pdftotext!

In [7]:
# in bash, install ocrmypdf using %
# % is called 'magic' and lets you install via bash in a jupyter notebook
%pip install ocrmypdf

Note: you may need to restart the kernel to use updated packages.


In [None]:
# If prior cell runs correctly, you will be prompted to restart the kernel. Go ahead and do so.

In [8]:
#Bash
# Run this code to make sure ocrmypdf has installed
!ocrmypdf --version

16.12.0


In [9]:
import os
os.environ["PATH"] = "/opt/homebrew/bin:" + os.environ["PATH"]

In [10]:
!which unpaper
!unpaper --version

/opt/homebrew/bin/unpaper
7.0.0


In [11]:
import os
os.environ["PATH"] = "/opt/homebrew/bin:/usr/local/bin:" + os.environ["PATH"]

In [12]:
!which gs
!gs --version

/opt/homebrew/bin/gs
10.06.0


In [13]:
# bash command line to OCR one PDF
%cd /Users/mhaigh/Downloads/iiif-studentactivism/manifest/
!pwd
!ocrmypdf --language eng --deskew --clean 'manifest.pdf' 'manifest.pdf'

/Users/mhaigh/Downloads/iiif-studentactivism/manifest
/Users/mhaigh/Downloads/iiif-studentactivism/manifest
[2KScanning contents     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m [32m29/29[0m [36m0:00:00[0m
[?25hStart processing [1;36m10[0m pages concurrently                                 ]8;id=69870;file:///opt/anaconda3/lib/python3.13/site-packages/ocrmypdf/_pipelines/ocr.py\[2mocr.py[0m]8;;\[2m:[0m]8;id=764265;file:///opt/anaconda3/lib/python3.13/site-packages/ocrmypdf/_pipelines/ocr.py#96\[2m96[0m]8;;\
[2K    [1;36m3[0m [1m[[0mtesseract[1m][0m lots of diacritics - possibly poor OCR        ]8;id=316371;file:///opt/anaconda3/lib/python3.13/site-packages/ocrmypdf/_exec/tesseract.py\[2mtesseract.py[0m]8;;\[2m:[0m]8;id=925492;file:///opt/anaconda3/lib/python3.13/site-packages/ocrmypdf/_exec/tesseract.py#251\[2m251[0m]8;;\
[2K    [1;36m5[0m [1m[[0mtesseract[1m][0m lots of diacritics - possibly poor OCR     

In [None]:
#the best way to test the OCR worked is to manually open the PDF and use control+f to see if you can search for words.

## Step 4: Convert PDFs to .txt files: Bash
This section is based on the Programming Historian tutorial: https://programminghistorian.org/en/lessons/working-with-batches-of-pdf-files

Optional final step: use ChatGPT to clean up the outputs (see https://aigenealogyinsights.com/2023/03/22/ai-genealogy-use-case-cleaning-up-ocr-text/)

In [14]:
!pip install pdftotext



In [15]:
%cd /Users/mhaigh/Downloads/iiif-studentactivism/manifest/
!pwd
!pdftotext 'manifest.pdf' 'manifest.txt'

/Users/mhaigh/Downloads/iiif-studentactivism/manifest
/Users/mhaigh/Downloads/iiif-studentactivism/manifest


In [16]:
#check that it worked by listing the files in the directory
# new .txt file means it was created
!ls

image_1.jpg  image_15.jpg image_20.jpg image_26.jpg image_5.jpg  manifest.txt
image_10.jpg image_16.jpg image_21.jpg image_27.jpg image_6.jpg
image_11.jpg image_17.jpg image_22.jpg image_28.jpg image_7.jpg
image_12.jpg image_18.jpg image_23.jpg image_29.jpg image_8.jpg
image_13.jpg image_19.jpg image_24.jpg image_3.jpg  image_9.jpg
image_14.jpg image_2.jpg  image_25.jpg image_4.jpg  manifest.pdf


## 