# Text Extraction

## Introduction

The following notebook details Merve Tekgürler's approach to text extraction from the Parti Pris corpus from Summer 2025.

<figure>
    <img src="../img/parti_pris_cover.png"
         alt="Cover of Parti Pris's October 1963 issue">
    <figcaption>Cover of <a href="https://collections.banq.qc.ca/ark:/52327/2314782" target="_blank">Parti Pris' inaugural issue</a> from October 1963.</figcaption>
</figure>

In 2023-2024, Chloé Brault, Clare Chua, and Em Ho compiled the Parti Pris corpus and extracted texts from the same PDFs using ABBYY FineReader. More details about their work can be found [here](https://msuglobaldh.org/abstracts/#brault).

Tekgürler and Brault revisited the project in Summer 2025. This notebook describes how we used Gemini 2.5 API to extract texts from the Parti Pris corpus. It shared the prompting code and offers some insights into the decisions that went into the designing of the prompts as well as safeguards to verify the quality of the OCR output. We 

## Corpus

The complete Parti Pris PDF corpus is approximately 3.48GB in size and includes all 42 issues from the magazine’s entire print run, digitized by the Bibliothèque et Archives Nationales du Québec (BAnQ).

<figure>
    <img src="../img/banq.png"
         alt="Screenshot showing the search page for Parti Pris on the BAnQ website">
    <figcaption>Searching for Parti Pris on the BAnQ website. You can find all the issues of Parti Pris and download the PDFs by following <a href="https://numerique.banq.qc.ca/rechercheExterne/encoded/Kg==/false/D/asc/W3sibm9tIjoiY29ycHVzIiwidmFsZXVyIjoiUGF0cmltb2luZSUyMHF1w6liw6ljb2lzIn0seyJub20iOiJ0eXBlX2RvY19mIiwidmFsZXVyIjoiUmV2dWVzJTIwZXQlMjBqb3VybmF1eCJ9LHsibm9tIjoibnVtZXJvX25vdGljZSIsInZhbGV1ciI6IjAwMDAxNjMxMjIifV0=/Toutes%20les%20ressources/true/false/" target="_blank">this search link</a>. </figcaption>
</figure>

There are three example PDFs in the data folder of this repository (../data/full_pdfs and ../data/split_pdfs), for testing the code in this notebook.

## Approaches

The previous iteration of this project experimented with OCR'ing entire issues of Parti Pris and then spliting it into small chunks of text for further analysis. This allowed us, for example, to experiment with Named Entity Recognition algorithms to identify places mentioned in this corpus or run word frequency-based analyses to discover trends across the whole corpus. It has however proven to be rather difficult to split the OCR'ed issues into individual articles.

In this new iteration of the project, we revisit our approach to OCR. We use the Gemini 2.5 API to extract text and return a JSON object containing each article as a separate entry. 

### Advantages

There are many advantanges to this approach. We automatically obtain texts split at article level associated with the author of that article as opposed to just with Parti Pris. Parti Pris has relatively standard layout and very few if any images or advertisement. The language is standard mid-century French. Most issues are about 60 pages long, which fits the context window of the model. The scans are in a high resolution. The model can be prompted to perform specific transformations such as making the paragraph breaks but not line breaks and joining hyphenated words back into full words. All this makes text extraction easier.

### Challenges

At the same time there are some challenges inherent in prompting a Large Language Model (LLM) for OCR. These includes issues related to the size of the PDFs, difficulties with verifying the quality and the completeness the output, and differences between prompting approaches. 

**PDF Size**

the PDFs uploaded to the Gemini API cannot be larger than [50MB](https://github.com/googleapis/python-genai/issues/308). We had a total of 42 PDFs in this corpus and 18 of them were over 50MB. Some were only marginally larger, others were combined issues, reaching over 120 pages and 100MB. We could not split the PDFs automatically by size since our goal was to capture each article separately. A size or page number based approach could split up an article in the middle. This meant that we had to either reduce the size of the PDFs and risk reducing the quality of the resolution or split them manually. 

We had a two-part approach. First we transcribed all the PDFs, skipping the ones that were too large and making a note about that in the process. Afterwards we manually split the 18 large PDFs in Adobe Acrobat Pro, creating 42 new PDFs. We retained the filenames, adding '-A', '-B', etc to the end of the filenames. Then we ran a second pass of Gemini 2.5 transcription with the split PDFs.

<figure>
    <img src="../img/large_pdfs.png"
         alt="Screenshot showing the transcription token usage for API calls">
    <figcaption>Transcription token usage after the first pass</figcaption>
</figure>

<figure>
    <img src="../img/second_pass_pdfs.png"
         alt="Screenshot showing the transcription token usage for API calls">
    <figcaption>Transcription token usage after the second pass</figcaption>
</figure>

**Quality and Completeness of the Output**

It is difficult to verify that the entire PDF has been OCR'ed. Sometimes the model stopped producing text because it ran out of output tokens, particularly if the PDF was really long or if the model was using too many 'thinking' tokens. We retained the token usage for each PDF and looked for outliers. As you can see in the transcription usage table above, the PDF for June 1966 (163122_2-1966-06.pdf) has only 33 output tokens, which is only a fraction of the output tokens for the April 1966 issue which was 44k tokens long. In this case, this was not a model error but actually this issue does not exist. It has never been published. The PDF scanned by [the BAnQ](https://collections.banq.qc.ca/ark:/52327/2314811) reads "Parti Pris Juin à Août Non paru" (Parti Pris, June to August not published). In other instances like in September 1964 (163122_2-1964-09-01.pdf) and December 1964 (163122_2-1964-12-01.pdf), there are additional short pamphlets alongside the main issue. These pamphlets are denotated with '-01' after the date in the PDFs' filenames by the BAnQ. While these examples were real cases where the output should have way fewer tokens, this method of checking the output token sizes allowed us to run a simple analysis to discover inconsistencies in OCR output and rerun the API calls on issues were there was missing texts.

In [2]:
import pandas as pd
df = pd.read_csv("../data/transcription_usage_fulltext.csv")
df.head()

Unnamed: 0,pdf_filename,json_filename,prompt_tokens,thoughts_tokens,output_tokens,total_tokens
0,163122_1-1963-10.pdf,163122_1-1963-10.json,17107,1406,34382,52895
1,163122_1-1963-11.pdf,163122_1-1963-11.json,17107,1456,29757,48320
2,163122_1-1963-12.pdf,163122_1-1963-12.json,17107,1986,32445,51538
3,163122_1-1964-01.pdf,163122_1-1964-01.json,17107,1381,38662,57150
4,163122_1-1964-02.pdf,163122_1-1964-02.json,17107,1371,42279,60757


In [8]:
# Remove the -A, -B, -C, etc. from filenames to group them
df['base_filename'] = df['json_filename'].str.replace(r'-[A-Z]\.json$', '.json', regex=True)

# List of columns to sum
token_cols = ['prompt_tokens', 'thoughts_tokens', 'output_tokens', 'total_tokens']

# Group by base_filename, sum token columns, and aggregate other columns as needed
df_combined = df.groupby('base_filename', as_index=False).agg(
    {col: 'sum' for col in token_cols} | {
        'json_filename': lambda x: ', '.join(sorted(set(x))),
        'pdf_filename': lambda x: ', '.join(sorted(set(x)))
    }
)

# Count how many original rows were grouped for each base_filename
counts = df.groupby('base_filename').size().reset_index(name='count')

# Merge counts into df_combined
df_combined = df_combined.merge(counts, on='base_filename')

# Add note only where count > 1
df_combined['note'] = df_combined['count'].apply(lambda x: 'combined values' if x > 1 else '')

# Optionally drop the 'count' column if you don't need it
df_combined = df_combined.drop(columns=['count'])

# Save or display the result
df_combined.head()

Unnamed: 0,base_filename,prompt_tokens,thoughts_tokens,output_tokens,total_tokens,json_filename,pdf_filename,note
0,163122_1-1963-10.json,17107,1406,34382,52895,163122_1-1963-10.json,163122_1-1963-10.pdf,
1,163122_1-1963-11.json,17107,1456,29757,48320,163122_1-1963-11.json,163122_1-1963-11.pdf,
2,163122_1-1963-12.json,17107,1986,32445,51538,163122_1-1963-12.json,163122_1-1963-12.pdf,
3,163122_1-1964-01.json,17107,1381,38662,57150,163122_1-1964-01.json,163122_1-1964-01.pdf,
4,163122_1-1964-02.json,17107,1371,42279,60757,163122_1-1964-02.json,163122_1-1964-02.pdf,


## Gemini API
