# Text Preprocessing

This notebook contains the text preprocessing or data cleaning step for the Parti Pris corpus. In the previous notebook (`01_text_extraction.ipynb`), we used Gemini 2.5 to extract text and metadata from the Parti Pris corpus. We validated the OCR output for completeness and quality.

In this notebook, we will extract the metadata and the full texts of the articles and clean up the texts.

## Metadata

We will extract the metadata from the table of contents json files.

This is what our metadata looks like for teh first issue (163122_1-1963-10.json). 

```python
"metadata": {
    "volume_number": "1",
    "issue_number": "1",
    "season_or_month": "octobre",
    "year": "1963",
    "director": "",
    "description": "revue politique et littéraire paraît chaque mois sur 64 pages",
    "price": "Prix 50 cents 12 numéros: $5.00",
    "editors": [
      "André Brochu",
      "Paul Chamberland",
      "Pierre Maheu",
      "André Major",
      "Jean-Marc Piotte"
    ],
    "collaborators": [],
    "administration": [
      "Yvon Dionne",
      "Laurent Girouard",
      "Pierre Maheu",
      "Robert Maheu",
      "Gérald McKenzie",
      "Lise Théberge"
    ],
    "publisher": "La Revue PARTI PRIS, inc. 790-B rue Champagneur, Montréal (8) Québec.",
    "distributor": "Agence de Distribution Populaire, 1130 est rue Lagauchetière Montréal. Tél. LA 3-1182"
  }
```

We will first extract everything (metadata_unedited) and then extract a subset of this metadata that only contains temporal information, which we will later connect to our full texts.

In [22]:
import os
import json
import pandas as pd

toc_dir = "../data/toc_transcriptions"
processed_dir = "../data/processed"
os.makedirs(processed_dir, exist_ok=True)

# Gather all metadata
metadata_list = []
for fname in os.listdir(toc_dir):
    if fname.endswith(".json"):
        with open(os.path.join(toc_dir, fname), "r", encoding="utf-8") as f:
            toc_json = json.load(f)
            meta = toc_json.get("metadata", {})
            meta["source_file"] = fname.replace("toc_", "")  # remove 'toc_' prefix
            metadata_list.append(meta)

# Save full metadata as metadata_unedited.csv
metadata_unedited = pd.DataFrame(metadata_list)
# Move source_file to the first column
cols = ['source_file'] + [col for col in metadata_unedited.columns if col != 'source_file']
metadata_unedited = metadata_unedited[cols]
metadata_unedited.to_csv(os.path.join(processed_dir, "metadata_unedited.csv"), index=False)

# Save simplified metadata as metadata.csv
simple_cols = ["source_file", "volume_number", "issue_number", "season_or_month", "year"]
metadata = metadata_unedited[simple_cols]
metadata.to_csv(os.path.join(processed_dir, "metadata.csv"), index=False)

In [23]:
metadata_unedited.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39 entries, 0 to 38
Data columns (total 13 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   source_file      39 non-null     object
 1   volume_number    39 non-null     object
 2   issue_number     39 non-null     object
 3   season_or_month  39 non-null     object
 4   year             39 non-null     object
 5   director         39 non-null     object
 6   description      39 non-null     object
 7   price            39 non-null     object
 8   editors          39 non-null     object
 9   collaborators    39 non-null     object
 10  administration   39 non-null     object
 11  publisher        39 non-null     object
 12  distributor      39 non-null     object
dtypes: object(13)
memory usage: 4.1+ KB


In [24]:
metadata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39 entries, 0 to 38
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   source_file      39 non-null     object
 1   volume_number    39 non-null     object
 2   issue_number     39 non-null     object
 3   season_or_month  39 non-null     object
 4   year             39 non-null     object
dtypes: object(5)
memory usage: 1.7+ KB


## Full Text Corpus

Now, we will extract the full text corpus and merge it with the simplified metadata. Each row in the partipris dataframe will be one article from one issue.

We will use the source_file column to connect the metadata to the full text. However, since we had to split the PDFs that were too large into -A, -B, etc, we will need to create a clean source_file column. We will first extract the texts as is and save the corpus as partipris_unedited.csv. Afterwards, we will edit the source_file information and merge with the metadata to create partipris.csv. 

In [4]:
# Like we did in the previous notebook, these 3 files are not relevant for the full text corpus
# So we will skip them during processing

skip_files = {
    "163122_1-1964-09-01.json",
    "163122_1-1964-12-01.json",
    "163122_2-1966-06.json"
}

In [69]:
# Step 1: read the json files from data/transcriptions and save the unedited full text corpus

transcriptions_dir = "../data/transcriptions"
processed_dir = "../data/processed"
os.makedirs(processed_dir, exist_ok=True)

json_files = [f for f in os.listdir(transcriptions_dir) if f.endswith('.json')]

data_unedited = []
for file in json_files:
    if file in skip_files:
        continue
    with open(os.path.join(transcriptions_dir, file), 'r', encoding='utf-8') as f:
        entry = json.load(f)
        if isinstance(entry, list):
            for item in entry:
                item['source_file_raw'] = file
                data_unedited.append(item)
        else:
            entry['source_file_raw'] = file
            data_unedited.append(entry)

partipris_unedited = pd.DataFrame(data_unedited)
# Move source_file_raw to the first column
cols = ['source_file_raw'] + [col for col in partipris_unedited.columns if col != 'source_file_raw']
partipris_unedited = partipris_unedited[cols]
partipris_unedited.to_csv(os.path.join(processed_dir, "partipris_unedited.csv"), index=False)


In [70]:
partipris_unedited.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 698 entries, 0 to 697
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   source_file_raw  698 non-null    object
 1   author           698 non-null    object
 2   title            698 non-null    object
 3   page_range       698 non-null    object
 4   text             698 non-null    object
dtypes: object(5)
memory usage: 27.4+ KB


In [71]:
# Step 2: Clean up source_file and prepare for merge
import re

data_clean = []
for item in data_unedited:
    # Remove -A, -B, etc. before .json for matching
    base_file = re.sub(r'-[A-Z](?=\.json$)', '', item['source_file_raw'])
    item_clean = item.copy()
    item_clean['source_file'] = base_file
    data_clean.append(item_clean)

partipris = pd.DataFrame(data_clean)
# Move source_file to the first column and drop source_file_raw
cols = ['source_file'] + [col for col in partipris.columns if col not in ['source_file', 'source_file_raw']]
partipris = partipris[cols]

In [72]:
partipris.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 698 entries, 0 to 697
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   source_file  698 non-null    object
 1   author       698 non-null    object
 2   title        698 non-null    object
 3   page_range   698 non-null    object
 4   text         698 non-null    object
dtypes: object(5)
memory usage: 27.4+ KB


In [33]:
partipris.head()

Unnamed: 0,source_file,author,title,page_range,text
0,163122_1-1964-09.json,parti pris,manifeste 64-65,4-17,manifeste 64-65 Toute révolution détruit l'anc...
1,163122_1-1964-09.json,la direction.,lettre au lecteur,18-19,lettre au lecteur parti pris se prépare à une ...
2,163122_1-1964-09.json,paul chamberland,bilan d'un combat,20-35,bilan d'un combat paul chamberland Nous lutton...
3,163122_1-1964-09.json,jean-marc piotte,autocritique de parti pris,36-44,autocritique de parti pris jean-marc piotte Pi...
4,163122_1-1964-09.json,pierre maheu,notes pour une politisation,45-56,notes pour une politisation pierre maheu Dans ...


In [73]:
# Let's make sure that we have 42-3 = 39 unique source files as we intended
partipris['source_file'].nunique()

39

In [74]:
# Step 3: Merge with metadata

# Check for matching source_file values
print("Unique source_file in partipris:", partipris['source_file'].nunique())
print("Unique source_file in metadata:", metadata['source_file'].nunique())
print("partipris source_file not in metadata:", set(partipris['source_file']) - set(metadata['source_file']))
print("metadata source_file not in partipris:", set(metadata['source_file']) - set(partipris['source_file']))

Unique source_file in partipris: 39
Unique source_file in metadata: 39
partipris source_file not in metadata: set()
metadata source_file not in partipris: set()


In [75]:
# Merge on source_file
partipris = partipris.merge(metadata, on="source_file", how="left")

In [76]:
partipris.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 698 entries, 0 to 697
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   source_file      698 non-null    object
 1   author           698 non-null    object
 2   title            698 non-null    object
 3   page_range       698 non-null    object
 4   text             698 non-null    object
 5   volume_number    698 non-null    object
 6   issue_number     698 non-null    object
 7   season_or_month  698 non-null    object
 8   year             698 non-null    object
dtypes: object(9)
memory usage: 49.2+ KB


In [77]:
# Step 4: Save partipris with metadata as partipris_v1.csv

partipris.to_csv(os.path.join(processed_dir, "partipris_v1.csv"), index=False)

## Data Cleaning

In [78]:
# Let's load the partipris_v1.csv
partipris = pd.read_csv('../data/processed/partipris_v1.csv')

In [79]:
partipris.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 698 entries, 0 to 697
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   source_file      698 non-null    object
 1   author           695 non-null    object
 2   title            698 non-null    object
 3   page_range       698 non-null    object
 4   text             698 non-null    object
 5   volume_number    593 non-null    object
 6   issue_number     698 non-null    object
 7   season_or_month  698 non-null    object
 8   year             698 non-null    int64 
dtypes: int64(1), object(8)
memory usage: 49.2+ KB


We have a couple of entries with no authors and no volume numbers. It is to be expected because volume numbers did not exist for the first year of Parti Pris. Authors will be addressed later.

Why did this not appear when we first loaded the JSONs?

When we first create the DataFrame from JSON, all fields appear non-null because pandas automatically fills missing keys with `NaN` (null) values. After saving to CSV and reloading, these nulls become visible, as CSV requires a fixed set of columns and missing data is explicitly marked as `NaN`. This is normal behavior when working with semi-structured data like JSON.

### Hyphenation and Whitespace

As we briefly mentioned in the text extraction notebook, hyphens play an important role in French. There are two type of hyphen use, one serves a morphological function like connecting words in question sentences (C'est -> Est-que) and the other is hyphenation at the end of a sentence (enterprise -> enter-\nprise). We prompted the model to undo the hyphenations but it was not always successful, which we will try to edit as much as possible.

Second issue that we observed consistently is occasional lack of spaces between sentences, i.e. missing whitespace. We do not know why but some sentences are merged like .....

In [80]:
# Let's look at hyphens that exist in the text
hyphen_rows = partipris[partipris['text'].str.contains('-', na=False)]
len(hyphen_rows)

681

Out of 698 rows, 681 of them have hyphens.

Let's print some of them:

In [44]:
for idx, row in hyphen_rows[['source_file', 'text']].head(10).iterrows():
    print(f"Source: {row['source_file']}\nText: {row['text']}\n{'-'*40}")

Source: 163122_1-1964-09.json
Text: manifeste 64-65 Toute révolution détruit l'ancienne société; en tant qu'elle est sociale. Toute révolution abat l'ancien pouvoir; en tant qu'elle est politique. La révolution en général - le bouleversement du pouvoir existant et la dissolution des anciennes relations sociales - est un acte politique. Sans révolution, le socialisme ne peut pas se réaliser. Karl MARX (Morceaux choisis par Henri Lefebvre et N. Guterman, p. 177.) 2 I bilan 1.1.- introduction parti pris existe depuis bientôt un an. Dès les premiers numéros, il devint évident que toute notre pensée tournait autour d'un maître-mot, qui revenait dans tous les titres: REVOLUTION. Le contexte politique de l'heu- re, où la question qui faisait les manchettes était celle du séparatisme, semblait réduire fortement le contenu de ce mot. Et en fait, nous ne disions pas claire- ment ce que serait cette révolution, nous n'arrivions pas à en faire ressortir les aspects social et politique; au fond, la

**Some examples**

- je n'ai pas l'intention de **reve- nir** *là-dessus*
- en **accomplis- sant** ensemble
- de **désagré- gation**
- *c'est-à-dire* réaliser une idée

Bold are the hyphens at word breaks and the italic ones are legitimate hyphens in French.



There is an easy fix! We can distinguish between these two uses by checking if there is a space after the hyphen or not.

But it is always a good idea to first make sure that we do not have other hyphen use cases.

In [49]:
import re

word_break_pat = re.compile(r"\w+- \w+")
in_word_pat = re.compile(r"\w+-\w+")

def print_other_hyphens(text, source_file):
    # Find all hyphens in the text
    for m in re.finditer(r"-", text):
        # Get a window around the hyphen for context
        start = max(0, m.start()-15)
        end = min(len(text), m.end()+15)
        window = text[start:end]
        # Check if this hyphen is NOT part of a word-break or in-word hyphen
        if not word_break_pat.search(window) and not in_word_pat.search(window):
            print(f"Source: {source_file}")
            print(f"Context: ...{window}...")
            print('-'*40)

# Apply to all rows with at least one hyphen
for idx, row in partipris[partipris['text'].str.contains('-', na=False)].iterrows():
    print_other_hyphens(row['text'], row['source_file'])

Source: 163122_1-1964-09.json
Context: ...ion en général - le bouleversem...
----------------------------------------
Source: 163122_1-1964-09.json
Context: ...tions sociales - est un acte po...
----------------------------------------
Source: 163122_1-1964-09.json
Context: ... 2 I bilan 1.1.- introduction p...
----------------------------------------
Source: 163122_1-1964-09.json
Context: ... travail. 1.2. - la période de ...
----------------------------------------
Source: 163122_1-1964-09.json
Context: ... de 1960. 1.3. - la révolution ...
----------------------------------------
Source: 163122_1-1964-09.json
Context: ...destin. 5 1.4. - l'indépendance...
----------------------------------------
Source: 163122_1-1964-09.json
Context: ...rroriste. 1.6. - une nouvelle g...
----------------------------------------
Source: 163122_1-1964-09.json
Context: ...texte de l'indé- •9 pendantisme...
----------------------------------------
Source: 163122_1-1964-09.json
Context: ...rètement. 1.7.

**New Types**

Looking through the hyphens other than the ones that we expected, we were able to discover some new examples. We serve a role in the corpus, so we should not change them.

*Dash as punctuation or separator*  
- Used like an em dash or to separate clauses, items, or list elements.  
  Example: `Justice sociale - histoire de vo...`  
  Example: `Place des Arts - pauvre de moi ...`

*Numbered/lettered lists and section markers*  
- Used after numbers or letters to mark sections or list items.  
  Example: `1.1.- introduction`  
  Example: `a - les animateurs`

*Double or triple hyphens*  
- Used for emphasis, as a separator, or to mimic a long dash.  
  Example: `d'équipement, -- nous envoyer ...`  
  Example: `contradiction--- en ce sens...`

*Hyphens in abbreviations or initials*  
- Used between initials, abbreviations, or acronyms.  
  Example: `J.-L. G.`  
  Example: `C.S.N.-F.T.Q.`  
  Example: `P.-E. Trudeau`

*Hyphens in numeric ranges*  
- Used to indicate ranges or as part of numeric formatting.  
  Example: `$5-$10 par mois`  
  Example: `la guerre '39-'45`


**Edge Cases**

We also found some edge cases that we need to account for.

*Word-break with capital letter*
- Sometimes a word-break hyphen is followed by a capital letter and then the rest of the word, likely due to OCR or formatting issues.
- Example:
  - `rap- P port à Cité Libre`

*Word-breaks with page numbers or bullets*
- Occasionally, a word-break hyphen is followed by a bullet, page number, or other artifact from the page layout.
- Example:  
  - `d'appren- • 61 dre`

Turns out `rap- P port à Cité Libre` is a unique edge case and not an example of a model mistake pattern. Other instances of hyphen and then space and then a capital letter and then space and then some lowercase letters are just lists: `1- A ceux`. We edited the word rapport in the csv directly.

In [144]:
# Let's load the partipris_v2.csv, which is the same as v1 but with the word "rapport" fixed
# We will override this file after the entire hyphen processing is done
partipris = pd.read_csv('../data/processed/partipris_v2.csv')

In [118]:
# Let's see if the edge case is fixed
edge_case_count = partipris['text'].str.contains(r'rap- P port à Cité Libre', na=False).sum()
if edge_case_count > 0:
    print(f"Found {edge_case_count} instances of the edge case 'rap- P port à Cité Libre'.")
else:
    print("No instances of the edge case 'rap- P port à Cité Libre' found.")

No instances of the edge case 'rap- P port à Cité Libre' found.


This Haut-Canada example is like a canary in the mine. It is going to help us keep track of the hyphen removal. It is a good example because we want to fix hyphenation of `Haut- Canada` but we do not want to turn it into `HautCanada`

In [145]:
# HautCanada vs Haut-Canada
# Let's check how many instances of 'HautCanada' exist
haut_canada_count = partipris['text'].str.contains(r'HautCanada', na=False).sum()
if haut_canada_count > 0:
    print(f"Found {haut_canada_count} instances of 'HautCanada'.")
else:
    print("No instances of 'HautCanada' found.")

No instances of 'HautCanada' found.


In [146]:
haut_dash_canada_count = partipris['text'].str.contains(r'Haut-Canada', na=False).sum()
if haut_dash_canada_count > 0:
    print(f"Found {haut_dash_canada_count} instances of 'Haut-Canada'.")
else:
    print("No instances of 'Haut-Canada' found.")

Found 5 instances of 'Haut-Canada'.


In [147]:
haut_dash_space_canada_count = partipris['text'].str.contains(r'Haut- Canada', na=False).sum()
if haut_dash_space_canada_count > 0:
    print(f"Found {haut_dash_space_canada_count} instances of 'Haut- Canada'.")
else:
    print("No instances of 'Haut- Canada' found.")

Found 2 instances of 'Haut- Canada'.


In [102]:
partipris.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 698 entries, 0 to 697
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   source_file      698 non-null    object
 1   author           695 non-null    object
 2   title            698 non-null    object
 3   page_range       698 non-null    object
 4   text             698 non-null    object
 5   volume_number    593 non-null    object
 6   issue_number     698 non-null    object
 7   season_or_month  698 non-null    object
 8   year             698 non-null    int64 
dtypes: int64(1), object(8)
memory usage: 49.2+ KB


Was this important enough to fix? Probably not. But it is an example of how one can deal with some edge cases manually. It is ok to just command-F in a csv and edit things :D

In [148]:
# Let's handle the other edge case: d'appren- • 61 dre
# there are more examples of this: démis- • 27 sionnaires; progres- • 29 siste
# and also s'as- 30. sure; de- 83 venue
# it seem to be the case where the hyphenation breaks the word onto the next page

# Exclude cases where the hyphen is between two numbers (e.g. $5-$10; '39-'45)
# Pattern: letter(s) + hyphen + optional spaces + (non-word or nothing) + optional spaces + 1-3 digits + optional punctuation + optional spaces + letter(s)
pattern = r"[a-zA-ZÀ-ÿ]+-\s*(?:[^\w\s]+)?\s*\d{1,3}[^\w\s]?\s*[a-zA-ZÀ-ÿ]+"

# Count total number of matches across all rows
total_count = partipris['text'].apply(lambda x: len(re.findall(pattern, str(x)))).sum()
print(f"Total number of pattern matches: {total_count}")

# Let's look at the matches
matches = partipris[partipris['text'].str.contains(pattern, na=False, regex=True)]

for idx, row in matches[['source_file', 'text']].iterrows():
    for m in re.finditer(pattern, row['text']):
        start = max(0, m.start()-30)
        end = min(len(row['text']), m.end()+30)
        print(f"Source: {row['source_file']}")
        print(f"Context: ...{row['text'][start:end]}...")
        print('-'*40)

Total number of pattern matches: 86
Source: 163122_1-1964-09.json
Context: ...ne révolution qui pourtant se vou- 8. drait populaire. La seule violence ...
----------------------------------------
Source: 163122_1-1964-09.json
Context: ...ez nous dans le contexte de l'indé- •9 pendantisme dont elle était un approfondi...
----------------------------------------
Source: 163122_1-1964-09.json
Context: ...au contraire un progrès et un approfondisse- 12. ment de la lutte qui, en s'inséran...
----------------------------------------
Source: 163122_1-1964-09.json
Context: ... politique; lorsque l'on veut pas- 20. ser à l'action, c'est-à-dire réal...
----------------------------------------
Source: 163122_1-1964-09.json
Context: ..., le "combat" ne fut alors qu'ex- • 25 clusivement idéologique pour ne pas dire ...
----------------------------------------
Source: 163122_1-1964-09.json
Context: ...se interne (cf. l'épisode des démis- • 27 sionnaires), semble de plus en plus somb...
-------------

86 instances of this kind of word break that goes over page boundaries. Not bad!

Let's fix them: d'appren- • 61 dre → d'apprendre, s'as- 30. sure → s'assure

In [149]:
# Pattern from above to clean up the edge case hyphenation
pattern = r"([a-zA-ZÀ-ÿ]+)-\s*(?:[^\w\s]+)?\s*\d{1,3}[^\w\s]?\s*([a-zA-ZÀ-ÿ]+)"

def clean_edge_case_hyphens(text):
    # Join the split word fragments, removing the page/bullet artifacts
    return re.sub(pattern, r"\1\2", text)

# Apply cleaning to the entire column
partipris['text'] = partipris['text'].apply(clean_edge_case_hyphens)

In [150]:
# Let's see if we were able to fix this

pattern = r"[a-zA-ZÀ-ÿ]+-\s*(?:[^\w\s]+)?\s*\d{1,3}[^\w\s]?\s*[a-zA-ZÀ-ÿ]+"

# Count total number of matches across all rows
total_count = partipris['text'].apply(lambda x: len(re.findall(pattern, str(x)))).sum()
print(f"Total number of pattern matches: {total_count}")

Total number of pattern matches: 0


Now, let's move onto the word-break corrections

In [129]:
# The most simple way to do this would be with a word- word pattern

pattern = r'(\w+)- (\w+)' 

matches = []
for idx, row in partipris[['source_file', 'text']].iterrows():
    for m in re.finditer(pattern, str(row['text'])):
        start = max(0, m.start()-30)
        end = min(len(row['text']), m.end()+30)
        matches.append((row['source_file'], row['text'][start:end]))

print(f"Found {len(matches)} instances of 'word- word' pattern.")
for source, context in matches:
    print(f"Source: {source}")
    print(f"Context: ...{context}...")
    print('-'*40)

Found 6162 instances of 'word- word' pattern.
Source: 163122_1-1964-09.json
Context: ...N. Le contexte politique de l'heu- re, où la question qui faisait l...
----------------------------------------
Source: 163122_1-1964-09.json
Context: ... en fait, nous ne disions pas claire- ment ce que serait cette révolutio...
----------------------------------------
Source: 163122_1-1964-09.json
Context: ...le jour. En somme le front se dé- plaçait, une nouvelle lutte s'engagea...
----------------------------------------
Source: 163122_1-1964-09.json
Context: ...uttes sociales commençaient à infor- mer la lutte de libération nation...
----------------------------------------
Source: 163122_1-1964-09.json
Context: ...e se radicaliser et de donner nais- sance au vrai combat, à la lutte ré...
----------------------------------------
Source: 163122_1-1964-09.json
Context: ... en éclairer l'origine et les struc- tures, et tentent d'en dévoiler les...
----------------------------------------
Source: 1

There are 6162 instances of this pattern but not all are what we want. We should look into this further to avoid hits like: `4- Le Devoir`, `1- Quand`, `3- Nous` or 'Haut- Canada' 

Let's look at the number pattern first.

In [130]:
# number pattern

pattern = r'(\d+)- (\w+)' 

matches = []
for idx, row in partipris[['source_file', 'text']].iterrows():
    for m in re.finditer(pattern, str(row['text'])):
        start = max(0, m.start()-30)
        end = min(len(row['text']), m.end()+30)
        matches.append((row['source_file'], row['text'][start:end]))

print(f"Found {len(matches)} instances of 'word- word' pattern.")
for source, context in matches:
    print(f"Source: {source}")
    print(f"Context: ...{context}...")
    print('-'*40)

Found 104 instances of 'word- word' pattern.
Source: 163122_1-1964-09.json
Context: ...ons. 46 les buts à long terme 1- Je voudrais d'abord répondre à c...
----------------------------------------
Source: 163122_2-1966-02.json
Context: ...artis de masse yvon hussereau 1- les partis traditionnels quelques...
----------------------------------------
Source: 163122_2-1966-02.json
Context: ...n. leader OFFICEUX qui exerce 2- organisation par cellules les caractéristi...
----------------------------------------
Source: 163122_1-1965-04.json
Context: ...estera les mines de baloney". 3- La Presse, 17 fév. 1965. 4- Le D...
----------------------------------------
Source: 163122_1-1965-04.json
Context: .... 3- La Presse, 17 fév. 1965. 4- Le Devoir, 2 mars 1965. mario du...
----------------------------------------
Source: 163122_2-1968-05.json
Context: ...oin du premier que du second. 1- Quand une certaine gauche consentir...
----------------------------------------
Source: 163122_2-1968-05.json

104 instances! We should make sure not to include these in our edits

Let's look at the word- Word, for cases like Haut- Canada without the digits

In [None]:
# Pattern: lowercase or capitalized word, hyphen, space, Capitalized word
pattern = r'\b(?!\d+\b)\w+- [A-ZÉÈÊÎÏÔÛÙÇÀÂÄÖÜ]\w+'

matches = []
for idx, row in partipris[['source_file', 'text']].iterrows():
    for m in re.finditer(pattern, str(row['text'])):
        start = max(0, m.start()-30)
        end = min(len(row['text']), m.end()+30)
        matches.append((row['source_file'], row['text'][start:end]))

print(f"Found {len(matches)} instances of 'word- Word' pattern.")
for source, context in matches:  # print for review
    print(f"Source: {source}")
    print(f"Context: ...{context}...")
    print('-'*40)

Found 68 instances of 'word- Word' pattern.
Source: 163122_1-1964-09.json
Context: ... plusieurs journalistes de LA PRES- SE, aujourd'hui menacés dans leu...
----------------------------------------
Source: 163122_1-1964-09.json
Context: ... et de faire du conflit de LA PRES- SE l'importante affaire politiqu...
----------------------------------------
Source: 163122_2-1966-02.json
Context: ...s of Lightnin' Hopkins (Verve Folk- FV/FVS-9000). Hootin' the blues ...
----------------------------------------
Source: 163122_1-1965-08.json
Context: ...ément cela: ORGANISATION DE L'AVANT- GARDE EN VUE DE CREER LE PARTI REVO...
----------------------------------------
Source: 163122_1-1965-08.json
Context: ...EER LE PARTI REVOLUTIONNAIRE, INSTRU- MENT DE LA PRISE DU POUVOIR. Nous ...
----------------------------------------
Source: 163122_1-1965-08.json
Context: ...geoisie, à ORGANISER LE PARTI REVOLUTION- NAIRE. A nos militants, nous demand...
----------------------------------------
Source: 1

68 instances! Some are like this: `INSTRU- MENT`, where is it a word that was split by hyphenation and others are like this `Saint- Laurent`, a word that was originally spelt with a hyphen but for some reason, it was split.

Now this becomes more generalizable. Let's incorporate what we already know about word breaks.
- If both sides of the hyphen are ALL UPPERCASE or all lowercase, join them (likely a split word).
- If the left is Titlecase or uppercase and the right is Titlecase (e.g., Haut- Canada), keep the hyphen and remove the space (→ Haut-Canada).


In [151]:
import re

# Patterns from your fix_hyphen_splits function
pattern_allcaps = r'\b([A-ZÉÈÊÎÏÔÛÙÇÀÂÄÖÜ]+)- ([A-ZÉÈÊÎÏÔÛÙÇÀÂÄÖÜ]+)\b'
pattern_alllower = r'\b([a-zà-ÿ]+)- ([a-zà-ÿ]+)\b'
pattern_titlecase = r'\b(?!\d+\b)([A-ZÉÈÊÎÏÔÛÙÇÀÂÄÖÜ][a-zà-ÿ]+)- ([A-ZÉÈÊÎÏÔÛÙÇÀÂÄÖÜ][a-zà-ÿ]+)\b'

def count_hyphen_cases(texts):
    count = 0
    for text in texts:
        count += len(re.findall(pattern_allcaps, str(text)))
        count += len(re.findall(pattern_alllower, str(text)))
        count += len(re.findall(pattern_titlecase, str(text)))
    return count

# BEFORE cleaning
before_count = count_hyphen_cases(partipris['text'])
print(f"Before cleaning: {before_count} cases matching fix_hyphen_splits patterns")


Before cleaning: 5638 cases matching fix_hyphen_splits patterns


In [152]:
def fix_hyphen_splits(text):
    # 1. Join split words: ALLCAPS- ALLCAPS or all lowercase, but not digit- word
    text = re.sub(r'\b([A-ZÉÈÊÎÏÔÛÙÇÀÂÄÖÜ]+)- ([A-ZÉÈÊÎÏÔÛÙÇÀÂÄÖÜ]+)\b', r'\1\2', text)
    text = re.sub(r'\b([a-zà-ÿ]+)- ([a-zà-ÿ]+)\b', r'\1\2', text)
    # 2. Remove space after hyphen for compounds: Title- Case → Title-Case, but not digit- Word
    text = re.sub(r'\b(?!\d+\b)([A-ZÉÈÊÎÏÔÛÙÇÀÂÄÖÜ][a-zà-ÿ]+)- ([A-ZÉÈÊÎÏÔÛÙÇÀÂÄÖÜ][a-zà-ÿ]+)\b', r'\1-\2', text)
    return text

partipris['text'] = partipris['text'].apply(fix_hyphen_splits)

In [153]:
after_count = count_hyphen_cases(partipris['text'])
print(f"After cleaning: {after_count} cases matching fix_hyphen_splits patterns")

After cleaning: 0 cases matching fix_hyphen_splits patterns


**Whitespace**

The next problem that we had initially observed in the text extraction phase was cases where word that were not supposed to be joined, were joined, like:

- des amours **étrangèresJe** te reconnais
- des **motsTu** es beau
- pierre **maheuNote:** la plupart

We could describe them as words where a lowercase letter is followed by an upper case letter without whitespace in between. But there are also real examples of this, most notably names like *McGill* or *McLuhan*.

Let's see what the situation is:

In [154]:
# Find places where a lowercase letter is immediately followed by an uppercase letter (possible missing space)
import re

boundary_pattern = r'[a-zà-ÿ][A-ZÉÈÊÎÏÔÛÙÇÀÂÄÖÜ]'

all_suspect_matches = []
rows_with_suspect = set()

for idx, row in partipris[['source_file', 'text']].iterrows():
    found_in_row = False
    for m in re.finditer(boundary_pattern, str(row['text'])):
        # Get a window for context
        start = max(0, m.start()-30)
        end = min(len(row['text']), m.end()+30)
        all_suspect_matches.append((row['source_file'], row['text'][start:end]))
        found_in_row = True
    if found_in_row:
        rows_with_suspect.add(idx)

print(f"Total suspect boundaries: {len(all_suspect_matches)}")
print(f"Rows with possible missing whitespace between words: {len(rows_with_suspect)}")

# Print first 10 examples
for example in all_suspect_matches[:10]:
    print(f"Source: {example[0]}")
    print(f"Context: ...{example[1]}...")
    print('-'*40)

Total suspect boundaries: 352
Rows with possible missing whitespace between words: 126
Source: 163122_2-1966-02.json
Context: ...ques et de luttes entre des facRussie, avant 1917, des cellule...
----------------------------------------
Source: 163122_2-1966-02.json
Context: ...alker, Muddy Waters, Brownie McGhee, Sonny Terry, etc...) Parm...
----------------------------------------
Source: 163122_2-1966-02.json
Context: ...dit: je croyais que c'était MacKenzie King, le premier ministr...
----------------------------------------
Source: 163122_2-1966-02.json
Context: ...e informés de l'existence de McLuhan? Il est vrai que dans la ...
----------------------------------------
Source: 163122_2-1966-02.json
Context: ...la traduction en français de McLuhan. Sinon, il faudra faire a...
----------------------------------------
Source: 163122_2-1966-02.json
Context: ...es... A moins qu'on confonde McLuhan, avec son "délire d'inter...
----------------------------------------
Source: 163122_2-19

In [155]:
# Of the 352 words with possible missing whitespace, let's see how many are unique
# This pattern captures the full merged word: last lowercase letters + first uppercase + following lowercase letters
import collections

boundary_pattern = r'[a-zà-ÿ][A-ZÉÈÊÎÏÔÛÙÇÀÂÄÖÜ]'

def expand_to_word(text, match_start, match_end):
    # Expand left
    left = match_start
    while left > 0 and text[left-1].isalnum():
        left -= 1
    # Expand right
    right = match_end
    while right < len(text) and text[right].isalnum():
        right += 1
    return text[left:right]

all_expanded_merged_words = []
for idx, row in partipris[['source_file', 'text']].iterrows():
    text = str(row['text'])
    for m in re.finditer(boundary_pattern, text):
        merged_word = expand_to_word(text, m.start(), m.end())
        all_expanded_merged_words.append(merged_word)

counter = collections.Counter(all_expanded_merged_words)
print(f"Unique merged words: {len(counter)}")
print("Most frequent merged words:")
for word, count in counter.most_common(20):
    print(f"{word}: {count}")

Unique merged words: 273
Most frequent merged words:
McGill: 28
McLuhan: 14
LaDurantaye: 7
MacLean: 6
McNamara: 5
DesRochers: 4
McGregor: 3
McCoy: 3
McCarthy: 3
MacWilliams: 3
TvB: 3
McGhee: 2
McLaren: 2
LeMoyne: 2
McLean: 2
DuBellay: 2
McGance: 2
littéraireL: 2
espritAux: 2
DuFour: 2


A lot of these begin with Mc, Mac, Le, La, Du, Des. We know that these are real words. Let's create a pattern that does not take them into account.

In [157]:
boundary_pattern = r'[a-zà-ÿ][A-ZÉÈÊÎÏÔÛÙÇÀÂÄÖÜ]'

# List of prefixes to ignore
prefixes = ('Mc', 'Mac', 'Le', 'La', 'Du', 'Des')

def expand_to_word(text, match_start, match_end):
    left = match_start
    while left > 0 and text[left-1].isalnum():
        left -= 1
    right = match_end
    while right < len(text) and text[right].isalnum():
        right += 1
    return text[left:right]

all_expanded_merged_words = []
for idx, row in partipris[['source_file', 'text']].iterrows():
    text = str(row['text'])
    for m in re.finditer(boundary_pattern, text):
        merged_word = expand_to_word(text, m.start(), m.end())
        # Only add if it does NOT start with any of the prefixes
        if not merged_word.startswith(prefixes):
            all_expanded_merged_words.append(merged_word)

counter = collections.Counter(all_expanded_merged_words)
print(f"Unique merged words (excluding Mc, Mac, Le, La, Du, Des): {len(counter)}")
print("Most frequent merged words (filtered):")
for word, count in counter.most_common(30):
    print(f"{word}: {count}")

Unique merged words (excluding Mc, Mac, Le, La, Du, Des): 238
Most frequent merged words (filtered):
TvB: 3
littéraireL: 2
espritAux: 2
facRussie: 1
SaintChristophe: 1
LéviStrauss: 1
EtatsUnis: 1
SainteCatherine: 1
maheuNote: 1
bonheurMontréal: 1
commUlysses: 1
motsDIT: 1
RadioCanada: 1
MountRoyal: 1
NouveauBrunswick: 1
extérieurJ: 1
révolutionLa: 1
libérationJusqu: 1
socialeNous: 1
révolutionUne: 1
xénophobieHubert: 1
inventionSi: 1
auteurL: 1
oeuvreUn: 1
constelléPorte: 1
désoléEt: 1
saisChaque: 1
sablierGaston: 1
EhrembourgJe: 1
encoreIls: 1


We got rid of 40 unique words and they were the most frequent ones. Great! Now let's look at the rest. 

TvB is a term and we should keep it aside as well.

We see here some cases, like EtatsUnis which should have been Etats-Unis. But once we add a space between these, it will be fine for our purposes.

In [158]:
boundary_pattern = r'[a-zà-ÿ][A-ZÉÈÊÎÏÔÛÙÇÀÂÄÖÜ]'

# List of prefixes and protected terms to ignore
prefixes = ('Mc', 'Mac', 'Le', 'La', 'Du', 'Des', 'TvB')

def expand_to_word(text, match_start, match_end):
    left = match_start
    while left > 0 and text[left-1].isalnum():
        left -= 1
    right = match_end
    while right < len(text) and text[right].isalnum():
        right += 1
    return text[left:right]

def split_merged_words(text):
    # Find all merged words not starting with protected prefixes
    matches = []
    for m in re.finditer(boundary_pattern, text):
        merged_word = expand_to_word(text, m.start(), m.end())
        if not merged_word.startswith(prefixes):
            matches.append((m.start(), m.end()))
    # Replace from end to start to avoid messing up indices
    new_text = text
    offset = 0
    for start, end in matches:
        insert_at = start + 1 + offset  # after the lowercase letter
        new_text = new_text[:insert_at] + ' ' + new_text[insert_at:]
        offset += 1
    return new_text

# Apply the splitting to the text column
partipris['text'] = partipris['text'].apply(split_merged_words)

In [159]:
# Let's test how we did. Do TvB and littéraireL appear now?
tvb_count = partipris['text'].str.contains(r'\bTvB\b', na=False).sum()
if tvb_count > 0:
    print(f"Found {tvb_count} instances of 'TvB'.")
else:
    print("No instances of 'TvB' found.")

litteraireL_count = partipris['text'].str.contains(r'\blittéraireL\b', na=False).sum()
if litteraireL_count > 0:
    print(f"Found {litteraireL_count} instances of 'littéraireL'.")
else:
    print("No instances of 'littéraireL' found.")

Found 1 instances of 'TvB'.
No instances of 'littéraireL' found.


Perfect! We have successfully addressed the whitespace issue. Let's save the cleaned partipris

In [160]:
# save after cleaning
partipris.to_csv('../data/processed/partipris_v3.csv', index=False)

### Page Numbers in Text

The next step in our preprocessing step is finding and removing the page numbers.

Initially, when we were doing the text extraction, asking the model to include page numbers for each article was also helping the model keep the page numbers off the main text body. This was however not always successful. In fact, the model was also not very successful in assigning page number accurately. It often mixed up the page numbers on the page with the page numbers of the PDF We will discard the page number column later.

In [161]:
partipris = pd.read_csv('../data/processed/partipris_v3.csv')

In [162]:
partipris.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 698 entries, 0 to 697
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   source_file      698 non-null    object
 1   author           695 non-null    object
 2   title            698 non-null    object
 3   page_range       698 non-null    object
 4   text             698 non-null    object
 5   volume_number    593 non-null    object
 6   issue_number     698 non-null    object
 7   season_or_month  698 non-null    object
 8   year             698 non-null    int64 
dtypes: int64(1), object(8)
memory usage: 49.2+ KB


In [170]:
import re

pattern = r'\s\d{1,3}\s'

number_examples = []
for idx, row in partipris[['source_file', 'text']].iterrows():
    for m in re.finditer(pattern, str(row['text'])):
        # Get a window for context
        start = max(0, m.start()-30)
        end = min(len(row['text']), m.end()+30)
        context = row['text'][start:end]
        number_examples.append((row['source_file'], context))

print(len(number_examples))

# Print first 10 examples
for example in number_examples:
    print(f"Source: {example[0]}")
    print(f"Context: ...{example[1]}...")
    print('-'*40)

1928
Source: 163122_1-1964-09.json
Context: ...ebvre et N. Guterman, p. 177.) 2 I bilan 1.1.- introduction par...
----------------------------------------
Source: 163122_1-1964-09.json
Context: ...lus sérieuse de ces tentatives 3 fut celle d'une importante fra...
----------------------------------------
Source: 163122_1-1964-09.json
Context: ...endance comme vers son destin. 5 1.4. - l'indépendance tranquil...
----------------------------------------
Source: 163122_1-1964-09.json
Context: ... de l'indépendance mis à part, 7 le programme de RIN dépasse à ...
----------------------------------------
Source: 163122_1-1964-09.json
Context: ...n ou yankee. De l'autre, les • 11 classes populaires de plus en ...
----------------------------------------
Source: 163122_1-1964-09.json
Context: ...outumiers; sous son emprise la 13 petite entreprise familiale te...
----------------------------------------
Source: 163122_1-1964-09.json
Context: ...miques, démographiques, etc. • 15 du pays est un pr

Most of the 1928 examples are actual numbers. But some of them are page numbers: `• 91` or `• 101`. This is very similar to the examples where hyphenation was going over the page break. We can refine our regex and remove the dotted numbers.

In [172]:
# dotted page number pattern
pattern = r'•\s\d{1,3}'

bullet_number_examples = []
for idx, row in partipris[['source_file', 'text']].iterrows():
    for m in re.finditer(pattern, str(row['text'])):
        start = max(0, m.start()-30)
        end = min(len(row['text']), m.end()+30)
        context = row['text'][start:end]
        bullet_number_examples.append((row['source_file'], context))

print(f"Total matches: {len(bullet_number_examples)}")
for example in bullet_number_examples[:10]:
    print(f"Source: {example[0]}")
    print(f"Context: ...{example[1]}...")
    print('-'*40)

Total matches: 44
Source: 163122_1-1964-09.json
Context: ...an ou yankee. De l'autre, les • 11 classes populaires de plus en...
----------------------------------------
Source: 163122_1-1964-09.json
Context: ...omiques, démographiques, etc. • 15 du pays est un préalable à to...
----------------------------------------
Source: 163122_1-1964-09.json
Context: ...t une vie d'homme. parti pris • 17...
----------------------------------------
Source: 163122_1-1964-09.json
Context: ... fasse sa part. la direction. • 19...
----------------------------------------
Source: 163122_1-1964-09.json
Context: ...rer, il faut compter celuilà. • 21 Nous entendions, il n'y a pas...
----------------------------------------
Source: 163122_1-1964-09.json
Context: ...t de responsabilité politique • 23 ne s'imposera vraiment qu'une...
----------------------------------------
Source: 163122_1-1964-09.json
Context: ...i: résoudre, entre patrons et • 31 ouvriers, des conflits qui ri...
--------------------------

In [173]:
# Let's remove it!

def remove_bullet_page_numbers(text):
    return re.sub(r'•\s\d{1,3}', '', text)

partipris['text'] = partipris['text'].apply(remove_bullet_page_numbers)


Maybe Gemini was not so bad at not inserting page numbers!

Additionally, we will run named entity recognition later on this corpus, which should allow us to identify dates

In [174]:
# save after cleaning
partipris.to_csv('../data/processed/partipris_v4.csv', index=False)

### Author Names

In [233]:
partipris = pd.read_csv('../data/processed/partipris_v4.csv')

The next step in the preprocessing is cleaning up the author names. What awe observed throughout this corpus is that the authors appear once with their fullnames and then if they have other articles in the same issue, they use an abbreviation of their names. For example, pierre maheu becomes p.m. 

We can clean this up by creating some kind of a lookup table for each issue.

In [235]:
# Step 1: Find an example of an abbreviated author name

def is_abbreviation(author):
    """Return True if the author name contains a dot (.)"""
    return '.' in str(author)

# Find the first issue with abbreviated author names
for issue, group in partipris.groupby('source_file'):
    abbrev_rows = group[group['author'].apply(is_abbreviation)]
    if not abbrev_rows.empty:
        print(f"Issue: {issue}")
        print("All author names in this issue:")
        print(group['author'].unique())
        break

Issue: 163122_1-1964-03.json
All author names in this issue:
['Parti Pris / J.-M.P.' 'Pierre Maheu' 'Jacques Berque'
 'Michel van Schendel' 'Jean-Marc Piotte, Paul Chamberland' 'Roger Nantel'
 'André Major' 'Paul Chamberland' 'Laurent Girouard' 'Jacques Ferron'
 'Patrick Straram' 'Gérald Godin']


Our first example is `163122_1-1965-05.json`
All author names in this issue are `'M. D.' 'michael draper' 'madeleine richer' 'luc dufresne' 'J.M.', 'parti-pris' 'rené beaudin' 'parti pris'`

Upon closer investigation, we found out that J.M. was intentionally left anonymous in the issue itself, which is even more a reason for why we should do the name-abbreviation matching per issue and not for the whole corpus. There could be a person with J.M. initials elsewhere in the corpus.

Another interesting point is that parti-pris and parti pris appear as authors. It makes sense in this case because there are jointly written articles. We should still make sure to normalize it for counting later, to `parti pris`

Otherwise the M. D. is Michael Draper, so our mapping logic should work!

In [236]:
# Let's create a list like this per issue where there are initials

# For each issue, collect all abbreviated and full author names
issue_abbrev_map = {}

for issue, group in partipris.groupby('source_file'):
    abbrev_authors = [a for a in group['author'].unique() if is_abbreviation(a)]
    full_authors = [a for a in group['author'].unique() if not is_abbreviation(a)]
    if abbrev_authors:
        issue_abbrev_map[issue] = {
            'abbreviated': abbrev_authors,
            'full_names': full_authors
        }

print(f"Number of issues with abbreviated author names: {len(issue_abbrev_map)}")

Number of issues with abbreviated author names: 18


In [191]:
issue_abbrev_map.keys()

dict_keys(['163122_1-1964-03.json', '163122_1-1964-09.json', '163122_1-1965-05.json', '163122_1-1965-06.json', '163122_1-1965-10.json', '163122_1-1965-12.json', '163122_2-1966-01.json', '163122_2-1966-02.json', '163122_2-1966-04.json', '163122_2-1966-11.json', '163122_2-1967-01.json', '163122_2-1967-03.json', '163122_2-1967-05.json', '163122_2-1967-10.json', '163122_2-1968-01.json', '163122_2-1968-02.json', '163122_2-1968-03.json', '163122_2-1968-05.json'])

In [237]:
issue_abbrev_map['163122_2-1966-02.json']

{'abbreviated': ['g. g.',
  'P.',
  'j. d.',
  'm. g...',
  'p. s',
  'p. s.',
  'j. f.',
  'p.m.'],
 'full_names': ['yvon hussereau',
  'jean racine',
  'jacques trudel',
  'gérald godin',
  'paul chamberland',
  'michel euvrard',
  'patrick straram',
  nan,
  'robert boily',
  'michel guénard']}

Looking at the 1966-02 issue, we see that the initials p.m. appear but pierre maheu does not. This means that maybe we could do a second pass for the usual suspects.

Let's test a function to do the mapping first

In [238]:
# Step 2: Mapping initials and names

def initials(name):
    """Extract initials from a name, e.g. 'jean racine' -> 'j. r.'"""
    if not isinstance(name, str) or pd.isna(name):
        return ''
    parts = re.findall(r'\b\w', name.lower())
    return '. '.join(parts) + '.' if parts else ''

def map_abbreviations(abbrev_list, full_names_list):
    mapping = {}
    for abbrev in abbrev_list:
        if not isinstance(abbrev, str) or pd.isna(abbrev):
            mapping[abbrev] = None
            continue
        abbr_initials = ''.join(re.findall(r'[a-z]', abbrev.lower()))
        matches = []
        for full in full_names_list:
            if not isinstance(full, str) or pd.isna(full):
                continue
            full_initials = ''.join(re.findall(r'[a-z]', initials(full)))
            if abbr_initials == full_initials:
                matches.append(full)
        mapping[abbrev] = matches[0] if len(matches) == 1 else None
    return mapping

# Example from '163122_2-1966-02.json'
abbrev_list = ['g. g.', 'P.', 'j. d.', 'm. g...', 'p. s', 'p. s.', 'j. f.', 'p.m.']
full_names_list = ['yvon hussereau', 'jean racine', 'jacques trudel', 'gérald godin', 'paul chamberland', 'michel euvrard', 'patrick straram', None, 'robert boily', 'michel guénard']

abbrev_to_full = map_abbreviations(abbrev_list, full_names_list)
print(abbrev_to_full)

{'g. g.': 'gérald godin', 'P.': None, 'j. d.': None, 'm. g...': 'michel guénard', 'p. s': 'patrick straram', 'p. s.': 'patrick straram', 'j. f.': None, 'p.m.': None}


In [239]:
# Step 3: Let's create a new author_edited column

import numpy as np

def create_author_edited_column(df):
    # For each issue, build the mapping and apply it
    author_edited = []
    for issue, group in df.groupby('source_file'):
        abbrev_list = [a for a in group['author'].unique() if isinstance(a, str) and '.' in a]
        full_names_list = [a for a in group['author'].unique() if isinstance(a, str) and '.' not in a]
        mapping = map_abbreviations(abbrev_list, full_names_list)
        # Map each author in the group
        for idx, row in group.iterrows():
            author = row['author']
            if pd.isna(author):
                author_edited.append(np.nan)
            elif author in mapping and mapping[author]:
                author_edited.append(mapping[author].lower())
            elif isinstance(author, str):
                author_edited.append(author.lower())
            else:
                author_edited.append(author)
    df['author_edited'] = author_edited
    return df

partipris = create_author_edited_column(partipris)


In [241]:
partipris.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 698 entries, 0 to 697
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   source_file      698 non-null    object
 1   author           695 non-null    object
 2   title            698 non-null    object
 3   page_range       698 non-null    object
 4   text             698 non-null    object
 5   volume_number    593 non-null    object
 6   issue_number     698 non-null    object
 7   season_or_month  698 non-null    object
 8   year             698 non-null    int64 
 9   author_edited    695 non-null    object
dtypes: int64(1), object(9)
memory usage: 54.7+ KB


In [242]:
# Let's look at all the unique values in author_edited

unique_authors = partipris['author_edited'].unique()
print(f"Number of unique authors in author_edited: {len(unique_authors)}")
for author in unique_authors:
    if pd.notna(author):
        print(author)

Number of unique authors in author_edited: 190
parti pris editorial team
pierre maheu
jean-marc piotte
yvon dionne
paul chamberland
andré brochu
andré major
pierre vadeboncoeur
robert maheu
denys arcand
jacques ferron
jacques godbout
camille limoges
parti pris editorial board
georges schoeters
gaston miron
jacques brault
paul-marie lapointe
robert mackay
jean-claude lebel
parti pris
pierre bourgault
laurent girouard
jacques berque
andré balestard
patrick straram
hubert aquin
gilles vigneault
nina bruneau
luc racine
michel van schendel
unknown
gérald godin
jacques allard
parti pris / j.-m.p.
jean-marc piotte, paul chamberland
roger nantel
gilles carle
clément perron
gilles groulx
jacques renaud
jean-robert rémillard
jan depocas
éditorial
jean-claude lapointe
roger guy
serge grenier
pierre lefebvre
andrée benoist
réginald hamel
archival document
charles gill
albert laberge
la direction.
raymond villeneuve
jacques lebrecque
andré garand
jacques poisson
gilles derome
alain bouchard
jacques

This is alreay better but we are not done yet. Let's look at all the names that refer to the editorial team. They appear in many versions and some with dashes like parti pris / j.-m.p.

This is the part where we again turn to AI for help. This time we prompted GPT-4.1 on the GitHub Copilot to come up with a transformation list for all 190 unique names. We then evaluated the mapping and applied it.

In [None]:
# GPT-4.1 mapping

author_name_mapping = {
    "parti pris editorial team": "parti pris",
    "parti pris editorial board": "parti pris",
    "parti pris": "parti pris",
    "parti-pris": "parti pris",
    "parti pris / j.-m.p.": "parti pris; j.-m.p.",
    "parti pris / g.t.": "parti pris; g.t.",
    "parti pris/p.m.": "parti pris; p.m.",
    "club parti pris": "parti pris",
    "la direction.": "parti pris",
    "la rédaction": "parti pris",
    "le comité de rédaction": "parti pris",
    "équipe de rédaction": "parti pris",
    "éditorial": "parti pris",
    "collectif (not explicitly stated, but implies common effort)": "parti pris",
    "archival document": "parti pris",
    "déclaration": "parti pris",
    "non spécifié": "parti pris",
    "unknown": "parti pris",
    "anonymous": "parti pris",
    "various": "parti pris",
    "mouvement de libération populaire et de la revue parti pris": "parti pris",
    "l'équipe parti pris; patrick straram; gaëtan tremblay": "parti pris; patrick straram; gaëtan tremblay",
    "p.p.": "parti pris",
    "p. p.": "parti pris",
    "p.p. / g. t., p. m.": "parti pris; g.t.; p.m.",
    "p. p. / g. t., p. m.": "parti pris; g.t.; p.m.",
    "p.p. / g.t., p.m.": "parti pris; g.t.; p.m.",
    "p.p. / g.t.": "parti pris; g.t.",
    "p.p. / p.m.": "parti pris; p.m.",
    "p.p. / g.t., p.m.": "parti pris; g.t.; p.m.",
    "p.p. / g.t., p. m.": "parti pris; g.t.; p.m.",
    "p.p. / p. m.": "parti pris; p.m.",
    "p.p. / p.m.": "parti pris; p.m.",
    "p.p. / g.t.": "parti pris; g.t.",
    "p.p. / p.m.": "parti pris; p.m.",
    "p.p. / g.t., p.m.": "parti pris; g.t.; p.m.",
    "p.p. / g.t., p. m.": "parti pris; g.t.; p.m.",
    "p.p. / p. m.": "parti pris; p.m.",
    "p.p. / p.m.": "parti pris; p.m.",
    "p.p.": "parti pris",
    "p. p.": "parti pris",
    "p.": "parti pris",
    # We are not yet changing the initials, just standardizing them
    # Some more straightforward initials will be edited next
    "g.m.": "g.m.",
    "r.-g. m.": "r.-g.m.",
    "j.m.": "j.m.",
    "g.d.": "g.d.",
    "ph. b.": "ph.b.",
    "sh. b.": "sh.b.",
    "i. r.": "i.r.",
    "ch. g.": "ch.g.",
    "p.m.": "p.m.",
    "g.t.": "g.t.",
    "j. d.": "j.d.",
    "j. f.": "j.f.",
    "r. s.": "r.s.",
    "p. s.": "p.s.",
    "p. s": "p.s.",
    "m. g...": "m.g.",
    "g. g.": "g.g.",
    # Multi-author and separator normalization
    "jean-marc piotte, paul chamberland": "jean-marc piotte; paul chamberland",
    "mario dumais, robert tremblay": "mario dumais; robert tremblay",
    "charles gagnon, pierre vallieres": "charles gagnon; pierre vallieres",
    "philippe bernard, gaëtan tremblay": "philippe bernard; gaëtan tremblay",
    "andrée paul, raoul duguay": "andrée paul; raoul duguay",
    "luc racine, michel pichette, narciso pizarro, gilles bourque": "luc racine; michel pichette; narciso pizarro; gilles bourque",
    "gabriel gagnon, jean-guy loranger, louis gendreau, rémi savard, gaétan dufour, gilles dostaler": "gabriel gagnon; jean-guy loranger; louis gendreau; rémi savard; gaétan dufour; gilles dostaler",
    "pierre renaud, robert tremblay": "pierre renaud; robert tremblay",
    "gilles bourque et luc racine": "gilles bourque; luc racine",
    "robert tremblay et pierre renaud": "robert tremblay; pierre renaud",
    "jean-guy loranger - luc racine": "jean-guy loranger; luc racine",
    "gilles bourque; michel pichette; narcisso pizarro; luc racine": "gilles bourque; michel pichette; narcisso pizarro; luc racine",
    # Remove extra spaces and standardize
    "l'équipe parti pris": "parti pris",
    "le bureau exécutif": "parti pris",
    "équipe de rédaction": "parti pris",
    "la rédaction": "parti pris",
    "la direction.": "parti pris",
}

In [244]:
# Let's apply this to our author_edited column

def map_author_name(name):
    if pd.isna(name):
        return name
    key = name.strip().lower()
    return author_name_mapping.get(key, key)

partipris['author_edited'] = partipris['author_edited'].apply(map_author_name)

In [245]:
unique_authors = partipris['author_edited'].unique()
print(f"Number of unique authors in author_edited: {len(unique_authors)}")
for author in unique_authors:
    if pd.notna(author):
        print(author)

Number of unique authors in author_edited: 169
parti pris
pierre maheu
jean-marc piotte
yvon dionne
paul chamberland
andré brochu
andré major
pierre vadeboncoeur
robert maheu
denys arcand
jacques ferron
jacques godbout
camille limoges
georges schoeters
gaston miron
jacques brault
paul-marie lapointe
robert mackay
jean-claude lebel
pierre bourgault
laurent girouard
jacques berque
andré balestard
patrick straram
hubert aquin
gilles vigneault
nina bruneau
luc racine
michel van schendel
gérald godin
jacques allard
parti pris; j.-m.p.
jean-marc piotte; paul chamberland
roger nantel
gilles carle
clément perron
gilles groulx
jacques renaud
jean-robert rémillard
jan depocas
jean-claude lapointe
roger guy
serge grenier
pierre lefebvre
andrée benoist
réginald hamel
charles gill
albert laberge
raymond villeneuve
jacques lebrecque
andré garand
jacques poisson
gilles derome
alain bouchard
jacques trudel
pierre prézeau
reggie chartrand
jean-pierre trempe
stéphane venne
pierre charbonneau
andrée ferr

Already a lot cleaner, especially all the Parti Pris names. But we can do better, especially since there are typos and small spelling errors in the names like `pierre vadboncoeur`. Let's also see how many times they appear before we attempt another clean up.

In [250]:
# Let's also see how many times they appear before we attempt another clean up

author_counts = partipris['author_edited'].value_counts(dropna=False)
print("Author name frequencies in author_edited:")
for author, count in author_counts.items():
    print(f"{author}: {count}")

Author name frequencies in author_edited:
parti pris: 106
patrick straram: 48
pierre maheu: 35
gérald godin: 33
paul chamberland: 27
luc racine: 22
gabriel gagnon: 19
gaëtan tremblay: 19
andré major: 16
philippe bernard: 15
jean-marc piotte: 13
jacques ferron: 13
laurent girouard: 12
michel euvrard: 11
thérèse dumouchel: 10
andré brochu: 10
denys arcand: 9
robert maheu: 8
jacques godbout: 8
jacques brault: 8
raoul duguay: 7
gilles bourque: 7
gilles dostaler: 6
jacques trudel: 6
pierre lefebvre: 6
jan depocas: 6
jacques renaud: 5
gaston miron: 5
jacques allard: 5
mario dumais: 5
g.m.: 4
guy bourassa: 4
gaétan tremblay: 4
charles gagnon: 4
françois aquin: 4
michael draper: 4
michel guénard: 4
camille limoges: 4
pierre vadeboncoeur: 4
pierre desrosiers: 3
pierre vallières: 3
j.d.: 3
rené beaudin: 3
yvon dionne: 3
raoul roy: 3
p.m.: 3
nan: 3
jacques-victor morin: 2
michel mill: 2
hubert aquin: 2
gilles vigneault: 2
robert tremblay: 2
jacques berque: 2
robert mackay: 2
andré théberge: 2
mic

Let's do another round of mapping with GPT-4.1 to get the typos corrected and then we also have some usual suspects. Like `p.m.` is clearly `pierre maheu`. Some of the rest of the intials have obvious candidates too. Let's do another round of mapping with GPT-4.1

In [249]:
# This also showed that some author names were NaN and became nan while lowercasing
# or they were nan to begin with. This was probably because there were no names associated with
# the texts that Gemini 2.5 could extract. We should map them back to NaN in the proper way.

import numpy as np
# Ensure NaN values in author_edited are not converted to the string "nan"
partipris['author_edited'] = partipris['author_edited'].replace({'nan': np.nan, '': np.nan})

In [256]:
# Usual suspects mapping and typos

# Mapping for author typos, initials, and variants
author_typo_initials_mapping = {
    # Initials to full names based on what we know of the corpus and who the potential names are
    "p.m.": "pierre maheu",
    "j.m.": "jean-marc piotte",
    "g.t.": "gaëtan tremblay",
    "ph.b.": "philippe bernard",
    "ch.g.": "charles gagnon",
    "g.m.": "gaston miron",
    "j.d.": "jan depocas",
    "parti pris; g.t.; p.m." : "parti pris; gaëtan tremblay; pierre maheu",
    "parti pris; g.t.": "parti pris; gaëtan tremblay",
    "parti pris; j.-m.p.": "parti pris; jean-marc piotte",
    # Typos and accent/spacing variants
    "gerald godin": "gérald godin",
    "pierre vallieres": "pierre vallières",
    "yvon husereau": "yvon hussereau",
    "pierre vadboncoeur": "pierre vadeboncoeur",
    "andrée bertrand-ferreti": "andrée ferretti-bertrand",
    "jean marc piotte": "jean-marc piotte",
    "gaétan tremblay": "gaëtan tremblay",
    "gaetan tremblay": "gaëtan tremblay",
    "charles gagnon; pierre vallieres": "charles gagnon; pierre vallières",
}

In [257]:
def map_author_typo_initials(name):
    if pd.isna(name):
        return name
    key = name.strip().lower()
    return author_typo_initials_mapping.get(key, key)

partipris['author_edited'] = partipris['author_edited'].apply(map_author_typo_initials)

In [259]:
author_counts = partipris['author_edited'].value_counts(dropna=False)
print(f"Number of unique authors in author_edited after typo mapping: {len(author_counts)}")
print("Author name frequencies in author_edited:")
for author, count in author_counts.items():
    print(f"{author}: {count}")

Number of unique authors in author_edited after typo mapping: 157
Author name frequencies in author_edited:
parti pris: 106
patrick straram: 48
pierre maheu: 38
gérald godin: 34
paul chamberland: 27
gaëtan tremblay: 23
luc racine: 22
gabriel gagnon: 19
andré major: 16
philippe bernard: 16
jean-marc piotte: 15
jacques ferron: 13
laurent girouard: 12
michel euvrard: 11
andré brochu: 10
thérèse dumouchel: 10
denys arcand: 9
gaston miron: 9
jan depocas: 9
jacques godbout: 8
jacques brault: 8
robert maheu: 8
gilles bourque: 7
raoul duguay: 7
pierre lefebvre: 6
gilles dostaler: 6
jacques trudel: 6
charles gagnon: 5
mario dumais: 5
pierre vadeboncoeur: 5
jacques renaud: 5
jacques allard: 5
michel guénard: 4
françois aquin: 4
guy bourassa: 4
camille limoges: 4
michael draper: 4
nan: 3
yvon dionne: 3
pierre desrosiers: 3
raoul roy: 3
yvon hussereau: 3
rené beaudin: 3
pierre vallières: 3
jacques-victor morin: 2
andré théberge: 2
andrée ferretti-bertrand: 2
michel mill: 2
jacques poisson: 2
jacqu

Much cleaner! This whole process of author name normalization is rather relative and specific to the corpus. We chose to normalize all the cases of Parti Pris multiauthorship to Parti Pris only, unless otherwise specified because our goal is to make it easier for us to count authors.

The first part of the author name normalization, mapping initials to author names per issue, was the more reliable and scaleable part of the process.

GPT-4.1 supported mapping of names could have also been done manually and it was indeed edited manually because it was easier to sort out names when there are less than 200 unique names. It is however not as easily scalable and requires knowledge of the corpus.

In [260]:
# saving the corpus
partipris.to_csv('../data/processed/partipris_v5.csv', index=False)

### Metadata Clean Up

Months and seasons as integers
figure out how to represent dual month issues