# Data Preprocessing

This notebook cleans, tokenizes and augments data

In [48]:
import pandas as pd
import re
import os
from dotenv import load_dotenv
from tqdm import tqdm
import random
import numpy as np
import warnings

# Tokenization
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters

# Data Augmentation
import openai

# Burrow's Delta Score
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import CountVectorizer
from scipy.spatial.distance import cityblock  # Burrows' Delta uses Manhattan distance


[nltk_data] Downloading package punkt to
[nltk_data]     /Users/hadrienstrichard/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/hadrienstrichard/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


## Cleaning and tokenizing

In [3]:
# Load initial data

texts = pd.read_parquet('../Data/excerpts_df.parquet')
final_authors = pd.read_csv('../Data/final_authors.csv')

In [4]:
# Basic cleaning function tailored for stylometry (preserving punctuation, no lemmatization)
def clean_text(text):
    text = text.lower()
    text = re.sub(r"http\S+|www\S+|https\S+", '', text)  # remove URLs
    text = re.sub(r'\S+@\S+', '', text)  # remove emails
    text = re.sub(r'\d+', '', text)  # remove digits
    text = re.sub(r'\s+', ' ', text)  # normalize whitespace
    text = re.sub(r'[“”«»]', '"', text)  # normalize quotes
    text = re.sub(r"[’]", "'", text)  # normalize apostrophes
    return text.strip()

In [5]:
# Apply cleaning
texts['Cleaned_Text'] = texts['Excerpt_Text'].apply(clean_text)

In [6]:
# Display first 3 cleaned text for verification
pd.set_option('display.max_colwidth', None)
print(texts['Cleaned_Text'][:3])

index
0                                                                              n'est souvent déterminée que par un mot. en ce point même, cependant, il n'a fait que se conformer au caprice piquant de la nature, qui se joue à nous faire parcourir dans la durée d'un seul rêve, plusieurs fois interrompu par des épisodes étrangers à son objet, tous les développements d'une action régulière, complète et plus ou moins vraisemblable. les personnes qui ont lu apulée s'apercevront facilement que la fable du premier livre de l'_âne d'or_ de cet ingénieux conteur a beaucoup de rapports avec celle-ci, et qu'elles se ressemblent par le fond presque autant qu'elles diffèrent par la forme. l'auteur paraît même avoir affecté de solliciter ce rapprochement en conservant à son principal personnage le nom de lucius. le récit du philosophe de madaure et celui du prêtre dalmate, cité par fortis, tome i, page , ont en effet une origine commune dans les chants traditionnels d'une contrée qu'apulée avai

In [7]:
# Load French tokenizer
punkt_param = PunktParameters()
tokenizer = PunktSentenceTokenizer(punkt_param)

In [8]:
# Apply word tokenization
texts['Tokens'] = texts['Cleaned_Text'].apply(lambda x: word_tokenize(x, language='french'))

In [9]:
# Display first 3 tokenized text for verification
print(texts['Tokens'][:3])

index
0                                                                     [n'est, souvent, déterminée, que, par, un, mot, ., en, ce, point, même, ,, cependant, ,, il, n, ', a, fait, que, se, conformer, au, caprice, piquant, de, la, nature, ,, qui, se, joue, à, nous, faire, parcourir, dans, la, durée, d'un, seul, rêve, ,, plusieurs, fois, interrompu, par, des, épisodes, étrangers, à, son, objet, ,, tous, les, développements, d'une, action, régulière, ,, complète, et, plus, ou, moins, vraisemblable, ., les, personnes, qui, ont, lu, apulée, s'apercevront, facilement, que, la, fable, du, premier, livre, de, l'_âne, d'or_, de, cet, ingénieux, conteur, a, beaucoup, de, rapports, avec, celle-ci, ,, et, qu'elles, se, ...]
1                                                                                                              [le, reste, ne, me, regarde, point, ., j'ai, dit, de, qui, était, la, fable, :, sauf, quelques, phrases, de, transition, ,, tout, appartient, à, homère, ,, à, théo

In [10]:
# Drop authors and excerpts with < 25 excerpts
author_counts = texts['Author'].value_counts()
valid_authors = author_counts[author_counts > 25].index
texts = texts[texts['Author'].isin(valid_authors)]

# Filter final_authors
final_authors = final_authors[final_authors['Name'].isin(valid_authors)]

# Update books.csv
books = pd.read_csv('../Data/books.csv')
books = books[books['Author'].isin(valid_authors)]
books.to_csv('../Data/books.csv', index=False)

# Save updated data
texts.to_parquet('../Data/excerpts_df.parquet')
final_authors.to_csv('../Data/final_authors.csv', index=False)

In [11]:
final_authors.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28 entries, 0 to 27
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Name        28 non-null     object
 1   birth_date  28 non-null     object
 2   death_date  28 non-null     object
 3   n_books     28 non-null     int64 
 4   Mouvement   28 non-null     object
 5   Genres      28 non-null     object
 6   n_excerpts  28 non-null     int64 
dtypes: int64(2), object(5)
memory usage: 1.7+ KB


In [12]:
books.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 9 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Author     299 non-null    object 
 1   Title      299 non-null    object 
 2   Language   299 non-null    object 
 3   EBook-No   299 non-null    int64  
 4   URL        299 non-null    object 
 5   Roman      299 non-null    bool   
 6   Genres     299 non-null    object 
 7   Mouvement  299 non-null    object 
 8   Date       298 non-null    float64
dtypes: bool(1), float64(1), int64(1), object(6)
memory usage: 19.1+ KB


In [13]:
texts.info()

<class 'pandas.core.frame.DataFrame'>
Index: 14625 entries, 0 to 14657
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Author        14625 non-null  object
 1   Title         14625 non-null  object
 2   URL           14625 non-null  object
 3   Excerpt_ID    14625 non-null  object
 4   Excerpt_Text  14625 non-null  object
 5   Cleaned_Text  14625 non-null  object
 6   Tokens        14625 non-null  object
dtypes: object(7)
memory usage: 914.1+ KB


## Data Aug

In [15]:
# Load OpenAI key

load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")

In [38]:
# Delta Score Calculation

def compute_author_profiles(texts_df):
    """Creates stylometric profiles per author"""
    X = vectorizer.fit_transform(texts_df["Cleaned_Text"])
    features = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
    features["Author"] = texts_df["Author"].values

    # Compute mean frequencies per author
    author_profiles = features.groupby("Author").mean()

    # Z-score normalization (feature-wise)
    scaler = StandardScaler()
    scaled_profiles = pd.DataFrame(
        scaler.fit_transform(author_profiles),
        index=author_profiles.index,
        columns=author_profiles.columns
    )
    return scaled_profiles, scaler, vectorizer

def delta_score(text, scaled_author_profiles, scaler, vectorizer):
    """Compute Delta between one excerpt and all author profiles"""
    vec = vectorizer.transform([text])
    vec_df = pd.DataFrame(vec.toarray(), columns=vectorizer.get_feature_names_out())

    # Align columns
    for col in scaled_author_profiles.columns:
        if col not in vec_df.columns:
            vec_df[col] = 0
    vec_df = vec_df[scaled_author_profiles.columns]

    # Normalize using same scaler as author profiles
    vec_scaled = pd.DataFrame(
        scaler.transform(vec_df),
        columns=vec_df.columns
    ).iloc[0]

    # Compute Manhattan (L1) distance: Burrows’ Delta
    deltas = {}
    for author in scaled_author_profiles.index:
        profile = scaled_author_profiles.loc[author]
        deltas[author] = cityblock(vec_scaled, profile)

    return deltas


def get_delta_distribution_for_author(author, texts_df, profiles, scaler, vectorizer):
    subset = texts_df[texts_df["Author"] == author]["Cleaned_Text"]
    deltas = [delta_score(text, profiles, scaler, vectorizer)[author] for text in subset]
    return deltas

In [39]:
def build_prompt(target_author, example_1, example_2, input_excerpt, author_metadata):
    bio = (
        f"{target_author} est un écrivain français du {author_metadata['birth_date'][:4]}–{author_metadata['death_date'][:4]} siècle. "
        f"Appartenant au mouvement {author_metadata['Mouvement']}, il a principalement écrit dans les genres suivants : "
        f"{author_metadata['Genres']}."
    )

    return f"""
Tu es un modèle de génération de texte littéraire spécialisé dans l'augmentation de données pour la classification d'auteurs.

Ton rôle :
- Générer des extraits de fiction de 1500 mots (+/- 10%) en français.
- Respecter scrupuleusement le style de l’auteur : vocabulaire, syntaxe, ponctuation, ton, époque.
- Ne pas inclure de dialogues, titres ou en-têtes : que du texte narratif ou descriptif.
- Générer un passage qui suit naturellement celui donné, comme s’il s’agissait de la suite immédiate dans le livre.

Informations :
{bio}

Exemple de suite entre deux extraits d’un même livre de cet auteur:
--- Extrait A ---
{example_1}
--- Extrait B (suite de A) ---
{example_2}

Tâche :
Voici un extrait tiré de l’œuvre suivante : "{input_excerpt['Title']}". Génére la suite immédiate de cet extrait dans le même style, sans introduire de coupure ou de résumé.

--- Extrait donné ---
{input_excerpt['Excerpt_Text']}
--- Ta réponse (suite de l'extrait) ---
"""


In [None]:
def generate_and_validate(author_row, texts_df, n_needed, author_profiles, vectorizer, scaler, client):

    author_name = author_row["Name"]
    subset = texts_df[texts_df["Author"] == author_name].copy()

    # Extract book ID and position
    subset[["Book_ID", "Book_Pos"]] = subset["Excerpt_ID"].str.extract(r"(\d+)_(\d+)")
    subset["Book_Pos"] = subset["Book_Pos"].astype(int)

    # For each book, keep middle 50% of excerpts
    middle_excerpts = []
    for book_id, group in subset.groupby("Book_ID"):
        group_sorted = group.sort_values("Book_Pos")
        n = len(group_sorted)
        if n >= 4:
            middle = group_sorted.iloc[n // 4: 3 * n // 4]
            middle_excerpts.append(middle)
        else:
            # If too few excerpts, keep them all (fallback)
            middle_excerpts.append(group_sorted)

    middle_subset = pd.concat(middle_excerpts).reset_index(drop=True)


    new_excerpts = []
    attempts = 0
    max_attempts = 5 * n_needed 

    while len(new_excerpts) < n_needed and attempts < max_attempts:
        attempts += 1

        # Random consecutive excerpts for example
        examples = middle_subset.sample(2).sort_values("Excerpt_ID")
        example_1, example_2 = examples.iloc[0]["Excerpt_Text"], examples.iloc[1]["Excerpt_Text"]

        # Input excerpt also from middle
        input_excerpt = middle_subset.sample(1).iloc[0]

        prompt = build_prompt(
            target_author=author_name,
            example_1=example_1,
            example_2=example_2,
            input_excerpt=input_excerpt,
            author_metadata=author_row
        )

        try:
            print(f"Generating excerpt for {author_name}... (Attempt {attempts})")
            response = client.chat.completions.create(
                model="gpt-4.1-2025-04-14",
                messages=[{"role": "user", "content": prompt}],
                temperature=0.5,
                max_tokens=2048
            )
            generated = response.choices[0].message.content
            print(f"Generated excerpt for {author_name}: {generated[:100]}...")

            cleaned_gen = clean_text(generated)
            tokenized_gen = word_tokenize(cleaned_gen, language='french')
            print(len(tokenized_gen), "tokens in generated excerpt")
            length_ok = 1350 <= len(tokenized_gen) <= 1650

            delta_scores = delta_score(cleaned_gen, author_profiles, scaler, vectorizer)
            print(f"Delta scores for {author_name}: {delta_scores}")
            sorted_authors = sorted(delta_scores.items(), key=lambda x: x[1])
            closest_author, closest_score = sorted_authors[0]
            target_score = delta_scores[author_name]
            print(f"Closest author: {closest_author} ({closest_score:.2f}), Target author ({author_name}): {target_score:.2f}")

            # Allow small tolerance if author is nearly the closest
            tolerance = attempts / 100  # Make the tolerance increase with attempts
            delta_gap = abs(closest_score - target_score)
            is_closest_or_near = (
                closest_author == author_name or
                (target_score - closest_score) / closest_score <= tolerance
            )

            # Delta range check
            real_deltas = get_delta_distribution_for_author(
                author_name, texts_df, author_profiles, scaler, vectorizer
            )
            threshold = np.percentile(real_deltas, 90)
            within_range = target_score <= threshold

            print(f"Delta to {author_name}: {target_score:.2f} | 90th percentile: {threshold:.2f}")
            good_style = is_closest_or_near and within_range
            print(f"Excerpt length OK: {length_ok}, Good style: {good_style}")

            if length_ok and good_style:
                new_excerpts.append({
                    "Author": author_name,
                    "Title": input_excerpt["Title"],
                    "URL": input_excerpt["URL"],
                    "Excerpt_ID": f"{input_excerpt['Excerpt_ID']}_gen{len(new_excerpts)}",
                    "Excerpt_Text": generated,
                    "Cleaned_Text": cleaned_gen,
                    "Tokenized_Text": tokenized_gen,
                    "of_which_generated": input_excerpt["Excerpt_ID"]
                })
                print(f"✅ Accepted excerpt for {author_name}: {len(new_excerpts)} / {n_needed}")

        except Exception as e:
            print(f"Error (attempt {attempts}):", e)
            continue

    if len(new_excerpts) < n_needed:
        warnings.warn(
            f"⚠️ Could not generate all excerpts for {author_name}. Only {len(new_excerpts)} out of {n_needed} were accepted. Review needed."
        )

    return pd.DataFrame(new_excerpts)


In [42]:
# Precompute author profiles
author_profiles, scaler, vectorizer = compute_author_profiles(texts)

# Get underrepresented authors
need_aug = final_authors[final_authors["n_excerpts"] < 50]

print(f"Authors needing augmentation: {len(need_aug)}")

Authors needing augmentation: 4


In [None]:
augmented_all = []

client = client = openai.OpenAI()

for _, author_row in tqdm(need_aug.iterrows(), total=need_aug.shape[0]):
    n_needed = 50 - author_row["n_excerpts"]
    df_aug = generate_and_validate(author_row, texts, n_needed, author_profiles, vectorizer, scaler, client)
    augmented_all.append(df_aug)

augmented_df = pd.concat(augmented_all, ignore_index=True)

# Add new excerpts
texts = pd.concat([texts, augmented_df], ignore_index=True)

# Update author excerpt count
updated_counts = texts["Author"].value_counts().reset_index()
updated_counts.columns = ["Name", "n_excerpts"]
final_authors.update(updated_counts)

# Fill missing "of_which_generated" for real texts
texts["of_which_generated"] = texts.get("of_which_generated", np.nan)

# Save final versions
texts.to_parquet('../Data/excerpts_df.parquet', index=False)
final_authors.to_csv('../Data/final_authors.csv', index=False)


  0%|          | 0/4 [00:00<?, ?it/s]

Generating excerpt for Charles Nodier... (Attempt 1)
Generated excerpt for Charles Nodier: Dès lors, la maison de madame Alberti prit, sans qu’on s’en aperçût, une physionomie nouvelle. Il se...
1518 tokens in generated excerpt
Delta scores for Charles Nodier: {'Alain-Fournier': np.float64(684.917461700297), 'Alexandre Dumas': np.float64(669.7759743701403), 'Alfred de Vigny': np.float64(699.8145589846887), 'Alphonse Daudet': np.float64(650.4940214335786), 'Anatole France': np.float64(611.0257418704662), 'André Gide': np.float64(669.8493259053164), 'Charles Nodier': np.float64(551.8406946660012), 'François Mauriac': np.float64(599.8796768482035), 'George Sand': np.float64(696.2068825461565), 'Georges Bernanos': np.float64(703.1687084132642), 'Gustave Flaubert': np.float64(613.3346308854663), 'Guy de Maupassant': np.float64(638.3237627725039), 'Henri Barbusse': np.float64(710.9329421727745), 'Honoré de Balzac': np.float64(704.3729619325727), 'Jules Renard': np.float64(703.4556478032977),

 25%|██▌       | 1/4 [14:35<43:46, 875.41s/it]

Generated excerpt for Charles Nodier: C’est ainsi que les jours s’écoulèrent dans un demi-repos, une attente inquiète, où le présent s’emp...
1494 tokens in generated excerpt
Delta scores for Charles Nodier: {'Alain-Fournier': np.float64(711.9001017550061), 'Alexandre Dumas': np.float64(710.6140497980475), 'Alfred de Vigny': np.float64(735.7433295293812), 'Alphonse Daudet': np.float64(670.209882292147), 'Anatole France': np.float64(637.9970458313135), 'André Gide': np.float64(705.6221235236378), 'Charles Nodier': np.float64(600.2539611649038), 'François Mauriac': np.float64(602.7206821139447), 'George Sand': np.float64(740.0998088091349), 'Georges Bernanos': np.float64(717.0560902355104), 'Gustave Flaubert': np.float64(635.5680392367825), 'Guy de Maupassant': np.float64(676.9460511182372), 'Henri Barbusse': np.float64(707.5574167652003), 'Honoré de Balzac': np.float64(735.9861038872733), 'Jules Renard': np.float64(730.8092447058127), 'Jules Vallès': np.float64(776.9470054261221), 'Jule

In [None]:
# df . sample et gen ID à checker