<a href="https://colab.research.google.com/github/len-rtz/plus-facile/blob/main/finetuning-BARThez.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [4]:
!pip install transformers datasets torch huggingface_hub evaluate rouge_score



# Data Cleaning

In [5]:
import pandas as pd
import numpy as np
from datasets import Dataset
import re
from tqdm import tqdm

# Load TSV file
df = pd.read_csv('wivico_dataset_v2.tsv', sep='\t')
print(f"Original dataset size: {len(df)} pairs")

# Filter for simplification pairs (pair == 0)
simplification_df = df[df['pair (0: simplification, 1: complexification)'] == 0]
print(f"Simplification pairs: {len(simplification_df)}")

Original dataset size: 46525 pairs
Simplification pairs: 42478


In [6]:
# 1. Basic cleaning and filtering
def clean_text(text):
    if not isinstance(text, str):
        return ""
    # Replace multiple spaces with single space
    text = re.sub(r'\s+', ' ', text)
    # Remove HTML tags
    text = re.sub(r'<.*?>', '', text)
    return text.strip()

# Apply cleaning
simplification_df['wiki_sent_clean'] = simplification_df['wiki_sent'].apply(clean_text)
simplification_df['viki_sent_clean'] = simplification_df['viki_sent'].apply(clean_text)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  simplification_df['wiki_sent_clean'] = simplification_df['wiki_sent'].apply(clean_text)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  simplification_df['viki_sent_clean'] = simplification_df['viki_sent'].apply(clean_text)


In [7]:
# 2. Length checks
simplification_df['complex_len'] = simplification_df['wiki_sent_clean'].apply(len)
simplification_df['simple_len'] = simplification_df['viki_sent_clean'].apply(len)

# Filter out empty pairs or too short texts
min_length = 10
simplification_df = simplification_df[(simplification_df['complex_len'] > min_length) &
                                      (simplification_df['simple_len'] > min_length)]
print(f"After removing short texts: {len(simplification_df)} pairs")

After removing short texts: 42475 pairs


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  simplification_df['complex_len'] = simplification_df['wiki_sent_clean'].apply(len)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  simplification_df['simple_len'] = simplification_df['viki_sent_clean'].apply(len)


In [8]:
  # 3. Simplification verification
  # Verify if the "simple" text is actually simpler than the "complex" text
  # We can use basic metrics like length ratio, word count, etc.

  simplification_df['word_count_complex'] = simplification_df['wiki_sent_clean'].apply(lambda x: len(x.split()))
  simplification_df['word_count_simple'] = simplification_df['viki_sent_clean'].apply(lambda x: len(x.split()))
  simplification_df['char_ratio'] = simplification_df['simple_len'] / simplification_df['complex_len']
  simplification_df['word_ratio'] = simplification_df['word_count_simple'] / simplification_df['word_count_complex']

  # Define reasonable thresholds for simplification
  # Usually simple text should be shorter or at least not much longer
  max_length_ratio = 1.5  # Simple text should not be 50% longer than complex
  min_length_ratio = 0.3  # Simple text should not be 70% shorter than complex

  simplification_df = simplification_df[(simplification_df['char_ratio'] <= max_length_ratio) &
                                        (simplification_df['char_ratio'] >= min_length_ratio)]
  print(f"After simplification ratio check: {len(simplification_df)} pairs")

After simplification ratio check: 40099 pairs


In [9]:
# 4. Content similarity check
# Ensure that simple and complex texts are actually related
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def compute_content_similarity(df, sample_size=1000):
    sample_df = df.sample(min(sample_size, len(df)))

    vectorizer = TfidfVectorizer()
    all_texts = list(sample_df['wiki_sent_clean']) + list(sample_df['viki_sent_clean'])
    tfidf_matrix = vectorizer.fit_transform(all_texts)

    similarities = []
    n = len(sample_df)
    for i in range(n):
        sim = cosine_similarity(tfidf_matrix[i:i+1], tfidf_matrix[i+n:i+n+1])[0][0]
        similarities.append(sim)

    return similarities

similarities = compute_content_similarity(simplification_df)
avg_similarity = np.mean(similarities)
print(f"Average content similarity between complex and simple texts: {avg_similarity:.4f}")

Average content similarity between complex and simple texts: 0.5777


In [10]:
# Filter out pairs with very low similarity
similarity_threshold = 0.3
low_similarity_count = sum(s < similarity_threshold for s in similarities)
print(f"Pairs with similarity below {similarity_threshold}: {low_similarity_count} ({low_similarity_count/len(similarities)*100:.2f}%)")

Pairs with similarity below 0.3: 64 (6.40%)


In [11]:
# 5. Create final cleaned dataset
final_df = simplification_df[['wiki_sent_clean', 'viki_sent_clean']].rename(
    columns={'wiki_sent_clean': 'complex', 'viki_sent_clean': 'simple'})

print(f"\nFinal dataset size: {len(final_df)} pairs")


Final dataset size: 40099 pairs


In [12]:
# Save a sample of the data to inspect
final_df.sample(10).to_csv('sample_cleaned_data.csv', index=False)

In [13]:
# Inspect dataset
final_df.head()

Unnamed: 0,complex,simple
0,Catharanthus roseus La Pervenche de Madagascar...,"La pervenche de Madagascar (nom commun), ou Ca..."
1,"Claude de France (Romorantin, 13 octobre 1499 ...",Claude de France est née le 13 octobre 1499 à ...
2,"Hippocrate de Kos Hippocrate de Kos, ou simple...",Hippocrate de Cos (surnommé Hippocrate le Gran...
3,"‌L'ASM Clermont Auvergne, anciennement Associa...","L'ASM Clermont Auvergne, anciennement AS Montf..."
4,"‌L'ASM Clermont Auvergne, anciennement Associa...","L'ASM Clermont Auvergne, anciennement AS Montf..."


In [14]:
# Statistics summary
print("\nData Statistics:")
print(f"Average complex text length: {final_df['complex'].str.len().mean():.2f} characters")
print(f"Average simple text length: {final_df['simple'].str.len().mean():.2f} characters")
print(f"Average complex words: {final_df['complex'].apply(lambda x: len(x.split())).mean():.2f}")
print(f"Average simple words: {final_df['simple'].apply(lambda x: len(x.split())).mean():.2f}")


Data Statistics:
Average complex text length: 238.12 characters
Average simple text length: 167.75 characters
Average complex words: 38.49
Average simple words: 28.07


In [15]:
# Convert to HuggingFace Dataset
dataset = Dataset.from_pandas(final_df)

# Finetuning

In [16]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "moussaKam/barthez"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.29k [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/1.12M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.63M [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/557M [00:00<?, ?B/s]

In [17]:
# Check tokenizer configuration
print(f"Pad token: {tokenizer.pad_token}, ID: {tokenizer.pad_token_id}")
print(f"EOS token: {tokenizer.eos_token}, ID: {tokenizer.eos_token_id}")

Pad token: <pad>, ID: 1
EOS token: </s>, ID: 2


In [18]:
# Ppreprocessing function
def preprocess_function(examples):
    inputs = examples["complex"]
    targets = examples["simple"]

    # Tokenize inputs
    model_inputs = tokenizer(
        inputs,
        max_length=256,  # Reduced from 512
        truncation=True,
        padding="max_length"
    )

    # Tokenize targets with special handling
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            targets,
            max_length=256,  # Reduced from 512
            truncation=True,
            padding="max_length"
        )

    # Replace pad token id with -100 explicitly
    labels_with_ignore_index = []
    for label in labels["input_ids"]:
        labels_with_ignore_index.append(
            [-100 if token == tokenizer.pad_token_id else token for token in label]
        )

    model_inputs["labels"] = labels_with_ignore_index
    return model_inputs

# Tokenize first
tokenized_dataset = dataset.map(
    preprocess_function,
    batched=True,
    remove_columns=dataset.column_names  # Remove original columns
)

def check_tokenized_data(dataset):
    for example in dataset:
        if any(isinstance(token_id, (float, str)) or token_id is None for token_id in example["input_ids"]):
            print(f"Corrupt example found: {example}")
            return False
    return True

print("Checking tokenized dataset integrity...")
assert check_tokenized_data(tokenized_dataset), "Tokenized dataset contains invalid values!"

# Split after tokenization
tokenized_splits = tokenized_dataset.train_test_split(test_size=0.2, seed=42)


# Check splits
print(f"Training examples: {len(tokenized_splits['train'])}")
print(f"Validation examples: {len(tokenized_splits['test'])}")

Map:   0%|          | 0/40099 [00:00<?, ? examples/s]



model.safetensors:   0%|          | 0.00/557M [00:00<?, ?B/s]

Checking tokenized dataset integrity...
Training examples: 32079
Validation examples: 8020


In [19]:
# Define Training arguments
from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    save_strategy="epoch",           # Save checkpoint each epoch
    learning_rate=5e-5,
    per_device_train_batch_size=2,   # Reduced from 8
    per_device_eval_batch_size=2,    # Reduced from 8
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=3,
    predict_with_generate=True,
    generation_max_length=128,
    generation_num_beams=4,          # Explicit beam search
    push_to_hub=False,
    fp16=False,
    gradient_accumulation_steps=4,   # Effectively batch size of 8
    max_grad_norm=1.0,               # Gradient clipping
    load_best_model_at_end=True,
    metric_for_best_model="rouge1"
)



In [20]:
# Define trainer and train
from transformers import Seq2SeqTrainer, DataCollatorForSeq2Seq
import numpy as np
import evaluate

# Metric for evaluation
metric = evaluate.load("rouge")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred

    # Ensure valid token IDs
    predictions = np.where(predictions < 0, tokenizer.pad_token_id, predictions)
    labels = np.where(labels < 0, tokenizer.pad_token_id, labels)

    # Decode predictions safely
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Clean up
    decoded_preds = [pred.strip() for pred in decoded_preds]
    decoded_labels = [label.strip() for label in decoded_labels]

    # Compute metrics
    result = metric.compute(
        predictions=decoded_preds,
        references=decoded_labels,
        use_stemmer=True
    )

    return {k: round(v * 100, 4) for k, v in result.items()}



# Improved data collator - NEW!
data_collator = DataCollatorForSeq2Seq(
    tokenizer,
    model=model,
    label_pad_token_id=-100,
    pad_to_multiple_of=8 if training_args.fp16 else None
)

# Initialize trainer
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_splits['train'],
    eval_dataset=tokenized_splits['test'],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    # tokenizer parameter is deprecated and will be removed - NEW!
)

# Check memory before training - NEW!
print("\nChecking GPU memory before training:")
!nvidia-smi

# Train the model
trainer.train()

# Save the model - NEW!
trainer.save_model("./french_simplification_model")

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]


Checking GPU memory before training:
Fri Mar  7 13:21:56 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   38C    P0             25W /   70W |     692MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
          

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mlen-rtz[0m ([33mlen-rtz-th-k-ln[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum
1,1.3667,1.166508,56.5367,40.8966,50.9067,50.924
2,1.1301,1.059562,57.8214,43.0369,52.6009,52.6208




Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum
1,1.3667,1.166508,56.5367,40.8966,50.9067,50.924
2,1.1301,1.059562,57.8214,43.0369,52.6009,52.6208
3,1.0128,1.028692,58.9092,44.2144,53.5878,53.5968


There were missing keys in the checkpoint model loaded: ['model.encoder.embed_tokens.weight', 'model.decoder.embed_tokens.weight', 'lm_head.weight'].


In [22]:
# Test the model on a sample
test_sentence = "Pour justifier le nombre élevé de victimes palestiniennes, Israël affirme que la responsabilité en incombe aux membres du Hamas, qui opéreraient au mépris de la vie de leurs compatriotes. Certes, le droit international estime que se protéger derrière des non-combattants est un crime de guerre. Mais une question demeure : combien de civils peut-on tuer pour éliminer un seul ennemi ?"

inputs = tokenizer(test_sentence, return_tensors="pt", padding=True).to(model.device)
outputs = model.generate(**inputs, max_length=256)
simplified = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(f"Original: {test_sentence}")
print(f"Simplified: {simplified}")

Original: Pour justifier le nombre élevé de victimes palestiniennes, Israël affirme que la responsabilité en incombe aux membres du Hamas, qui opéreraient au mépris de la vie de leurs compatriotes. Certes, le droit international estime que se protéger derrière des non-combattants est un crime de guerre. Mais une question demeure : combien de civils peut-on tuer pour éliminer un seul ennemi ?
Simplified: Pour justifier le nombre élevé de victimes palestiniennes, Isral affirme que la responsabilité en incombe aux membres du Hamas, qui opéreraient au mépris de la vie de leurs compatriotes.
