<a href="https://colab.research.google.com/github/len-rtz/plus-facile/blob/main/finetuning-BARThez.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
# Install required libraries
!pip install transformers datasets evaluate accelerate



tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.29k [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/1.12M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.63M [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/557M [00:00<?, ?B/s]

In [3]:
# Download BARThez model
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Initialize model and tokenizer
model_name = "moussaKam/barthez"  # French BART model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

In [15]:
import pandas as pd
import numpy as np
from datasets import Dataset
import re
from tqdm import tqdm

# Load TSV file
df = pd.read_csv('wivico_dataset_v2.tsv', sep='\t')
print(f"Original dataset size: {len(df)} pairs")

# Filter for simplification pairs (pair == 0)
simplification_df = df[df['pair (0: simplification, 1: complexification)'] == 0]
print(f"Simplification pairs: {len(simplification_df)}")

Original dataset size: 46525 pairs
Simplification pairs: 42478


In [17]:
# 1. Basic cleaning and filtering
def clean_text(text):
    if not isinstance(text, str):
        return ""
    # Replace multiple spaces with single space
    text = re.sub(r'\s+', ' ', text)
    # Remove HTML tags
    text = re.sub(r'<.*?>', '', text)
    return text.strip()

# Apply cleaning
simplification_df['wiki_sent_clean'] = simplification_df['wiki_sent'].apply(clean_text)
simplification_df['viki_sent_clean'] = simplification_df['viki_sent'].apply(clean_text)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  simplification_df['wiki_sent_clean'] = simplification_df['wiki_sent'].apply(clean_text)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  simplification_df['viki_sent_clean'] = simplification_df['viki_sent'].apply(clean_text)


In [18]:
# 2. Length checks
simplification_df['complex_len'] = simplification_df['wiki_sent_clean'].apply(len)
simplification_df['simple_len'] = simplification_df['viki_sent_clean'].apply(len)

# Filter out empty pairs or too short texts
min_length = 10
simplification_df = simplification_df[(simplification_df['complex_len'] > min_length) &
                                      (simplification_df['simple_len'] > min_length)]
print(f"After removing short texts: {len(simplification_df)} pairs")

After removing short texts: 42475 pairs


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  simplification_df['complex_len'] = simplification_df['wiki_sent_clean'].apply(len)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  simplification_df['simple_len'] = simplification_df['viki_sent_clean'].apply(len)


In [19]:
# 3. Simplification verification
# Verify if the "simple" text is actually simpler than the "complex" text
# We can use basic metrics like length ratio, word count, etc.

simplification_df['word_count_complex'] = simplification_df['wiki_sent_clean'].apply(lambda x: len(x.split()))
simplification_df['word_count_simple'] = simplification_df['viki_sent_clean'].apply(lambda x: len(x.split()))
simplification_df['char_ratio'] = simplification_df['simple_len'] / simplification_df['complex_len']
simplification_df['word_ratio'] = simplification_df['word_count_simple'] / simplification_df['word_count_complex']

# Define reasonable thresholds for simplification
# Usually simple text should be shorter or at least not much longer
max_length_ratio = 1.5  # Simple text should not be 50% longer than complex
min_length_ratio = 0.3  # Simple text should not be 70% shorter than complex

simplification_df = simplification_df[(simplification_df['char_ratio'] <= max_length_ratio) &
                                      (simplification_df['char_ratio'] >= min_length_ratio)]
print(f"After simplification ratio check: {len(simplification_df)} pairs")

After simplification ratio check: 40099 pairs


In [20]:
# 4. Content similarity check
# Ensure that simple and complex texts are actually related
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def compute_content_similarity(df, sample_size=1000):
    sample_df = df.sample(min(sample_size, len(df)))

    vectorizer = TfidfVectorizer()
    all_texts = list(sample_df['wiki_sent_clean']) + list(sample_df['viki_sent_clean'])
    tfidf_matrix = vectorizer.fit_transform(all_texts)

    similarities = []
    n = len(sample_df)
    for i in range(n):
        sim = cosine_similarity(tfidf_matrix[i:i+1], tfidf_matrix[i+n:i+n+1])[0][0]
        similarities.append(sim)

    return similarities

similarities = compute_content_similarity(simplification_df)
avg_similarity = np.mean(similarities)
print(f"Average content similarity between complex and simple texts: {avg_similarity:.4f}")

Average content similarity between complex and simple texts: 0.5671


In [22]:
# Filter out pairs with very low similarity
similarity_threshold = 0.3
low_similarity_count = sum(s < similarity_threshold for s in similarities)
print(f"Pairs with similarity below {similarity_threshold}: {low_similarity_count} ({low_similarity_count/len(similarities)*100:.2f}%)")

Pairs with similarity below 0.3: 73 (7.30%)


In [26]:
# 5. Create final cleaned dataset
final_df = simplification_df[['wiki_sent_clean', 'viki_sent_clean']].rename(
    columns={'wiki_sent_clean': 'complex', 'viki_sent_clean': 'simple'})

print(f"\nFinal dataset size: {len(final_df)} pairs")


Final dataset size: 40099 pairs


In [27]:
# Save a sample of the data to inspect
final_df.sample(10).to_csv('sample_cleaned_data.csv', index=False)

In [28]:
# Statistics summary
print("\nData Statistics:")
print(f"Average complex text length: {final_df['complex'].str.len().mean():.2f} characters")
print(f"Average simple text length: {final_df['simple'].str.len().mean():.2f} characters")
print(f"Average complex words: {final_df['complex'].apply(lambda x: len(x.split())).mean():.2f}")
print(f"Average simple words: {final_df['simple'].apply(lambda x: len(x.split())).mean():.2f}")


Data Statistics:
Average complex text length: 238.12 characters
Average simple text length: 167.75 characters
Average complex words: 38.49
Average simple words: 28.07
