# Training - Fine-tune mT5 for text summarization
Course here: https://huggingface.co/learn/nlp-course/chapter7/5?fw=pt#models-for-text-summarization


## Use this image Docker -> PyTorch with cuda < 11.7

Goal: The purpose of this notebook is to Fine-tune a model for Text summarization in French and English. I will use the mT5 model from Google.

Install packages

In [1]:
!pip install transformers datasets torch nltk rouge_score evaluate scikit-learn ydata-profiling sentencepiece protobuf --quiet
!pip install transformers[torch] -q
!pip install accelerate -U -q

[0m

In [2]:
import warnings
warnings.filterwarnings('ignore')

In [3]:
import nltk
import evaluate


nltk.download("punkt")
rouge_score = evaluate.load("rouge")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

In [4]:
import torch

print("Torch version: ", torch.__version__)
print("Cuda is available: ", torch.cuda.is_available())
print(torch.version.cuda)

Torch version:  2.0.0
Cuda is available:  True
11.7


Put the notebook in a logging mode so we can save the output to a file. This is useful for debugging and sharing the results of the notebook. The log file will be saved in the same directory as the notebook. 

In [5]:
import sys
import logging

nblog = open("nb-finetune-mt5.log", "a+")
sys.stdout.echo = nblog
sys.stderr.echo = nblog

get_ipython().log.handlers[0].stream = nblog
get_ipython().log.setLevel(logging.INFO)

%autosave 5


Autosaving every 5 seconds


Enable text wrapping so we don't have to scroll horizontally and create a function to flush CUDA cache.

In [6]:
from IPython.display import HTML, display

def set_css():
    display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))

get_ipython().events.register('pre_run_cell', set_css)

def clear_cache():
    if torch.cuda.is_available():
        model = None
        torch.cuda.empty_cache()

Enter your Huggingface tokkens

In [7]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Load model

For my purpose and my restrictions ressources I will use a small model.

In [8]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_checkpoint = "google/mt5-small" # thekenken/mt5small-finetuned-summary-en-fr google/mt5-small
 
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

Downloading (…)okenizer_config.json:   0%|          | 0.00/833 [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/4.31M [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/16.3M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/416 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/802 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.20G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

## Load datasets from HuggingFace data - XSum english & MLsum french

In [9]:
from datasets import load_dataset, load_metric, concatenate_datasets, DatasetDict

# Chargez les ensembles de données XSUM (en anglais) et MLsum (en français)
#xsum_en_datasets = load_dataset("xsum")
#mlsum_fr_datasets = load_dataset("mlsum", "fr")

# Charger le dataset XSUM
xsum_en_datasets = load_dataset("xsum").map(lambda example: {"document": example["document"], "summary": example["summary"]})
xsum_en_datasets = xsum_en_datasets.remove_columns(['id'])  # Supprimer la colonne 'id'

# Charger le dataset MLsum
mlsum_fr_datasets = load_dataset("mlsum", "fr").map(lambda example: {"document": example["text"], "summary": example["summary"]})
mlsum_fr_datasets = mlsum_fr_datasets.remove_columns(['text', 'topic', 'url', 'title', 'date'])  # Supprimer les colonnes indésirables

print(xsum_en_datasets)
print(mlsum_fr_datasets)

Downloading builder script:   0%|          | 0.00/5.76k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/6.24k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.00M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/204045 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/11332 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11334 [00:00<?, ? examples/s]

Map:   0%|          | 0/204045 [00:00<?, ? examples/s]

Map:   0%|          | 0/11332 [00:00<?, ? examples/s]

Map:   0%|          | 0/11334 [00:00<?, ? examples/s]

Downloading builder script:   0%|          | 0.00/3.72k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/12.7k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/11.0k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/1.69G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/81.8M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/80.8M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/392902 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/16059 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/15828 [00:00<?, ? examples/s]

Map:   0%|          | 0/392902 [00:00<?, ? examples/s]

Map:   0%|          | 0/16059 [00:00<?, ? examples/s]

Map:   0%|          | 0/15828 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['document', 'summary'],
        num_rows: 204045
    })
    validation: Dataset({
        features: ['document', 'summary'],
        num_rows: 11332
    })
    test: Dataset({
        features: ['document', 'summary'],
        num_rows: 11334
    })
})
DatasetDict({
    train: Dataset({
        features: ['summary', 'document'],
        num_rows: 392902
    })
    validation: Dataset({
        features: ['summary', 'document'],
        num_rows: 16059
    })
    test: Dataset({
        features: ['summary', 'document'],
        num_rows: 15828
    })
})


In [10]:
def show_samples(dataset, num_samples=3, seed=42):
    sample = dataset["train"].shuffle(seed=seed).select(range(num_samples))
    for example in sample:
        print(f"\n'>> Document: {example['document']}'")
        print(f"'>> Summary: {example['summary']}'")


show_samples(mlsum_fr_datasets)


'>> Document: Willem-Alexander, incognito. Jason Reed / REUTERS Willem-Alexander, devenu roi des Pays-Bas en 2013, exerçait un autre métier à temps partiel depuis vingt et un ans : pilote de ligne pour la compagnie néerlandaise KLM. Le monarque a révélé sa double vie au journal De Telegraaf, le 18 mai. Il raconte qu’il volait au moins deux fois par mois sur des Fokker 70 couvrant des courtes distances en Europe du Nord. Il officiait en tant que copilote mais ne révélait jamais sa vraie identité à ses passagers. « L’avantage, c’est que je pouvais toujours les accueillir au nom du capitaine et de l’équipage. Je n’étais pas obligé de dire mon nom. » Il se souvient qu’avant le 11-Septembre, quand l’accès au cockpit était encore autorisé aux plus curieux, « des gens venaient régulièrement jeter un coup d’œil et étaient surpris et contents de m’y voir assis ». Mais tout compte fait, sur deux décennies, peu de gens ont reconnu sa voix – « de toute façon, la plupart des gens n’écoutent pas » 

## Concatenate datasets 

In [13]:
#from sklearn.model_selection import train_test_split
from datasets import concatenate_datasets, DatasetDict

# Fraction des données à conserver (40%)
fraction_to_keep = 0.4

# Créer un nouvel DatasetDict pour stocker les données
summary_dataset = DatasetDict()

for split in mlsum_fr_datasets.keys():
    # Obtenir le nombre total d'exemples dans mlsum_fr_datasets
    total_examples_mlsum = len(mlsum_fr_datasets[split])
    # Obtenir le nombre total d'exemples dans xsum_en_datasets
    total_examples_xsum = len(xsum_en_datasets[split])

    # Calculer le nombre d'exemples à conserver en fonction de la fraction
    num_to_keep_mlsum = int(fraction_to_keep * total_examples_mlsum)
    num_to_keep_xsum = int(fraction_to_keep * total_examples_xsum)

    # Sélectionner un échantillon aléatoire d'exemples à conserver pour mlsum_fr_datasets
    kept_data_mlsum = mlsum_fr_datasets[split].shuffle(seed=42).select([i for i in range(num_to_keep_mlsum)])

    # Sélectionner un échantillon aléatoire d'exemples à conserver pour xsum_en_datasets
    kept_data_xsum = xsum_en_datasets[split].shuffle(seed=42).select([i for i in range(num_to_keep_xsum)])

    # Concaténer les ensembles de données conservés
    summary_dataset[split] = concatenate_datasets([kept_data_mlsum, kept_data_xsum])

    # Mélanger les données
    summary_dataset[split] = summary_dataset[split].shuffle(seed=42)

# Afficher quelques exemples
#show_samples(summary_dataset)


summary_dataset

DatasetDict({
    train: Dataset({
        features: ['summary', 'document'],
        num_rows: 596947
    })
    validation: Dataset({
        features: ['summary', 'document'],
        num_rows: 27391
    })
    test: Dataset({
        features: ['summary', 'document'],
        num_rows: 27162
    })
})

In [14]:
from datasets import DatasetDict

max_input_length = 600
max_target_length = 300

def clean_and_preprocess_dataset(dataset_dict, text_column, summary_column, max_input_length, max_target_length):
    # Créer un nouvel DatasetDict pour stocker les données prétraitées
    cleaned_dataset_dict = DatasetDict()
    
    for split in dataset_dict.keys():
        cleaned_data = dataset_dict[split].map(lambda example: {
            text_column: example[text_column][:max_input_length],
            summary_column: example[summary_column][:max_target_length]
        })
        
        # Filtrer les exemples qui sont trop courts ou vides
        cleaned_data = cleaned_data.filter(lambda example: len(example[text_column]) > 0 and len(example[summary_column]) > 0)

        # Stocker les données prétraitées dans le nouvel DatasetDict
        cleaned_dataset_dict[split] = cleaned_data

    return cleaned_dataset_dict

# Appliquer la fonction à votre DatasetDict
summary_dataset = clean_and_preprocess_dataset(summary_dataset, "document", "summary", max_input_length, max_target_length)
show_samples(summary_dataset)
summary_dataset



Map:   0%|          | 0/596947 [00:00<?, ? examples/s]

Filter:   0%|          | 0/596947 [00:00<?, ? examples/s]

Map:   0%|          | 0/27391 [00:00<?, ? examples/s]

Filter:   0%|          | 0/27391 [00:00<?, ? examples/s]

Map:   0%|          | 0/27162 [00:00<?, ? examples/s]

Filter:   0%|          | 0/27162 [00:00<?, ? examples/s]


'>> Document: The infection is suspected of leading to thousands of babies being born with underdeveloped brains in Brazil.
"I trust that the British Olympic Association will keep us best informed," said Muir, 22.
"I'm just going to keep on training and hopefully come the summer time, everything will be OK."
Veterinary medicine student Muir said she would continue to monitor the situation, with Zika having been declared a global public health emergency by the World Health Organisation.
"Because of my background with my studies, I know a little bit about it," she explained.
Last year, Muir finished fourth in'
'>> Summary: Scottish middle-distance runner Laura Muir insists she has no concerns at this stage about the Zika virus ahead of this summer's Olympic Games in Rio.'

'>> Document: Pour ce premier samedi du mois, Paris Première propose son programme érotique "à dimension culturelle" : un documentaire inédit de la collection "Sex in the World's Cities" consacré à la capitale argenti

DatasetDict({
    train: Dataset({
        features: ['summary', 'document'],
        num_rows: 596919
    })
    validation: Dataset({
        features: ['summary', 'document'],
        num_rows: 27386
    })
    test: Dataset({
        features: ['summary', 'document'],
        num_rows: 27161
    })
})

In [15]:
def preprocess_function(examples):
    model_inputs = tokenizer(
        examples["document"],
        max_length=max_input_length,
        truncation=True,
    )
    labels = tokenizer(
        examples["summary"], max_length=max_target_length, truncation=True
    )
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_datasets = summary_dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/596919 [00:00<?, ? examples/s]

Map:   0%|          | 0/27386 [00:00<?, ? examples/s]

Map:   0%|          | 0/27161 [00:00<?, ? examples/s]

## Fine-tuning using Trainer - Training

### Configuration

In [16]:
from transformers import Seq2SeqTrainingArguments

# Configuration de l'entraînement
batch_size = 8
num_train_epochs = 5
logging_steps = len(tokenized_datasets['train']) // batch_size
model_name = model_checkpoint.split("/")[-1]

args = Seq2SeqTrainingArguments(
    output_dir=f"{model_name}-finetuned-summary-en-fr",
    evaluation_strategy="epoch",
    learning_rate=3.6e-5,
    do_train=True,
    do_eval=True,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=num_train_epochs,
    predict_with_generate=True,
    logging_steps=logging_steps,  # Fréquence de logging
    push_to_hub=True,
)


In [17]:
import numpy as np
from rouge_score import rouge_scorer
from nltk.tokenize import sent_tokenize

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    result = rouge_score.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)

    return {k: round(v, 4) for k, v in result.items()}

def compute_metrics_2(eval_pred):
    predictions, labels = eval_pred
    # Decode generated summaries into text
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    # Decode reference summaries into text
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    # ROUGE expects a newline after each sentence
    decoded_preds = ["\n".join(sent_tokenize(pred.strip())) for pred in decoded_preds]
    decoded_labels = ["\n".join(sent_tokenize(label.strip())) for label in decoded_labels]
    # Compute ROUGE scores
    result = rouge_score.compute(
        predictions=decoded_preds, references=decoded_labels, use_stemmer=True
    )
    # Extract the median scores
    result = {key: value * 100 for key, value in result.items()}
    return {k: round(v, 4) for k, v in result.items()}


def compute_metrics_chatgpt(predictions, references):
    # Initialize the ROUGE scorer
    scorer = rouge_score.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

    # Calculate ROUGE scores for each prediction-reference pair
    rouge_scores = {
        "rouge1_precision": [],
        "rouge1_recall": [],
        "rouge1_f1": [],
        "rouge2_precision": [],
        "rouge2_recall": [],
        "rouge2_f1": [],
        "rougeL_precision": [],
        "rougeL_recall": [],
        "rougeL_f1": []
    }

    for prediction, reference in zip(predictions, references):
        scores = scorer.score(reference, prediction)

        rouge_scores["rouge1_precision"].append(scores["rouge1"].precision)
        rouge_scores["rouge1_recall"].append(scores["rouge1"].recall)
        rouge_scores["rouge1_f1"].append(scores["rouge1"].fmeasure)

        rouge_scores["rouge2_precision"].append(scores["rouge2"].precision)
        rouge_scores["rouge2_recall"].append(scores["rouge2"].recall)
        rouge_scores["rouge2_f1"].append(scores["rouge2"].fmeasure)

        rouge_scores["rougeL_precision"].append(scores["rougeL"].precision)
        rouge_scores["rougeL_recall"].append(scores["rougeL"].recall)
        rouge_scores["rougeL_f1"].append(scores["rougeL"].fmeasure)

    # Calculate average ROUGE scores
    average_rouge_scores = {key: np.mean(value) * 100 for key, value in rouge_scores.items()}

    return average_rouge_scores


In [18]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)


In [19]:
tokenized_datasets = tokenized_datasets.remove_columns(
    summary_dataset["train"].column_names
)
features = [tokenized_datasets["train"][i] for i in range(2)]
data_collator(features)

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'input_ids': tensor([[  1243,    263,    259,  34471,   1643,    259,    263,    293,    278,
          50764,   4328,    259,   7434,    259,    845,  74040,    289,   3086,
            380,    261,    303,    342,  67516,    413,  39676,    624,   2708,
            261,    413,  19993,    624,    796,    261,    289,    259,  24884,
            265,    317,  84572,    269,  40991,    259,    369,    443, 103917,
            261,   1080,    340,   6130,    404, 191777,    295,    618,    269,
            283,    259,  52486,    498,  16674,    299,    260,   1170,  13487,
            849,    259,  37601,    380,    261,    763,    340,   1451,   5826,
            269,    327,    786,  19328,  34407,  29389,   1919,    804,   9597,
            322,  11477,    269,    763, 138187,    380,   1218,  15204,    293,
            369,    259,  27338,    720,    259,    369,   8480,  64601,    383,
            259,  16650,   8818,    322,    261,    340,   9562,    843,    259,
          5846

In [20]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 596919
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 27386
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 27161
    })
})

### Fine-tuning - Training 

In [21]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

In [22]:
%%time
trainer.train()

Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,2.4654,2.375458,0.2166,0.0691,0.1789,0.1789,18.9725
2,2.3603,2.31885,0.2222,0.0728,0.1835,0.1836,18.9777
3,2.3052,2.292186,0.2246,0.0749,0.1854,0.1854,18.9823
4,2.272,2.273301,0.2252,0.0757,0.1861,0.1861,18.9801
5,2.2532,2.26633,0.226,0.0761,0.1868,0.1867,18.9787


CPU times: user 12h 5min 42s, sys: 42min 5s, total: 12h 47min 47s
Wall time: 12h 19min 9s


TrainOutput(global_step=373075, training_loss=2.3312351312211073, metrics={'train_runtime': 44349.3742, 'train_samples_per_second': 67.297, 'train_steps_per_second': 8.412, 'total_flos': 6.072053934186394e+17, 'train_loss': 2.3312351312211073, 'epoch': 5.0})

### Evaluation

In [23]:
%%time
trainer.evaluate()

CPU times: user 12min 59s, sys: 1.02 s, total: 13min
Wall time: 12min 59s


{'eval_loss': 2.2663304805755615,
 'eval_rouge1': 0.226,
 'eval_rouge2': 0.0761,
 'eval_rougeL': 0.1868,
 'eval_rougeLsum': 0.1867,
 'eval_gen_len': 18.9787,
 'eval_runtime': 779.9134,
 'eval_samples_per_second': 35.114,
 'eval_steps_per_second': 4.39,
 'epoch': 5.0}

### Push to huggingface

In [27]:
#trainer.save_model("./trained_model")

In [24]:
trainer.push_to_hub("Training done - 10 epochs", tags="summarization")

pytorch_model.bin:   0%|          | 0.00/1.20G [00:00<?, ?B/s]

'https://huggingface.co/thekenken/mt5-small-finetuned-summary-en-fr-finetuned-summary-en-fr/tree/main/'

## Using my fine-tuned model 

In [14]:
from transformers import pipeline

hub_model_id = "thekenken/mt5small-finetuned-summary-en-fr"
summarizer = pipeline("summarization", model=hub_model_id)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [18]:
def print_summary(idx, max_length=150):
    document = summary_dataset["test"][idx]["document"]
    summary = summary_dataset["test"][idx]["summary"]
    prediction = summarizer(summary_dataset["test"][idx]["document"], max_length=max_length)[0]["summary_text"]
    print(f"'>>> Document: {document}'")
    print(f"\n'>>> Summary: {summary}'")
    print(f"\n'>>> Predicted - Summary: {prediction}'")

In [19]:
print_summary(10, max_length=300)

Your max_length is set to 300, but your input_length is only 160. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=80)


'>>> Document: Donald Trump et Xi Jinping, le président chinois, au sommet du G20 à Osaka (Japon), le 29 juin. KEVIN LAMARQUE / REUTERS La Chine a annoncé, lundi 2 septembre, qu’elle avait déposé une plainte auprès de l’Organisation mondiale du commerce (OMC) en réaction à l’entrée en vigueur aux Etats-Unis, dimanche, de nouveaux droits de douane sur des produits chinois représentant des milliards de dollars d’importations annuelles. « Ces taxes américaines enfreignent gravement le consensus auquel étaient parvenus les ch'

'>>> Summary: De nouveaux droits de douane sont entrés en vigueur dimanche aux Etats-Unis sur des produits chinois'

'>>> Predicted - Summary: La Chine a déposé une plainte auprès de l’OMC en réaction à l’entrée en vigueur aux Etats-Unis, dimanche, de nouveaux '


In [41]:
!pip install langdetect -q 
from langdetect import detect

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


You should consider upgrading via the '/Users/KENAN/Desktop/KENAN-WEBSITE/text-summarization/training/venv/bin/python3 -m pip install --upgrade pip' command.[0m


In [44]:
def generate_summary(document, max_length=150):
    # Détectez automatiquement la langue du document
    language = detect(document)

    # Ajustez la longueur maximale en fonction de la langue
    if language == "en":
        max_length = 300  # Longueur maximale pour l'anglais
    elif language == "fr":
        max_length = 350  # Longueur maximale pour le français

    # Générer un résumé pour le document complet en fonction de la langue
    summary = model.generate(
        tokenizer.encode(document, return_tensors="pt"),
        max_length=max_length,
        num_beams=5,  # Augmenter num_beams pour des résumés de meilleure qualité
        early_stopping=True,  # Assurez-vous que la génération se termine correctement
    )[0]
    full_summary = tokenizer.decode(summary, skip_special_tokens=True)

    return full_summary, language

In [46]:
idx=3 
document = summary_dataset["test"][idx]["document"]
summary = summary_dataset["test"][idx]["summary"]
prediction, language = generate_summary(document, max_length=100)  # Spécifiez la longueur maximale souhaitée
print(f"'>>> Language: {language}'")
print(f"'>>> Document: {document}'")
print(f"\n'>>> Summary: {summary}'")
print(f"\n'>>> Predicted - Summary: {prediction}'")

'>>> Language: fr'
'>>> Document: EDF et les partisans du nucléaire font de longue date la promotion du chauffage électrique. PHILIPPE HUGUEN / AFP C’est une intense bataille de lobbying qui se joue en coulisses, mais cette fois-ci, elle concerne un objet du quotidien de tous les Français : le radiateur. Ces derniers mois, les partisans de l’électrique et du gaz démultiplient leurs efforts pour tenter de convaincre les pouvoirs publics. En ligne de mire : la nouvelle réglementation environnementale, dite RE 2020, qui doit définir quel mode '

'>>> Summary: Le gouvernement prépare une réglementation pour les bâtiments neufs. Partisans du tout-électrique et'

'>>> Predicted - Summary: EDF et les partisans du nucléaire font de longue date la promotion du chauffage électrique.'


#### Générer résumé en entier: 

In [24]:
!pip install langdetect -q 

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


You should consider upgrading via the '/Users/KENAN/Desktop/KENAN-WEBSITE/text-summarization/training/venv/bin/python3 -m pip install --upgrade pip' command.[0m


In [26]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import nltk

# Utilisez un modèle adapté à la tâche de découpage de phrases (en anglais)
hub_model_id = "thekenken/mt5small-finetuned-summary-en-fr"
tokenizer = AutoTokenizer.from_pretrained(hub_model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(hub_model_id)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [30]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from langdetect import detect


def split_text_into_sentences(text, language="english"):
    # Découpez le texte en phrases en fonction de la langue
    if language == "fr": language = "french"
    sentence_splitter = nltk.data.load(f'tokenizers/punkt/{language}.pickle')
    sentences = sentence_splitter.tokenize(text)
    return sentences

def generate_summary(text, max_length=150):
    # Détectez automatiquement la langue du texte
    language = detect(text)
    print("Language: ", language)

    # Découpez le texte en phrases en fonction de la langue
    sentences = split_text_into_sentences(text, language)

    # Générer un résumé pour chaque phrase
    summaries = []
    for sentence in sentences:
        summary = model.generate(
            tokenizer.encode(sentence, return_tensors="pt"),
            max_length=max_length,
            num_beams=5,  # Augmenter num_beams pour des résumés de meilleure qualité
            early_stopping=True,  # Assurez-vous que la génération se termine correctement
        )[0]
        summaries.append(tokenizer.decode(summary, skip_special_tokens=True))

    # Concaténez les résumés des phrases pour obtenir le résumé complet
    full_summary = " ".join(summaries)
    return full_summary


idx=20 

document = summary_dataset["test"][idx]["document"]
summary = summary_dataset["test"][idx]["summary"]
prediction = generate_summary(document, max_length=200)  # Spécifiez la longueur maximale souhaitée
print(f"'>>> Document: {document}'")
print(f"\n'>>> Summary: {summary}'")
print(f"\n'>>> Predicted - Summary: {prediction}'")

Language:  fr
'>>> Document: Caps (à droite), leader des G2 Esports, part favori face à Doinb de FunPlux Phoenix, une équipe chinoise sans référence au niveau international, mais sans complexe. (Riot Games) Il y a comme un petit parfum de finale d’Euro 2016 de football. Les Européens de G2 Esports contre les Chinois de FunPlus Phoenix (FPX) va-t-il devenir le France-Portugal des compétitions de jeu vidéo ? Dimanche à 13 h 30, devant les 20 000 spectateurs de l’AccorHotels Arena (Paris-Bercy) forcément acquis à la cause du dernier repré'

'>>> Summary: L’équipe européenne G2 part ultrafavorite des Worlds, dont la finale doit réunir dimanche 20 000 spe'

'>>> Predicted - Summary: Caps (à droite), leader des G2 Esports, part favori face à Doinb de FunPlux Phoenix, une équipe chinoise La finale d’Euro 2016 de football s’est achevée dimanche à Riot Games. Il y a comme un petit parfum de finale d Les Européens de G2 Esports contre les Chinois de FunPlus Phoenix (FPX) va-t-il devenir le Franc

## Fine-tuning with Accelerate

### Configuration

In [None]:
tokenized_datasets.set_format("torch")

In [None]:
from transformers import DataCollatorForSeq2Seq

model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

In [None]:
from torch.utils.data import DataLoader

batch_size = 8
train_dataloader = DataLoader(
    tokenized_datasets["train"],
    shuffle=True,
    collate_fn=data_collator,
    batch_size=batch_size,
)
eval_dataloader = DataLoader(
    tokenized_datasets["validation"], collate_fn=data_collator, batch_size=batch_size
)

In [None]:
from torch.optim import AdamW

optimizer = AdamW(model.parameters(), lr=2e-5)

In [None]:
from accelerate import Accelerator

accelerator = Accelerator()
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader
)

In [None]:
from transformers import get_scheduler

num_train_epochs = 10
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

In [None]:
def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [label.strip() for label in labels]

    # ROUGE expects a newline after each sentence
    preds = ["\n".join(nltk.sent_tokenize(pred)) for pred in preds]
    labels = ["\n".join(nltk.sent_tokenize(label)) for label in labels]

    return preds, labels

In [None]:
from huggingface_hub import get_full_repo_name

model_name = "mt5-small-summary-en-fr-accelerate"
repo_name = get_full_repo_name(model_name)
repo_name

In [None]:
from huggingface_hub import Repository

output_dir = "results-mt5-finetuned-squad-accelerate"
repo = Repository(output_dir, clone_from=repo_name)

### Training loop

The training loop for summarization is quite similar to the other 🤗 Accelerate examples that we’ve encountered and is roughly split into four main steps:

1. Train the model by iterating over all the examples in train_dataloader for each epoch.
2. Generate model summaries at the end of each epoch, by first generating the tokens and then decoding them (and the reference summaries) into text.
3. Compute the ROUGE scores using the same techniques we saw earlier.
4. Save the checkpoints and push everything to the Hub. Here we rely on the nifty blocking=False argument of the Repository object so that we can push the checkpoints per epoch asynchronously. This allows us to continue training without having to wait for the somewhat slow upload associated with a GB-sized model!


In [None]:
from tqdm.auto import tqdm
import torch
import numpy as np

progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_train_epochs):
    # Training
    model.train()
    for step, batch in enumerate(train_dataloader):
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

    # Evaluation
    model.eval()
    for step, batch in enumerate(eval_dataloader):
        with torch.no_grad():
            generated_tokens = accelerator.unwrap_model(model).generate(
                batch["input_ids"],
                attention_mask=batch["attention_mask"],
            )

            generated_tokens = accelerator.pad_across_processes(
                generated_tokens, dim=1, pad_index=tokenizer.pad_token_id
            )
            labels = batch["labels"]

            # If we did not pad to max length, we need to pad the labels too
            labels = accelerator.pad_across_processes(
                batch["labels"], dim=1, pad_index=tokenizer.pad_token_id
            )

            generated_tokens = accelerator.gather(generated_tokens).cpu().numpy()
            labels = accelerator.gather(labels).cpu().numpy()

            # Replace -100 in the labels as we can't decode them
            labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
            if isinstance(generated_tokens, tuple):
                generated_tokens = generated_tokens[0]
            decoded_preds = tokenizer.batch_decode(
                generated_tokens, skip_special_tokens=True
            )
            decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

            decoded_preds, decoded_labels = postprocess_text(
                decoded_preds, decoded_labels
            )

            rouge_score.add_batch(predictions=decoded_preds, references=decoded_labels)

    # Compute metrics
    result = rouge_score.compute()
    # Extract the median ROUGE scores
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}
    result = {k: round(v, 4) for k, v in result.items()}
    print(f"Epoch {epoch}:", result)

    # Save and upload
    accelerator.wait_for_everyone()
    unwrapped_model = accelerator.unwrap_model(model)
    unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)
    if accelerator.is_main_process:
        tokenizer.save_pretrained(output_dir)
        repo.push_to_hub(
            commit_message=f"Training in progress epoch {epoch}", blocking=False
        )

### Using my fine-tuned model 

In [None]:
from transformers import pipeline

hub_model_id = "thekenken/mt5-finetuned-summary-en-fr-accelerate"
summarizer = pipeline("summarization", model=hub_model_id)

In [None]:
def print_summary(idx):
    document = summary_dataset["test"][idx]["document"]
    summary = summary_dataset["test"][idx]["summary"]
    prediction = summarizer(summary_dataset["test"][idx]["document"])[0]["summary_text"]
    print(f"'>>> Document: {document}'")
    print(f"\n'>>> Summary: {summary}'")
    print(f"\n'>>> Predicted - Summary: {prediction}'")

In [None]:
print_summary(100)