In [1]:
import torch
print("CUDA dispo :", torch.cuda.is_available())
print("GPU :", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "Aucun")



CUDA dispo : True
GPU : NVIDIA GeForce RTX 4060 Laptop GPU


In [2]:
!pip install transformers[sentencepiece] datasets sacrebleu rouge_score py7zr -q

Charger automatiquement le dataset CNN/DailyMail

Résumer plusieurs articles

Évaluer la qualité avec ROUGE ou BLEU

In [3]:
!pip install --upgrade accelerate
!pip uninstall -y transformers accelerate
!pip install transformers accelerate

Found existing installation: transformers 4.53.2
Uninstalling transformers-4.53.2:
  Successfully uninstalled transformers-4.53.2
Found existing installation: accelerate 1.8.1
Uninstalling accelerate-1.8.1:
  Successfully uninstalled accelerate-1.8.1
Collecting transformers
  Using cached transformers-4.53.2-py3-none-any.whl.metadata (40 kB)
Collecting accelerate
  Using cached accelerate-1.8.1-py3-none-any.whl.metadata (19 kB)
Using cached transformers-4.53.2-py3-none-any.whl (10.8 MB)
Using cached accelerate-1.8.1-py3-none-any.whl (365 kB)
Installing collected packages: accelerate, transformers

   ---------------------------------------- 0/2 [accelerate]
   ---------------------------------------- 0/2 [accelerate]
   ---------------------------------------- 0/2 [accelerate]
   ---------------------------------------- 0/2 [accelerate]
   ---------------------------------------- 0/2 [accelerate]
   ---------------------------------------- 0/2 [accelerate]
   --------------------------

# Restart Kernel

In [1]:
import os
os.environ["CUDA_LAUNCH_BLOCKING"] = "1"


In [2]:
from transformers import pipeline, set_seed
from datasets import load_dataset, load_from_disk
import matplotlib.pyplot as plt 
import pandas as pd

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

import nltk
from nltk.tokenize import sent_tokenize

from tqdm import tqdm
import torch

nltk.download("punkt")

  from .autonotebook import tqdm as notebook_tqdm
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\nico_\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Basic functionnality of Hugging Face Model

In [2]:
from transformers import AutoTokenizer, PegasusForConditionalGeneration

model = PegasusForConditionalGeneration.from_pretrained("google/pegasus-xsum",use_safetensors=True).to("cuda")
tokenizer = AutoTokenizer.from_pretrained("google/pegasus-xsum")

ARTICLE_TO_SUMMARIZE = (
    "PG&E stated it scheduled the blackouts in response to forecasts for high winds "
    "amid dry conditions. The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were "
    "scheduled to be affected by the shutoffs which were expected to last through at least midday tomorrow."
)
inputs = tokenizer(ARTICLE_TO_SUMMARIZE, max_length=1024, return_tensors="pt", truncation=True).to("cuda")

# Generate Summary
summary_ids = model.generate(inputs["input_ids"])
tokenizer.batch_decode(summary_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-xsum and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


KeyboardInterrupt: 

to("cuda") : Pour envoyer les données sur le GPU

truncation = True : Coupe proprement les textes trop longs au bon endroit

use_safetensors=True : Permet de charger le modèle même sans PyTorch 2.6, tout en restant sécurisé, contourne la faille de sécurité dans torch.load

In [3]:
device = "cuda" if torch.cuda.is_available() else "cpu"
device

'cuda'

## Fine Tuning

In [4]:
model = "google/pegasus-xsum"
tokenizer = AutoTokenizer.from_pretrained(model)
model_pegasus = AutoModelForSeq2SeqLM.from_pretrained(model, use_safetensors=True).to(device)

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-xsum and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [5]:
model_pegasus.config.decoder_start_token_id = 2  # Ou un autre ID ≠ pad_token_id (qui est 0)

In [6]:
print("Pad token id:", tokenizer.pad_token_id)
print("Decoder start token id:", model_pegasus.config.decoder_start_token_id)
print("Vocab size:", tokenizer.vocab_size)


Pad token id: 0
Decoder start token id: 2
Vocab size: 96103


In [7]:
from datasets import load_dataset

dataset_samsum = load_dataset("knkarthick/samsum")

In [8]:
print(dataset_samsum)

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 14732
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 818
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 819
    })
})


In [9]:
sample = dataset_samsum["train"][0]
print(sample["dialogue"])
print(sample["summary"])

Amanda: I baked  cookies. Do you want some?
Jerry: Sure!
Amanda: I'll bring you tomorrow :-)
Amanda baked cookies and will bring Jerry some tomorrow.


## Preparing data for training for Sequence to Sequence Model

Cette fonction sert à préparer le dataset pour l'entraînement ou l'inférence avec un modèle de type Seq2Seq comme PEGASUS.

| Champ            | Rôle                                            |
| ---------------- | ----------------------------------------------- |
| `input_ids`      | Texte source encodé (ici, les dialogues)        |
| `attention_mask` | Masque indiquant quelles positions sont valides |
| `labels`         | Résumé cible (encodé) utilisé comme référence   |


In [10]:
def convert_examples_to_features(example_batch):
    dialogues = [str(d) for d in example_batch["dialogue"]]
    summaries = [str(s) for s in example_batch["summary"]]

    # Inputs
    inputs = tokenizer(
        dialogues,
        max_length=512,   # limite de Pegasus
        padding="max_length",
        truncation=True
    )

    # Targets
    targets = tokenizer(
        summaries,
        max_length=128,
        padding="max_length",
        truncation=True
    )

    # Remplacement des [PAD] par -100
    labels = [
        [(token if token != tokenizer.pad_token_id else -100) for token in label]
        for label in targets["input_ids"]
    ]

    return {
        "input_ids": inputs["input_ids"],
        "attention_mask": inputs["attention_mask"],
        "labels": labels
    }





In [11]:
dataset_samsum_pt = dataset_samsum.map(convert_examples_to_features, batched=True)

Map: 100%|██████████| 819/819 [00:00<00:00, 1310.71 examples/s]


In [12]:
dataset_samsum_pt['test']

Dataset({
    features: ['id', 'dialogue', 'summary', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 819
})

In [13]:
# Training
from transformers import DataCollatorForSeq2Seq

seq2seq_data_collator = DataCollatorForSeq2Seq(tokenizer, model = model_pegasus)

Un data collator (spécifique pour les modèles seq2seq) sert à :

- assembler des exemples individuels en mini-batchs

- faire le padding dynamique

- masquer les tokens ignorés dans la loss (comme [PAD] dans les labels)

In [14]:
# Pegasus Training parameters
from transformers import TrainingArguments, Trainer

trainer_args = TrainingArguments(
    output_dir = 'pegasus-samsum', num_train_epochs= 1, warmup_steps= 500,
    per_device_train_batch_size= 1,per_device_eval_batch_size= 1,
    weight_decay= 0.01, logging_steps= 10,
    eval_strategy= 'steps', eval_steps= 500, save_steps= 1e6,
    gradient_accumulation_steps= 16
)

In [15]:
trainer = Trainer(model=model_pegasus, args=trainer_args,
                  tokenizer=tokenizer, data_collator=seq2seq_data_collator,
                  train_dataset=dataset_samsum_pt['test'],
                  eval_dataset=dataset_samsum_pt['validation'])

  trainer = Trainer(model=model_pegasus, args=trainer_args,


In [16]:
trainer.train()

Step,Training Loss,Validation Loss




TrainOutput(global_step=52, training_loss=4.323927906843332, metrics={'train_runtime': 2975.4502, 'train_samples_per_second': 0.275, 'train_steps_per_second': 0.017, 'total_flos': 1183235677618176.0, 'train_loss': 4.323927906843332, 'epoch': 1.0})

In [17]:
## Save model
model_pegasus.save_pretrained("pegasus-samsum-model")

In [19]:
#Load
tokenizer.save_pretrained("pegasus-samsum-tokenizer")

('pegasus-samsum-tokenizer\\tokenizer_config.json',
 'pegasus-samsum-tokenizer\\special_tokens_map.json',
 'pegasus-samsum-tokenizer\\spiece.model',
 'pegasus-samsum-tokenizer\\added_tokens.json',
 'pegasus-samsum-tokenizer\\tokenizer.json')

## Evaluation

In [20]:
### lst[1,2,3,4,5,6]-> [1,2,3][4,5,6]
def generate_batch_sized_chunks(list_of_elements, batch_size):
    #split the dataset into smaller batches that we can process simultaneously
    for i in range(0, len(list_of_elements), batch_size):
        yield list_of_elements[i : i + batch_size]



def calculate_metric_on_test_ds(dataset, metric, model, tokenizer,
                               batch_size=16, device=device,
                               column_text="article",
                               column_summary="highlights"):
    article_batches = list(generate_batch_sized_chunks(dataset[column_text], batch_size))
    target_batches = list(generate_batch_sized_chunks(dataset[column_summary], batch_size))

    for article_batch, target_batch in tqdm(
        zip(article_batches, target_batches), total=len(article_batches)):

        inputs = tokenizer(article_batch, max_length=512,  truncation=True,
                        padding="max_length", return_tensors="pt")

        summaries = model.generate(input_ids=inputs["input_ids"].to(device),
                         attention_mask=inputs["attention_mask"].to(device),
                         length_penalty=0.8, num_beams=8, max_length=128)
        #parameter for length penalty ensures that the model does not generate sequences that are too long

        # decode the generated texts,
        # replace the  token, and add the decoded texts with the references to the metric.
        decoded_summaries = [tokenizer.decode(s, skip_special_tokens=True,
                                clean_up_tokenization_spaces=True)
               for s in summaries]

        decoded_summaries = [d.replace("", " ") for d in decoded_summaries]


        metric.add_batch(predictions=decoded_summaries, references=target_batch)

    #  compute and return the ROUGE scores.
    score = metric.compute()
    return score

In [21]:
import evaluate

rouge_metric = evaluate.load('rouge')
rouge_names = ["rouge1", "rouge2", "rougeL", "rougeLsum"]
#rouge_metric = load_metric('rouge')

In [22]:
rouge_metric

EvaluationModule(name: "rouge", module_type: "metric", features: [{'predictions': Value('string'), 'references': List(Value('string'))}, {'predictions': Value('string'), 'references': Value('string')}], usage: """
Calculates average rouge scores for a list of hypotheses and references
Args:
    predictions: list of predictions to score. Each prediction
        should be a string with tokens separated by spaces.
    references: list of reference for each prediction. Each
        reference should be a string with tokens separated by spaces.
    rouge_types: A list of rouge types to calculate.
        Valid names:
        `"rouge{n}"` (e.g. `"rouge1"`, `"rouge2"`) where: {n} is the n-gram based scoring,
        `"rougeL"`: Longest common subsequence based scoring.
        `"rougeLsum"`: rougeLsum splits text using `"
"`.
        See details in https://github.com/huggingface/datasets/issues/617
    use_stemmer: Bool indicating whether Porter stemmer should be used to strip word suffixes.
 

In [23]:
score = calculate_metric_on_test_ds(
    dataset_samsum['test'][0:10], rouge_metric, trainer.model, tokenizer, batch_size = 2, column_text = 'dialogue', column_summary= 'summary'
)

# Directly use the scores without accessing fmeasure or mid
rouge_dict = {rn: score[rn] for rn in rouge_names}

# Convert the dictionary to a DataFrame for easy visualization
import pandas as pd
pd.DataFrame(rouge_dict, index=[f'pegasus'])

100%|██████████| 5/5 [01:16<00:00, 15.25s/it]


Unnamed: 0,rouge1,rouge2,rougeL,rougeLsum
pegasus,0.036987,0.0,0.037084,0.037014


Interpreting Good vs. Bad ROUGE Scores:
Scores close to 1: This indicates a strong overlap between the generated summary and the reference summary, which is desirable in summarization tasks. For example, an F1-score of 0.7 or higher across metrics is generally considered good.
Scores between 0.5 and 0.7: Indicates moderate overlap. The summary might be capturing key points but is likely missing some structure or important information.
Scores below 0.5: Suggest a poor match between the generated and reference summaries. The model might be generating irrelevant or incomplete summaries that don’t capture the key ideas well.

In [24]:
gen_kwargs = {"length_penalty": 0.8, "num_beams":8, "max_length": 128}



sample_text = dataset_samsum["test"][0]["dialogue"]

reference = dataset_samsum["test"][0]["summary"]

pipe = pipeline("summarization", model="pegasus-samsum-model",tokenizer=tokenizer)

##
print("Dialogue:")
print(sample_text)


print("\nReference Summary:")
print(reference)


print("\nModel Summary:")
print(pipe(sample_text, **gen_kwargs)[0]["summary_text"])

Device set to use cuda:0
Your max_length is set to 128, but your input_length is only 122. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=61)


Dialogue:
Hannah: Hey, do you have Betty's number?
Amanda: Lemme check
Hannah: <file_gif>
Amanda: Sorry, can't find it.
Amanda: Ask Larry
Amanda: He called her last time we were at the park together
Hannah: I don't know him well
Hannah: <file_gif>
Amanda: Don't be shy, he's very nice
Hannah: If you say so..
Hannah: I'd rather you texted him
Amanda: Just text him 🙂
Hannah: Urgh.. Alright
Hannah: Bye
Amanda: Bye bye

Reference Summary:
Hannah needs Betty's number but Amanda doesn't have it. She needs to contact Larry.

Model Summary:
Amanda: Hey Hannah, do you have Betty's number? Amanda: Lemme check Hannah: file_gif> Amanda: Sorry, can't find it.
