## 评价指标

In [2]:
from rouge import Rouge 
# !pip3.9 install rouge

hypothesis = "the #### transcript is a written version of each day 's cnn student news program use this transcript to he    lp students with reading comprehension and vocabulary use the weekly newsquiz to test your knowledge of storie s you     saw on cnn student news"

reference = "this page includes the show transcript use the transcript to help students with reading comprehension and     vocabulary at the bottom of the page , comment for a chance to be mentioned on cnn student news . you must be a teac    her or a student age # # or older to request a mention on the cnn student news roll call . the weekly newsquiz tests     students ' knowledge of even ts in the news"

rouge = Rouge()
scores = rouge.get_scores(hypothesis, reference)

scores



[{'rouge-1': {'r': 0.42857142857142855,
   'p': 0.5833333333333334,
   'f': 0.49411764217577864},
  'rouge-2': {'r': 0.18571428571428572,
   'p': 0.3170731707317073,
   'f': 0.23423422957552154},
  'rouge-l': {'r': 0.3877551020408163,
   'p': 0.5277777777777778,
   'f': 0.44705881864636676}}]

In [1]:
import wandb



### Motivation

In our current fast pace society, it is impossible to keep up with the information being generated every single minute.  
Even if one limits itself to articles, the volume will still be too much. However, not everything contained in an article is actually relevant.
With the abundance of information available, it is therefore neccessary to focus only on relevant information and articles.

### Overall goal of the project
Our aim is to perform abstractive and extractive text summarization on news articles. This will reduce the time spent on any given article.

### What framework are you going to use (Kornia, Transformer, Pytorch-Geometrics)
The [Transformers](https://github.com/huggingface/transformers) framework provided by HuggingFace provides high-performance NLP models suitable for a wide range of application - including text summarization.


### What data are you going to run on (initially, may change)
The [CNN Dailymail](https://www.kaggle.com/datasets/gowrishankarp/newspaper-text-summarization-cnn-dailymail) Dataset contains approximately 300k new articles.
Each entry contains the article alongside the summarized article, as well as a unique id.
If time allows, we may expand our model by the [XSum dataset and additional articles from Multi-News](https://www.kaggle.com/datasets/sbhatti/news-summarization), available on Kaggle as well.



### What deep learning models do you expect to use
Due to both time- and computational constraints, we will refer to pre-trained models, which we intend to fine-tune on the dataset.
As the dataset is fairly popular for text summarization, there are several models fitted to it already available. We will use [BigBirdPegasus](https://huggingface.co/docs/transformers/model_doc/bigbird_pegasus) or [Pegasus](https://huggingface.co/docs/transformers/model_doc/pegasus), and might extend using [DistilBERT](https://huggingface.co/docs/transformers/model_doc/distilbert) or [ALBERT](https://huggingface.co/docs/transformers/model_doc/albert)




## 数据集

In [3]:
import pandas as pd

# https://www.kaggle.com/datasets/sbhatti/news-summarization
data = pd.read_csv('data.csv', nrows=10000)
data = data[['Content', 'Summary']]

In [4]:
data.head()

Unnamed: 0,Content,Summary
0,New York police are concerned drones could bec...,Police have investigated criminals who have ri...
1,By . Ryan Lipman . Perhaps Australian porn sta...,Porn star Angela White secretly filmed sex act...
2,"This was, Sergio Garcia conceded, much like be...",American draws inspiration from fellow country...
3,An Ebola outbreak that began in Guinea four mo...,World Health Organisation: 635 infections and ...
4,By . Associated Press and Daily Mail Reporter ...,A sinkhole opened up at 5:15am this morning in...


## 模型加载

In [5]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = '0,1'
os.environ["CUDA_VISIBLE_DEVICES"] = ''

In [7]:
import torch
from transformers import PegasusTokenizer, PegasusForConditionalGeneration

# transformers 从网站上自动下载权重的词典
tokenizer = PegasusTokenizer.from_pretrained("google/pegasus-xsum")
model = PegasusForConditionalGeneration.from_pretrained("google/pegasus-xsum")



device = 'cuda' if torch.cuda.is_available() else 'cpu'
device = 'cpu'

In [16]:
batch = tokenizer.prepare_seq2seq_batch(data['Content'].iloc[20], 
                                        truncation=True, 
                                        padding='longest', 
                                        return_tensors="pt")
translated = model.generate(**batch)
tgt_text = tokenizer.batch_decode(translated, skip_special_tokens=True)


tgt_text, data['Summary'].iloc[20]

`prepare_seq2seq_batch` is deprecated and will be removed in version 5 of HuggingFace Transformers. Use the regular
`__call__` method to prepare your inputs and targets.

Here is a short example:

model_inputs = tokenizer(src_texts, text_target=tgt_texts, ...)

If you either need to use different keyword arguments for the source and target texts, you should do two calls like
this:

model_inputs = tokenizer(src_texts, ...)
labels = tokenizer(text_target=tgt_texts, ...)
model_inputs["labels"] = labels["input_ids"]

See the documentation of your specific tokenizer for more details on the specific arguments to the tokenizer of choice.
For a more complete example, see the implementation of `prepare_seq2seq_batch`.



(['She worked like crazy and completed 115 hours of community service in Brooklyn to meet her Thursday deadline and avoid jail.'],
 "– Tomorrow looks to be a milestone day for Lindsay Lohan: Her lawyer will be able to report to a Los Angeles judge that she has completed all her necessary community service, paving the way for her to be off probation for the first time in seven years, reports TMZ. The community service stems from a reckless driving case, and things looked bleak for Lohan less than three weeks ago when a judge informed her that she still had 115 hours to complete before a May 28 deadline, notes the New York Daily News. Lohan got it done, however, and she's been posting photos of herself on the job at a women's shelter in New York City.")

In [9]:
rouge.get_scores(data['Summary'].iloc[20], tgt_text[0])

[{'rouge-1': {'r': 0.5, 'p': 0.11235955056179775, 'f': 0.18348623553572932},
  'rouge-2': {'r': 0.1, 'p': 0.017699115044247787, 'f': 0.030075185414664696},
  'rouge-l': {'r': 0.45, 'p': 0.10112359550561797, 'f': 0.16513761168251836}}]

## 数据集构建

In [10]:
from transformers import PegasusForConditionalGeneration, PegasusTokenizer, Trainer, TrainingArguments
import torch

# 自定义对文本做处理，读取单条记录
# 【content， summary】 是一条记录
class PegasusDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels
        
    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels['input_ids'][idx])  # torch.tensor(self.labels[idx])
        return item
    
    def __len__(self):
        return len(self.labels['input_ids'])
    


In [11]:
def tokenize_data(texts, labels):
    # content
    encodings = tokenizer(texts, truncation=True, padding=True, max_length=300)
    
    # summary
    decodings = tokenizer(labels, truncation=True, padding=True, max_length=200)
    
    dataset_tokenized = PegasusDataset(encodings, decodings)
    return dataset_tokenized

def prepare_data(model_name, 
                 train_texts, train_labels, 
                 val_texts=None, val_labels=None, 
                 test_texts=None, test_labels=None):
    """
    Prepare input data for model fine-tuning
    """
    tokenizer = PegasusTokenizer.from_pretrained(model_name)
    prepare_val = False if val_texts is None or val_labels is None else True
    prepare_test = False if test_texts is None or test_labels is None else True

    train_dataset = tokenize_data(train_texts, train_labels)
    val_dataset = tokenize_data(val_texts, val_labels) if prepare_val else None
    test_dataset = tokenize_data(test_texts, test_labels) if prepare_test else None

    return train_dataset, val_dataset, test_dataset, tokenizer

## 模型训练

In [14]:
def prepare_fine_tuning(model_name, tokenizer, train_dataset, val_dataset=None, freeze_encoder=False, output_dir='./results'):
    """
    Prepare configurations and base model for fine-tuning
    """
    # torch_device = 'cuda' if torch.cuda.is_available() else 'cpu'
    torch_device = 'cpu'
    model = PegasusForConditionalGeneration.from_pretrained(model_name).to(torch_device)
    
    
    
    if freeze_encoder:
        for param in model.model.encoder.parameters():
            param.requires_grad = False

    if val_dataset is not None:
        training_args = TrainingArguments(
          output_dir=output_dir,           # output directory
          num_train_epochs=1,              # total number of training epochs
          per_device_train_batch_size=1,   # batch size per device during training, can increase if memory allows
          per_device_eval_batch_size=1,    # batch size for evaluation, can increase if memory allows
          save_steps=500,                  # number of updates steps before checkpoint saves
          save_total_limit=5,              # limit the total amount of checkpoints and deletes the older checkpoints
          evaluation_strategy='steps',     # evaluation strategy to adopt during training
          eval_steps=100,                  # number of update steps before evaluation
          warmup_steps=500,                # number of warmup steps for learning rate scheduler
          weight_decay=0.01,               # strength of weight decay
          logging_dir='./logs',            # directory for storing logs
          logging_steps=10,
          report_to = 'wandb',
        )

        trainer = Trainer(
          model=model,                         # the instantiated 🤗 Transformers model to be trained
          args=training_args,                  # training arguments, defined above
          train_dataset=train_dataset,         # training dataset
          eval_dataset=val_dataset,            # evaluation dataset
          tokenizer=tokenizer
        )

    else:
        training_args = TrainingArguments(
          output_dir=output_dir,           # output directory
          num_train_epochs=1,           # total number of training epochs
          per_device_train_batch_size=1,   # batch size per device during training, can increase if memory allows
          save_steps=500,                  # number of updates steps before checkpoint saves
          save_total_limit=5,              # limit the total amount of checkpoints and deletes the older checkpoints
          warmup_steps=500,                # number of warmup steps for learning rate scheduler
          weight_decay=0.01,               # strength of weight decay
          logging_dir='./logs',            # directory for storing logs
          logging_steps=10,
          report_to = 'wandb',
          
        )

        trainer = Trainer(
          model=model,                         # the instantiated 🤗 Transformers model to be trained
          args=training_args,                  # training arguments, defined above
          train_dataset=train_dataset,         # training dataset
          tokenizer=tokenizer
        )

    return trainer

In [15]:
train_texts, train_labels = list(data['Content'].iloc[:100]), list(data['Summary'].iloc[:100])

# use Pegasus Large model as base for fine-tuning
model_name = './pegasus-xsum'

train_dataset, _, _, tokenizer = prepare_data(model_name, train_texts, train_labels)

trainer = prepare_fine_tuning(model_name, tokenizer, train_dataset)
trainer.train()

HFValidationError: Repo id must use alphanumeric chars or '-', '_', '.', '--' and '..' are forbidden, '-' and '.' cannot start or end the name, max length is 96: './pegasus-xsum'.

In [12]:
trainer.save_model('results/model.pt')

Saving model checkpoint to results/model.pt
Configuration saved in results/model.pt/config.json
Model weights saved in results/model.pt/pytorch_model.bin
tokenizer config file saved in results/model.pt/tokenizer_config.json
Special tokens file saved in results/model.pt/special_tokens_map.json


In [13]:
import torch
from transformers import PegasusTokenizer, PegasusForConditionalGeneration

# tokenizer = PegasusTokenizer.from_pretrained("google/pegasus-xsum")
# model = PegasusForConditionalGeneration.from_pretrained("google/pegasus-xsum")

tokenizer = PegasusTokenizer.from_pretrained("results/model.pt/")
model = PegasusForConditionalGeneration.from_pretrained("results/model.pt/")

device = 'cuda' if torch.cuda.is_available() else 'cpu'
device = 'cpu'

Didn't find file results/model.pt/added_tokens.json. We won't load it.
Didn't find file results/model.pt/tokenizer.json. We won't load it.
loading file results/model.pt/spiece.model
loading file None
loading file results/model.pt/special_tokens_map.json
loading file results/model.pt/tokenizer_config.json
loading file None
loading configuration file results/model.pt/config.json
Model config PegasusConfig {
  "_name_or_path": "./pegasus-xsum",
  "activation_dropout": 0.1,
  "activation_function": "relu",
  "add_bias_logits": false,
  "add_final_layer_norm": true,
  "architectures": [
    "PegasusForConditionalGeneration"
  ],
  "attention_dropout": 0.1,
  "bos_token_id": 0,
  "classif_dropout": 0.0,
  "classifier_dropout": 0.0,
  "d_model": 1024,
  "decoder_attention_heads": 16,
  "decoder_ffn_dim": 4096,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 16,
  "decoder_start_token_id": 0,
  "do_blenderbot_90_layernorm": false,
  "dropout": 0.1,
  "encoder_attention_heads": 16,
  "encoder_f

In [16]:
batch = tokenizer.prepare_seq2seq_batch(data['Content'].iloc[50], truncation=True, padding='longest', return_tensors="pt")
translated = model.generate(**batch)
tgt_text = tokenizer.batch_decode(translated, skip_special_tokens=True)
tgt_text, data['Summary'].iloc[50]

(['Bordeaux clinched the French league title with a 1-0 win over Caen on a dramatic final day of the season.'],
 'Bordeaux beat relegated Caen 1-0 to clinch the French league title .\nMarseille finish runners-up despite 4-0 home win over Rennes .\nAtletico Madrid clinch Champions League spot from Primera Liga .\nBesiktas claim Turkish league title with 2-1 win over Denizlispor .')

In [17]:
rouge.get_scores(data['Summary'].iloc[50], tgt_text[0])

[{'rouge-1': {'r': 0.5555555555555556,
   'p': 0.29411764705882354,
   'f': 0.3846153800887574},
  'rouge-2': {'r': 0.2631578947368421,
   'p': 0.1388888888888889,
   'f': 0.18181817729586788},
  'rouge-l': {'r': 0.4444444444444444,
   'p': 0.23529411764705882,
   'f': 0.30769230316568047}}]