# Text Summarization

In this notebook, it will:

    I. explain the NER problem.
    II. Model
    III. Realization

## I. Presentation

### 1. definition

This task is part on the sequence to sequence generations. 
Given a longer text, the goal is to generate a shorter version of the text.

Example: 
**text:** OpenAI's mission is to ensure that artificial general intelligence (AGI) benefits all of humanity. OpenAI will build safe and beneficial AGI directly, but will also consider its mission fulfilled if its work aids others to achieve this outcome. OpenAI follows several key principles for this purpose. First, broadly distributed benefits - any influence over AGI's deployment will be used for the benefit of all, and to avoid harmful uses or undue concentration of power. Second, long-term safety - OpenAI is committed to doing the research to make AGI safe, and to promote the adoption of such research across the AI community. Third, technical leadership - OpenAI aims to be at the forefront of AI capabilities. Fourth, a cooperative orientation - OpenAI actively cooperates with other research and policy institutions, and seeks to create a global community working together to address AGI's global challenges.

**summary:** OpenAI aims to ensure artificial general intelligence (AGI) is used for everyone's benefit, avoiding harmful uses or undue power concentration. It is committed to researching AGI safety, promoting such studies among the AI community. OpenAI seeks to lead in AI capabilities and cooperates with global research and policy institutions to address AGI's challenges.

### 2. metric - Rouge

There are Rouge-1, Rouge-2, Rouge-L based on 1-gram, 2-gram, l-gram.

Example:

| text                     |  1-gram               | 2-gram |
----------                 |:-------             |:------- 
| The cat is on the mat.   | **The** **cat** is on **the** mat | **The-cat** cat-is is-on on-the the-mat |
| The cat and the dog.     | **The** **cat** and **the** dog   | **The-cat** cat-and and-the the-dog      |

* 1-gram

precision = 3 / 5 = 0.6
recall = 3 / 6 = 0.5
F1 = 2 * (0.6 * 0.5) / (0.6 + 0.5) = 0.54

* 2-gram

precision = 1 / 2 = 0.25
recall = 1 / 5 = 0.2
F1 = 2 * (0.25 * 0.2) / (0.25 + 0.2) = 0.22

### 3. data structure

The data structure is:

    _______________________________________________________________________
       encoder input           |eos| |bos|         decoder input      |eos|
    -----------------------------------------------------------------------
                                          |
                                          >________________________________
                                     |bos|      decoder output        |eos|
                                          ---------------------------------


https://www.machinelearningplus.com/nlp/text-summarization-approaches-nlp-example/


## II. model

For this task, we need encoder and decoder.

Other tasks can follow this similar pattern to achieve, such as translations, completions...

model is XXForCOnditionalGeneration. If using T5, use t5ForConditionalGeneration, otherwise, use seq2seq models (see below)

## III. Realization

In [1]:
# to set the gpu to use
# Since I have 2 GPUs and I only want to use one, I need to run this.
# Should be run the first
# skip this if you don't need.

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"  # or "0,1" for multiple GPUs
os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [2]:
## defin repos for data and model

# data

ckp_data = "Ateeqq/news-title-generator"

# model

ckp = "google-t5/t5-base"

### 1. import

In [3]:
import evaluate, torch
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainer, Seq2SeqTrainingArguments

2024-06-21 15:28:50.958952: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-06-21 15:28:50.959022: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-06-21 15:28:50.962044: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-06-21 15:28:50.977125: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


### 2. load data

In [4]:
data = load_dataset(ckp_data, split="train[:1000]")
data

Downloading readme:   0%|          | 0.00/183 [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/41.5M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset({
    features: ['summary', 'text'],
    num_rows: 1000
})

In [6]:
# show data

data[0]

{'summary': 'upGrad learner switches to career in ML & Al with 90% salary hike',
 'text': "Saurav Kant, an alumnus of upGrad and IIIT-B's PG Program in Machine learning and Artificial Intelligence, was a Sr Systems Engineer at Infosys with almost 5 years of work experience. The program and upGrad's 360-degree career support helped him transition to a Data Scientist at Tech Mahindra with 90% salary hike. upGrad's Online Power Learning has powered 3 lakh+ careers."}

### 3. split data

In [7]:
split_data = data.train_test_split(test_size=0.2, seed=42)
split_data

DatasetDict({
    train: Dataset({
        features: ['summary', 'text'],
        num_rows: 800
    })
    test: Dataset({
        features: ['summary', 'text'],
        num_rows: 200
    })
})

### 4. tokenization

In [8]:
tokenizer = AutoTokenizer.from_pretrained(ckp)
tokenizer



config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

T5TokenizerFast(name_or_path='google-t5/t5-base', vocab_size=32100, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<pad>', 'additional_special_tokens': ['<extra_id_0>', '<extra_id_1>', '<extra_id_2>', '<extra_id_3>', '<extra_id_4>', '<extra_id_5>', '<extra_id_6>', '<extra_id_7>', '<extra_id_8>', '<extra_id_9>', '<extra_id_10>', '<extra_id_11>', '<extra_id_12>', '<extra_id_13>', '<extra_id_14>', '<extra_id_15>', '<extra_id_16>', '<extra_id_17>', '<extra_id_18>', '<extra_id_19>', '<extra_id_20>', '<extra_id_21>', '<extra_id_22>', '<extra_id_23>', '<extra_id_24>', '<extra_id_25>', '<extra_id_26>', '<extra_id_27>', '<extra_id_28>', '<extra_id_29>', '<extra_id_30>', '<extra_id_31>', '<extra_id_32>', '<extra_id_33>', '<extra_id_34>', '<extra_id_35>', '<extra_id_36>', '<extra_id_37>', '<extra_id_38>', '<extra_id_39>', '<extra_id_40>', '<extra_id_41>', '<extr

In [13]:
def process(samples):

    # if model_checkpoint in ["t5-small", "t5-base", "t5-larg", "t5-3b", "t5-11b"]:
    # prefix = "summarize: "

    prefix = "summarize: "

    input = [prefix + t for t in samples["text"]]

    toks = tokenizer(input, truncation=True, padding=True, max_length=256)

    label = tokenizer(text_target=samples["summary"], truncation=True, padding=True, max_length=64)
    toks["labels"] = label["input_ids"]

    return toks


In [14]:
tokenized_data = split_data.map(process, batched=True)
tokenized_data

Map:   0%|          | 0/800 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['summary', 'text', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 800
    })
    test: Dataset({
        features: ['summary', 'text', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 200
    })
})

In [11]:
# we can see that the tokenizer added an end token "</s>" and paddings
# compared to the original sentence

tokenizer.decode(tokenized_data["train"][0]["input_ids"])

"summarize: England pacer James Anderson matched Sir Ian Botham's tally of 27 Test five-wicket hauls for England on day two of the first Test against the Windies. Anderson's milestone five-wicket haul came 16 years after his first, which he bagged on debut against Zimbabwe at Lord's in May 2003. Anderson has now picked up 570 wickets in Test cricket.</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>"

In [16]:
tokenizer.decode(tokenized_data["train"][0]["labels"])

'Cong MLA accused of attempt to murder by fellow MLA absconding</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>'

### 5. load model

In [18]:
model = AutoModelForSeq2SeqLM.from_pretrained(ckp)
model

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

### 6. metric

In [22]:
# rouge is calculated based on n-grams

rouge = evaluate.load("rouge")

In [23]:
# a test 

rouge.compute(predictions=["This is a test", "john smith"], references=["I want a test", "john smith"])

{'rouge1': 0.75,
 'rouge2': 0.6666666666666666,
 'rougeL': 0.75,
 'rougeLsum': 0.75}

In [24]:
import numpy as np

def metric(pred):

    preds, refs = pred

    decode_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    refs = np.where(refs != -100, refs, tokenizer.pad_token_id)
    decode_refs = tokenizer.batch_decode(refs, skip_special_tokens=True)

    # concat decodes line by line
    decode_preds = [" ".join(p.strip()) for p in decode_preds]
    decode_refs = [" ".join(r.strip()) for r in decode_refs]

    # compute metric

    res = rouge.compute(predictions=decode_preds, references=decode_refs)

    return res


### 7. train args

In [26]:
args = Seq2SeqTrainingArguments(
    output_dir="../tmp//checkpoint",
    num_train_epochs=3,
    per_device_eval_batch_size=16,
    per_device_train_batch_size=16,
    gradient_accumulation_steps=8,
    logging_steps=8,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    metric_for_best_model="rougeL",
    predict_with_generate=True, # for seq2seq training, we enable this option to do evaluation
)



### 8. trainer

In [27]:
trainer = Seq2SeqTrainer(
    model=model,
    args=args,
    train_dataset=tokenized_data["train"],
    eval_dataset=tokenized_data["test"],
    compute_metrics=metric,
    tokenizer=tokenizer,
    data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer)
)

### 9. train

In [28]:
trainer.train()

Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum
0,No log,5.350233,0.718139,0.417513,0.460833,0.461075
1,7.673600,3.573808,0.717973,0.422881,0.465524,0.465305
2,4.429700,2.985825,0.717236,0.420341,0.462359,0.462388




TrainOutput(global_step=18, training_loss=5.786354647742377, metrics={'train_runtime': 74.9099, 'train_samples_per_second': 32.038, 'train_steps_per_second': 0.24, 'total_flos': 386383781560320.0, 'train_loss': 5.786354647742377, 'epoch': 2.88})

### 10. inference

In [29]:
from transformers import pipeline

# if you don't know the type of pipeline, you can just put whatever and run
# you will get error message in this case, but at the end of the message you
# can find all type names avalaible and choose the one you need

pipe = pipeline("text2text-generation", model=model, tokenizer=tokenizer)

In [50]:
# by default, the generated text may be shorter than nessary, so use max_length to increase
# do_sample allows to show different results

print(pipe("summarize: " + split_data["test"][1]["text"], max_length=32, do_sample=True))
print(split_data["test"][1]["summary"])

[{'generated_text': 'the 38-year-old is to receive the highest peacetime gallantry award in india . he died in 2004 after giving up terrorism'}]
Martyred terrorist-turned-soldier Nazir Wani to get Ashoka Chakra


## references:
 - https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/summarization.ipynb#scrollTo=UmvbnJ9JIrJd

 - https://github.com/abhimishra91/transformers-tutorials/blob/master/transformers_summarization_wandb.ipynb