<a href="https://colab.research.google.com/github/lakhanrajpatlolla/aiml-learning/blob/master/U4W21_72_Part_A_Finetuning_T5model_for_Dialogue_Summary.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Certification in AIML
## A Program by IIIT-H and TalentSprint
### Part-A: Finetuning a Seq2Seq (T5) Model for Summarization

## Reference notebook not for submission

> **NOTE that** this Assignment is in 2 parts:
> - Part-A: Finetuning a Seq2Seq (T5) Model for Summarization
> - Part-B: PEFT for Dialogue Summary
>
>Only Part-B needs to be submitted for grading.

## Learning Objectives

At the end of the experiment, you will be able to:

* fine tune a T5 model, `facebook/bart-large-cnn`, on the SAMSum dataset for summerization
* push the finetune model to HuggingFace model hub
* load the finetuned model from hub for inference

## Dataset Description

The **[SAMSum](https://huggingface.co/datasets/samsum) dataset** contains about 16k messenger-like conversations with summaries. Conversations were created and written down by linguists fluent in English. Linguists were asked to create conversations similar to those they write on a daily basis, reflecting the proportion of topics of their real-life messenger convesations. The style and register are diversified - conversations could be informal, semi-formal or formal, they may contain slang words, emoticons and typos. Then, **the conversations were annotated with summaries**. It was assumed that summaries should be a concise brief of what people talked about in the conversation in third person. The SAMSum dataset was prepared by Samsung R&D Institute Poland and is distributed for research purposes.

Data Splits:
- train: 14732
- val: 818
- test: 819

Data Fields:

- ***dialogue***: text of dialogue
- ***summary***: human written summary of the dialogue
- ***id***: unique id of an example

<br>

**Example:**

\{
> '**id**': '13818513',

>'**summary**': 'Amanda baked cookies and will bring Jerry some tomorrow.',

>'**dialogue**': "Amanda: I baked cookies. Do you want some?\r\nJerry: Sure!\r\nAmanda: I'll bring you tomorrow :-)"

\}

## Information

**Summarization** creates a shorter version of a document or an article that captures all the important information. Along with translation, it is another example of a task that can be formulated as a sequence-to-sequence task.

**BART** is a transformer encoder-decoder (seq2seq) model with a bidirectional (BERT-like) encoder and an autoregressive (GPT-like) decoder.

BART is pre-trained by
1. corrupting text with an arbitrary noising function, and
2. learning a model to reconstruct the original text.

BART is particularly effective when fine-tuned for text generation (e.g. summarization, translation) but also works well for comprehension tasks (e.g. text classification, question answering). This particular checkpoint, `facebook/bart-large-cnn`, has been fine-tuned on CNN Daily Mail dataset, a large collection of text-summary pairs.

To know more about BART `facebook/bart-large-cnn`, refer to its Model card [here](https://huggingface.co/facebook/bart-large-cnn).

### Install required dependencies

In [None]:
#!pip install "transformers==4.27.2" "datasets==2.9.0" "accelerate==0.17.1" "evaluate==0.4.0" "bitsandbytes==0.40.2" loralib --upgrade --quiet
# install additional dependencies needed for training

In [None]:
# HuggingFace transformers and datasets
!pip -q install transformers datasets

In [None]:
# 'Accelerate' is the backend for the PyTorch side
#  It enables the PyTorch code to be run across any distributed configuration
!pip -q install accelerate -U


# To install both 'transformer' and 'accelerate' in one go
# !pip install transformers[torch]

In [None]:
# A dependecy required for loading SAMSum dataset
!pip -q install py7zr

In [None]:
!pip -q install transformers

### Import required packages

In [None]:
import os
import torch
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from transformers import TrainingArguments, Trainer

import warnings
warnings.filterwarnings('ignore')

### **Load Model & Tokenizer**

* **Load the model and tokenizer from HF Model Hub for finetuning**

    - In many cases, the architecture you want to use can be guessed from the name or the path of the pretrained model you are supplying to the `from_pretrained()` method. **AutoClasses** can be used to automatically retrieve the relevant model given the name/path to the pretrained weights/config/vocabulary.

    - Instantiating one of `AutoConfig`, `AutoModel`, and `AutoTokenizer` will directly create a class of the relevant architecture.

    - `AutoModelForSeq2SeqLM` instantiates one of the model classes of the library (with a sequence-to-sequence language modeling head) from a configuration.

    - Full path of the model repo needs to be specified i.e. ***''USER-NAME/REPO-NAME''*** while calling `from_pretrained()` method.

In [None]:
# Load model from HF Model Hub

"""
BART HAS 400M PARAMS: https://github.com/facebookresearch/fairseq/tree/main/examples/bart
Look into Model card - 400 Million parameters
"""

checkpoint = "facebook/bart-large-cnn"                # username/repo-name

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

# Load model
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

### **Load Dataset**

In [None]:
# Load SAMSum dataset
dataset = load_dataset("samsum", trust_remote_code=True)
dataset

### **Testing the pre-trained model**

#### Observing the data

In [None]:
sample = dataset['test'][0]['dialogue']
label = dataset['test'][0]['summary']
print(sample,'\n','--------------')
print(label)

#### Prompt Preparation

In [None]:
def generate_summary(input, llm):
    """Prepare prompt  -->  tokenize -->  generate output using LLM  -->  detokenize output"""

    input_prompt = f"""
                    Summarize the following conversation.

                    {input}

                    Summary:
                    """

    input_ids = tokenizer(input_prompt, return_tensors='pt')
    tokenized_output = llm.generate(input_ids=input_ids['input_ids'], min_length=30, max_length=200)
    output = tokenizer.decode(tokenized_output[0], skip_special_tokens=True)

    return output

#### Getting the output

In [None]:
output = generate_summary(sample, llm=model)
print("Sample")
print(sample)
print("-------------------")
print("Model Generated Summary:")
print(output)
print("Correct Summary:")
print(label)

### **Prepare Dataset**

In [None]:
# Define function to prepare dataset

def tokenize_inputs(example):

    start_prompt = "Summarize the following conversation.\n\n"
    end_prompt = "\n\nSummary: "
    prompt = [start_prompt + dialogue + end_prompt for dialogue in example['dialogue']]
    example['input_ids'] = tokenizer(prompt, padding='max_length', truncation=True, max_length=512, return_tensors='pt').input_ids             # 'pt' for pytorch tensor
    example['labels'] = tokenizer(example['summary'], padding='max_length', truncation=True, max_length=512, return_tensors='pt').input_ids

    return example

In the below code, we are using `batched=True` to use Fast tokenizer implementation.

**Slow** tokenizers are those written in Python inside the HF Transformers library, while the **fast** versions are the ones provided by HF Tokenizers, which are written in Rust.

To know more about 'slow' and 'fast' tokenizers, refer [here](https://huggingface.co/learn/nlp-course/chapter6/3?fw=pt)

In [None]:
# Prepare dataset
tokenizer.pad_token = tokenizer.eos_token
tokenized_datasets = dataset.map(tokenize_inputs, batched=True)       # using batched=True for Fast tokenizer implementation

# Remove columns/keys that are not needed further
tokenized_datasets = tokenized_datasets.remove_columns(['id', 'dialogue', 'summary'])

In [None]:
# Shortening the data: Just picking row index divisible by 100
# For learning purpose! It will reduce the compute resource requirement and training time

tokenized_datasets = tokenized_datasets.filter(lambda example, index: index % 100 == 0, with_indices=True)

In [None]:
print(tokenized_datasets['train'].shape)
print(tokenized_datasets['validation'].shape)
print(tokenized_datasets['test'].shape)

In [None]:
tokenized_datasets['train'][0].keys()

### **Define Training Arguments and Trainer object**

**To upload the finetuned model on HF Model Hub, first you need to create a HuggingFace Account and Create a new model repository and Access Tokens with read/write permission**
* [Sign up](https://huggingface.co/join) for a Hugging Face account
    
        * Follow the below steps to create reposotory
    
            - By going through your icon on huggingface you will find new model.
            - Create your Model name, with License as ( MIT/mit ), keep it public and create model.
            - You can access your folder location from the browser URL : https://huggingface.co/[YOUR-USER-NAME]/[YOUR-MODEL-REPO-NAME]
            - With your user name and model repo name in training arguments uncomment and rename them `"sumanthk/PEFT_expo"`



In [None]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./bart-cnn-samsum-finetuned",        # local directory
    hub_model_id="sumanthk/PEFT_expo",      # identifier on the Hub for directly pushing to HFhub model
    learning_rate=1e-5,
    num_train_epochs=2,
    weight_decay=0.01,
    auto_find_batch_size=True,
    evaluation_strategy='epoch',
    logging_steps=10,
)

In [None]:
from transformers import GenerationConfig

In [None]:
# Configure generation settings

generation_config = GenerationConfig(
    max_length=142,  # Maximum length of generated sequences
    min_length=56,  # Minimum length of generated sequences
    early_stopping=True,  # Stop generation early if all beams reach an EOS token
    num_beams=4,  # Number of beams for beam search
    length_penalty=2.0,  # Penalty for longer sequences
    no_repeat_ngram_size=3,  # Prevent repeating n-grams
    forced_bos_token_id=0,  # Force the beginning of sequence token
    forced_eos_token_id=2,  # Force the end of sequence token
)

model.generation_config.decoder_start_token_id = tokenizer.cls_token_id

In [None]:
trainer = Trainer(
    model=model,           # model to be finetuned
    tokenizer=tokenizer,       # tokenizer to use
    args=training_args,        # training arguments such as epochs, learning_rate, etc
    train_dataset=tokenized_datasets['train'],         # training data to use
    eval_dataset=tokenized_datasets['validation'],     # validation data to use
)

In [None]:
# Disabling Weights and Biases logging
import os
os.environ["WANDB_DISABLED"] = "true"

In [None]:
# Training
trainer.train()

### **Save the model on Local system**

In [None]:
ver = 1
output_directory="./bart-cnn-samsum-finetuned"
model_path = os.path.join(output_directory, f"tuned_model_{ver}" )

# Save finetuned model
trainer.save_model(model_path)

# Save associated tokenizer
tokenizer.save_pretrained(model_path)

print(f"\nSaved at path: {model_path}")

### **Load the model from Local system and test**

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_path)
model4local = AutoModelForSeq2SeqLM.from_pretrained(model_path)

In [None]:
output = generate_summary(sample, llm = model4local)

print("Sample")
print(sample)
print("-------------------")
print("Summary:")
print(output)
print("Ground Truth Summary:")
print(label)

### **Push your model to Hugging Face Model Hub**

To upload the finetuned model on HF Model Hub, please follow below steps:

**Steps to push your fine-tuned model to HuggingFace Model Hub:**

1. Go to already signed up Hugging Face account

2. Create an access token for your account and save it

    To create an access token:
    
        - Go to your `Settings`, then click on the `Access Tokens` tab. Click on the `New token` button to create a new User Access Token.
        - Select a role(`write`) and a name for your token
        - Click Generate a token

    To know more about Access Tokens, refer [here](https://huggingface.co/docs/hub/security-tokens).


3. Once you have your User Access Token, run the following command to authenticate your identity to the Hub.
        - `notebook_login()`
        - Paste your Access token when prompted
    
    For more details on login, refer [here](https://huggingface.co/docs/huggingface_hub/quick-start#login).

In [None]:
from huggingface_hub import notebook_login

notebook_login()

4. Push your fine-tuned model and tokenizer to Model Hub
        - Use `push_to_hub()` method of your model and tokenizer both, to push them on hub

In [None]:
trainer.push_to_hub()

### **Test your finetuned model downloaded from HF Model Hub**

- Specify user name and your repository where the model and tokenizer will be loaded from.
    

In [None]:
username = "sumanthk"      # change it to your HuggingFace username

checkpoint = username + '/PEFT_expo'  # change it to your Repo name

loaded_model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

In [None]:
output = generate_summary(sample, llm=loaded_model)

print("Sample")
print(sample)
print("-------------------")
print("Summary:")
print(output)
print("Ground Truth Summary:")
print(label)

### References

- [Summarization](https://huggingface.co/docs/transformers/tasks/summarization)