<div style="text-align: right">INFO 7390 Advances Data Sciences and Architecture SEC 03 Spring 2024</div>
<div style="text-align: right">Creating Data with Generative AI</div>
<div style="text-align: right">Vinay Jogani, NEU ID: 002839145</div>

# <div style="text-align: center">CREATING DATA WITH GENERATIVE AI: PART 1</div>

## Step 1:  Theoretical Foundations of Generative AI

### -  Introduction to generative AI and its applications.

Generative AI encompasses artificial intelligence technologies and models capable of producing new content, such as text, images, audio, and video, that mimic human-created works. Unlike discriminative models, which classify or differentiate between types of input data, generative models generate new, unseen outputs. They learn from large datasets, using algorithms to replicate the data's underlying patterns and styles. This branch of AI diverges from traditional task-specific models, like those used for classification or regression, by prioritizing the creation of novel and original content. It leverages complex models, including Generative Adversarial Networks (GANs), Transformers, and Variational Autoencoders (VAEs), to inspire innovations in content creation, entertainment, and beyond.

Some popular examples of generative AI models include:

- **Generative Adversarial Networks (GANs):** These involve two models, a generator and a discriminator, which are trained simultaneously through adversarial processes. The generator creates content, and the discriminator evaluates it against real data, improving iteratively until the generator produces realistic outputs.
- **Transformers:** Initially designed for natural language processing tasks, transformer models like GPT (Generative Pre-trained Transformer) can generate coherent and contextually relevant text over long passages. They have also been adapted for use in generating images, music, and other types of content.
- **Variational Autoencoders (VAEs):** VAEs are designed to compress data into a lower-dimensional space and then reconstruct it, which can be used for generating new data points with similar properties to the input data.

The development and deployment of generative AI raise important considerations regarding ethics, copyright, and the potential for misuse. However, with appropriate safeguards and ethical guidelines, generative AI can offer significant benefits across various sectors by enhancing creativity, personalizing user experiences, and automating routine or complex creation processes.

#### Applications

Generative AI has a wide range of applications across various industries:

1. **Content Creation:** It can generate creative writing, art, music, and digital images. This technology is revolutionizing the media, entertainment, and advertising industries by automating the content creation process and offering new forms of artistic expression.

2. **Design and Modeling:** In architecture and product design, generative AI can propose numerous design options based on specific criteria, significantly speeding up the design process and enabling more innovation.

3. **Simulation and Training:** Generative AI can create realistic simulations and environments for training purposes in fields such as medicine, aviation, and military.

4. **Personalization:** It can generate personalized content in real-time, enhancing user experiences in apps and websites by tailoring content, recommendations, and interactions to the individual user.

5. **Data Augmentation:** Generative AI is used to augment data sets in machine learning projects, especially where data may be scarce or expensive to obtain, improving the robustness and accuracy of predictive models.

6. **Drug Discovery and Healthcare:** It accelerates the drug discovery process by predicting molecular properties and generating novel molecular structures, leading to faster development of new medicines. In healthcare, generative AI can create realistic patient data for research without compromising privacy.

Generative AI's potential is vast, promising to transform how we create, innovate, and interact with technology. As these systems become more sophisticated, their influence is set to expand, touching every aspect of our digital lives. However, ethical considerations and potential misuse are important concerns that must be addressed to ensure these technologies benefit society as a whole.

### - The relevance of data generation in various data science tasks.

Data generation plays a crucial role in various data science tasks, serving multiple purposes that enhance model training, testing, and deployment across industries. Here's why data generation is so relevant:

1. **Augmenting Training Data:** In many machine learning projects, obtaining a large and diverse dataset is a significant challenge. Data generation techniques can create additional training data, enhancing the model's ability to learn and generalize. For instance, in image recognition tasks, data augmentation methods like flipping, rotation, and color variation can significantly increase the diversity of training samples without the need for more original images.

2. **Overcoming Data Privacy Issues:** Generative models can create synthetic datasets that mimic the statistical properties of real-world data without containing any actual real-world data. This is particularly relevant in fields like healthcare and finance, where data privacy is paramount. Synthetic data can be used for training machine learning models without risking exposure of sensitive information.

3. **Improving Model Robustness and Generalization:** Generated data can be used to test and improve the robustness of models against unusual or rare scenarios that might not be well-represented in the original training data. By exposing a model to a wider array of scenarios, including edge cases, data generation can help ensure that the model performs well under diverse conditions.

4. **Simulating Time-series Data:** In domains such as finance, weather forecasting, and supply chain management, generative models can simulate future scenarios or missing historical data. This can be crucial for stress testing, scenario analysis, and forecasting tasks where real historical data might be limited or non-existent.

5. **Enhancing User Experience with Personalization:** Generative AI can create personalized content for users in applications such as recommendation systems, gaming, and interactive media. By understanding user preferences and generating content tailored to individual tastes, AI can significantly enhance user engagement and satisfaction.

6. **Enabling Data-Driven Decision Making:** In environments where real-world experimentation is costly or impractical, generated data can provide a valuable alternative for testing hypotheses and making decisions. For example, in manufacturing and product design, generative models can simulate the outcomes of design choices without the need for physical prototypes.

7. **Training Models in Safe Environments:** In fields like autonomous driving and robotics, generative models can create simulated environments for training AI systems. This allows models to learn from a vast array of scenarios, including dangerous situations, without any real-world risk.

The relevance of data generation in data science cannot be overstated. It not only addresses practical challenges like data scarcity and privacy but also opens up new possibilities for innovation, personalization, and improved decision-making across various sectors.

### - Theoretical underpinnings of the chosen generative AI method.

The selection of a generative AI method depends on the specific task, desired outcomes, and the nature of the data. Let's explore the theoretical underpinnings of three prominent generative AI methods: Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and Transformer-based models, each serving different purposes and based on distinct principles.

#### Generative Adversarial Networks (GANs)

**Theoretical Foundation:** GANs are based on a game-theoretic scenario where two neural networks—the generator and the discriminator—compete against each other. The generator tries to produce data that is indistinguishable from real data, while the discriminator tries to distinguish between real data and data produced by the generator. This process can be thought of as a minimax game, where the generator network is trying to minimize a loss function while the discriminator is trying to maximize it.

**Key Concepts:**
- **Adversarial Training:** This process improves the generator's ability to produce realistic data and the discriminator's ability to detect fakes. Over time, the generator becomes so good at producing data that the discriminator can hardly distinguish real from fake.
- **Latent Space:** GANs map points in a latent space to the data space. By exploring this latent space, GANs can generate novel data instances that mimic the training data.

#### Variational Autoencoders (VAEs)

**Theoretical Foundation:** VAEs are grounded in the principles of Bayesian inference and deep learning. They consist of an encoder, a decoder, and a loss function that measures two things: how well the decoder's outputs match the original inputs and how closely the encoded latent variables match a prior distribution, usually a Gaussian distribution.

**Key Concepts:**
- **Probabilistic Latent Variables:** VAEs introduce a probabilistic twist to the encoding process, where the encoder outputs parameters to a probability distribution representing each data point in the latent space.
- **Reconstruction Loss:** This measures how well the decoded samples match the original inputs, encouraging the decoder to reconstruct the original data accurately.
- **KL Divergence:** This part of the loss function ensures that the learned distribution of latent variables doesn't stray too far from the prior distribution, aiding in the generation of new samples that are similar to the input data.

#### Transformer-based Models

**Theoretical Foundation:** Transformer models, initially introduced for natural language processing tasks, are based on self-attention mechanisms. They can weigh the relevance of different parts of the input data differently, making them highly effective for generating coherent and contextually relevant sequences of data, whether text, images, or something else.

**Key Concepts:**
- **Attention Mechanisms:** These allow the model to focus on different parts of the input sequence when producing each part of the output sequence, facilitating the generation of complex patterns and sequences.
- **Positional Encoding:** Transformers use positional encodings to maintain the order of the sequence, crucial for generating coherent outputs that follow a logical or temporal order.
- **Layer-wise Processing:** Unlike traditional models that process data sequentially, transformers process entire sequences simultaneously, which enhances their efficiency and scalability.

Each of these generative AI methods has its theoretical nuances and practical applications, making them suited to different kinds of data generation tasks. The choice among them depends on the specific requirements of the task, including the complexity of the data, the need for realism, interactivity, and the computational resources available.

### - How generative AI contributes to solving data-related problems.

Generative AI significantly contributes to solving a wide range of data-related problems, addressing issues such as data scarcity, privacy concerns, data diversity, and the need for synthetic data for testing and simulation. Here’s how generative AI is making a difference:

#### Addressing Data Scarcity

- **Synthetic Data Generation:** Generative AI models can produce synthetic datasets when collecting real data is challenging or expensive. This is especially useful in domains where data is scarce or difficult to obtain, such as rare medical conditions.
- **Data Augmentation:** In cases where there is a limited amount of training data, generative AI can augment existing datasets by creating variations of the data, thereby enhancing the robustness and performance of machine learning models.

#### Enhancing Privacy and Security

- **Privacy-preserving Synthetic Data:** Generative models can create datasets that mimic the statistical properties of original data without containing any personal or sensitive information. This allows for the safe sharing of data and enables research and development activities while complying with privacy regulations like GDPR.
- **Secure Data Sharing:** By generating synthetic data that retains the utility of real datasets but does not expose sensitive information, generative AI facilitates secure data sharing between organizations and researchers.

#### Increasing Data Diversity and Quality

- **Diversifying Training Data:** Generative AI can introduce diversity into datasets, creating more inclusive and representative data. This helps in developing AI systems that are fair and unbiased, performing well across varied demographics.
- **Improving Data Quality:** Generative models can be used to clean data by generating missing values or correcting erroneous entries, thereby improving the overall quality and reliability of datasets.

#### Enabling Advanced Simulations and Testing

- **Simulating Complex Scenarios:** In fields such as autonomous driving and robotics, generative AI can create realistic simulation environments that help in training and testing AI systems under a wide range of conditions without the real-world risks.
- **Testing and Validation:** Generative AI aids in generating data for testing and validation purposes, ensuring that models and systems are robust, reliable, and ready for deployment in diverse and unpredictable real-world situations.

#### Facilitating Research and Development

- **Accelerating Drug Discovery:** In the pharmaceutical industry, generative models can predict molecular structures that could lead to new drugs, speeding up the discovery process and reducing the reliance on traditional trial-and-error methods.
- **Innovative Content Creation:** In creative industries, generative AI assists in designing art, music, literature, and even video game environments, pushing the boundaries of creativity and enabling new forms of expression.

Generative AI's ability to generate, augment, and enhance data has broad implications, offering solutions to some of the most pressing data-related challenges across industries. By enabling more efficient and effective use of data, generative AI drives innovation, enhances productivity, and opens up new possibilities for solving complex problems.

# Refine a Generative AI Model for Summarizing Conversations

In this notebook, I will fine-tune an existing LLM from Hugging Face to enhance dialogue summarization. I'll be using the [FLAN-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5) model, a high-quality instruction-tuned model that can already summarize text effectively. To refine the inferences further, I plan to implement a full fine-tuning approach and assess the outcomes using ROUGE metrics. Afterwards, I'll explore Parameter Efficient Fine-Tuning (PEFT), evaluate the performance of the optimized model, and demonstrate that the advantages of PEFT compensate for its slightly lower performance metrics.

### Model

FLAN-T5 represents an evolution in the development of language models, focusing on the ability to understand and execute a wide variety of tasks described through natural language instructions. Here's a more detailed look at how FLAN-T5 is designed and how it functions:

### 1. **Background and Motivation**
   T5, or the Text-to-Text Transfer Transformer, was initially developed as a versatile model that treats every language problem as a text-to-text task. While highly effective, its performance could vary across tasks that it hadn't been explicitly fine-tuned on. FLAN-T5 builds on this by emphasizing the model's ability to handle instructions directly, thereby making it more adaptable across a broader range of tasks without specific fine-tuning.

### 2. **Instruction Tuning**
   The core innovation in FLAN-T5 is instruction tuning. Instead of training the model solely on task-specific data, FLAN-T5 is exposed to a variety of tasks described through instructions during its training phase. This includes a diverse set of datasets where the task is to follow written instructions to generate an appropriate output. For example, the model might be given an instruction like "Explain the steps involved in photosynthesis," and it must generate an informative response based on this prompt.

   The training process incorporates a large amount of data encompassing different formats and domains, such as question answering, summarization, translation, and more, each framed as an instruction-following task. This method teaches the model not just the specifics of a task, but how to interpret and respond to instructions generally.

### 3. **Few-Shot Learning Capabilities**
   Few-shot learning refers to the model's ability to perform tasks effectively with very limited examples. By training FLAN-T5 on a variety of instruction-based tasks, it learns a generalizable way to handle new tasks it has never seen before, using only a few examples. This is particularly important in scenarios where gathering large datasets is impractical or impossible.

### 4. **Generalization Across Tasks**
   The instruction tuning makes FLAN-T5 robust in generalizing its learning to new tasks. It can understand the requirements of a task based on how the instruction is phrased, even if it has limited exposure to that specific type of task. This ability to generalize from instructions to task execution without extensive retraining on task-specific data sets FLAN-T5 apart from many traditional models.

### 5. **Performance**
   Empirical results have shown that instruction-tuned models like FLAN-T5 perform better across a broad spectrum of tasks compared to models that are not instruction-tuned. This includes achieving state-of-the-art results on benchmarks that measure a model’s ability to generalize from limited examples.

### 6. **Applications**
   FLAN-T5's ability to understand and process instructions in natural language makes it highly suitable for applications such as virtual assistants, automated customer support, content creation, and any application requiring interaction in natural language where the tasks can vary widely.

Overall, FLAN-T5 represents a significant step forward in the development of versatile, robust language models capable of adapting to and performing a wide array of tasks based solely on natural language instructions.

<a name='1'></a>
## Setting up the kernel, load the necessary dependencies, prepare the dataset, and initialize the language learning model.

<a name='1.1'></a>
### 1 - Setting up Kernel and Required Dependencies

I am now installing the necessary packages for the LLM and datasets.

In [1]:
%pip install --upgrade pip
%pip install --disable-pip-version-check \
    torch==1.13.1 \
    torchdata==0.5.1 --quiet

%pip install \
    transformers==4.27.2 \
    datasets==2.11.0 \
    evaluate==0.4.0 \
    rouge_score==0.1.2 \
    loralib==0.1.1 \
    peft==0.3.0 --quiet

Note: you may need to restart the kernel to use updated packages.
[31mERROR: Could not find a version that satisfies the requirement torch==1.13.1 (from versions: 2.0.0, 2.0.1, 2.1.0, 2.1.1, 2.1.2, 2.2.0, 2.2.1, 2.2.2)[0m[31m
[0m[31mERROR: No matching distribution found for torch==1.13.1[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


Importing the necessary components. 

In [2]:
from datasets import load_dataset
from transformers import (
    AutoModelForSeq2SeqLM,
    AutoTokenizer,
    GenerationConfig,
    TrainingArguments,
    Trainer,
)
import torch
import time
import evaluate
import pandas as pd
import numpy as np

<a name='1.2'></a>
### 2 - Loading Dataset and LLM

I am going to continue experimenting with the [DialogSum](https://huggingface.co/datasets/knkarthick/dialogsum) Hugging Face dataset, which includes over 10,000 dialogues along with their manually labeled summaries and topics.

In [3]:
huggingface_dataset_name = "knkarthick/dialogsum" # HF dataset for summarization task

dataset = load_dataset(huggingface_dataset_name)

dataset

Found cached dataset csv (/Users/vinayjogani/.cache/huggingface/datasets/knkarthick___csv/knkarthick--dialogsum-cd36827d3490488d/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1)


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 12460
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 500
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 1500
    })
})

I will keep exploring the [DialogSum](https://huggingface.co/datasets/knkarthick/dialogsum) dataset from Hugging Face, which contains more than 10,000 dialogues, each accompanied by manually labeled summaries and topics.

In [4]:
pip install torch

Note: you may need to restart the kernel to use updated packages.


In [5]:
pip install transformers

Note: you may need to restart the kernel to use updated packages.


In [6]:
import transformers
print(torch.__version__)

2.2.2


In [7]:
pip install torch torchvision


Note: you may need to restart the kernel to use updated packages.


In [8]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

try:
    model_name = "google/flan-t5-base"
    model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    print("Model and tokenizer loaded successfully.")
except Exception as e:
    print("Failed to load model and tokenizer:", str(e))

Downloading config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

Downloading generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Model and tokenizer loaded successfully.


In [9]:
model_name = "google/flan-t5-base" # fT5 model for summarization task

original_model = AutoModelForSeq2SeqLM.from_pretrained(
    model_name, torch_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

I can extract the number of model parameters and determine how many are trainable using the function below. At this stage, there's no need for me to delve into the details of it.

In [10]:
# run flan t5 in GPU mode if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
original_model.to(device)


T5ForConditionalGeneration(
  (shared): Embedding(32128, 768)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 768)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=768, out_features=768, bias=False)
              (k): Linear(in_features=768, out_features=768, bias=False)
              (v): Linear(in_features=768, out_features=768, bias=False)
              (o): Linear(in_features=768, out_features=768, bias=False)
              (relative_attention_bias): Embedding(32, 12)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseGatedActDense(
              (wi_0): Linear(in_features=768, out_features=2048, bias=False)
              (wi_1): Linear(in_features=768, out_features=2048, bias=False)
              (wo):

In [11]:
def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"trainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"


print(print_number_of_trainable_model_parameters(original_model))

trainable model parameters: 247577856
all model parameters: 247577856
percentage of trainable model parameters: 100.00%


<a name='1.3'></a>
### 3 - Testinng the Model with Zero Shot Inferencing

In [12]:
index = 200

dialogue = dataset["test"][index]["dialogue"]
summary = dataset["test"][index]["summary"]

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary:
"""

inputs = tokenizer(prompt, return_tensors="pt")
output = tokenizer.decode(
    original_model.generate(
        inputs["input_ids"],
        max_new_tokens=200,
    )[0],
    skip_special_tokens=True,
)

dash_line = "-".join("" for x in range(100))
print(dash_line)
print(f"INPUT PROMPT:\n{prompt}")
print(dash_line)
print(f"BASELINE HUMAN SUMMARY:\n{summary}\n")
print(dash_line)
print(f"MODEL GENERATION - ZERO SHOT:\n{output}")

---------------------------------------------------------------------------------------------------
INPUT PROMPT:

Summarize the following conversation.

#Person1#: Have you considered upgrading your system?
#Person2#: Yes, but I'm not sure what exactly I would need.
#Person1#: You could consider adding a painting program to your software. It would allow you to make up your own flyers and banners for advertising.
#Person2#: That would be a definite bonus.
#Person1#: You might also want to upgrade your hardware because it is pretty outdated now.
#Person2#: How can we do that?
#Person1#: You'd probably need a faster processor, to begin with. And you also need a more powerful hard disc, more memory and a faster modem. Do you have a CD-ROM drive?
#Person2#: No.
#Person1#: Then you might want to add a CD-ROM drive too, because most new software programs are coming out on Cds.
#Person2#: That sounds great. Thanks.

Summary:

-------------------------------------------------------------------

<a name='2'></a>
##  Perform Full Fine-Tuning

<a name='2.1'></a>
### 1 - Preprocessing the Dialog-Summary Dataset

I need to convert the dialog-summary (prompt-response) pairs into explicit instructions for the language learning model. I will prepend the instruction `Summarize the following conversation` to the start of the dialog and `Summary` to the start of the summary, as follows:

Training prompt (dialogue):
```
Summarize the following conversation.

    Chris: This is his part of the conversation.
    Antje: This is her part of the conversation.
    
Summary: 
```

Training response (summary):
```
Both Chris and Antje participated in the conversation.
```


In [13]:
def tokenize_function(example):
    start_prompt = "Summarize the following conversation.\n\n"
    end_prompt = "\n\nSummary: "
    prompt = [start_prompt + dialogue + end_prompt for dialogue in example["dialogue"]]
    example["input_ids"] = tokenizer(
        prompt, padding="max_length", truncation=True, return_tensors="pt"
    ).input_ids
    example["labels"] = tokenizer(
        example["summary"], padding="max_length", truncation=True, return_tensors="pt"
    ).input_ids

    return example


# The dataset actually contains 3 diff splits: train, validation, test.
# The tokenize_function code is handling all data across all splits in batches.
tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(
    [
        "id",
        "topic",
        "dialogue",
        "summary",
    ]
)

Map:   0%|          | 0/12460 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

In [14]:
tokenized_datasets = tokenized_datasets.filter(
    lambda example, index: index % 100 == 0, with_indices=True
)

Filter:   0%|          | 0/12460 [00:00<?, ? examples/s]

Filter:   0%|          | 0/500 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1500 [00:00<?, ? examples/s]

Checking the shapes of all parts of the dataset:

In [15]:
print(f"Shapes of the datasets:")
print(f"Training: {tokenized_datasets['train'].shape}")
print(f"Validation: {tokenized_datasets['validation'].shape}")
print(f"Test: {tokenized_datasets['test'].shape}")

print(tokenized_datasets)

Shapes of the datasets:
Training: (125, 2)
Validation: (5, 2)
Test: (15, 2)
DatasetDict({
    train: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 125
    })
    validation: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 5
    })
    test: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 15
    })
})


The output dataset is ready for fine-tuning.

<a name='2.2'></a>
### 2 - Fine-Tune the Model with the Preprocessed Dataset


In [16]:
output_dir = f"./dialogue-summary-training-{str(int(time.time()))}"

training_args = TrainingArguments(
    output_dir=output_dir,
    learning_rate=1e-5,
    num_train_epochs=5,
    weight_decay=0.01,
    logging_steps=1,
    max_steps=50,
)

trainer = Trainer(
    model=original_model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
)

Start training process...

In [None]:
trainer.train()

Training a fully fine-tuned version of the model would take a few hours on a GPU. To save time, download a checkpoint of the fully fine-tuned model to use in the rest of this notebook. This fully fine-tuned model will also be referred to as the **instruct model** in this lab.

In [16]:
!aws s3 cp --recursive s3://dlai-generative-ai/models/flan-dialogue-summary-checkpoint/ ./flan-dialogue-summary-checkpoint/

download: s3://dlai-generative-ai/models/flan-dialogue-summary-checkpoint/generation_config.json to flan-dialogue-summary-checkpoint/generation_config.json
download: s3://dlai-generative-ai/models/flan-dialogue-summary-checkpoint/config.json to flan-dialogue-summary-checkpoint/config.json
download: s3://dlai-generative-ai/models/flan-dialogue-summary-checkpoint/trainer_state.json to flan-dialogue-summary-checkpoint/trainer_state.json
download: s3://dlai-generative-ai/models/flan-dialogue-summary-checkpoint/scheduler.pt to flan-dialogue-summary-checkpoint/scheduler.pt
download: s3://dlai-generative-ai/models/flan-dialogue-summary-checkpoint/training_args.bin to flan-dialogue-summary-checkpoint/training_args.bin
download: s3://dlai-generative-ai/models/flan-dialogue-summary-checkpoint/rng_state.pth to flan-dialogue-summary-checkpoint/rng_state.pth
download: s3://dlai-generative-ai/models/flan-dialogue-summary-checkpoint/pytorch_model.bin to flan-dialogue-summary-checkpoint/pytorch_model.

The size of the downloaded instruct model is approximately 1GB.

In [17]:
!ls -alh ./flan-dialogue-summary-checkpoint/pytorch_model.bin

-rw-r--r-- 1 root root 945M May 15 10:25 ./flan-dialogue-summary-checkpoint/pytorch_model.bin


Create an instance of the `AutoModelForSeq2SeqLM` class for the instruct model:

In [18]:
instruct_model = AutoModelForSeq2SeqLM.from_pretrained(
    "./flan-dialogue-summary-checkpoint", torch_dtype=torch.bfloat16
)

<a name='2.3'></a>
### 2.3 - Evaluating the Model Qualitatively (Human Evaluation)

In the field of generative AI applications, adopting a qualitative assessment method can be particularly effective. This involves personally evaluating whether the AI model is functioning according to the desired specifications. Essentially, I start by posing the question: "Is my model performing as it should?" This initial check helps in identifying any discrepancies between the expected and actual behaviors of the model.

Taking a specific case from the beginning of this notebook as an example, I can demonstrate the improvement in model performance through fine-tuning. Originally, the model struggled to grasp the tasks it was supposed to perform, often failing to generate a coherent response. However, after applying fine-tuning adjustments, the model's capabilities enhanced significantly. It now not only understands the queries better but also produces a more accurate and reasonable summary of the dialogue presented to it. This transformation underscores the importance of iterative adjustments and evaluations in developing effective AI models.

In [60]:
index = 300
dialogue = dataset["test"][index]["dialogue"]
human_baseline_summary = dataset["test"][index]["summary"]

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary:
"""

input_ids = tokenizer(prompt, return_tensors="pt").input_ids

original_model_outputs = original_model.generate(
    input_ids=input_ids,
    generation_config=GenerationConfig(max_new_tokens=200, num_beams=1),
)
original_model_text_output = tokenizer.decode(
    original_model_outputs[0], skip_special_tokens=True
)

instruct_model_outputs = instruct_model.generate(
    input_ids=input_ids,
    generation_config=GenerationConfig(max_new_tokens=200, num_beams=1),
)
instruct_model_text_output = tokenizer.decode(
    instruct_model_outputs[0], skip_special_tokens=True
)

print(dash_line)
print(f"BASELINE HUMAN SUMMARY:\n{human_baseline_summary}")
print(dash_line)
print(f"ORIGINAL MODEL:\n{original_model_text_output}")
print(dash_line)
print(f"INSTRUCT MODEL:\n{instruct_model_text_output}")

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# is crazy for Trump and voted for him. #Person2# doesn't agree with #Person1# on Trump and will vote for Biden.
---------------------------------------------------------------------------------------------------
ORIGINAL MODEL:
#Person1: I am proud to say he is our President again. #Person2: I am not sure about this. #Person1: I am not sure about this. #Person2: I trust that he will take good care of our country. #Person1: I will vote for Biden anyway.
---------------------------------------------------------------------------------------------------
INSTRUCT MODEL:
#Person1# and #Person2# are talking about the future of the President. #Person2# has nothing but faith in Trump and #Person1# will vote for Biden.


<a name='2.4'></a>
### 4 - Assess the model quantitatively using the ROUGE metric.

The [ROUGE metric](https://en.wikipedia.org/wiki/ROUGE_(metric)) allows me to measure the accuracy of the summaries that models generate. It does this by comparing these summaries to a "baseline" summary, typically crafted by a human. Although it's not flawless, this metric does show how much the quality of summarization has improved through fine-tuning.

In [61]:
rouge = evaluate.load("rouge")

I will generate outputs for a sample of the test dataset, focusing on just 10 dialogues and summaries to save time, and then save the results.

In [62]:
dialogues = dataset["test"][0:10]["dialogue"]
human_baseline_summaries = dataset["test"][0:10]["summary"]

original_model_summaries = []
instruct_model_summaries = []

for _, dialogue in enumerate(dialogues):
    prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids

    original_model_outputs = original_model.generate(
        input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200)
    )
    original_model_text_output = tokenizer.decode(
        original_model_outputs[0], skip_special_tokens=True
    )
    original_model_summaries.append(original_model_text_output)

    instruct_model_outputs = instruct_model.generate(
        input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200)
    )
    instruct_model_text_output = tokenizer.decode(
        instruct_model_outputs[0], skip_special_tokens=True
    )
    instruct_model_summaries.append(instruct_model_text_output)

zipped_summaries = list(
    zip(human_baseline_summaries, original_model_summaries, instruct_model_summaries)
)

df = pd.DataFrame(
    zipped_summaries,
    columns=[
        "human_baseline_summaries",
        "original_model_summaries",
        "instruct_model_summaries",
    ],
)
df

Unnamed: 0,human_baseline_summaries,original_model_summaries,instruct_model_summaries
0,Ms. Dawson helps #Person1# to write a memo to ...,"#Person1#: Ms. Dawson, I need a dictation. #Pe...",#Person1# asks Ms. Dawson to take a dictation ...
1,In order to prevent employees from wasting tim...,...,#Person1# asks Ms. Dawson to take a dictation ...
2,Ms. Dawson takes a dictation for #Person1# abo...,The following are the new policy guidelines fo...,#Person1# asks Ms. Dawson to take a dictation ...
3,#Person2# arrives late because of traffic jam....,#Person1#: I'm here. I'm a driver. #Person2#: ...,#Person2# got stuck in traffic again. #Person1...
4,#Person2# decides to follow #Person1#'s sugges...,#Person1: I'm sorry to hear that. #Person2: I'...,#Person2# got stuck in traffic again. #Person1...
5,#Person2# complains to #Person1# about the tra...,People are talking about their car.,#Person2# got stuck in traffic again. #Person1...
6,#Person1# tells Kate that Masha and Hero get d...,Masha and Hero are getting a divorce.,Masha and Hero are getting divorced. Kate can'...
7,#Person1# tells Kate that Masha and Hero are g...,Masha and Hero are getting divorced.,Masha and Hero are getting divorced. Kate can'...
8,#Person1# and Kate talk about the divorce betw...,Masha and Hero are getting divorced.,Masha and Hero are getting divorced. Kate can'...
9,#Person1# and Brian are at the birthday party ...,#Person1#: Brian's a really popular person. #P...,Brian's birthday is coming. #Person1# invites ...


I'll evaluate the models using ROUGE metrics and look at the improvements in the results!

In [63]:
original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0 : len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0 : len(instruct_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print("ORIGINAL MODEL:")
print(original_model_results)
print("INSTRUCT MODEL:")
print(instruct_model_results)

ORIGINAL MODEL:
{'rouge1': 0.21994389562610994, 'rouge2': 0.0767787352834082, 'rougeL': 0.18932090663536644, 'rougeLsum': 0.188761363905078}
INSTRUCT MODEL:
{'rouge1': 0.41026607717457186, 'rouge2': 0.17840645241958838, 'rougeL': 0.2977022096267017, 'rougeLsum': 0.2987374187518165}


The file `data/dialogue-summary-training-results.csv` includes a pre-populated list of all model results. I will use this to evaluate a larger section of data for each model.

In [64]:
results = pd.read_csv("data/dialogue-summary-training-results.csv")

human_baseline_summaries = results["human_baseline_summaries"].values
original_model_summaries = results["original_model_summaries"].values
instruct_model_summaries = results["instruct_model_summaries"].values

original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0 : len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0 : len(instruct_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print("ORIGINAL MODEL:")
print(original_model_results)
print("INSTRUCT MODEL:")
print(instruct_model_results)

ORIGINAL MODEL:
{'rouge1': 0.2334158581572823, 'rouge2': 0.07603964187010573, 'rougeL': 0.20145520923859048, 'rougeLsum': 0.20145899339006135}
INSTRUCT MODEL:
{'rouge1': 0.42161291557556113, 'rouge2': 0.18035380596301792, 'rougeL': 0.3384439349963909, 'rougeLsum': 0.33835653595561666}


The results demonstrate significant enhancements across all ROUGE metrics:

In [65]:
print("Absolute percentage improvement of INSTRUCT MODEL over ORIGINAL MODEL")

improvement = np.array(list(instruct_model_results.values())) - np.array(
    list(original_model_results.values())
)
for key, value in zip(instruct_model_results.keys(), improvement):
    print(f"{key}: {value*100:.2f}%")

Absolute percentage improvement of INSTRUCT MODEL over ORIGINAL MODEL
rouge1: 18.82%
rouge2: 10.43%
rougeL: 13.70%
rougeLsum: 13.69%


<a name='3'></a>
##  Performing Parameter Efficient Fine-Tuning (PEFT)

Certainly! Let’s dive deeper into the concept of **Parameter Efficient Fine-Tuning (PEFT)**, particularly focusing on **Low-Rank Adaptation (LoRA)**, to understand how it streamlines the fine-tuning process and maintains efficiency.

**PEFT** offers a way to update large language models (LLMs) without extensively retraining them, which is traditionally done in full fine-tuning. This extensive retraining usually requires significant computational resources and affects the entire model structure. In contrast, PEFT focuses on making smaller, more targeted changes. This approach is not only resource-efficient but also preserves the integrity and stability of the original model while adapting it for specific tasks.

Among the techniques under the PEFT umbrella, **LoRA** stands out. LoRA targets specific components of the neural network—typically the weights in the linear layers of the model. By applying low-rank matrix updates, LoRA modifies these components in a way that the vast majority of the model's original parameters remain unchanged. Here’s how it works:

1. **Selectivity**: Instead of updating all parameters of the model, LoRA selectively tunes only a small subset. This subset usually consists of the parameters that have the most significant impact on the model's performance for a given task.

2. **Efficiency**: Because LoRA focuses on a smaller number of parameters, it requires fewer computational resources. Often, a single GPU is sufficient for this process, which is particularly beneficial for organizations without access to large computing clusters.

3. **Adapter Creation**: Through LoRA, a compact module or "adapter" is trained. This adapter encapsulates the updates necessary for the model to perform a specific task. The key aspect here is that these adapters are significantly smaller than the full model—often just a single-digit percentage of the original model’s size.

4. **Compatibility and Reusability**: At inference time, the LoRA adapter needs to be combined with the original LLM. This merging is crucial as it allows the updated adapter to leverage the foundational capabilities of the LLM while applying its specific tweaks. Moreover, the same base LLM can be used with multiple different adapters, each configured for different tasks or use cases. This reusability greatly reduces the memory footprint when the model is deployed for multiple purposes.

In summary, LoRA and other PEFT methods provide a practical and efficient way to customize large language models for specific needs without the extensive resource demands of traditional full fine-tuning. This makes PEFT particularly attractive for applications where model agility and computational efficiency are priorities.

<a name='3.1'></a>
### 1 - Setting up the PEFT/LoRA model for Fine-Tuning

I need to set up the PEFT/LoRA model for fine-tuning with a new layer/parameter adapter. By using PEFT/LoRA, I am freezing the underlying LLM and only training the adapter. I should check out the LoRA configuration below, especially the rank (`r`) hyper-parameter, which determines the rank/dimension of the adapter to be trained.

In [66]:
from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    r=16,  # Rank
    lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM,  # FLAN-T5
)

In [67]:
peft_model = get_peft_model(original_model, lora_config)
print(print_number_of_trainable_model_parameters(peft_model))

trainable model parameters: 1769472
all model parameters: 249347328
percentage of trainable model parameters: 0.71%


<a name='3.2'></a>
### 2 - Train PEFT Adapter

Defining training arguments and creating `Trainer` instance.

In [68]:
output_dir = f"./peft-dialogue-summary-training-{str(int(time.time()))}"

peft_training_args = TrainingArguments(
    output_dir=output_dir,
    auto_find_batch_size=True,
    learning_rate=1e-3,  # Higher learning rate than full fine-tuning.
    num_train_epochs=1,
    logging_steps=1,
    max_steps=1,
)

peft_trainer = Trainer(
    model=peft_model,
    args=peft_training_args,
    train_dataset=tokenized_datasets["train"],
)

In [69]:
peft_trainer.train()

peft_model_path = "./peft-dialogue-summary-checkpoint-local"

peft_trainer.model.save_pretrained(peft_model_path)
tokenizer.save_pretrained(peft_model_path)



Step,Training Loss
1,51.0


('./peft-dialogue-summary-checkpoint-local/tokenizer_config.json',
 './peft-dialogue-summary-checkpoint-local/special_tokens_map.json',
 './peft-dialogue-summary-checkpoint-local/tokenizer.json')

That training was performed on a subset of data. To load a fully trained PEFT model, read a checkpoint of a PEFT model from S3.

In [70]:
!aws s3 cp --recursive s3://dlai-generative-ai/models/peft-dialogue-summary-checkpoint/ ./peft-dialogue-summary-checkpoint-from-s3/ 

download: s3://dlai-generative-ai/models/peft-dialogue-summary-checkpoint/adapter_config.json to peft-dialogue-summary-checkpoint-from-s3/adapter_config.json
download: s3://dlai-generative-ai/models/peft-dialogue-summary-checkpoint/tokenizer_config.json to peft-dialogue-summary-checkpoint-from-s3/tokenizer_config.json
download: s3://dlai-generative-ai/models/peft-dialogue-summary-checkpoint/special_tokens_map.json to peft-dialogue-summary-checkpoint-from-s3/special_tokens_map.json
download: s3://dlai-generative-ai/models/peft-dialogue-summary-checkpoint/tokenizer.json to peft-dialogue-summary-checkpoint-from-s3/tokenizer.json
download: s3://dlai-generative-ai/models/peft-dialogue-summary-checkpoint/adapter_model.bin to peft-dialogue-summary-checkpoint-from-s3/adapter_model.bin


Checking that the size of this model is much smaller than the original LLM:

In [71]:
!ls -al ./peft-dialogue-summary-checkpoint-from-s3/adapter_model.bin

-rw-r--r-- 1 root root 14208525 May 15 11:18 ./peft-dialogue-summary-checkpoint-from-s3/adapter_model.bin


I will prepare this model by adding an adapter to the original FLAN-T5 model. I am setting `is_trainable=False` because my plan is only to perform inference with this PEFT model. If I were preparing the model for further training, I would set `is_trainable=True`.

In [72]:
from peft import PeftModel, PeftConfig

peft_model_base = AutoModelForSeq2SeqLM.from_pretrained(
    "google/flan-t5-base", torch_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")

peft_model = PeftModel.from_pretrained(
    peft_model_base,
    "./peft-dialogue-summary-checkpoint-from-s3/",
    torch_dtype=torch.bfloat16,
    is_trainable=False,
)

The setting `is_trainable=False` means I will have `0` trainable parameters.

In [73]:
print(print_number_of_trainable_model_parameters(peft_model))

trainable model parameters: 0
all model parameters: 251116800
percentage of trainable model parameters: 0.00%


<a name='3.3'></a>
### 3 - Evaluating the Model Qualitatively (Human Evaluation)

I will make inferences for the same example discussed in sections 1.3 and 2.3, using the original model, the fully fine-tuned model, and the PEFT model.

In [74]:
index = 300
dialogue = dataset["test"][index]["dialogue"]
baseline_human_summary = dataset["test"][index]["summary"]

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """

input_ids = tokenizer(prompt, return_tensors="pt").input_ids

original_model_outputs = original_model.generate(
    input_ids=input_ids,
    generation_config=GenerationConfig(max_new_tokens=200, num_beams=1),
)
original_model_text_output = tokenizer.decode(
    original_model_outputs[0], skip_special_tokens=True
)

instruct_model_outputs = instruct_model.generate(
    input_ids=input_ids,
    generation_config=GenerationConfig(max_new_tokens=200, num_beams=1),
)
instruct_model_text_output = tokenizer.decode(
    instruct_model_outputs[0], skip_special_tokens=True
)

peft_model_outputs = peft_model.generate(
    input_ids=input_ids,
    generation_config=GenerationConfig(max_new_tokens=200, num_beams=1),
)
peft_model_text_output = tokenizer.decode(
    peft_model_outputs[0], skip_special_tokens=True
)

print(dash_line)
print(f"BASELINE HUMAN SUMMARY:\n{human_baseline_summary}")
print(dash_line)
print(f"ORIGINAL MODEL:\n{original_model_text_output}")
print(dash_line)
print(f"INSTRUCT MODEL:\n{instruct_model_text_output}")
print(dash_line)
print(f"PEFT MODEL: {peft_model_text_output}")

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# is crazy for Trump and voted for him. #Person2# doesn't agree with #Person1# on Trump and will vote for Biden.
---------------------------------------------------------------------------------------------------
ORIGINAL MODEL:
#Pord1: I am proud to say that he is our President. #Pord2: Did you vote for him? #Pord1: I know, because I know that he is our President. #Pord2: I am pretty sure he will make America great again! #Pord3: I am sure he will make America great again! #Pord4: I am pretty sure he will make America great again! #Pord5: I am sure he will make America great again!
---------------------------------------------------------------------------------------------------
INSTRUCT MODEL:
#Person1# and #Person2# are talking about the future of the President. #Person2# has nothing but faith in Trump and #Person1# will vote for Biden.
---------------

<a name='3.4'></a>
###  4 - Evaluating the Model Quantitatively (with ROUGE Metric)
Performing inferences for the sample of the test dataset (only 10 dialogues and summaries to save time). 

In [75]:
dialogues = dataset["test"][0:10]["dialogue"]
human_baseline_summaries = dataset["test"][0:10]["summary"]

original_model_summaries = []
instruct_model_summaries = []
peft_model_summaries = []

for idx, dialogue in enumerate(dialogues):
    prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """

    input_ids = tokenizer(prompt, return_tensors="pt").input_ids

    human_baseline_text_output = human_baseline_summaries[idx]

    original_model_outputs = original_model.generate(
        input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200)
    )
    original_model_text_output = tokenizer.decode(
        original_model_outputs[0], skip_special_tokens=True
    )

    instruct_model_outputs = instruct_model.generate(
        input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200)
    )
    instruct_model_text_output = tokenizer.decode(
        instruct_model_outputs[0], skip_special_tokens=True
    )

    peft_model_outputs = peft_model.generate(
        input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200)
    )
    peft_model_text_output = tokenizer.decode(
        peft_model_outputs[0], skip_special_tokens=True
    )

    original_model_summaries.append(original_model_text_output)
    instruct_model_summaries.append(instruct_model_text_output)
    peft_model_summaries.append(peft_model_text_output)

zipped_summaries = list(
    zip(
        human_baseline_summaries,
        original_model_summaries,
        instruct_model_summaries,
        peft_model_summaries,
    )
)

df = pd.DataFrame(
    zipped_summaries,
    columns=[
        "human_baseline_summaries",
        "original_model_summaries",
        "instruct_model_summaries",
        "peft_model_summaries",
    ],
)
df

Unnamed: 0,human_baseline_summaries,original_model_summaries,instruct_model_summaries,peft_model_summaries
0,Ms. Dawson helps #Person1# to write a memo to ...,Employees: This memo should go out to all empl...,#Person1# asks Ms. Dawson to take a dictation ...,#Person1# asks Ms. Dawson to take a dictation ...
1,In order to prevent employees from wasting tim...,Employees are being directed to take a dictati...,#Person1# asks Ms. Dawson to take a dictation ...,#Person1# asks Ms. Dawson to take a dictation ...
2,Ms. Dawson takes a dictation for #Person1# abo...,#Person1#: Thank you for your dictation.,#Person1# asks Ms. Dawson to take a dictation ...,#Person1# asks Ms. Dawson to take a dictation ...
3,#Person2# arrives late because of traffic jam....,People are complaining about the traffic in th...,#Person2# got stuck in traffic again. #Person1...,#Person2# got stuck in traffic and #Person1# s...
4,#Person2# decides to follow #Person1#'s sugges...,People are talking about how they're going to ...,#Person2# got stuck in traffic again. #Person1...,#Person2# got stuck in traffic and #Person1# s...
5,#Person2# complains to #Person1# about the tra...,#Person1#: I'm going to take the subway to wor...,#Person2# got stuck in traffic again. #Person1...,#Person2# got stuck in traffic and #Person1# s...
6,#Person1# tells Kate that Masha and Hero get d...,Masha and Hero are getting divorced. #Person1:...,Masha and Hero are getting divorced. Kate can'...,Kate tells #Person2# Masha and Hero are gettin...
7,#Person1# tells Kate that Masha and Hero are g...,Masha and Hero are having a divorce.,Masha and Hero are getting divorced. Kate can'...,Kate tells #Person2# Masha and Hero are gettin...
8,#Person1# and Kate talk about the divorce betw...,Masha and Hero are getting divorced.,Masha and Hero are getting divorced. Kate can'...,Kate tells #Person2# Masha and Hero are gettin...
9,#Person1# and Brian are at the birthday party ...,Brian is having his birthday party.,Brian's birthday is coming. #Person1# invites ...,Brian remembers his birthday and invites #Pers...


In [76]:
rouge = evaluate.load("rouge")

original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0 : len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0 : len(instruct_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

peft_model_results = rouge.compute(
    predictions=peft_model_summaries,
    references=human_baseline_summaries[0 : len(peft_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print("ORIGINAL MODEL:")
print(original_model_results)
print("INSTRUCT MODEL:")
print(instruct_model_results)
print("PEFT MODEL:")
print(peft_model_results)

ORIGINAL MODEL:
{'rouge1': 0.27335027696318015, 'rouge2': 0.0730783699059561, 'rougeL': 0.22176129080628232, 'rougeLsum': 0.22464092940323838}
INSTRUCT MODEL:
{'rouge1': 0.41026607717457186, 'rouge2': 0.17840645241958838, 'rougeL': 0.2977022096267017, 'rougeLsum': 0.2987374187518165}
PEFT MODEL:
{'rouge1': 0.3725351062275605, 'rouge2': 0.12138811933618107, 'rougeL': 0.27620639623170606, 'rougeLsum': 0.2758134870822362}


Notice that the results of the PEFT model are not too bad, considering how much easier the training process was for me!

I already computed the ROUGE score on the full dataset, having loaded the results from the `data/dialogue-summary-training-results.csv` file. Now, I'll load the values for the PEFT model and compare its performance with other models.

In [77]:
human_baseline_summaries = results["human_baseline_summaries"].values
original_model_summaries = results["original_model_summaries"].values
instruct_model_summaries = results["instruct_model_summaries"].values
peft_model_summaries = results["peft_model_summaries"].values

original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0 : len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0 : len(instruct_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

peft_model_results = rouge.compute(
    predictions=peft_model_summaries,
    references=human_baseline_summaries[0 : len(peft_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print("ORIGINAL MODEL:")
print(original_model_results)
print("INSTRUCT MODEL:")
print(instruct_model_results)
print("PEFT MODEL:")
print(peft_model_results)

ORIGINAL MODEL:
{'rouge1': 0.2334158581572823, 'rouge2': 0.07603964187010573, 'rougeL': 0.20145520923859048, 'rougeLsum': 0.20145899339006135}
INSTRUCT MODEL:
{'rouge1': 0.42161291557556113, 'rouge2': 0.18035380596301792, 'rougeL': 0.3384439349963909, 'rougeLsum': 0.33835653595561666}
PEFT MODEL:
{'rouge1': 0.40810631575616746, 'rouge2': 0.1633255794568712, 'rougeL': 0.32507074586565354, 'rougeLsum': 0.3248950182867091}


The results indicate a lesser improvement with PEFT compared to complete fine-tuning. However, the advantages of PEFT generally surpass the modest reduction in performance metrics.

Here's how to calculate the improvement of PEFT over the original model:

In [78]:
print("Absolute percentage improvement of PEFT MODEL over ORIGINAL MODEL")

improvement = np.array(list(peft_model_results.values())) - np.array(
    list(original_model_results.values())
)
for key, value in zip(peft_model_results.keys(), improvement):
    print(f"{key}: {value*100:.2f}%")

Absolute percentage improvement of PEFT MODEL over ORIGINAL MODEL
rouge1: 17.47%
rouge2: 8.73%
rougeL: 12.36%
rougeLsum: 12.34%


Now I will calculate how much better PEFT performs compared to a fully fine-tuned model:

In [79]:
print("Absolute percentage improvement of PEFT MODEL over INSTRUCT MODEL")

improvement = np.array(list(peft_model_results.values())) - np.array(
    list(instruct_model_results.values())
)
for key, value in zip(peft_model_results.keys(), improvement):
    print(f"{key}: {value*100:.2f}%")

Absolute percentage improvement of PEFT MODEL over INSTRUCT MODEL
rouge1: -1.35%
rouge2: -1.70%
rougeL: -1.34%
rougeLsum: -1.35%


Here we can observe a slight decrease in the ROUGE metrics compared to full fine-tuning. However, the training demands significantly fewer computing and memory resources, often requiring only a single GPU.

## Conclusion

In this study, we explored the performance of various generative AI models in summarizing dialogues, comparing their outputs against a human baseline. We discovered that while there is a small decline in performance as measured by ROUGE metrics when using partially fine-tuned models, these models require significantly less computational resources. This finding is particularly important in resource-constrained environments where access to powerful computing infrastructure might be limited. The results highlight a viable pathway for implementing AI-driven solutions that are both effective and sustainable. The efficiency of using a single GPU for training without a substantial loss in output quality opens the door for more scalable and accessible AI applications across various sectors. This balance between performance and computational demand is crucial for advancing the field of AI while remaining mindful of economic and environmental considerations.

## Reference 

LLM: https://www.databricks.com/resources/ebook/tap-full-potential-llm?scid=7018Y000001Fi1CQAS&utm_medium=paid+search&utm_source=google&utm_campaign=21141549522&utm_adgroup=163513216034&utm_content=ebook&utm_offer=tap-full-potential-llm&utm_ad=661606835956&utm_term=llm&gad_source=1&gclid=Cj0KCQjw2uiwBhCXARIsACMvIU04K0x9yl2P8K0t10MGRCfhg-WJ0hxcz1MTKzBfUfVWz7RZhKExRSsaAhXIEALw_wcB

PEFT: https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/nlp/nemo_megatron/peft/landing_page.html

GEN AI: https://writer.com/guides/generative-ai/?utm_source=google&utm_medium=paid_search&utm_campaign=&utm_term=&utm_content=&hsa_acc=8779610535&hsa_cam=20924966778&hsa_grp=&hsa_ad=&hsa_src=x&hsa_tgt=&hsa_kw=&hsa_mt=&hsa_net=adwords&hsa_ver=3&gad_source=1&gclid=Cj0KCQjw2uiwBhCXARIsACMvIU0D5zdFbRAhgf1NaAwTdsxPpiMESmI_5zgivTGTbKBrq3oyN_m8bLAaAqRAEALw_wcB

Transformer: https://blogs.nvidia.com/blog/what-is-a-transformer-model/

FLAN T5: https://www.datacamp.com/tutorial/flan-t5-tutorial

## License 

Copyright (c) 2024 Vinay Jogani

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.