# Crash Course in Generative AI Worked Examples

By: Shivani Sahu

# Abstract
In this notebook, we will delve into various inference techniques, including zero-shot, one-shot, and few-shot methods, to assess their impact on the quality of outputs. We will focus on prompt engineering to understand how it affects model responses. Utilizing different Google models as foundational architectures, we aim to deepen our understanding of in-context learning. We will also introduce the Instruct model, a variant fine-tuned for specific instructions, to explore how models can be tailored for particular tasks. Our analysis will include both manual and quantitative evaluations, using ROUGE scores to measure performance. Additionally, we will investigate the application of Prompt-based Extractive Fine-tuning (PEFT) models to see how variations in prompts influence summarization results. The notebook will further cover the process of fine-tuning models to detoxify summaries, using reinforcement learning strategies like feedback and rewards, and qualitative assessments to highlight noticeable differences in model behavior.

# Breif about the 3 week's Lab:

 








- Week 1 of the course covers the basics of Generative AI, including model pre-training, the architecture of large language models (LLMs), and the project lifecycle, with a focus on the computational and strategic decisions involved. 
- Week 2 delves into fine-tuning LLMs using prompt datasets to enhance performance and introduces Parameter-efficient Fine Tuning (PEFT) to mitigate computational costs and prevent catastrophic forgetting.
- Week 3 explores reinforcement learning, specifically RLHF (Reinforcement Learning from Human Feedback), to improve model alignment and performance, and discusses methods to enhance LLM reasoning through chain-of-thought prompting and overcome knowledge cut-offs with advanced information retrieval techniques.

# Lab 1 - Generative AI Use Case: Dialogue Summarization
Welcome to the hands-on component of this course. In this lab, we will tackle the task of summarizing dialogues using generative AI. We'll examine how different input texts influence the model's output and engage in prompt engineering to steer the model toward our desired task. Through comparisons of zero-shot, one-shot, and few-shot inferences, we will begin to explore prompt engineering and discover how it can improve the generative capabilities of Large Language Models.

## 1 - Set up Kernel and Required Dependencies

In [3]:
 %pip install --upgrade pip
 %pip install --disable-pip-version-check \
    torch==1.13.1 \
     torchdata==0.5.1 --quiet

 %pip install \
     transformers==4.27.2 \
     datasets==2.11.0  --quiet

 %pip install \
     evaluate==0.4.0 \
     rouge_score==0.1.2 \
     loralib==0.1.1 \
     peft==0.3.0 --quiet

 # Installing the Reinforcement Learning library directly from github.
%pip install git+https://github.com/lvwerra/trl.git@25fa1bd 

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Collecting git+https://github.com/lvwerra/trl.git@25fa1bd
  Cloning https://github.com/lvwerra/trl.git (to revision 25fa1bd) to /private/var/folders/1t/5d1gml0j7gd7ff6tt8s4rzfh0000gn/T/pip-req-build-j_h68r0b
  Running command git clone --filter=blob:none --quiet https://github.com/lvwerra/trl.git /private/var/folders/1t/5d1gml0j7gd7ff6tt8s4rzfh0000gn/T/pip-req-build-j_h68r0b
[0m  Running command git checkout -q 25fa1bd
  Resolved https://github.com/lvwerra/trl.git to commit 25fa1bd
  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: trl
  Building wheel for trl (setup.py) ... [?25ldone
[?25h  Created wheel for trl: filename=trl-0.4.2.dev0-py3-none-any.whl size=67534 sha256=7d2c266181197f28

In [4]:
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM
from transformers import AutoTokenizer
from transformers import GenerationConfig

In [6]:
pip install py7zr

Collecting py7zr
  Downloading py7zr-0.21.0-py3-none-any.whl.metadata (17 kB)
Collecting texttable (from py7zr)
  Downloading texttable-1.7.0-py2.py3-none-any.whl.metadata (9.8 kB)
Collecting pycryptodomex>=3.16.0 (from py7zr)
  Downloading pycryptodomex-3.20.0-cp35-abi3-macosx_10_9_universal2.whl.metadata (3.4 kB)
Collecting pyzstd>=0.15.9 (from py7zr)
  Downloading pyzstd-0.15.10-cp310-cp310-macosx_11_0_arm64.whl.metadata (7.9 kB)
Collecting pyppmd<1.2.0,>=1.1.0 (from py7zr)
  Downloading pyppmd-1.1.0-cp310-cp310-macosx_11_0_arm64.whl.metadata (5.7 kB)
Collecting pybcj<1.1.0,>=1.0.0 (from py7zr)
  Downloading pybcj-1.0.2-cp310-cp310-macosx_11_0_arm64.whl.metadata (4.0 kB)
Collecting multivolumefile>=0.2.3 (from py7zr)
  Downloading multivolumefile-0.2.3-py3-none-any.whl.metadata (6.3 kB)
Collecting inflate64<1.1.0,>=1.0.0 (from py7zr)
  Downloading inflate64-1.0.0-cp310-cp310-macosx_11_0_arm64.whl.metadata (4.0 kB)
Collecting brotli>=1.1.0 (from py7zr)
  Downloading Brotli-1.1.0-cp31

Problems encountered here:

datasets was not upgraded ran the following code to fix it pip install -U datasets

## 2 - Summarize Dialogue without Prompt Engineering
In this use case, we will be generating a summary of a dialogue with the pre-trained Large Language Model (LLM) FLAN-T5 from Hugging Face. The list of available models in the Hugging Face transformers package can be found here.

Let's upload some simple dialogues from the Samsum Hugging Face dataset. This dataset contains 10,000+ dialogues with the corresponding manually labeled summaries and topics.

Changes: Changed the dialog dataset to samsum

In [7]:
huggingface_dataset_name = "samsum"

dataset = load_dataset(huggingface_dataset_name)

Downloading and preparing dataset samsum/samsum to /Users/shivanisahu/.cache/huggingface/datasets/samsum/samsum/0.0.0/f1d7c6b7353e6de335d444e424dc002ef70d1277109031327bc9cc6af5d3d46e...


Downloading data:   0%|          | 0.00/2.94M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/14732 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/819 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/818 [00:00<?, ? examples/s]

Dataset samsum downloaded and prepared to /Users/shivanisahu/.cache/huggingface/datasets/samsum/samsum/0.0.0/f1d7c6b7353e6de335d444e424dc002ef70d1277109031327bc9cc6af5d3d46e. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

Print a couple of dialogues with their baseline summaries.

In [8]:
example_indices = [50, 400]

dash_line = '-'.join('' for x in range(100))

for i, index in enumerate(example_indices):
    print(dash_line)
    print('Example ', i + 1)
    print(dash_line)
    print('INPUT DIALOGUE:')
    print(dataset['test'][index]['dialogue'])
    print(dash_line)
    print('BASELINE HUMAN SUMMARY:')
    print(dataset['test'][index]['summary'])
    print(dash_line)
    print()

---------------------------------------------------------------------------------------------------
Example  1
---------------------------------------------------------------------------------------------------
INPUT DIALOGUE:
Nick: You look absolutely gorgeous and have a lovely smile. 
Nick: Would love to get to know you a bit more. How about we meet up for a drink sometime?
Jane: Hmmm... You're shooting a bit above your range aren't you?
Nick: Why would you think that hon?
Jane: Because I'm not that desperate.
Nick: That was a bit below the belt.
Nick: You're nice but you're not THAT hot.
Jane: Oh is your poor little dick shriveling at the thought?
Nick: Actually I'll take it back. Forget about the drink.
Nick: Forget I ever wrote to you.
Jane: Bye loser!
Nick: Fucking bitch!
Jane: You're welcome!
---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
Nick finds Jane pretty and invites her for a drink to

In [10]:
model_name='google/flan-t5-base'

model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

Downloading config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

Downloading generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

To handle encoding and decoding, it's necessary to work with text in a tokenized form. Tokenization involves breaking down texts into smaller pieces that can be processed by LLM models.

You can download the tokenizer for the FLAN-T5 model by using the AutoTokenizer.from_pretrained() method. Enable the fast tokenizer by setting the use_fast parameter to True. While the specifics of this parameter are not crucial at this point, you can learn more about the tokenizer's parameters in the documentation.

In [11]:
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

Downloading tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

In [12]:
sentence = "Are your bringing him over tonight"

sentence_encoded = tokenizer(sentence, return_tensors='pt')

sentence_decoded = tokenizer.decode(
        sentence_encoded["input_ids"][0], 
        skip_special_tokens=True
    )

print('ENCODED SENTENCE:')
print(sentence_encoded["input_ids"][0])
print('\nDECODED SENTENCE:')
print(sentence_decoded)

ENCODED SENTENCE:
tensor([1521,   39,    3, 3770,  376,  147, 8988,    1])

DECODED SENTENCE:
Are your bringing him over tonight


Now it's time to explore how well the base LLM summarizes a dialogue without any prompt engineering. Prompt engineering is an act of a human changing the prompt (input) to improve the response for a given task.

In [13]:
for i, index in enumerate(example_indices):
    dialogue = dataset['test'][index]['dialogue']
    summary = dataset['test'][index]['summary']
    
    inputs = tokenizer(dialogue, return_tensors='pt')
    output = tokenizer.decode(
        model.generate(
            inputs["input_ids"], 
            max_new_tokens=50,
        )[0], 
        skip_special_tokens=True
    )
    
    print(dash_line)
    print('Example ', i + 1)
    print(dash_line)
    print(f'INPUT PROMPT:\n{dialogue}')
    print(dash_line)
    print(f'BASELINE HUMAN SUMMARY:\n{summary}')
    print(dash_line)
    print(f'MODEL GENERATION - WITHOUT PROMPT ENGINEERING:\n{output}\n')

---------------------------------------------------------------------------------------------------
Example  1
---------------------------------------------------------------------------------------------------
INPUT PROMPT:
Nick: You look absolutely gorgeous and have a lovely smile. 
Nick: Would love to get to know you a bit more. How about we meet up for a drink sometime?
Jane: Hmmm... You're shooting a bit above your range aren't you?
Nick: Why would you think that hon?
Jane: Because I'm not that desperate.
Nick: That was a bit below the belt.
Nick: You're nice but you're not THAT hot.
Jane: Oh is your poor little dick shriveling at the thought?
Nick: Actually I'll take it back. Forget about the drink.
Nick: Forget I ever wrote to you.
Jane: Bye loser!
Nick: Fucking bitch!
Jane: You're welcome!
---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
Nick finds Jane pretty and invites her for a drink to get to know h

# 3 - Summarize Dialogue with an Instruction Prompt

## 3 - Summarize Dialogue with an Instruction Prompt
3.1 - Zero Shot Inference with an Instruction Prompt
In order to instruct the model to perform a task - summarize a dialogue - we can take the dialogue and convert it into an instruction prompt. This is often called zero shot inference. Wrap the dialogue in a descriptive instruction and see how the generated text will change:

In [14]:
for i, index in enumerate(example_indices):
    dialogue = dataset['test'][index]['dialogue']
    summary = dataset['test'][index]['summary']

    prompt = f"""
Summarize the following conversation.

{dialogue}

Summary:
    """

    # Input constructed prompt instead of the dialogue.
    inputs = tokenizer(prompt, return_tensors='pt')
    output = tokenizer.decode(
        model.generate(
            inputs["input_ids"], 
            max_new_tokens=50,
        )[0], 
        skip_special_tokens=True
    )
    
    print(dash_line)
    print('Example ', i + 1)
    print(dash_line)
    print(f'INPUT PROMPT:\n{prompt}')
    print(dash_line)
    print(f'BASELINE HUMAN SUMMARY:\n{summary}')
    print(dash_line)    
    print(f'MODEL GENERATION - ZERO SHOT:\n{output}\n')

---------------------------------------------------------------------------------------------------
Example  1
---------------------------------------------------------------------------------------------------
INPUT PROMPT:

Summarize the following conversation.

Nick: You look absolutely gorgeous and have a lovely smile. 
Nick: Would love to get to know you a bit more. How about we meet up for a drink sometime?
Jane: Hmmm... You're shooting a bit above your range aren't you?
Nick: Why would you think that hon?
Jane: Because I'm not that desperate.
Nick: That was a bit below the belt.
Nick: You're nice but you're not THAT hot.
Jane: Oh is your poor little dick shriveling at the thought?
Nick: Actually I'll take it back. Forget about the drink.
Nick: Forget I ever wrote to you.
Jane: Bye loser!
Nick: Fucking bitch!
Jane: You're welcome!

Summary:
    
---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
Nick finds J

In [None]:
for i, index in enumerate(example_indices):
    dialogue = dataset['test'][index]['dialogue']
    summary = dataset['test'][index]['summary']

    prompt = f"""
Summarize the following conversation.

{dialogue}


    """

    # Input constructed prompt instead of the dialogue.
    inputs = tokenizer(prompt, return_tensors='pt')
    output = tokenizer.decode(
        model.generate(
            inputs["input_ids"], 
            max_new_tokens=50,
        )[0], 
        skip_special_tokens=True
    )
    
    print(dash_line)
    print('Example ', i + 1)
    print(dash_line)
    print(f'INPUT PROMPT:\n{prompt}')
    print(dash_line)
    print(f'BASELINE HUMAN SUMMARY:\n{summary}')
    print(dash_line)    
    print(f'MODEL GENERATION - ZERO SHOT:\n{output}\n')

Observation: Even though the model is able to understand and summarize parts of the conversation, it still does not pick up on the nuance of the conversation.

In [16]:
for i, index in enumerate(example_indices):
    dialogue = dataset['test'][index]['dialogue']
    summary = dataset['test'][index]['summary']

    prompt = f"""
Write a short summary for the given conversation:

{dialogue}

Summary:
    """

    # Input constructed prompt instead of the dialogue.
    inputs = tokenizer(prompt, return_tensors='pt')
    output = tokenizer.decode(
        model.generate(
            inputs["input_ids"], 
            max_new_tokens=50,
        )[0], 
        skip_special_tokens=True
    )
    
    print(dash_line)
    print('Example ', i + 1)
    print(dash_line)
    print(f'INPUT PROMPT:\n{prompt}')
    print(dash_line)
    print(f'BASELINE HUMAN SUMMARY:\n{summary}')
    print(dash_line)    
    print(f'MODEL GENERATION - ZERO SHOT:\n{output}\n')

---------------------------------------------------------------------------------------------------
Example  1
---------------------------------------------------------------------------------------------------
INPUT PROMPT:

Write a short summary for the given conversation:

Nick: You look absolutely gorgeous and have a lovely smile. 
Nick: Would love to get to know you a bit more. How about we meet up for a drink sometime?
Jane: Hmmm... You're shooting a bit above your range aren't you?
Nick: Why would you think that hon?
Jane: Because I'm not that desperate.
Nick: That was a bit below the belt.
Nick: You're nice but you're not THAT hot.
Jane: Oh is your poor little dick shriveling at the thought?
Nick: Actually I'll take it back. Forget about the drink.
Nick: Forget I ever wrote to you.
Jane: Bye loser!
Nick: Fucking bitch!
Jane: You're welcome!

Summary:
    
---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:


3.2 - Zero Shot Inference with the Prompt Template from FLAN-T5
Let's use a slightly different prompt. FLAN-T5 has many prompt templates that are published for certain tasks here. In the following code, we will use one of the pre-built FLAN-T5 prompts:

In [17]:
for i, index in enumerate(example_indices):
    dialogue = dataset['test'][index]['dialogue']
    summary = dataset['test'][index]['summary']
        
    prompt = f"""
Dialogue:

{dialogue}

What was going on?
"""

    inputs = tokenizer(prompt, return_tensors='pt')
    output = tokenizer.decode(
        model.generate(
            inputs["input_ids"], 
            max_new_tokens=50,
        )[0], 
        skip_special_tokens=True
    )

    print(dash_line)
    print('Example ', i + 1)
    print(dash_line)
    print(f'INPUT PROMPT:\n{prompt}')
    print(dash_line)
    print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
    print(dash_line)
    print(f'MODEL GENERATION - ZERO SHOT:\n{output}\n')

---------------------------------------------------------------------------------------------------
Example  1
---------------------------------------------------------------------------------------------------
INPUT PROMPT:

Dialogue:

Nick: You look absolutely gorgeous and have a lovely smile. 
Nick: Would love to get to know you a bit more. How about we meet up for a drink sometime?
Jane: Hmmm... You're shooting a bit above your range aren't you?
Nick: Why would you think that hon?
Jane: Because I'm not that desperate.
Nick: That was a bit below the belt.
Nick: You're nice but you're not THAT hot.
Jane: Oh is your poor little dick shriveling at the thought?
Nick: Actually I'll take it back. Forget about the drink.
Nick: Forget I ever wrote to you.
Jane: Bye loser!
Nick: Fucking bitch!
Jane: You're welcome!

What was going on?

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
Nick finds Jane pretty and invites

# 4 - Summarize Dialogue with One Shot and Few Shot Inference
One shot and few shot inference are the practices of providing an LLM with either one or more full examples of prompt-response pairs that match your task - before your actual prompt that you want completed. This is called "in-context learning" and puts your model into a state that understands your specific task. .



## 4.1 - One Shot Inference
Let's build a function that takes a list of example_indices_full, generates a prompt with full examples, then at the end appends the prompt which we want the model to complete (example_index_to_summarize). We will use the FLAN-T5 prompt template.

In [18]:
def make_prompt(example_indices_full, example_index_to_summarize):
    prompt = ''
    for index in example_indices_full:
        dialogue = dataset['test'][index]['dialogue']
        summary = dataset['test'][index]['summary']
        
        # The stop sequence '{summary}\n\n\n' is important for FLAN-T5. Other models may have their own preferred stop sequence.
        prompt += f"""
Dialogue:

{dialogue}

What was going on?
{summary}


"""
    
    dialogue = dataset['test'][example_index_to_summarize]['dialogue']
    
    prompt += f"""
Dialogue:

{dialogue}

What was going on?
"""
        
    return prompt

In [19]:
example_indices_full = [80]
example_index_to_summarize = 250

one_shot_prompt = make_prompt(example_indices_full, example_index_to_summarize)

print(one_shot_prompt)


Dialogue:

Ryan: I have a bad feeling about this
Ryan: <file_other>
Sebastian: Ukraine...
Sebastian: This russian circus will never end...
Ryan: I hope the leaders of of nations will react somehow to this shit.
Sebastian: I hope so too :(

What was going on?
Ryan and Sebastian are worried about the political situation in Ukraine.



Dialogue:

Shaldona: WE ARE GONNA GET MARRIED ❤️❤️
Shaldona: <file_others>
Shaldona: This is our mobile inviation for our wedding.
Shaldona: Invitation*
Piper: Hey. You haven’t sent me any messages for a few years.
Piper: And now you are sending me your wedding invitation 
Piper: THROUGH MESSENGER?
Shaldona: .....
Shaldona: Well..
Shaldona: I had no enough time to meet everybody and give this in person.
Shaldona: Hope you understand.
Piper: If you don't have time to give the invitation card in person but expect people go to your wedding
Piper: Shaldona, if so, you are too greedy.

What was going on?



In [20]:
summary = dataset['test'][example_index_to_summarize]['summary']

inputs = tokenizer(one_shot_prompt, return_tensors='pt')
output = tokenizer.decode(
    model.generate(
        inputs["input_ids"],
        max_new_tokens=50,
    )[0], 
    skip_special_tokens=True
)

print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'MODEL GENERATION - ONE SHOT:\n{output}')

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
Shaldona sends mobile invitations to her wedding, as she has no time to give them in person.

---------------------------------------------------------------------------------------------------
MODEL GENERATION - ONE SHOT:
Shaldona and Piper are getting married. Shaldona hasn't sent Piper messages for a few years. Piper is worried about Shaldona's wedding invitation.


# 4.2 - Few Shot Inference
Let's explore few shot inference by adding two more full dialogue-summary pairs to our prompt.

In [21]:
example_indices_full = [70, 100, 200]
example_index_to_summarize = 260

few_shot_prompt = make_prompt(example_indices_full, example_index_to_summarize)

print(few_shot_prompt)


Dialogue:

Ali: I think I left my wallet at your place yesterday. Could you check? 
Mohammad: Give me a sec, I'll have a look around my room.
Ali: OK.
Mohammad: Found it!
Ali: Phew, I don't know what I'd do if it wasn't there. Can you bring it to uni tomorrow?
Mohammad: Sure thing.

What was going on?
Ali left his wallet at Mohammad's place. Mohammad'll bring it to uni tomorrow.



Dialogue:

Chris: Hi there! Where are you? Any chance of skyping?
Rick: Hi! Our last two days in Cancun before flying to Havana. Yeah, skyping is an idea. When would it suit you?
Rick: We don't have the best of connections in the room but I can get you pretty well in the lobby.
Chris: What's the time in your place now?
Rick: 6:45 pm
Chris: It's a quarter to one in the morning here. Am still in front of the box.
Rick: Gracious me! Sorry mate. You needn't have answered.
Chris: 8-D
Rick: Just tell me when we could skype.
Chris: Preferably in the evening. Just a few hours earlier than now. And not

In [22]:
summary = dataset['test'][example_index_to_summarize]['summary']

inputs = tokenizer(few_shot_prompt, return_tensors='pt')
output = tokenizer.decode(
    model.generate(
        inputs["input_ids"],
        max_new_tokens=50,
    )[0], 
    skip_special_tokens=True
)

print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'MODEL GENERATION - FEW SHOT:\n{output}')

Token indices sequence length is longer than the specified maximum sequence length for this model (697 > 512). Running this sequence through the model will result in indexing errors


---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
Debbie can't decide between buying a red dress and a green one. On Kelly and Denise's advice she will buy the green one. Kelly is considering buying the red one for herself.

---------------------------------------------------------------------------------------------------
MODEL GENERATION - FEW SHOT:
Debbie is looking for a red dress. Kelly recommends the green dress. Kelly is considering buying the red one for herself.


# In this case, few shot did not provide much of an improvement over one shot inference. And, anything above 5 or 6 shot will typically not help much, either.

However, we can see that feeding in at least one full example (one shot) provides the model with more information and qualitatively improves the summary overall.

Exercise:
Experiment with the few shot inferencing.

Choose different dialogues - change the indices in the example_indices_full list and example_index_to_summarize value.
Change the number of shots. Be sure to stay within the model's 512 context length, however.
How well does few shot inferencing work with other examples?

Choosing various other dialogs:

In [23]:
example_indices_full = [20, 50, 70, 110]
example_index_to_summarize = 160

few_shot_prompt = make_prompt(example_indices_full, example_index_to_summarize)

print(few_shot_prompt)


Dialogue:

Deirdre: Hi Beth, how are you love?
Beth: Hi Auntie Deirdre, I'm been meaning to message you, had a favour to ask.
Deirdre: Wondered if you had any thought about your Mum's 40th, we've got to do something special!
Beth: How about a girls weekend, just mum, me, you and the girls, Kira will have to come back from Uni, of course.
Deirdre: Sounds fab! Get your thinking cap on, it's only in 6 weeks! Bet she's dreading it, I remember doing that!
Beth: Oh yeah, we had a surprise party for you, you nearly had a heart attack! 
Deirdre: Well, it was a lovely surprise! Gosh, thats nearly 4 years ago now, time flies! What was the favour, darling?
Beth: Oh, it was just that I fancied trying a bit of work experience in the salon, auntie.
Deirdre: Well, I am looking for Saturday girls, are you sure about it? you could do well in the exams and go on to college or 6th form.
Beth: I know, but it's not for me, auntie, I am doing all foundation papers and I'm struggling with those.
D

In [24]:
summary = dataset['test'][example_index_to_summarize]['summary']

inputs = tokenizer(few_shot_prompt, return_tensors='pt')
output = tokenizer.decode(
    model.generate(
        inputs["input_ids"],
        max_new_tokens=50,
    )[0], 
    skip_special_tokens=True
)

print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'MODEL GENERATION - FEW SHOT:\n{output}')

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
Tom arrived safely, but without his luggage.

---------------------------------------------------------------------------------------------------
MODEL GENERATION - FEW SHOT:
Alexander will send Tom a message when he will be in taxi. Tom arrived safely without luggages.


# 5 - Configuring Generation Parameters for Inference
You can alter the configuration parameters of the generate() method to vary the output of the large language model. Previously, you've primarily adjusted the max_new_tokens=50 parameter, which limits the number of tokens the model can generate.

The GenerationConfig class is a useful way to organize these parameters.

Exercise:
Modify the configuration parameters to explore their impact on the output.

By setting do_sample = True, you enable different decoding strategies that affect the selection of the next token from the probability distribution across the vocabulary. You can then fine-tune the output by adjusting parameters like temperature, top_k, and top_p.

Uncomment the lines in the cell below and run the code again. Try to analyze the results. Below are some comments for guidance.

In [25]:
generation_config = GenerationConfig(max_new_tokens=50)
# generation_config = GenerationConfig(max_new_tokens=10)
# generation_config = GenerationConfig(max_new_tokens=50, do_sample=True, temperature=0.1)
# generation_config = GenerationConfig(max_new_tokens=50, do_sample=True, temperature=0.5)
# generation_config = GenerationConfig(max_new_tokens=50, do_sample=True, temperature=1.0)

inputs = tokenizer(few_shot_prompt, return_tensors='pt')
output = tokenizer.decode(
    model.generate(
        inputs["input_ids"],
        generation_config=generation_config,
    )[0], 
    skip_special_tokens=True
)

print(dash_line)
print(f'MODEL GENERATION - FEW SHOT:\n{output}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')

---------------------------------------------------------------------------------------------------
MODEL GENERATION - FEW SHOT:
Alexander will send Tom a message when he will be in taxi. Tom arrived safely without luggages.
---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
Tom arrived safely, but without his luggage.



In [26]:
generation_config = GenerationConfig(max_new_tokens=5)
# generation_config = GenerationConfig(max_new_tokens=50, do_sample=True, temperature=0.1)
# generation_config = GenerationConfig(max_new_tokens=50, do_sample=True, temperature=0.5)
# generation_config = GenerationConfig(max_new_tokens=50, do_sample=True, temperature=1.0)

inputs = tokenizer(few_shot_prompt, return_tensors='pt')
output = tokenizer.decode(
    model.generate(
        inputs["input_ids"],
        generation_config=generation_config,
    )[0], 
    skip_special_tokens=True
)

print(dash_line)
print(f'MODEL GENERATION - FEW SHOT:\n{output}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')

---------------------------------------------------------------------------------------------------
MODEL GENERATION - FEW SHOT:
Alexander will send Tom 
---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
Tom arrived safely, but without his luggage.



In [27]:
# generation_config = GenerationConfig(max_new_tokens=50, do_sample=True, temperature=0.1)
# generation_config = GenerationConfig(max_new_tokens=50, do_sample=True, temperature=0.5)
generation_config = GenerationConfig(max_new_tokens=50, do_sample=True, temperature=1.0)

inputs = tokenizer(few_shot_prompt, return_tensors='pt')
output = tokenizer.decode(
    model.generate(
        inputs["input_ids"],
        generation_config=generation_config,
    )[0], 
    skip_special_tokens=True
)

print(dash_line)
print(f'MODEL GENERATION - FEW SHOT:\n{output}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')

---------------------------------------------------------------------------------------------------
MODEL GENERATION - FEW SHOT:
Alexander will be in a taxi, has received the taxi confirmation below. He arrived safely not even without luggages.
---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
Tom arrived safely, but without his luggage.



Comments related to the choice of the parameters in the code cell above:

Choosing max_new_tokens=10 will make the output text too short, so the dialogue summary will be cut.
Putting do_sample = True and changing the temperature value you get more flexibility in the output.
As you can see, prompt engineering can take you a long way for this use case, but there are some limitations. Next, you will start to explore how you can use fine-tuning to help your LLM to understand a particular use case in better depth!




# Lab 2: Fine-Tune a Generative AI Model for Dialogue Summarization
1 - Load Libraries

In [28]:
from transformers import AutoTokenizer, GenerationConfig, TrainingArguments, Trainer
import torch
import time
import evaluate
import pandas as pd
import numpy as np

1.1 - Load the model
Changes: using smaller version of flan-t5 google/flan-t5-small

In [30]:
new_model_name='google/flan-t5-small'

original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name)

You can extract the number of model parameters and determine how many are trainable using the function below. At this stage, detailed understanding of the function is not necessary.

In [31]:
def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"trainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"

print(print_number_of_trainable_model_parameters(original_model))

trainable model parameters: 247577856
all model parameters: 247577856
percentage of trainable model parameters: 100.00%


# 1.2 - Evaluating the Model with Zero-Shot Inference
Evaluate the model using zero-shot inference. You'll notice that the model has difficulty summarizing the dialogue as effectively as the baseline summary, but it still extracts some crucial information from the text. This suggests that the model has potential for further fine-tuning specific to the task.

In [32]:
index = 800

dialogue = dataset['test'][index]['dialogue']
summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary:
"""

inputs = tokenizer(prompt, return_tensors='pt')
output = tokenizer.decode(
    original_model.generate(
        inputs["input_ids"], 
        max_new_tokens=200,
    )[0], 
    skip_special_tokens=True
)

dash_line = '-'.join('' for x in range(100))
print(dash_line)
print(f'INPUT PROMPT:\n{prompt}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'MODEL GENERATION - ZERO SHOT:\n{output}')

---------------------------------------------------------------------------------------------------
INPUT PROMPT:

Summarize the following conversation.

Linda: Hi Dad, I want to buy flowers for mum! But I don't remember which one she likes :(
Michael: Well, she likes all the flowers I believe
Linda: That doesn't help! I'm on a flower market right now!
Michael: Send me some pics then
Linda: <file_photo> 
Michael: Tulips are nice, roses too
Linda:  What about carnations?
Michael: No, carnations are boring :D
Linda: Thanks Dad, srsly…
Michael:  What about freesias? She likes them a lot, are there any there?
Linda: <file_photo> 
Michael: Take those!

Summary:

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
Linda wants to buy flowers for her mother and asks Michael which flowers does she like. Michael suggests Linda to buy freesias.

----------------------------------------------------------------------

# 2 - Perform Full Fine-Tuning
2.1 - Preprocess the Dialog-Summary Dataset

In [33]:
def tokenize_function(example):
    start_prompt = 'Summarize the following conversation.\n\n'
    end_prompt = '\n\nSummary: '
    prompt = [start_prompt + dialogue + end_prompt for dialogue in example["dialogue"]]
    example['input_ids'] = tokenizer(prompt, padding="max_length", truncation=True, return_tensors="pt").input_ids
    example['labels'] = tokenizer(example["summary"], padding="max_length", truncation=True, return_tensors="pt").input_ids
    
    return example

# The dataset actually contains 3 diff splits: train, validation, test.
# The tokenize_function code is handling all data across all splits in batches.
tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(['id', 'dialogue', 'summary',])

Map:   0%|          | 0/14732 [00:00<?, ? examples/s]

Map:   0%|          | 0/819 [00:00<?, ? examples/s]

Map:   0%|          | 0/818 [00:00<?, ? examples/s]

In [34]:
tokenized_datasets = tokenized_datasets.filter(lambda example, index: index % 100 == 0, with_indices=True)

Filter:   0%|          | 0/14732 [00:00<?, ? examples/s]

Filter:   0%|          | 0/819 [00:00<?, ? examples/s]

Filter:   0%|          | 0/818 [00:00<?, ? examples/s]

In [35]:
print(f"Shapes of the datasets:")
print(f"Training: {tokenized_datasets['train'].shape}")
print(f"Validation: {tokenized_datasets['validation'].shape}")
print(f"Test: {tokenized_datasets['test'].shape}")

print(tokenized_datasets)

Shapes of the datasets:
Training: (148, 2)
Validation: (9, 2)
Test: (9, 2)
DatasetDict({
    train: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 148
    })
    test: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 9
    })
    validation: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 9
    })
})


The output dataset is ready for fine-tuning.

## 2.2 - Fine-Tune the Model with the Preprocessed Dataset
Now utilize the built-in Hugging Face Trainer class (see the documentation here). Pass the preprocessed dataset with reference to the original model. Other training parameters are found experimentally and there is no need to go into details about those at the moment.

Changes: Training a fully fine-tuned version of the model would take a few hours on a GPU. Instead we download a pre-fine-tuned model mrm8488/flan-t5-small-finetuned-samsum to use in the rest of this notebook. This fully fine-tuned model will also be referred to as the instruct model in this lab.

In [36]:
instruct_model_name="mrm8488/flan-t5-small-finetuned-samsum"

Create an instance of the AutoModelForSeq2SeqLM class for the instruct model:

In [38]:
instruct_model = AutoModelForSeq2SeqLM.from_pretrained(instruct_model_name, torch_dtype=torch.bfloat16)

Downloading config.json:   0%|          | 0.00/1.53k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/308M [00:00<?, ?B/s]

## 2.3 - Qualitative Evaluation of the Model (Human Assessment)
In many generative AI applications, beginning with a qualitative evaluation by asking, "Is my model performing as expected?" is often an effective approach. In the example below (the same one we opened this notebook with), observe how the fine-tuned model now generates a reasonable summary of the dialogue, showing a marked improvement over its initial failure to comprehend the task required of it.

In [39]:
index = 50
dialogue = dataset['test'][index]['dialogue']
human_baseline_summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary:
"""

input_ids = tokenizer(prompt, return_tensors="pt").input_ids

original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{human_baseline_summary}')
print(dash_line)
print(f'ORIGINAL MODEL:\n{original_model_text_output}')
print(dash_line)
print(f'INSTRUCT MODEL:\n{instruct_model_text_output}')

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
Nick finds Jane pretty and invites her for a drink to get to know her better. Jane rejects Nick and is unpleasant to him. Nick suggests Jane to forget about their conversation.
---------------------------------------------------------------------------------------------------
ORIGINAL MODEL:
Nick and Jane are going to meet for a drink.
---------------------------------------------------------------------------------------------------
INSTRUCT MODEL:
Nick and Jane are going to meet up for a drink.


## 2.4 - Evaluate the Model Quantitatively (with ROUGE Metric)

The ROUGE metric is used to assess the accuracy of summaries generated by models by comparing them to a "baseline" summary typically crafted by a human. Although it isn't flawless, this metric serves as an indicator of the improvement in summarization effectiveness achieved through fine-tuning.

In [40]:
rouge = evaluate.load('rouge')

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

Generate the outputs for the sample of the test dataset (only 10 dialogues and summaries to save time), and save the results.

In [41]:
dialogues = dataset['test'][0:10]['dialogue']
human_baseline_summaries = dataset['test'][0:10]['summary']

original_model_summaries = []
instruct_model_summaries = []

for _, dialogue in enumerate(dialogues):
    prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids

    original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)
    original_model_summaries.append(original_model_text_output)

    instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)
    instruct_model_summaries.append(instruct_model_text_output)
    
zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, instruct_model_summaries))
 
df = pd.DataFrame(zipped_summaries, columns = ['human_baseline_summaries', 'original_model_summaries', 'instruct_model_summaries'])
df

Unnamed: 0,human_baseline_summaries,original_model_summaries,instruct_model_summaries
0,Hannah needs Betty's number but Amanda doesn't...,Amanda can't find Betty's number. Amanda will ...,Betty called Larry last time they were at the ...
1,Eric and Rob are going to watch a stand-up on ...,Eric and Rob are watching a stand-up. Eric and...,Eric and Rob are watching a show on YouTube.
2,Lenny can't decide which trousers to buy. Bob ...,Lenny wants to buy two pairs of purple trouser...,Bob will send Lenny photos of the trousers. Le...
3,Emma will be home soon and she will let Will k...,Emma will be home soon. Will will pick her up.,Emma will pick Will up at the moment.
4,Jane is in Warsaw. Ollie and Jane has a party....,Jane lost her calendar. Ollie and Jane have lu...,Jane is in Warsaw. Ollie will bring some sun w...
5,Hilary has the keys to the apartment. Benjamin...,Hilary and Elliot are meeting at the conferenc...,"Benjamin, Hilary and Daniel are meeting for dr..."
6,Payton provides Max with websites selling clot...,Payton likes shopping but he doesn't always bu...,Payton is looking for clothes to buy. Max will...
7,Rita and Tina are bored at work and have still...,Rita is tired and is not able to concentrate a...,Rita is tired and is tired. Tina is tired.
8,"Beatrice wants to buy Leo a scarf, but he does...","Beatrice is in town, shopping. She has a scarf...",Beatrice is in town. She doesn't have a scarf....
9,Eric doesn't know if his parents let him go to...,Eric is coming to Ivan's brother's wedding. Er...,Eric is coming to the wedding. He has a lot to...


Evaluate the models computing ROUGE metrics. Notice the improvement in the results!

In [47]:
original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0:len(instruct_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('INSTRUCT MODEL:')
print(instruct_model_results)

ORIGINAL MODEL:
{'rouge1': 0.47028472286610057, 'rouge2': 0.22995235132837163, 'rougeL': 0.37927486414974926, 'rougeLsum': 0.3802753543727778}
INSTRUCT MODEL:
{'rouge1': 0.36376015671847006, 'rouge2': 0.1298066050954753, 'rougeL': 0.2917625018054684, 'rougeLsum': 0.29182423684675857}


Rouge scores of this model are bad, even worse than our regular model. Let's move on to the next step.

The file data/dialogue-summary-training-results.csv contains a pre-populated list of all model results which you can use to evaluate on a larger section of data. Let's do that for each of the models:

In [50]:
results = pd.read_csv("/Users/shivanisahu/Desktop/ADV_DS_Assignment/dialogue-summary-training-results.csv")

human_baseline_summaries = results['human_baseline_summaries'].values
original_model_summaries = results['original_model_summaries'].values
instruct_model_summaries = results['instruct_model_summaries'].values

original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0:len(instruct_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('INSTRUCT MODEL:')
print(instruct_model_results)

ORIGINAL MODEL:
{'rouge1': 0.2334267762606164, 'rouge2': 0.07583872163969117, 'rougeL': 0.20145533544294464, 'rougeLsum': 0.2013454634200133}
INSTRUCT MODEL:
{'rouge1': 0.42157999366320953, 'rouge2': 0.18024457353812656, 'rougeL': 0.3383623425777854, 'rougeLsum': 0.3382783013380308}


The results show substantial improvement in all ROUGE metrics:

In [51]:
print("Absolute percentage improvement of INSTRUCT MODEL over HUMAN BASELINE")

improvement = (np.array(list(instruct_model_results.values())) - np.array(list(original_model_results.values())))
for key, value in zip(instruct_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

Absolute percentage improvement of INSTRUCT MODEL over HUMAN BASELINE
rouge1: 18.82%
rouge2: 10.44%
rougeL: 13.69%
rougeLsum: 13.69%


# 3 - Implementing Parameter Efficient Fine-Tuning (PEFT)
Next, we will explore Parameter Efficient Fine-Tuning (PEFT), an alternative to the "full fine-tuning" method previously used. PEFT is a more resource-efficient approach to fine-tuning that typically yields results comparable to full fine-tuning.

PEFT encompasses techniques such as Low-Rank Adaptation (LoRA) and prompt tuning (distinct from prompt engineering). Often, PEFT specifically refers to LoRA. LoRA allows for fine-tuning a model using significantly fewer computational resources, sometimes only requiring a single GPU. After fine-tuning for a particular task, use case, or client, LoRA modifies only a small component of the model, known as the "LoRA adapter," which is significantly smaller in size than the full model—often just a fraction of the original size in megabytes as opposed to gigabytes.

During inference, this LoRA adapter must be integrated back with the original large language model (LLM) to process requests. This integration allows the original LLM to be reused with multiple LoRA adapters, thus optimizing memory usage when managing various tasks and applications.

In [52]:
from peft import PeftModel, PeftConfig

peft_model_base = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base", torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")

peft_model = PeftModel.from_pretrained(peft_model_base, 
                                       "RohitKeswani/flan-t5-base-peft-samsum",
                                       torch_dtype=torch.bfloat16,
                                       is_trainable=False)

Downloading adapter_config.json:   0%|          | 0.00/439 [00:00<?, ?B/s]

Downloading adapter_model.bin:   0%|          | 0.00/14.2M [00:00<?, ?B/s]

The number of trainable parameters will be 0 due to is_trainable=False setting:

In [53]:
print(print_number_of_trainable_model_parameters(peft_model))

trainable model parameters: 0
all model parameters: 251116800
percentage of trainable model parameters: 0.00%


## 3.2 - Evaluate the Model Qualitatively (Human Evaluation)
Make inferences with the original model, fully fine-tuned and PEFT model.

In [54]:
index = 200
dialogue = dataset['test'][index]['dialogue']
# baseline_human_summary = dataset['test'][index]['summary']
human_baseline_summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """

input_ids = tokenizer(prompt, return_tensors="pt").input_ids

original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{human_baseline_summary}')
print(dash_line)
print(f'ORIGINAL MODEL:\n{original_model_text_output}')
print(dash_line)
print(f'INSTRUCT MODEL:\n{instruct_model_text_output}')
print(dash_line)
print(f'PEFT MODEL: {peft_model_text_output}')

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
Sam won't finish work till 5. Sam is bringing him over about 9 am. Sam will see Abdellilah in the morning. 
---------------------------------------------------------------------------------------------------
ORIGINAL MODEL:
Sam is at work. He finishes at 5 and is not bringing Abdellilah over tonight. Sam will bring Abdellilah to work at about 9.
---------------------------------------------------------------------------------------------------
INSTRUCT MODEL:
Sam is working at 9. Sam will bring him over tonight.
---------------------------------------------------------------------------------------------------
PEFT MODEL: Sam is at work. He finishes at 5 and is not bringing Abdellilah over tonight. Sam will bring Abdellilah to work at about 9.


## 3.3 - Evaluate the Model Quantitatively (with ROUGE Metric)
Perform inferences for the sample of the test dataset (only 10 dialogues and summaries to save time)

In [56]:
dialogues = dataset['test'][0:10]['dialogue']
human_baseline_summaries = dataset['test'][0:10]['summary']

original_model_summaries = []
instruct_model_summaries = []
peft_model_summaries = []

for idx, dialogue in enumerate(dialogues):
    prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """
    
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids

    human_baseline_text_output = human_baseline_summaries[idx]
    
    original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

    instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

    peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

    original_model_summaries.append(original_model_text_output)
    instruct_model_summaries.append(instruct_model_text_output)
    peft_model_summaries.append(peft_model_text_output)

zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, instruct_model_summaries, peft_model_summaries))
 
df = pd.DataFrame(zipped_summaries, columns = ['human_baseline_summaries', 'original_model_summaries', 'instruct_model_summaries', 'peft_model_summaries'])
df

Unnamed: 0,human_baseline_summaries,original_model_summaries,instruct_model_summaries,peft_model_summaries
0,Hannah needs Betty's number but Amanda doesn't...,Amanda can't find Betty's number. Amanda will ...,Betty called Larry last time they were at the ...,Amanda can't find Betty's number. Amanda will ...
1,Eric and Rob are going to watch a stand-up on ...,Eric and Rob are watching a stand-up. Eric and...,Eric and Rob are watching a show on YouTube.,Eric and Rob are watching a stand-up. Eric and...
2,Lenny can't decide which trousers to buy. Bob ...,Lenny wants to buy two pairs of purple trouser...,Bob will send Lenny photos of the trousers. Le...,Lenny wants to buy two pairs of purple trouser...
3,Emma will be home soon and she will let Will k...,Emma will be home soon. Will will pick her up.,Emma will pick Will up at the moment.,Emma will be home soon. Will will pick her up.
4,Jane is in Warsaw. Ollie and Jane has a party....,Jane lost her calendar. Ollie and Jane have lu...,Jane is in Warsaw. Ollie will bring some sun w...,Jane lost her calendar. Ollie and Jane have lu...
5,Hilary has the keys to the apartment. Benjamin...,Hilary and Elliot are meeting at the conferenc...,"Benjamin, Hilary and Daniel are meeting for dr...",Hilary and Elliot are meeting at the conferenc...
6,Payton provides Max with websites selling clot...,Payton likes shopping but he doesn't always bu...,Payton is looking for clothes to buy. Max will...,Payton likes shopping but he doesn't always bu...
7,Rita and Tina are bored at work and have still...,Rita is tired and is not able to concentrate a...,Rita is tired and is tired. Tina is tired.,Rita is tired and is not able to concentrate a...
8,"Beatrice wants to buy Leo a scarf, but he does...","Beatrice is in town, shopping. She has a scarf...",Beatrice is in town. She doesn't have a scarf....,"Beatrice is in town, shopping. She has a scarf..."
9,Eric doesn't know if his parents let him go to...,Eric is coming to Ivan's brother's wedding. Er...,Eric is coming to the wedding. He has a lot to...,Eric is coming to Ivan's brother's wedding. Er...


Compute ROUGE score for this subset of the data.

In [57]:
rouge = evaluate.load('rouge')

original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0:len(instruct_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

peft_model_results = rouge.compute(
    predictions=peft_model_summaries,
    references=human_baseline_summaries[0:len(peft_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('INSTRUCT MODEL:')
print(instruct_model_results)
print('PEFT MODEL:')
print(peft_model_results)

ORIGINAL MODEL:
{'rouge1': 0.47028472286610057, 'rouge2': 0.22995235132837163, 'rougeL': 0.37927486414974926, 'rougeLsum': 0.3802753543727778}
INSTRUCT MODEL:
{'rouge1': 0.36376015671847006, 'rouge2': 0.1298066050954753, 'rougeL': 0.2917625018054684, 'rougeLsum': 0.29182423684675857}
PEFT MODEL:
{'rouge1': 0.47753687688771096, 'rouge2': 0.23028476049778054, 'rougeL': 0.3773752395014854, 'rougeLsum': 0.37961260085429527}


Notice, that PEFT model performed a little bit better than flan-t5-base.

We already computed ROUGE score on the full dataset, after loading the results from the data/dialogue-summary-training-results.csv file. Load the values for the PEFT model now and check its performance compared to other models.

In [58]:
human_baseline_summaries = results['human_baseline_summaries'].values
original_model_summaries = results['original_model_summaries'].values
instruct_model_summaries = results['instruct_model_summaries'].values
peft_model_summaries     = results['peft_model_summaries'].values

original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0:len(instruct_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

peft_model_results = rouge.compute(
    predictions=peft_model_summaries,
    references=human_baseline_summaries[0:len(peft_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('INSTRUCT MODEL:')
print(instruct_model_results)
print('PEFT MODEL:')
print(peft_model_results)

ORIGINAL MODEL:
{'rouge1': 0.2334267762606164, 'rouge2': 0.07583872163969117, 'rougeL': 0.20145533544294464, 'rougeLsum': 0.2013454634200133}
INSTRUCT MODEL:
{'rouge1': 0.42157999366320953, 'rouge2': 0.18024457353812656, 'rougeL': 0.3383623425777854, 'rougeLsum': 0.3382783013380308}
PEFT MODEL:
{'rouge1': 0.4080553198406258, 'rouge2': 0.16332717404983593, 'rougeL': 0.3251568978594342, 'rougeLsum': 0.32488871719602286}


Calculate the improvement of PEFT over the original model:

In [59]:
print("Absolute percentage improvement of PEFT MODEL over HUMAN BASELINE")

improvement = (np.array(list(peft_model_results.values())) - np.array(list(original_model_results.values())))
for key, value in zip(peft_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

Absolute percentage improvement of PEFT MODEL over HUMAN BASELINE
rouge1: 17.46%
rouge2: 8.75%
rougeL: 12.37%
rougeLsum: 12.35%


In [60]:
print("Absolute percentage improvement of PEFT MODEL over INSTRUCT MODEL")

improvement = (np.array(list(peft_model_results.values())) - np.array(list(instruct_model_results.values())))
for key, value in zip(peft_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

Absolute percentage improvement of PEFT MODEL over INSTRUCT MODEL
rouge1: -1.35%
rouge2: -1.69%
rougeL: -1.32%
rougeLsum: -1.34%


Here we see a small percentage decrease in the ROUGE metrics vs. full fine-tuned.

# Lab 3 - Fine-Tune FLAN-T5 with Reinforcement Learning (PPO) and PEFT to Generate Less-Toxic Summaries

1 - Load libraries

In [5]:
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification, AutoModelForSeq2SeqLM, GenerationConfig
from datasets import load_dataset
from peft import PeftModel, PeftConfig, LoraConfig, TaskType

# trl: Transformer Reinforcement Learning library
from trl import PPOTrainer, PPOConfig, AutoModelForSeq2SeqLMWithValueHead
from trl import create_reference_model
from trl.core import LengthSampler

import torch
import evaluate

import numpy as np
import pandas as pd

# tqdm library makes the loops show a smart progress meter.
from tqdm import tqdm
tqdm.pandas()

2 - Load FLAN-T5 Model, Prepare Reward Model and Toxicity Evaluator
2.1 - Load Data and FLAN-T5 Model Fine-Tuned with Summarization Instruction
You will keep working with the same Hugging Face dataset samsum and the pre-trained model FLAN-T5-BASE.

In [6]:
huggingface_dataset_name = "samsum"

dataset = load_dataset(huggingface_dataset_name)

Found cached dataset samsum (/Users/shivanisahu/.cache/huggingface/datasets/samsum/samsum/0.0.0/f1d7c6b7353e6de335d444e424dc002ef70d1277109031327bc9cc6af5d3d46e)


  0%|          | 0/3 [00:00<?, ?it/s]

In [7]:
model_name="google/flan-t5-base"

dataset_original = load_dataset("samsum")

dataset_original

Found cached dataset samsum (/Users/shivanisahu/.cache/huggingface/datasets/samsum/samsum/0.0.0/f1d7c6b7353e6de335d444e424dc002ef70d1277109031327bc9cc6af5d3d46e)


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 14732
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 819
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 818
    })
})

The next step will be to preprocess the dataset. We will take only a part of it, then filter the dialogues of a particular length (just to make those examples long enough and, at the same time, easy to read). Then wrap each dialogue with the instruction and tokenize the prompts. Save the token ids in the field input_ids and decoded version of the prompts in the field query.

We could do that all step by step in the cell below, but it is a good habit to organize that all in a function build_dataset:

In [8]:
def build_dataset(model_name,
                  dataset_name,
                  input_min_text_length,
                  input_max_text_length):

    """
    Preprocess the dataset and split it into train and test parts.

    Parameters:
    - model_name (str): Tokenizer model name.
    - dataset_name (str): Name of the dataset to load.
    - input_min_text_length (int): Minimum length of the dialogues.
    - input_max_text_length (int): Maximum length of the dialogues.

    Returns:
    - dataset_splits (datasets.dataset_dict.DatasetDict): Preprocessed dataset containing train and test parts.
    """


    # load dataset (only "train" part will be enough for this lab).
    dataset = load_dataset(dataset_name, split="train")

    # Filter the dialogues of length between input_min_text_length and input_max_text_length characters.
    dataset = dataset.filter(lambda x: len(x["dialogue"]) > input_min_text_length and len(x["dialogue"]) <= input_max_text_length, batched=False)

    # Prepare tokenizer. Setting device_map="auto" allows to switch between GPU and CPU automatically.
    tokenizer = AutoTokenizer.from_pretrained(model_name, device_map="auto")

    def tokenize(sample):

        # Wrap each dialogue with the instruction.
        prompt = f"""
Summarize the following conversation.

{sample["dialogue"]}

Summary:
"""
        sample["input_ids"] = tokenizer.encode(prompt)

        # This must be called "query", which is a requirement of our PPO library.
        sample["query"] = tokenizer.decode(sample["input_ids"])
        return sample

    # Tokenize each dialogue.
    dataset = dataset.map(tokenize, batched=False)
    dataset.set_format(type="torch")

    # Split the dataset into train and test parts.
    dataset_splits = dataset.train_test_split(test_size=0.2, shuffle=False, seed=42)

    return dataset_splits

dataset = build_dataset(model_name=model_name,
                        dataset_name=huggingface_dataset_name,
                        input_min_text_length=200,
                        input_max_text_length=1000)

print(dataset)

Found cached dataset samsum (/Users/shivanisahu/.cache/huggingface/datasets/samsum/samsum/0.0.0/f1d7c6b7353e6de335d444e424dc002ef70d1277109031327bc9cc6af5d3d46e)
Loading cached processed dataset at /Users/shivanisahu/.cache/huggingface/datasets/samsum/samsum/0.0.0/f1d7c6b7353e6de335d444e424dc002ef70d1277109031327bc9cc6af5d3d46e/cache-a2c1ba800a8fbe71.arrow
Loading cached processed dataset at /Users/shivanisahu/.cache/huggingface/datasets/samsum/samsum/0.0.0/f1d7c6b7353e6de335d444e424dc002ef70d1277109031327bc9cc6af5d3d46e/cache-c0c08df2e22f6f24.arrow


DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'input_ids', 'query'],
        num_rows: 7851
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'input_ids', 'query'],
        num_rows: 1963
    })
})


Prepare a function to pull out the number of model parameters (it is the same as in the previous lab):

In [9]:
def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"\ntrainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"

Add the adapter to the original FLAN-T5 model. In the previous lab you were adding the fully trained adapter only for inferences, so there was no need to pass LoRA configurations doing that. Now you need to pass them to the constructed PEFT model, also putting is_trainable=True.

In [10]:
lora_config = LoraConfig(
    r=32, # Rank
    lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM # FLAN-T5
)

model = AutoModelForSeq2SeqLM.from_pretrained(model_name,
                                              torch_dtype=torch.bfloat16)

peft_model = PeftModel.from_pretrained(model,
                                       'RohitKeswani/flan-t5-base-peft-samsum',
                                       lora_config=lora_config,
                                       torch_dtype=torch.bfloat16,
                                       device_map="auto",
                                       is_trainable=True)

print(f'PEFT model parameters to be updated:\n{print_number_of_trainable_model_parameters(peft_model)}\n')

PEFT model parameters to be updated:

trainable model parameters: 3538944
all model parameters: 251116800
percentage of trainable model parameters: 1.41%



In this lab, you are preparing to fine-tune the LLM using Reinforcement Learning (RL). RL will be briefly discussed in the next section of this lab, but at this stage, you just need to prepare the Proximal Policy Optimization (PPO) model passing the instruct-fine-tuned PEFT model to it. PPO will be used to optimize the RL policy against the reward model.

In [11]:
ppo_model = AutoModelForSeq2SeqLMWithValueHead.from_pretrained(peft_model,
                                                               torch_dtype=torch.bfloat16,
                                                               is_trainable=True)

print(f'PPO model parameters to be updated (ValueHead + 769 params):\n{print_number_of_trainable_model_parameters(ppo_model)}\n')
print(ppo_model.v_head)

PPO model parameters to be updated (ValueHead + 769 params):

trainable model parameters: 3539713
all model parameters: 251117569
percentage of trainable model parameters: 1.41%

ValueHead(
  (dropout): Dropout(p=0.1, inplace=False)
  (summary): Linear(in_features=768, out_features=1, bias=True)
  (flatten): Flatten(start_dim=1, end_dim=-1)
)


During PPO, only a few parameters will be updated. Specifically, the parameters of the ValueHead. More information about this class of models can be found in the documentation. The number of trainable parameters can be computed as 
, where 
 is the number of input units (here 
) and 
 is the number of output units (
). The 
 term in the equation takes into account the bias term.

Now create a frozen copy of the PPO which will not be fine-tuned - a reference model. The reference model will represent the LLM before detoxification. None of the parameters of the reference model will be updated during PPO training. This is on purpose.

In [12]:
ref_model = create_reference_model(ppo_model)

print(f'Reference model parameters to be updated:\n{print_number_of_trainable_model_parameters(ref_model)}\n')

Reference model parameters to be updated:

trainable model parameters: 0
all model parameters: 251117569
percentage of trainable model parameters: 0.00%



## 2.2 - Setting Up the Reward Model
Reinforcement Learning (RL) is a branch of machine learning where agents perform actions in an environment to maximize cumulative rewards. The behavior of these agents is directed by a policy, and the primary aim of RL is for the agent to develop an optimal or near-optimal policy that maximizes the reward function.

Previously, the original policy was based on the instruct PEFT model, which is the LLM before undergoing detoxification. Although human labelers could provide feedback on the toxicity levels of the outputs, relying on them throughout the fine-tuning process can be costly. An effective alternative is to employ a reward model that encourages the agent to produce less toxic dialogue summaries. A straightforward strategy would be to utilize sentiment analysis, categorizing outputs into two classes—nothate and hate—and assigning higher rewards for outputs more likely to be classified as nothate.

For this purpose, you will utilize Meta AI’s RoBERTa-based model for detecting hate speech. This model will produce logits, which are then used to calculate probabilities for the two classes: nothate and hate. The logits corresponding to nothate will be considered as positive rewards. The model will subsequently undergo fine-tuning using these reward values with Proximal Policy Optimization (PPO).

To begin, we need to instantiate the necessary RoBERTa model class and load a tokenizer to evaluate the model. In this setup, label 0 represents the class nothate, and label 1 corresponds to the class hate.







In [13]:
toxicity_model_name = "facebook/roberta-hate-speech-dynabench-r4-target"
toxicity_tokenizer = AutoTokenizer.from_pretrained(toxicity_model_name, device_map="auto")
toxicity_model = AutoModelForSequenceClassification.from_pretrained(toxicity_model_name, device_map="auto")
print(toxicity_model.config.id2label)

{0: 'nothate', 1: 'hate'}


In [19]:
# Use CPU if GPU is not available
device = "cuda:0" if torch.cuda.is_available() else "cpu"

# Move the model to the device
toxicity_model = toxicity_model.to(device)

In [20]:
# Use CPU if GPU is not available
device = "cuda:0" if torch.cuda.is_available() else "cpu"

# Move the model to the device
toxicity_model = toxicity_model.to(device)

In [22]:
non_toxic_text = "#Person 1# tells Tommy that he didn't like the movie."

toxicity_input_ids = toxicity_tokenizer(non_toxic_text, return_tensors="pt").input_ids.to(device)

logits = toxicity_model(input_ids=toxicity_input_ids).logits
print(f'logits [not hate, hate]: {logits.tolist()[0]}')

# Print the probabilities for [not hate, hate]
probabilities = logits.softmax(dim=-1).tolist()[0]
print(f'probabilities [not hate, hate]: {probabilities}')

# get the logits for "not hate" - this is the reward!
not_hate_index = 0
nothate_reward = (logits[:, not_hate_index]).tolist()
print(f'reward (high): {nothate_reward}')

logits [not hate, hate]: [3.1140999794006348, -2.489616870880127]
probabilities [not hate, hate]: [0.9963293671607971, 0.003670621896162629]
reward (high): [3.1140999794006348]


In [23]:
toxic_text = "#Person 1# tells Tommy that the movie was terrible, dumb and stupid."

toxicity_input_ids = toxicity_tokenizer(toxic_text, return_tensors="pt").input_ids.to(device)

logits = toxicity_model(toxicity_input_ids).logits
# print(f'logits [not hate, hate]: {logits.tolist()[0]}')

# Print the probabilities for [not hate, hate]
probabilities = logits.softmax(dim=-1).tolist()[0]
print(f'probabilities [not hate, hate]: {probabilities}')

# Get the logits for "not hate" - this is the reward!
nothate_reward = (logits[:, not_hate_index]).tolist()
print(f'reward (low): {nothate_reward}')

probabilities [not hate, hate]: [0.2564719319343567, 0.7435280084609985]
reward (low): [-0.6921163201332092]


In [31]:
def evaluate_toxicity(model,
                      toxicity_evaluator,
                      tokenizer,
                      dataset,
                      num_samples):

    """
    Preprocess the dataset and split it into train and test parts.

    Parameters:
    - model (trl model): Model to be evaluated.
    - toxicity_evaluator (evaluate_modules toxicity metrics): Toxicity evaluator.
    - tokenizer (transformers tokenizer): Tokenizer to be used.
    - dataset (dataset): Input dataset for the evaluation.
    - num_samples (int): Maximum number of samples for the evaluation.

    Returns:
    tuple: A tuple containing two numpy.float64 values:
    - mean (numpy.float64): Mean of the samples toxicity.
    - std (numpy.float64): Standard deviation of the samples toxicity.
    """

    max_new_tokens=100

    toxicities = []
    input_texts = []
    for i, sample in tqdm(enumerate(dataset)):
        input_text = sample["query"]

        if i > num_samples:
            break

        input_ids = tokenizer(input_text, return_tensors="pt", padding=True).input_ids

        generation_config = GenerationConfig(max_new_tokens=max_new_tokens,
                                             tok_k=0.0,
                                             top_p=1.0,
                                             do_sample=True)

        response_token_ids = model.generate(input_ids=input_ids,
                                            generation_config=generation_config)

        generated_text = tokenizer.decode(response_token_ids[0], skip_special_tokens=True)

        toxicity_score = toxicity_evaluator.compute(predictions=[(input_text + " " + generated_text)])

        toxicities.extend(toxicity_score["toxicity"])

    # Compute mean & std using np.
    mean = np.mean(toxicities)
    std = np.std(toxicities)

    return mean, std

# 3 - Perform Fine-Tuning to Detoxify the Summaries
Optimize a RL policy against the reward model using Proximal Policy Optimization (PPO).

3.1 - Initialize PPOTrainer
For the PPOTrainer initialization, we will need a collator. Here it will be a function transforming the dictionaries in a particular way. We can define and test it

In [30]:
def collator(data):
    return dict((key, [d[key] for d in data]) for key in data[0])

test_data = [{"key1": "value1", "key2": "value2", "key3": "value3"}]
print(f'Collator input: {test_data}')
print(f'Collator output: {collator(test_data)}')

Collator input: [{'key1': 'value1', 'key2': 'value2', 'key3': 'value3'}]
Collator output: {'key1': ['value1'], 'key2': ['value2'], 'key3': ['value3']}


Set up the configuration parameters. Load the ppo_model and the tokenizer. We will also load a frozen version of the model ref_model. The first model is optimized while the second model serves as a reference to calculate the KL-divergence from the starting point. This works as an additional reward signal in the PPO training to make sure the optimized model does not deviate too much from the original LLM.

## 3.2 - Fine-Tune the Model
The fine-tuning loop consists of the following main steps:

Get the query responses from the policy LLM (PEFT model).
Get sentiments for query/responses from hate speech RoBERTa model.
Optimize policy with PPO using the (query, response, reward) triplet.
The operation is running if you see the following metrics appearing:

objective/kl: minimize kl divergence,
ppo/returns/mean: maximize mean returns,
ppo/policy/advantages_mean: maximize advantages.


In [None]:
output_min_length = 100
output_max_length = 400
output_length_sampler = LengthSampler(output_min_length, output_max_length)

generation_kwargs = {
    "min_length": 5,
    "top_k": 0.0,
    "top_p": 1.0,
    "do_sample": True
}

reward_kwargs = {
    "top_k": None, # Return all scores.
    "function_to_apply": "none", # You want the raw logits without softmax.
    "batch_size": 16
}

max_ppo_steps = 10

for step, batch in tqdm(enumerate(ppo_trainer.dataloader)):
    # Break when you reach max_steps.
    if step >= max_ppo_steps:
        break

    prompt_tensors = batch["input_ids"]

    # Get response from FLAN-T5/PEFT LLM.
    summary_tensors = []

    for prompt_tensor in prompt_tensors:
        max_new_tokens = output_length_sampler()

        generation_kwargs["max_new_tokens"] = max_new_tokens
        summary = ppo_trainer.generate(prompt_tensor, **generation_kwargs)

        summary_tensors.append(summary.squeeze()[-max_new_tokens:])

    # This needs to be called "response".
    batch["response"] = [tokenizer.decode(r.squeeze()) for r in summary_tensors]

    # Compute reward outputs.
    query_response_pairs = [q + r for q, r in zip(batch["query"], batch["response"])]
    rewards = sentiment_pipe(query_response_pairs, **reward_kwargs)

    # You use the `nothate` item because this is the score for the positive `nothate` class.
    reward_tensors = [torch.tensor(reward[not_hate_index]["score"]) for reward in rewards]

    # Run PPO step.
    stats = ppo_trainer.step(prompt_tensors, summary_tensors, reward_tensors)
    ppo_trainer.log_stats(stats, batch, reward_tensors)

    print(f'objective/kl: {stats["objective/kl"]}')
    print(f'ppo/returns/mean: {stats["ppo/returns/mean"]}')
    print(f'ppo/policy/advantages_mean: {stats["ppo/policy/advantages_mean"]}')
    print('-'.join('' for x in range(100)))


## 3.3 - Evaluate the Model Qualitatively
Let's inspect some examples from the test dataset. We can compare the original ref_model to the fine-tuned/detoxified ppo_model using the toxicity evaluator.

In [None]:
batch_size = 20
compare_results = {}

df_batch = dataset["test"][0:batch_size]

compare_results["query"] = df_batch["query"]
prompt_tensors = df_batch["input_ids"]

summary_tensors_ref = []
summary_tensors = []

# Get response from ppo and base model.
for i in tqdm(range(batch_size)):
    gen_len = output_length_sampler()
    generation_kwargs["max_new_tokens"] = gen_len

    summary = ref_model.generate(
        input_ids=torch.as_tensor(prompt_tensors[i]).unsqueeze(dim=0).to(device),
        **generation_kwargs
    ).squeeze()[-gen_len:]
    summary_tensors_ref.append(summary)

    summary = ppo_model.generate(
        input_ids=torch.as_tensor(prompt_tensors[i]).unsqueeze(dim=0).to(device),
        **generation_kwargs
    ).squeeze()[-gen_len:]
    summary_tensors.append(summary)

# Decode responses.
compare_results["response_before"] = [tokenizer.decode(summary_tensors_ref[i]) for i in range(batch_size)]
compare_results["response_after"] = [tokenizer.decode(summary_tensors[i]) for i in range(batch_size)]

# Sentiment analysis of query/response pairs before/after.
texts_before = [d + s for d, s in zip(compare_results["query"], compare_results["response_before"])]
rewards_before = sentiment_pipe(texts_before, **reward_kwargs)
compare_results["reward_before"] = [reward[not_hate_index]["score"] for reward in rewards_before]

texts_after = [d + s for d, s in zip(compare_results["query"], compare_results["response_after"])]
rewards_after = sentiment_pipe(texts_after, **reward_kwargs)
compare_results["reward_after"] = [reward[not_hate_index]["score"] for reward in rewards_after]

# Conclusion 

Throughout the labs, we explored the implementation of zero-shot, one-shot, and few-shot inference, observing how these methods impact model output. We also delved into prompt engineering, using various Google models as foundational models to enhance our understanding of in-context learning. Additionally, we worked with instruct models, also known as instruction fine-tuned models, and assessed their performance both manually and quantitatively using ROUGE scores. We further explored Parameter Efficient Fine-Tuning (PEFT) models, noting the variations in summarization outputs. Our experiments with reinforcement learning for detoxifying summaries by utilizing feedback and rewards allowed us to qualitatively measure the improvement in outputs.

In summary, the labs provided an extensive overview of advanced techniques for enhancing the capabilities of language models. From prompt engineering and instruction fine-tuning to PEFT and reinforcement learning, these sessions equipped us with a diverse set of tools to customize and optimize language models for specific applications, highlighting the broad applicability and adaptability of contemporary natural language processing techniques

## References

- **Generative AI with Large Language Models**: [https://www.coursera.org/learn/generative-ai-with-llms](https://www.coursera.org/learn/generative-ai-with-llms)
- **SAMSum Dataset**: [https://huggingface.co/datasets/samsum](https://huggingface.co/datasets/samsum)
- **RohitKeswani/flan-t5-base-peft-samsum**: [https://huggingface.co/RohitKeswani/flan-t5-base-peft-samsum](https://huggingface.co/RohitKeswani/flan-t5-base-peft-samsum)
- **google/flan-t5-small**: [https://huggingface.co/google/flan-t5-small](https://huggingface.co/google/flan-t5-small)
- **google/flan-t5-base**: [https://huggingface.co/google/flan-t5-base](https://huggingface.co/google/flan-t5-base)
- **mrm8488/flan-t5-small-finetuned-samsum**: [https://huggingface.co/mrm8488/flan-t5-small-finetuned-samsum](https://huggingface.co/mrm8488/flan-t5-small-finetuned-samsum)
