# Text Summarization - Worked Examples

Author: Yamini Manral (manral.y@northeastern.edu)

## Lab 1 - Generative AI Use Case: Summarize Dialogue

Welcome to the practical side of this course. In this lab we will do the dialogue summarization task using generative AI. We will explore how the input text affects the output of the model, and perform prompt engineering to direct it towards the task we need. By comparing zero shot, one shot, and few shot inferences, we will take the first step towards prompt engineering and see how it can enhance the generative output of Large Language Models.

### 1 - Set up Kernel and Required Dependencies

In [5]:
# %pip install --upgrade pip
# %pip install --disable-pip-version-check \
#     torch==1.13.1 \
#     torchdata==0.5.1 --quiet

# %pip install \
#     transformers==4.27.2 \
#     datasets==2.11.0  --quiet

# %pip install \
#     evaluate==0.4.0 \
#     rouge_score==0.1.2 \
#     loralib==0.1.1 \
#     peft==0.3.0 --quiet

# # Installing the Reinforcement Learning library directly from github.
# %pip install git+https://github.com/lvwerra/trl.git@25fa1bd    

In [6]:
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM
from transformers import AutoTokenizer
from transformers import GenerationConfig

**Problems encountered here:**

datasets was not upgraded ran the following code to fix it
`pip install -U datasets`

### 2 - Summarize Dialogue without Prompt Engineering

In this use case, we will be generating a summary of a dialogue with the pre-trained Large Language Model (LLM) FLAN-T5 from Hugging Face. The list of available models in the Hugging Face `transformers` package can be found [here](https://huggingface.co/docs/transformers/index). 

Let's upload some simple dialogues from the [Samsum](https://huggingface.co/datasets/samsum) Hugging Face dataset. This dataset contains 10,000+ dialogues with the corresponding manually labeled summaries and topics. 

**Changes:** Changed the dialog dataset to [samsum](https://huggingface.co/datasets/samsum) 

In [7]:
huggingface_dataset_name = "samsum"

dataset = load_dataset(huggingface_dataset_name)

Print a couple of dialogues with their baseline summaries.

In [8]:
example_indices = [50, 400]

dash_line = '-'.join('' for x in range(100))

for i, index in enumerate(example_indices):
    print(dash_line)
    print('Example ', i + 1)
    print(dash_line)
    print('INPUT DIALOGUE:')
    print(dataset['test'][index]['dialogue'])
    print(dash_line)
    print('BASELINE HUMAN SUMMARY:')
    print(dataset['test'][index]['summary'])
    print(dash_line)
    print()

---------------------------------------------------------------------------------------------------
Example  1
---------------------------------------------------------------------------------------------------
INPUT DIALOGUE:
Nick: You look absolutely gorgeous and have a lovely smile. 
Nick: Would love to get to know you a bit more. How about we meet up for a drink sometime?
Jane: Hmmm... You're shooting a bit above your range aren't you?
Nick: Why would you think that hon?
Jane: Because I'm not that desperate.
Nick: That was a bit below the belt.
Nick: You're nice but you're not THAT hot.
Jane: Oh is your poor little dick shriveling at the thought?
Nick: Actually I'll take it back. Forget about the drink.
Nick: Forget I ever wrote to you.
Jane: Bye loser!
Nick: Fucking bitch!
Jane: You're welcome!
---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
Nick finds Jane pretty and invites her for a drink to get to know

Load the [FLAN-T5 model](https://huggingface.co/docs/transformers/model_doc/flan-t5), creating an instance of the `AutoModelForSeq2SeqLM` class with the `.from_pretrained()` method. 

In [9]:
model_name='google/flan-t5-base'

model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

To perform encoding and decoding, you need to work with text in a tokenized form. **Tokenization** is the process of splitting texts into smaller units that can be processed by the LLM models. 

Download the tokenizer for the FLAN-T5 model using `AutoTokenizer.from_pretrained()` method. Parameter `use_fast` switches on fast tokenizer. At this stage, there is no need to go into the details of that, but you can find the tokenizer parameters in the [documentation](https://huggingface.co/docs/transformers/v4.28.1/en/model_doc/auto#transformers.AutoTokenizer).

In [10]:
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

Test the tokenizer encoding and decoding a simple sentence:

In [11]:
sentence = "Are your bringing him over tonight"

sentence_encoded = tokenizer(sentence, return_tensors='pt')

sentence_decoded = tokenizer.decode(
        sentence_encoded["input_ids"][0], 
        skip_special_tokens=True
    )

print('ENCODED SENTENCE:')
print(sentence_encoded["input_ids"][0])
print('\nDECODED SENTENCE:')
print(sentence_decoded)

ENCODED SENTENCE:
tensor([1521,   39,    3, 3770,  376,  147, 8988,    1])

DECODED SENTENCE:
Are your bringing him over tonight


Now it's time to explore how well the base LLM summarizes a dialogue without any prompt engineering. **Prompt engineering** is an act of a human changing the **prompt** (input) to improve the response for a given task.

In [12]:
for i, index in enumerate(example_indices):
    dialogue = dataset['test'][index]['dialogue']
    summary = dataset['test'][index]['summary']
    
    inputs = tokenizer(dialogue, return_tensors='pt')
    output = tokenizer.decode(
        model.generate(
            inputs["input_ids"], 
            max_new_tokens=50,
        )[0], 
        skip_special_tokens=True
    )
    
    print(dash_line)
    print('Example ', i + 1)
    print(dash_line)
    print(f'INPUT PROMPT:\n{dialogue}')
    print(dash_line)
    print(f'BASELINE HUMAN SUMMARY:\n{summary}')
    print(dash_line)
    print(f'MODEL GENERATION - WITHOUT PROMPT ENGINEERING:\n{output}\n')

---------------------------------------------------------------------------------------------------
Example  1
---------------------------------------------------------------------------------------------------
INPUT PROMPT:
Nick: You look absolutely gorgeous and have a lovely smile. 
Nick: Would love to get to know you a bit more. How about we meet up for a drink sometime?
Jane: Hmmm... You're shooting a bit above your range aren't you?
Nick: Why would you think that hon?
Jane: Because I'm not that desperate.
Nick: That was a bit below the belt.
Nick: You're nice but you're not THAT hot.
Jane: Oh is your poor little dick shriveling at the thought?
Nick: Actually I'll take it back. Forget about the drink.
Nick: Forget I ever wrote to you.
Jane: Bye loser!
Nick: Fucking bitch!
Jane: You're welcome!
---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
Nick finds Jane pretty and invites her for a drink to get to know h

You can see that the guesses of the model make some sense, but it doesn't seem to be sure what task it is supposed to accomplish. Seems it just makes up the next sentence in the dialogue. Prompt engineering can help here.

### 3 - Summarize Dialogue with an Instruction Prompt

#### 3.1 - Zero Shot Inference with an Instruction Prompt

In order to instruct the model to perform a task - summarize a dialogue - we can take the dialogue and convert it into an instruction prompt. This is often called **zero shot inference**. Wrap the dialogue in a descriptive instruction and see how the generated text will change:

In [13]:
for i, index in enumerate(example_indices):
    dialogue = dataset['test'][index]['dialogue']
    summary = dataset['test'][index]['summary']

    prompt = f"""
Summarize the following conversation.

{dialogue}

Summary:
    """

    # Input constructed prompt instead of the dialogue.
    inputs = tokenizer(prompt, return_tensors='pt')
    output = tokenizer.decode(
        model.generate(
            inputs["input_ids"], 
            max_new_tokens=50,
        )[0], 
        skip_special_tokens=True
    )
    
    print(dash_line)
    print('Example ', i + 1)
    print(dash_line)
    print(f'INPUT PROMPT:\n{prompt}')
    print(dash_line)
    print(f'BASELINE HUMAN SUMMARY:\n{summary}')
    print(dash_line)    
    print(f'MODEL GENERATION - ZERO SHOT:\n{output}\n')

---------------------------------------------------------------------------------------------------
Example  1
---------------------------------------------------------------------------------------------------
INPUT PROMPT:

Summarize the following conversation.

Nick: You look absolutely gorgeous and have a lovely smile. 
Nick: Would love to get to know you a bit more. How about we meet up for a drink sometime?
Jane: Hmmm... You're shooting a bit above your range aren't you?
Nick: Why would you think that hon?
Jane: Because I'm not that desperate.
Nick: That was a bit below the belt.
Nick: You're nice but you're not THAT hot.
Jane: Oh is your poor little dick shriveling at the thought?
Nick: Actually I'll take it back. Forget about the drink.
Nick: Forget I ever wrote to you.
Jane: Bye loser!
Nick: Fucking bitch!
Jane: You're welcome!

Summary:
    
---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
Nick finds J

**Observation:** Even though the model is able to understand and summarize parts of the conversation, it still does not pick up on the nuance of the conversation.

##### ***Exercise:***

- Experiment with the `prompt` text and see how the inferences will be changed. Will the inferences change if you end the prompt with just empty string vs. `Summary: `?
- Try to rephrase the beginning of the `prompt` text from `Summarize the following conversation.` to something different - and see how it will influence the generated output.

In [14]:
for i, index in enumerate(example_indices):
    dialogue = dataset['test'][index]['dialogue']
    summary = dataset['test'][index]['summary']

    prompt = f"""
Summarize the following conversation.

{dialogue}


    """

    # Input constructed prompt instead of the dialogue.
    inputs = tokenizer(prompt, return_tensors='pt')
    output = tokenizer.decode(
        model.generate(
            inputs["input_ids"], 
            max_new_tokens=50,
        )[0], 
        skip_special_tokens=True
    )
    
    print(dash_line)
    print('Example ', i + 1)
    print(dash_line)
    print(f'INPUT PROMPT:\n{prompt}')
    print(dash_line)
    print(f'BASELINE HUMAN SUMMARY:\n{summary}')
    print(dash_line)    
    print(f'MODEL GENERATION - ZERO SHOT:\n{output}\n')

---------------------------------------------------------------------------------------------------
Example  1
---------------------------------------------------------------------------------------------------
INPUT PROMPT:

Summarize the following conversation.

Nick: You look absolutely gorgeous and have a lovely smile. 
Nick: Would love to get to know you a bit more. How about we meet up for a drink sometime?
Jane: Hmmm... You're shooting a bit above your range aren't you?
Nick: Why would you think that hon?
Jane: Because I'm not that desperate.
Nick: That was a bit below the belt.
Nick: You're nice but you're not THAT hot.
Jane: Oh is your poor little dick shriveling at the thought?
Nick: Actually I'll take it back. Forget about the drink.
Nick: Forget I ever wrote to you.
Jane: Bye loser!
Nick: Fucking bitch!
Jane: You're welcome!


    
---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
Nick finds Jane pret

**Changes:** Upon changing the prompt from `Summary` to nothing, there is no change in the output generated.

In [15]:
for i, index in enumerate(example_indices):
    dialogue = dataset['test'][index]['dialogue']
    summary = dataset['test'][index]['summary']

    prompt = f"""
Write a short summary for the given conversation:

{dialogue}

Summary:
    """

    # Input constructed prompt instead of the dialogue.
    inputs = tokenizer(prompt, return_tensors='pt')
    output = tokenizer.decode(
        model.generate(
            inputs["input_ids"], 
            max_new_tokens=50,
        )[0], 
        skip_special_tokens=True
    )
    
    print(dash_line)
    print('Example ', i + 1)
    print(dash_line)
    print(f'INPUT PROMPT:\n{prompt}')
    print(dash_line)
    print(f'BASELINE HUMAN SUMMARY:\n{summary}')
    print(dash_line)    
    print(f'MODEL GENERATION - ZERO SHOT:\n{output}\n')

---------------------------------------------------------------------------------------------------
Example  1
---------------------------------------------------------------------------------------------------
INPUT PROMPT:

Write a short summary for the given conversation:

Nick: You look absolutely gorgeous and have a lovely smile. 
Nick: Would love to get to know you a bit more. How about we meet up for a drink sometime?
Jane: Hmmm... You're shooting a bit above your range aren't you?
Nick: Why would you think that hon?
Jane: Because I'm not that desperate.
Nick: That was a bit below the belt.
Nick: You're nice but you're not THAT hot.
Jane: Oh is your poor little dick shriveling at the thought?
Nick: Actually I'll take it back. Forget about the drink.
Nick: Forget I ever wrote to you.
Jane: Bye loser!
Nick: Fucking bitch!
Jane: You're welcome!

Summary:
    
---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:


**Changes:** There doesn't seem to be any changes upon changing prompt text. The output for zero-shot inference remains the same. 

#### 3.2 - Zero Shot Inference with the Prompt Template from FLAN-T5

Let's use a slightly different prompt. FLAN-T5 has many prompt templates that are published for certain tasks [here](https://github.com/google-research/FLAN/tree/main/flan/v2). In the following code, we will use one of the [pre-built FLAN-T5 prompts](https://github.com/google-research/FLAN/blob/main/flan/v2/templates.py):

In [16]:
for i, index in enumerate(example_indices):
    dialogue = dataset['test'][index]['dialogue']
    summary = dataset['test'][index]['summary']
        
    prompt = f"""
Dialogue:

{dialogue}

What was going on?
"""

    inputs = tokenizer(prompt, return_tensors='pt')
    output = tokenizer.decode(
        model.generate(
            inputs["input_ids"], 
            max_new_tokens=50,
        )[0], 
        skip_special_tokens=True
    )

    print(dash_line)
    print('Example ', i + 1)
    print(dash_line)
    print(f'INPUT PROMPT:\n{prompt}')
    print(dash_line)
    print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
    print(dash_line)
    print(f'MODEL GENERATION - ZERO SHOT:\n{output}\n')

---------------------------------------------------------------------------------------------------
Example  1
---------------------------------------------------------------------------------------------------
INPUT PROMPT:

Dialogue:

Nick: You look absolutely gorgeous and have a lovely smile. 
Nick: Would love to get to know you a bit more. How about we meet up for a drink sometime?
Jane: Hmmm... You're shooting a bit above your range aren't you?
Nick: Why would you think that hon?
Jane: Because I'm not that desperate.
Nick: That was a bit below the belt.
Nick: You're nice but you're not THAT hot.
Jane: Oh is your poor little dick shriveling at the thought?
Nick: Actually I'll take it back. Forget about the drink.
Nick: Forget I ever wrote to you.
Jane: Bye loser!
Nick: Fucking bitch!
Jane: You're welcome!

What was going on?

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
Nick finds Jane pretty and invites

**Observation:** No noticible change from introducting flan t5 prompt template. This is what you will try to solve with the few shot inferencing.

### 4 - Summarize Dialogue with One Shot and Few Shot Inference

**One shot and few shot inference** are the practices of providing an LLM with either one or more full examples of prompt-response pairs that match your task - before your actual prompt that you want completed. This is called "in-context learning" and puts your model into a state that understands your specific task. .


#### 4.1 - One Shot Inference

Let's build a function that takes a list of `example_indices_full`, generates a prompt with full examples, then at the end appends the prompt which we want the model to complete (`example_index_to_summarize`).  We will use the FLAN-T5 prompt template.

In [17]:
def make_prompt(example_indices_full, example_index_to_summarize):
    prompt = ''
    for index in example_indices_full:
        dialogue = dataset['test'][index]['dialogue']
        summary = dataset['test'][index]['summary']
        
        # The stop sequence '{summary}\n\n\n' is important for FLAN-T5. Other models may have their own preferred stop sequence.
        prompt += f"""
Dialogue:

{dialogue}

What was going on?
{summary}


"""
    
    dialogue = dataset['test'][example_index_to_summarize]['dialogue']
    
    prompt += f"""
Dialogue:

{dialogue}

What was going on?
"""
        
    return prompt

Construct the prompt to perform one shot inference:

In [18]:
example_indices_full = [80]
example_index_to_summarize = 250

one_shot_prompt = make_prompt(example_indices_full, example_index_to_summarize)

print(one_shot_prompt)


Dialogue:

Ryan: I have a bad feeling about this
Ryan: <file_other>
Sebastian: Ukraine...
Sebastian: This russian circus will never end...
Ryan: I hope the leaders of of nations will react somehow to this shit.
Sebastian: I hope so too :(

What was going on?
Ryan and Sebastian are worried about the political situation in Ukraine.



Dialogue:

Shaldona: WE ARE GONNA GET MARRIED ❤️❤️
Shaldona: <file_others>
Shaldona: This is our mobile inviation for our wedding.
Shaldona: Invitation*
Piper: Hey. You haven’t sent me any messages for a few years.
Piper: And now you are sending me your wedding invitation 
Piper: THROUGH MESSENGER?
Shaldona: .....
Shaldona: Well..
Shaldona: I had no enough time to meet everybody and give this in person.
Shaldona: Hope you understand.
Piper: If you don't have time to give the invitation card in person but expect people go to your wedding
Piper: Shaldona, if so, you are too greedy.

What was going on?



Now pass this prompt to perform the one shot inference:

In [19]:
summary = dataset['test'][example_index_to_summarize]['summary']

inputs = tokenizer(one_shot_prompt, return_tensors='pt')
output = tokenizer.decode(
    model.generate(
        inputs["input_ids"],
        max_new_tokens=50,
    )[0], 
    skip_special_tokens=True
)

print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'MODEL GENERATION - ONE SHOT:\n{output}')

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
Shaldona sends mobile invitations to her wedding, as she has no time to give them in person.

---------------------------------------------------------------------------------------------------
MODEL GENERATION - ONE SHOT:
Shaldona and Piper are getting married. Shaldona hasn't sent Piper messages for a few years. Piper is worried about Shaldona's wedding invitation.


**Observation:** One-shot inference seems to work great now that it is able to summarize better and in greater detail. 

#### 4.2 - Few Shot Inference



Let's explore few shot inference by adding two more full dialogue-summary pairs to our prompt.

In [20]:
example_indices_full = [70, 100, 200]
example_index_to_summarize = 260

few_shot_prompt = make_prompt(example_indices_full, example_index_to_summarize)

print(few_shot_prompt)


Dialogue:

Ali: I think I left my wallet at your place yesterday. Could you check? 
Mohammad: Give me a sec, I'll have a look around my room.
Ali: OK.
Mohammad: Found it!
Ali: Phew, I don't know what I'd do if it wasn't there. Can you bring it to uni tomorrow?
Mohammad: Sure thing.

What was going on?
Ali left his wallet at Mohammad's place. Mohammad'll bring it to uni tomorrow.



Dialogue:

Chris: Hi there! Where are you? Any chance of skyping?
Rick: Hi! Our last two days in Cancun before flying to Havana. Yeah, skyping is an idea. When would it suit you?
Rick: We don't have the best of connections in the room but I can get you pretty well in the lobby.
Chris: What's the time in your place now?
Rick: 6:45 pm
Chris: It's a quarter to one in the morning here. Am still in front of the box.
Rick: Gracious me! Sorry mate. You needn't have answered.
Chris: 8-D
Rick: Just tell me when we could skype.
Chris: Preferably in the evening. Just a few hours earlier than now. And not tomorrow.
Ric

Now pass this prompt to perform a few shot inference:

In [21]:
summary = dataset['test'][example_index_to_summarize]['summary']

inputs = tokenizer(few_shot_prompt, return_tensors='pt')
output = tokenizer.decode(
    model.generate(
        inputs["input_ids"],
        max_new_tokens=50,
    )[0], 
    skip_special_tokens=True
)

print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'MODEL GENERATION - FEW SHOT:\n{output}')

Token indices sequence length is longer than the specified maximum sequence length for this model (697 > 512). Running this sequence through the model will result in indexing errors


---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
Debbie can't decide between buying a red dress and a green one. On Kelly and Denise's advice she will buy the green one. Kelly is considering buying the red one for herself.

---------------------------------------------------------------------------------------------------
MODEL GENERATION - FEW SHOT:
Debbie is looking for a red dress. Kelly recommends the green dress. Kelly is considering buying the red one for herself.


In this case, few shot did not provide much of an improvement over one shot inference.  And, anything above 5 or 6 shot will typically not help much, either.

However, we can see that feeding in at least one full example (one shot) provides the model with more information and qualitatively improves the summary overall.

##### ***Exercise:***

Experiment with the few shot inferencing.
- Choose different dialogues - change the indices in the `example_indices_full` list and `example_index_to_summarize` value.
- Change the number of shots. Be sure to stay within the model's 512 context length, however.

How well does few shot inferencing work with other examples?

Choosing various other dialogs:

In [22]:
example_indices_full = [20, 50, 70, 110]
example_index_to_summarize = 160

few_shot_prompt = make_prompt(example_indices_full, example_index_to_summarize)

print(few_shot_prompt)


Dialogue:

Deirdre: Hi Beth, how are you love?
Beth: Hi Auntie Deirdre, I'm been meaning to message you, had a favour to ask.
Deirdre: Wondered if you had any thought about your Mum's 40th, we've got to do something special!
Beth: How about a girls weekend, just mum, me, you and the girls, Kira will have to come back from Uni, of course.
Deirdre: Sounds fab! Get your thinking cap on, it's only in 6 weeks! Bet she's dreading it, I remember doing that!
Beth: Oh yeah, we had a surprise party for you, you nearly had a heart attack! 
Deirdre: Well, it was a lovely surprise! Gosh, thats nearly 4 years ago now, time flies! What was the favour, darling?
Beth: Oh, it was just that I fancied trying a bit of work experience in the salon, auntie.
Deirdre: Well, I am looking for Saturday girls, are you sure about it? you could do well in the exams and go on to college or 6th form.
Beth: I know, but it's not for me, auntie, I am doing all foundation papers and I'm struggling with those.
Deirdre: Wh

In [23]:
summary = dataset['test'][example_index_to_summarize]['summary']

inputs = tokenizer(few_shot_prompt, return_tensors='pt')
output = tokenizer.decode(
    model.generate(
        inputs["input_ids"],
        max_new_tokens=50,
    )[0], 
    skip_special_tokens=True
)

print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'MODEL GENERATION - FEW SHOT:\n{output}')

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
Tom arrived safely, but without his luggage.

---------------------------------------------------------------------------------------------------
MODEL GENERATION - FEW SHOT:
Alexander will send Tom a message when he will be in taxi. Tom arrived safely without luggages.


**Observation:** Few shot inference with different dialogs, does a good job of summarization.

Changing the number of shots from 3 to 4 does not seem to make a lot of changes to the output.

### 5 - Generative Configuration Parameters for Inference

You can change the configuration parameters of the `generate()` method to see a different output from the LLM. So far the only parameter that you have been setting was `max_new_tokens=50`, which defines the maximum number of tokens to generate. 

A convenient way of organizing the configuration parameters is to use `GenerationConfig` class. 

##### ***Exercise:***

Change the configuration parameters to investigate their influence on the output. 

Putting the parameter `do_sample = True`, you activate various decoding strategies which influence the next token from the probability distribution over the entire vocabulary. You can then adjust the outputs changing `temperature` and other parameters (such as `top_k` and `top_p`). 

Uncomment the lines in the cell below and rerun the code. Try to analyze the results. You can read some comments below.

In [24]:
generation_config = GenerationConfig(max_new_tokens=50)
# generation_config = GenerationConfig(max_new_tokens=10)
# generation_config = GenerationConfig(max_new_tokens=50, do_sample=True, temperature=0.1)
# generation_config = GenerationConfig(max_new_tokens=50, do_sample=True, temperature=0.5)
# generation_config = GenerationConfig(max_new_tokens=50, do_sample=True, temperature=1.0)

inputs = tokenizer(few_shot_prompt, return_tensors='pt')
output = tokenizer.decode(
    model.generate(
        inputs["input_ids"],
        generation_config=generation_config,
    )[0], 
    skip_special_tokens=True
)

print(dash_line)
print(f'MODEL GENERATION - FEW SHOT:\n{output}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')

---------------------------------------------------------------------------------------------------
MODEL GENERATION - FEW SHOT:
Alexander will send Tom a message when he will be in taxi. Tom arrived safely without luggages.
---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
Tom arrived safely, but without his luggage.



In [25]:
generation_config = GenerationConfig(max_new_tokens=5)
# generation_config = GenerationConfig(max_new_tokens=50, do_sample=True, temperature=0.1)
# generation_config = GenerationConfig(max_new_tokens=50, do_sample=True, temperature=0.5)
# generation_config = GenerationConfig(max_new_tokens=50, do_sample=True, temperature=1.0)

inputs = tokenizer(few_shot_prompt, return_tensors='pt')
output = tokenizer.decode(
    model.generate(
        inputs["input_ids"],
        generation_config=generation_config,
    )[0], 
    skip_special_tokens=True
)

print(dash_line)
print(f'MODEL GENERATION - FEW SHOT:\n{output}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')

---------------------------------------------------------------------------------------------------
MODEL GENERATION - FEW SHOT:
Alexander will send Tom 
---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
Tom arrived safely, but without his luggage.



In [26]:
# generation_config = GenerationConfig(max_new_tokens=50, do_sample=True, temperature=0.1)
# generation_config = GenerationConfig(max_new_tokens=50, do_sample=True, temperature=0.5)
generation_config = GenerationConfig(max_new_tokens=50, do_sample=True, temperature=1.0)

inputs = tokenizer(few_shot_prompt, return_tensors='pt')
output = tokenizer.decode(
    model.generate(
        inputs["input_ids"],
        generation_config=generation_config,
    )[0], 
    skip_special_tokens=True
)

print(dash_line)
print(f'MODEL GENERATION - FEW SHOT:\n{output}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')

---------------------------------------------------------------------------------------------------
MODEL GENERATION - FEW SHOT:
Alexander asks Tom to inform Alexander when Tom is in taxi. Tom and Alexander are travelling together. Tom arrived safely without luggage.
---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
Tom arrived safely, but without his luggage.



Comments related to the choice of the parameters in the code cell above:
- Choosing `max_new_tokens=10` will make the output text too short, so the dialogue summary will be cut.
- Putting `do_sample = True` and changing the temperature value you get more flexibility in the output.

As you can see, prompt engineering can take you a long way for this use case, but there are some limitations. Next, you will start to explore how you can use fine-tuning to help your LLM to understand a particular use case in better depth!

<a name='Lab2'></a>
## Lab 2: Fine-Tune a Generative AI Model for Dialogue Summarization

### 1 - Load Libraries

In [27]:
from transformers import AutoTokenizer, GenerationConfig, TrainingArguments, Trainer
import torch
import time
import evaluate
import pandas as pd
import numpy as np


#### 1.1 - Load the model

**Changes:** using smaller version of flan-t5 [google/flan-t5-small](https://huggingface.co/google/flan-t5-small)

In [28]:
new_model_name='google/flan-t5-small'

original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name)

It is possible to pull out the number of model parameters and find out how many of them are trainable. The following function can be used to do that, at this stage, you do not need to go into details of it. 

In [29]:
def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"trainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"

print(print_number_of_trainable_model_parameters(original_model))

trainable model parameters: 247577856
all model parameters: 247577856
percentage of trainable model parameters: 100.00%



#### 1.2 - Test the Model with Zero Shot Inferencing

Test the model with the zero shot inferencing. You can see that the model struggles to summarize the dialogue compared to the baseline summary, but it does pull out some important information from the text which indicates the model can be fine-tuned to the task at hand.

In [30]:
index = 800

dialogue = dataset['test'][index]['dialogue']
summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary:
"""

inputs = tokenizer(prompt, return_tensors='pt')
output = tokenizer.decode(
    original_model.generate(
        inputs["input_ids"], 
        max_new_tokens=200,
    )[0], 
    skip_special_tokens=True
)

dash_line = '-'.join('' for x in range(100))
print(dash_line)
print(f'INPUT PROMPT:\n{prompt}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'MODEL GENERATION - ZERO SHOT:\n{output}')

---------------------------------------------------------------------------------------------------
INPUT PROMPT:

Summarize the following conversation.

Linda: Hi Dad, I want to buy flowers for mum! But I don't remember which one she likes :(
Michael: Well, she likes all the flowers I believe
Linda: That doesn't help! I'm on a flower market right now!
Michael: Send me some pics then
Linda: <file_photo> 
Michael: Tulips are nice, roses too
Linda:  What about carnations?
Michael: No, carnations are boring :D
Linda: Thanks Dad, srsly…
Michael:  What about freesias? She likes them a lot, are there any there?
Linda: <file_photo> 
Michael: Take those!

Summary:

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
Linda wants to buy flowers for her mother and asks Michael which flowers does she like. Michael suggests Linda to buy freesias.

---------------------------------------------------------------------------------


### 2 - Perform Full Fine-Tuning


#### 2.1 - Preprocess the Dialog-Summary Dataset

In [31]:
def tokenize_function(example):
    start_prompt = 'Summarize the following conversation.\n\n'
    end_prompt = '\n\nSummary: '
    prompt = [start_prompt + dialogue + end_prompt for dialogue in example["dialogue"]]
    example['input_ids'] = tokenizer(prompt, padding="max_length", truncation=True, return_tensors="pt").input_ids
    example['labels'] = tokenizer(example["summary"], padding="max_length", truncation=True, return_tensors="pt").input_ids
    
    return example

# The dataset actually contains 3 diff splits: train, validation, test.
# The tokenize_function code is handling all data across all splits in batches.
tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(['id', 'dialogue', 'summary',])

Map:   0%|          | 0/14732 [00:00<?, ? examples/s]

Map:   0%|          | 0/819 [00:00<?, ? examples/s]

Map:   0%|          | 0/818 [00:00<?, ? examples/s]

To save some time in the lab, you will subsample the dataset:

In [32]:
tokenized_datasets = tokenized_datasets.filter(lambda example, index: index % 100 == 0, with_indices=True)

Filter:   0%|          | 0/14732 [00:00<?, ? examples/s]

Filter:   0%|          | 0/819 [00:00<?, ? examples/s]

Filter:   0%|          | 0/818 [00:00<?, ? examples/s]

Check the shapes of all three parts of the dataset:

In [33]:
print(f"Shapes of the datasets:")
print(f"Training: {tokenized_datasets['train'].shape}")
print(f"Validation: {tokenized_datasets['validation'].shape}")
print(f"Test: {tokenized_datasets['test'].shape}")

print(tokenized_datasets)

Shapes of the datasets:
Training: (148, 2)
Validation: (9, 2)
Test: (9, 2)
DatasetDict({
    train: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 148
    })
    test: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 9
    })
    validation: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 9
    })
})


The output dataset is ready for fine-tuning.

#### 2.2 - Fine-Tune the Model with the Preprocessed Dataset

Now utilize the built-in Hugging Face `Trainer` class (see the documentation [here](https://huggingface.co/docs/transformers/main_classes/trainer)). Pass the preprocessed dataset with reference to the original model. Other training parameters are found experimentally and there is no need to go into details about those at the moment.

**Changes:** Training a fully fine-tuned version of the model would take a few hours on a GPU. Instead we download a pre-fine-tuned model [mrm8488/flan-t5-small-finetuned-samsum](https://huggingface.co/mrm8488/flan-t5-small-finetuned-samsum?text=Sid%3A+Wanna+catch+a+movie%3F%0AAnnie%3A+sure+what+do+you+have+in+mind%3F%0ASid%3B+the+Aquaman%3F+%3AD%0AAnnie%3A+haha+isn%27t+it+a+bit+childish%0ASid%3A+noooooo+I+mean+yes+but+it%27s+the+highest+grossing+movie+this+week%0AAnnie%3A+seriously%3F%0ASid%3A+yeah%3F%0AAnnie%3A+okay+let%27s+see+what+the+fuss+is+all+about) to use in the rest of this notebook. This fully fine-tuned model will also be referred to as the **instruct model** in this lab.

In [34]:
instruct_model_name="mrm8488/flan-t5-small-finetuned-samsum"

Create an instance of the `AutoModelForSeq2SeqLM` class for the instruct model:

In [35]:
instruct_model = AutoModelForSeq2SeqLM.from_pretrained(instruct_model_name, torch_dtype=torch.bfloat16)

#### 2.3 - Evaluate the Model Qualitatively (Human Evaluation)

As with many GenAI applications, a qualitative approach where you ask yourself the question "Is my model behaving the way it is supposed to?" is usually a good starting point. In the example below (the same one we started this notebook with), you can see how the fine-tuned model is able to create a reasonable summary of the dialogue compared to the original inability to understand what is being asked of the model.

In [36]:
index = 50
dialogue = dataset['test'][index]['dialogue']
human_baseline_summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary:
"""

input_ids = tokenizer(prompt, return_tensors="pt").input_ids

original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{human_baseline_summary}')
print(dash_line)
print(f'ORIGINAL MODEL:\n{original_model_text_output}')
print(dash_line)
print(f'INSTRUCT MODEL:\n{instruct_model_text_output}')

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
Nick finds Jane pretty and invites her for a drink to get to know her better. Jane rejects Nick and is unpleasant to him. Nick suggests Jane to forget about their conversation.
---------------------------------------------------------------------------------------------------
ORIGINAL MODEL:
Nick and Jane are going to meet for a drink.
---------------------------------------------------------------------------------------------------
INSTRUCT MODEL:
Nick and Jane are going to meet up for a drink.


#### 2.4 - Evaluate the Model Quantitatively (with ROUGE Metric)

The [ROUGE metric](https://en.wikipedia.org/wiki/ROUGE_(metric)) helps quantify the validity of summarizations produced by models. It compares summarizations to a "baseline" summary which is usually created by a human. While not perfect, it does indicate the overall increase in summarization effectiveness that we have accomplished by fine-tuning.

In [37]:
rouge = evaluate.load('rouge')

Generate the outputs for the sample of the test dataset (only 10 dialogues and summaries to save time), and save the results.

In [38]:
dialogues = dataset['test'][0:10]['dialogue']
human_baseline_summaries = dataset['test'][0:10]['summary']

original_model_summaries = []
instruct_model_summaries = []

for _, dialogue in enumerate(dialogues):
    prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids

    original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)
    original_model_summaries.append(original_model_text_output)

    instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)
    instruct_model_summaries.append(instruct_model_text_output)
    
zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, instruct_model_summaries))
 
df = pd.DataFrame(zipped_summaries, columns = ['human_baseline_summaries', 'original_model_summaries', 'instruct_model_summaries'])
df

Unnamed: 0,human_baseline_summaries,original_model_summaries,instruct_model_summaries
0,Hannah needs Betty's number but Amanda doesn't...,Amanda can't find Betty's number. Amanda will ...,Betty called Larry last time they were at the ...
1,Eric and Rob are going to watch a stand-up on ...,Eric and Rob are watching a stand-up. Eric and...,Eric and Rob are watching a show on YouTube.
2,Lenny can't decide which trousers to buy. Bob ...,Lenny wants to buy two pairs of purple trouser...,Bob will send Lenny photos of the trousers. Le...
3,Emma will be home soon and she will let Will k...,Emma will be home soon. Will will pick her up.,Emma will pick Will up at the moment.
4,Jane is in Warsaw. Ollie and Jane has a party....,Jane lost her calendar. Ollie and Jane have lu...,Jane is in Warsaw. Ollie will bring some sun w...
5,Hilary has the keys to the apartment. Benjamin...,Hilary and Elliot are meeting at the conferenc...,"Benjamin, Hilary and Daniel are meeting for dr..."
6,Payton provides Max with websites selling clot...,Payton likes shopping but he doesn't always bu...,Payton is looking for clothes to buy. Max will...
7,Rita and Tina are bored at work and have still...,Rita is tired and is not happy at work.,Rita is tired and is tired. Tina is tired.
8,"Beatrice wants to buy Leo a scarf, but he does...","Beatrice is in town, shopping. She has a scarf...",Beatrice is in town. She doesn't have a scarf....
9,Eric doesn't know if his parents let him go to...,Eric is coming to Ivan's brother's wedding. Er...,Eric is coming to the wedding. He has a lot to...


Evaluate the models computing ROUGE metrics. Notice the improvement in the results!

In [39]:
original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0:len(instruct_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('INSTRUCT MODEL:')
print(instruct_model_results)

ORIGINAL MODEL:
{'rouge1': 0.46985436352130705, 'rouge2': 0.22581970994728395, 'rougeL': 0.3816583797939229, 'rougeLsum': 0.3852519615421176}
INSTRUCT MODEL:
{'rouge1': 0.3994572016278605, 'rouge2': 0.14111831257416574, 'rougeL': 0.3108604710116508, 'rougeLsum': 0.3118647194203424}


Rouge scores of this model are bad, even worse than our regular model. Let's move on to the next step.

The file `data/dialogue-summary-training-results.csv` contains a pre-populated list of all model results which you can use to evaluate on a larger section of data. Let's do that for each of the models:

In [40]:
results = pd.read_csv("data/dialogue-summary-training-results.csv")

human_baseline_summaries = results['human_baseline_summaries'].values
original_model_summaries = results['original_model_summaries'].values
instruct_model_summaries = results['instruct_model_summaries'].values

original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0:len(instruct_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('INSTRUCT MODEL:')
print(instruct_model_results)

ORIGINAL MODEL:
{'rouge1': 0.23337391746914432, 'rouge2': 0.07620718933525607, 'rougeL': 0.2017702446072403, 'rougeLsum': 0.20152238762082608}
INSTRUCT MODEL:
{'rouge1': 0.4214309303597156, 'rouge2': 0.18040230847807043, 'rougeL': 0.33809319137790006, 'rougeLsum': 0.3381026931436848}


The results show substantial improvement in all ROUGE metrics:

In [41]:
print("Absolute percentage improvement of INSTRUCT MODEL over HUMAN BASELINE")

improvement = (np.array(list(instruct_model_results.values())) - np.array(list(original_model_results.values())))
for key, value in zip(instruct_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

Absolute percentage improvement of INSTRUCT MODEL over HUMAN BASELINE
rouge1: 18.81%
rouge2: 10.42%
rougeL: 13.63%
rougeLsum: 13.66%


### 3 - Perform Parameter Efficient Fine-Tuning (PEFT)

Now, let's perform **Parameter Efficient Fine-Tuning (PEFT)** fine-tuning as opposed to "full fine-tuning" as you did above. PEFT is a form of instruction fine-tuning that is much more efficient than full fine-tuning - with comparable evaluation results as you will see soon. 

PEFT is a generic term that includes **Low-Rank Adaptation (LoRA)** and prompt tuning (which is NOT THE SAME as prompt engineering!). In most cases, when someone says PEFT, they typically mean LoRA. LoRA, at a very high level, allows the user to fine-tune their model using fewer compute resources (in some cases, a single GPU). After fine-tuning for a specific task, use case, or tenant with LoRA, the result is that the original LLM remains unchanged and a newly-trained “LoRA adapter” emerges. This LoRA adapter is much, much smaller than the original LLM - on the order of a single-digit % of the original LLM size (MBs vs GBs).  

That said, at inference time, the LoRA adapter needs to be reunited and combined with its original LLM to serve the inference request.  The benefit, however, is that many LoRA adapters can re-use the original LLM which reduces overall memory requirements when serving multiple tasks and use cases.

#### 3.1 - Setup the PEFT/LoRA model for Fine-Tuning

**Changes:** 
Since trainig a model from scratch is time consuming and needs compute resources, the models chosen here are [google/flan-t5-base](https://huggingface.co/google/flan-t5-base) and [flan-t5-base-peft-samsum](https://huggingface.co/RohitKeswani/flan-t5-base-peft-samsum)

In [63]:
from peft import PeftModel, PeftConfig

peft_model_base = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base", torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")

peft_model = PeftModel.from_pretrained(peft_model_base, 
                                       "RohitKeswani/flan-t5-base-peft-samsum",
                                       torch_dtype=torch.bfloat16,
                                       is_trainable=False)

The number of trainable parameters will be `0` due to `is_trainable=False` setting:

In [64]:
print(print_number_of_trainable_model_parameters(peft_model))

trainable model parameters: 0
all model parameters: 251116800
percentage of trainable model parameters: 0.00%


#### 3.2 - Evaluate the Model Qualitatively (Human Evaluation)

Make inferences with the original model, fully fine-tuned and PEFT model.

In [69]:
index = 200
dialogue = dataset['test'][index]['dialogue']
# baseline_human_summary = dataset['test'][index]['summary']
human_baseline_summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """

input_ids = tokenizer(prompt, return_tensors="pt").input_ids

original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{human_baseline_summary}')
print(dash_line)
print(f'ORIGINAL MODEL:\n{original_model_text_output}')
print(dash_line)
print(f'INSTRUCT MODEL:\n{instruct_model_text_output}')
print(dash_line)
print(f'PEFT MODEL: {peft_model_text_output}')

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
Sam won't finish work till 5. Sam is bringing him over about 9 am. Sam will see Abdellilah in the morning. 
---------------------------------------------------------------------------------------------------
ORIGINAL MODEL:
Sam is at work. He finishes at 5 and is not bringing Abdellilah over tonight. Sam will bring Abdellilah to work at about 9.
---------------------------------------------------------------------------------------------------
INSTRUCT MODEL:
Sam is working at 9. Sam will bring him over tonight.
---------------------------------------------------------------------------------------------------
PEFT MODEL: Sam is at work. He finishes at 5 and is not bringing Abdellilah over tonight. Sam will bring Abdellilah to work at about 9.


PEFT model result looks very good, almost as good and detailed as human baseline. 

#### 3.3 - Evaluate the Model Quantitatively (with ROUGE Metric)
Perform inferences for the sample of the test dataset (only 10 dialogues and summaries to save time). 

In [71]:
dialogues = dataset['test'][0:10]['dialogue']
human_baseline_summaries = dataset['test'][0:10]['summary']

original_model_summaries = []
instruct_model_summaries = []
peft_model_summaries = []

for idx, dialogue in enumerate(dialogues):
    prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """
    
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids

    human_baseline_text_output = human_baseline_summaries[idx]
    
    original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

    instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

    peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

    original_model_summaries.append(original_model_text_output)
    instruct_model_summaries.append(instruct_model_text_output)
    peft_model_summaries.append(peft_model_text_output)

zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, instruct_model_summaries, peft_model_summaries))
 
df = pd.DataFrame(zipped_summaries, columns = ['human_baseline_summaries', 'original_model_summaries', 'instruct_model_summaries', 'peft_model_summaries'])
df

Unnamed: 0,human_baseline_summaries,original_model_summaries,instruct_model_summaries,peft_model_summaries
0,Hannah needs Betty's number but Amanda doesn't...,Amanda can't find Betty's number. Amanda will ...,Betty called Larry last time they were at the ...,Amanda can't find Betty's number. Amanda will ...
1,Eric and Rob are going to watch a stand-up on ...,Eric and Rob are watching a stand-up. Eric and...,Eric and Rob are watching a show on YouTube.,Eric and Rob are watching a stand-up. Eric and...
2,Lenny can't decide which trousers to buy. Bob ...,Lenny wants to buy two pairs of purple trouser...,Bob will send Lenny photos of the trousers. Le...,Lenny wants to buy two pairs of purple trouser...
3,Emma will be home soon and she will let Will k...,Emma will be home soon. Will will pick her up.,Emma will pick Will up at the moment.,Emma will be home soon. Will will pick her up.
4,Jane is in Warsaw. Ollie and Jane has a party....,Jane lost her calendar. Ollie and Jane have lu...,Jane is in Warsaw. Ollie will bring some sun w...,Jane lost her calendar. Ollie and Jane have lu...
5,Hilary has the keys to the apartment. Benjamin...,Hilary and Elliot are meeting at the conferenc...,"Benjamin, Hilary and Daniel are meeting for dr...",Hilary and Elliot are meeting at the conferenc...
6,Payton provides Max with websites selling clot...,Payton likes shopping but he doesn't always bu...,Payton is looking for clothes to buy. Max will...,Payton likes shopping but he doesn't always bu...
7,Rita and Tina are bored at work and have still...,Rita is tired and is not happy at work.,Rita is tired and is tired. Tina is tired.,Rita is tired and is not able to concentrate a...
8,"Beatrice wants to buy Leo a scarf, but he does...","Beatrice is in town, shopping. She has a scarf...",Beatrice is in town. She doesn't have a scarf....,"Beatrice is in town, shopping. She has a scarf..."
9,Eric doesn't know if his parents let him go to...,Eric is coming to Ivan's brother's wedding. Er...,Eric is coming to the wedding. He has a lot to...,Eric is coming to Ivan's brother's wedding. Er...


Compute ROUGE score for this subset of the data. 

In [73]:
rouge = evaluate.load('rouge')

original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0:len(instruct_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

peft_model_results = rouge.compute(
    predictions=peft_model_summaries,
    references=human_baseline_summaries[0:len(peft_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('INSTRUCT MODEL:')
print(instruct_model_results)
print('PEFT MODEL:')
print(peft_model_results)

ORIGINAL MODEL:
{'rouge1': 0.46985436352130705, 'rouge2': 0.22581970994728395, 'rougeL': 0.3816583797939229, 'rougeLsum': 0.3852519615421176}
INSTRUCT MODEL:
{'rouge1': 0.3994572016278605, 'rouge2': 0.14111831257416574, 'rougeL': 0.3108604710116508, 'rougeLsum': 0.3118647194203424}
PEFT MODEL:
{'rouge1': 0.47451100961572235, 'rouge2': 0.22690539599006532, 'rougeL': 0.37721480092992143, 'rougeLsum': 0.38095574175069424}


Notice, that PEFT model performed a little bit better than flan-t5-base. 

We already computed ROUGE score on the full dataset, after loading the results from the `data/dialogue-summary-training-results.csv` file. Load the values for the PEFT model now and check its performance compared to other models.

In [74]:
human_baseline_summaries = results['human_baseline_summaries'].values
original_model_summaries = results['original_model_summaries'].values
instruct_model_summaries = results['instruct_model_summaries'].values
peft_model_summaries     = results['peft_model_summaries'].values

original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0:len(instruct_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

peft_model_results = rouge.compute(
    predictions=peft_model_summaries,
    references=human_baseline_summaries[0:len(peft_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('INSTRUCT MODEL:')
print(instruct_model_results)
print('PEFT MODEL:')
print(peft_model_results)

ORIGINAL MODEL:
{'rouge1': 0.23337391746914432, 'rouge2': 0.07620718933525607, 'rougeL': 0.2017702446072403, 'rougeLsum': 0.20152238762082608}
INSTRUCT MODEL:
{'rouge1': 0.4214309303597156, 'rouge2': 0.18040230847807043, 'rougeL': 0.33809319137790006, 'rougeLsum': 0.3381026931436848}
PEFT MODEL:
{'rouge1': 0.40810554423302325, 'rouge2': 0.16353829312593815, 'rougeL': 0.3250376063319481, 'rougeLsum': 0.32473416304982294}


The results show less of an improvement over full fine-tuning, but the benefits of PEFT typically outweigh the slightly-lower performance metrics.

Calculate the improvement of PEFT over the original model:

In [75]:
print("Absolute percentage improvement of PEFT MODEL over HUMAN BASELINE")

improvement = (np.array(list(peft_model_results.values())) - np.array(list(original_model_results.values())))
for key, value in zip(peft_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

Absolute percentage improvement of PEFT MODEL over HUMAN BASELINE
rouge1: 17.47%
rouge2: 8.73%
rougeL: 12.33%
rougeLsum: 12.32%


Now calculate the improvement of PEFT over a full fine-tuned model:

In [77]:
print("Absolute percentage improvement of PEFT MODEL over INSTRUCT MODEL")

improvement = (np.array(list(peft_model_results.values())) - np.array(list(instruct_model_results.values())))
for key, value in zip(peft_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

Absolute percentage improvement of PEFT MODEL over INSTRUCT MODEL
rouge1: -1.33%
rouge2: -1.69%
rougeL: -1.31%
rougeLsum: -1.34%


Here we see a small percentage decrease in the ROUGE metrics vs. full fine-tuned. 

## Conclusion

In the labs, we saw implementation for zero-shot, one-shot and few-shot inference and how it affects the output. We also touched upon a bit of prompt engineering. We used different google models as our base model and build on to learn the concepts of in-context learning. We also implemented instruct model, also known as instruction fine-tuned model, and compared the results both manually and quantatively (ROUGE scores). We implemented PEFT models and discovered the difference in summarization outputs.

## References

- [Generative AI with Large Language Models](https://www.coursera.org/learn/generative-ai-with-llms/home/welcome)
- [samsum](https://huggingface.co/datasets/samsum)
- [RohitKeswani/flan-t5-base-peft-samsum](https://huggingface.co/RohitKeswani/flan-t5-base-peft-samsum)
- [google/flan-t5-small](https://huggingface.co/google/flan-t5-small)
- [google/flan-t5-base](https://huggingface.co/google/flan-t5-base)
- [mrm8488/flan-t5-small-finetuned-samsum](https://huggingface.co/mrm8488/flan-t5-small-finetuned-samsum)

