### Introduction

This project focuses on **dialogue summarization**, which involves generating concise and coherent summaries of multi-turn conversations. By leveraging pre-trained language models like **FLAN-T5**, we aim to capture the essential points, intentions, and context of dialogues, ensuring the summary is both informative and contextually accurate. The project explores techniques such as **zero-shot**, **one-shot** and **few-shot** prompting to guide the model in summarizing conversations without fine-tuning, using carefully crafted prompts. We also evaluate different configurations of the model's generation parameters to optimize the quality and relevance of the summaries, addressing challenges like handling long conversations and nuanced context.

### Library installations and importing

We will install the required libraries at first. The specified versions might give some warning on dependencies but these are good to go with - <br/>
pip install torch==1.13.1 <br/>
pip install torchdata==0.5.1 <br/>
pip install datasets==2.17.0 <br/>
pip install transformers==4.27.2 <br/>
You might also need to install pip install py7zr <br/>




The next step is to import the necessary libraries. First we will import the load_dataset library from the datasets module. load_dataset allows us to load datasets from the Hugging Face Hub or local files in various formats like JSON, CSV, etc. This is necessary to work with pre-defined datasets like samsum or custom datasets for dialogue summarization. Then we will import AutoModelForSeq2SeqLM from transformers, a generic class for models that perform sequence-to-sequence (seq2seq) tasks, such as summarization, translation, or question answering. It simplifies loading pre-trained seq2seq models from the Hugging Face Hub. Next we will import AutoTokenizer from transformers, which provides the tokenizer corresponding to the pre-trained model. The tokenizer is essential to convert raw text into token IDs the model can process. At last we will import GenerationConfig from transformers, a utility to configure parameters (e.g., max tokens, beam search) for text generation tasks. This allows customization of the text generation process when generating summaries.

In [2]:
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM
from transformers import AutoTokenizer
from transformers import GenerationConfig


### Summarize Dialogue without Prompt Engineering

First, we will be generating a summary of a dialogue with the pre-trained Large Language Model (LLM) FLAN-T5 from Hugging Face. <br/> 

We will use two types of datasets - <br/>

The samsum dataset consists of dialogues and corresponding human-written summaries. It is commonly used for testing and fine-tuning dialogue summarization models. <br/>

The knkarthick/dialogsum dataset on Hugging Face is designed for dialogue summarization tasks. It contains a collection of dialogue texts sourced from various domains, such as daily conversations, interviews, and chats, along with corresponding human-written summaries.

In [12]:
# Specify the name of the dataset to be used
huggingface_dataset_name_dialogsum =  "knkarthick/dialogsum"  
huggingface_dataset_name_samsum = "samsum"


# Loads the specified dataset from the Hugging Face Hub into memory.
dataset_dialogsum = load_dataset(huggingface_dataset_name_dialogsum)  
dataset_samsum = load_dataset(huggingface_dataset_name_samsum, trust_remote_code=True)

Generating train split: 100%|███████████████████████████████████████████| 14732/14732 [00:01<00:00, 9098.76 examples/s]
Generating test split: 100%|████████████████████████████████████████████████| 819/819 [00:00<00:00, 5249.17 examples/s]
Generating validation split: 100%|██████████████████████████████████████████| 818/818 [00:00<00:00, 4325.64 examples/s]


Dataset is a dictionary-like object with splits like train, validation, and test. Each split contains records or data.

In [13]:
dataset_dialogsum

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 12460
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 500
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 1500
    })
})

In [14]:
dataset_samsum

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 14732
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 819
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 818
    })
})

 Printing a couple of dialogues with baseline summaries -

In [21]:
example_indices = [100, 200, 300]  # Looking at the records at indexes 100, 200, and 300 for both the datasets

for i, index in enumerate(example_indices):
    print('Example Number :  ', i + 1)
    print()
    print('INPUT DIALOGUE:    ')
    print(dataset_samsum['train'][index]['dialogue'])
    print('-'*120)
    print('SUMMARY:   ')
    print(dataset_samsum['train'][index]['summary'])
    print('-'*120)
    print('\n\n')

Example Number :   1

INPUT DIALOGUE:    
Gabby: How is you? Settling into the new house OK?
Sandra: Good. The kids and the rest of the menagerie are doing fine. The dogs absolutely love the new garden. Plenty of room to dig and run around.
Gabby: What about the hubby?
Sandra: Well, apart from being his usual grumpy self I guess he's doing OK.
Gabby: :-D yeah sounds about right for Jim.
Sandra: He's a man of few words. No surprises there. Give him a backyard shed and that's the last you'll see of him for months.
Gabby: LOL that describes most men I know.
Sandra: Ain't that the truth! 
Gabby: Sure is. :-) My one might as well move into the garage. Always tinkering and building something in there.
Sandra: Ever wondered what he's doing in there?
Gabby: All the time. But he keeps the place locked.
Sandra: Prolly building a portable teleporter or something. ;-)
Gabby: Or a time machine... LOL
Sandra: Or a new greatly improved Rabbit :-P
Gabby: I wish... Lmfao!
----------------

In [23]:
example_indices = [100, 200, 300]  # Looking at the records at indexes 100, 200, and 300 for both the datasets

for i, index in enumerate(example_indices):
    print('Example Number :  ', i + 1)
    print()
    print('TOPIC:    ', dataset_dialogsum['train'][index]['topic'])
    print('-'*120)
    print('INPUT DIALOGUE:    ')
    print(dataset_dialogsum['train'][index]['dialogue'])
    print('-'*120)
    print('SUMMARY:   ')
    print(dataset_dialogsum['train'][index]['summary'])
    print('-'*120)
    print('\n\n')

Example Number :   1

TOPIC:     cable
------------------------------------------------------------------------------------------------------------------------
INPUT DIALOGUE:    
#Person1#: I have a problem with my cable.
#Person2#: What about it?
#Person1#: My cable has been out for the past week or so.
#Person2#: The cable is down right now. I am very sorry.
#Person1#: When will it be working again?
#Person2#: It should be back on in the next couple of days.
#Person1#: Do I still have to pay for the cable?
#Person2#: We're going to give you a credit while the cable is down.
#Person1#: So, I don't have to pay for it?
#Person2#: No, not until your cable comes back on.
#Person1#: Okay, thanks for everything.
#Person2#: You're welcome, and I apologize for the inconvenience.
------------------------------------------------------------------------------------------------------------------------
SUMMARY:   
#Person1# has a problem with the cable. #Person2# promises it should work again and

Load the FLAN-T5 model, creating an instance of the AutoModelForSeq2SeqLM class with the .from_pretrained() method. Flan-t5-large is a larger model with 780M parameters, and is publicly available on the Hugging Face Hub.

In [24]:
model_name='google/flan-t5-base'  # Specifies the name of the pre-trained model to be used.

model = AutoModelForSeq2SeqLM.from_pretrained(model_name) # Loads the specified pre-trained seq2seq model into memory.

To enable encoding and decoding of text with a language model, it is essential to work with text in a **tokenized form**. Tokenization is the process of breaking down a string of text into smaller, manageable units called **tokens**. These tokens can represent words, subwords, or even individual characters, depending on the tokenizer's configuration. This step is critical because Large Language Models (LLMs) like FLAN-T5 are designed to process numerical representations of tokens rather than raw text, ensuring efficiency and consistency in handling diverse inputs.

To tokenize text for the FLAN-T5 model, we will download and initialize the tokenizer using the `AutoTokenizer.from_pretrained()` method from the Transformers library. This method retrieves the pre-trained tokenizer associated with the FLAN-T5 model, ensuring compatibility with its architecture. Additionally, you can utilize the `use_fast` parameter, which activates the **fast tokenizer** implementation. The fast tokenizer leverages the **Hugging Face Tokenizers library**, offering improved speed and lower memory usage without compromising accuracy—ideal for handling large-scale text processing tasks.

For more information, you can refer to the documentation - https://huggingface.co/docs/transformers/v4.28.1/en/model_doc/auto#transformers.AutoTokenizer

In [25]:
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)  # Loads the tokenizer corresponding to the model

Let's test the tokenizer encoding and decoding with a simple sentence. In the tokenizer object, we mentioned return_tensors='pt'. This specifies that the tokenized data should be returned as a PyTorch tensor (pt stands for PyTorch).
This format is essential when using the tokenized data with PyTorch-based models. sentence_encoded is now a dictionary containing - input_ids: Numerical token IDs representing the input text.
attention_mask: Indicates which tokens are meaningful (1) and which are padding (0). Now skip_special_tokens in the decode function of the tokenizer ensures that special tokens (e.g., [CLS], [SEP], [PAD]) added during tokenization are not included in the decoded text.

In [26]:
sentence = "I love learning about llm. LLM is fun!"

# Converts the raw input text (sentence) into tokenized form using the tokenizer.
sentence_encoded = tokenizer(sentence, return_tensors='pt')


# Begins the process of converting tokenized IDs back into a human-readable string.
sentence_decoded = tokenizer.decode(
        sentence_encoded["input_ids"][0],  # Selects the token IDs for the first input as there may be a batch of inputs.
        skip_special_tokens=True  
    )



print('ENCODED SENTENCE:   ', sentence_encoded["input_ids"][0])
print('-'*120)
print('DECODED SENTENCE:   ', sentence_decoded)

ENCODED SENTENCE:    tensor([   27,   333,  1036,    81,     3,   195,    51,     5,   301, 11160,
           19,   694,    55,     1])
------------------------------------------------------------------------------------------------------------------------
DECODED SENTENCE:    I love learning about llm. LLM is fun!


Now it's time to explore how well the base LLM summarizes a dialogue without any prompt engineering. Prompt engineering is an act of a human changing the prompt (input) to improve the response for a given task. Now below we have used model.generate(), what is that? This line generates text (e.g., a summary, response, or continuation) using a pre-trained model based on the tokenized input provided. The generate method is used to create new sequences (e.g., tokens) based on the input provided.
The model uses its underlying architecture and weights (in this case FLAN-T5) to predict the next tokens iteratively, forming coherent text output. max_new_token parameter specifies the maximum number of new tokens that the model can generate during this call.

In [27]:
for i, index in enumerate(example_indices):
    dialogue = dataset_samsum['train'][index]['dialogue']
    summary = dataset_samsum['train'][index]['summary']
    
    # Tokenization
    inputs = tokenizer(dialogue, return_tensors='pt')
    
    # Model prediction with tokenized data
    model_generate = model.generate(inputs['input_ids'], max_new_tokens=100)
    
    # Detokenization
    outputs = tokenizer.decode(model_generate[0], skip_special_tokens=True)
    
    print('-'*120)
    print('Example Number:   ', i + 1)
    print('-'*120)
    print(f'INPUT PROMPT:\n{dialogue}')
    print('-'*120)
    print(f'HUMAN SUMMARY:\n{summary}')
    print('-'*120)
    print(f'MODEL GENERATION - WITHOUT PROMPT ENGINEERING:\n{outputs}\n')
    print('-'*120)
    print('\n\n')

------------------------------------------------------------------------------------------------------------------------
Example Number:    1
------------------------------------------------------------------------------------------------------------------------
INPUT PROMPT:
Gabby: How is you? Settling into the new house OK?
Sandra: Good. The kids and the rest of the menagerie are doing fine. The dogs absolutely love the new garden. Plenty of room to dig and run around.
Gabby: What about the hubby?
Sandra: Well, apart from being his usual grumpy self I guess he's doing OK.
Gabby: :-D yeah sounds about right for Jim.
Sandra: He's a man of few words. No surprises there. Give him a backyard shed and that's the last you'll see of him for months.
Gabby: LOL that describes most men I know.
Sandra: Ain't that the truth! 
Gabby: Sure is. :-) My one might as well move into the garage. Always tinkering and building something in there.
Sandra: Ever wondered what he's doing in there?
Gabby: All t

In [28]:
# Let's try with the other dataset

for i, index in enumerate(example_indices):
    dialogue = dataset_dialogsum['train'][index]['dialogue']
    summary = dataset_dialogsum['train'][index]['summary']
    
    # Tokenization
    inputs = tokenizer(dialogue, return_tensors='pt')
    
    # Model prediction with tokenized data
    model_generate = model.generate(inputs['input_ids'], max_new_tokens=100)
    
    # Detokenization
    outputs = tokenizer.decode(model_generate[0], skip_special_tokens=True)
    
    print('-'*120)
    print('Example Number:   ', i + 1)
    print('-'*120)
    print(f'INPUT PROMPT:\n{dialogue}')
    print('-'*120)
    print(f'HUMAN SUMMARY:\n{summary}')
    print('-'*120)
    print(f'MODEL GENERATION - WITHOUT PROMPT ENGINEERING:\n{outputs}\n')
    print('-'*120)
    print('\n\n')

------------------------------------------------------------------------------------------------------------------------
Example Number:    1
------------------------------------------------------------------------------------------------------------------------
INPUT PROMPT:
#Person1#: I have a problem with my cable.
#Person2#: What about it?
#Person1#: My cable has been out for the past week or so.
#Person2#: The cable is down right now. I am very sorry.
#Person1#: When will it be working again?
#Person2#: It should be back on in the next couple of days.
#Person1#: Do I still have to pay for the cable?
#Person2#: We're going to give you a credit while the cable is down.
#Person1#: So, I don't have to pay for it?
#Person2#: No, not until your cable comes back on.
#Person1#: Okay, thanks for everything.
#Person2#: You're welcome, and I apologize for the inconvenience.
-----------------------------------------------------------------------------------------------------------------------

You may notice that the model's predictions appear somewhat reasonable, as they align with the context of the input. However, the model does not seem entirely certain about the specific task it is expected to perform. Instead, it tends to generate a continuation of the dialogue that feels improvised or arbitrary rather than purposeful. This behavior suggests that the model lacks clarity about the intended outcome, such as summarizing, translating, or answering a question. In such cases, **prompt engineering**—the process of crafting clear, specific, and informative input prompts—can significantly improve the model's performance by guiding it more effectively toward the desired goal.

### Summarize Dialogue with an Instruction Prompt <br/>

Prompt engineering is the practice of designing effective input prompts to guide language models (LLMs) like GPT or T5 to produce desired outputs. It leverages the model's inherent training on diverse data, enabling task-specific results without modifying the model's architecture. <br/>

Ways of Prompt Engineering:
Instruction-based Prompts: Clearly define the task, e.g., "Summarize this passage."
Few-shot Prompts: Provide examples within the prompt to guide the model.
Zero-shot Prompts: Assume the model understands the task solely from the instruction.
Chain-of-thought Prompts: Encourage step-by-step reasoning for complex tasks.
Role-based Prompts: Assign roles like, “Act as a teacher explaining...” <br/>

Usefulness:
Prompt engineering is effective for:

Text generation tasks (summarization, question answering, translation).
Tasks where models must follow nuanced instructions.
Prototyping without costly fine-tuning.
<br/>

Limitations:
Reliability: Output varies with subtle prompt changes.
Complex Tasks: Struggles with intricate domain-specific applications.
Model Dependency: Performance hinges on the pre-trained model's limitations.

#### Zero Shot Inference with an Instruction Prompt

Zero-shot inference with an instruction prompt involves guiding a pre-trained language model (LLM) to perform a task it hasn't been explicitly trained on, without providing task-specific examples. Instead, a well-crafted instruction prompt is given to help the model understand the task from its general training knowledge. <br/>

The LLM, like GPT-3 or FLAN-T5, relies on its extensive training on diverse text datasets, enabling it to infer tasks from natural language instructions. For instance, a prompt such as: "Summarize the following text: [text]" tells the model to generate a concise summary, even if summarization isn’t its primary task. <br/>

Benefits:
Flexibility: Adapts to multiple tasks without retraining.
Cost-Efficiency: Avoids the need for labeled data or fine-tuning.
Fast Prototyping: Quickly evaluates model performance on new tasks.
<br/>

Key Techniques:
Direct Instruction: Straightforward commands like “Translate this text into French.”
Role Assignment: Specifying a role, e.g., “You are a helpful assistant. Write a formal email.”
Explicit Output Structure: Defining the desired format, e.g., “Answer in bullet points.”<br/>

Challenges:
Ambiguity: Vague instructions yield inconsistent outputs.
Complexity: Struggles with specialized domains requiring expertise.

In [30]:
for i, index in enumerate(example_indices):
    dialogue = dataset_dialogsum['train'][index]['dialogue']
    summary = dataset_dialogsum['train'][index]['summary']

    
    # Here in this prompt we are not providing any example
    # Rather we are just instructing to summarize the input which is a conversation in our case
    prompt = f"""
                Summarize the following conversation.
                {dialogue}
                Summary:
            """
    

    # We pass prompt in place of dialogue in the tokenizer.
    inputs = tokenizer(prompt, return_tensors='pt')
    
    # Model prediction with tokenized data
    model_generate = model.generate(inputs['input_ids'], max_new_tokens=100)
    
    # Detokenization
    outputs = tokenizer.decode(model_generate[0], skip_special_tokens=True)
    
    
    print('-'*120)
    print('Example Number:   ', i + 1)
    print('-'*120)
    print(f'INPUT PROMPT:\n{dialogue}')
    print('-'*120)
    print(f'HUMAN SUMMARY:\n{summary}')
    print('-'*120)  
    print(f'MODEL GENERATION - ZERO SHOT:\n{outputs}\n')
    print('-'*120)  
    print('\n\n')

------------------------------------------------------------------------------------------------------------------------
Example Number:    1
------------------------------------------------------------------------------------------------------------------------
INPUT PROMPT:
#Person1#: I have a problem with my cable.
#Person2#: What about it?
#Person1#: My cable has been out for the past week or so.
#Person2#: The cable is down right now. I am very sorry.
#Person1#: When will it be working again?
#Person2#: It should be back on in the next couple of days.
#Person1#: Do I still have to pay for the cable?
#Person2#: We're going to give you a credit while the cable is down.
#Person1#: So, I don't have to pay for it?
#Person2#: No, not until your cable comes back on.
#Person1#: Okay, thanks for everything.
#Person2#: You're welcome, and I apologize for the inconvenience.
-----------------------------------------------------------------------------------------------------------------------

In [31]:
# Let's perform the same with samsum

for i, index in enumerate(example_indices):
    dialogue = dataset_samsum['train'][index]['dialogue']
    summary = dataset_samsum['train'][index]['summary']

    
    # Here in this prompt we are not providing any example
    # Rather we are just instructing to summarize the input which is a conversation in our case
    prompt = f"""
                Summarize the following conversation.
                {dialogue}
                Summary:
            """
    

    # We pass prompt in place of dialogue in the tokenizer.
    inputs = tokenizer(prompt, return_tensors='pt')
    
    # Model prediction with tokenized data
    model_generate = model.generate(inputs['input_ids'], max_new_tokens=100)
    
    # Detokenization
    outputs = tokenizer.decode(model_generate[0], skip_special_tokens=True)
    
    
    print('-'*120)
    print('Example Number:   ', i + 1)
    print('-'*120)
    print(f'INPUT PROMPT:\n{dialogue}')
    print('-'*120)
    print(f'HUMAN SUMMARY:\n{summary}')
    print('-'*120)  
    print(f'MODEL GENERATION - ZERO SHOT:\n{outputs}\n')
    print('-'*120)  
    print('\n\n')

------------------------------------------------------------------------------------------------------------------------
Example Number:    1
------------------------------------------------------------------------------------------------------------------------
INPUT PROMPT:
Gabby: How is you? Settling into the new house OK?
Sandra: Good. The kids and the rest of the menagerie are doing fine. The dogs absolutely love the new garden. Plenty of room to dig and run around.
Gabby: What about the hubby?
Sandra: Well, apart from being his usual grumpy self I guess he's doing OK.
Gabby: :-D yeah sounds about right for Jim.
Sandra: He's a man of few words. No surprises there. Give him a backyard shed and that's the last you'll see of him for months.
Gabby: LOL that describes most men I know.
Sandra: Ain't that the truth! 
Gabby: Sure is. :-) My one might as well move into the garage. Always tinkering and building something in there.
Sandra: Ever wondered what he's doing in there?
Gabby: All t

Despite using zero-shot prompting, the model still struggles to capture the subtle nuances of the conversations. While it can generate coherent responses based on the provided instructions, it often misses context-specific details, tone, and underlying intentions that are crucial for understanding complex dialogues. Zero-shot prompting relies on the model's general training, but without fine-tuning or task-specific examples, the model may fail to fully grasp intricate aspects of the conversation, such as sarcasm, implied meanings, or emotional undertones. This limitation arises because the model isn't explicitly trained on these finer conversational cues. To address this, prompt engineering can help refine instructions, but there are still inherent challenges when dealing with nuanced language, especially in highly specialized or emotional contexts. <br/>
Let's use a slightly different prompt. FLAN-T5 has many prompt templates that are published for certain tasks, you can check it here - https://github.com/google-research/FLAN/tree/main/flan/v2

In [32]:
for i, index in enumerate(example_indices):
    dialogue = dataset_dialogsum['train'][index]['dialogue']
    summary = dataset_dialogsum['train'][index]['summary']
        
    prompt = f""" 
                Dialogue:
                {dialogue}
                
                What was going on?
             """

    inputs = tokenizer(prompt, return_tensors='pt')
    
    
    model_generate = model.generate(inputs['input_ids'], max_new_tokens=100)
    
 
    outputs = tokenizer.decode(model_generate[0], skip_special_tokens=True)
    
    

    print('-'*120)
    print('Example Number:   ', i + 1)
    print('-'*120)
    print(f'INPUT PROMPT:\n{dialogue}')
    print('-'*120)
    print(f'HUMAN SUMMARY:\n{summary}')
    print('-'*120)  
    print(f'MODEL GENERATION - ZERO SHOT:\n{outputs}\n')
    print('-'*120)  
    print('\n\n')

------------------------------------------------------------------------------------------------------------------------
Example Number:    1
------------------------------------------------------------------------------------------------------------------------
INPUT PROMPT:
#Person1#: I have a problem with my cable.
#Person2#: What about it?
#Person1#: My cable has been out for the past week or so.
#Person2#: The cable is down right now. I am very sorry.
#Person1#: When will it be working again?
#Person2#: It should be back on in the next couple of days.
#Person1#: Do I still have to pay for the cable?
#Person2#: We're going to give you a credit while the cable is down.
#Person1#: So, I don't have to pay for it?
#Person2#: No, not until your cable comes back on.
#Person1#: Okay, thanks for everything.
#Person2#: You're welcome, and I apologize for the inconvenience.
-----------------------------------------------------------------------------------------------------------------------

In [33]:
for i, index in enumerate(example_indices):
    dialogue = dataset_dialogsum['train'][index]['dialogue']
    summary = dataset_dialogsum['train'][index]['summary']
        
    # Using another slightly different prompt
    prompt = f"""
                Dialogue:
                {dialogue}
                
                What was going on? Summary: ?
              """

    inputs = tokenizer(prompt, return_tensors='pt')
    
    
    model_generate = model.generate(inputs['input_ids'], max_new_tokens=100)
    
 
    outputs = tokenizer.decode(model_generate[0], skip_special_tokens=True)
    
    

    print('-'*120)
    print('Example Number:   ', i + 1)
    print('-'*120)
    print(f'INPUT PROMPT:\n{dialogue}')
    print('-'*120)
    print(f'HUMAN SUMMARY:\n{summary}')
    print('-'*120)  
    print(f'MODEL GENERATION - ZERO SHOT:\n{outputs}\n')
    print('-'*120)  
    print('\n\n')

------------------------------------------------------------------------------------------------------------------------
Example Number:    1
------------------------------------------------------------------------------------------------------------------------
INPUT PROMPT:
#Person1#: I have a problem with my cable.
#Person2#: What about it?
#Person1#: My cable has been out for the past week or so.
#Person2#: The cable is down right now. I am very sorry.
#Person1#: When will it be working again?
#Person2#: It should be back on in the next couple of days.
#Person1#: Do I still have to pay for the cable?
#Person2#: We're going to give you a credit while the cable is down.
#Person1#: So, I don't have to pay for it?
#Person2#: No, not until your cable comes back on.
#Person1#: Okay, thanks for everything.
#Person2#: You're welcome, and I apologize for the inconvenience.
-----------------------------------------------------------------------------------------------------------------------

Notice that this prompt from FLAN-T5 did help a bit, but still struggles to pick up on the nuance of the conversation. This is what we will try to solve with the few shot inferencing.

#### Summarize Dialogue with One Shot and Few Shot Inference

**One-shot** and **few-shot inference** are techniques used to improve the performance of large language models (LLMs) by providing them with one or more examples of prompt-response pairs that are representative of the task you want the model to perform. These examples are presented before the actual task prompt, helping the model understand the context and expected output style. This process is known as **in-context learning**.

In **one-shot inference**, a single example is provided, while in **few-shot inference**, multiple examples are given. These examples serve as demonstrations of the task, guiding the model on how to generate responses that align with the task's requirements. By learning from these examples, the model adjusts its responses to match the pattern seen in the prompt-response pairs.

In-context learning leverages the model's ability to understand patterns in the examples and apply that understanding to new, unseen tasks. It doesn’t require retraining the model but instead relies on the model’s inherent ability to generalize from the provided examples. This makes one-shot and few-shot prompting highly effective for tasks where fine-tuning the model is impractical or resource-intensive.

However, the quality of the task-specific response depends heavily on the clarity and relevance of the examples provided. With fewer examples, the model may struggle to fully understand the task's nuances, whereas providing more examples increases the chances of achieving high-quality, contextually accurate results.

<b/> Let's perform One Shot Inference first </b>

In [42]:
def make_prompt(example_indices, example_index_to_summarize):
    prompt = ''
    
    for index in example_indices:
        dialogue = dataset_samsum['train'][index]['dialogue']
        summary = dataset_samsum['train'][index]['summary']
        
        # The first part of the prompt is providing input and the human summary as an example. There is only one example.
        # The stop sequence '{summary}\n\n\n' is important for FLAN-T5. Other models may have their own preferred stop sequence.
        # Look into the documentation and use the prompt templates as it is
        prompt += f"""
    Dialogue:

    {dialogue}

    What was going on?
    {summary}


                """
    
    # The second part of the prompt is only providing the input dialogues without summary.
    # The model has to generate the summary
    dialogue = dataset_samsum['train'][example_index_to_summarize]['dialogue']
    
    prompt += f"""
    Dialogue:

    {dialogue}
    
    What was going on?
            """
    
    # Returning the full prompt
    return prompt

In [46]:
example_indices = [100]
example_index_to_summarize = 222

one_shot_prompt = make_prompt(example_indices, example_index_to_summarize)

# Let's see how our prompt looks like
print(one_shot_prompt)


    Dialogue:

    Gabby: How is you? Settling into the new house OK?
Sandra: Good. The kids and the rest of the menagerie are doing fine. The dogs absolutely love the new garden. Plenty of room to dig and run around.
Gabby: What about the hubby?
Sandra: Well, apart from being his usual grumpy self I guess he's doing OK.
Gabby: :-D yeah sounds about right for Jim.
Sandra: He's a man of few words. No surprises there. Give him a backyard shed and that's the last you'll see of him for months.
Gabby: LOL that describes most men I know.
Sandra: Ain't that the truth! 
Gabby: Sure is. :-) My one might as well move into the garage. Always tinkering and building something in there.
Sandra: Ever wondered what he's doing in there?
Gabby: All the time. But he keeps the place locked.
Sandra: Prolly building a portable teleporter or something. ;-)
Gabby: Or a time machine... LOL
Sandra: Or a new greatly improved Rabbit :-P
Gabby: I wish... Lmfao!

    What was going on?
    Sandra is 

In [47]:
# Picking up the human written summary to match with the model generated summary
summary = dataset_samsum['train'][example_index_to_summarize]['summary']


inputs = tokenizer(one_shot_prompt, return_tensors='pt')

model_generate = model.generate(inputs['input_ids'], max_new_tokens=100)
    
outputs = tokenizer.decode(model_generate[0], skip_special_tokens=True)


print('-'*120)
print(f'HUMAN SUMMARY:\n{summary}\n')
print('-'*120)
print(f'MODEL GENERATION - ONE SHOT:\n{outputs}')

Token indices sequence length is longer than the specified maximum sequence length for this model (589 > 512). Running this sequence through the model will result in indexing errors


------------------------------------------------------------------------------------------------------------------------
HUMAN SUMMARY:
Richard suspects his girlfriend is cheating on him because of her emotional distance. Matt has once accused his girlfriend of cheating, when in reality she was throwing him a surprise birthday party. Richard will talk to his girlfriend as Matt advises.

------------------------------------------------------------------------------------------------------------------------
MODEL GENERATION - ONE SHOT:
Richard has a feeling that his girl is cheating on him. Matt accuses his girl of cheating on him. Richard is not sure if he is cheating on her. Matt will try to talk to her.


<b> Now let's proceed with Few Shot Inference </b>

In [48]:
# We will pick more examples in this case

example_indices = [111, 222, 333, 444]
example_index_to_summarize = 555

few_shot_prompt = make_prompt(example_indices, example_index_to_summarize)

print(few_shot_prompt)


    Dialogue:

    Joe: R U watching 'The Millionaire'?
Tim: Sure!
Jack: Me too!
Joe: Oooops. the commercial block is finishing.
Joe: Talk to you later!

    What was going on?
    Joe, Tim and Jack are watching 'The Millionaire'.


                
    Dialogue:

    Richard: I have a feeling that my girl is cheating on me...
Matt: Well...
Matt: Don't know what to reply
Matt: I'm sorry man..
Matt: But are you sure?
Richard: I'm not. But I have my reasons to believe so.
Matt: I once made a mistake.
Matt: I accussed my girl of cheating on me. But it turned out she conspired with my friends to throw me a suprprise birthday party.
Matt: She was furious when I confronted her.
Richard: Wow. You've never told me that, and I was partially resposible for it...
Matt: Nevermind. Everything is fine now.
Richard: In my case however it's not about conspiring. It's the distance. She created so much distance between us that I have a feeling there is someone else she likes to be with.

In [49]:
summary = dataset_samsum['train'][example_index_to_summarize]['summary']


inputs = tokenizer(few_shot_prompt, return_tensors='pt')

model_generate = model.generate(inputs['input_ids'], max_new_tokens=100)
    
outputs = tokenizer.decode(model_generate[0], skip_special_tokens=True)


print('-'*120)
print(f'HUMAN SUMMARY:\n{summary}\n')
print('-'*120)
print(f'MODEL GENERATION - FEW SHOT:\n{outputs}')

------------------------------------------------------------------------------------------------------------------------
HUMAN SUMMARY:
Penny will wear some black dress for the company dinner.

------------------------------------------------------------------------------------------------------------------------
MODEL GENERATION - FEW SHOT:
Penny will wear a black dress on the company dinner.


In this case, few-shot inference did offer a noticeable improvement over one-shot inference, as providing multiple examples helped the model better understand the desired output format and task. However, increasing the number of examples beyond five or six does not tend to provide significant additional benefits. After this point, the model’s ability to improve its performance plateaus, likely because it has already learned the relevant pattern from the initial examples. 

It's also crucial to ensure that the number of tokens in the input prompt does not exceed the model's input-context length. In our scenario, the model has a context length limit of 512 tokens. If the total token count surpasses this limit, the excess tokens will be truncated or ignored, potentially leading to incomplete or less accurate outputs. Therefore, it’s important to balance the number of examples provided with the model's token capacity.

Despite these limitations, feeding at least one full example (one-shot) into the model can still be highly effective. Even with a single example, the model gains more context and is able to produce a response that is more relevant and coherent. This additional information significantly improves the quality of tasks like summarization, where having a clear structure or format helps the model understand the expectations and generate better summaries. In short, while too many examples can be counterproductive, the right number of examples (especially one or a few) can greatly enhance the model's performance by setting a clear context for the task at hand.

### Generative Configuration Parameters for Inference

You can customize the output of a large language model (LLM) by adjusting various configuration parameters in the `generate()` method. So far, the primary parameter you’ve been using is `max_new_tokens=100`, which limits the number of tokens the model generates. However, the `generate()` method offers a wide range of other parameters that can influence the output, such as `num_beams`, `temperature`, `top_p`, and `repetition_penalty`, each of which affects how the model generates text.

- **`num_beams`**: Controls the number of beams used in beam search. A higher number of beams typically improves the quality of the generated text by exploring more possibilities but increases computation.
- **`temperature`**: Modifies the randomness of predictions. A lower temperature (e.g., 0.1) makes the model’s output more deterministic, while a higher temperature (e.g., 1.0) introduces more creativity and variety.
- **`top_p`**: Implements nucleus sampling, where the model selects from the smallest set of most probable tokens that have a cumulative probability greater than or equal to `p`. This can lead to more focused or diverse generation.
- **`repetition_penalty`**: Discourages the model from repeating the same phrases or tokens by applying a penalty to previously generated tokens. repetition_penalty = 1.0 (No repetition penalty, default behavior)
repetition_penalty = 1.5 (Moderate repetition penalty, discourages repetition)
repetition_penalty = 2.0 (Strong repetition penalty, significantly reduces repetition)

Managing these parameters individually can become cumbersome, especially when you need to experiment with different combinations. A more efficient way to organize and manage these settings is by using the **`GenerationConfig`** class. This class allows you to group related parameters into a single object, making it easier to modify and experiment with different configurations. By encapsulating your generation settings in one place, you can streamline the process and ensure consistency when generating multiple outputs.



A full list of available parameters can be found in the Hugging Face Generation documentation - https://huggingface.co/docs/transformers/v4.29.1/en/main_classes/text_generation#transformers.GenerationConfig



In [54]:
generation_config = GenerationConfig(max_new_tokens=100, do_sample=True, temperature=0.1, repetition_penalty=1.5)
generation_config

GenerationConfig {
  "do_sample": true,
  "max_new_tokens": 100,
  "repetition_penalty": 1.5,
  "temperature": 0.1
}

In [62]:
# Using the few shot prompt example but with the configured model now

# Let's try giving many examples
example_indices = [111, 222, 333, 444, 555, 666, 777, 888]
example_index_to_summarize = 1111

few_shot_prompt = make_prompt(example_indices, example_index_to_summarize)


summary = dataset_samsum['train'][example_index_to_summarize]['summary']


inputs = tokenizer(few_shot_prompt, return_tensors='pt')

model_generate = model.generate(inputs['input_ids'], generation_config=generation_config,)
    
outputs = tokenizer.decode(model_generate[0], skip_special_tokens=True)


print('-'*120)
print(f'HUMAN SUMMARY:\n{summary}\n')
print('-'*120)
print(f'MODEL GENERATION - FEW SHOT:\n{outputs}')

------------------------------------------------------------------------------------------------------------------------
HUMAN SUMMARY:
It's snowing in Satle in October.

------------------------------------------------------------------------------------------------------------------------
MODEL GENERATION - FEW SHOT:
It's snowing in Satle.


In [63]:
# Using the few shot prompt example but with the configured model now

example_indices = [111, 222, 333, 444]
example_index_to_summarize = 1111

few_shot_prompt = make_prompt(example_indices, example_index_to_summarize)


summary = dataset_samsum['train'][example_index_to_summarize]['summary']


inputs = tokenizer(few_shot_prompt, return_tensors='pt')

model_generate = model.generate(inputs['input_ids'], generation_config=generation_config,)
    
outputs = tokenizer.decode(model_generate[0], skip_special_tokens=True)


print('-'*120)
print(f'HUMAN SUMMARY:\n{summary}\n')
print('-'*120)
print(f'MODEL GENERATION - FEW SHOT:\n{outputs}')

------------------------------------------------------------------------------------------------------------------------
HUMAN SUMMARY:
It's snowing in Satle in October.

------------------------------------------------------------------------------------------------------------------------
MODEL GENERATION - FEW SHOT:
It's snowing in Stockholm.


I have tried the one shot and few shot prompting using samsum data. Use the dialogsum data and see what you get! Also, try changing the parameters to find out the differences in the output.

### Conclusion

As demonstrated, prompt engineering can significantly enhance the performance of a model for specific use cases, but it does come with certain limitations. This is where **fine-tuning** becomes essential. Fine-tuning allows a model to adapt more specifically to a particular task or dataset, overcoming the general limitations of prompt engineering. For example, if we were using larger models like GPT-3 or LLaMA, which contain billions of parameters, the output quality and accuracy would likely be much higher even with the same process we used with FLAN-T5. These models, due to their massive scale and more sophisticated pre-training, could better handle the task at hand without requiring fine-tuning, especially when the dataset consists of general language rather than highly specialized, domain-specific terms.

In such cases, fine-tuning might not be necessary because these larger models are already equipped to handle a wide variety of tasks. The ability to generalize across different types of input is a major advantage of these models. Additionally, there are a variety of **prompt template types** available for models like GPT-3, LLaMA, or others, which provide structured ways to input queries and guide the model to produce more relevant and accurate responses. These templates can range from simple instructions to more complex role-based or chain-of-thought prompting techniques, all of which can be used to enhance the performance of the model without the need for extensive fine-tuning, especially when dealing with less domain-specific datasets.