# **Dialogue Summarization**
# **This Notebook is created by: [mahdi khoshmaram](https://github.com/mahdi-khoshmaram)** 🤗

In this note book, I try to give various formats of text to the `T5` model to see how prompt engneering can improve the Summarization performance! This what I am going to do:

1. Summarize `Dialogue` without Prompt Engineering
2. Summarize `Dialogue` with an Instruction Prompt
    - Zero Shot Inference
    - One Shot Inference
    - few Shot Inference

##**0-Instal libraries**
**Note:** I run this notebook on Google Colab. If you're using it on Jupyter Notebook, you need to install the `torch` and `transformers` libraries. Use the following commands:

1.***Transformers:***
- `pip install transformers`


2.***Pytorch:***
- ***Windows-cpu:*** `pip3 install torch torchvision torchaudio`
- ***pytorch-Linux-cpu:*** `pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu`
-***pytorch-Windows-cuda12.4:*** `pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124`
-***pytorch-Linux-cuda12.4:*** `pip3 install torch torchvision torchaudio`

In [4]:
%pip install datasets --quiet

##**1-Load libraries**

In [5]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, GenerationConfig

##**2-Dataset**
I am using `knkarthick/dialogsum` dataset from huggingface.

In [6]:
hf_dataset_name = "knkarthick/dialogsum"
dataset = load_dataset(hf_dataset_name, split=None)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/4.65k [00:00<?, ?B/s]

train.csv:   0%|          | 0.00/11.3M [00:00<?, ?B/s]

validation.csv:   0%|          | 0.00/442k [00:00<?, ?B/s]

test.csv:   0%|          | 0.00/1.35M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/12460 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/500 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1500 [00:00<?, ? examples/s]

Get information about dataset:

In [7]:
# Dataset Inspection
for split in dataset.keys():
    print(f"{split}: {len(dataset[split])} rows")

train: 12460 rows
validation: 500 rows
test: 1500 rows


In [9]:
# print two examples of dataset
indices = [20,400]

dashLine = ''.join(['-' for j in range(100)])

for num, index in enumerate(indices):
    print(dashLine)
    print(f"Example {num+1}")
    print(dashLine)
    print("dialogue:")
    print(dataset['test'][index]['dialogue'])
    print(dashLine)
    print("summary:")
    print(dataset['test'][index]['summary'])
    print(dashLine)

----------------------------------------------------------------------------------------------------
Example 1
----------------------------------------------------------------------------------------------------
dialogue:
#Person1#: What's wrong with you? Why are you scratching so much?
#Person2#: I feel itchy! I can't stand it anymore! I think I may be coming down with something. I feel lightheaded and weak.
#Person1#: Let me have a look. Whoa! Get away from me!
#Person2#: What's wrong?
#Person1#: I think you have chicken pox! You are contagious! Get away! Don't breathe on me!
#Person2#: Maybe it's just a rash or an allergy! We can't be sure until I see a doctor.
#Person1#: Well in the meantime you are a biohazard! I didn't get it when I was a kid and I've heard that you can even die if you get it as an adult!
#Person2#: Are you serious? You always blow things out of proportion. In any case, I think I'll go take an oatmeal bath.
--------------------------------------------------------

## **3-Set Model**
I am using `T5`, an **ecncoder-decoder** model, suitable for text summarization.

In [24]:
model_name = 'google/flan-t5-base'
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
generation_config = GenerationConfig(max_new_tokens=50, do_sample=True, temperature=1)

Checking Tokenizer just for fun!

In [20]:
text = "Hi, I am Mahdi!"
sentence_encoded = tokenizer(text, return_tensors = 'pt')
sentence_decoded = tokenizer.decode(sentence_encoded['input_ids'][0], skip_special_tokens=True)
print(f'sentence_encoded:\n{sentence_encoded}')
print(dashLine)
print(f'sentence_decoded:\n{sentence_decoded}')

sentence_encoded:
{'input_ids': tensor([[2018,    6,   27,  183, 8555,   26,   23,   55,    1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]])}
----------------------------------------------------------------------------------------------------
sentence_decoded:
Hi, I am Mahdi!


## **4-Prompt Engineering <u>vs</u> NOT Prompt Engineering**
From now on, I try to give various formats to the model to see how prompt engneering can improve the performance! This what I am going to do:

1. Summarize `Dialogue` without Prompt Engineering
2. Summarize `Dialogue` with an Instruction Prompt
    - Zero Shot Inference
    - One Shot Inference
    - few Shot Inference

###**4-1-Without Prompt Engineering**
I used one example of `test` split. The example `index` is 400.

In [22]:
index = 400
dialogue = dataset['test'][index]['dialogue']
summary = dataset['test'][index]['summary']

In [27]:
input = tokenizer(dialogue, return_tensors='pt')
output = tokenizer.decode(
    model.generate(input['input_ids'], generation_config=generation_config)[0],
    skip_special_tokens=True
    )

print(dashLine)
print('dialogue:')
print(dialogue)
print(dashLine)
print('original_summary:')
print(summary)
print(dashLine)
print('model_summary:')
print(output)
print(dashLine)

----------------------------------------------------------------------------------------------------
dialogue:
#Person1#: It was a heavy storm last night, wasn't it?
#Person2#: It certainly was. The wind broke several windows. What weather!
#Person1#: Do you know that big tree in front of my house? One of the biggest branches came down in the night.
#Person2#: Really? Did it do any damage to your home?
#Person1#: Thank goodness! It is far away from that.
#Person2#: I really hate storms. It's about time we had some nice spring weather.
#Person1#: It's April, you know. The flowers are beginning to blossom.
#Person2#: Yes, that's true. But I still think the weather is terrible.
#Person1#: I suppose we should not complain. We had a fine March after all.
----------------------------------------------------------------------------------------------------
original_summary:
#Person1# and #Person2# are talking about the heavy storm last night, and #Person1#'s positive. #Person2# thinks the weat

`T5` is not sure what task it is supposed to do. It just makes up the next sentence in the dialogue. Prompt engineering can help here.

###**4-2-1-With Prompt Engineering**
#### **Zero-shot inference**

In [28]:
prompt = f"""
Summarize the following conversation.

{dialogue}

summary:
"""

In [30]:
input = tokenizer(prompt, return_tensors='pt')
output = tokenizer.decode(
    model.generate(input['input_ids'], generation_config=generation_config)[0],
    skip_special_tokens=True)

print(dashLine)
print('dialogue:')
print(dialogue)
print(dashLine)
print('original_summary:')
print(summary)
print(dashLine)
print('model_summary:')
print(output)
print(dashLine)

----------------------------------------------------------------------------------------------------
dialogue:
#Person1#: It was a heavy storm last night, wasn't it?
#Person2#: It certainly was. The wind broke several windows. What weather!
#Person1#: Do you know that big tree in front of my house? One of the biggest branches came down in the night.
#Person2#: Really? Did it do any damage to your home?
#Person1#: Thank goodness! It is far away from that.
#Person2#: I really hate storms. It's about time we had some nice spring weather.
#Person1#: It's April, you know. The flowers are beginning to blossom.
#Person2#: Yes, that's true. But I still think the weather is terrible.
#Person1#: I suppose we should not complain. We had a fine March after all.
----------------------------------------------------------------------------------------------------
original_summary:
#Person1# and #Person2# are talking about the heavy storm last night, and #Person1#'s positive. #Person2# thinks the weat

Performance is much better than without prompt engineering! but it still lacks details!

**Zero-shot inference `T5` prompt template**

In [31]:
prompt = f"""
dialogue:

{dialogue}

What was going on?
"""

In [32]:
input = tokenizer(prompt, return_tensors='pt')
output = tokenizer.decode(
    model.generate(input['input_ids'], generation_config=generation_config)[0],
    skip_special_tokens=True)

print(dashLine)
print('dialogue:')
print(dialogue)
print(dashLine)
print('original_summary:')
print(summary)
print(dashLine)
print('model_summary:')
print(output)
print(dashLine)

----------------------------------------------------------------------------------------------------
dialogue:
#Person1#: It was a heavy storm last night, wasn't it?
#Person2#: It certainly was. The wind broke several windows. What weather!
#Person1#: Do you know that big tree in front of my house? One of the biggest branches came down in the night.
#Person2#: Really? Did it do any damage to your home?
#Person1#: Thank goodness! It is far away from that.
#Person2#: I really hate storms. It's about time we had some nice spring weather.
#Person1#: It's April, you know. The flowers are beginning to blossom.
#Person2#: Yes, that's true. But I still think the weather is terrible.
#Person1#: I suppose we should not complain. We had a fine March after all.
----------------------------------------------------------------------------------------------------
original_summary:
#Person1# and #Person2# are talking about the heavy storm last night, and #Person1#'s positive. #Person2# thinks the weat

Performance is much better than without prompt engineering! but it still lacks details!

###**4-2-2-With Prompt Engineering**
#### **One-shot inference**

First, I wrote `make_prompt` function to make me a prompt, including shots!

In [37]:
def make_prompt(example_indices_full, example_index_to_summarize):
    prompt = ''
    for index in example_indices_full:
        dialogue = dataset['test'][index]['dialogue']
        summary = dataset['test'][index]['summary']

         # The stop sequence '{summary}\n\n\n' is important for FLAN-T5.
        prompt += f"""
Dialogue:

{dialogue}

what was going on?
{summary}



        """
    dialogue = dataset['test'][example_index_to_summarize]['dialogue']

    prompt += f"""
Dialogue:

{dialogue}

what was going on?
    """
    return prompt

An example of prompt:

In [38]:
# one-shot prompt
example_indeces_full = [40]
example_index_to_summarize = 400
one_shot_prompt = make_prompt(example_indeces_full, example_index_to_summarize)
print(one_shot_prompt)


Dialogue:

#Person1#: What time is it, Tom?
#Person2#: Just a minute. It's ten to nine by my watch.
#Person1#: Is it? I had no idea it was so late. I must be off now.
#Person2#: What's the hurry?
#Person1#: I must catch the nine-thirty train.
#Person2#: You've plenty of time yet. The railway station is very close. It won't take more than twenty minutes to get there.

what was going on?
#Person1# is in a hurry to catch a train. Tom tells #Person1# there is plenty of time.



        
Dialogue:

#Person1#: It was a heavy storm last night, wasn't it?
#Person2#: It certainly was. The wind broke several windows. What weather!
#Person1#: Do you know that big tree in front of my house? One of the biggest branches came down in the night.
#Person2#: Really? Did it do any damage to your home?
#Person1#: Thank goodness! It is far away from that.
#Person2#: I really hate storms. It's about time we had some nice spring weather.
#Person1#: It's April, you know. The flowers are beginning to blossom.

Give the prompt to `T5`



In [41]:
# inference
summary = dataset['test'][example_index_to_summarize]['summary']

input = tokenizer(one_shot_prompt, return_tensors='pt')
output = tokenizer.decode(
    model.generate(
        input['input_ids'],
        generation_config=generation_config
        )[0],
    skip_special_tokens=True
)

print(dashLine)
print('dialogue:')
print(dialogue)
print(dashLine)
print('original_summary:')
print(summary)
print(dashLine)
print('model_summary:')
print(output)
print(dashLine)

----------------------------------------------------------------------------------------------------
dialogue:
#Person1#: It was a heavy storm last night, wasn't it?
#Person2#: It certainly was. The wind broke several windows. What weather!
#Person1#: Do you know that big tree in front of my house? One of the biggest branches came down in the night.
#Person2#: Really? Did it do any damage to your home?
#Person1#: Thank goodness! It is far away from that.
#Person2#: I really hate storms. It's about time we had some nice spring weather.
#Person1#: It's April, you know. The flowers are beginning to blossom.
#Person2#: Yes, that's true. But I still think the weather is terrible.
#Person1#: I suppose we should not complain. We had a fine March after all.
----------------------------------------------------------------------------------------------------
original_summary:
#Person1# and #Person2# are talking about the heavy storm last night, and #Person1#'s positive. #Person2# thinks the weat

Nice output!

###**4-2-3-With Prompt Engineering**
#### **Few-shot inference**

An example of prompt

In [42]:
# few-shot prompt
example_indeces_full = [40,100]
example_index_to_summarize = 400
few_shot_prompt = make_prompt(example_indeces_full, example_index_to_summarize)
print(few_shot_prompt)


Dialogue:

#Person1#: What time is it, Tom?
#Person2#: Just a minute. It's ten to nine by my watch.
#Person1#: Is it? I had no idea it was so late. I must be off now.
#Person2#: What's the hurry?
#Person1#: I must catch the nine-thirty train.
#Person2#: You've plenty of time yet. The railway station is very close. It won't take more than twenty minutes to get there.

what was going on?
#Person1# is in a hurry to catch a train. Tom tells #Person1# there is plenty of time.



        
Dialogue:

#Person1#: OK, that's a cut! Let's start from the beginning, everyone.
#Person2#: What was the problem that time?
#Person1#: The feeling was all wrong, Mike. She is telling you that she doesn't want to see you any more, but I want to get more anger from you. You're acting hurt and sad, but that's not how your character would act in this situation.
#Person2#: But Jason and Laura have been together for three years. Don't you think his reaction would be one of both anger and sadness?
#Person1#: At 

In [43]:
# inference
summary = dataset['test'][example_index_to_summarize]['summary']

input = tokenizer(few_shot_prompt, return_tensors='pt')
output = tokenizer.decode(
    model.generate(
        input['input_ids'],
        generation_config=generation_config
        )[0],
    skip_special_tokens=True
)

print(dashLine)
print('dialogue:')
print(dialogue)
print(dashLine)
print('original_summary:')
print(summary)
print(dashLine)
print('model_summary:')
print(output)
print(dashLine)

Token indices sequence length is longer than the specified maximum sequence length for this model (645 > 512). Running this sequence through the model will result in indexing errors


----------------------------------------------------------------------------------------------------
dialogue:
#Person1#: It was a heavy storm last night, wasn't it?
#Person2#: It certainly was. The wind broke several windows. What weather!
#Person1#: Do you know that big tree in front of my house? One of the biggest branches came down in the night.
#Person2#: Really? Did it do any damage to your home?
#Person1#: Thank goodness! It is far away from that.
#Person2#: I really hate storms. It's about time we had some nice spring weather.
#Person1#: It's April, you know. The flowers are beginning to blossom.
#Person2#: Yes, that's true. But I still think the weather is terrible.
#Person1#: I suppose we should not complain. We had a fine March after all.
----------------------------------------------------------------------------------------------------
original_summary:
#Person1# and #Person2# are talking about the heavy storm last night, and #Person1#'s positive. #Person2# thinks the weat

I think few shot did not help the performance! One shot is sufficient!