##Agenda
1. Install and import neccessary libraries
2. Dialogue dataset examples
3. Tokenizer encoding and decoding
4. LLM summarizes dialogues without promoting techniques
5. LLM summarizes dialogues with promoting techniques
  1. Zero-shot learning
  2. One-shot learning
  3. Few-shot learning

##Install and load libraries

In [1]:
%pip install -U datasets
%pip install transformers

Collecting datasets
  Downloading datasets-3.0.0-py3-none-any.whl.metadata (19 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.0.0-py3-none-any.whl (474 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.3/474.3 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (39.9 MB)
[2K  

In [2]:
#importing necessary libraries
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM
from transformers import AutoTokenizer
from transformers import GenerationConfig

In [3]:
#accessing dataset from HuggingFace (https://huggingface.co/datasets/knkarthick/dialogsum)
huggingface_dataset_name = "knkarthick/dialogsum"
dataset_dialogue = load_dataset(huggingface_dataset_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/4.65k [00:00<?, ?B/s]

train.csv:   0%|          | 0.00/11.3M [00:00<?, ?B/s]

validation.csv:   0%|          | 0.00/442k [00:00<?, ?B/s]

test.csv:   0%|          | 0.00/1.35M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/12460 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/500 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1500 [00:00<?, ? examples/s]

In [4]:
#loading the pre-trained model and tokenizer from HuggingFace (https://huggingface.co/google/flan-t5-base)
model = "google/flan-t5-base"
tokenizer = AutoTokenizer.from_pretrained(model)
model_flan_t5 = AutoModelForSeq2SeqLM.from_pretrained(model)

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]



config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

##Dialogue dataset examples

In [6]:
#printing some examples from the dialogue dataset
indices_dialogue_example = [1, 15]
for i, index in enumerate(indices_dialogue_example):
    print('Example', i + 1)
    print ('----------------------')
    print('Input dialogue:')
    print ('----------------------')
    print(dataset_dialogue['test'][index]['dialogue'])
    print ('----------------------')
    print('Baseline summary:')
    print ('----------------------')
    print(dataset_dialogue['test'][index]['summary'])
    print('\n')

Example 1
----------------------
Input dialogue:
----------------------
#Person1#: Ms. Dawson, I need you to take a dictation for me.
#Person2#: Yes, sir...
#Person1#: This should go out as an intra-office memorandum to all employees by this afternoon. Are you ready?
#Person2#: Yes, sir. Go ahead.
#Person1#: Attention all staff... Effective immediately, all office communications are restricted to email correspondence and official memos. The use of Instant Message programs by employees during working hours is strictly prohibited.
#Person2#: Sir, does this apply to intra-office communications only? Or will it also restrict external communications?
#Person1#: It should apply to all communications, not only in this office between employees, but also any outside communications.
#Person2#: But sir, many employees use Instant Messaging to communicate with their clients.
#Person1#: They will just have to change their communication methods. I don't want any - one using Instant Messaging in this

##Tokenizer encoding and decoding

In [8]:
#checking tokenizer
dialog = "Where are you from?"
#tokenizing
dialog_encoded = tokenizer(dialog, return_tensors='pt')
dialog_decoded = tokenizer.decode(dialog_encoded["input_ids"][0], skip_special_tokens=True)
print('encoded dialog:')
print(dialog_encoded["input_ids"][0])
print('\ndecoded dialog:')
print(dialog_decoded)

encoded dialog:
tensor([2840,   33,   25,   45,   58,    1])

decoded dialog:
Where are you from?


##LLM summarizes dialogues without promoting techniques

In [9]:
#printing some examples from the dialogue dataset
indices_dialogue_example = [1, 15]

for i, index in enumerate(indices_dialogue_example):
    dialogue = dataset_dialogue['test'][index]['dialogue']
    summary = dataset_dialogue['test'][index]['summary']
    #tokenizing
    input = tokenizer(dialogue, return_tensors='pt')
    #decoding
    output = model_flan_t5.generate(input['input_ids'], max_length=150, num_beams=5, early_stopping=True)
    #print(outputs)
    output = tokenizer.decode(output[0], skip_special_tokens=True)

    print('Example', i + 1)
    print ('----------------------')
    print('Input dialogue:')
    print ('----------------------')
    print(dataset_dialogue['test'][index]['dialogue'])
    print ('----------------------')
    print('Baseline summary:')
    print ('----------------------')
    print(dataset_dialogue['test'][index]['summary'])
    print ('----------------------')
    print(f'Flan-t5 model summary:')
    print ('----------------------')
    print(output)
    print ('----------------------')
    print('\n')

Example 1
----------------------
Input dialogue:
----------------------
#Person1#: Ms. Dawson, I need you to take a dictation for me.
#Person2#: Yes, sir...
#Person1#: This should go out as an intra-office memorandum to all employees by this afternoon. Are you ready?
#Person2#: Yes, sir. Go ahead.
#Person1#: Attention all staff... Effective immediately, all office communications are restricted to email correspondence and official memos. The use of Instant Message programs by employees during working hours is strictly prohibited.
#Person2#: Sir, does this apply to intra-office communications only? Or will it also restrict external communications?
#Person1#: It should apply to all communications, not only in this office between employees, but also any outside communications.
#Person2#: But sir, many employees use Instant Messaging to communicate with their clients.
#Person1#: They will just have to change their communication methods. I don't want any - one using Instant Messaging in this

##LLM summarizes dialogues with promoting techniques
1. Zero-shot prompting
2. One-shot prompting
3. Few-shot prompting

### Zero-shot (no example prompt)

In [10]:
#no example provided, only the dialogue to summarize

indices_dialogue_example = [1, 15]

for i, index in enumerate(indices_dialogue_example):
    dialogue = dataset_dialogue['test'][index]['dialogue']
    summary = dataset_dialogue['test'][index]['summary']

    #passing a prompt without any examples
    prompt = f"What was the conversation?:\n\n{dialogue} Please summerize it in three lines:\n\n"

    #tokenizing
    input = tokenizer(prompt, return_tensors='pt')
    #decoding
    output = model_flan_t5.generate(input['input_ids'], max_length=150, num_beams=5, early_stopping=True)
    #print(outputs)
    zero_shot_summary = tokenizer.decode(output[0], skip_special_tokens=True)

    print('Input dialogue:', i+1)
    print ('----------------------')
    print(dataset_dialogue['test'][index]['dialogue'])
    print ('----------------------')
    print('Baseline summary:')
    print ('----------------------')
    print(dataset_dialogue['test'][index]['summary'])
    print ('----------------------')
    print(f'Flan-t5 model summary:')
    print ('----------------------')
    print(zero_shot_summary)
    print ('----------------------')
    print('\n')


Input dialogue: 1
----------------------
#Person1#: Ms. Dawson, I need you to take a dictation for me.
#Person2#: Yes, sir...
#Person1#: This should go out as an intra-office memorandum to all employees by this afternoon. Are you ready?
#Person2#: Yes, sir. Go ahead.
#Person1#: Attention all staff... Effective immediately, all office communications are restricted to email correspondence and official memos. The use of Instant Message programs by employees during working hours is strictly prohibited.
#Person2#: Sir, does this apply to intra-office communications only? Or will it also restrict external communications?
#Person1#: It should apply to all communications, not only in this office between employees, but also any outside communications.
#Person2#: But sir, many employees use Instant Messaging to communicate with their clients.
#Person1#: They will just have to change their communication methods. I don't want any - one using Instant Messaging in this office. It wastes too much tim

###One-shot (one example prompt)

In [11]:
def one_shot_prompt(example_index_for_one_shot, dialogue_index_to_summarize):
    example_dialogue = dataset_dialogue['test'][example_index_for_one_shot[0]]['dialogue']
    example_summary = dataset_dialogue['test'][example_index_for_one_shot[0]]['summary']
    dialogue = dataset_dialogue['test'][dialogue_index_to_summarize]['dialogue']
    #passing a prompt with one example of dialogue-summary
    prompt = f"Dialogue:\n\n{example_dialogue}\n\nSummary:\n{example_summary}\n\n"
    #instructing the prompt to summarize a new dialogue
    prompt += f"Now, summarize this dialogue:\n\n{dialogue}\n\n"
    return prompt

In [12]:
#setting one example index and dialogue to summarize
example_index_for_one_shot = [11]
dialogue_index_to_summarize = 200

#generating and printing one-shot prompt
one_shot_prompt_text = one_shot_prompt(example_index_for_one_shot, dialogue_index_to_summarize)
print(one_shot_prompt_text)

Dialogue:

#Person1#: Happy Birthday, this is for you, Brian.
#Person2#: I'm so happy you remember, please come in and enjoy the party. Everyone's here, I'm sure you have a good time.
#Person1#: Brian, may I have a pleasure to have a dance with you?
#Person2#: Ok.
#Person1#: This is really wonderful party.
#Person2#: Yes, you are always popular with everyone. and you look very pretty today.
#Person1#: Thanks, that's very kind of you to say. I hope my necklace goes with my dress, and they both make me look good I feel.
#Person2#: You look great, you are absolutely glowing.
#Person1#: Thanks, this is a fine party. We should have a drink together to celebrate your birthday

Summary:
#Person1# has a dance with Brian at Brian's birthday party. Brian thinks #Person1# looks great and is popular.

Now, summarize this dialogue:

#Person1#: Have you considered upgrading your system?
#Person2#: Yes, but I'm not sure what exactly I would need.
#Person1#: You could consider adding a painting progra

In [13]:
#accessing baseline summary for comparison
baseline_summary = dataset_dialogue['test'][dialogue_index_to_summarize]['summary']

#tokenizing and generating the summary using one-shot technique
inputs = tokenizer(one_shot_prompt_text, return_tensors="pt", max_length=512, truncation=True)
outputs = model_flan_t5.generate(inputs['input_ids'], max_length=150, num_beams=5, early_stopping=True)
one_shot_summary = tokenizer.decode(outputs[0], skip_special_tokens=True)

#printing the generated summary for one-shot
print ('Dialogue')
print ('----------------------')
print(dialogue)
print ('----------------------')
print('Baseline summary:')
print ('----------------------')
print(baseline_summary)
print ('----------------------')
print(f'One shot summary by Flan-t5 model:')
print ('----------------------')
print(one_shot_summary)
print ('----------------------')
print('\n')


Dialogue
----------------------
#Person1#: I've had it! I am done working for a company that is taking me nowhere!
#Person2#: So what are you gonna do? Just quit?
#Person1#: That's exactly what I am going to do! I have decided to create my own company! I am going to write up a business plan, get some investors and start working for myself!
#Person2#: Have you ever written up a business plan before?
#Person1#: Well, no, it can't be that hard! I mean, all you have to do is explain your business, how you are going to do things and that's it, right?
#Person2#: You couldn't be more wrong! A well written business plan will include an executive summary which highlights the idea of the business in two pages or less. Then you need to describe your company with information such as what type of legal structure it has, history, etc.
#Person1#: Well, that seems easy enough.
#Person2#: Wait, there is more! Then you need to introduce and describe your goods or services. What they are and how they are

### Few-shot (multiple example prompt)

In [14]:
#including multiple examples of dialogue-summary before the task
def few_shot_prompt(example_indices_for_few_shot, dialog_index_to_summarize_for_few_shot):
    prompt = ""
    for i in example_indices_for_few_shot:
        example_dialogue = dataset_dialogue['test'][i]['dialogue']
        example_summary = dataset_dialogue['test'][i]['summary']
        #passing a prompt with three examples of dialog-summary
        prompt += f"--- Dialogue {i}: ---\n\n{example_dialogue}\n\n--- Summary: ---\n{example_summary}\n\n"

    #instructing the prompt to summarize a new dialogue
    dialogue = dataset_dialogue['test'][dialog_index_to_summarize_for_few_shot]['dialogue']
    prompt += f"--- Now, summarize this dialogue: ---\n\n{dialogue}.\n\n"
    return prompt

In [15]:
#setting multiple example indices and a dialogue to summarize
example_indices_for_few_shot = [1, 15, 55]
dialogue_index_to_summarize_for_few_shot = 200

#generating and printing one-shot prompt
few_shot_prompt_text = few_shot_prompt(example_indices_for_few_shot, dialogue_index_to_summarize_for_few_shot)
print(few_shot_prompt_text)

--- Dialogue 1: ---

#Person1#: Ms. Dawson, I need you to take a dictation for me.
#Person2#: Yes, sir...
#Person1#: This should go out as an intra-office memorandum to all employees by this afternoon. Are you ready?
#Person2#: Yes, sir. Go ahead.
#Person1#: Attention all staff... Effective immediately, all office communications are restricted to email correspondence and official memos. The use of Instant Message programs by employees during working hours is strictly prohibited.
#Person2#: Sir, does this apply to intra-office communications only? Or will it also restrict external communications?
#Person1#: It should apply to all communications, not only in this office between employees, but also any outside communications.
#Person2#: But sir, many employees use Instant Messaging to communicate with their clients.
#Person1#: They will just have to change their communication methods. I don't want any - one using Instant Messaging in this office. It wastes too much time! Now, please conti

In [16]:
#accessing baseline summary for comparison
baseline_summary = dataset_dialogue['test'][dialogue_index_to_summarize_for_few_shot]['summary']

#tokenizing and generating the summary using few-shot technique
inputs = tokenizer(few_shot_prompt_text, return_tensors="pt", max_length=512, truncation=True)
outputs = model_flan_t5.generate(inputs['input_ids'], max_length=150, num_beams=5, early_stopping=True)
few_shot_summary = tokenizer.decode(outputs[0], skip_special_tokens=True)

#printing the generated summary for few-shot
print ('Dialogue')
print ('----------------------')
print(dialogue)
print ('----------------------')
print('Baseline summary:')
print ('----------------------')
print(baseline_summary)
print ('----------------------')
print(f'Few shot summary by Flan-t5 model:')
print ('----------------------')
print(few_shot_summary)
print ('----------------------')
print('\n')


Dialogue
----------------------
#Person1#: I've had it! I am done working for a company that is taking me nowhere!
#Person2#: So what are you gonna do? Just quit?
#Person1#: That's exactly what I am going to do! I have decided to create my own company! I am going to write up a business plan, get some investors and start working for myself!
#Person2#: Have you ever written up a business plan before?
#Person1#: Well, no, it can't be that hard! I mean, all you have to do is explain your business, how you are going to do things and that's it, right?
#Person2#: You couldn't be more wrong! A well written business plan will include an executive summary which highlights the idea of the business in two pages or less. Then you need to describe your company with information such as what type of legal structure it has, history, etc.
#Person1#: Well, that seems easy enough.
#Person2#: Wait, there is more! Then you need to introduce and describe your goods or services. What they are and how they are