<span style="font-size:20px">**Dialogue Summarization Using Generative AI and ROUGE Score Analysis** </span>

**Objective :** 

To summarize dialogues using various generative AI models and analyze the results using ROUGE scores.

**Model Selection :**

1) Amazon Titan Text Lite accessed through Amazon Bedrock
2) LangChain gpt-3.5-turbo using Chat Messages 
3) LangChain gpt-3.5-turbo using Promp template
4) HuggingFace's Flan-T5

**Dataset :** https://huggingface.co/datasets/knkarthick/dialogsum

**Evaluate Using ROUGE Scores** : Compute ROUGE scores (ROUGE-1, ROUGE-2, ROUGE-L) for each model's summaries against the reference summaries.


In [1]:
#Install the dependencies

%pip install -U datasets==2.17.0

%pip install --upgrade pip
%pip install --disable-pip-version-check \
    torch==1.13.1 \
    torchdata==0.5.1 --quiet

%pip install \
    transformers==4.27.2 \
    evaluate==0.4.0 \
    rouge_score==0.1.2 \

Collecting datasets==2.17.0
  Downloading datasets-2.17.0-py3-none-any.whl.metadata (20 kB)
Collecting xxhash (from datasets==2.17.0)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting fsspec<=2023.10.0,>=2023.1.0 (from fsspec[http]<=2023.10.0,>=2023.1.0->datasets==2.17.0)
  Downloading fsspec-2023.10.0-py3-none-any.whl.metadata (6.8 kB)
Collecting huggingface-hub>=0.19.4 (from datasets==2.17.0)
  Downloading huggingface_hub-0.23.4-py3-none-any.whl.metadata (12 kB)
Downloading datasets-2.17.0-py3-none-any.whl (536 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m536.6/536.6 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m:00:01[0m
[?25hDownloading fsspec-2023.10.0-py3-none-any.whl (166 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m166.4/166.4 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m:00:01[0m
[?25hDownloading huggingface_hub-0.23.4-py3-none-any.whl (402 kB)
[2K   [90m━━━━━━━━━━━━━

### Model 1 : Amazon Titan Text Lite from Amazon Bedrock

In [2]:
from datasets import load_dataset
import boto3, json

### Initializing AWS Bedrock Runtime Client






In [10]:
bedrock_runtime = boto3.client(service_name = "bedrock-runtime", region_name = "us-east-1")

### Import the dataset from hugging face

Upload some simple dialogues from the [DialogSum](https://huggingface.co/datasets/knkarthick/dialogsum) Hugging Face dataset. This dataset contains 10,000+ dialogues with the corresponding manually labeled summaries and topics.

In [57]:
huggingface_dataset_name = "knkarthick/dialogsum"

dataset = load_dataset(huggingface_dataset_name)

In [4]:
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 12460
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 500
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 1500
    })
})

### Print a couple of dialogues with their baseline summaries.

In [58]:
example_indices = [40, 200]

dash_line = '-'.join('' for x in range(100))

for i, index in enumerate(example_indices):
    print(dash_line)
    print('Example ', i + 1)
    print(dash_line)
    print('INPUT DIALOGUE:')
    print(dataset['test'][index]['dialogue'])
    print(dash_line)
    print('BASELINE HUMAN SUMMARY:')
    print(dataset['test'][index]['summary'])
    print(dash_line)
    print()

---------------------------------------------------------------------------------------------------
Example  1
---------------------------------------------------------------------------------------------------
INPUT DIALOGUE:
#Person1#: What time is it, Tom?
#Person2#: Just a minute. It's ten to nine by my watch.
#Person1#: Is it? I had no idea it was so late. I must be off now.
#Person2#: What's the hurry?
#Person1#: I must catch the nine-thirty train.
#Person2#: You've plenty of time yet. The railway station is very close. It won't take more than twenty minutes to get there.
---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# is in a hurry to catch a train. Tom tells #Person1# there is plenty of time.
---------------------------------------------------------------------------------------------------

---------------------------------------------------------------------------------------------------
Exa

### Summarize the dialogue

To simplify, let's randomly select index 40 and focus on summarizing that specific dialogue.

In [59]:
import json

input_text = dataset['test'][40]['dialogue']
body = {
    "inputText": f"Summarize the conversation in a single line or two {input_text}",
    "textGenerationConfig": {
        "maxTokenCount": 4096,
        "stopSequences": [],
        "temperature": 0,
        "topP": 1
    }
}

kwargs = {
    "modelId": "amazon.titan-text-lite-v1",
    "contentType": "application/json",
    "accept": "application/json",
    "body": json.dumps(body)
}

print(kwargs)




{'modelId': 'amazon.titan-text-lite-v1', 'contentType': 'application/json', 'accept': 'application/json', 'body': '{"inputText": "Summarize the conversation in a single line or two #Person1#: What time is it, Tom?\\n#Person2#: Just a minute. It\'s ten to nine by my watch.\\n#Person1#: Is it? I had no idea it was so late. I must be off now.\\n#Person2#: What\'s the hurry?\\n#Person1#: I must catch the nine-thirty train.\\n#Person2#: You\'ve plenty of time yet. The railway station is very close. It won\'t take more than twenty minutes to get there.", "textGenerationConfig": {"maxTokenCount": 4096, "stopSequences": [], "temperature": 0, "topP": 1}}'}


### Get the summarized text from the model and store in completion_bedrock

In [60]:
response = bedrock_runtime.invoke_model(**kwargs)
response

{'ResponseMetadata': {'RequestId': '5db42d2e-1b58-41db-8339-449b30416705',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'date': 'Wed, 03 Jul 2024 09:50:06 GMT',
   'content-type': 'application/json',
   'content-length': '240',
   'connection': 'keep-alive',
   'x-amzn-requestid': '5db42d2e-1b58-41db-8339-449b30416705',
   'x-amzn-bedrock-invocation-latency': '1426',
   'x-amzn-bedrock-output-token-count': '35',
   'x-amzn-bedrock-input-token-count': '121'},
  'RetryAttempts': 1},
 'contentType': 'application/json',
 'body': <botocore.response.StreamingBody at 0x7f85c3441a50>}

In [61]:
response_body = json.loads(response.get('body').read())
response_body

{'inputTextTokenCount': 121,
 'results': [{'tokenCount': 35,
   'outputText': '\nPerson 1 is rushing to catch a nine-thirty train, but Person 2 tells them they have plenty of time and that the railway station is close.',
   'completionReason': 'FINISH'}]}

In [62]:
completion_bedrock = response_body.get('results')[0].get('outputText')
completion_bedrock

'\nPerson 1 is rushing to catch a nine-thirty train, but Person 2 tells them they have plenty of time and that the railway station is close.'

In [63]:
example_indices = [40]
for i, index in enumerate(example_indices):
    dialogue = dataset['test'][index]['dialogue']
    summary = dataset['test'][index]['summary']
    
    print(dash_line)
    print('Example ', i + 1)
    print(dash_line)
    print(f'INPUT PROMPT:\n{dialogue}')
    print(dash_line)
    print(f'BASELINE HUMAN SUMMARY:\n{summary}')
    print(dash_line)
    print(f'MODEL GENERATION :\n{completion_bedrock}\n')

---------------------------------------------------------------------------------------------------
Example  1
---------------------------------------------------------------------------------------------------
INPUT PROMPT:
#Person1#: What time is it, Tom?
#Person2#: Just a minute. It's ten to nine by my watch.
#Person1#: Is it? I had no idea it was so late. I must be off now.
#Person2#: What's the hurry?
#Person1#: I must catch the nine-thirty train.
#Person2#: You've plenty of time yet. The railway station is very close. It won't take more than twenty minutes to get there.
---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# is in a hurry to catch a train. Tom tells #Person1# there is plenty of time.
---------------------------------------------------------------------------------------------------
MODEL GENERATION :

Person 1 is rushing to catch a nine-thirty train, but Person 2 tells them they have pl

### Model 2 : gpt-3.5-turbo from langchain and use Chat Messages to generate summaries

In [29]:
%pip install langchain
%pip install langchain_community
%pip install openai
%pip install tiktoken

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Note: you may need to restart the kernel to use updated packages.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Note: you may need to restart the kernel to use updated packages.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Collecting openai
 

In [30]:
import os 
open_api_key="sk-proj-t####"
os.environ["OPENAI_API_KEY"]=open_api_key

In [64]:
from langchain.chat_models import ChatOpenAI
from langchain.schema import(
    AIMessage,
    HumanMessage,
    SystemMessage
)

In [65]:
speech = dataset['test'][40]['dialogue']
speech

"#Person1#: What time is it, Tom?\n#Person2#: Just a minute. It's ten to nine by my watch.\n#Person1#: Is it? I had no idea it was so late. I must be off now.\n#Person2#: What's the hurry?\n#Person1#: I must catch the nine-thirty train.\n#Person2#: You've plenty of time yet. The railway station is very close. It won't take more than twenty minutes to get there."

### Get the summarized text from the model and store in completion_langchain

In [66]:
chat_messages=[
    SystemMessage(content='You are an expert assistant with expertize in summarizing speeches'),
    HumanMessage(content=f'Please provide a short and concise summary of the following speech in not more than one or two sentences:\n TEXT: {speech}')
]

llm=ChatOpenAI(model_name='gpt-3.5-turbo')

In [67]:
llm.get_num_tokens(speech)

102

In [68]:
completion_langchain = llm.invoke(chat_messages).content
completion_langchain

"Person 1 realizes it is almost nine o'clock and needs to catch a train at 9:30, but Person 2 reassures them they have plenty of time as the train station is nearby and it will only take around twenty minutes to get there."

In [69]:
example_indices = [40]
for i, index in enumerate(example_indices):
    dialogue = dataset['test'][index]['dialogue']
    summary = dataset['test'][index]['summary']
    
    print(dash_line)
    print('Example ', i + 1)
    print(dash_line)
    print(f'INPUT PROMPT:\n{dialogue}')
    print(dash_line)
    print(f'BASELINE HUMAN SUMMARY:\n{summary}')
    print(dash_line)
    print(f'MODEL GENERATION :\n{completion_langchain}\n')

---------------------------------------------------------------------------------------------------
Example  1
---------------------------------------------------------------------------------------------------
INPUT PROMPT:
#Person1#: What time is it, Tom?
#Person2#: Just a minute. It's ten to nine by my watch.
#Person1#: Is it? I had no idea it was so late. I must be off now.
#Person2#: What's the hurry?
#Person1#: I must catch the nine-thirty train.
#Person2#: You've plenty of time yet. The railway station is very close. It won't take more than twenty minutes to get there.
---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# is in a hurry to catch a train. Tom tells #Person1# there is plenty of time.
---------------------------------------------------------------------------------------------------
MODEL GENERATION :
Person 1 realizes it is almost nine o'clock and needs to catch a train at 9:30, but Per

### Model 3 : gpt-3.5-turbo from langchain and use Prompt template

In [70]:
from langchain.chains import LLMChain
from langchain import PromptTemplate

In [71]:
generic_template='''
Write a summary of the following speech in not more than one or two sentences:
Speech : `{speech}`

'''
prompt=PromptTemplate(
    input_variables=['speech'],
    template=generic_template
)

In [72]:
complete_prompt = prompt.format(speech=speech)

In [73]:
llm.get_num_tokens(complete_prompt)

123

### Get the summarized text from the model and store in completion_langchain_prompt

In [74]:
from langchain_core.output_parsers import StrOutputParser

chain = prompt | llm | StrOutputParser()
completion_langchain_prompt = chain.invoke({"speech": speech})
completion_langchain_prompt

'Person 1 realizes it is later than they thought and needs to leave to catch the nine-thirty train, but Person 2 reassures them there is still plenty of time as the railway station is close by.'

In [75]:
example_indices = [40]
for i, index in enumerate(example_indices):
    dialogue = dataset['test'][index]['dialogue']
    summary = dataset['test'][index]['summary']
    
    print(dash_line)
    print('Example ', i + 1)
    print(dash_line)
    print(f'INPUT PROMPT:\n{dialogue}')
    print(dash_line)
    print(f'BASELINE HUMAN SUMMARY:\n{summary}')
    print(dash_line)
    print(f'MODEL GENERATION :\n{completion_langchain_prompt}\n')

---------------------------------------------------------------------------------------------------
Example  1
---------------------------------------------------------------------------------------------------
INPUT PROMPT:
#Person1#: What time is it, Tom?
#Person2#: Just a minute. It's ten to nine by my watch.
#Person1#: Is it? I had no idea it was so late. I must be off now.
#Person2#: What's the hurry?
#Person1#: I must catch the nine-thirty train.
#Person2#: You've plenty of time yet. The railway station is very close. It won't take more than twenty minutes to get there.
---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# is in a hurry to catch a train. Tom tells #Person1# there is plenty of time.
---------------------------------------------------------------------------------------------------
MODEL GENERATION :
Person 1 realizes it is later than they thought and needs to leave to catch the nine-th

### Model 4 : Flan-T5 from Hugging Face

In [7]:
from transformers import AutoModelForSeq2SeqLM
from transformers import AutoTokenizer
from transformers import GenerationConfig
import torch

In [76]:
model_name='google/flan-t5-base'

original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name)



### Get the summarized text from the model and store in completion_hugging

In [77]:
index = 40

dialogue = dataset['test'][index]['dialogue']
summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following conversation in not more than one or two lines.

{dialogue}

Summary:
"""

inputs = tokenizer(prompt, return_tensors='pt')
completion_hugging = tokenizer.decode(
    original_model.generate(
        inputs["input_ids"], 
        max_new_tokens=200,
    )[0], 
    skip_special_tokens=True
)

dash_line = '-'.join('' for x in range(100))
print(dash_line)
print(f'INPUT PROMPT:\n{prompt}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'MODEL GENERATION:\n{completion_hugging}')

---------------------------------------------------------------------------------------------------
INPUT PROMPT:

Summarize the following conversation in not more than one or two lines.

#Person1#: What time is it, Tom?
#Person2#: Just a minute. It's ten to nine by my watch.
#Person1#: Is it? I had no idea it was so late. I must be off now.
#Person2#: What's the hurry?
#Person1#: I must catch the nine-thirty train.
#Person2#: You've plenty of time yet. The railway station is very close. It won't take more than twenty minutes to get there.

Summary:

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# is in a hurry to catch a train. Tom tells #Person1# there is plenty of time.

---------------------------------------------------------------------------------------------------
MODEL GENERATION:
The train is about to leave.


### Store the summarized text from each of the model into a dataframe. This would help us manually check the resluts.

In [82]:
import pandas as pd

human_baseline_summaries = dataset['test'][40]['summary']
summaries = [(human_baseline_summaries, completion_bedrock, completion_langchain, completion_langchain_prompt, completion_hugging)]
df = pd.DataFrame(summaries, columns = ['human_baseline_summaries', 'bedrock_titan', 'langchain_gpt', 'langchain_prompt_gpt', 'hugging_flant5'])
df.head()

Unnamed: 0,human_baseline_summaries,bedrock_titan,langchain_gpt,langchain_prompt_gpt,hugging_flant5
0,#Person1# is in a hurry to catch a train. Tom ...,\nPerson 1 is rushing to catch a nine-thirty t...,Person 1 realizes it is almost nine o'clock an...,Person 1 realizes it is later than they though...,The train is about to leave.


### Calculate the ROUGE score for each of the model

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics used for evaluating the quality of automatic summaries by comparing them to a set of reference summaries (typically human-generated). Here's what each ROUGE metric measures:

1. **ROUGE-1 (unigram overlap):**
   - Measures the overlap of unigrams (individual words) between the generated summary and the reference summaries.
   - Example: If a reference summary contains "good morning" and the generated summary contains "morning is good", ROUGE-1 would count "morning" and "good" as overlapping unigrams.

2. **ROUGE-2 (bigram overlap):**
   - Measures the overlap of bigrams (sequences of two adjacent words) between the generated summary and the reference summaries.
   - Example: If a reference summary contains "good morning" and the generated summary contains "morning is good", ROUGE-2 would count "morning is" and "is good" as overlapping bigrams.

3. **ROUGE-L (longest common subsequence):**
   - Measures the longest common subsequence (LCS) between the generated summary and the reference summaries. LCS is the longest sequence that can be found in both the generated and reference summaries without reordering the words.
   - Example: If a reference summary is "The cat sat on the mat" and the generated summary is "A cat sat on a mat", ROUGE-L would find the longest common subsequence as "cat sat on mat".

4. **ROUGE-Lsum (average ROUGE-L):**
   - Measures the average of the ROUGE-L scores of the individual sentences in the generated summary against the reference summaries. This metric is useful for multi-sentence summaries.
   - Example: If a generated summary consists of multiple sentences, ROUGE-Lsum calculates the average ROUGE-L score across all sentences.

### Interpretation:

- **Higher Scores**: A higher ROUGE score (close to 1.0) indicates a better overlap or similarity between the generated summary and the reference summaries, suggesting better quality in terms of content overlap.
  


In [92]:
import evaluate
rouge = evaluate.load('rouge')

min_length = min(len(completion_bedrock), len(human_baseline_summaries))
completion_bedrock = completion_bedrock[:min_length]
bedrock_model_results = rouge.compute(
    predictions=completion_bedrock,
    references=human_baseline_summaries[0:min_length],
    use_aggregator=True,
    use_stemmer=True,
)

min_length = min(len(completion_langchain), len(human_baseline_summaries))
completion_langchain = completion_langchain[:min_length]
langchain_model_results = rouge.compute(
    predictions=completion_langchain,
    references=human_baseline_summaries[0:min_length],
    use_aggregator=True,
    use_stemmer=True,
)

min_length = min(len(completion_langchain_prompt), len(human_baseline_summaries))
completion_langchain_prompt = completion_langchain_prompt[:min_length]
langchain2_model_results = rouge.compute(
    predictions=completion_langchain_prompt,
    references=human_baseline_summaries[0:min_length],
    use_aggregator=True,
    use_stemmer=True,
)

min_length = min(len(completion_hugging), len(human_baseline_summaries))
completion_hugging = completion_hugging[:min_length]
hugging_model_results = rouge.compute(
    predictions=completion_hugging,
    references=human_baseline_summaries[0:min_length],
    use_aggregator=True,
    use_stemmer=True,
)



BEDROCK TITAN TEXT LITE MODEL:
{'rouge1': 0.11627906976744186, 'rouge2': 0.0, 'rougeL': 0.11627906976744186, 'rougeLsum': 0.11627906976744186}
LANGCHIAN OPENAI gpt-3.5-turbo MODEL:
{'rouge1': 0.046511627906976744, 'rouge2': 0.0, 'rougeL': 0.046511627906976744, 'rougeLsum': 0.046511627906976744}
LANGCHIAN OPENAI gpt-3.5-turbo MODEL WITH PROMPT TEMPLATE:
{'rouge1': 0.08139534883720931, 'rouge2': 0.0, 'rougeL': 0.08139534883720931, 'rougeLsum': 0.08139534883720931}
HUGGING FLANT5 MODEL:
{'rouge1': 0.10714285714285714, 'rouge2': 0.0, 'rougeL': 0.10714285714285714, 'rougeLsum': 0.10714285714285714}


### Analysis:

### ROUGE-1 Score Comparison: 
Among the models evaluated, the BEDROCK TITAN TEXT LITE MODEL achieved the highest ROUGE-1 score of 0.116. This indicates that it had the highest overlap in unigrams (individual words) with the reference summaries compared to the other models.

### Other ROUGE Metrics: 
All models had a ROUGE-2 score of 0, indicating that none of them performed well in capturing bigram overlaps with the reference summaries. ROUGE-L and ROUGE-Lsum scores were comparable across the models, with slight variations.

In [93]:
print('BEDROCK TITAN TEXT LITE MODEL:')
print(bedrock_model_results)
print('\n')
print('LANGCHIAN OPENAI gpt-3.5-turbo MODEL:')
print(langchain_model_results)
print('\n')
print('LANGCHIAN OPENAI gpt-3.5-turbo MODEL WITH PROMPT TEMPLATE:')
print(langchain2_model_results)
print('\n')
print('HUGGING FLANT5 MODEL:')
print(hugging_model_results)

BEDROCK TITAN TEXT LITE MODEL:
{'rouge1': 0.11627906976744186, 'rouge2': 0.0, 'rougeL': 0.11627906976744186, 'rougeLsum': 0.11627906976744186}


LANGCHIAN OPENAI gpt-3.5-turbo MODEL:
{'rouge1': 0.046511627906976744, 'rouge2': 0.0, 'rougeL': 0.046511627906976744, 'rougeLsum': 0.046511627906976744}


LANGCHIAN OPENAI gpt-3.5-turbo MODEL WITH PROMPT TEMPLATE:
{'rouge1': 0.08139534883720931, 'rouge2': 0.0, 'rougeL': 0.08139534883720931, 'rougeLsum': 0.08139534883720931}


HUGGING FLANT5 MODEL:
{'rouge1': 0.10714285714285714, 'rouge2': 0.0, 'rougeL': 0.10714285714285714, 'rougeLsum': 0.10714285714285714}
