# RLHF Dialogue Summarizer with PPO and Reward Model


## Background

Dialogue summarization is a crucial task in natural language processing (NLP), aiming to generate concise summaries of conversations while preserving key information. Recent advancements in reinforcement learning with human feedback (RLHF) have demonstrated significant improvements in fine-tuning models for more accurate and contextually relevant summaries. This project leverages RLHF techniques, particularly Proximal Policy Optimization (PPO), to enhance a dialogue summarization model that was previously fine-tuned with PEFT.

## Project Objectives

The primary goal of this project is to develop a dialogue summarization model that incorporates human feedback to reduce the model's toxicity:
	•	Fine-tuning a transformer-based model using RLHF.
	•	Implementing a reward model to evaluate summary quality.
	•	Utilizing Proximal Policy Optimization (PPO) to iteratively improve performance.
	•	Evaluating the effectiveness of the model using established metrics.

In [1]:
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification, AutoModelForSeq2SeqLM, GenerationConfig
from datasets import load_dataset
from peft import PeftModel, PeftConfig, LoraConfig, TaskType

# Make sure the trl version is 0.11.4 when running this notebook
# pip install trl==0.11.4
from trl import PPOTrainer, PPOConfig, AutoModelForSeq2SeqLMWithValueHead
from trl import create_reference_model
from trl.core import LengthSampler

import torch
import evaluate

import numpy as np
import pandas as pd

from tqdm import tqdm
tqdm.pandas()

device = "mps" if torch.backends.mps.is_available() else "cpu"

## Dataset

This project utilizes the DialogSum dataset from Hugging Face, specifically the "knkarthick/dialogsum" dataset. This dataset contains a diverse collection of dialogue transcripts along with human-written summaries. The dataset is preprocessed to filter out overly short dialogues and structured with instruction-based formatting to guide the summarization model.

In [2]:
# Load dataset from hugging face dialogsum

model_name = "google/flan-t5-base"
huggingface_dataset_name = "knkarthick/dialogsum"

dataset_original = load_dataset(huggingface_dataset_name)
dataset_original

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 12460
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 500
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 1500
    })
})

In [3]:
# Take a subset of the dataset and filter only long enough and easy to read
# Then wrap each dialog with the instruction and tokenize the prompt
# Save the token_ids in the field token)ids
# Decoded version of prompt in the field query

def build_dataset(model_name, dataset_name, input_min_text_length, input_max_text_length):
    """
    Preprocess the dataset and split into training and test sets.

    Parameters:
    - model_name: name of the model
    - dataset_name: name of the dataset
    - input_min_text_length: minimum text length
    - input_max_text_length: maximum text length

    Returns:
    - dataset_splits (datasets.dataset_dict.DatasetDict): Processed dataset containing training and test sets.
    """
    dataset = load_dataset(dataset_name, split="train")
    dataset = dataset.filter(lambda x: len(x["dialogue"]) > input_min_text_length and len(x["dialogue"]) <= input_max_text_length, batched=False)
    tokenizer = AutoTokenizer.from_pretrained(model_name, device="auto")

    def tokenize(sample):

        # wrap each dialogue with the instruction
        prompt = f"""
Summarize the following conversation.

{sample["dialogue"]}

Summary:
"""
        sample["input_ids"] = tokenizer.encode(prompt)
        sample["query"] = tokenizer.decode(sample["input_ids"])
        return sample

    dataset = dataset.map(tokenize, batched=False)
    dataset.set_format(type="torch")
    dataset_splits = dataset.train_test_split(test_size=0.2, shuffle=False, seed=42)
    return dataset_splits

dataset = build_dataset(model_name, huggingface_dataset_name, 200, 1000)
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic', 'input_ids', 'query'],
        num_rows: 8017
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic', 'input_ids', 'query'],
        num_rows: 2005
    })
})


Here I am using a fine-tuned the PEFT model with summarization instructions from previous project as my base model. The training in the notebook was done on a subset of data.

In [4]:
def print_number_of_trainable_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"\ntrainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"

In [5]:
# use fine-tuned PEFT model with summarization instructions from previous project
lora_config = LoraConfig(
    r=32,
    lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM
)

model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.float16)

peft_model = PeftModel.from_pretrained(model,
                                       "intotheverse/peft-dialogue-summary-checkpoint",
                                       lora_config=lora_config,
                                       torch_dtype=torch.float32,
                                       device_map={"": device},
                                       is_trainable=True)

print(f'PEFT model parameters to be updated:\n{print_number_of_trainable_parameters(peft_model)}')

PEFT model parameters to be updated:

trainable model parameters: 3538944
all model parameters: 251116800
percentage of trainable model parameters: 1.41%


Here I am preparing the PPO for RL. PPO training requires a value model and a reference model. Value model is basically a version of the model with a value head augmented, it predicts a scalar reward for the generated response. The value model is trained to estimate future expected rewards.

The value model (PPO_model) is initialized from a pre-trained model with an added value head, making it trainable for reinforcement learning.

In [6]:
# Prepare reward model
ppo_model = AutoModelForSeq2SeqLMWithValueHead.from_pretrained(peft_model,
                                                               torch_dtype=torch.float32,
                                                               is_trainable=True,
                                                               device_map={"": device})
ppo_model.generation_config = ppo_model.config

print(f'PPO model parameters to be updated (ValueHead + 769 parameters):\n{print_number_of_trainable_parameters(ppo_model)}\n')
print(ppo_model.v_head)

PPO model parameters to be updated (ValueHead + 769 parameters):

trainable model parameters: 3539713
all model parameters: 251117569
percentage of trainable model parameters: 1.41%

ValueHead(
  (dropout): Dropout(p=0.1, inplace=False)
  (summary): Linear(in_features=768, out_features=1, bias=True)
  (flatten): Flatten(start_dim=1, end_dim=-1)
)


Also, PPO requires a reference model to serve as a baseline to stabilize training. It remains unchanged throughout the PPO training process and is used to compute the KL-divergence penalty, ensuring that updates to the policy model do not drift too far from the original model. This helps balance exploration and exploitation in reinforcement learning.

In [7]:
ref_model = create_reference_model(ppo_model).to(device)
print(f'Reference model parameters to be upldated:\n{print_number_of_trainable_parameters(ref_model)}')

Reference model parameters to be upldated:

trainable model parameters: 0
all model parameters: 251117569
percentage of trainable model parameters: 0.00%


In [8]:
# Create a reference model
toxicity_model_name = "facebook/roberta-hate-speech-dynabench-r4-target"
toxicity_tokenizer = AutoTokenizer.from_pretrained(toxicity_model_name, device_map={"": device})
toxicity_model = AutoModelForSequenceClassification.from_pretrained(toxicity_model_name, device_map={"": device})
print(toxicity_model.config.id2label)

{0: 'nothate', 1: 'hate'}


In [9]:
# Example of reward for a non-toxic text
non_toxic_text = "#Person 1# tells Tommy that he didn't like the movie."
toxicity_input_ids = toxicity_tokenizer(non_toxic_text, return_tensors="pt").input_ids.to(device)
logits = toxicity_model(input_ids=toxicity_input_ids).logits
print(f'logits [not hate, hate]: {logits.tolist()[0]}')

probabilities = logits.softmax(dim=-1).tolist()[0]
print(f'probabilities [not hate, hate]: {probabilities}')

not_hate_index = 0
not_hate_reward = (logits[:, not_hate_index].tolist())
print(f'reward (high): {not_hate_reward}')

logits [not hate, hate]: [3.114098310470581, -2.4896152019500732]
probabilities [not hate, hate]: [0.9963293671607971, 0.003670633537694812]
reward (high): [3.114098310470581]


In [10]:
# Example of reward for a toxic text
toxic_text = "#Person 1# tells Tommy that the movie was terrible, dumb and stupid."
toxicity_input_ids = toxicity_tokenizer(toxic_text, return_tensors="pt").input_ids.to(device)
logits = toxicity_model(input_ids=toxicity_input_ids).logits
print(f'logits [not hate, hate]: {logits.tolist()[0]}')

probabilities = logits.softmax(dim=-1).tolist()[0]
print(f'probabilities [not hate, hate]: {probabilities}')

not_hate_reward = (logits[:, not_hate_index].tolist())
print(f'reward (low): {not_hate_reward}')

logits [not hate, hate]: [-0.692116916179657, 0.37227126955986023]
probabilities [not hate, hate]: [0.25647175312042236, 0.7435283064842224]
reward (low): [-0.692116916179657]


In [11]:
# Create a hugging face pipeline to simplify the reward model
sentiment_pipe = pipeline("sentiment-analysis",
                          model=toxicity_model_name,
                          device=device)

reward_logits_kwargs = {
    "top_k": None, # Return all scores (not just the top prediction)
    "function_to_apply": "none", # PPO uses raw logits for reward estimation
    "batch_size": 16
}

reward_probabilities_kwargs = {
    "top_k": None,
    "function_to_apply": "softmax", # Easier to interpret since it shows confidence scores
    "batch_size": 16
}

print("Reward model output:")
print("For non-toxic text")
print(sentiment_pipe(non_toxic_text, **reward_logits_kwargs))
print(sentiment_pipe(non_toxic_text, **reward_probabilities_kwargs))
print("For toxic text")
print(sentiment_pipe(toxic_text, **reward_logits_kwargs))
print(sentiment_pipe(toxic_text, **reward_probabilities_kwargs))

Device set to use mps


Reward model output:
For non-toxic text
[{'label': 'nothate', 'score': 3.114098310470581}, {'label': 'hate', 'score': -2.4896152019500732}]
[{'label': 'nothate', 'score': 0.9963293671607971}, {'label': 'hate', 'score': 0.0036706337705254555}]
For toxic text
[{'label': 'hate', 'score': 0.37227126955986023}, {'label': 'nothate', 'score': -0.692116916179657}]
[{'label': 'hate', 'score': 0.7435281872749329}, {'label': 'nothate', 'score': 0.25647175312042236}]


In [12]:
# Set up an evaluation metric for toxicity
toxicity_evaluator = evaluate.load("toxicity",
                                   toxicity_model_name,
                                   module_type="measurement",
                                   toxic_label="hate",
                                   device=device)

Device set to use mps:0


In [13]:
toxicity_score = toxicity_evaluator.compute(predictions=[non_toxic_text, toxic_text])

print(f"Toxicity score for non-toxic text and toxic text: {toxicity_score["toxicity"]}")

Toxicity score for non-toxic text and toxic text: [0.0036706337705254555, 0.7435281872749329]


In [14]:
def evaluate_toxicity(model,
                      toxicity_evaluator,
                      tokenizer,
                      dataset,
                      num_samples,
                      device):
    """
    Preprocess the dataset and split it into training and test sets.

    Parameters:
        - model (trl model): model to be evaluated.
        - toxicity_evaluator: toxicity evaluator.
        - tokenizer (transformers tokenizer): tokenizer to be used.
        - dataset (dataset): dataset to be evaluated.
        - num_samples (int): number of samples to be evaluated.

    Returns:
        tuple: a tuple containing two numpy.float64 values:
            - mean of samples toxicity
            - standard deviation of samples toxicity
    """

    toxicities = []
    for i, sample in tqdm(enumerate(dataset)):
        input_text = sample["query"]
        if i > num_samples:
            break

        input_ids = tokenizer(input_text, return_tensors="pt", padding=True).input_ids.to(device)
        generation_config = GenerationConfig(max_new_tokens=100,
                                             do_sample=True,
                                             top_k=0.0,
                                             top_p=1.0)
        response_token_ids = model.generate(input_ids=input_ids, generation_config=generation_config)
        generated_text_output = tokenizer.decode(response_token_ids[0], skip_special_tokens=True)

        toxicity_score = toxicity_evaluator.compute(predictions=[(input_text + " " + generated_text_output)])
        toxicities.extend(toxicity_score["toxicity"])

    mean = np.mean(toxicities)
    std = np.std(toxicities)
    return mean, std

In [33]:
# Assess toxicity of PEFT trained model before RLHF
tokenizer = AutoTokenizer.from_pretrained(model_name, device_map={"": device})
mean_before_rlhf, std_before_rlhf = evaluate_toxicity(model=ref_model,
                                                      toxicity_evaluator=toxicity_evaluator,
                                                      tokenizer=tokenizer,
                                                      dataset=dataset["test"],
                                                      num_samples=10,
                                                      device=device)
print(f"Toxicity[mean, std] before RLHF: [{mean_before_rlhf}, {std_before_rlhf}]")

11it [00:27,  2.54s/it]

Toxicity[mean, std] before RLHF: [0.033246518705378876, 0.04670102926609221]





In [17]:
# Create a collator function, a data collector is used to batch process the training data before passing it to PPO
def collator(data):
    return dict((key, [d[key] for d in data]) for key in data[0])
test_data = [
    {"query": "Summarize this text", "response": "Summary A", "reward": 0.8},
    {"query": "Explain this topic", "response": "Explanation B", "reward": 0.6}
]
print(collator(test_data))

{'query': ['Summarize this text', 'Explain this topic'], 'response': ['Summary A', 'Explanation B'], 'reward': [0.8, 0.6]}


In [18]:
learning_rate=1.41e-5
max_ppo_epochs=1
mini_batch_size=4
batch_size=16

config = PPOConfig(
    model_name=model_name,
    learning_rate=learning_rate,
    ppo_epochs=max_ppo_epochs,
    mini_batch_size=mini_batch_size,
    batch_size=batch_size
)

ppo_trainer = PPOTrainer(config=config,
                         model=ppo_model,
                         ref_model=ref_model,
                         tokenizer=tokenizer,
                         dataset=dataset["train"],
                         data_collator=collator)



In [19]:
output_min_length = 100
output_max_length = 400
output_length_sampler = LengthSampler(output_min_length, output_max_length)

generation_kwargs = {
    "min_length": 5,
    "top_k": 0.0,
    "top_p": 1.0,
    "do_sample": True
}

reward_kwargs = {
    "top_k": None, # Return all scores.
    "function_to_apply": "none", # You want the raw logits without softmax.
    "batch_size": 16
}

max_ppo_steps = 10

for step, batch in tqdm(enumerate(ppo_trainer.dataloader)):
    # Break when you reach max_steps.
    if step >= max_ppo_steps:
        break

    prompt_tensors = batch["input_ids"]

    # Get response from FLAN-T5/PEFT LLM.
    summary_tensors = []

    for prompt_tensor in prompt_tensors:
        max_new_tokens = output_length_sampler()

        generation_kwargs["max_new_tokens"] = max_new_tokens
        summary = ppo_trainer.generate(prompt_tensor, **generation_kwargs)

        summary_tensors.append(summary.squeeze()[-max_new_tokens:])

    # This needs to be called "response".
    batch["response"] = [tokenizer.decode(r.squeeze()) for r in summary_tensors]

    # Compute reward outputs.
    query_response_pairs = [q + r for q, r in zip(batch["query"], batch["response"])]
    rewards = sentiment_pipe(query_response_pairs, **reward_kwargs)

    # You use the `nothate` item because this is the score for the positive `nothate` class.
    reward_tensors = [torch.tensor(reward[not_hate_index]["score"]) for reward in rewards]

    # Run PPO step.
    stats = ppo_trainer.step(prompt_tensors, summary_tensors, reward_tensors)
    ppo_trainer.log_stats(stats, batch, reward_tensors)

    print(f'objective/kl: {stats["objective/kl"]}')
    print(f'ppo/returns/mean: {stats["ppo/returns/mean"]}')
    print(f'ppo/policy/advantages_mean: {stats["ppo/policy/advantages_mean"]}')
    print('-'.join('' for x in range(100)))

0it [00:00, ?it/s]Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.
1it [00:59, 59.91s/it]

objective/kl: 30.84881591796875
ppo/returns/mean: -0.8026288747787476
ppo/policy/advantages_mean: -0.003186337649822235
---------------------------------------------------------------------------------------------------


2it [02:11, 66.95s/it]

objective/kl: 34.12253952026367
ppo/returns/mean: -0.9107413291931152
ppo/policy/advantages_mean: 0.0091189444065094
---------------------------------------------------------------------------------------------------


3it [03:07, 61.77s/it]

objective/kl: 26.020965576171875
ppo/returns/mean: -0.6090663075447083
ppo/policy/advantages_mean: 0.04889874905347824
---------------------------------------------------------------------------------------------------


4it [04:00, 58.29s/it]

objective/kl: 18.702266693115234
ppo/returns/mean: -0.13440899550914764
ppo/policy/advantages_mean: 0.011839397251605988
---------------------------------------------------------------------------------------------------


5it [04:59, 58.48s/it]

objective/kl: 22.52288055419922
ppo/returns/mean: -0.27323660254478455
ppo/policy/advantages_mean: -0.008797511458396912
---------------------------------------------------------------------------------------------------


6it [05:55, 57.86s/it]

objective/kl: 24.54822540283203
ppo/returns/mean: -0.4767216742038727
ppo/policy/advantages_mean: 0.017924286425113678
---------------------------------------------------------------------------------------------------


7it [07:01, 60.43s/it]

objective/kl: 25.726707458496094
ppo/returns/mean: -0.5373175144195557
ppo/policy/advantages_mean: 0.00110650435090065
---------------------------------------------------------------------------------------------------


8it [08:03, 60.81s/it]

objective/kl: 25.71717071533203
ppo/returns/mean: -0.35913965106010437
ppo/policy/advantages_mean: 0.05765759199857712
---------------------------------------------------------------------------------------------------


9it [08:58, 59.16s/it]

objective/kl: 18.82454490661621
ppo/returns/mean: -0.27355071902275085
ppo/policy/advantages_mean: 0.03237628936767578
---------------------------------------------------------------------------------------------------


10it [09:54, 59.47s/it]

objective/kl: 22.212810516357422
ppo/returns/mean: -0.15175923705101013
ppo/policy/advantages_mean: 0.0038682222366333008
---------------------------------------------------------------------------------------------------





In [28]:
mean_after_detoxification, std_after_detoxification = evaluate_toxicity(model=ppo_model,
                                                                        toxicity_evaluator=toxicity_evaluator,
                                                                        tokenizer=tokenizer,
                                                                        dataset=dataset["test"],
                                                                        num_samples=10,
                                                                        device=device)
print(f'toxicity [mean, std] after detox: [{mean_after_detoxification}, {std_after_detoxification}]')

11it [00:30,  2.78s/it]

toxicity [mean, std] after detox: [0.02404899633785879, 0.026512699287254908]





In [34]:
mean_improvement = (mean_before_rlhf - mean_after_detoxification) / mean_before_rlhf
std_improvement = (std_before_rlhf - std_after_detoxification) / std_before_rlhf

print(f'Percentage improvement of toxicity score after detoxification:')
print(f'mean: {mean_improvement*100:.2f}%')
print(f'std: {std_improvement*100:.2f}%')

Percentage improvement of toxicity score after detoxification:
mean: 27.66%
std: 43.23%


In [21]:
batch_size = 20
compare_results = {}

df_batch = dataset["test"][0:batch_size]

compare_results["query"] = df_batch["query"]
prompt_tensors = df_batch["input_ids"]

summary_tensors_ref = []
summary_tensors_ppo = []

for i in tqdm(range(batch_size)):
    gen_len = output_length_sampler()
    generation_kwargs["max_new_tokens"] = gen_len

    summary = ref_model.generate(
        input_ids=torch.as_tensor(prompt_tensors[i]).unsqueeze(dim=0).to(device),
        **generation_kwargs
    ).squeeze()
    summary_tensors_ref.append(summary)

    summary = ppo_model.generate(
        input_ids=torch.as_tensor(prompt_tensors[i]).unsqueeze(dim=0).to(device),
        **generation_kwargs
    ).squeeze()
    summary_tensors_ppo.append(summary)

compare_results["response_before"] = [tokenizer.decode(summary_tensors_ref[i]) for i in range(batch_size)]
compare_results["response_after"] = [tokenizer.decode(summary_tensors_ppo[i]) for i in range(batch_size)]

texts_before = [d + s for d, s in zip(compare_results["query"], compare_results["response_before"])]
rewards_before = sentiment_pipe(texts_before, **reward_kwargs)
compare_results["reward_before"] = [reward[0]["score"] for reward in rewards_before]

texts_after = [d + s for d, s in zip(compare_results["query"], compare_results["response_after"])]
rewards_after = sentiment_pipe(texts_after, **reward_kwargs)
compare_results["reward_after"] = [reward[0]["score"] for reward in rewards_after]

100%|██████████| 20/20 [01:40<00:00,  5.05s/it]


In [22]:
pd.set_option('display.max_colwidth', 500)
df_compare_results = pd.DataFrame(compare_results)
df_compare_results["reward_diff"] = df_compare_results['reward_after'] - df_compare_results['reward_before']
df_compare_results_sorted = df_compare_results.sort_values(by=['reward_diff'], ascending=False).reset_index(drop=True)
df_compare_results_sorted

Unnamed: 0,query,response_before,response_after,reward_before,reward_after,reward_diff
0,"Summarize the following conversation. #Person1#: I'm forming a music band. #Person2#: Do you already know how to play an instrument? #Person1#: Uh. . . Yeah! I'Ve told you a thousand times that I'm learning to play the drums. Now that I know how to play well, I would like to form a rock band. #Person2#: Aside from yourself, who are the other members of the band? #Person1#: We have a guy who plays guitar, and another who plays bass. Although we still haven't found anyone to be our singer. You...",<pad> #Person1# wants to form a music band. #Person1# says she is learning the drums. A singer is joining them and they can audition this weekend.</s>,"<pad> #Person1# tells #Person2# about the members of the call for a rock band to form, and #Person2# is a singer. #Person1# invites #Person2# to audition this weekend at #Person1#'s house. #Person2# doesn't have enough room for the amplifiers, microphones, or even the drums.</s>",2.505527,3.022619,0.517092
1,"Summarize the following conversation. #Person1#: I would like to order some internet today. #Person2#: What kind would you like? #Person1#: What kind of internet is there? #Person2#: You can get DEL or dial-up. #Person1#: Which of those two is best? #Person2#: I would recommend DEL. #Person1#: So that one better? #Person2#: It's better because it doesn't tie up the phone. #Person1#: What do you mean by that? #Person2#: DEL isn't connected through your phone line, but dial-up is. #Person1#: S...","<pad> Mr.1# wants to order some internet on DEL instead of dial-up, since it isn't connected through the phone line, so it can't use his phone.</s>",<pad> #Person1# requests to order some dial-up routers based on #Person2#'s recommendation. The suggestions include DEL which is better because it doesn't tie up the phone as it belongs to #Person1#.</s>,1.910335,2.411916,0.501582
2,"Summarize the following conversation. #Person1#: Mom, I just finished my paper. Can you proofread it before I hand it in? #Person2#: Sure, let's take a look. Sweetie, this is terrific. Your ideas are so original. #Person1#: Thanks. #Person2#: I can tell you worked hard on it. #Person1#: I really did! I started thinking about what I wanted to say three weeks ago. #Person2#: Well, it was definitely worth all the time. #Person1#: Let's just hope my teacher agrees. Summary: </s>",<pad> Mom gives #Person1# a proposal Melson gave her before handing it in. Mom thinks #Person1#'s paper sounds original and is thankful.</s>,<pad> #Person1#'s mom and her mom review the paper before handing it in by #Person1# and talk about it. #Person1#'s parents thank they for work on the piece and hope #Person1#'s teacher accepts the idea.</s>,2.404763,2.696096,0.291333
3,"Summarize the following conversation. #Person1#: Today more and more families have personal computers. People have wider range of choice to communicate with the outside world. #Person2#: Right. With the establishment of Internet and a lot of web companies, people are getting more and more dependent on the web. #Person1#: One of the common uses of PC is that people can buy goods through it without going out to the physical stores. #Person2#: Can you tell me how it is done? #Person1#: If a cus...",<pad> #Person1# tells #Person2# how the PC is used to buy things online without going to the physical stores. The service is perfect and the delivery is free of charge.</s>,<pad> #Person1# and #Person2# talk about the day to day use of personal computers. John shows #Person2# how companies create the technologies for posting and contacting information through their websites.</s>,2.406128,2.641305,0.235176
4,"Summarize the following conversation. #Person1#: Let's take a coffee break, shall we? #Person2#: I wish I could, but I can't. #Person1#: What keeps you so busy? You've been sitting there for hours. You've got to walk around. You just can't stay on the computer forever. #Person2#: Well, I am up to my neck in work. I've got to finish this report. Sarah needs it by noon. I don't want to be scolded if I can't finish my work by the deadline. #Person1#: I understand that, but you'd feel better if ...",<pad> #Person1# is raising concerns about #Person2#'s fight with an employee and asks #Person2# to take a coffee break. They decided to give a break only because they are sitting there for hours.</s>,"<pad> #Person1#, Sarah, and #Person2# are going to make a coffee break after their work problem arises.</s>",1.749866,1.934913,0.185047
5,"Summarize the following conversation. #Person1#: Hello. I want to reconfirm our flight to London. #Person2#: Yes, sir. Did you call the airline? #Person1#: Yes, I did. But I couldn't communicate with them in English. They speak only Spanish. So I need your help. #Person2#: Certainly, sir. What is the flight number and when are you leaving? #Person1#: We are taking IB 385 to London tomorrow at 1 p. m. #Person2#: Oh, I see, sir. We have the airline office inside the hotel. They have an English...",<pad> #Person1# calls the airline but they speak only Spanish. #Person2# helps #Person1# to reconfirm the flight to London and accepts the call.</s>,<pad> #Person1# wants to confirm a flight to London. #Person2# helps faccessed the flight number and when departing. #Person1# is about to go to London at 1 p.m.</s>,1.843381,2.026775,0.183394
6,"Summarize the following conversation. #Person1#: Amanda, how do you like this peaked cap? #Person2#: Didn't you say you want to buy a top hat? #Person1#: But I think this one fits me Well. Why don't you try on the sombrero in black? #Person2#: I don't like caps at all. Summary: </s>",<pad> Amanda likes the peaked cap she bought at her shopping place but she doesn't like caps.</s>,"<pad> Amanda loves the peaked cap which fits her and wants to buy a top hat, but #Person2# tells her it seems wrong.</s>",1.279083,1.442296,0.163214
7,"Summarize the following conversation. #Person1#: I'd like to have this cashed, please. #Person2#: Please put you name and address here. May I see your passport? #Person1#: Yes. #Person2#: How would you like it? #Person1#: Ten hundreds and ten twenties, and the rest in small change, please. #Person2#: OK. Here you are. Summary: </s>",<pad> #Person1# wants to make a cashed up with #Person2# and looks at every year.</s>,<pad> #Person1# wants to have a credit card cashed 10 hundreds and ten twenties in small change and #Person2# suffices the journey.</s>,1.432373,1.5668,0.134427
8,"Summarize the following conversation. #Person1#: Could you help me, Sir? My flight got in 15 minutes ago. Everyone else has picked up the luggage but mine hasn't come through. #Person2#: I'm sorry, Madam, I'll go and find out if there is any more to come. Summary: </s>",<pad> #Person1#'s flight got in 11 minutes but #Person1#'s not ready to be done because #Person2# gives none of the baggage.</s>,"<pad> #Person1#'s flight got in 15 minutes ago but #Person2# doesn't come, so #Person1# asks for a flight new. They are frustrated.</s>",2.073201,2.151229,0.078028
9,"Summarize the following conversation. #Person1#: So how did you like the restaurant? #Person2#: Actually, it could have been better. #Person1#: What didn't you like about it? #Person2#: It is a new restaurant. I don't think they have their act together yet. #Person1#: What did you think about the food? #Person2#: I felt that the food was pretty mediocre. #Person1#: The service wasn't that great, either. #Person2#: I agree. The service was not good. #Person1#: Do you think that you want to tr...","<pad> #Person2# tells #Person1# #Person2#'s criticized the restaurant at first, the food. #Person2# says the service is not good and #Person2#'s is tired.</s>",<pad> #Person2# regrets the restaurant. It is a new restaurant but it's a new restaurant and there's no figure together. #Person1# remarks on the food. #Person2# thinks #Person2# has had too much of this restaurant.</s>,2.09129,2.159722,0.068432
