# Using a T5 model for Text Summarization

### 1. Install dependencies

In [1]:
!pip install datasets
!pip install transformers
!pip install pandas
!pip install accelerate -U
!pip install pip install transformers[torch]
!pip install stable-baselines --upgrade
!pip install rouge_score
!pip install numpy
!pip install textblob
!pip install -U scikit-learn scipy matplotlib
!pip install torch

import datasets
from transformers import pipeline
import numpy as np
import re  # Used for data cleaning 
import nltk  # Natural Language Toolkit
nltk.download('punkt')                                                    


  from .autonotebook import tqdm as notebook_tqdm
[nltk_data] Downloading package punkt to /home/jordan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### 2. Import dataset from file

This can work on a variety of datasets. This code assumes that the data is installed in a local directory. For our project we are using the Samsum Dataset Text Summarization set from Kaggle: https://www.kaggle.com/datasets/nileshmalode1/samsum-dataset-text-summarization/data

In [2]:
from datasets import load_dataset
import pandas as pd

train_file = '/home/data1/T5/CS6140_Final/datasets/Samsum/samsum-train.csv'
val_file = '/home/data1/T5/CS6140_Final/datasets/Samsum/samsum-validation.csv'
test_file = '/home/data1/T5/CS6140_Final/datasets/Samsum/samsum-test.csv'

train_dataset = pd.read_csv(train_file)
val_dataset = pd.read_csv(val_file)
test_dataset = pd.read_csv(test_file)


#### 2a. Display datasets 
Referenced from: https://www.kaggle.com/code/lusfernandotorres/text-summarization-with-large-language-models

In [3]:
def display_feature_list(features, feature_type):

    '''
    This function displays the features within each list for each type of data
    '''

    print(f"\n{feature_type} Features: ")
    print(', '.join(features) if features else 'None')

def describe_df(df):
    """
    This function prints some basic info on the dataset and
    sets global variables for feature lists.
    """

    global categorical_features, continuous_features, binary_features
    categorical_features = [col for col in df.columns if df[col].dtype == 'object']
    binary_features = [col for col in df.columns if df[col].nunique() <= 2 and df[col].dtype != 'object']
    continuous_features = [col for col in df.columns if df[col].dtype != 'object' and col not in binary_features]

    print(f"\n{type(df).__name__} shape: {df.shape}")
    print(f"\n{df.shape[0]:,.0f} samples")
    print(f"\n{df.shape[1]:,.0f} attributes")
    print(f'\nMissing Data: \n{df.isnull().sum()}')
    print(f'\nDuplicates: {df.duplicated().sum()}')
    print(f'\nData Types: \n{df.dtypes}')

    display_feature_list(categorical_features, 'Categorical')
    display_feature_list(continuous_features, 'Continuous')
    display_feature_list(binary_features, 'Binary')

    print(f'\n{type(df).__name__} Head: \n')
    display(df.head(5))
    print(f'\n{type(df).__name__} Tail: \n')
    display(df.tail(5))

In [4]:
describe_df(train_dataset)


DataFrame shape: (14731, 3)

14,731 samples

3 attributes

Missing Data: 
id          0
dialogue    0
summary     0
dtype: int64

Duplicates: 0

Data Types: 
id          object
dialogue    object
summary     object
dtype: object

Categorical Features: 
id, dialogue, summary

Continuous Features: 
None

Binary Features: 
None

DataFrame Head: 



Unnamed: 0,id,dialogue,summary
0,13818513,Amanda: I baked cookies. Do you want some?\r\...,Amanda baked cookies and will bring Jerry some...
1,13728867,Olivia: Who are you voting for in this electio...,Olivia and Olivier are voting for liberals in ...
2,13681000,"Tim: Hi, what's up?\r\nKim: Bad mood tbh, I wa...",Kim may try the pomodoro technique recommended...
3,13730747,"Edward: Rachel, I think I'm in ove with Bella....",Edward thinks he is in love with Bella. Rachel...
4,13728094,Sam: hey overheard rick say something\r\nSam:...,"Sam is confused, because he overheard Rick com..."



DataFrame Tail: 



Unnamed: 0,id,dialogue,summary
14726,13863028,Romeo: You are on my ‘People you may know’ lis...,Romeo is trying to get Greta to add him to her...
14727,13828570,Theresa: <file_photo>\r\nTheresa: <file_photo>...,Theresa is at work. She gets free food and fre...
14728,13819050,John: Every day some bad news. Japan will hunt...,Japan is going to hunt whales again. Island an...
14729,13828395,Jennifer: Dear Celia! How are you doing?\r\nJe...,Celia couldn't make it to the afternoon with t...
14730,13729017,Georgia: are you ready for hotel hunting? We n...,Georgia and Juliette are looking for a hotel i...


In [5]:
describe_df(val_dataset)


DataFrame shape: (818, 3)

818 samples

3 attributes

Missing Data: 
id          0
dialogue    0
summary     0
dtype: int64

Duplicates: 0

Data Types: 
id          object
dialogue    object
summary     object
dtype: object

Categorical Features: 
id, dialogue, summary

Continuous Features: 
None

Binary Features: 
None

DataFrame Head: 



Unnamed: 0,id,dialogue,summary
0,13817023,"A: Hi Tom, are you busy tomorrow’s afternoon?\...",A will go to the animal shelter tomorrow to ge...
1,13716628,Emma: I’ve just fallen in love with this adven...,Emma and Rob love the advent calendar. Lauren ...
2,13829420,Jackie: Madison is pregnant\r\nJackie: but she...,Madison is pregnant but she doesn't want to ta...
3,13819648,Marla: <file_photo>\r\nMarla: look what I foun...,Marla found a pair of boxers under her bed.
4,13728448,Robert: Hey give me the address of this music ...,Robert wants Fred to send him the address of t...



DataFrame Tail: 



Unnamed: 0,id,dialogue,summary
813,13829423,Carla: I've got it...\r\nDiego: what?\r\nCarla...,Carla's date for graduation is on June 4th. Di...
814,13727710,"Gita: Hello, this is Beti's Mum Gita, I wanted...",Bev is going on the school trip with her son. ...
815,13829261,"Julia: Greg just texted me\r\nRobert: ugh, del...",Greg cheated on Julia. He apologises to her. R...
816,13680226,"Marry: I broke my nail ;(\r\nTina: oh, no!\r\n...",Marry broke her nail and has a party tomorrow....
817,13862383,Paige: I asked them to wait and send the decla...,Paige wants to have the declaration sent later...


In [6]:
describe_df(test_dataset)


DataFrame shape: (819, 3)

819 samples

3 attributes

Missing Data: 
id          0
dialogue    0
summary     0
dtype: int64

Duplicates: 0

Data Types: 
id          object
dialogue    object
summary     object
dtype: object

Categorical Features: 
id, dialogue, summary

Continuous Features: 
None

Binary Features: 
None

DataFrame Head: 



Unnamed: 0,id,dialogue,summary
0,13862856,"Hannah: Hey, do you have Betty's number?\nAman...",Hannah needs Betty's number but Amanda doesn't...
1,13729565,Eric: MACHINE!\r\nRob: That's so gr8!\r\nEric:...,Eric and Rob are going to watch a stand-up on ...
2,13680171,"Lenny: Babe, can you help me with something?\r...",Lenny can't decide which trousers to buy. Bob ...
3,13729438,"Will: hey babe, what do you want for dinner to...",Emma will be home soon and she will let Will k...
4,13828600,"Ollie: Hi , are you in Warsaw\r\nJane: yes, ju...",Jane is in Warsaw. Ollie and Jane has a party....



DataFrame Tail: 



Unnamed: 0,id,dialogue,summary
814,13611902-1,Alex: Were you able to attend Friday night's b...,Benjamin didn't come to see a basketball game ...
815,13820989,Jamilla: remember that the audition starts at ...,The audition starts at 7.30 P.M. in Antena 3.
816,13717193,"Marta: <file_gif>\r\nMarta: Sorry girls, I cli...","Marta sent a file accidentally,"
817,13829115,Cora: Have you heard how much fuss British med...,There was a meet-and-greet with James Charles ...
818,13818810,Rachel: <file_other>\r\nRachel: Top 50 Best Fi...,Rachel sends a list of Top 50 films of 2018. J...


### 3. Preprocessing Data
We can see in the data that there are characters representing files, or photos like ```<file_photo>```. These will need to be cleaned. In addition, there are null values within the datasets that can be removed

In [7]:
# The ID label is not necessary for this.
categorical_features.remove('id')

In [8]:
# Remove null values
train_dataset = train_dataset.dropna()
val_dataset = val_dataset.dropna()
test_dataset = test_dataset.dropna()

In [9]:
def clean_tags(text):
    clean = re.compile('<.*?>') # Compiling tags
    clean = re.sub(clean, '', text) # Replacing tags text by an empty string

    # Removing empty dialogues
    clean = '\n'.join([line for line in clean.split('\n') if not re.match('.*:\s*$', line)])

    return clean

In [10]:
def clean_dataframe(df, column_labels):
    for col in column_labels:
        df[col] = df[col].fillna('').apply(clean_tags)
    return df

In [11]:
train_dataset = clean_dataframe(train_dataset,['dialogue', 'summary'])
test_dataset = clean_dataframe(test_dataset,['dialogue', 'summary'])
val_dataset = clean_dataframe(val_dataset,['dialogue', 'summary'])

### Print Sample

In [12]:
print(train_dataset["dialogue"].iloc[13950])

Steve: Can you remind me what pages we were supposed to read for tomorrow's classes?
Chris: You mean the International Political Relations?
Steve: Yup.
Chris: pages 10-25
Steve: Thanks a lot man.
Steve: By the way, do you think we really need to read that?
Chris: Yup. The guys passionate about these texts he makes students read. He gets angry when he sees people don't read it.
Chris: When he's not satisfied, he often punishes students with quicktests...
Steve: I hate the educational system, when everybody tries to discourage me from absorbing the knowledge...
Chris: Nothing comes easy.
Steve: So they say... 


In [13]:
# Converts pandas dataframe to a Dataset
train_dataset = datasets.Dataset.from_pandas(train_dataset)
test_dataset = datasets.Dataset.from_pandas(test_dataset)
val_dataset = datasets.Dataset.from_pandas(val_dataset)

In [14]:
# Verify the data structure
train_dataset

Dataset({
    features: ['id', 'dialogue', 'summary'],
    num_rows: 14731
})

### 4. Tokenize the data
For this we will use the t5-base model.
Tokenizer script referenced from: https://medium.com/askdata/train-t5-for-text-summarization-a1926f52d281

In [15]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('t5-base')

max_source = 1000
max_target = 128

def tokenize(batch):
    
    tokenized_input = tokenizer(batch['dialogue'], max_length=max_source, truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        tokenized_labels = tokenizer(batch["summary"], max_length=128, truncation=True)

    tokenized_input['labels'] = tokenized_labels['input_ids']

    return tokenized_input

train_dataset_tokenized = train_dataset.map(tokenize, batched=True, remove_columns=['id', 'dialogue', 'summary'])
val_dataset_tokenized = val_dataset.map(tokenize, batched=True, remove_columns=['id', 'dialogue', 'summary'])
test_dataset_tokenized = test_dataset.map(tokenize, batched=True, remove_columns=['id', 'dialogue', 'summary'])

train_dataset_tokenized.set_format('numpy', columns=['input_ids', 'attention_mask', 'labels'])
val_dataset_tokenized.set_format('numpy', columns=['input_ids', 'attention_mask', 'labels'])
test_dataset_tokenized.set_format('numpy', columns=['input_ids', 'attention_mask', 'labels'])

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
Map: 100%|██████████| 14731/14731 [00:02<00:00, 5600.67 examples/s]
Map: 100%|██████████| 818/818 [00:00<00:00, 6097.95 examples/s]
Map: 100%|██████████| 819/819 [00:00<00:00, 6127.87 examples/s]


Verify Dataset dictionaries and column labels are correct

In [16]:
print(train_dataset_tokenized)
print(val_dataset_tokenized)
print(test_dataset_tokenized)

Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 14731
})
Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 818
})
Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 819
})


## 5. Computing Metrics
To evaluate our model we will be using the ROUGE score. This can be imported from the datasets library and input as an argument in our trainer.
Code referenced from: https://www.kaggle.com/code/lusfernandotorres/text-summarization-with-large-language-models

In [17]:
metric = datasets.load_metric('rouge')

  metric = datasets.load_metric('rouge')


In [20]:
def compute_metrics(eval_prediction):
    
    # Retrieve prediction and label
    predictions, labels = eval_prediction
    
    # Decoding predictions
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    
    # Obtaining the true labels tokens, while eliminating masked tokens
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Rouge expects a newline after each sentence
    decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
    decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]
    
    
    # Computing rouge score
    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()} 

    # Add mean-generated length
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)

    return {k: round(v, 4) for k, v in result.items()}

In [35]:
# Instantiating Data Collator - required for seq2seq trainer
from transformers import T5ForConditionalGeneration, DataCollatorForSeq2Seq, Seq2SeqTrainer, Seq2SeqTrainingArguments

model = T5ForConditionalGeneration.from_pretrained('t5-small')
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

In [36]:
# Enable GPU power if available
import torch

if torch.cuda.is_available():
    print("Using GPU")
    device = torch.device('cuda')
else:
    print("Using CPU")
    device = torch.device('cpu')

torch.cuda.empty_cache()

Using GPU


# 6. Run Trainer
For training we utilize the seq2seq trainer from huggingface: https://huggingface.co/docs/transformers/main_classes/trainer

Fine tuning the hyperparameters was done by running multiple models for low epoch values

In [37]:
output_dir = '/home/data1/T5/CS6140_Final/out/runs/'

training_args = Seq2SeqTrainingArguments(
    output_dir = output_dir,
    evaluation_strategy = "epoch",
    save_strategy = 'epoch',
    learning_rate=1e-4,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    gradient_accumulation_steps=2,
    weight_decay=0.01,
    save_total_limit=2,
    num_train_epochs=10,
    predict_with_generate=True, # whether to use generate to calculate generative metrics (ROUGE score)
    fp16=False,
)

# Defining Trainer
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset_tokenized,
    eval_dataset=test_dataset_tokenized,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()
trainer.save_model(output_dir + '/model')

  0%|          | 0/3070 [00:00<?, ?it/s]You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
                                                  
 10%|█         | 307/3070 [03:42<29:49,  1.54it/s]

{'eval_loss': 1.8259080648422241, 'eval_rouge1': 39.826, 'eval_rouge2': 16.2435, 'eval_rougeL': 32.6084, 'eval_rougeLsum': 36.6319, 'eval_gen_len': 16.8217, 'eval_runtime': 18.6659, 'eval_samples_per_second': 43.877, 'eval_steps_per_second': 1.875, 'epoch': 1.0}


 16%|█▋        | 500/3070 [05:52<28:41,  1.49it/s]  

{'loss': 2.0582, 'learning_rate': 8.371335504885994e-05, 'epoch': 1.63}


                                                  
 20%|██        | 614/3070 [07:24<27:30,  1.49it/s]

{'eval_loss': 1.7647408246994019, 'eval_rouge1': 40.6385, 'eval_rouge2': 17.768, 'eval_rougeL': 33.7577, 'eval_rougeLsum': 37.5597, 'eval_gen_len': 16.8291, 'eval_runtime': 18.1169, 'eval_samples_per_second': 45.206, 'eval_steps_per_second': 1.932, 'epoch': 2.0}


                                                  
 30%|███       | 921/3070 [11:07<23:40,  1.51it/s]

{'eval_loss': 1.7425086498260498, 'eval_rouge1': 41.5634, 'eval_rouge2': 18.6244, 'eval_rougeL': 34.6747, 'eval_rougeLsum': 38.4635, 'eval_gen_len': 16.7021, 'eval_runtime': 18.0925, 'eval_samples_per_second': 45.267, 'eval_steps_per_second': 1.934, 'epoch': 3.0}


 33%|███▎      | 1000/3070 [11:58<22:06,  1.56it/s] 

{'loss': 1.8789, 'learning_rate': 6.742671009771987e-05, 'epoch': 3.26}


                                                   
 40%|████      | 1228/3070 [14:48<23:38,  1.30it/s]

{'eval_loss': 1.7244532108306885, 'eval_rouge1': 41.6636, 'eval_rouge2': 18.5605, 'eval_rougeL': 34.8702, 'eval_rougeLsum': 38.5296, 'eval_gen_len': 16.8022, 'eval_runtime': 17.8542, 'eval_samples_per_second': 45.872, 'eval_steps_per_second': 1.96, 'epoch': 4.0}


 49%|████▉     | 1500/3070 [17:48<18:54,  1.38it/s]  

{'loss': 1.8017, 'learning_rate': 5.114006514657981e-05, 'epoch': 4.89}


                                                   
 50%|█████     | 1535/3070 [18:30<17:58,  1.42it/s]

{'eval_loss': 1.713733434677124, 'eval_rouge1': 42.1674, 'eval_rouge2': 18.8684, 'eval_rougeL': 35.3144, 'eval_rougeLsum': 38.8547, 'eval_gen_len': 16.8449, 'eval_runtime': 18.0718, 'eval_samples_per_second': 45.319, 'eval_steps_per_second': 1.937, 'epoch': 5.0}


                                                   
 60%|██████    | 1842/3070 [22:11<13:00,  1.57it/s]

{'eval_loss': 1.704947590827942, 'eval_rouge1': 42.3811, 'eval_rouge2': 19.0467, 'eval_rougeL': 35.2594, 'eval_rougeLsum': 39.0351, 'eval_gen_len': 16.9573, 'eval_runtime': 18.1806, 'eval_samples_per_second': 45.048, 'eval_steps_per_second': 1.925, 'epoch': 6.0}


 65%|██████▌   | 2000/3070 [23:57<12:44,  1.40it/s]  

{'loss': 1.7563, 'learning_rate': 3.485342019543974e-05, 'epoch': 6.51}


                                                   
 70%|███████   | 2149/3070 [25:54<09:54,  1.55it/s]

{'eval_loss': 1.7039817571640015, 'eval_rouge1': 42.7403, 'eval_rouge2': 19.2299, 'eval_rougeL': 35.5139, 'eval_rougeLsum': 39.3526, 'eval_gen_len': 16.8999, 'eval_runtime': 18.2717, 'eval_samples_per_second': 44.824, 'eval_steps_per_second': 1.916, 'epoch': 7.0}


                                                   
 80%|████████  | 2456/3070 [29:36<06:07,  1.67it/s]

{'eval_loss': 1.695377230644226, 'eval_rouge1': 43.0632, 'eval_rouge2': 19.4124, 'eval_rougeL': 35.6752, 'eval_rougeLsum': 39.5463, 'eval_gen_len': 16.8645, 'eval_runtime': 18.1417, 'eval_samples_per_second': 45.145, 'eval_steps_per_second': 1.929, 'epoch': 8.0}


 81%|████████▏ | 2500/3070 [30:06<06:39,  1.43it/s]  

{'loss': 1.7305, 'learning_rate': 1.8566775244299675e-05, 'epoch': 8.14}


                                                   
 90%|█████████ | 2763/3070 [33:17<03:25,  1.49it/s]

{'eval_loss': 1.6927355527877808, 'eval_rouge1': 42.9158, 'eval_rouge2': 19.3969, 'eval_rougeL': 35.5905, 'eval_rougeLsum': 39.4768, 'eval_gen_len': 16.9084, 'eval_runtime': 17.8049, 'eval_samples_per_second': 45.999, 'eval_steps_per_second': 1.966, 'epoch': 9.0}


 98%|█████████▊| 3000/3070 [35:55<00:44,  1.58it/s]

{'loss': 1.7104, 'learning_rate': 2.280130293159609e-06, 'epoch': 9.77}


                                                   
100%|██████████| 3070/3070 [36:59<00:00,  1.63it/s]

{'eval_loss': 1.6928337812423706, 'eval_rouge1': 43.0563, 'eval_rouge2': 19.4298, 'eval_rougeL': 35.7404, 'eval_rougeLsum': 39.6116, 'eval_gen_len': 16.9267, 'eval_runtime': 18.1568, 'eval_samples_per_second': 45.107, 'eval_steps_per_second': 1.928, 'epoch': 10.0}


100%|██████████| 3070/3070 [37:00<00:00,  1.38it/s]


{'train_runtime': 2220.0479, 'train_samples_per_second': 66.354, 'train_steps_per_second': 1.383, 'train_loss': 1.8200511248958227, 'epoch': 10.0}


# 7. Inference
Although we get updated score results during the training process, it is important to evaluate the summaries produced by our model on the validation set. Even if our model summary does not exactly match the dataset summary, the ability to edit the length of returned summary shows promising results based on human analysis.  

In [29]:
model = '/home/data1/T5/CS6140_Final/out/runs/model'

In [28]:
def generate_summary(validation_dataset, model, index_position, summary_length):
    
    text = validation_dataset[index_position]['dialogue']
    summary = validation_dataset[index_position]['summary']
    summarizer = pipeline('summarization', model = model)
    generated_summary = summarizer(text, max_length=summary_length)
    
    print('Original Dialogue:\n')
    print(text)
    print('\n')
    print('Dataset Summary:\n')
    print(summary)
    print('\n')
    print('Model-generated Summary:\n')
    print(generated_summary)

In [30]:
generate_summary(val_dataset, model, 35, 30)

Original Dialogue:

John: doing anything special?
Alex: watching 'Millionaires' on tvn
Sam: me too! He has a chance to win a million!
John: ok, fingers crossed then! :)


Reference Summary:

Alex and Sam are watching Millionaires.


Model-generated Summary:

[{'summary_text': "Alex is watching 'Millionaires' on tv. John and Sam are going to win a million if he wins"}]


In [31]:
generate_summary(val_dataset, model, 0, 30)

Original Dialogue:

A: Hi Tom, are you busy tomorrow’s afternoon?
B: I’m pretty sure I am. What’s up?
A: Can you go with me to the animal shelter?.
B: What do you want to do?
A: I want to get a puppy for my son.
B: That will make him so happy.
A: Yeah, we’ve discussed it many times. I think he’s ready now.
B: That’s good. Raising a dog is a tough issue. Like having a baby ;-) 
A: I'll get him one of those little dogs.
B: One that won't grow up too big;-)
A: And eat too much;-))
B: Do you know which one he would like?
A: Oh, yes, I took him there last Monday. He showed me one that he really liked.
B: I bet you had to drag him away.
A: He wanted to take it home right away ;-).
B: I wonder what he'll name it.
A: He said he’d name it after his dead hamster – Lemmy  - he's  a great Motorhead fan :-)))


Reference Summary:

A will go to the animal shelter tomorrow to get a puppy for her son. They already visited the shelter last Monday and the son chose the puppy. 


Model-generated Summary:

In [33]:
generate_summary(val_dataset, model, 100, 45)

Original Dialogue:

Mike: Do you know where Tomas is from?
Jenny: Eastern Europe I believe
Mike: sure, but what country exactly
Mike: I heard him speaking English today with Kamil, so I think he's not Polish
Jack: Really? I was sure he was Polish
Kyle: He's from Slovenia
Mike: oh, how cute, how do you know?
Kyle: We talked many times about Slovenia and his home town
Mike: Which is?
Kyle: Bled I think, close to the Alps
Jack: and why do you find Slovenia cute? hahaha
Mike: I think he's the only Slovenian in the company now
Jack: true, quite exotic


Reference Summary:

Mike, Jenny and Jack wonder where Tomas is from. Kyle is sure Tomas is from Slovenia. Mike thinks Tomas is now the only Slovenian in the company.


Model-generated Summary:

[{'summary_text': "Tomas is from Eastern Europe, but he's not Polish. He's from Slovenia and is the only Slovenian in the company now."}]
