<a href="https://colab.research.google.com/github/nayankote/meeting_summarization/blob/main/ami_summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# Installing stuff

In [None]:
!pip install pytextrank
!python3 -m pip install pytextrank
!python3 -m spacy download en_core_web_sm
!pip install gensim
!pip install transformers datasets rouge-score nltk
!pip3 install sentencepiece

# Data

**Download ami data from here**  
https://drive.google.com/file/d/1e87oOSDPdFwCGSh6-HHn63j0GSpgND5r/view?usp=sharing

In [None]:
import pandas as pd
import numpy as np

In [None]:
!tar -xzvf "/content/drive/MyDrive/Colab Notebooks/nvidia_task/ami.tar"

ami/
ami/test.source
ami/val.source
ami/train.target
ami/test.target
ami/val.target
ami/train.source


In [None]:
train = pd.DataFrame({"source" : open("/content/ami/train.source",'r').readlines(), "target" : open("/content/ami/train.target",'r').readlines()})
val = pd.DataFrame({"source" : open("/content/ami/val.source",'r').readlines(), "target" : open("/content/ami/val.target",'r').readlines()})
test = pd.DataFrame({"source" : open("/content/ami/test.source",'r').readlines(), "target" : open("/content/ami/test.target",'r').readlines()})

Getting train, val and test stats

In [None]:
len(train), len(val), len(test)

(105, 17, 20)

In [None]:
train['source_length'] = train['source'].apply(lambda x : len(x.split()))
train['target_length'] = train['target'].apply(lambda x : len(x.split()))
val['source_length'] = val['source'].apply(lambda x : len(x.split()))
val['target_length'] = val['target'].apply(lambda x : len(x.split()))
test['source_length'] = test['source'].apply(lambda x : len(x.split()))
test['target_length'] = test['target'].apply(lambda x : len(x.split()))
print(train['source_length'].describe(), train['target_length'].describe(), val['source_length'].describe(), val['target_length'].describe(), test['source_length'].describe(), test['target_length'].describe(), sep='\n')

count     105.000000
mean     5012.247619
std      1992.087071
min       747.000000
25%      3366.000000
50%      5188.000000
75%      6549.000000
max      9113.000000
Name: source_length, dtype: float64
count    105.00000
mean     164.60000
std       49.73963
min       78.00000
25%      138.00000
50%      169.00000
75%      192.00000
max      530.00000
Name: target_length, dtype: float64
count      17.000000
mean     4921.058824
std      1944.404551
min      1489.000000
25%      2726.000000
50%      4868.000000
75%      6685.000000
max      7518.000000
Name: source_length, dtype: float64
count     17.000000
mean     149.529412
std       50.943741
min       41.000000
25%      131.000000
50%      175.000000
75%      188.000000
max      200.000000
Name: target_length, dtype: float64
count      20.000000
mean     4833.350000
std      2124.942087
min      1614.000000
25%      3158.000000
50%      4889.000000
75%      6041.250000
max      9625.000000
Name: source_length, dtype: float64
coun

In [None]:
# an example of the source sentences
train['source'][0]

"No. Mm no. Um 'kay um yeah. uh some uh research uh a about um designing of an interface. Um the uh last meeting uh we had a about um uh using a f few buttons. So uh um uh that's w what I what I want to uh uh to do in uh our design. So um finding an attractive uh way to control uh the remote control. Um the uh I found some uh something about uh speech uh recognition. So maybe uh we can uh use uh that. Um Uh and uh using a little uh display. So um findings. Um yeah just um we have just to focus on the primary um functions. So uh only uh buttons uh for uh sound, um for uh on-off, um uh shifting u up uh sa uh ca channel or uh down shifting down. Um uh let's see. Um yeah and Uh we uh need some uh new a attractive functions uh uh which attract uh uh people for using it. So uh it's uh like a speak uh speech uh recognition and um a special button for selecting uh subtitles. Just uh what we uh mentioned uh last uh meeting. Um and yeah overall um user-friendly. So uh using uh large large button

# Extractive Summarization

The input transcripts to be summarized are extremely large, with a mean of roughly 5012 words. Most open source summarization models like t5, bart and pegasus have a maximum token length of 512 or 1024 tokens. Hence it is important to have an extractive step to obtain the most information rich sentences from the large corpus to prevent loss of information due to truncation during tokenization. For this extractive step it is sufficient to use a pagerank type algorithm such as the ones in pytextrank or gensim. In the final pipeline I have used gensim as the output preserved sentence order and was more readable.

# Using pytextrank

In [None]:
import spacy
import pytextrank
nlp = spacy.load("en_core_web_sm")

def clean_text(text):
  cleaned_lines = []
  forbidden_list = ["\n", "um", "uh", "hmm", "mm-hmm", "mm", "oops", "'kay", "yeah"]
  for text in text.lower().split("."):
    text = text.strip()
    for w in forbidden_list : text = text.replace(w,"")
    words = []
    for word in text.split(" "):
      if len(word) <= 1 and word not in ['a','i'] : continue
      else : words.append(word)
    if len(words)>2 : cleaned_lines.append(" ".join(words).strip())

  return ". ".join(cleaned_lines)

def get_sentences(text):
    nlp = spacy.load("en_core_web_sm")
    doc = nlp(text)
    try : 
        nlp.add_pipe("textrank")
    except: 
        pass
    doc = nlp(text)

    final_sentence = ""
    len_sentence = 0
    for sent in doc._.textrank.summary(limit_phrases=100, limit_sentences=200, preserve_order=False):
        final_sentence += str(sent) + " "
        len_sentence += len(str(sent).split(" "))
        if len_sentence >= 1024 : break
    return final_sentence.strip(" ")

#train['source_final_1'] = train['source'].apply(lambda x :get_sentences(clean_text(x)))
#val['source_final_1'] = val['source'].apply(lambda x : get_sentences(clean_text(x)))
#test['source_final_1'] = test['source'].apply(lambda x : get_sentences(clean_text(x)))

# Using gensim

**Text cleaning**  
Since the input data is conversational, there is a lot of affirmations, repetitions and stutters represented by individual letters. These have to be cleaned and clean_text does that. 

In [None]:
from gensim.summarization import summarize

def clean_text(text):
  cleaned_lines = []
  forbidden_list = ["\n", "um", "uh", "hmm", "mm-hmm", "mm", "oops", "'kay", "yeah"]
  for text in text.lower().split("."):
    text = text.strip()
    for w in forbidden_list : text = text.replace(w,"")
    words = []
    for word in text.split(" "):
      if len(word) <= 1 and word not in ['a','i'] : continue
      else : words.append(word)
    if len(words)>2 : cleaned_lines.append(" ".join(words).strip())

  return ". ".join(cleaned_lines)

In [None]:
train['source_final'] = train['source'].apply(lambda x : " ".join(summarize(clean_text(x), word_count=1024, split=True)))
val['source_final'] = val['source'].apply(lambda x : " ".join(summarize(clean_text(x), word_count=1024, split=True)))
test['source_final'] = test['source'].apply(lambda x : " ".join(summarize(clean_text(x), word_count=1024, split=True)))

In [None]:
train.head()

Unnamed: 0,source,target,source_length,target_length,source_final
0,No. Mm no. Um 'kay um yeah. uh some uh researc...,The project manager opened the meeting and rec...,4588,131,so it's like a speak speech recognition and a ...
1,What? Yeah. Yeah. We didn't make any uh Oh in ...,The project manager opened the meeting and the...,9113,150,you push the scroll button and it's claps out ...
2,Okay. B you think uh I I'm User Interface Mana...,The project manager opened the meeting and the...,5188,97,"okay, about what i found about different these..."
3,Yep. Um So hello everybody. So uh you everybod...,When the meeting opens the project manager giv...,7532,87,so the goal for today is to decide for a movie...
4,"You could change the vegetable, or fruit. Yeah...",The Project Manager reviewed the minutes from ...,4516,181,it's been a it's been a little bit difficult t...


In [None]:
"""
train.to_csv("/content/drive/MyDrive/Colab Notebooks/nvidia_task/train_processed.csv")
val.to_csv("/content/drive/MyDrive/Colab Notebooks/nvidia_task/val_processed.csv")
test.to_csv("/content/drive/MyDrive/Colab Notebooks/nvidia_task/test_processed.csv")
"""

'\ntrain.to_csv("/content/drive/MyDrive/Colab Notebooks/nvidia_task/train_processed.csv")\nval.to_csv("/content/drive/MyDrive/Colab Notebooks/nvidia_task/val_processed.csv")\ntest.to_csv("/content/drive/MyDrive/Colab Notebooks/nvidia_task/test_processed.csv")\n'

# Model training

**Abstractive summarization**  
3 of the best scoring models from huggingface will be tried, i.e. t5, bart and pegasus. Their corresponding tokenizers will be used to vectorize the input text. 

In [None]:
!nvidia-smi

Tue May 18 17:24:42 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P8     8W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
import nltk
import numpy as np
import random
nltk.download('punkt')
from datasets import load_metric

metric = load_metric('rouge')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
import torch
import transformers
from transformers import AutoTokenizer
from torch.utils.data import DataLoader
from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer

In [None]:
model_checkpoint = "facebook/bart-large"
prefix = "summarize: " if "t5" in model_checkpoint else ""
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint) #.task_specific_params['summarization']

max_input_length = 1024
max_target_length = 256

def tokenize_sentences(data):
  inputs = [prefix + text for text in data['source_final']]
  labels = [target for target in data['target']]
  tokenized_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True, padding=True, add_special_tokens=True) # , return_tensors='pt'

  with tokenizer.as_target_tokenizer():
    labels = tokenizer(labels, max_length=max_target_length, padding=True, truncation=True, add_special_tokens=True) # , return_tensors='pt'

  model_inputs = {"input_ids" : tokenized_inputs['input_ids'], "attention_mask" : tokenized_inputs['attention_mask'], "decoder_attention_mask" : labels['attention_mask'], "labels" : labels['input_ids']}
  return model_inputs

train_tokenized = tokenize_sentences(train)
val_tokenized = tokenize_sentences(val)
test_tokenized = tokenize_sentences(test)
train_tokenized.keys()

dict_keys(['input_ids', 'attention_mask', 'decoder_attention_mask', 'labels'])

In [None]:
class AMIDataset(torch.utils.data.Dataset):
  def __init__(self,encodings):
    super(AMIDataset, self).__init__()
    self.encodings=encodings

  def __getitem__(self, idx):
    #return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
    return {key: val[idx] for key, val in self.encodings.items()}

  def __len__(self):
    return len(self.encodings['input_ids'])

train_dataset = AMIDataset(train_tokenized)
val_dataset = AMIDataset(val_tokenized)
test_dataset = AMIDataset(test_tokenized)

In [None]:
if "bart" in model_checkpoint:
  for i, batch in enumerate(train_dataset):
    if 2 not in batch['input_ids'] and 2 not in batch['labels'] : print(i, batch)

In [None]:
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint) #.task_specific_params['summarization']
device = torch.device('cuda') if torch.cuda.is_available() else 'cpu'
batch_size = 4
args = Seq2SeqTrainingArguments(
    "test-summarization",
    evaluation_strategy = "epoch",
    learning_rate=1e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=5,
    predict_with_generate=True,
    fp16=True,
)

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)


def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    # Rouge expects a newline after each sentence
    decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
    decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]
    
    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    # Extract a few results
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}
    
    # Add mean generated length
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)
    
    return {k: round(v, 4) for k, v in result.items()}

def freeze_params(model, encoder=True, decoder=False, embedding=True, t5=False):
  if t5 : 
    encoder_params = model.encoder.parameters()
    decoder_params = model.decoder.parameters()
    embedding_params = model.shared.parameters()
  else : 
    encoder_params = model.model.encoder.parameters()
    decoder_params = model.model.decoder.parameters()
    embedding_params = model.model.shared.parameters()
  if encoder : 
    for param in encoder_params:
      param.requires_grad = False

  if decoder : 
    for param in decoder_params:
      param.requires_grad = False

  if embedding : 
    for param in embedding_params:
      param.requires_grad = False

freeze_params(model,encoder=True,decoder=False,embedding=False,t5 = "t5" in model_checkpoint)

trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

trainer.train()

Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,No log,6.03874,8.4866,1.4905,6.5386,7.6515,20.0
2,No log,5.334928,15.6203,7.1909,13.6526,14.8954,20.0
3,No log,4.798257,15.4047,6.9054,13.1146,14.3633,20.0
4,No log,4.4058,15.7828,6.8987,13.4752,14.5758,20.0
5,No log,4.325643,15.7509,7.1491,13.5666,14.4923,20.0


TrainOutput(global_step=135, training_loss=5.189368127893519, metrics={'train_runtime': 184.1857, 'train_samples_per_second': 0.733, 'total_flos': 1638167150592000.0, 'epoch': 5.0, 'init_mem_cpu_alloc_delta': 4096, 'init_mem_gpu_alloc_delta': 1625367040, 'init_mem_cpu_peaked_delta': 0, 'init_mem_gpu_peaked_delta': 0, 'train_mem_cpu_alloc_delta': 638976, 'train_mem_gpu_alloc_delta': 2431385088, 'train_mem_cpu_peaked_delta': 61440, 'train_mem_gpu_peaked_delta': 5273206272})

In [None]:
def quick_test(model,device,dataset,idx=-1,tensor=False):
  if idx == -1 : 
    idx = random.randint(0,len(dataset))
    print(idx)
  if tensor:
    print("input is tensor")
    decoded = tokenizer.decode(model.generate(dataset[idx]['input_ids'].view(1,-1).to(device),min_length = 0, max_length=max_target_length, attention_mask=dataset[idx]['attention_mask'].view(1,-1).to(device))[0], skip_special_tokens=True) #
    original = tokenizer.decode(dataset[idx]['labels'].to(device), skip_special_tokens=True)
  else : 
    decoded = tokenizer.decode(model.generate(torch.tensor(dataset[idx]['input_ids']).view(1,-1).to(device),max_length=180, attention_mask=torch.tensor(dataset[idx]['attention_mask']).view(1,-1).to(device), 
                                              early_stopping = True, no_repeat_ngram_size = 3,  top_p=0.9, top_k=15)[0], skip_special_tokens=True) #  forced_eos_token_id=tokenizer.eos_token_id,
    original = tokenizer.decode(torch.tensor(dataset[idx]['labels']).to(device), skip_special_tokens=True)
  # removing truncated last sentence : 
  if decoded[-1]!="." : 
    decoded = ".".join(decoded.split(".")[:-1]) + "."
  scores = metric.compute(predictions=[decoded], references=[original], use_stemmer=True)
  rouge1, rouge2, rougeL = scores['rouge1'].mid.fmeasure, scores['rouge2'].mid.fmeasure, scores['rougeL'].mid.fmeasure

  return decoded, original, rouge1, rouge2, rougeL

In [None]:
quick_test(model,device,val_dataset,3)

('The project manager opened the meeting by talking about the components that would be used to make the remote. The team discussed the energy source, the design of the case, the buttons, and the face-plates. The project manager discussed the possibility of using kinetic energy to power the device. The user interface designer presented the idea of using a scroll button to control the remote, and suggested that the user interface could be based on a fruit and vegetable theme. The marketing expert presented the use of a graphical user interface, which would include a number of different symbols. The industrial designer presented an idea for a light-up display, which could be incorporated into the design. The group discussed how to make a remote that could be easily reconstituted, and how to incorporate a scroll-button into it. The Project Manager then presented the project budget for the project. The Marketing Expert presented the',
 "The project manager opens this conceptual design meeti

In [None]:
test_results = {}
for i,batch in enumerate(test_dataset):
  decoded, original, rouge1, rouge2, rougeL = quick_test(model,device,test_dataset,i)
  test_results[i] = {"generated" : decoded, "original" : original, "rouge1" : rouge1, "rouge2" : rouge2, "rougeL" : rougeL}

test_results = pd.DataFrame(test_results).transpose()
print(round(sum(test_results['rouge1'])/len(test_results['rouge1']),4)*100, round(sum(test_results['rouge2'])/len(test_results['rouge2']),4)*100, round(sum(test_results['rougeL'])/len(test_results['rougeL']),4)*100)

val_results = {}
for i,batch in enumerate(val_dataset):
  decoded, original, rouge1, rouge2, rougeL = quick_test(model,device,val_dataset,i)
  val_results[i] = {"generated" : decoded, "original" : original, "rouge1" : rouge1, "rouge2" : rouge2, "rougeL" : rougeL}

val_results = pd.DataFrame(val_results).transpose()
print(round(sum(val_results['rouge1'])/len(val_results['rouge1']),4)*100, round(sum(val_results['rouge2'])/len(val_results['rouge2']),4)*100, round(sum(val_results['rougeL'])/len(val_results['rougeL']),4)*100)

47.28 17.25 26.400000000000002
45.79 15.83 24.27


In [None]:
torch.save(model.state_dict(), "/content/drive/MyDrive/Colab Notebooks/nvidia_task/model_bart_large")

In [None]:
import json
with open("/content/drive/MyDrive/Colab Notebooks/nvidia_task/bart_large_test.json",'w') as f: json.dump(test_results.to_json(),f)
with open("/content/drive/MyDrive/Colab Notebooks/nvidia_task/bart_large_val.json",'w') as f: json.dump(val_results.to_json(),f)

In [None]:
with open("/content/drive/MyDrive/Colab Notebooks/nvidia_task/bart_large_test.json",'r') as inf : a = json.load(inf)
pd.DataFrame(eval(a))

Unnamed: 0,generated,original,rouge1,rouge2,rougeL
0,The project manager opened the meeting by goin...,This last meeting started with the presentatio...,0.419453,0.122324,0.200608
1,The Project Manager opens the meeting. The Pro...,The project manager opened the meeting and had...,0.42623,0.198347,0.245902
2,The project manager opened the meeting by goin...,The Project Manager presented the goals of the...,0.574018,0.297872,0.308157
3,The project manager opens the meeting by going...,The project manager opened the meeting and sta...,0.479167,0.167832,0.256944
4,The Project Manager opened the meeting by stat...,The Project Manager presented the final cost o...,0.564706,0.252964,0.313725
5,The project manager opened the meeting by goin...,"For the conceptual design, the ID suggested to...",0.449568,0.127536,0.190202
6,The Project Manager opened the meeting by reca...,The project manager opened the meeting and rea...,0.523529,0.213018,0.294118
7,The project manager opened the meeting by goin...,The project manager opens the meeting by stati...,0.433526,0.087209,0.213873
8,The project manager opened the meeting by tell...,The interface specialist and industrial design...,0.507463,0.18797,0.268657
9,The project manager opened the meeting by goin...,The project manager recapped the decisions mad...,0.431095,0.120996,0.261484


Final Results : (rouge1, rouge2, rougeL)  
Test  47.28  17.25  26.40  
Val  45.79  15.83  24.27