<a href="https://colab.research.google.com/github/osullik/bc-autoreporter/blob/main/INT_Summarize.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Imports

In [1]:
!pip install sentencepiece
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
from transformers import T5Tokenizer, T5Config, T5ForConditionalGeneration
from transformers import BartTokenizer, BartForConditionalGeneration
import numpy as np
import torch
import random
import re
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [3]:
device = torch.device("cpu")

if torch.cuda.is_available():
   print("Using GPU")
   device = torch.device("cuda:0")

Using GPU


# Read in data
Assume each set of evaluations corresponds to the same employee. Different employees will be read in separately and run through the model separately to generate per-employee summaries.

In [4]:
original_text = [{'entity': 'HomerSimpson', 'date': '2021-10-14', 'tagList': ['resilience', 'problem-solving', 'safetyconsciousness'], 'observation': 'On October 14th, 2021, @HomerSimpson was observed to perform to an excellent standard. This was evidenced by his quick response to a power outage in Sector 7G, restoring power and ensuring that critical systems remained operational. His actions show #resilience #problem-solving #safetyconsciousness.\n', 'sentiment': 69.08, 'adjectives': ['excellent', 'quick', 'critical', 'resilience', 'problem-solving'], 'observer': 'MontgomeryBurns', 'lastModified': '2023-04-09T01:16:32'}, {'entity': 'HomerSimpson', 'date': '2021-08-25', 'tagList': ['forgetfulness', 'lackofattention'], 'observation': 'On August 25th, @HomerSimpson was observed to perform to a poor standard. This was evidenced by his forgetting to inform Mr. Burns of an important meeting, resulting in a missed opportunity for the company. His actions show #forgetfulness and #lackofattention.\n', 'sentiment': -17.79, 'adjectives': ['poor', 'important', 'missed', 'forgetfulness', 'lackofattention'], 'observer': 'MontgomeryBurns', 'lastModified': '2023-04-09T01:16:32'}, {'entity': 'HomerSimpson', 'date': '2021-11-29', 'tagList': ['Conflict', 'Tension', 'TragicAccident'], 'observation': "On November 29th, @FrankGrimes was observed to still be at odds with @HomerSimpson. Despite attempts by other employees to mediate the situation, the two were unable to resolve their differences and tensions continued to escalate. This culminated in the tragic accident that took Frank's life later that day. #Conflict #Tension #TragicAccident\n", 'sentiment': -73.50999999999999, 'adjectives': ['still', 'other', 'unable', 'tragic', 'later', 'Conflict'], 'observer': 'MontgomeryBurns', 'lastModified': '2023-04-09T01:16:34'}]

# Preprocessing Functions

In [5]:
# generate chunks of text \ sentences <= 512 tokens
def nest_sentences(document):
  nested = []
  sent = ["summarize"]  # need to provide task instruction as the first token
  length = 1
  for sentence in nltk.sent_tokenize(document):
    length += len(sentence)
    if length < 512:
      sent.append(sentence)
    else:
      nested.append(sent)
      sent = ["summarize"]  # need to provide task instruction as the first token
      length = 1

  if len(sent) > 1:
    nested.append(sent)

  return nested

# Summarization function
Nested function feeds input sentence by sentence, which doesn't allow summarization across a report. Use generate_summary instead.

In [6]:
# generate summary on text with <= 512 tokens
def generate_summary(sentences, tokenizer, model):
  input_tokenized = tokenizer.encode(sentences, truncation=True, return_tensors='pt')
  input_tokenized = input_tokenized.to(device)
  summary_ids = model.to(device).generate(input_tokenized,
                                          num_beams=4,
                                          min_length=50,
                                          max_length=1000,
                                          early_stopping=True)
  output = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summary_ids]
  return output

# Create Tokenizer
Corresponding T5 tokenizer from Hugging Face library

# Create Model
Pretrained T5 model from Hugging Face library. This is a generic model that can do translation, summarization, etc. We will pass it the 'summarize' command appended to the front of the input stream to trigger a summary output from the model.

https://towardsdatascience.com/simple-abstractive-text-summarization-with-pretrained-t5-text-to-text-transfer-transformer-10f6d602c426 

# Summarize
Generate summary text given preprocessed input text containing a set of reports for a given employee.

In [7]:
def employee_summarize(list_of_dicts):
  employee_record = ''
  for _dict in list_of_dicts:
    employee_record = employee_record + _dict['observation'].replace("\n"," ")
    
  tokenizer = T5Tokenizer.from_pretrained('t5-small')
  model = T5ForConditionalGeneration.from_pretrained('t5-small')

  nested_text = nest_sentences(employee_record)

  employee_summary = []
  for i in range(len(nested_text)):
    inputs = []

    for j in nested_text[i]:
      inputs.append(j.split())

    inputs_squeezed = np.concatenate(inputs, axis=0 )
    inputs_squeezed = [j + ' ' for j in inputs_squeezed]
    employee_summary.append(generate_summary(' '.join(inputs_squeezed), tokenizer, model))

  sub_summaries = [''.join(ele) for ele in employee_summary]
  return ''.join(sub_summaries)

In [8]:
employee_summarize(original_text)

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-small automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


"@HomerSimpson was observed to perform to an excellent standard . this was evidenced by his quick response to a power outage in Sector 7G . his actions show #resilience #problem-solving #safetyconsciousness .@FrankGrimes was observed to still be at odds with @HomerSimpson . the two were unable to resolve their differences and tensions continued to escalate . this culminated in the tragic accident that took Frank's life later that day ."