---
# Part 3 - Auto-Summarization
For the last part, you will use the sections of your book of choice that are

    a) the most descriptive of the crime and
    b) the most descriptive of the resolution of the crime (e.g., description and uncovering of the perpetrator).

You will use the sections that are at the minimum 256 tokens long, and you will test the summarization using T5 model. You will then assess and analyze the presence of the key and relevant facts in the summarized material.

For the extra credit: Create your own manually summarized content and then use [ROUGE](https://huggingface.co/spaces/evaluate-metric/rouge) score to showcase the performance of the auto-summarized content vs. manually produced.

---
### Imports

In [1]:
# io
import os
import re

# sentence tokenization
import nltk
from nltk.tokenize import sent_tokenize
nltk.download('punkt')
import spacy

# huggingface
import evaluate
from transformers import pipeline
from transformers import T5Tokenizer, T5ForConditionalGeneration
from transformers import AutoTokenizer, TFAutoModelForSeq2SeqLM

[nltk_data] Downloading package punkt to /home/hp/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
2022-11-19 15:56:57.018028: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-19 15:56:57.079839: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2022-11-19 15:56:57.096086: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2022-11-19 15:56:57.377437: W tensorflo

---
### Globals

In [2]:
INPUT_DIR = 'part3-input'
OUTPUT_DIR = 'part3-output'

---
# Summarization

---
### Setup model

In [3]:
nlp = spacy.load("en_core_web_sm")
tokenizer = AutoTokenizer.from_pretrained("t5-large")
model = TFAutoModelForSeq2SeqLM.from_pretrained("t5-large")
pipe = pipeline('summarization', model=model, tokenizer=tokenizer)

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-large automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
2022-11-19 15:57:02.005309: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-19 15:57:02.006041: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-11-19 15:57:02.006136

---
### Load in text

In [4]:
# get number of samples
sample_fn = os.listdir(INPUT_DIR)
n_samples = len(sample_fn)//2

# get filenames
inp_fn = [os.path.join(INPUT_DIR, f'inp{i}.txt') for i in range(1,n_samples+1)]
ref_fn = [os.path.join(INPUT_DIR, f'ref{i}.txt') for i in range(1,n_samples+1)]
print('input text filenames:', inp_fn)
print('reference text filenames:', ref_fn)

# load in texts
def clean_context(filename):
    with open(filename, 'r', encoding="utf8") as f:
        text = f.read()
    text = re.sub("\n", r' ', text)
    text = re.sub(r"\s{2,}", r' ', text)
    text = re.sub(r"“|”", r'"', text)
    text = re.sub(r"‘|’", r"'", text)
    text = re.sub(r"_", r'', text, re.ASCII)
    text = re.sub(r"\s{2,}", r' ', text)
    text = text.strip()
    return text
inp_text = [clean_context(fn) for fn in inp_fn]
ref_text = [clean_context(fn) for fn in ref_fn]


input text filenames: ['part3-input/inp1.txt']
reference text filenames: ['part3-input/ref1.txt']


----
# Summarize text

---
### Automate Summarization

In [16]:
def prepare(text):
    doc = nlp(text)
    sentences = list(doc.sents)
    sentences = [str(s) for s in sentences]

    length = 0
    chunk = ""
    chunks = []
    count = -1

    for sentence in sentences:
        count += 1
        combined_length = len(tokenizer.tokenize(sentence)) + length

        if combined_length <= tokenizer.max_len_single_sentence:
            chunk += sentence + " "
            length = combined_length
            if count == len(sentences) - 1:
                chunks.append(chunk.strip())

        else:
            chunks.append(chunk.strip())
            chunk = ""
            chunk += sentence + " "
            length = len(tokenizer.tokenize(sentence))

    return chunks


def get_output(chunks):
    inputs = [tokenizer(chunk, return_tensors='tf') for chunk in chunks]
    outputs = []
    for input in inputs:
        output = model.generate(**input)
        outputs.append(tokenizer.decode(*output, skip_special_tokens=True))
    out_sent = []
    for output in outputs:
        out_sent += sent_tokenize(output)
    output = "\n".join(out_sent)
    print(output)
    return output


def predict(text):
    chunks = prepare(text)
    output = get_output(chunks)
    return output


---
### Summarize all the input texts

In [20]:
predictions = []
for inp in inp_text:
    predictions.append(predict(inp))




.
"The old gentleman was not to be decoyed.
He could not be deceived.
Stapleton waited.
He waited and waited, but he waited no longer.
"Stapleton was determined.
He was determined to kill Sir Charles.
"He waited for the old gentleman to come.
He had hoped that his wife might lure him to his ruin, but she refused.
"But he was not. "
he.
"It was not long before He "
.
It is a case which has remained unsolved for many years.
It has...
It was a dreadful sight to see that huge black creature bounding after its victim.
It must have been a terrible sight indeed to see.
In that gloomy tunnel it must indeed have been awful to see hound left.............. creature


---
# Evaluation with Rouge Score(Extra Credit)

---
### Setup Rouge Evaluation

In [None]:
rouge = evaluate.load('rouge')

def evaluate(predictions, references):
    global rouge
    results = rouge.compute(
        predictions=predictions,
        references=references
    )
    return results
