---
# Part 3 - Auto-Summarization
For the last part, you will use the sections of your book of choice that are

    a) the most descriptive of the crime and
    b) the most descriptive of the resolution of the crime (e.g., description and uncovering of the perpetrator).

You will use the sections that are at the minimum 256 tokens long, and you will test the summarization using T5 model. You will then assess and analyze the presence of the key and relevant facts in the summarized material.

For the extra credit: Create your own manually summarized content and then use [ROUGE](https://huggingface.co/spaces/evaluate-metric/rouge) score to showcase the performance of the auto-summarized content vs. manually produced.

---
### Imports

In [1]:
# io
import os
import re

# sentence tokenization
import nltk
from nltk.tokenize import sent_tokenize
nltk.download('punkt')
import spacy

# huggingface
from transformers import pipeline


[nltk_data] Downloading package punkt to /home/hp/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
2022-12-03 19:19:49.496070: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-12-03 19:19:49.570408: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2022-12-03 19:19:49.593597: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2022-12-03 19:19:50.001196: W tensorflo

---
### Globals

In [2]:
INPUT_DIR = 'part3-text'
OUTPUT_DIR = 'part3-text'

---
# Summarization

---
### Setup model

In [3]:
pipe = pipeline("summarization", model="t5-large", device=0)



2022-12-03 19:19:53.411694: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-12-03 19:19:53.412620: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-12-03 19:19:53.412744: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-12-03 19:19:53.412794: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA

In [17]:
pipe.framework

'tf'

---
### Load in text

In [22]:
# get number of samples
sample_fn = os.listdir(INPUT_DIR)
sample_fn = list(filter(lambda x: 'inp' in x.lower(), sample_fn))
n_samples = len(sample_fn)

# get filenames
inp_fn = [os.path.join(INPUT_DIR, f'inp{i}.txt') for i in range(1,n_samples+1)]
ref_fn = [os.path.join(INPUT_DIR, f'ref{i}.txt') for i in range(1,n_samples+1)]
print('input text filenames:', inp_fn)
print('reference text filenames:', ref_fn)

# load in texts
def clean_context(filename):
    with open(filename, 'r', encoding="utf8") as f:
        text = f.read()
    text = re.sub("\n", r' ', text)
    text = re.sub(r"\s{2,}", r' ', text)
    text = re.sub(r"“|”", r'"', text)
    text = re.sub(r"‘|’", r"'", text)
    text = re.sub(r"_", r'', text, re.ASCII)
    text = re.sub(r"\s{2,}", r' ', text)
    text = text.strip()
    return text
inp_text = [clean_context(fn) for fn in inp_fn]
ref_text = ["\n".join(sent_tokenize(clean_context(fn))) for fn in ref_fn]


input text filenames: ['part3-text/inp1.txt', 'part3-text/inp2.txt', 'part3-text/inp3.txt', 'part3-text/inp4.txt']
reference text filenames: ['part3-text/ref1.txt', 'part3-text/ref2.txt', 'part3-text/ref3.txt', 'part3-text/ref4.txt']


In [23]:
inp_text[3]

'"The circumstances connected with the death of Sir Charles cannot be said to have been entirely cleared up by the inquest, but at least enough has been done to dispose of those rumours to which local superstition has given rise. There is no reason whatever to suspect foul play, or to imagine that death could be from any but natural causes. Sir Charles was a widower, and a man who may be said to have been in some ways of an eccentric habit of mind. In spite of his considerable wealth he was simple in his personal tastes, and his indoor servants at Baskerville Hall consisted of a married couple named Barrymore, the husband acting as butler and the wife as housekeeper. Their evidence, corroborated by that of several friends, tends to show that Sir Charles\'s health has for some time been impaired, and points especially to some affection of the heart, manifesting itself in changes of colour, breathlessness, and acute attacks of nervous depression. Dr. James Mortimer, the friend and medica

----
# Summarize text

---
### Summarize all the input texts

In [24]:
predictions = []
for i, inp in enumerate(inp_text):
    print(f'Summarizing Input {i}')
    predictions.append(pipe(inp))
print("Done")

Summarizing Input 0
Summarizing Input 1
Summarizing Input 2
Summarizing Input 3
Done


In [25]:
def clean_pred(preds):
    out = []
    for p in preds:
        for x in p:
            out.append('\n'.join(sent_tokenize(x['summary_text'])))
    return out
predictions = clean_pred(predictions)


In [26]:
predictions

['"sir Charles Baskerville was in the habit every night before going to bed" "the evidence of the Barrymores shows that this had been his custom," "the coroner\'s jury returned a verdict in accordance with the medical evidence" "it is obviously of the utmost importance that Sir Charles\'s heir should settle at the Hall"',
 '"he found a way out of his difficulties through the chance that sir Charles made him minister of his charity in the case of unfortunate woman, Mrs. Laura Lyons" by representing himself as a single man he acquired complete influence over her .\nhe then put pressure upon Mrs. Lyons to write this letter, imploring the old man to give her an interview on the evening before his departure .',
 '"the other words were all simple and might be found in any issue, but \'moor\' would be less common" "have you read anything else in this message, Mr.\nHolmes?"\nhe asks .\n"the address, you observe, is printed in rough characters, but the times is a paper which is seldom found in 

In [27]:
for i, p in enumerate(predictions):
    with open(os.path.join(OUTPUT_DIR, f'out{i+1}.txt'), 'w') as f:
        f.write(p)

---
# Evaluation with Rouge Score(Extra Credit)

---
### Setup Rouge Evaluation

In [28]:
import evaluate

rouge = evaluate.load('rouge')

def eval_rouge(predictions, references):
    global rouge
    results = rouge.compute(
        predictions=predictions,
        references=references
    )
    return results
res = eval_rouge(predictions, ref_text)

In [29]:
res

{'rouge1': 0.30477971889501143,
 'rouge2': 0.08313524369280692,
 'rougeL': 0.172717101887657,
 'rougeLsum': 0.2431638779022941}

In [37]:
ref_text

["Sir Charles Baskerville did not return after his nightly walk. Barrymore noticed and followed Sir Charles' footsteps. It seemed Sir Charles stood by the gate to the moor then walked down the alley. Barrymore then found Sir Charles' body at the end of the alley. The body did not show any signs of struggle. Interestingly, the victim's face showed incredible facial distortion.",
 'Stapleton convinced Mrs. Laura Lyons to write a letter to Sir Charles telling him to give her an interview. He then painted his hound and brought it to the gate. The hound jumped over the gate and chased the baronet until he fell dead at the end of the alley from terror.']