# Text Summarisation Using Transformers

### Context
From researching a new topic, to extracting key points from an investor report, or even standardising text documents for input into another LM, the summarisation of text documents has wide ranging and far reaching appications. In the following notebook,  the Transformer Architecture is leveraged to summarise text documents, ranging from various blog posts, to the famous paper, 'Attention is All You Need'. 


### Solution
As mentioned above, the Transformer Architecture is leveraged in order to create a NLP model capable of text summarisation. In particular, a pre-trained Transformer from Hugging Face is used, leveraging its 'Summarization Pipeline' capability. 

In the case of blog posts, or other similar internet media, Beautiful Soup can be used to scrape the relevant data. In the case of research papers, which are often PDF files, the papers must be saved locally. They are then read in using PdfReader. 

This scraped data is then preocessed and chunked into blocks of sentences before it is passed to our summariser. Lastly, we export the summarised text so we can use it as desired. 


## 0. Setup

To run the following Jupyter Notebook, the following setup is recommended, although alternatives methods are available. 

* First create and activate a virtual environment in the desired directory. 
<br>
* Next install the required packages within the virtual environment. 

*(See the ReadMe.md file for detailed instructions on how the above steps are done)*

* Then execute `jupyter notebook` in your terminal, with the virtual environment still activated, in order to open a Jupyter Notebook, whose kernel is the virtual environment created. This ensures we do not need to install all package requirements from the Notebook itself.
<br>
* The Notebook should then open. If it does not open, check your terminal, as a link to open the Notebook may have been logged here. Once opened, ensure the kernel selected is your virtual environment.



In case, you wish to install packages directly from the Jupyter Notebook, uncomment and run the following installations. 

In [None]:
# !pip install tensorflow
# !pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
# !pip install transformers
# !pip install beautifulsoup4
#!pip install pypdf

Check the imports below exist within the available poackages for the Notebook:

In [1]:
!pip list

Package                      Version
---------------------------- ----------
absl-py                      1.4.0
anyio                        3.6.2
argon2-cffi                  21.3.0
argon2-cffi-bindings         21.2.0
arrow                        1.2.3
asttokens                    2.2.1
astunparse                   1.6.3
attrs                        23.1.0
backcall                     0.2.0
beautifulsoup4               4.12.2
bleach                       6.0.0
cachetools                   5.3.0
certifi                      2022.12.7
cffi                         1.15.1
charset-normalizer           3.1.0
comm                         0.1.3
debugpy                      1.6.7
decorator                    5.1.1
defusedxml                   0.7.1
executing                    1.2.0
fastjsonschema               2.16.3
filelock                     3.12.0
flatbuffers                  23.3.3
fqdn                         1.5.1
fsspec                       2023.4.0
gast  

## 1. Imports

In [2]:
from transformers import pipeline  # Lets us import summarization model
from bs4 import BeautifulSoup  # Needed for scraping of text documents
import requests  # Needed to make https requests 
from pypdf import PdfReader  # Needed to read PDF files
import tensorflow as tf  # Install preferred ML framework
import torch
import os

2023-05-01 15:05:30.624194: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-05-01 15:05:31.350766: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-05-01 15:05:31.362496: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


## 2. Functions

In [None]:
def extract_text_from_url(url: str, tags: list) -> str:
    """
    Returns text from the url specified, where text is retrieved from tags (of the html) specified.
    """
    request = requests.get(url)
    if request.ok:
        print('URL request successful!')
    else:
        raise ValueError('URL request unsuccessful!')
        
    # Extract all tags
    soup = BeautifulSoup(request.text, 'html.parser')
    results = soup.find_all(tags)
    text_list = [result.text for result in results]
    text = ' '.join(text_list)
    return text

def chunk_text(text: str, max_chunk: int = 500) -> list:
    """
    Returns a list of chunks, ready to to be passed to the transformers pipeline. 
    """
    # Split text into sentences
    eos_punctuation = ['.', '?', '!']
    for symbol in eos_punctuation:
        text = text.replace(symbol, f'{symbol}<EOS>')
    sentences = text.split('<EOS>')
    
    # Chunk sentences
    current_chunk = 0 
    chunks = []
    for s in sentences:
        if len(chunks) == current_chunk + 1: 
            if len(chunks[current_chunk]) + len(s.split(' ')) <= max_chunk:
                chunks[current_chunk].extend(s.split(' '))
            else:
                current_chunk += 1
                chunks.append(s.split(' '))
        else:
            chunks.append(s.split(' '))

    for chunk_id in range(len(chunks)):
        chunks[chunk_id] = ' '.join(chunks[chunk_id])
        
    print(f'Total Chunks: {len(chunks)}')
    
    return chunks

def extract_text_from_pdf(filename: str) -> str:
    """
    Returns text from the specified PDF file. 
    """
    reader = PdfReader(filename)
    text = ""
    for page in reader.pages:
        text += page.extract_text() + " "
    if len(text) > 0:
        print('PDF reading successful!')
    return text

def extract_text_between_keywords(text: str, start_keyword: str, end_keyword: str):
    """
    Returns text between the start_keyword (Inclusive) and the end_keyword (Not Inclusive). 
    """
    start_index = text.find(start_keyword)
    end_index = text.find(end_keyword)
    
    if start_index == -1 or end_index == -1:
        return ""
    
    return text[start_index:end_index]

def process_text_from_pdf(text: str) -> str:
    """
    Returns the text processed as desired. 
    
    Notes
    -----
    - Only needed for the PDF file. 
    
    #TODO: Abstract and add more processing steps. 
    """
    text = extract_text_between_keywords(text, "Abstract", "Acknowledgements")
    return text
    

## 3. Notebook Variables

In [15]:
ML_FRAMEWORK = 'TORCH'
SUMMARY_MAX_LENGTH = 80
SUMMARY_MIN_LENGTH = 30

## 4. Load Summarization Pipeline

*Notes:*
- *`pipeline` defaults to use pytorch framework if both pytorch and tensorflow installed.*
- *The `pipeline` will load in if not done so before.*


In [5]:
if ML_FRAMEWORK == 'TORCH':
    # Use bart in pytorch
    summarizer = pipeline('summarization')
elif ML_FRAMEWORK == 'TENSORFLOW':
    # Use t5 in tensorflow
    summarizer = pipeline('summarization', model="t5-base", tokenizer="t5-base", framework="tf")
else:
    raise ValueError('Invalid ML Framework Specified!')

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

## 6.A Summarise Blog Posts

In [16]:
URL = "https://towardsdatascience.com/a-bayesian-take-on-model-regularization-9356116b6457"
text = extract_text_from_url(URL, ['h1', 'p'])
chunks = chunk_text(text)
chunk_summaries = summarizer(chunks, max_length=SUMMARY_MAX_LENGTH, min_length=SUMMARY_MIN_LENGTH, do_sample=False)
summary = ' '.join([chunk_summary['summary_text'] for chunk_summary in chunk_summaries])
summary

URL request successful
Total Chunks: 4


' In machine learning, regularization, or model complexity control, is an essential and common practice to ensure that a model attains high out-of-sample performance . We can do this by placing a prior on our distribution of models . We will focus on analytical regularization techniques, since their Bayesian interpretation is more well-defined .  We will analyze these claims for regression problems, but they extend to other supervised learning tasks, such as classification, as well . We will focus on rigorously presenting the mathematics behind these claims . In the next section, we will use Bayes’ Rule to derive our L2 regularized estimator .  Using Bayes’ Rule, we can show that the mean and mode of the posterior distribution of w is the solution for LASSO regression when we invoke a Gaussian prior distribution on w . We’ll now examine a similar case with a Laplace prior . Here is the corresponding derivation for Lasso: (Note that in the last step, we set p =  A huge thank you to CODE

In [25]:
URL = "https://asianabsolute.co.uk/blog/2021/01/28/what-is-bleu-score/"
text = extract_text_from_url(URL, ['h1', 'p'])
chunks = chunk_text(text)
chunk_summaries = summarizer(chunks, max_length=SUMMARY_MAX_LENGTH, min_length=SUMMARY_MIN_LENGTH, do_sample=False)
summary = ' '.join([chunk_summary['summary_text'] for chunk_summary in chunk_summaries])
summary

ValueError: URL request unsuccessful!

## 6.B Summarise PDF

In [18]:
current_directory = os.getcwd()
pdf_filepath = os.path.join(current_directory, '../../Resources/attention-is-all-you-need.pdf')

text = extract_text_from_pdf(pdf_filepath)
processed_text = process_text_from_pdf(text)
chunks = chunk_text(text, max_chunk = 400)
chunk_summaries = summarizer(chunks, max_length=SUMMARY_MAX_LENGTH, min_length=SUMMARY_MIN_LENGTH, do_sample=False)
summary = ' '.join([chunk_summary['summary_text'] for chunk_summary in chunk_summaries])
summary

Total Chunks: 15


' We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions . Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring signiﬁcantly less time to train .  The Transformer allows for signiﬁcantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs .  The Transformer is the ﬁrst transduction model relying solely on self-attention to compute representations of its input and output without using sequence-aligned RNNs or convolution . The model is auto-regressive[10], consuming the previously generated symbols as additional input when generating the next .  An attention function can be described as mapping a query and a set of key-value pairs to an output . The weight assigned to each value is computed by a compatibility function of 

### Summarising the summarisation?

In [19]:
chunks = chunk_text(summary, max_chunk = 400)
chunk_summaries = summarizer(chunks, max_length=SUMMARY_MAX_LENGTH, min_length=SUMMARY_MIN_LENGTH, do_sample=False)
summary = ' '.join([chunk_summary['summary_text'] for chunk_summary in chunk_summaries])
summary

Total Chunks: 2


' The Transformer is the ﬁrst transduction model relying solely on self-attention to compute representations of its input and output without using sequence-aligned RNNs or convolution . Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring signiﬁcantly less time to train . The model is  The Transformer achieves better BLEU scores than previous state-of-the-art models on the English-to-German and French newstest tests at a fraction of the training cost . Sentences were encoded using byte-pair encoding [ 3], which has a shared source-target vocabulary of about 37000 tokens . For English-French, we used the signi�'