# Document Summarization
Will use the transformers library from Hugging Face to summarize text. Hugging Face is an open-source platform and provides pre-trained models. Transformers are a type of autoregressive sequence models, which uses its own previous outputs as inputs to make predictions and generate text. What makes transformers so powerful is that they consider the whole sequence of words that came before the one they are predicting, and they are constantly updating the bank of previous words after each subsequent prediction. 

 I will use the pre-trained BART model. The architecture of this model is encoder-decoder therefore it is well suited for text generation and sequence to sequence tasks like summarization, translation and paraphrasing. 


Will extend this project by building a transformer model from scratch and comparing the summerization results with the pre-trained transformer model from Hugging Face.


## Summarization for Large Text
BART has a max token input length of 1024. However, documents will often exceed this length, so further summarization techniques will need to be explored. To achieve large document summarization, the text will first be broken up into chunks of tokens that do not exceed 1024. These text chunks will then be individually summarized and appended to an aggregate summary. Once the summaries of the individual chunks have been generated, they will then be summarized to achieve a more precise final output. 


In [2]:
#installations
#!pip install transformers
#!pip install torch

In [1]:
#imports
import torch
from transformers import BartTokenizer, BartForConditionalGeneration

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
#loading the BART tokenizer and model 
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')
model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')



In [11]:
#breaking up the larger document into chunks; the chunk size will be 1024 as that is the maz input of BART

def chunk_text(text, chunk_size = 1024):

    """
    Splits a long document into chunks that fit within the model's max token limit
    
    :param text: the original long text to be chunked
    
    :param chunk_size: max number of tokens for each chunk
    
    :return: list of text chunks that fit the token size
    """

    #tokenizing the text into token IDs that can be processed by the model 
    inputs = tokenizer.encode(text, return_tensors='pt', truncation=False)
    total_tokens = inputs.size(1)


    #splitting the tokenized input into chunks
    chunks = []
    for i in range(0, total_tokens, chunk_size):
        chunk = inputs[:, i:i + chunk_size]
        chunks.append(chunk)

    return chunks


In [20]:
#summarizing each chunk in the chunk's list using the BART model

def summarize_chunk(chunk):
    """
    Summarizes a chunk of text using the BART model
    
    :param chunk: A chunk of tokenized input
    :return: the summary of the given chunk
    
    """

    #summary_ids holds the token indices (IDs) of the summary generated by the model
    #it is a tensor containing the IDs of the generated tokens
    #these token IDs are the predicted tokens that for the summary text
    ''' 
    max_length specifies the max number of tokens
    min_length specifies the min number of tokens
    length_penalty controls how much the model favours longer outputs (>1 encourages shorter outputs)
    num_beams the number of candidate sequences the model keeps at each step (higher = higher quality)
    early_stopping stops the generation process when all candidate sequences (beams) finish before reachine max_length
    '''
    summary_ids = model.generate(chunk, max_length = 150, min_length = 30,
                                 length_penalty = 2.0, num_beams = 4, early_stopping = True)
    

    #need to decode the token IDs of the summary to convert them back into human-readable text
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

    #returning the human-readable summary
    return summary



In [None]:
#combining the above functions
#summarize a long document by splitting it up into chunks and summarizing each chunk

def summarize_document(text, chunk_size = 1024):
    
    """
    Summarizes a long document by splitting it into chunks, summarizing each, and combining the summaries.
    
    :param text: The entire long document to summarize.
    :param chunk_size: Maximum number of tokens per chunk.
    :return: Combined summary of the entire document.

    """

    #use the chunk_text function to split up the text
    text_chunks = chunk_text(text, chunk_size)

    #summarize each chunk using the summarize_chunk definition
    summaries = []
    for chunk in text_chunks:
        summary = summarize_chunk(chunk)
        summaries.append(summary)


    #combining the summaries
    full_summary = " ".join(summaries)

    return full_summary
    

In [21]:
#taking all of the summaries of the chunks and generating a final concise summary

def concise_summary(summaries, max_length = 250):

    """
    Takes the chunk summaries and generates a final concise summary.
    
    :param summaries: The combined summaries from all chunks.
    :param max_length: The maximum length of the final summary.
    :return: The final summarized version of the summaries.

    """

    #Tokenize the summaries into a single string (encoding the summaries)
    inputs = tokenizer.encode("summarize: " + summaries, return_tensors = 'pt', 
                             max_length = 1024, truncation = True)
    
    #Summarize the combined summaries using the BART model
    summary_ids = model.generate(inputs, max_length = max_length, min_length = 50, 
                                 length_penalty = 2.0, num_beams = 4, early_stopping = True)
    
    #Decode the final summary so that it is human-readable
    final_summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

    #return the final summary
    return final_summary


