# Ensemble Model for Summarization
The achieve a higher quality and more accurate summary, this approach will focus on combining multiple model outputs during inference instead of after summarization. This can be done in two main ways: taking the average of the token probabilities of the different models or using a majority voting mechanism to deterine the final summary.




## Method 1: Averaging Token Porbabilities
This method will focus on averaging the token probabilities.

Step 1) Generate token probabilities from the different models.

Step 2) Average the probabilities for each token at every step.

Step 3) Pick the tocken with the highest average probability at each step to form the summary. 

In [5]:
#installations
#!pip install transformers sentence-transformers
#!pip install tf-keras
#!pip install SentencePiece

In [4]:
#imports
from transformers import BartTokenizer, BartForConditionalGeneration, PegasusTokenizer, PegasusForConditionalGeneration
from sentence_transformers import SentenceTransformer, util
import torch

  from .autonotebook import tqdm as notebook_tqdm





In [1]:
#pip install sentencepiece

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.3.2 -> 24.2
[notice] To update, run: C:\Users\lawre\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


In [2]:
import sentencepiece

In [7]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

### Loading the Models and Tokenizers

In [5]:
#loading the BART and Pegasus models and tokenizers
bart_tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')
bart_model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')

pegasus_tokenizer = PegasusTokenizer.from_pretrained('google/pegasus-cnn_dailymail')
pegasus_model = PegasusForConditionalGeneration.from_pretrained('google/pegasus-cnn_dailymail')

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-cnn_dailymail and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [8]:
#loading the t5 model and tokenizer
t5_tokenizer = T5Tokenizer.from_pretrained('t5-large')
t5_model = T5ForConditionalGeneration.from_pretrained('t5-large')

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


### Averaging the Logits
Each model generates token logits, these are the probabilities of the next token. We want to average the logits (probabilities) of all models at each decoding step. The final token selection, as in the next token chosen for the summary, is determined by taking the token with the highest average probability.

We are encountering an error with this method since the size of the logits from BART, Pegasus, and T5 models are not the same. 

In [23]:
#defining a function to return the logits of the models

def get_model_logits(text, model, tokenizer, max_length = 1024):
    
    #converting the plain text into tokens which can be interpreted by the model (encoding the text)
    inputs = tokenizer.encode(text, return_tensors = 'pt', max_length = max_length, truncation = True)
    
    # Prepare decoder input IDs for the models (typically the start token for summarization)
    decoder_start_token = tokenizer.pad_token_id if tokenizer.pad_token_id is not None else tokenizer.eos_token_id
    decoder_input_ids = torch.tensor([[decoder_start_token]], dtype=torch.long)

    #passing the model the tokenized text
    outputs = model(input_ids=inputs, decoder_input_ids=decoder_input_ids, return_dict=True)

    #returning the token logits and the tokens
    return outputs.logits, inputs



In [24]:
#defining a function to generate a summary based on the averaging the token probabilities

def ensemble_generate(text, max_length = 150, min_length = 30, num_beams = 4, early_stopping = True):

    #retrieving the logits from BART, Pegasus, and T5 using our above defined function
    bart_logits, inputs_bart = get_model_logits(text, bart_model, bart_tokenizer)
    pegasus_logits, inputs_pegasus = get_model_logits(text, pegasus_model, pegasus_tokenizer)
    t5_logits, inputs_t5 = get_model_logits(text, t5_model, t5_tokenizer)    


    #taking the average of the logits
    combined_logits = (bart_logits + pegasus_logits + t5_logits) / 3

    #generating tokens from the averaged logits
    generated_tokens = torch.argmax(combined_logits, dim=-1)

    #decoding the generated tokens to a readable summary 
    final_summary = bart_tokenizer.decode(generated_tokens[0], skip_special_tokens=True)

    #returning the final summary
    return final_summary



In [25]:
#running an example with a long text document
long_text = """
Artificial Intelligence (AI) has been a major breakthrough in technology. It is being utilized in various domains such as healthcare, autonomous driving, and natural language processing...
... [continued long text]
"""


#genetating a summary using this ensemble model
final_summary = ensemble_generate(long_text)

print("Final Summary After Model Ensembling:\n", final_summary)

RuntimeError: The size of tensor a (50264) must match the size of tensor b (96103) at non-singleton dimension 2

Attempting to pad all logits so they are the size of the largest vocabulary (so their dimensions match). Run into an issue where NoneTypes are generated and cannot be properly prosessed. 

In [28]:
def get_model_logits(text, model, tokenizer, max_length=1024):
    inputs = tokenizer.encode(text, return_tensors="pt", max_length=max_length, truncation=True)
    
    # Prepare decoder input IDs (start token)
    decoder_start_token_id = tokenizer.pad_token_id if tokenizer.pad_token_id is not None else tokenizer.eos_token_id
    decoder_input_ids = torch.tensor([[decoder_start_token_id]], dtype=torch.long)

    # Call model with encoder input and decoder start token
    outputs = model(input_ids=inputs, decoder_input_ids=decoder_input_ids, return_dict=True)
    
    return outputs.logits, inputs

# Function to pad logits to the maximum size
def pad_logits(logits, target_size):
    current_size = logits.size(-1)
    if current_size < target_size:
        # Pad the logits with zeros to match the target size
        pad_size = target_size - current_size
        logits = torch.nn.functional.pad(logits, (0, pad_size), value=0)
    return logits

# Ensemble function to average logits from different models
def ensemble_generate(text, max_length=150, min_length=30, num_beams=4, early_stopping=True):
    # Get logits from BART, Pegasus, and T5
    bart_logits, inputs_bart = get_model_logits(text, bart_model, bart_tokenizer)
    pegasus_logits, inputs_pegasus = get_model_logits(text, pegasus_model, pegasus_tokenizer)
    t5_logits, inputs_t5 = get_model_logits(text, t5_model, t5_tokenizer)

    # Find the largest vocabulary size among the models
    max_vocab_size = max(bart_logits.size(-1), pegasus_logits.size(-1), t5_logits.size(-1))

    # Pad the logits so they all match the largest size
    bart_logits_padded = pad_logits(bart_logits, max_vocab_size)
    pegasus_logits_padded = pad_logits(pegasus_logits, max_vocab_size)
    t5_logits_padded = pad_logits(t5_logits, max_vocab_size)

    # Average the padded logits
    combined_logits = (bart_logits_padded + pegasus_logits_padded + t5_logits_padded) / 3

    # Generate tokens from averaged logits
    generated_tokens = torch.argmax(combined_logits, dim=-1)

    # Apply a mask to remove padding or invalid tokens
    valid_token_mask = generated_tokens != bart_tokenizer.pad_token_id

    # Keep only valid tokens for decoding
    valid_tokens = generated_tokens[valid_token_mask]

    # Decode the valid tokens to readable summary
    final_summary = bart_tokenizer.decode(valid_tokens, skip_special_tokens=True)

    return final_summary

# Example long text document (scientific or article)
long_text = """
Artificial Intelligence (AI) has been a major breakthrough in technology. It is being utilized in various domains such as healthcare, autonomous driving, and natural language processing...
... [continued long text]
"""

# Generate summary using ensemble method
final_summary = ensemble_generate(long_text)

print("Final Summary After Model Ensembling:\n", final_summary)

TypeError: sequence item 0: expected str instance, NoneType found

### Method 2: Using a Ranking or Voting Mechcanism
Selecting the best sentence from the original summaries instead of averaging the logits. This method relies on embedding the sentences and calculating the cosine similarity between the sentence embeddings of the original text and the sentence embeddings of the generated summaries. Essentially, each model will independently generate a summary which will subsequently be ranked. 

In [31]:
# Sentence embedding model for ranking
embedder = SentenceTransformer('paraphrase-MiniLM-L6-v2')

# Function to generate summary from a model
def generate_summary(text, model, tokenizer, max_length=150, min_length=30, num_beams=4, early_stopping=True):
    inputs = tokenizer.encode(text, return_tensors="pt", max_length=1024, truncation=True)
    
    # Generate summary
    summary_ids = model.generate(inputs, max_length=max_length, min_length=min_length, num_beams=num_beams, early_stopping=early_stopping)
    
    # Decode and return the summary
    return tokenizer.decode(summary_ids[0], skip_special_tokens=True)

# Ensemble function to rank summaries
def ensemble_generate(text):
    # Generate summaries from all models
    bart_summary = generate_summary(text, bart_model, bart_tokenizer)
    pegasus_summary = generate_summary(text, pegasus_model, pegasus_tokenizer)
    t5_summary = generate_summary(text, t5_model, t5_tokenizer)
    
    # Get sentence embeddings for the original text and each summary
    original_embedding = embedder.encode(text, convert_to_tensor=True)
    summaries = [bart_summary, pegasus_summary, t5_summary]
    summary_embeddings = embedder.encode(summaries, convert_to_tensor=True)
    
    # Compute cosine similarity between the original text and each summary
    similarities = util.pytorch_cos_sim(original_embedding, summary_embeddings)
    
    # Select the summary with the highest similarity score
    best_summary_idx = torch.argmax(similarities).item()
    
    return summaries[best_summary_idx]

# Example long text document (scientific or article)
long_text = """
Artificial Intelligence (AI) has been a major breakthrough in technology. It is being utilized in various domains such as healthcare, autonomous driving, and natural language processing...
... [continued long text]
"""

# Generate summary using ensemble method
final_summary = ensemble_generate(long_text)

print("Final Summary After Model Ensembling:\n", final_summary)



Final Summary After Model Ensembling:
 Artificial Intelligence (AI) has been a major breakthrough in technology. It is being utilized in various domains such as healthcare, autonomous driving, and natural language processing.
