# LLM Inference: Introduction to Optimization and Efficiency Lab

## Introduction on Summarization

Text summarization is a natural language processing task that is the process of creating shorter text from a document or sequence of text that captures the most important information.

Summarization can take the following form:

- Extractive summarization is the process of extracting the most relevant text from the document and using the relevant text to form a summary.
- Abstractive summarization is the process of generating new text that captures the most relevant information from the document. The generated summary may contain text that does not appear in the document.

Summarization is an example of a sequence-to-sequence task. It has a family of models that are encoder-decoder models that use both parts of the Transformer architecture. The encoder's attention layers has access to all the words of the input text, while the decoder's attention layers only have access to the words that are positioned before the target word from the input text.


## Objective 

In this lab, you will:

1. Understand the concept of inference optimization.
2. Learn techniques to optimize inference for machine learning models.
3. Implement and evaluate these optimization techniques.


## Set Up Your Environment



### Install Required Libraries 

Ensure you have the necessary libraries installed. You can install them using pip if they are not already installed.

### Import Libraries 

Import the necessary libraries for data manipulation, model loading, and optimization.

In [None]:
# Import libraries
import os
import sys
import time
import torch
import pandas as pd
import polars as pl

from transformers import (
    AutoTokenizer,
    AutoModelForSeq2SeqLM,
    PegasusForConditionalGeneration,
    BartForConditionalGeneration,
    PegasusTokenizerFast,
    BartTokenizerFast,
    BatchEncoding,
)

from llm_inference_lab.utils.download_models import download_models

from llm_inference_lab.utils.models import pegasus_model, distilbart_model

from llm_inference_lab.tools.torch.quantization import dynamic_quantization

from llm_inference_lab.summarization.summarization import TextSummarizer

from llm_inference_lab.utils.benchmark import measure_inference_latency

### Download Models

Download the models for the Lab. Depending on your network connectivity, this may take longer than expected.

*Expected download time is approximately 2 minutes.*

More information about the models we will download for this lab can be found here:

- [google/pegasus-cnn_dailymail](https://huggingface.co/google/pegasus-cnn_dailymail) - This model provides an abstractive summary that is high in extractive coverage/density, which means the summaries returned are more extractive.
- [sshleifer/distilbart-cnn-12-6](https://huggingface.co/sshleifer/distilbart-cnn-12-6) - This model is a form of a compressed model technique known as distillation. Distillation is the process of transferring knowledge from a larger model, also referred to as the teacher, to a smaller model, also referred as the student. This model provides an abstractive summary that is high in extractive coverage/density, which means the summaries returned tend to contain snippets of verbatim text from the input document (so may resemble an extractive summary).


#### Default Hyperparameters for Models in this Lab

**Default hyperparameters for Pegasus**

    Model Parameters:​
        Tokenization:​
            max_length – 512​
            padding – True​
            truncation – True​
    ​
    Generation:​
        Parameters that control Generation Strategy:​
        num_beams – 4 (model default is 8)​

    Parameters that control the length of output:​
        min_length - 32​
        max_length - 128​
        early_stopping – True​
        max_new_tokens - 128​

    Parameters for manipulation of model output logits:​
        length_penalty - 0.8​
        no_repeat_ngram_size - 0 (default)

**Default hyperparameters for DistilBART**

    Model Parameters:​
        Tokenization:​
            max_length – 512​
            padding – True​
            truncation – True​
    ​
    Generation:​
        Parameters that control Generation Strategy:​
            num_beams – 4​

    Parameters that control the length of output:​
        min_length - 56​
        max_length - 142​
        early_stopping – True​
        max_new_tokens - 128​

    Parameters for manipulation of model output logits:​
        length_penalty - 2​
        no_repeat_ngram_size - 3

In [None]:
%%time

download_models(all=True)

## Load and Prepare Data

### Load Dataset 

For this lab, we will use the [Xsum dataset](https://www.kaggle.com/datasets/mdnaveedmmulla/xsumdataset?resource=download&select=xsum_test.csv), which is a classic dataset for summarization tasks.

In [None]:
df = pd.read_csv("../data/xsum_validation.csv")


In [None]:
df.head()

In [None]:
df["document"].values[2]

In [None]:
src_text = list(df["document"].values[2])

*Note: Keep in mind that some models required some text preprocessing before training, fune-tuning, or inference. This lab does not coverage the specifics on text preprocessing techniques but it is advised to consider what kind of text preprocessing is needed to support the model used.*

In [None]:
import re

In [None]:
# src_text = re.sub("\\s{2,}", " ", src_text)
# src_text = re.sub("\\n{2,}", "\n", src_text).strip()
# src_text

## Load and Prepare the Model

### Load Pre-trained Model and Tokenizer 

For this lab, we will use the Pegasus CNN_Dailymail model, which is a pretrained language model. This will be our base model we will want to optimize for inference.

- [google/pegasus-cnn_dailymail](https://huggingface.co/google/pegasus-cnn_dailymail) - This model provides an abstractive summary that is high in extractive coverage/density, which means the summaries returned are more extractive.
- [sshleifer/distilbart-cnn-12-6](https://huggingface.co/sshleifer/distilbart-cnn-12-6) - This model is a form of a compressed model technique known as distillation. Distillation is the process of transferring knowledge from a larger model, also referred to as the teacher, to a smaller model, also referred as the student. This model provides an abstractive summary that is high in extractive coverage/density, which means the summaries returned tend to contain snippets of verbatim text from the input document (so may resemble an extractive summary).


#### Load Base Model: Pegasus

In [None]:
model_name = pegasus_model.path

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name)

#### Load Quantized Model using PyTorch's Dynamic Quantization

Dynamic quantization will only be applied to the Pegasus model.

In [None]:
quantized_model = dynamic_quantization(model)

#### Load Distilled Model: DistilBart

In [None]:
model_name_dist = distilbart_model.path

tokenizer_dist = AutoTokenizer.from_pretrained(model_name_dist)
distilled_model = BartForConditionalGeneration.from_pretrained(model_name_dist)

In [None]:
# import psutil as ps

# ps.cpu_count(logical=False)

### Set device

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"

In [None]:
model.to(device)

In [None]:
quantized_model.to(device)

In [None]:
distilled_model.to(device)

## Optimize Inference

### Enable Model Evaluation Mode 

Set the model to evaluation mode to disable dropout layers.

In [None]:
model.eval()

### Optimize Tokenization

Tokenize the input text efficiently.

#### Base Model: Pegasus

In [None]:
inputs = tokenizer(src_text, max_length=512, padding=True, truncation=True, return_tensors="pt").to(device)

*Note: The Quanitized Model created from using PyTorch's Dynamic Quantization uses the same tokenized inputs as the Base Model.*

#### Distilled Model: DistilBart

In [None]:
inputs_dist = tokenizer_dist(src_text, max_length=512, padding=True, truncation=True, return_tensors="pt").to(device)

### Optimize Inference with Batch Processing

Use batch processing to optimize inference for multiple inputs.

In [None]:
def predict(inputs):
    with torch.inference_mode():
        outputs = model(inputs)
        predictions = torch.argmax(outputs.logits, dim=1)
    return predictions.cpu().numpy()


### Optimize Inference with Dynamic Quantization

Use PyTorch's dynamic quantization on the Pegasus model.

### Optimize Inference with Distillation

Use DistilBart model to demonstrate applying a distilled model.

## Make Predictions

Use the optimized inference process to make predictions.

### Use Base Model: Pegasus model

In [None]:
# Use the optimized inference process to generate text.
# input_texts = ["Once upon a time", "In a galaxy far, far away"]
# tokenized_inputs = [tokenize_input(text) for text in input_texts]
# batch_inputs = torch.cat(tokenized_inputs, dim=0)
# generated_texts = generate_text(batch_inputs)
# for i, text in enumerate(generated_texts):
#     print(f"Input: {input_texts[i]}")
#     print(f"Generated: {text}\n")


In [None]:
outputs = model.generate(
            inputs["input_ids"],
            num_beams=4,
            early_stopping=True,
            max_new_tokens=128
        )

summaries = tokenizer.batch_decode(outputs, skip_special_tokens=True, clean_up_tokenization_spaces=False)

print(summaries)

### Use Quantized Pegasus Model

In [None]:
quantized_outputs = quantized_model.generate(
            inputs["input_ids"],
            num_beams=4,
            early_stopping=True,
            max_new_tokens=128
        )

quantized_summaries = tokenizer.batch_decode(quantized_outputs, skip_special_tokens=True, clean_up_tokenization_spaces=False)

print(quantized_summaries)

### Use Distilled Model: DistilBART for summarization

In [None]:
distilled_outputs = distilled_model.generate(
            inputs_dist["input_ids"],
            num_beams=4,
            early_stopping=True,
            max_new_tokens=128
        )

distilled_summaries = tokenizer_dist.batch_decode(distilled_outputs, skip_special_tokens=True, clean_up_tokenization_spaces=False)

print(distilled_summaries)

## Evaluate the Optimized Inference

Evaluate the performance of the optimized inference process.

### Measure Inference Time

Measure the time taken for inference before and after optimization.

In [None]:
# Evaluate the performance of the optimized inference process.
# 	1. Measure Inference Time Measure the time taken for inference before and after optimization.
#    import time
# def measure_inference_time(inputs):
#        start_time = time.time()
#        predict(inputs)
#        end_time = time.time()
#        return end_time - start_time
# original_time = measure_inference_time(batch_inputs)
#    optimized_time = measure_inference_time(batch_inputs)
#    print(f"Original Inference Time: {original_time:.2f} seconds")
#    print(f"Optimized Inference Time: {optimized_time:.2f} seconds")
# 	2. Evaluate Prediction Accuracy Evaluate the accuracy of the predictions using a sample dataset.
#    from sklearn.metrics import accuracy_score
# # Assuming y_test contains the true labels for the input_texts
#    y_test = [1, 0]  # Example true labels
#    accuracy = accuracy_score(y_test, predictions)
#    print(f"Prediction Accuracy: {accuracy:.2f}")


## Measure Inference Time

Measure the time taken for inference before and after optimization.

In [None]:
# Measure the time taken for inference before and after optimization.
# def measure_inference_time(model, X):
#     start_time = time.time()
#     predictions = predict_batch(model, X)
#     end_time = time.time()
#     return end_time - start_time, predictions
# original_time, original_predictions = measure_inference_time(loaded_model, X_test)
# print(f"Original Inference Time: {original_time:.2f} seconds")


## Evaluate the Optimized Inference

Evaluate the performance of the optimized inference process.

In [None]:
# 	1. Evaluate Prediction Accuracy Evaluate the accuracy of the predictions using the test dataset.
#    accuracy = accuracy_score(y_test, original_predictions)
#    print(f"Prediction Accuracy: {accuracy:.2f}")
# 	2. Evaluate Inference Time Compare the inference time before and after optimization.
#    optimized_time, optimized_predictions = measure_inference_time(loaded_model, X_test)
#    print(f"Optimized Inference Time: {optimized_time:.2f} seconds")


## Put it All Together: Optimize Inference

# Conclusion

In this lab, you learned how to:

- Load a pre-trained small language model.
- Optimize the inference process for faster and more efficient predictions.
- Evaluate the optimized inference process.


This simple inference optimization task demonstrates the basic workflow of using a small pretrained language model for optimized prediction and evaluation. 

You can extend this lab by using different models, optimization techniques, and evaluation metrics.