# LLM Inference: Introduction to Optimization and Efficiency Lab

## Introduction on Summarization

Text summarization is a natural language processing task that is the process of creating shorter text from a document or sequence of text that captures the most important information.

Summarization can take the following form:

- Extractive summarization is the process of extracting the most relevant text from the document and using the relevant text to form a summary.
- Abstractive summarization is the process of generating new text that captures the most relevant information from the document. The generated summary may contain text that does not appear in the document.

Summarization is an example of a sequence-to-sequence task. It has a family of models that are encoder-decoder models that use both parts of the Transformer architecture. The encoder's attention layers has access to all the words of the input text, while the decoder's attention layers only have access to the words that are positioned before the target word from the input text.


## Objective 

In this lab, you will:

1. Understand the concept of inference optimization.
2. Learn techniques to optimize inference for machine learning models.
3. Implement and evaluate these optimization techniques.


## Set Up Your Environment



### Install Required Libraries 

Ensure you have the necessary libraries installed. You can install them using pip if they are not already installed.

### Import Libraries 

Import the necessary libraries for data manipulation, model loading, and optimization.

In [None]:
# Import libraries
import time
import torch
import pandas as pd

from transformers import (
    AutoTokenizer,
    PegasusForConditionalGeneration,
    BartForConditionalGeneration,
    PegasusTokenizerFast,
    BartTokenizerFast,
    BatchEncoding,
)

from llm_inference_lab.utils.download_models import download_models

from llm_inference_lab.utils.models import pegasus_model, distilbart_model

from llm_inference_lab.tools.torch.quantization import dynamic_quantization


Unused imports. The imports below will be used in another lab focused on benchmarking and metrics.

In [None]:

# from llm_inference_lab.summarization.summarization import TextSummarizer

# from llm_inference_lab.utils.benchmark import measure_inference_time

### Download Models

Download the models for the Lab. Depending on your network connectivity, this may take longer than expected.

*Expected download time is approximately 2 minutes.*

More information about the models we will download for this lab can be found here:

- [google/pegasus-cnn_dailymail](https://huggingface.co/google/pegasus-cnn_dailymail) - This model provides an abstractive summary that is high in extractive coverage/density, which means the summaries returned are more extractive.
- [sshleifer/distilbart-cnn-12-6](https://huggingface.co/sshleifer/distilbart-cnn-12-6) - This model is a form of a compressed model technique known as distillation. Distillation is the process of transferring knowledge from a larger model, also referred to as the teacher, to a smaller model, also referred as the student. This model provides an abstractive summary that is high in extractive coverage/density, which means the summaries returned tend to contain snippets of verbatim text from the input document (so may resemble an extractive summary).


#### Default Hyperparameters for Models used in this Lab

**Default hyperparameters for Pegasus**

    Model Parameters:​
        Tokenization:​
            max_length – 512​
            padding – True​
            truncation – True​
    ​
    Generation:​
        Parameters that control Generation Strategy:​
        num_beams – 4 (model default is 8)​

    Parameters that control the length of output:​
        min_length - 32​
        max_length - 128​
        early_stopping – True​
        max_new_tokens - 128​

    Parameters for manipulation of model output logits:​
        length_penalty - 0.8​
        no_repeat_ngram_size - 0 (default)

**Default hyperparameters for DistilBART**

    Model Parameters:​
        Tokenization:​
            max_length – 512​
            padding – True​
            truncation – True​
    ​
    Generation:​
        Parameters that control Generation Strategy:​
            num_beams – 4​

    Parameters that control the length of output:​
        min_length - 56​
        max_length - 142​
        early_stopping – True​
        max_new_tokens - 128​

    Parameters for manipulation of model output logits:​
        length_penalty - 2​
        no_repeat_ngram_size - 3

In [None]:
%%time

download_models(all=True)

## Load and Prepare Data

### Load Dataset 

For this lab, we will use the text labeled `src_text`.

For other examples of text, we will use the [Xsum dataset](https://www.kaggle.com/datasets/mdnaveedmmulla/xsumdataset?resource=download&select=xsum_test.csv), which is a classic dataset for summarization tasks. The use of this dataset is dependent on resources. **When using local resources, you may experience Kernel die issues. Recommended to increase your resources if using Xsum Dataset or if change any model hyperparameters.**

In [None]:
df = pd.read_csv("../data/xsum_validation.csv")


In [None]:
df.head()

In [None]:
df["document"].values[0]

We will proceed with `src_text` shown below for the remaining of this lab.

The text below original source: [Pegasus Usage Example](https://huggingface.co/docs/transformers/main/model_doc/pegasus#usage-example)

In [None]:
src_text = [
    """ PG&E stated it scheduled the blackouts in response to forecasts for high winds amid dry conditions. The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were scheduled to be affected by the shutoffs which were expected to last through at least midday tomorrow."""
]


*Note: Keep in mind that some models required some text preprocessing before training, fune-tuning, or inference. This lab does not coverage the specifics on text preprocessing techniques but it is advised to consider what kind of text preprocessing is needed to support the model used.*

## Load and Prepare the Model

### Load Pre-trained Model and Tokenizer 

For this lab, we will use the Pegasus CNN_Dailymail model, which is a pretrained language model. This will be our base model we will want to optimize for inference.

- [google/pegasus-cnn_dailymail](https://huggingface.co/google/pegasus-cnn_dailymail) - This model provides an abstractive summary that is high in extractive coverage/density, which means the summaries returned are more extractive.
- [sshleifer/distilbart-cnn-12-6](https://huggingface.co/sshleifer/distilbart-cnn-12-6) - This model is a form of a compressed model technique known as distillation. Distillation is the process of transferring knowledge from a larger model, also referred to as the teacher, to a smaller model, also referred as the student. This model provides an abstractive summary that is high in extractive coverage/density, which means the summaries returned tend to contain snippets of verbatim text from the input document (so may resemble an extractive summary).


#### Load Base Model: Pegasus

In [None]:
model_name = pegasus_model.path

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name)

#### Load Quantized Model using PyTorch's Dynamic Quantization

Dynamic quantization will only be applied to the Pegasus model.

In [None]:
quantized_model = dynamic_quantization(model)

#### Load Distilled Model: DistilBart

In [None]:
model_name_dist = distilbart_model.path

tokenizer_dist = AutoTokenizer.from_pretrained(model_name_dist)
distilled_model = BartForConditionalGeneration.from_pretrained(model_name_dist)

### Set device

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"

Set device and view model architecture.

In [None]:
model.to(device)

In [None]:
quantized_model.to(device)

In [None]:
distilled_model.to(device)

## Optimize Inference

### Enable Model Evaluation Mode 

Set the model to evaluation mode to disable dropout layers.

`model.eval()` - This is not needed for inference with transformers but is needed for torch.

### Optimize Tokenization

Tokenize the input text efficiently.

#### Base Model: Pegasus

In [None]:
# DO NOT USE - MORE RESOURCE INTENSIVE

# inputs = tokenizer(src_text, max_length=512, padding=True, truncation=True, return_tensors="pt").to(device)

In [None]:
inputs = tokenizer(src_text, padding=True, truncation=True, return_tensors="pt").to(device)

*Note: The Quanitized Model created from using PyTorch's Dynamic Quantization uses the same tokenized inputs as the Base Model.*

#### Distilled Model: DistilBart

In [None]:
inputs_dist = tokenizer_dist(src_text, padding=True, truncation=True, return_tensors="pt").to(device)

### Optimize Inference with Batch Processing

Use batch processing to optimize inference for multiple inputs.

We will not complete this during the lab but view the following modules for other ways to optimize inference via batch processing, distillation, quantization, and leveraging Ray for distributed processing.

- `from llm_inference_lab.summarization.summarization import TextSummarizer`
- `from llm_inference_lab.summarization.summarization_ray import TextSummarizer`

## Generate Summaries

Use the optimized inference process to make predictions (generate responses).

### Use Base Model: Pegasus model

In [None]:
outputs = model.generate(
            inputs["input_ids"]
        )

summaries = tokenizer.batch_decode(outputs,skip_special_tokens=True, clean_up_tokenization_spaces=False)

print(summaries)

### Use Quantized Pegasus Model

In [None]:
quantized_outputs = quantized_model.generate(
            inputs["input_ids"]
        )

quantized_summaries = tokenizer.batch_decode(quantized_outputs, skip_special_tokens=True, clean_up_tokenization_spaces=False)

print(quantized_summaries)

### Use Distilled Model: DistilBART for summarization

In [None]:
distilled_outputs = distilled_model.generate(
            inputs_dist["input_ids"]
        )

distilled_summaries = tokenizer_dist.batch_decode(distilled_outputs, skip_special_tokens=True, clean_up_tokenization_spaces=False)

print(distilled_summaries)

## Evaluate the Optimized Inference

Evaluate the performance of the optimized inference process.

### Measure Inference Time

Measure the time taken for inference before and after optimization.

In [None]:
def measure_inference_time(model: PegasusForConditionalGeneration
    | BartForConditionalGeneration, 
    tokenizer: PegasusTokenizerFast | BartTokenizerFast,
    inputs: BatchEncoding
   ):
        # Record start time
        start_time = time.time()
        
        # Run inference
        outputs = model.generate(
        inputs["input_ids"]
        )

        _ = tokenizer.batch_decode(outputs,skip_special_tokens=True, clean_up_tokenization_spaces=False)
        
        # Record end time and calculate latency
        end_time = time.time()
        latency = (end_time - start_time)
        return latency

In [None]:
original_time = measure_inference_time(model, tokenizer, inputs)
optimized_quantized_time = measure_inference_time(quantized_model, tokenizer, inputs)
optimized_distilled_time = measure_inference_time(distilled_model, tokenizer_dist, inputs_dist)

print(f"Pegasus Model - Original Inference Time: {original_time:.2f} seconds")
print(f"Quantized Pegasus Model - Optimized Quantized Inference Time: {optimized_quantized_time:.2f} seconds")
print(f"DistilBART Model - Optimized Distilled Inference Time: {optimized_distilled_time:.2f} seconds")

### Evaluate Prediction Accuracy

Evaluate the accuracy of the generated summaries using the test dataset (or validation or hold-out set).

**For the purposes of this lab, we will not evaluate prediction accuracy.**

## Questions

Recall the presentation material and lab content. Spend time answering the questions below or be ready to discuss as a group.

**Questions to Ask Yourself:**
- When and why would you use a particular technique?
- What are ways to determine if inference optimization is necessary?
- What do we expect the output to be for the techniques used?
- What are ways we measure inference efficiency?


# Conclusion

In this lab, you learned how to:

- Load a pre-trained language model.
- Optimize the inference process for faster and more efficient predictions.
- Evaluate the optimized inference process.

This simple inference optimization task demonstrates the basic workflow of using a *smaller* pretrained language model for optimized prediction and evaluation. 

You can extend this lab by using different models, optimization techniques, and evaluation metrics.