# Introduction

This notebook is the first in a series aimed at optimizing large language model (LLM) deployment for efficient and scalable inference. We focus on inference using models from **Hugging Face**, a key source of cutting-edge LLMs, starting with the **Llama 3** model. We also explore **quantization**, which reduces memory usage, allowing models to run efficiently on smaller hardware.

Additionally, we introduce **vLLM**, a specialized framework for improving inference speed and memory handling, particularly useful in high-load scenarios involving large batches or long sequences. This notebook sets the foundation for future installments in the series, where we'll further refine performance strategies for various LLM use cases.

Finally, we build a simple **chatbot** as our first application, showcasing how these techniques can be applied in a real-world setting.


In [None]:
import gc
import json
import time

import torch
import vllm
from huggingface_hub import login
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

# Loading Tokenizer and Model

### Using LLaMA 3

#### Conditions

Llama 3 can be used for commercial products, but there are certain requirements and restrictions to follow:

1. **Attribution**:  
   You must provide a clear and prominent acknowledgment, such as "Built with Meta Llama 3," in all relevant user interfaces, documentation, and webpages.

2. **User Threshold**:  
   If your product or service utilizing Llama 3 exceeds **700 million monthly active users**, you must obtain a separate, specific license from Meta.

3. **Restrictions on Enhancing Other Models**:  
   Llama 3 materials or outputs cannot be used to improve or train any other large language models outside the Llama family.

4. **Compliance**:  
   Users must ensure compliance with applicable laws and regulations, such as GDPR and trade compliance laws.

In conclusion, Llama 3 offers great flexibility for commercial applications, provided you adhere to these licensing terms and restrictions. See [Llama 3 Overview](https://ai.meta.com/static-resource/july-responsible-use-guide).


#### Steps

1. **Apply for model access**: Visit [Llama-3-8B-Instruct on Hugging Face](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) to request access to the model. Please note that it may take a few days for your application to be approved. Once approved, you will see the following message on the website:

    * **Gated model** You have been granted access to this model
   
2. **Create your Hugging Face Access Key**: Go to your [Hugging Face settings](https://huggingface.co/settings/tokens) to create an access token. When creating the token, ensure you check the box:

    * `Read access to contents of all public gated repos you can access` under **Permissions**.

3. **Provide your Hugging Face Access Key**: Once you have your access token, paste it into `api_keys.json` to authenticate the notebook with Hugging Face.


In [None]:
# Read HF_ACCESS_KEY into hf_access_key
with open("api_keys.json", "r") as file:
    hf_access_key = json.load(file).get("HF_ACCESS_KEY")

# Login to HuggingFace
login(hf_access_key)

### Quantization

Suppose you have access to **16 GB of GPU memory**, which is insufficient to load the entire LLaMA model at once. To complete inference, Hugging Face will dynamically move parts of the model onto the GPU during runtime, which will cause the inference to become **extremely slow**.

To address this limitation, we **quantize** the model using Hugging Face's `bitsandbytes` library. This approach significantly reduces GPU memory consumption, enabling faster inference without needing to load the entire model into GPU memory at once.


In [None]:
# Create a BitsAndBytesConfig for 4-bit quantization
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,  # Change this to `False` to disable quantization
    bnb_4bit_use_double_quant=True,  # Optional for performance
    bnb_4bit_quant_type='nf4',  # Normal floating-point 4-bit quantization
    bnb_4bit_compute_dtype=torch.float16  # Set compute dtype to float16 for faster inference
)

# Model name--you can change to many huggingface models
model_name = "meta-llama/Meta-Llama-3-8B-Instruct"

Now, let's load the tokenizer and model.

In [None]:
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

# Load model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype="auto",
    quantization_config=quantization_config
)

Let’s check the GPU information and verify that all parts of the model are loaded onto the GPU.

In [None]:
def clear_info_gpu():
    """ Clear GPU cache and print info """
    # Clear the memory cache on the GPU
    torch.cuda.empty_cache()
    # Collect garbage to ensure all references are removed
    gc.collect()
    # Ensure all CUDA operations are finished before clearing memory (optional)
    torch.cuda.synchronize()
    # Print GPU info
    total_memory = torch.cuda.get_device_properties(0).total_memory / 1024 ** 3
    allocated_memory = torch.cuda.memory_allocated(0) / 1024 ** 3
    free_memory = total_memory - allocated_memory
    print(f"Total GPU memory: {total_memory:.2f} GB")
    print(f"Allocated GPU memory: {allocated_memory:.2f} GB")
    print(f"Free GPU memory: {free_memory:.2f} GB")


# Check GPU info
clear_info_gpu()

# Check the device of each module of the model
for name, param in model.named_parameters():
    print(f"{name} is on device: {param.device}")

# Inference

Let's try the model.

In [None]:
# Generate chat response
def generate_response(prompt, max_new_tokens=30, temperature=0.2, num_beams=1):
    # Tokenize the prompt
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

    # Generate the output ids
    outputs = model.generate(inputs.input_ids,
                             attention_mask=inputs['attention_mask'],  # Avoid warning
                             pad_token_id=tokenizer.eos_token_id,  # Avoid warning
                             max_new_tokens=max_new_tokens,  # Length of generation
                             temperature=temperature,  # Temperature for randomness
                             num_beams=num_beams  # Number of beams
                             )

    # Decode the output ids to a string
    return tokenizer.decode(outputs[0], skip_special_tokens=True)


# Try it
generate_response("Hi mate!")

### Key Parameters in `model.generate()`

There are numerous arguments for `model.generate()`. Below, we highlight the most useful ones.

#### **Length Control**
- **`max_length`**: Specifies the maximum number of tokens for the entire sequence, including both input tokens (from the prompt) and generated tokens. The model stops generating once the total number of tokens reaches this limit.
- **`max_new_tokens`**: Defines the maximum number of new tokens that the model can generate, excluding the tokens from the input prompt. The model will generate up to this many tokens after receiving the input.
- **`eos_token_id`**: The ID of the end-of-sequence (EOS) token. The generation will stop once the model generates this token, marking the end of the sequence.

#### **Diversity and Quality**
- **`temperature`**: Controls the randomness of predictions. Lower values (e.g., 0.7) make the model more deterministic, while higher values (e.g., 1.0 or above) increase randomness, making the outputs more diverse.
- **`top_k`**: Limits the next token selection to the top `k` most likely tokens. A higher value allows for more variety in the generated text, while a lower value makes it more deterministic.
- **`top_p` (nucleus sampling)**: Limits token selection to tokens with a cumulative probability of `p`. This ensures that only the top `p` percent of the probability mass is considered, promoting diverse but controlled generation.
- **`do_sample`**: Enables random sampling of tokens instead of greedy decoding (which selects the highest-probability token). This is essential for generating diverse outputs.
- **`num_beams`**: The number of beams for beam search. Higher values explore more possibilities during generation, leading to better outputs but at the cost of increased computation.

![beam](https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/beam_search.png)

#### **Repetition and Token Constraints**
- **`repetition_penalty`**: Penalizes repeated tokens, discouraging the model from generating repetitive sequences. A value greater than 1.0 reduces the likelihood of repeating the same token.
- **`no_repeat_ngram_size`**: Prevents repetition of n-grams of a specified size. For example, `no_repeat_ngram_size=3` ensures that trigrams do not repeat in the generated output.

#### **Output Control**
- **`num_return_sequences`**: The number of different sequences to generate. For example, `num_return_sequences=3` generates three separate outputs from the same prompt.


### Batch inference

Now, we enhance our `generate_response()` function to support batch inference, which is crucial for serving multiple users in production environments. Additionally, we add functionality to measure the runtime of the inference process.

In [None]:
def generate_response(prompts, max_new_tokens=30, temperature=0.2, num_beams=1, measure_time=False):
    # Check if input is a single prompt or batch of prompts
    if isinstance(prompts, str):
        prompts = [prompts]  # Convert single prompt to a list for batch processing

    # Tokenize the batch of prompts (single or multiple)
    inputs = tokenizer(prompts, return_tensors="pt", padding=True).to("cuda")

    # Start time measurement for model.generate() if requested
    start_time = time.time() if measure_time else None

    # Generate responses for the batch
    outputs = model.generate(
        inputs.input_ids,
        attention_mask=inputs['attention_mask'],  # Avoid warning
        pad_token_id=tokenizer.eos_token_id,  # Avoid warning
        max_new_tokens=max_new_tokens,  # Length of generation
        temperature=temperature,  # Temperature for randomness
        num_beams=num_beams  # Number of beams
    )

    # Measure time after generation
    if measure_time:
        end_time = time.time()
        runtime = end_time - start_time
    else:
        runtime = None

    # Decode the batch of generated outputs
    responses = [tokenizer.decode(output, skip_special_tokens=True) for output in outputs]

    # If the original input was a single prompt, return a single string, not a list
    if len(prompts) == 1:
        responses = responses[0]

    # Return both the response and runtime if time measurement was requested
    return (responses, runtime) if measure_time else responses

For batched inference, padding is necessary to ensure that all input sequences in a batch are of the same length. This allows the model to process multiple inputs in parallel. To set up padding correctly, we need to configure the tokenizer to handle padding. Specifically, we can assign a padding token (typically the `eos_token`) and set the padding side to ensure proper alignment of input sequences.


In [None]:
# Set the pad token to the eos token 
tokenizer.pad_token = tokenizer.eos_token

# Set padding side to left for decoder-only models
tokenizer.padding_side = "left"

Now let's create a multi-input task, and measure its runtime.

In [None]:
# Batch input
many_prompts = [
    "Once upon a time in a distant land...",
    "Explain the theory of relativity in simple terms.",
    "What is the capital of France?",
    "Generate a story about a robot in the future.",
    "How do you bake a chocolate cake?"
]

# Batch inference
many_responses, t = generate_response(many_prompts, measure_time=True)

# Print results
for res in many_responses:
    print("---------------")
    print(res)
print("\n\nRuntime before acceleration:", t)

# Acceleration

In this section, we utilize **vLLM** to optimize inference through **paged attention**, as described in [this paper](https://arxiv.org/abs/2309.06180). The key idea of paged attention is to enhance memory efficiency by managing the key-value (KV) cache in a way that minimizes memory fragmentation. This is achieved by allocating memory in smaller "pages" instead of reserving large, contiguous memory blocks, which reduces memory fragmentation and allows for more flexible and efficient GPU memory usage. Through this approach, vLLM greatly improves inference throughput, particularly for batched or parallel requests, enabling Large Language Models (LLMs) to handle multiple sequences simultaneously while maintaining high performance on limited GPU resources.

If you encounter an error, it is likely that vLLM is not properly installed. In this case, set the following flag to `True` to skip all vLLM-related cells.

In [None]:
SKIP_VLLM_CELLS = False

Before proceeding, we release the GPU memory by deleting the previous model, ensuring efficient usage of resources on a small GPU.

In [None]:
if not SKIP_VLLM_CELLS:
    # Delete the model and free its GPU memory
    try:
        del model  # Deletes the model object
    except NameError:
        pass

    clear_info_gpu()

First, we wrap our Llama 3 model in the `vllm.LLM` wrapper. The `gpu_memory_utilization` parameter is essential, setting the proportion of GPU memory (between 0 and 1) allocated for model weights, activations, and the KV cache. Higher values increase the available KV cache size, enhancing throughput by allowing for more efficient processing of larger sequences or batches. However, setting this value too high risks out-of-memory (OOM) errors, so it’s crucial to balance utilization based on the specific memory capacity of the GPU.

In [None]:
if not SKIP_VLLM_CELLS:
    # Use vllm to load model
    model = vllm.LLM(model=model_name,
                     skip_tokenizer_init=True,
                     quantization="bitsandbytes", load_format="bitsandbytes",
                     dtype="half", device="cuda",
                     gpu_memory_utilization=0.5,
                     max_seq_len_to_capture=1024)

    # Check GPU info
    clear_info_gpu()

Next, modify the generate function for `vllm.LLM`. 

In [None]:
def generate_response_vllm(prompts, max_new_tokens=50, temperature=0.2, measure_time=False):
    # Check if input is a single prompt or batch of prompts
    if isinstance(prompts, str):
        prompts = [prompts]  # Convert single prompt to a list for batch processing

    # Tokenize the batch of prompts (single or multiple)
    inputs = tokenizer(prompts, return_tensors="pt", padding=True)
    inputs = [vllm.TokensPrompt(prompt_token_ids=inp) for inp in inputs.data["input_ids"].tolist()]

    # Start time measurement for model.generate() if requested
    start_time = time.time() if measure_time else None

    # Generate responses for the batch
    outputs = model.generate(
        prompts=inputs,
        sampling_params=vllm.SamplingParams(max_tokens=max_new_tokens, temperature=temperature),
    )

    # Measure time after generation
    if measure_time:
        end_time = time.time()
        runtime = end_time - start_time
    else:
        runtime = None

    # Decode the batch of generated outputs
    responses = [tokenizer.decode(output.outputs[0].token_ids, skip_special_tokens=True) for output in outputs]

    # If the original input was a single prompt, return a single string, not a list
    if len(prompts) == 1:
        responses = responses[0]

    # Return both the response and runtime if time measurement was requested
    return (responses, runtime) if measure_time else responses

In [None]:
if not SKIP_VLLM_CELLS:
    # Batch inference
    many_responses, t = generate_response_vllm(many_prompts, measure_time=True)

    # Print results
    for res in many_responses:
        print("---------------")
        print(res)
    print("\n\nRuntime before acceleration:", t)

### Results

The results indicate that while vLLM significantly increases GPU memory usage, it does not provide faster inference speeds in our specific setup.

- **Increased memory usage**: This occurs because vLLM pre-allocates GPU memory to optimize execution. The pre-allocation is designed for handling large-scale workloads efficiently, which can lead to higher memory usage even when dealing with smaller models or batches.
  
- **No speedup observed**: The similar inference speeds are likely due to our small batch size and short sequence lengths. vLLM's paged attention is optimized for large batches and long sequences, where it can reduce latency by efficiently managing memory. For small workloads, the overhead of this optimization outweighs the potential speed gains.

#### When to use vLLM

You will benefit most from using vLLM if:

- You have a large model (e.g., non-quantized models that require significant memory) and GPU (or GPU-cluster).
- You are working with large batch sizes and/or long sequence lengths, where vLLM's paged attention can significantly improve speed and resource utilization.


# Chatbot

In this section, we build a simple chatbot as our first application. 

First, let's re-create the model using `from_pretrained`.

In [None]:
# Delete the model and free its GPU memory
try:
    del model  # Deletes the model object
except NameError:
    pass

clear_info_gpu()

# Load model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype="auto",
    quantization_config=quantization_config
)

LLMs do not retain memory between interactions. To illustrate this, consider the following example:

In [None]:
print(generate_response("What are the first three colors in the rainbow?"))
print("-------")
print(generate_response("What are the rest?"))

Clearly, the model fails to interpret what "the rest" refers to. This is because LLMs process each query independently and do not retain the context of previous interactions.

To address the limitation of LLMs not retaining memory between interactions, we can implement a **streamed history**. This technique involves maintaining a conversation history by appending previous user inputs and model responses to a single prompt. By doing so, we simulate context retention, allowing the model to handle follow-up questions more effectively.

We will also provide a **world prompt** to set the stage for the conversation, leading the chatbot to respond in a specific context or theme. This world prompt acts as an overarching guide for the interaction, helping maintain consistency throughout the conversation.


In [None]:
# Give your bot a name
bot_name = "Rachel"

# Initial world prompt to guide the conversation
world_prompt = f"""
Your name is {bot_name}. 
You are a knowledgeable assistant, capable of answering questions on a wide variety of topics. 
Assist the user with information, advice, or explanations based on their queries. 
You only generate your own answer. Do not continue to generate user input.
"""

# Initialize the conversation history with the world prompt
history = world_prompt

print(f"Chatbot {bot_name} initialized. Type 'exit' to end the conversation.\n")

# Loop for continuous user input
while True:
    # Get user input
    user_input = input("[You] ")

    # Exit condition
    if user_input.lower() == "exit":
        print("Ending the conversation.")
        break

    # Append the new user input to the history
    final_prompts = f"{history}\nUser: {user_input}\n{bot_name}:"

    # Generate chatbot response using the existing generate_response function
    response = generate_response(final_prompts, max_new_tokens=100)

    # Exclude the input tokens (prompt) from the response by slicing out the history part
    generated_text = response[len(final_prompts):]  # Only take the newly generated part

    # Only take the first line because further conversation may be generated
    if "\n" in generated_text:
        generated_text = generated_text.split("\n")[0]

    # Print the chatbot response without including the prompt
    print(f"[{bot_name}] {generated_text}\n")

    # Update conversation history with the user input and chatbot response
    history = f"{history}\nUser: {user_input}\n{bot_name}: {generated_text}"


Great! Our chatbot works as expected. However, as the conversation continues, the history accumulates, leading to **an increasingly larger sequence length**. This will gradually slow down inference and, eventually, hit the token limit of the model. To address this, we can apply advanced prompt engineering techniques, such as conversation chaining (e.g., in LangChain), which helps manage context by retaining only the most relevant parts of the conversation while discarding older, less relevant exchanges. This keeps the sequence length manageable and ensures efficient inference over time.

# More Exercises

* Experiment with different models available on Hugging Face to compare their performance and suitability for various tasks.
* Explore additional parameters in `model.generate()`, such as `temperature`, `top_k`, and `num_beams`, to understand their impact on the quality and creativity of responses, as well as the inference speed.
* Enhance the chatbot's memory management: when the chat history exceeds a certain length, prompt the LLM to summarize the conversation into a concise paragraph. This will help maintain context while reducing token usage and improving efficiency. 