# LangChain and Llama2: Prompt Templates and Batch GPU Inference

This notebook uses LangChain and local Llama2-Chat inference that can be run on consumer grade hardware. The following LangChain features explored are:
1) [LangChain Custom Prompt Template](https://python.langchain.com/docs/modules/model_io/prompts/prompt_templates/custom_prompt_template) for a Llama2-Chat model
2) [Hugging Face Local Pipelines](https://python.langchain.com/docs/integrations/llms/huggingface_pipelines)
3) [4-Bit Quantization](https://huggingface.co/blog/4bit-transformers-bitsandbytes)
4) [Batch GPU Inference](https://python.langchain.com/docs/integrations/llms/huggingface_pipelines#batch-gpu-inference)

Llama2-chat was trained using the below [template](https://gpus.llm-utils.org/llama-2-prompt-template/) and should be prompted the same to best performance. 

**NOTE**: `<s>` is the beginning of sequence (bos) token.

**Llama2-Chat Prompt Template**
```
<s>[INST] <<SYS>>
{your_system_message}
<</SYS>>

{user_message_1} [/INST]
```

At the time of writing LangChain does not offer a Llama2-Chat prompt template but Custom Prompt Templates be created for such situations.

# User Inputs and Libraries

In [1]:
# Example context
EXAMPLE_CONTEXT = """ 
The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France. 
It is named after the engineer Gustave Eiffel, whose company designed and built the tower from 1887 to 1889.
The Eiffel Tower is 1,083 ft tall.
""".strip()

# Llama2-Chat Prompt Template
llama2_template = """
<s>[INST] <<SYS>>
{your_system_message}
<</SYS>>

{user_message_1} [/INST]
""".strip()

# System Message
sys_template = """
You are a question generating assistant. 
Given a document, please generate a simple and short question based on the information provided.
The question can be a maximum of 10 words long.
""".strip()

# Human Message
human_template = """
DOCUMENT: {context}
QUESTION: Your question here.
""".strip()

# Path to Model
model_id = '/nvme4tb/Projects/llama2_models/Llama-2-13b-chat-hf'

In [2]:
# Import libraries and packages
import gc
import torch
from time import time
from torch import cuda, bfloat16
from transformers import (AutoConfig,
                          AutoTokenizer,
                          AutoModelForCausalLM,
                          BitsAndBytesConfig,
                          pipeline)
from langchain.llms import HuggingFacePipeline
from langchain.prompts import (ChatPromptTemplate,
                               HumanMessagePromptTemplate,
                               SystemMessagePromptTemplate,
                               StringPromptTemplate)

# LangChain Standard Chat Prompt Templates

LangChain [SystemMessagePromptTemplate](https://api.python.langchain.com/en/latest/prompts/langchain.prompts.chat.SystemMessagePromptTemplate.html#langchain.prompts.chat.SystemMessagePromptTemplate) and [HumanMessagePromptTemplate](https://api.python.langchain.com/en/latest/prompts/langchain.prompts.chat.HumanMessagePromptTemplate.html#langchain.prompts.chat.HumanMessagePromptTemplate) are commonly used LangChain prompt templates; however, at the time of writing, are not optimized for Llama2-Chat.

The below cell shows the prompt template returned using these LangChain classes.

**REFERENCES**
- [LangChain Message Prompt Templates](https://python.langchain.com/docs/modules/model_io/prompts/prompt_templates/msg_prompt_templates)

In [3]:
# Chat template
chat_template = ChatPromptTemplate.from_messages(
    [SystemMessagePromptTemplate.from_template(sys_template),
     HumanMessagePromptTemplate.from_template(human_template)])

# Invoke the chat template
chat_invoked = chat_template.invoke({'context': EXAMPLE_CONTEXT})

# Print the template that would be passed to the LLM
print(chat_invoked.to_string())

System: You are a question generating assistant. 
Given a document, please generate a simple and short question based on the information provided.
The question can be a maximum of 10 words long.
Human: DOCUMENT: The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France. 
It is named after the engineer Gustave Eiffel, whose company designed and built the tower from 1887 to 1889.
The Eiffel Tower is 1,083 ft tall.
QUESTION: Your question here.


Notice the above prompt template does not match what is required for Llama2-chat. In the next section will implement a LangChain custom prompt template that can be used for Llama2-Chat.

# Custom Prompt Template

This section demonstrates how to create a [LangChain custom prompt template](https://python.langchain.com/docs/modules/model_io/prompts/prompt_templates/custom_prompt_template) for Llama2. The custom class could easily be modified to work with any LLM model of choice. Further, the input parameters (e.g., model_template, system_message, human_message) could be pointed to a prompt template databases for robust usage. 

**References**
- [LangChain - Custom Agent With Tool Retrieval](https://python.langchain.com/docs/modules/agents/how_to/custom_agent_with_tool_retrieval)
- [LangChain - Custom Prompt Template](https://python.langchain.com/docs/modules/model_io/prompts/prompt_templates/custom_prompt_template)
- [LangChain - StringPromptTemplate](https://api.python.langchain.com/en/latest/prompts/langchain.prompts.base.StringPromptTemplate.html#langchain.prompts.base.StringPromptTemplate)
- [Blog on LangChain and Llama2](https://www.mlexpert.io/prompt-engineering/langchain-quickstart-with-llama-2)


In [4]:
# Configure a class that defines a function to explain the source code of the given function
class Llama2ChatPromptTemplate(StringPromptTemplate):
    """
    Llama2-Chat prompt template customized for system and human messages
    """
    # Define templates
    model_template: str
    system_template: str
    user_template: str


    def __get_template(self,
                       model_template: str,
                       your_system_message: str,
                       user_message_1: str) -> str:
        """
        Insert the System and User Messages into the Model Prompt Template

        Args:
            model_template (str): Model prompt template (e.g. Llama2-Chat)
            your_system_message (str): System Message with placeholders for examples, etc.
            user_message_1 (str): User message with placeholders for context, etc.

        Returns:
            str: Prompt template with placeholders (context, documents, examples, etc.)
        """
        # Insert system message into model template, then insert human message
        template = model_template.replace('{your_system_message}', your_system_message)
        template = template.replace('{user_message_1}', user_message_1)
        return template


    def format(self, **kwargs) -> str:
        """
        LangChain method for formatting the template
        """
        # Create a prompt template with placeholder for context
        PROMPT = self.__get_template(model_template=self.model_template,
                                     your_system_message=self.system_template,
                                     user_message_1=self.user_template)
        
        # Generate the prompt to be sent to the llm
        prompt = PROMPT.format(context=kwargs['context'])
        return prompt

# Initialize Prompt Template
llama2_prompt_template = Llama2ChatPromptTemplate(model_template=llama2_template,
                                                  system_template=sys_template,
                                                  user_template=human_template,
                                                  input_variables=["context"])

# Create the prompt using example context
prompt = llama2_prompt_template.format(context=EXAMPLE_CONTEXT)

# Print the template that would be passed to the LLM
print(prompt)

<s>[INST] <<SYS>>
You are a question generating assistant. 
Given a document, please generate a simple and short question based on the information provided.
The question can be a maximum of 10 words long.
<</SYS>>

DOCUMENT: The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France. 
It is named after the engineer Gustave Eiffel, whose company designed and built the tower from 1887 to 1889.
The Eiffel Tower is 1,083 ft tall.
QUESTION: Your question here. [/INST]


The above template satisfies the Llama2-Chat format. Again, the above class can be modified to suit any LLM.

# Chat with Llama2

The following will form a LangChain [Chain](https://python.langchain.com/docs/modules/chains/) using their [LangChain Expression Language (LCEL)](https://python.langchain.com/docs/expression_language/). First, a `text-generation` model will be configured using 4-bit quantization and then the the above prompt template will be chained to the model. 

This will provide a chat model that can be invoked and queries will be generated.

**REFERENCES**
- [LangChain HuggingFace Pipeline](https://python.langchain.com/docs/integrations/llms/huggingface_pipelines)

In [5]:
# Select the device
device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

# Set quantization configuration to load large model with less GPU memory
# this requires the `bitsandbytes` library
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

# Model
model_config = AutoConfig.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map='auto',
)
model.eval() # set to evaluation for inference only

# Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Transformers pipeline
pipe = pipeline(
    model=model, 
    tokenizer=tokenizer,
    return_full_text=True,  # langchain expects the full text
    task='text-generation',
    # we pass model parameters here too
    # stopping_criteria=stopping_criteria,  # without this model rambles during chat
    temperature=0.1,  # 'randomness' of outputs, 0.0 is the min and 1.0 the max
    # max_new_tokens=512,  # max number of tokens to generate in the output
    repetition_penalty=1.1  # without this output begins repeating
)

# LangChain Hugging Face Pipeline
hf = HuggingFacePipeline(pipeline=pipe)

# Create a chain
chain = llama2_prompt_template | hf

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [6]:
# View GPU Memory
!nvidia-smi

Sun Nov 26 13:40:13 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0 Off |                  N/A |
| 33%   62C    P2   116W / 350W |   7383MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  On   | 00000000:02:00.0 Off |                  N/A |
| 30%   60C    P2   120W / 350W |   9031MiB / 24576MiB |      0%      Default |
|       

In [7]:
# Invoke the chain and get a response from Llama2
print(chain.invoke({"context": EXAMPLE_CONTEXT}))

  Sure! Here's a simple and short question based on the information provided:

"What is the height of the Eiffel Tower in Paris?"


As shown above a question is returned based on the `EXAMPLE_CONTEXT` provided to Llama2.

# Generate Context-Query Pairs using Batch GPU Inference

The HuggingFacePipeline can be used to Batch GPU Inferences and it will be demonstrated on a subset of the [Stanford Question Answering Dataset squad_v2](https://huggingface.co/datasets/squad_v2).

## Load SquadV2 Data

In [8]:
from datasets import load_dataset
from pathlib import Path
from types import SimpleNamespace

# Specify paths to data, prompt templates, llama model, etc.
paths = {'base_dir': Path.cwd().parents[0],
         'prompt_vanilla': 'notebooks/question-answering-prompts/vanilla.txt',
         'prompt_gbq': 'notebooks/question-answering-prompts/gbq.txt',
         'squad_data': 'data/squad_v2',
         'model': '/nvme4tb/Projects/llama2_models/Llama-2-13b-chat-hf',
         }

# Number of context samples for experimentation
NUM_SAMPLES = 100

# Convert from dictionary to SimpleNamespace
paths = SimpleNamespace(**paths)

# Load squad_v2 data locally from disk
df = load_dataset(str(paths.base_dir / paths.squad_data),
                  split='train').to_pandas()

# Remove redundant context
df = df.drop_duplicates(subset=['context', 'title']).reset_index(drop=True)

# Randomly select 50 contexts
df = df.sample(n=NUM_SAMPLES, random_state=42)[['id', 'context', 'question']]

# Print Info.
print(f'df.shape: {df.shape}')
print(f'Columns: {df.columns.tolist()}')

df.shape: (100, 3)
Columns: ['id', 'context', 'question']


In [9]:
# Update the hf pipeline batch size
hf.batch_size = 50

# Create a chain
chain_batch = llama2_prompt_template | hf.bind(stop=['\n\n'])

In [10]:
# Place the SquadV2 context in a list of dictionaries
contexts = []
for context in df.context.tolist():
    contexts.append({'context': context})

In [11]:
# Start time
st = time()

# GPU Batch Inference
queries = chain_batch.batch(contexts)

# Total time to generate the queries
total_secs = time() - st
secs_per_sample = (total_secs / NUM_SAMPLES)
print(f'Total Time to Generate {NUM_SAMPLES} Queries: {(total_secs / 60):.1f} mins.')
print(f'Avg. Amount of Seconds Per Sample: {secs_per_sample:.1f}')

Total Time to Generate 100 Queries: 4.5 mins.
Avg. Amount of Seconds Per Sample: 2.7


In [12]:
# Assign to dataframe and strip any whitespaces
df['query'] = [x.strip() for x in queries]

# View a few examples
for ii in range(3):
    print(f'Example # {ii + 1}')
    print(f'Context: {df.iloc[ii].context}')
    print(f'Synthetic Query: {df.iloc[ii].question}\n')

Example # 1
Context: The Oklahoma City Police Department, has a uniformed force of 1,169 officers and 300+ civilian employees. The Department has a central police station and five substations covering 2,500 police reporting districts that average 1/4 square mile in size.
Synthetic Query: How many substations does Oklahoma city have?

Example # 2
Context: The U.S. Federal Reserve and central banks around the world have taken steps to expand money supplies to avoid the risk of a deflationary spiral, in which lower wages and higher unemployment lead to a self-reinforcing decline in global consumption. In addition, governments have enacted large fiscal stimulus packages, by borrowing and spending to offset the reduction in private sector demand caused by the crisis. The U.S. Federal Reserve's new and expanded liquidity facilities were intended to enable the central bank to fulfill its traditional lender-of-last-resort role during the crisis while mitigating stigma, broadening the set of in

This notebook showcased several LangChain features and is generic so other models or prompt templates can be easily integrated for your application.