# LangChain and Llama2: Question-Answer Generation with Output Parser

This notebook uses LangChain and local Llama2-Chat inference that can be run on consumer grade hardware. 


The following LangChain features explored are:
1) [LangChain Custom Prompt Template](https://python.langchain.com/docs/modules/model_io/prompts/prompt_templates/custom_prompt_template) for a Llama2-Chat model
2) [Hugging Face Local Pipelines](https://python.langchain.com/docs/integrations/llms/huggingface_pipelines)
3) [4-Bit Quantization](https://huggingface.co/blog/4bit-transformers-bitsandbytes)
4) [Batch GPU Inference](https://python.langchain.com/docs/integrations/llms/huggingface_pipelines#batch-gpu-inference)
5) [PydanticOutputParser](https://python.langchain.com/docs/modules/model_io/output_parsers/pydantic)
6) [OutputFixingParser](https://python.langchain.com/docs/modules/model_io/output_parsers/output_fixing_parser)

## Key Concepts: 
- This notebook is tailored towards using LangChain so it runs with local LLM models. At the time of writing several LangChain examples use the OpenAI as the LLM model but using Local LLMs pose considerable challenges that are addressed in this notebook. For example, local llms will have less parameters than OpenAI models and therefore produce lower-quality responses (e.g., following data structures).
- [LangChain Output Parsers](https://python.langchain.com/docs/modules/model_io/output_parsers/) - many applications with LLMs will ultimately request structured data responses (e.g., json, markdown, etc.) and output parsers are used to return the required format. This notebook shows how to get required structured responses and then running these chains in batch mode. This is helpful because error messages are handled in a custom manner.
- [Pydantic is all you need: Jason Liu](https://www.youtube.com/watch?v=yj-wSRJwrrc) - a Youtube video describing the benefits of using Pydantic parsers and data validation models.

# User Inputs and Libraries

In [1]:
# Example context
EXAMPLE_CONTEXT = """ 
The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France. 
It is named after the engineer Gustave Eiffel, whose company designed and built the tower from 1887 to 1889.
The Eiffel Tower is 1,083 ft tall.
""".strip()

# Llama2-Chat Prompt Template
llama2_template = """
<s>[INST] <<SYS>>
{your_system_message}
<</SYS>>

{user_message_1} [/INST]
""".strip()

# System Message
sys_template = """
You are a question and answer generating assistant. 
Given context, please generate a question and answer based on the information provided.
Use only information from the context to answer the question.
The question and answer should both be single sentences and no longer than 15 words.

{format_instructions}
""".strip()

# Human Message
human_template = """
Context information is below. 
---------------------
{context}
---------------------

{format_instructions}
""".strip()

# Output Parser template to format instructions
parser_template = """
It is mandatory the output be a markdown code snippet formatted in the following schema:

```json
{{
	"QUESTION": string  // Your question generated using only information from the context?,
	"ANSWER": string  // Your answer to the generated question.
}}
```
""".strip()

# Path to Model
model_id = '/nvme4tb/Projects/llama2_models/Llama-2-7b-chat-hf'

In [2]:
# Import libraries and packages
import os, sys
import gc
import torch
from time import time
from torch import cuda, bfloat16
from transformers import (AutoConfig,
                          AutoTokenizer,
                          AutoModelForCausalLM,
                          BitsAndBytesConfig,
                          pipeline)
from langchain.llms import HuggingFacePipeline
from langchain.prompts import PromptTemplate

# Custom Modules
sys.path.append(os.getenv('SRC_DIRECTORY'))
from src.lc_output_parsers import (insert_templates,
                                   QAOutputFixingParser,
                                   QuestionAnswerOutputParser)

# Custom Output Parser - Inherent Pydantic Base Model

The [QuestionAnswerOutputParser](../src/lc_output_parsers.py) is a custom output parser that demonstrates how to modify the existing [PydanticOutputParser](https://python.langchain.com/docs/modules/model_io/output_parsers/pydantic) to better handle parsing of Llama2-Chat responses for question and answer.

In [3]:
# Custom question-answer parser
parser = QuestionAnswerOutputParser(parser_template=parser_template)

# View the data structure formatting instructions 
print(parser.get_format_instructions())

It is mandatory the output be a markdown code snippet formatted in the following schema:

```json
{
	"QUESTION": string  // Your question generated using only information from the context?,
	"ANSWER": string  // Your answer to the generated question.
}
```


# Custom Question-Answer Data Generation Prompt Template

This section demonstrates how to create a [LangChain custom prompt template](https://python.langchain.com/docs/modules/model_io/prompts/prompt_templates/custom_prompt_template) for Llama2. The custom class could easily be modified to work with any LLM model of choice. Further, the input parameters (e.g., model_template, system_message, human_message) could be pointed to a prompt template databases for robust usage. 

**References**
- [LangChain - Custom Agent With Tool Retrieval](https://python.langchain.com/docs/modules/agents/how_to/custom_agent_with_tool_retrieval)
- [LangChain - Custom Prompt Template](https://python.langchain.com/docs/modules/model_io/prompts/prompt_templates/custom_prompt_template)
- [LangChain - StringPromptTemplate](https://api.python.langchain.com/en/latest/prompts/langchain.prompts.base.StringPromptTemplate.html#langchain.prompts.base.StringPromptTemplate)
- [Blog on LangChain and Llama2](https://www.mlexpert.io/prompt-engineering/langchain-quickstart-with-llama-2)


In [4]:
# Create a Llama2 template using LangChain PromptTemplate
llama_template = insert_templates(model_template=llama2_template,
                                  your_system_message=sys_template,
                                  user_message_1=human_template)


# Initialize Prompt Template
llama2_prompt_template = PromptTemplate(
    template=llama_template,
    input_variables=['context'],
    partial_variables={
        "format_instructions": parser.get_format_instructions()},
)
del llama_template

# Create the prompt using example context
prompt = llama2_prompt_template.format(context=EXAMPLE_CONTEXT)

# Print the template that would be passed to the LLM
print(prompt)

<s>[INST] <<SYS>>
You are a question and answer generating assistant. 
Given context, please generate a question and answer based on the information provided.
Use only information from the context to answer the question.
The question and answer should both be single sentences and no longer than 15 words.

It is mandatory the output be a markdown code snippet formatted in the following schema:

```json
{
	"QUESTION": string  // Your question generated using only information from the context?,
	"ANSWER": string  // Your answer to the generated question.
}
```
<</SYS>>

Context information is below. 
---------------------
The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France. 
It is named after the engineer Gustave Eiffel, whose company designed and built the tower from 1887 to 1889.
The Eiffel Tower is 1,083 ft tall.
---------------------

It is mandatory the output be a markdown code snippet formatted in the following schema:

```json
{
	"QUESTION": stri

The above template satisfies the Llama2-Chat format. Again, the above class can be modified to suit any LLM.

# Model 

The following will form a LangChain [Chain](https://python.langchain.com/docs/modules/chains/) using their [LangChain Expression Language (LCEL)](https://python.langchain.com/docs/expression_language/). First, a `text-generation` model will be configured using 4-bit quantization and then the the above prompt template will be chained to the model. 

This will provide a chat model that can be invoked and queries will be generated.

**REFERENCES**
- [LangChain HuggingFace Pipeline](https://python.langchain.com/docs/integrations/llms/huggingface_pipelines)

In [5]:
# Select the device
device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

# Set quantization configuration to load large model with less GPU memory
# this requires the `bitsandbytes` library
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

# Model
model_config = AutoConfig.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map='auto',
)
model.eval() # set to evaluation for inference only

# Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Transformers pipeline
pipe = pipeline(
    model=model, 
    tokenizer=tokenizer,
    return_full_text=True,  # langchain expects the full text
    task='text-generation',
    # we pass model parameters here too
    # stopping_criteria=stopping_criteria,  # without this model rambles during chat
    temperature=0.1,  # 'randomness' of outputs, 0.0 is the min and 1.0 the max
    max_new_tokens=200,  # max number of tokens to generate in the output
    repetition_penalty=1.1  # without this output begins repeating
)

# LangChain Hugging Face Pipeline
hf = HuggingFacePipeline(pipeline=pipe)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [6]:
# View GPU vRAM
!nvidia-smi

Sat Nov 25 15:49:42 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0 Off |                  N/A |
| 32%   58C    P2   143W / 350W |   2529MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  On   | 00000000:02:00.0 Off |                  N/A |
|  0%   47C    P2   140W / 350W |   3163MiB / 24576MiB |      1%      Default |
|       

In [7]:
# Create a chain
chain = llama2_prompt_template | hf

# Invoke the chain and get a response from Llama2
output = chain.invoke({"context": EXAMPLE_CONTEXT})
print(output)

# Parse the output
print(f'\nOutput after Passing Through Custom Parser')
parser.parse(output)

  Sure! Here's a question and answer based on the provided context:

{
"QUESTION": "What is the height of the Eiffel Tower?",
"ANSWER": "The Eiffel Tower is 1,083 feet (330 meters) tall."
}

Output after Passing Through Custom Parser


Llama2QuestionAnswer(QUESTION='What is the height of the Eiffel Tower?', ANSWER='The Eiffel Tower is 1,083 feet (330 meters) tall.')

# Custom Auto-fixing Output Parser

This output parser wraps another output parser, and in the event that the first one fails it calls out to another LLM to fix any errors.

But we can do other things besides throw errors. Specifically, we can pass the misformatted output, along with the formatted instructions, to the model and ask it to fix it.

**NOTE**: The default LangChain `PydanticOutputParser` wrapped in the `OutputFixingParser` was tested (not shown here) and it was not capable of fixing the misformatted output from Llama2-Chat. Therefore, a custom [QAOutputFixingParser](../src/lc_output_parsers.py) is presented here that is more effective for this use case.

In [8]:
# Llama2-Chat prompt template to fix incorrect json formatting
output_fixing_template = """
<s>[INST] <<SYS>>
You are a formatting expert and the Context must be re-formatted to meet the below Instructions.

{instructions}
<</SYS>>

Context is provided below. 
---------------------
{completion}
--------------------- [/INST]

No additional responses or wordings is allowed. Only respond in the Instructions format.
""".strip()

# Custom Output Fixing Parser for Question Answering
fixing_parser = QAOutputFixingParser.from_llm(
    parser=parser,
    prompt=PromptTemplate.from_template(output_fixing_template),
    llm=hf,
    max_retries=1,
)
# Turn on to view the retry chain instructions
fixing_parser.retry_chain.verbose = True

### Demo GOOD Output

The below output will not be required to be sent to the `QAOutputFixingParser` because its formatted correctly.

In [9]:
# Simple example of correctly formatted output
good_output = """
{"QUESTION": "This is a question?",
"ANSWER": "Sounds good."}
""".strip()
fixing_parser.parse(good_output)

Llama2QuestionAnswer(QUESTION='This is a question?', ANSWER='Sounds good.')

### Demo Incorrectly Formatted Output

The below output is slightly misformatted and be sent to the `QAOutputFixingParser` to fix its formatting.

The misformatted errors in the below the errors are: 1) missing comma, 2) quotes on key-values.

*Notice* the new template used in the LLM that asks it to fix the json formatting issues.

In [10]:
# Error in the below the errors are: 1) missing comma, 2) quotes on key-values.
bad_output = """ 
{QUESTION: This is a question?
ANSWER: Sounds good.}
""".strip()

# Pass into the fixing parser
output = fixing_parser.parse(bad_output)
print(output)



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3m<s>[INST] <<SYS>>
You are a formatting expert and the Context must be re-formatted to meet the below Instructions.

It is mandatory the output be a markdown code snippet formatted in the following schema:

```json
{
	"QUESTION": string  // Your question generated using only information from the context?,
	"ANSWER": string  // Your answer to the generated question.
}
```
<</SYS>>

Context is provided below. 
---------------------
{QUESTION: This is a question?
ANSWER: Sounds good.}
--------------------- [/INST]

No additional responses or wordings is allowed. Only respond in the Instructions format.[0m

[1m> Finished chain.[0m
QUESTION='This is a question?' ANSWER='Sounds good.'


### Demo WRONGLY Formatted Output

The below output is completely unacceptable and cannot be fixed by `QAOutputFixingParser` due to it missing information. 

In this instance we want to return a fixed string (i.e., NULL) and this demonstrates error handling.

In [11]:
# Very bad response wtih no results
bad_output = """ """.strip()

# Pass into the fixing parser to get response when it CANNOT be fixed
output = fixing_parser.parse(bad_output)
print(output)



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3m<s>[INST] <<SYS>>
You are a formatting expert and the Context must be re-formatted to meet the below Instructions.

It is mandatory the output be a markdown code snippet formatted in the following schema:

```json
{
	"QUESTION": string  // Your question generated using only information from the context?,
	"ANSWER": string  // Your answer to the generated question.
}
```
<</SYS>>

Context is provided below. 
---------------------

--------------------- [/INST]

No additional responses or wordings is allowed. Only respond in the Instructions format.[0m

[1m> Finished chain.[0m
QUESTION='NULL?' ANSWER='NUll'


As shown above a question is returned based on the `EXAMPLE_CONTEXT` provided to Llama2.

# Generate Synthetic Context-Query-Answer Results using Batch GPU Inference

The below demonstrates how to synthetic datasets for IR and RAG evaluation of custom documents / text.

Batch GPU Inferences is used with all the previously demonstrated custom prompts and output parsers.

The demonstration data is a subset of the [Stanford Question Answering Dataset squad_v2](https://huggingface.co/datasets/squad_v2).

## Load SquadV2 Data

In [12]:
from datasets import load_dataset
from pathlib import Path
from types import SimpleNamespace

# Specify paths to data, prompt templates, llama model, etc.
paths = {'base_dir': Path.cwd().parents[0],
         'prompt_vanilla': 'notebooks/question-answering-prompts/vanilla.txt',
         'prompt_gbq': 'notebooks/question-answering-prompts/gbq.txt',
         'squad_data': 'data/squad_v2',
         }

# Convert from dictionary to SimpleNamespace
paths = SimpleNamespace(**paths)

# Load squad_v2 data locally from disk
df = load_dataset(str(paths.base_dir / paths.squad_data),
                  split='train').to_pandas()

# Remove redundant context
df = df.drop_duplicates(subset=['context', 'title']).reset_index(drop=True)

# Number of context samples for experimentation
NUM_SAMPLES = 10

# Randomly select contexts
df = df.sample(n=NUM_SAMPLES, random_state=42)[['id', 'context', 'question', 'answers']]

# Print Info.
print(f'df.shape: {df.shape}')
print(f'Columns: {df.columns.tolist()}')
display(df.head())

df.shape: (10, 4)
Columns: ['id', 'context', 'question', 'answers']


Unnamed: 0,id,context,question,answers
2141,56df5cdd96943c1400a5d438,"The Oklahoma City Police Department, has a uni...",How many substations does Oklahoma city have?,"{'text': ['5'], 'answer_start': [182]}"
18339,5733823bd058e614000b5c03,The U.S. Federal Reserve and central banks aro...,What have central banks around the world done ...,"{'text': ['expand money supplies'], 'answer_st..."
980,56d37ac659d6e414001464d5,The two finalists were Kris Allen and Adam Lam...,Who were the final two contestants on season e...,"{'text': ['Kris Allen and Adam Lambert'], 'ans..."
326,56cddc2a62d2951400fa690a,Christoph Waltz was cast in the role of Franz ...,Who did Christoph Waltz portray in Spectre?,"{'text': ['Franz Oberhauser'], 'answer_start':..."
12083,5727cb4a4b864d1900163d30,Detroit and the rest of southeastern Michigan ...,What body of water affects Detroit's climate?,"{'text': ['Great Lakes'], 'answer_start': [119]}"


## Batch Inference with Chaining

In [13]:
# Place the SquadV2 context in a list of dictionaries
contexts = []
for context in df.context.tolist():
    contexts.append({'context': context})

# Update the hf pipeline batch size
hf.batch_size = 50

# Turn off verbose in the retry chain instructions
fixing_parser.retry_chain.verbose = False

# Create a Chain
chain = llama2_prompt_template | hf | fixing_parser

# Start time
st = time()

# GPU Batch Inference
results = chain.batch(contexts)

# Total time to generate the queries
total_secs = time() - st
secs_per_sample = (total_secs / NUM_SAMPLES)
print(f'Total Time to Generate {NUM_SAMPLES} Queries: {(total_secs / 60):.1f} mins.')
print(f'Avg. Amount of Seconds Per Sample: {secs_per_sample:.1f}')

Total Time to Generate 10 Queries: 0.8 mins.
Avg. Amount of Seconds Per Sample: 4.9


# Results and Example

In [14]:
# Place the results into the data frame
df['question_synthetic'] = [x.QUESTION for x in results]
df['answer_synthetic'] = [x.ANSWER for x in results]

# View a few examples
for ii in range(3):
    print(f'Example # {ii + 1}')
    print(f'Context: {df.iloc[ii].context}')
    print(f'Original Query: {df.iloc[ii].question}')
    print(f'Original Answer: {df.iloc[ii].answers["text"][0]}')
    print(f'Synthetic Query: {df.iloc[ii].question_synthetic}')
    print(f'Synthetic Answer: {df.iloc[ii].answer_synthetic}\n')

Example # 1
Context: The Oklahoma City Police Department, has a uniformed force of 1,169 officers and 300+ civilian employees. The Department has a central police station and five substations covering 2,500 police reporting districts that average 1/4 square mile in size.
Original Query: How many substations does Oklahoma city have?
Original Answer: 5
Synthetic Query: How many police reporting districts does the Oklahoma City Police Department cover?
Synthetic Answer: The Oklahoma City Police Department covers 2,500 police reporting districts that average 1/4 square mile in size.

Example # 2
Context: The U.S. Federal Reserve and central banks around the world have taken steps to expand money supplies to avoid the risk of a deflationary spiral, in which lower wages and higher unemployment lead to a self-reinforcing decline in global consumption. In addition, governments have enacted large fiscal stimulus packages, by borrowing and spending to offset the reduction in private sector deman

# Takeaway
The above three examples show high-quality synthetic IR and RAG evaluation datasets for custom documents/text. Using this workflow is an attractive supplement to human annotation of custom documents.  