# Question-Answer Synthetic Query Generation

This notebook illustrates the utilization of the Llama2-Chat prompt templates to create synthetic query-context data. It utilizes the [Llama-2-13b-chat-hf](https://huggingface.co/meta-llama), integrating seamlessly with the Hugging Face library for this purpose.

[Two prompting techniques](https://blog.reachsumit.com/posts/2023/03/llm-for-text-ranking/) are demonstrated:
1) Basic zero-shot query generation - referred to as vanilla
2) Few-shot with Guided by Bad Questions (GBQ)

# User Inputs and Libraries

In [1]:
from pathlib import Path
from types import SimpleNamespace

# Specify paths to data, prompt templates, llama model, etc.
paths = {'base_dir': Path.cwd().parents[0],
         'prompt_vanilla': 'notebooks/question-answering-prompts/vanilla.txt',
         'prompt_gbq': 'notebooks/question-answering-prompts/gbq.txt',
         'squad_data': 'data/squad_v2',
         'model': '/nvme4tb/Projects/llama2_models/Llama-2-13b-chat-hf',
         }

# Number of context samples for experimentation
NUM_SAMPLES = 3

# Convert from dictionary to SimpleNamespace
paths = SimpleNamespace(**paths)

In [2]:
# Import libraries and packages
import pandas as pd
from time import time
from IPython.display import clear_output
import re
import torch
from transformers import LlamaTokenizer, LlamaForCausalLM
from datasets import load_dataset

# Load Data
The [Stanford Question Answering Dataset squad_v2](https://huggingface.co/datasets/squad_v2) dataset was downloaded from Hugging Face and stored locally. The Stanford Question Answering Dataset (SQuAD) is a collection of question-answer pairs derived from Wikipedia articles. In SQuAD, the correct answers of questions can be any sequence of tokens in the given text. Because the questions and answers are produced by humans through crowdsourcing, it is more diverse than some other question-answering datasets. 

If internet connection is available you can alternatively download the dataset as shown:
```python
df = load_dataset('squad_v2')
```

In [3]:
# Load squad_v2 data locally from disk
df = load_dataset(str(paths.base_dir / paths.squad_data),
                  split='train').to_pandas()

# Remove redundant context
df = df.drop_duplicates(subset=['context', 'title']).reset_index(drop=True)
print(f'df.shape: {df.shape}')
print(f'Columns: {df.columns.tolist()}')

# Approximate the # of words in context
df['num_words_context'] = df.context.apply(lambda x: len(x.split()))
print('Number of Words in Context')
display(df.num_words_context.describe())

df.shape: (19029, 5)
Columns: ['id', 'title', 'context', 'question', 'answers']
Number of Words in Context


count    19029.000000
mean       116.600137
std         49.666777
min         20.000000
25%         87.000000
50%        107.000000
75%        139.000000
max        653.000000
Name: num_words_context, dtype: float64

In [4]:
# Randomly select 50 contexts
df = df.sample(n=NUM_SAMPLES, random_state=42)[['id', 'context', 'question']]

# View a few context and questions from original dataset
for ii in range(NUM_SAMPLES):
    print(f'Example # {ii + 1}')
    print(f'Context: {df.iloc[ii].context}')
    print(f'Squad Query: {df.iloc[ii].question}\n')

Example # 1
Context: The Oklahoma City Police Department, has a uniformed force of 1,169 officers and 300+ civilian employees. The Department has a central police station and five substations covering 2,500 police reporting districts that average 1/4 square mile in size.
Squad Query: How many substations does Oklahoma city have?

Example # 2
Context: The U.S. Federal Reserve and central banks around the world have taken steps to expand money supplies to avoid the risk of a deflationary spiral, in which lower wages and higher unemployment lead to a self-reinforcing decline in global consumption. In addition, governments have enacted large fiscal stimulus packages, by borrowing and spending to offset the reduction in private sector demand caused by the crisis. The U.S. Federal Reserve's new and expanded liquidity facilities were intended to enable the central bank to fulfill its traditional lender-of-last-resort role during the crisis while mitigating stigma, broadening the set of instit

# Prompt Templates

Two different prompt templates will be demonstrated in this notebook:
1) Basic vanilla zero-shot query generation.
2) [Few-shot with Guided by Bad Questions (GBQ)](https://blog.reachsumit.com/posts/2023/03/llm-for-text-ranking/): illustrated below image and detailed in [InPars paper](https://arxiv.org/abs/2301.01820).
<p align="center"> 
    <img src="https://raw.githubusercontent.com/mddunlap924/LLM-Prompting/main/imgs/inpars-gbq.png"
    style="width:756;height:512px;">
    <br>
    Left: Vanilla template, Right: GBQ prompts <a href="https://blog.gopenai.com/enrich-llms-with-retrieval-augmented-generation-rag-17b82a96b6f0">[Source]</a>.
</p>

In [5]:
# Load the each prompt template and insert context
for template_name in [paths.prompt_vanilla, paths.prompt_gbq]:
    # Name of prompt template
    print(f'Prompt Template: {(paths.base_dir / template_name).name}')
    
    # Load the prompt template 
    prompt_template = open(paths.base_dir / template_name, 'r').read()
    
    # Insert the context into the prompt template
    prompts = [prompt_template.replace('[CONTEXT]', i) for i in df.context.tolist()]

    # Example prompt for the first instance of data
    print(f'{prompts[0]}\n')

Prompt Template: vanilla.txt
<s>[INST] <<SYS>>
You are a question generating assistant. 
Given a document, please generate a simple and short question based on the information provided.
The question can be a maximum of 10 words long.
Return only the question in the JSON format shown in the examples.
<</SYS>>

"DOCUMENT": The Oklahoma City Police Department, has a uniformed force of 1,169 officers and 300+ civilian employees. The Department has a central police station and five substations covering 2,500 police reporting districts that average 1/4 square mile in size.
{"QUESTION": Your question here.}
[/INST]

Prompt Template: gbq.txt
<s>[INST] <<SYS>>
You are a question generating assistant. Given a document and a bad question; please generate a better more detailed question based on the information provided. 
Below are three examples of bad questions and good questions with sample relevant documents for each question.
Return only the question in the JSON format shown in the examples.


# Generate Synthetic Queries

This section will generate queries for each of the two prompts and then compare them to the original Squad query.

In [6]:
# Load the tokenizer and model
tokenizer = LlamaTokenizer.from_pretrained(paths.model)
model = LlamaForCausalLM.from_pretrained(paths.model,
                                         load_in_8bit=True,
                                         device_map='cuda:0',
                                         torch_dtype=torch.float32)

# View GPU vRAM
!nvidia-smi

# Notice: Llama-2 13B with 8bit quantization is ~14.8GB of vRAM

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Sat Oct 21 21:05:38 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0 Off |                  N/A |
|  0%   52C    P2   113W / 350W |  14875MiB / 24576MiB |      9%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [7]:
# Start time
st = time()

# Loop over each prompt template
counter = 0
prompt_names = []
for template_name in [paths.prompt_vanilla, paths.prompt_gbq]:
    # Name of prompt template
    prompt_name = (paths.base_dir / template_name).name.split('.txt')[0]
        
    # Load the prompt template 
    prompt_template = open(paths.base_dir / template_name, 'r').read()
    
    # Insert the context into the prompt template
    prompts = [prompt_template.replace('[CONTEXT]', i) for i in df.context.tolist()]
    
    # Loop over each prompt
    llama_questions = []
    for prompt in prompts:
        # Tokenize the prompt
        batch = tokenizer(prompt, return_tensors='pt')
        
        # Generate the response from Llama2
        response = model.generate(batch["input_ids"].cuda(),
                                  do_sample=True,
                                  top_k=50,
                                  top_p=0.9,
                                  temperature=0.75)
        # Decode the response
        decode_response = tokenizer.decode(response[0], skip_special_tokens=True)
        llama_questions.append(decode_response)
        clear_output()
        counter += 1
        
    # Store llama queries in the dataframe
    df[f'prompt_{prompt_name}'] = llama_questions
    prompt_names.append(f'prompt_{prompt_name}')

# Total time to generate the queries
total_secs = time() - st
secs_per_sample = (total_secs / counter)
print(f'Total Time to Generate {counter} Queries: {(total_secs / 60):.1f} mins.')
print(f'Avg. Amount of Seconds Per Sample: {secs_per_sample:.1f}')

# Print an example of a returned llama response
print(df['prompt_gbq'].iloc[0])

Total Time to Generate 6 Queries: 0.9 mins.
Avg. Amount of Seconds Per Sample: 8.7
[INST] <<SYS>>
You are a question generating assistant. Given a document and a bad question; please generate a better more detailed question based on the information provided. 
Below are three examples of bad questions and good questions with sample relevant documents for each question.
Return only the question in the JSON format shown in the examples.

Example # 1
DOCUMENT: Guam lies between 13.2°N and 13.7°N and between 144.6°E and 145.0°E, and has an area of 212 square miles (549 km2), making it the 32nd largest island of the United States. It is the southernmost and largest island in the Mariana island chain and is also the largest island in Micronesia. This island chain was created by the colliding Pacific and Philippine Sea tectonic plates. Guam is the closest land mass to the Mariana Trench, a deep subduction zone, that lies beside the island chain to the east. Challenger Deep, the deepest surveye

In [8]:
# Clean up the llama response to parse only the returned question
def parse_response(text: str):
    
    # Extract Llama response
    text = text.split('[/INST]')[-1].strip("</s>").strip()

    # Remove only the question
    if 'question' in text.lower():
        text = text.lower().split('question')[-1].split('?')[0].strip() + '?'
    elif '?' in text:
        text = text.split('?')[0].split('\n')[-1] + '?'
    else:
        text = 'NAN'
    text = re.sub('[":]', '', text)
    text = text.strip()
    text = text.capitalize()
    return text

# Parse each llama response
for prompt_name in prompt_names:
    df[f'{prompt_name}_cleaned'] = df[f'{prompt_name}'].apply(lambda x: parse_response(text=x))

In [9]:
# View the examples
for ii in range(NUM_SAMPLES):
    print(f'Example # {ii + 1}')
    print(f'Context: {df.iloc[ii].context}')
    print(f'Original Squad Query: {df.iloc[ii].question}')
    print(f'Llama-2 Vanilla Query: {df.iloc[ii].prompt_vanilla_cleaned}')
    print(f'Llama-2 GBQ Query: {df.iloc[ii].prompt_gbq_cleaned}\n')

Example # 1
Context: The Oklahoma City Police Department, has a uniformed force of 1,169 officers and 300+ civilian employees. The Department has a central police station and five substations covering 2,500 police reporting districts that average 1/4 square mile in size.
Original Squad Query: How many substations does Oklahoma city have?
Llama-2 Vanilla Query: How many officers are in the oklahoma city police department's uniformed force?
Llama-2 GBQ Query: What is the total size of the police reporting districts in the oklahoma city police department's jurisdiction, on average?

Example # 2
Context: The U.S. Federal Reserve and central banks around the world have taken steps to expand money supplies to avoid the risk of a deflationary spiral, in which lower wages and higher unemployment lead to a self-reinforcing decline in global consumption. In addition, governments have enacted large fiscal stimulus packages, by borrowing and spending to offset the reduction in private sector deman

# Takeaways:
- Varying prompts (Vanilla vs. GBQ) generates diverse queries, showcasing the flexibility of LLM prompts.
- Further experimentation with prompts can refine query outcomes as needed.