# Data generation for QA

I want to fine-tune a small LLM on examples generated by some more advanced model. I will use my base of knowledge.

In [320]:
import pandas as pd
import random

In [16]:
df = pd.read_csv("df.csv", index_col=0)

In [18]:
df.head()

Unnamed: 0,section,subsection,question,answer,text,hash_answer
0,Classical models,Linear Regression,Regression _1,Regression in machine learning refers to a sup...,Classical models\nLinear Regression\nRegressio...,8f8499b5f59e9390a87f7d2b183cc8bd
1,Classical models,Linear Regression,Regression _2,regression.\n4. Ridge & Lasso Regression\nRidg...,Classical models\nLinear Regression\nRegressio...,a37096af9620af5eca2a696c03a4b397
2,Classical models,Linear Regression,What Is a Linear Regression Model? List Its Dr...,A linear regression model is a model in which ...,Classical models\nLinear Regression\nWhat Is a...,376cf3108393d26d6d09952af3a4f1b8
3,Classical models,Linear Regression,What are various assumptions used in linear re...,Linear regression is done under the following ...,Classical models\nLinear Regression\nWhat are ...,cc89d249384cd42bccf680fb513ae05c
4,Classical models,Linear Regression,What methods for solving linear regression do ...,"To solve linear regression, you need to find t...",Classical models\nLinear Regression\nWhat meth...,c7811418f1a69095d8bd9c190adac605


In [20]:
len(df)

645

In [32]:
texts = df['text'].tolist()

In [203]:
context = texts[0]

In [46]:
import openai
import os

In [28]:
%env API_KEY=... # put here your API key

env: API_KEY=13728c610a1a0ad1ed11b19e28da684a9b7f7ee169b1e1f48109f3892678acfb


In [205]:
base_url = "https://api.together.xyz"
model_name = "meta-llama/Llama-3.3-70B-Instruct-Turbo-Free"
api_key = os.environ.get("API_KEY")
temperature = 0.7
top_p = 0.95
max_tokens = 512
seed = 34

In [117]:
client = openai.OpenAI(
            api_key=api_key,
            base_url=base_url,
        )

There is a prompt I use to generate several examples for each text chunk:

In [175]:
prompt_template = """{context}

Using this text, generate 1-5 questions and the answers to them. 
Your response must be it an array of pairs of "Question" and "Answer" in json format with a parent element named QA.
"""

In [207]:
prompt = prompt_template.format(**{"context": context})

In [209]:
prompt

'Classical models\nLinear Regression\nRegression\xa0_1\nRegression in machine learning refers to a\xa0supervised learning\xa0technique where the goal is to predict a continuous numerical value based on one or more independent features. It finds relationships between variables so that predictions can be made. we have two types of variables present in regression:\nDependent Variable (Target): The variable we are trying to predict e.g house price.\nIndependent Variables (Features): The input variables that influence the prediction e.g locality, number of rooms.\nRegression analysis problem works with if output variable is a real or continuous value such as “salary” or “weight”. Many different regression models can be used but the simplest model in them is linear regression.\nTypes of Regression\nRegression can be classified into different types based on the number of predictor variables and the nature of the relationship between variables:\n1.\xa0Simple Linear Regression\nLinear regressio

In [181]:
messages = [{"role": "user", "content": prompt}]

In [185]:
response = client.chat.completions.create(
    model=model_name,
    messages=messages,
    temperature=temperature,
    max_tokens=max_tokens,
    top_p=top_p,
    response_format={"type": "json_object"},
    seed=34
)

In [193]:
import json

In [187]:
json_string = response.choices[0].message.content
json_string

'{"QA": [\n  {"Question": "What is the goal of regression in machine learning?", "Answer": "To predict a continuous numerical value based on one or more independent features"},\n  {"Question": "What are the two types of variables present in regression?", "Answer": "Dependent Variable (Target) and Independent Variables (Features)"},\n  {"Question": "What type of output variable does regression analysis work with?", "Answer": "A real or continuous value such as salary or weight"},\n  {"Question": "What is the simplest model of regression?", "Answer": "Linear Regression"},\n  {"Question": "What is an example of when to use polynomial regression?", "Answer": "When predicting a non-linear trend like population growth over time"}\n] }'

In [199]:
json_dict = json.loads(json_string)
json_dict

{'QA': [{'Question': 'What is the goal of regression in machine learning?',
   'Answer': 'To predict a continuous numerical value based on one or more independent features'},
  {'Question': 'What are the two types of variables present in regression?',
   'Answer': 'Dependent Variable (Target) and Independent Variables (Features)'},
  {'Question': 'What type of output variable does regression analysis work with?',
   'Answer': 'A real or continuous value such as salary or weight'},
  {'Question': 'What is the simplest model of regression?',
   'Answer': 'Linear Regression'},
  {'Question': 'What is an example of when to use polynomial regression?',
   'Answer': 'When predicting a non-linear trend like population growth over time'}]}

This way we can get pairs of questions and answers from the context

In [213]:
df_qa = pd.DataFrame.from_records(json_dict['QA'])
df_qa['Context'] = context
df_qa

Unnamed: 0,Question,Answer,Context
0,What is the goal of regression in machine lear...,To predict a continuous numerical value based ...,Classical models\nLinear Regression\nRegressio...
1,What are the two types of variables present in...,Dependent Variable (Target) and Independent Va...,Classical models\nLinear Regression\nRegressio...
2,What type of output variable does regression a...,A real or continuous value such as salary or w...,Classical models\nLinear Regression\nRegressio...
3,What is the simplest model of regression?,Linear Regression,Classical models\nLinear Regression\nRegressio...
4,What is an example of when to use polynomial r...,When predicting a non-linear trend like popula...,Classical models\nLinear Regression\nRegressio...


In [255]:
import tqdm
import time
from tenacity import retry, stop_after_attempt, wait_exponential

Let's wrap it in a function:

In [263]:
def generate_qa(client, context: str, model_name: str, max_tokens, temperature=0.7, top_p=0.95):
    prompt_template = """{context}

Using this text, generate 1-5 questions and the answers to them. 
Your response must be it an array of pairs of "Question" and "Answer" in json format with a parent element named QA.
"""
    prompt = prompt_template.format(**{"context": context})
    messages = [{"role": "user", "content": prompt}]
    try:
        response = client.chat.completions.create(
            model=model_name,
            messages=messages,
            temperature=temperature,
            max_tokens=max_tokens,
            top_p=top_p,
            response_format={"type": "json_object"},
            seed=seed
        )
    except OpenAIError as e:
        print(f"Error: {e}")
        return None
    json_string = response.choices[0].message.content
    json_dict = json.loads(json_string)
    df_qa = pd.DataFrame.from_records(json_dict['QA'])
    df_qa['Context'] = context
    return df_qa

In [265]:
@retry(stop=stop_after_attempt(5), wait=wait_exponential(multiplier=1, min=4, max=60))
def generate_qa_with_retry(client, context, model_name, max_tokens):
    df_qa = generate_qa(client, context, model_name, max_tokens)
    if df_qa is None:
        raise ValueError("Retrying...")
    return df_qa

In [269]:
for context in tqdm.tqdm(texts):
    try:
        df_qa = generate_qa_with_retry(client, context, model_name, max_tokens)
        dfs_qa.append(df_qa)
    except Exception as e:
        print(f"Failed after retries for context: {context[:50]}... Error: {e}")
        continue

100%|██████████| 645/645 [1:45:01<00:00,  9.77s/it]  


Sometimes we don't have the answer in the context; we want our model to react accordingly, so let's generate a couple of appropriate answers

In [310]:
prompt = "Generate 20 sentences meaning 'Unfortunately, I can't answer this question'. Return only the sentences divided by '\n' symbol"

messages = [{"role": "user", "content": prompt}]

response = client.chat.completions.create(
    model=model_name,
    messages=messages,
    temperature=temperature,
    max_tokens=max_tokens,
    top_p=top_p,
    seed=seed
)

In [314]:
no_info_str = response.choices[0].message.content

In [316]:
no_info = no_info_str.split('\n')

We'll pair questions with the wrong contexts and match them with the "I don't know" answers

In [364]:
contexts = []
questions = []
answers = []

for i in range(500):
    context = df_qa["Context"].sample(n=1).item()
    question = df_qa[df_qa['Context'] != context].sample(n=1)["Question"].item()
    answer = random.choice(no_info)
    contexts.append(context)
    questions.append(question)
    answers.append(answer)

In [369]:
df_no_info = pd.DataFrame({"Question": questions, "Answer": answers, "Context": contexts})

In [373]:
dfs_qa.append(df_no_info)

In [375]:
df_qa = pd.concat(dfs_qa).reset_index(drop=True)[['Question', 'Answer', 'Context']]

In [389]:
df_qa.to_csv("df_qa.csv")

Now we have a dataset of questions, answers, and contexts that can be used in the training!

In [387]:
df_qa

Unnamed: 0,Question,Answer,Context
0,What is the main goal of regression in machine...,To predict a continuous numerical value based ...,Classical models\nLinear Regression\nRegressio...
1,What are the two types of variables present in...,Dependent Variable (Target) and Independent Va...,Classical models\nLinear Regression\nRegressio...
2,What type of regression is used when there is ...,Simple Linear Regression.,Classical models\nLinear Regression\nRegressio...
3,What type of regression is used to model non-l...,Polynomial Regression.,Classical models\nLinear Regression\nRegressio...
4,What are the extensions of linear regression t...,Ridge and Lasso Regression.,Classical models\nLinear Regression\nRegressio...
...,...,...,...
3720,When would you use a T-test?,This question is outside of my knowledge domai...,"LLM\nTraining\nWhat is Fine-tuning, and Why is..."
3721,"What is the purpose of the samples x1, . . . ,...",It's not possible for me to answer this questi...,Classical NLP\nPreprocessing\nUnigram_1\nUnigr...
3722,How is Key-Value cache commonly implemented?,The information needed to answer this question...,Classical NLP\nWord Embeddings\nSentencePiece_...
3723,What type of bias might occur when a sample is...,"Unfortunately, I'm not in a position to answer...",LLM\nSupervised Fine-Tuning\nPrompt Tuning_1\n...
