## Objective: To build a simple RAG evaluation framework

### Part 1: Synthesize and filter an Instruction dataset from a custom knowledge-base

#### Primary reference: https://huggingface.co/learn/cookbook/en/rag_evaluation by Aymeric Roucher (https://huggingface.co/m-ric)

For the knowledge base, let us use the  litgpt Github repo: https://github.com/Lightning-AI/litgpt/tree/main -> Only use markdown files, this ensures the knowledge base isnt too large (as a first effort),  and we get the quickstart, tutorials etc. **should** mean coherent QAs

### Installs and Dependencies

In [1]:
%pip install --quiet torch transformers langchain tqdm pandas datasets
%pip install -U --quiet langchain-openai Gitpython python-dotenv huggingface_hub
%pip install --quiet openai huggingface langchain_experimental sentence_transformers

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [2]:
import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

# openai_api_key = os.environ['OPENAI_API_KEY'] 
hf_api_key = os.environ['HF_API_KEY'] 

In [3]:
import textwrap
from tqdm import tqdm
import pandas as pd
import json
import datasets
import random
import bs4
import glob

pd.set_option("display.max_colwidth", None)

from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import GitLoader
from langchain_openai import ChatOpenAI
from langchain.docstore.document import Document as LangchainDocument
from langchain.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores.utils import DistanceStrategy
from langchain_community.llms import HuggingFaceHub
from langchain_core.vectorstores import VectorStore
from langchain_core.language_models.llms import LLM
from langchain_core.language_models import BaseChatModel


from huggingface_hub import InferenceClient


### Load in knowledge base and prepare documents

In [4]:
loader = GitLoader(
    clone_url="https://github.com/Lightning-AI/litgpt",
    repo_path="./litgpt_data_github/",
    branch="main",
    file_filter=lambda file_path: file_path.endswith(".md") # Only get the markdown files
)

data = loader.load()

In [5]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=2000,
    chunk_overlap=200,
    add_start_index=True,
    separators=["\n\n", "\n", ".", " ", "", "\n\n\n"],
)

docs_processed = []
for doc in data:
    docs_processed += text_splitter.split_documents([doc])

### Setup Question-Answer generation agent

In [9]:
repo_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"

llm_client = InferenceClient(
    model=repo_id,
    timeout=120,
    token=hf_api_key
)


def call_llm(inference_client: InferenceClient, prompt: str):
    response = inference_client.post(
        json={
            "inputs": prompt,
            "parameters": {"max_new_tokens": 1000},
            "task": "text-generation",
        },
    )
    return json.loads(response.decode())[0]["generated_text"]

QA_generation_prompt = """
Your task is to write a factoid question and an answer given a context.
Your factoid question should be answerable with a specific, concise piece of factual information from the context.
Your factoid question should be formulated in the same style as questions users could ask in a search engine.
This means that your factoid question MUST NOT mention something like "according to the passage" or "context".

Provide your answer as follows:

Output:::
Factoid question: (your factoid question)
Answer: (your answer to the factoid question)

Now here is the context.

Context: {context}\n
Output:::"""

In [13]:
len(docs_processed)

136

In [14]:
# Generate 136 samples for now, upload to Huggingface Hub to use later


N_GENERATIONS = len(docs_processed)  # We intentionally generate only 136 QA couples here for cost and time considerations

# This number is just the total length of docs_processed using the knowledge-base from litgpt markdown files

print(f"Generating {N_GENERATIONS} QA couples...")

outputs = []
for sampled_context in tqdm(random.sample(docs_processed, N_GENERATIONS)):
    # Generate QA couple
    output_QA_couple = call_llm(
        llm_client, QA_generation_prompt.format(context=sampled_context.page_content)
    )
    try:
        question = output_QA_couple.split("Factoid question: ")[-1].split("Answer: ")[0]
        answer = output_QA_couple.split("Answer: ")[-1]
        assert len(answer) < 300, "Answer is too long"
        outputs.append(
            {
                "context": sampled_context.page_content,
                "question": question,
                "answer": answer,
                "source_doc": sampled_context.metadata["source"],
            }
        )
    except:
        continue

Generating 136 QA couples...


100%|██████████| 136/136 [12:34<00:00,  5.55s/it]


In [16]:
display(pd.DataFrame(outputs).sample(5))

Unnamed: 0,context,question,answer,source_doc
105,"# TPU support\n\nThis project utilizes [`Fabric`](https://lightning.ai/docs/fabric/stable), which supports TPUs via [PyTorch XLA](https://github.com/pytorch/xla).\n\n> [!NOTE]\n> This guide assumes that you have already set-up your [Google Cloud environment](https://cloud.google.com/run/docs/setup).\n\nTo set up a Google Cloud instance with a TPU v4 VM, run the following commands:\n\n```shell\ngcloud compute tpus tpu-vm create litgpt --version=tpu-vm-v4-base --accelerator-type=v4-8 --zone=us-central2-b\ngcloud compute tpus tpu-vm ssh litgpt --zone=us-central2-b\n```\n\nYou can also choose a different TPU type. To do so, change the `version`, `accelerator-type`, and `zone` arguments. Find all regions and zones [here](https://cloud.google.com/tpu/docs/regions-zones).\n\n<details>\n<summary>Multihost caveats</summary>\n\nTPU v4-8 uses a single host. SSH'ing into the machine and running commands manually will only work when using a single host (1 slice in the TPU pod).\nIn multi-host environments, such as larger TPU pod slices, it's necessary to launch all commands on all hosts simultaneously to avoid hangs.\nFor local development, it is advisable to upload a zip file containing all your current changes and execute it inside the VM from your personal computer:\n\n```shell\n# Zip the local directory, excluding large directories from the zip. You may want to keep them.\nzip -r local_changes.zip . -x "".git/*"" ""checkpoints/*"" ""data/*"" ""out/*""\n# Copy the .zip file to the TPU VM\ngcloud compute tpus tpu-vm scp --worker=all local_changes.zip ""litgpt:~""\n# Unzip on each host\ngcloud compute tpus tpu-vm ssh litgpt --worker=all --command=""cd ~; unzip -q -o local_changes.zip""\n\n# Example of a typical workflow\ngcloud compute tpus tpu-vm ssh tmp --worker=all --command=""cd ~; bash install_dependencies.sh""\ngcloud compute tpus tpu-vm ssh tmp --worker=all --command=""cd ~; bash prepare_checkpoints.sh""\ngcloud compute tpus tpu-vm ssh tmp --worker=all --command=""cd ~; bash run_desired_script.sh""",How does this project support TPUs?\n,"This project supports TPUs via PyTorch XLA, which is integrated into Fabric.",extensions/xla/README.md
125,| Config | Model | Epochs | Max seq length | Micro batch size | Machine | Training runtime | Cost | Peak memory | Validation loss | Validation perplexity | Multitask score (MMLU) |\n| --------------------------------- | ---------------------- | ------ | -------------- | ---------------- | ------- | ---------------- | ---- | ----------- | --------------- | --------------------- | --------------- |\n| falcon-7b/lora.yaml | falcon-7b | 4 | 512 | 1 | 1xA10G | 24.84 min | $0.7 | 16.69 GB | 0.945 | 2.573 | 26.2% |\n| falcon-7b/lora.yaml | falcon-7b | 4 | 512 | 1 | 4xA10G | 24.94 min | $2.0 | 16.69 GB | 0.945 | 2.573 | 26.4% |\n| falcon-7b/qlora.yaml | falcon-7b | 4 | 512 | 1 | 1xA10G | 50.85 min | $1.5 | 9.44 GB | 0.993 | 2.699 | 26.3% |\n| falcon-7b/qlora.yaml | falcon-7b | 4 | 512 | 1 | 4xA10G | 50.88 min | $4.1 | 9.44 GB | 0.993 | 2.699 | 26.3% |\n| | | | | | | | | | | | |\n| gemma-2b/full.yaml | gemma-2b | 1 | 512 | 1 | 4xA10G | 14.06 min | $1.1 | 17.43 GB | 1.021 | 2.777 | 32.4% |\n| gemma-2b/lora.yaml | gemma-2b | 2 | 512 | 2 | 1xA10G | 9.41 min | $0.3 | 12.62 GB | 0.981 | 2.666 | 34.4% |,What is the training runtime for the gemma-2b model with the lora configuration?\n,9.41 min,config_hub/finetune/README.md
121,| Size | Model | Quantization | GPU | Max GPU RAM | Token/sec |\n|-------|----------------|--------------|----------|-------------------------------------------|-----------|\n| 1.3 B | phi-1.5 | None | 1 x A100 | 2.86 GB | 42.56 |\n| 1.3 B | phi-1.5 | bnb.nf4 | 1 x A100 | 1.39 GB | 22.89 |\n| 1.3 B | phi-1.5 | bnb.nf4-dq | 1 x A100 | 1.33 GB | 22.75 |\n| | | | | | |\n| 3 B | StableLM Alpha | None | 1 x A100 | 7.30 GB | 49.01 |\n| 3 B | StableLM Alpha | bnb.nf4 | 1 x A100 | 3.20 GB | 29.04 |\n| 3 B | StableLM Alpha | bnb.nf4-dq | 1 x A100 | 3.04 GB | 27.15 |\n| | | | | | |\n| 7 B | Llama 2 | None | 1 x A100 | 13.52 GB | 30.97 |\n| 7 B | Llama 2 | bnb.nf4 | 1 x A100 | 4.57 GB | 19.98 |\n| 7 B | Llama 2 | bnb.nf4-dq | 1 x A100 | 4.26 GB | 17.3 |\n| | | | | | |\n| 13 B | Llama 2 | None | 1 x A100 | 26.21 GB | 24.82 |\n| 13 B | Llama 2 | bnb.nf4 | 1 x A100 | 8.32 GB | 16.73 |\n| 13 B | Llama 2 | bnb.nf4-dq | 1 x A100 | 7.72 GB | 14.43 |\n| | | | | | |,What is the maximum GPU RAM required for the 13 B Llama 2 model with bnb.nf4-dq quantization?\n,7.72 GB,tutorials/resource-tables.md
87,# Download Model Weights with LitGPT\n\nLitGPT supports a variety of LLM architectures with publicly available weights. You can download model weights and access a list of supported models using the LitGPT `download.py` script.,What can I use to download model weights in LitGPT?\n,The LitGPT `download.py` script,tutorials/download_model_weights.md
104,"# Serve and Deploy LLMs\n\nThis document shows how you can serve a LitGPT for deployment. \n\n&nbsp;\n## Serve an LLM\n\nThis section illustrates how we can set up an inference server for a phi-2 LLM using `litgpt serve` that is minimal and highly scalable.\n\n\n&nbsp;\n## Step 1: Start the inference server\n\n\n```bash\n# 1) Download a pretrained model (alternatively, use your own finetuned model)\nlitgpt download --repo_id microsoft/phi-2\n\n# 2) Start the server\nlitgpt serve --checkpoint_dir checkpoints/microsoft/phi-2\n```\n\n> [!TIP]\n> Use `litgpt serve --help` to display additional options, including the port, devices, LLM temperature setting, and more.\n\n\n&nbsp;\n## Step 2: Query the inference server\n\nYou can now send requests to the inference server you started in step 2. For example, in a new Python session, we can send requests to the inference server as follows:\n\n\n```python\nimport requests, json\n\nresponse = requests.post(\n ""http://127.0.0.1:8000/predict"", \n json={""prompt"": ""Fix typos in the following sentence: Exampel input""}\n)\n\nprint(response.json()[""output""])\n```\n\nExecuting the code above prints the following output:\n\n```\nInstruct: Fix typos in the following sentence: Exampel input\nOutput: Example input.\n```",How can I start an inference server for a phi-2 LLM using litgpt serve?\n,You can start an inference server for a phi-2 LLM using litgpt serve by first downloading a pretrained model using `litgpt download --repo_id microsoft/phi-2` and then starting the server using `litgpt serve --checkpoint_dir checkpoints/microsoft/phi-2`.,tutorials/deploy.md


### Set-up Critique Agents

In [17]:
question_groundedness_critique_prompt = """
You will be given a context and a question.
Your task is to provide a 'total rating' scoring how well one can answer the given question unambiguously with the given context.
Give your answer on a scale of 1 to 5, where 1 means that the question is not answerable at all given the context, and 5 means that the question is clearly and unambiguously answerable with the context.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as a number between 1 and 5)

You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

Now here are the question and context.

Question: {question}\n
Context: {context}\n
Answer::: """

question_relevance_critique_prompt = """
You will be given a question.
Your task is to provide a 'total rating' representing how useful this question can be to AI and ML Practitioners working with Large Language Models using litgpt.
Give your answer on a scale of 1 to 5, where 1 means that the question is not useful at all, and 5 means that the question is extremely useful.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as a number between 1 and 5)

You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

Now here is the question.

Question: {question}\n
Answer::: """

question_standalone_critique_prompt = """
You will be given a question.
Your task is to provide a 'total rating' representing how context-independent this question is.
Give your answer on a scale of 1 to 5, where 1 means that the question depends on additional information to be understood, and 5 means that the question makes sense by itself.
For instance, if the question refers to a particular setting, like 'in the context' or 'in the document', the rating must be 1.
The questions can contain specific technical nouns or acronyms like LoRA, fp4, litgpt or Llama 3 and still be a 5: it must simply be clear to an operator with access to documentation what the question is about.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as a number between 1 and 5)

You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

Now here is the question.

Question: {question}\n
Answer::: """

In [18]:
print("Generating critique for each QA couple...")
for output in tqdm(outputs):
    evaluations = {
        "groundedness": call_llm(
            llm_client,
            question_groundedness_critique_prompt.format(
                context=output["context"], question=output["question"]
            ),
        ),
        "relevance": call_llm(
            llm_client,
            question_relevance_critique_prompt.format(question=output["question"]),
        ),
        "standalone": call_llm(
            llm_client,
            question_standalone_critique_prompt.format(question=output["question"]),
        ),
    }
    try:
        for criterion, evaluation in evaluations.items():
            score, eval = (
                int(evaluation.split("Total rating: ")[-1].strip()),
                evaluation.split("Total rating: ")[-2].split("Evaluation: ")[1],
            )
            output.update(
                {
                    f"{criterion}_score": score,
                    f"{criterion}_eval": eval,
                }
            )
    except Exception as e:
        continue

Generating critique for each QA couple...


100%|██████████| 127/127 [1:28:56<00:00, 42.02s/it]


In [19]:
# Filter out the bad questions, keep 3 as the threshold for now

pd.set_option("display.max_colwidth", None)

generated_questions = pd.DataFrame.from_dict(outputs)

# Filter out low rated QA pairs

print("Evaluation dataset before filtering:")
display(
    generated_questions[
        [
            "question",
            "answer",
            "groundedness_score",
            "relevance_score",
            "standalone_score",
        ]
    ]
)
generated_questions = generated_questions.loc[
    (generated_questions["groundedness_score"] >= 3)
    & (generated_questions["relevance_score"] >= 3)
    & (generated_questions["standalone_score"] >= 3)
]
print("============================================")
print("Final evaluation dataset:")

generated_questions.reset_index(inplace=True, drop=True)

display(
    generated_questions[
        [
            "question",
            "answer",
            "groundedness_score",
            "relevance_score",
            "standalone_score",
        ]
    ]
)


eval_dataset = datasets.Dataset.from_pandas(
    generated_questions, split="train", preserve_index=False
)

Evaluation dataset before filtering:


Unnamed: 0,question,answer,groundedness_score,relevance_score,standalone_score
0,What is the shape of tensor t24?\n,"The shape of tensor t24 is [2, 5, 4096].\n```",,,
1,What is a config in LitGPT?\n,"In LitGPT, a config is a configuration file that lets you customize training for all granular parameters like learning rate, batch size, number of epochs, and more.",5.0,,
2,How many parameters are there in the LLaMA-Adapter v2?\n,There are ~2.3 M trainable parameters in the LLaMA-Adapter v2.,2.0,,
3,What is the name of the argument used to resume training?\n,The name of the argument used to resume training is `--resume`.,1.0,5.0,3.0
4,What was the time taken to complete training?\n,The time taken to complete training was ~ 4 weeks with 64 A100 GPUs.,5.0,,
...,...,...,...,...,...
122,What is the URL for the Lightning Studio templates?\n,https://lightning.ai/lightning-ai/studios,2.0,,
123,What is the version of nvfuser\_cu121 used?\n,The version of nvfuser\_cu121 used is 0.2.0.dev20240327.,4.0,3.0,4.0
124,What is the command to download the pretrained model weights for the Llama-2-7b-hf model?\n,`litgpt download --repo_id meta-llama/Llama-2-7b-hf`,3.0,5.0,5.0
125,What is the training runtime for the gemma-2b model with the lora configuration?\n,9.41 min,3.0,3.0,5.0


Final evaluation dataset:


Unnamed: 0,question,answer,groundedness_score,relevance_score,standalone_score
0,What is the memory usage of Llama 2 with 7B when using bnb.nf4-dq?\n,13.84 GB,5.0,3.0,3.0
1,What is the command to run the evaluation harness?\n,"The command to run the evaluation harness is `lm_eval --model hf --model_args pretrained=out/hf-tinyllama/converted --tasks ""hellaswag,gsm8k,truthfulqa_mc2,mmlu,winogrande,arc_challenge"" --device ""cuda:0"" --batch_size 4`.",5.0,5.0,5.0
2,What is the command to run the Evaluation Harness?\n,"The command to run the Evaluation Harness is `lm_eval --model hf --model_args pretrained=""out/converted_model"" --tasks ""hellaswag,gsm8k,truthfulqa_mc2,mmlu,winogrande,arc_challenge"" --device ""cuda:0"" --batch_size 4`.",5.0,5.0,4.0
3,What is the default value of the 'precision' parameter in the LoRA finetuning config?\n,The default value of the 'precision' parameter in the LoRA finetuning config is null.,5.0,4.0,5.0
4,What is the name of the directory where the model weights are stored by default?\n,The model weights are stored in a `./checkpoints` subdirectory by default.,5.0,5.0,5.0
5,What is the command to download a pretrained model?\n,litgpt download --repo_id [model_name],5.0,4.0,5.0
6,What is the name of the studio that provides LitGPT pretraining projects?\n,Lightning Studio,5.0,4.0,5.0
7,How long does it take to finetune a model on a GPU?\n,It takes about a minute to finetune a model on a GPU.,3.0,4.0,4.0
8,What is the most memory-intensive finetuning technique in LitGPT?\n,The most memory-intensive finetuning technique in LitGPT is full finetuning.,5.0,4.0,5.0
9,What is the recommended approach for preprocessing large datasets?\n,The recommended approach for preprocessing large datasets is to use LitData for preprocessing and then read it from a local directory or S3 connection using `--data LitData`.,3.0,5.0,5.0


### Push dataset to HuggingFace Hub for future use

In [20]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [21]:
from huggingface_hub import create_repo
from huggingface_hub import Repository

repo_name = "litgpt_instruction_qa"  # Choose a name for your dataset repository
repo_url = create_repo(repo_name, repo_type="dataset")
print("Repository URL:", repo_url)

Repository URL: https://huggingface.co/datasets/delayedkarma/litgpt_instruction_qa


In [22]:
eval_dataset.push_to_hub(f"delayedkarma/litgpt_instruction_qa")

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

CommitInfo(commit_url='https://huggingface.co/datasets/delayedkarma/litgpt_instruction_qa/commit/40ec15f700cffd95440825e20ac06ed05d6513a1', commit_message='Upload dataset', commit_description='', oid='40ec15f700cffd95440825e20ac06ed05d6513a1', pr_url=None, pr_revision=None, pr_num=None)

In [None]:
### Dataset is now pushed to hub

### You can load it using the following;::

# eval_dataset = datasets.load_dataset("delayedkarma/litgpt_instruction_qa", split="train")

### In Part 2: Build and evaluate a RAG system using the synthesized dataset (LLM-as-a-judge)