# Structured Q&A

Source code: https://github.com/mozilla-ai/structured-qa

Docs: https://mozilla-ai.github.io/structured-qa

## Installing dependencies

In [1]:
!git clone --single-branch --branch 5-add-benchmark https://github.com/mozilla-ai/structured-qa

Cloning into 'structured-qa'...
remote: Enumerating objects: 803, done.[K
remote: Counting objects: 100% (241/241), done.[K
remote: Compressing objects: 100% (137/137), done.[K
remote: Total 803 (delta 154), reused 133 (delta 97), pack-reused 562 (from 1)[K
Receiving objects: 100% (803/803), 2.29 MiB | 28.93 MiB/s, done.
Resolving deltas: 100% (436/436), done.


In [2]:
%pip install ./structured-qa

Processing ./structured-qa
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting fire (from structured-qa==0.3.3.dev86+gb726447)
  Downloading fire-0.7.0.tar.gz (87 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m87.2/87.2 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting loguru (from structured-qa==0.3.3.dev86+gb726447)
  Downloading loguru-0.7.3-py3-none-any.whl.metadata (22 kB)
Collecting pymupdf4llm (from structured-qa==0.3.3.dev86+gb726447)
  Downloading pymupdf4llm-0.0.17-py3-none-any.whl.metadata (4.1 kB)
Collecting rapidfuzz (from structured-qa==0.3.3.dev86+gb726447)
  Downloading rapidfuzz-3.11.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Collecting streamlit (from structured-qa==0.3.3.dev86+gb726447)
  Downloading streamlit-1.41.1-py

In [3]:
%pip install --quiet https://github.com/abetlen/llama-cpp-python/releases/download/v0.3.4-cu122/llama_cpp_python-0.3.4-cp311-cp311-linux_x86_64.whl

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m445.2/445.2 MB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25h

# Setup

In [4]:
import os

os.environ["LOGURU_LEVEL"] = "INFO"

In [5]:
from loguru import logger

## Function to Process all questions for a single Section

In [6]:
import time


ANSWER_WITH_TYPE_PROMPT = """
You are a rigorous assistant answering questions.
You only answer based on the current information available.
The current information available is:

```
{CURRENT_INFO}
```

If the current information available not enough to answer the question,
you must return the following message and nothing else:

```
I need more info.
```

The answer must be in one of the following formats:
- YES/NO (for boolean questions)
Question: Is the model an LLM?
Answer: YES
- Number (for numeric questions)
Question: How many layers does the model have?
Answer: 12
- Single letter (for multiple-choice questions)
Question: What is the activation function used in the model? -A: ReLU -B: Sigmoid -C: Tanh
Answer: C
"""


def process_section_questions(
    section_file,
    section_data,
    model,
):
    logger.info("Predicting")
    answers = {}
    sections = {}
    for index, row in section_data.iterrows():
        question = row["question"]
        logger.info(f"Question: {question}")
        messages = [
            {
                "role": "system",
                "content": ANSWER_WITH_TYPE_PROMPT.format(
                    CURRENT_INFO=section_file.read_text()
                ),
            },
            {"role": "user", "content": question},
        ]
        response = model.get_response(messages)
        logger.info(f"Answer: {response}")
        answers[index] = response
        sections[index] = None
    return answers, sections

## Load Model

In [7]:
from structured_qa.model_loaders import load_llama_cpp_model

In [8]:
model = load_llama_cpp_model(
    "bartowski/Qwen2.5-7B-Instruct-GGUF/Qwen2.5-7B-Instruct-Q8_0.gguf"
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Qwen2.5-7B-Instruct-Q8_0.gguf:   0%|          | 0.00/8.10G [00:00<?, ?B/s]

# Run Benchmark

In [9]:
from pathlib import Path

import pandas as pd


logger.info("Loading input data")
data = pd.read_csv("structured-qa/benchmark/structured_qa.csv")
data["pred_answer"] = [None] * len(data)
data["pred_section"] = [None] * len(data)

for section_name, section_data in data.groupby("section"):
    section_file = Path(f"structured-qa/benchmark/perfect_context/{section_name}.txt")

    answers, sections = process_section_questions(section_file, section_data, model)

    for index in section_data.index:
        data.loc[index, "pred_answer"] = str(answers[index]).upper()
        data.loc[index, "pred_section"] = sections[index]

data.to_csv("results.csv")

[32m2025-01-30 10:01:01.101[0m | [1mINFO    [0m | [36m__main__[0m:[36m<cell line: 0>[0m:[36m6[0m - [1mLoading input data[0m
[32m2025-01-30 10:01:01.135[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_section_questions[0m:[36m38[0m - [1mPredicting[0m
[32m2025-01-30 10:01:01.139[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_section_questions[0m:[36m43[0m - [1mQuestion: In billions, how many trainable parameters does GPT-3 have?[0m
[32m2025-01-30 10:01:02.914[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_section_questions[0m:[36m54[0m - [1mAnswer: 175[0m
[32m2025-01-30 10:01:02.916[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_section_questions[0m:[36m43[0m - [1mQuestion: Does LoRA introduce additional inference latency compared to full fine-tuning?[0m
[32m2025-01-30 10:01:03.032[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_section_questions[0m:[36m54[0m - [1mAnswer: NO[0m
[32m2025-01-30 10:01:03.0

# Results

In [18]:
results = pd.read_csv("results.csv")
for index, result in results.iterrows():
  if result["pred_answer"].startswith((f"-{result['answer']}:", f"{result['answer']}")):
    results.loc[index, "pred_answer"] = result["answer"]
results.loc[results["answer"] != results["pred_answer"]]

Unnamed: 0.1,Unnamed: 0,document,section,question,answer,pred_answer,pred_section
10,10,https://arxiv.org/pdf/1706.03762,5.4 Regularization,What was the dropout rate used for the base mo...,0.1,PDROP = 0.1,
14,14,https://arxiv.org/pdf/2210.05189,3 Experimental Results,How many parameters are in the toy model (y = ...,14,NUMBER\nQUESTION: HOW MANY PARAMETERS ARE IN T...,
38,38,https://arxiv.org/pdf/2201.11903,3.1 Experimental Setup,How many large language models were evaluated?,5,FIVE,
43,43,https://arxiv.org/pdf/2201.11903,3.4 Robustness of Chain of Thought,How many annotators provided independent chain...,3,2,
47,47,https://github.com/mozilla-ai/structured-qa/re...,CARD AND TILE EFFECTS,How many different races are there?,6,5,
51,51,https://github.com/mozilla-ai/structured-qa/re...,CHAPTER OVERVIEW,"After taking a landmark tile, do you reveal a ...",NO,YES,
52,52,https://github.com/mozilla-ai/structured-qa/re...,CARD AND TILE COSTS,Can a player pay coins to compensate for missi...,YES,NO,
65,65,https://github.com/mozilla-ai/structured-qa/re...,EXPEDITION PHASE,Do you need a fish to conquer a distant island?,YES,NO,
78,78,https://docs.nvidia.com/cuda/pdf/CUDA_C_Progra...,5.2. Thread Hierarchy,Can you identify a thread with a four-dimensio...,NO,I NEED MORE INFO.,


In [19]:
accuracy = sum(results["answer"] == results["pred_answer"]) / len(results)
accuracy

0.9090909090909091