# Structured Q&A

Source code: https://github.com/mozilla-ai/structured-qa

Docs: https://mozilla-ai.github.io/structured-qa

## Installing dependencies

In [1]:
!git clone --single-branch --branch 5-add-benchmark https://github.com/mozilla-ai/structured-qa

Cloning into 'structured-qa'...
remote: Enumerating objects: 795, done.[K
remote: Counting objects: 100% (233/233), done.[K
remote: Compressing objects: 100% (134/134), done.[K
remote: Total 795 (delta 148), reused 126 (delta 92), pack-reused 562 (from 1)[K
Receiving objects: 100% (795/795), 2.27 MiB | 5.75 MiB/s, done.
Resolving deltas: 100% (430/430), done.


In [8]:
%pip install ./structured-qa

Processing ./structured-qa
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: structured-qa
  Building wheel for structured-qa (pyproject.toml) ... [?25l[?25hdone
  Created wheel for structured-qa: filename=structured_qa-0.3.3.dev84+g7b9c96c-py3-none-any.whl size=16325 sha256=3a2543903414e4e12121937c7c91c685062c83f3fc53f84a7316c8bec56b4181
  Stored in directory: /root/.cache/pip/wheels/b8/d1/8b/1585580e7787d68790745653775eb485d52a0d5386b616c827
Successfully built structured-qa
Installing collected packages: structured-qa
  Attempting uninstall: structured-qa
    Found existing installation: structured-qa 0.3.3.dev84+g7b9c96c
    Uninstalling structured-qa-0.3.3.dev84+g7b9c96c:
      Successfully uninstalled structured-qa-0.3.3.dev84+g7b9c96c
Successfully installed structured-qa-0.3.3.dev84+g7b9c96c


In [6]:
%pip install --quiet https://github.com/abetlen/llama-cpp-python/releases/download/v0.3.4-cu122/llama_cpp_python-0.3.4-cp311-cp311-linux_x86_64.whl

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m445.2/445.2 MB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25h

# Setup

In [1]:
import os

os.environ["LOGURU_LEVEL"] = "INFO"

In [2]:
from loguru import logger

## Function to Process all questions for a single Section

In [11]:
ANSWER_WITH_TYPE_PROMPT = """
You are a rigorous assistant answering questions.
You only answer based on the current information available.
The current information available is:

```
{CURRENT_INFO}
```

If the current information available not enough to answer the question,
you must return the following message and nothing else:

```
I need more info.
```

The answer must be in one of the following formats:
- YES/NO (for boolean questions)
Question: Is the model an LLM?
Answer: YES
- Number (for numeric questions)
Question: How many layers does the model have?
Answer: 12
- Single letter (for multiple-choice questions)
Question: What is the activation function used in the model? -A: ReLU -B: Sigmoid -C: Tanh
Answer: C
"""


def process_section_questions(
    section_file,
    section_data,
    model,
):
    logger.info("Predicting")
    answers = {}
    sections = {}
    for index, row in section_data.iterrows():
        question = row["question"]
        logger.info(f"Question: {question}")
        messages = [
            {
                "role": "system",
                "content": ANSWER_WITH_TYPE_PROMPT.format(
                    CURRENT_INFO=section_file.read_text()
                ),
            },
            {"role": "user", "content": question},
        ]
        response = model.get_response(messages)
        logger.info(f"Answer: {response}")
        answers[index] = response
        sections[index] = None
    return answers, sections

## Load Model

In [4]:
from structured_qa.model_loaders import load_llama_cpp_model

In [7]:
model = load_llama_cpp_model(
    "bartowski/Qwen2.5-7B-Instruct-GGUF/Qwen2.5-7B-Instruct-Q8_0.gguf"
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Qwen2.5-7B-Instruct-Q8_0.gguf:   0%|          | 0.00/8.10G [00:00<?, ?B/s]

# Run Benchmark

In [12]:
from pathlib import Path

import pandas as pd


logger.info("Loading input data")
data = pd.read_csv("structured-qa/benchmark/structured_qa.csv")
data["pred_answer"] = [None] * len(data)
data["pred_section"] = [None] * len(data)

for section_name, section_data in data.groupby("section"):
    section_file = Path(f"structured-qa/benchmark/perfect_context/{section_name}.txt")

    answers, sections = process_section_questions(section_file, section_data, model)

    for index in section_data.index:
        data.loc[index, "pred_answer"] = str(answers[index]).upper()
        data.loc[index, "pred_section"] = sections[index]

data.to_csv("results.csv")

[32m2025-01-29 13:02:00.443[0m | [1mINFO    [0m | [36m__main__[0m:[36m<cell line: 0>[0m:[36m6[0m - [1mLoading input data[0m
[32m2025-01-29 13:02:00.449[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_section_questions[0m:[36m38[0m - [1mPredicting[0m
[32m2025-01-29 13:02:00.451[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_section_questions[0m:[36m43[0m - [1mQuestion: In billions, how many trainable parameters does GPT-3 have?[0m
[32m2025-01-29 13:02:01.668[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_section_questions[0m:[36m54[0m - [1mAnswer: 175[0m
[32m2025-01-29 13:02:01.670[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_section_questions[0m:[36m43[0m - [1mQuestion: Does LoRA introduce additional inference latency compared to full fine-tuning?[0m
[32m2025-01-29 13:02:01.779[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_section_questions[0m:[36m54[0m - [1mAnswer: NO[0m
[32m2025-01-29 13:02:01.7

# Results

In [13]:
results = pd.read_csv("results.csv")
results.loc[results["answer"] != results["pred_answer"]]

Unnamed: 0.1,Unnamed: 0,document,section,question,answer,pred_answer,pred_section
10,10,https://arxiv.org/pdf/1706.03762,5.4 Regularization,What was the dropout rate used for the base mo...,0.1,PDROP = 0.1,
14,14,https://arxiv.org/pdf/2210.05189,3 Experimental Results,How many parameters are in the toy model (y = ...,14,NUMBER\nQUESTION: HOW MANY PARAMETERS ARE IN T...,
16,16,https://arxiv.org/pdf/2210.05189,3 Experimental Results,What is the main computational advantage of de...,B,B: FEWER OPERATIONS,
21,21,https://eur-lex.europa.eu/legal-content/EN/TXT...,Data and data governance,what is a requirement for datasets used in hig...,B,B: DATASETS ENSURING QUALITY AND DIVERSITY,
38,38,https://arxiv.org/pdf/2201.11903,3.1 Experimental Setup,How many large language models were evaluated?,5,FIVE,
42,42,https://arxiv.org/pdf/2201.11903,5 Symbolic Reasoning,Which symbolic reasoning task is used as an ou...,A,"A\nBASED ON THE INFORMATION PROVIDED, THE OUT-...",
43,43,https://arxiv.org/pdf/2201.11903,3.4 Robustness of Chain of Thought,How many annotators provided independent chain...,3,2,
47,47,https://github.com/mozilla-ai/structured-qa/re...,CARD AND TILE EFFECTS,How many different races are there?,6,5,
51,51,https://github.com/mozilla-ai/structured-qa/re...,CHAPTER OVERVIEW,"After taking a landmark tile, do you reveal a ...",NO,YES,
52,52,https://github.com/mozilla-ai/structured-qa/re...,CARD AND TILE COSTS,Can a player pay coins to compensate for missi...,YES,NO,


In [14]:
accuracy = sum(results["answer"] == results["pred_answer"]) / len(results)
accuracy

0.8383838383838383