# Structured Q&A

Source code: https://github.com/mozilla-ai/structured-qa

Docs: https://mozilla-ai.github.io/structured-qa

## Installing dependencies

In [1]:
!git clone --single-branch --branch 5-add-benchmark https://github.com/mozilla-ai/structured-qa

Cloning into 'structured-qa'...
remote: Enumerating objects: 939, done.[K
remote: Counting objects: 100% (377/377), done.[K
remote: Compressing objects: 100% (224/224), done.[K
remote: Total 939 (delta 246), reused 220 (delta 140), pack-reused 562 (from 1)[K
Receiving objects: 100% (939/939), 2.56 MiB | 9.80 MiB/s, done.
Resolving deltas: 100% (528/528), done.


In [2]:
%pip install ./structured-qa

Processing ./structured-qa
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting fire (from structured-qa==0.3.3.dev113+g0ab4688)
  Downloading fire-0.7.0.tar.gz (87 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m87.2/87.2 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting loguru (from structured-qa==0.3.3.dev113+g0ab4688)
  Downloading loguru-0.7.3-py3-none-any.whl.metadata (22 kB)
Collecting pymupdf4llm (from structured-qa==0.3.3.dev113+g0ab4688)
  Downloading pymupdf4llm-0.0.17-py3-none-any.whl.metadata (4.1 kB)
Collecting rapidfuzz (from structured-qa==0.3.3.dev113+g0ab4688)
  Downloading rapidfuzz-3.12.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Collecting streamlit (from structured-qa==0.3.3.dev113+g0ab4688)
  Downloading streamlit-1.41

In [3]:
%pip install --quiet https://github.com/abetlen/llama-cpp-python/releases/download/v0.3.4-cu122/llama_cpp_python-0.3.4-cp311-cp311-linux_x86_64.whl

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m445.2/445.2 MB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25h

# Setup

In [4]:
import os

os.environ["LOGURU_LEVEL"] = "INFO"

In [5]:
from loguru import logger

## Function to Process all questions for a single Section

In [6]:
ANSWER_WITH_TYPE_PROMPT = """
You are a rigorous assistant answering questions.
You must only answer based on the current information available which is:

```
{CURRENT_INFO}
```

If the current information available not enough to answer the question,
you must return "I need more info" srting and nothing else:

If the current information is enough to answer, you must return one of the following formats:
- YES/NO (for boolean questions)
- Number (for numeric questions)
- Single letter (for multiple-choice questions)
"""


def process_section_questions(
    section_file,
    section_data,
    model,
):
    logger.info("Predicting")
    answers = {}
    sections = {}
    for index, row in section_data.iterrows():
        question = row["question"]
        logger.info(f"Question: {question}")
        messages = [
            {
                "role": "system",
                "content": ANSWER_WITH_TYPE_PROMPT.format(
                    CURRENT_INFO=section_file.read_text()
                ),
            },
            {"role": "user", "content": question},
        ]
        response = model.get_response(messages)
        logger.info(f"Answer: {response}")
        answers[index] = response
        sections[index] = None
    return answers, sections

## Load Model

In [7]:
from structured_qa.model_loaders import load_llama_cpp_model

In [8]:
model = load_llama_cpp_model(
    "bartowski/Qwen2.5-7B-Instruct-GGUF/Qwen2.5-7B-Instruct-Q8_0.gguf"
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Qwen2.5-7B-Instruct-Q8_0.gguf:   0%|          | 0.00/8.10G [00:00<?, ?B/s]

# Run Benchmark

In [9]:
from pathlib import Path

import pandas as pd


logger.info("Loading input data")
data = pd.read_csv("structured-qa/benchmark/structured_qa.csv")
data["pred_answer"] = [None] * len(data)
data["pred_section"] = [None] * len(data)

for section_name, section_data in data.groupby("section"):
    section_file = Path(f"structured-qa/benchmark/perfect_context/{section_name}.txt")

    answers, sections = process_section_questions(section_file, section_data, model)

    for index in section_data.index:
        data.loc[index, "pred_answer"] = str(answers[index]).upper()
        data.loc[index, "pred_section"] = sections[index]

data.to_csv("results.csv")

[32m2025-02-04 18:01:42.914[0m | [1mINFO    [0m | [36m__main__[0m:[36m<cell line: 0>[0m:[36m6[0m - [1mLoading input data[0m
[32m2025-02-04 18:01:42.952[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_section_questions[0m:[36m24[0m - [1mPredicting[0m
[32m2025-02-04 18:01:42.956[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_section_questions[0m:[36m29[0m - [1mQuestion: In billions, how many trainable parameters does GPT-3 have?[0m
[32m2025-02-04 18:01:44.761[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_section_questions[0m:[36m40[0m - [1mAnswer: 175[0m
[32m2025-02-04 18:01:44.763[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_section_questions[0m:[36m29[0m - [1mQuestion: Does LoRA introduce additional inference latency compared to full fine-tuning?[0m
[32m2025-02-04 18:01:44.882[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_section_questions[0m:[36m40[0m - [1mAnswer: NO[0m
[32m2025-02-04 18:01:44.8

# Results

In [10]:
results = pd.read_csv("results.csv")
for index, result in results.iterrows():
    if result["pred_answer"].startswith(
        (f"-{result['answer']}", f"{result['answer']}")
    ):
        results.loc[index, "pred_answer"] = result["answer"]
results.loc[results["answer"] != results["pred_answer"]]

Unnamed: 0.1,Unnamed: 0,document,type,section,question,answer,pred_answer,pred_section
10,10,https://arxiv.org/pdf/1706.03762,Scientific Paper,5.4 Regularization,What was the dropout rate used for the base mo...,0.1,PDROP = 0.1,
28,28,https://arxiv.org/pdf/2201.11903,Scientific Report,3.1 Experimental Setup,How many large language models were evaluated?,5,FIVE,
32,32,https://arxiv.org/pdf/2201.11903,Scientific Report,5 Symbolic Reasoning,Which symbolic reasoning task is used as an ou...,A,I NEED MORE INFO,
37,37,https://github.com/mozilla-ai/structured-qa/re...,Board Game,CARD AND TILE EFFECTS,How many different races are there?,6,5,
41,41,https://github.com/mozilla-ai/structured-qa/re...,Board Game,CHAPTER OVERVIEW,"After taking a landmark tile, do you reveal a ...",NO,YES,
42,42,https://github.com/mozilla-ai/structured-qa/re...,Board Game,CARD AND TILE COSTS,Can a player pay coins to compensate for missi...,YES,NO,
55,55,https://github.com/mozilla-ai/structured-qa/re...,Board Game,EXPEDITION PHASE,Do you need a fish to conquer a distant island?,YES,NO,
58,58,https://github.com/mozilla-ai/structured-qa/re...,Board Game,LOCATION ABILITIES,How many victory points are granted by a built...,1,I NEED MORE INFO,
68,68,https://docs.nvidia.com/cuda/pdf/CUDA_C_Progra...,Techincal Documentation,5.2. Thread Hierarchy,Can you identify a thread with a four-dimensio...,NO,I NEED MORE INFO,
94,94,https://arxiv.org/pdf/2302.13971,Scientific Report,3 Main results,Was the model compared against GPT-4?,NO,I NEED MORE INFO,


In [11]:
accuracy = sum(results["answer"] == results["pred_answer"]) / len(results)
accuracy

0.9029126213592233