# Structured Q&A

Source code: https://github.com/mozilla-ai/structured-qa

Docs: https://mozilla-ai.github.io/structured-qa

## Installing dependencies

In [1]:
%pip install git+https://github.com/mozilla-ai/structured-qa.git@5-add-benchmark

Collecting git+https://github.com/mozilla-ai/structured-qa.git@5-add-benchmark
  Cloning https://github.com/mozilla-ai/structured-qa.git (to revision 5-add-benchmark) to /tmp/pip-req-build-zpym8juf
  Running command git clone --filter=blob:none --quiet https://github.com/mozilla-ai/structured-qa.git /tmp/pip-req-build-zpym8juf
  Running command git checkout -b 5-add-benchmark --track origin/5-add-benchmark
  Switched to a new branch '5-add-benchmark'
  Branch '5-add-benchmark' set up to track remote branch '5-add-benchmark' from 'origin'.
  Resolved https://github.com/mozilla-ai/structured-qa.git to commit 7b9c96cd5fc3cd34781aa26e2519a6f4731feedc
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [2]:
!wget https://raw.githubusercontent.com/mozilla-ai/structured-qa/refs/heads/5-add-benchmark/benchmark/structured_qa.csv

--2025-01-29 13:54:23--  https://raw.githubusercontent.com/mozilla-ai/structured-qa/refs/heads/5-add-benchmark/benchmark/structured_qa.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 21734 (21K) [text/plain]
Saving to: ‘structured_qa.csv.3’


2025-01-29 13:54:24 (9.19 MB/s) - ‘structured_qa.csv.3’ saved [21734/21734]



# Setup

In [3]:
import os

os.environ["LOGURU_LEVEL"] = "INFO"

In [4]:
from loguru import logger

## Function to Process a single Document

In [5]:
from structured_qa.config import FIND_PROMPT
from structured_qa.preprocessing import document_to_sections_dir
from structured_qa.workflow import find_retrieve_answer


ANSWER_WITH_TYPE_PROMPT = """
You are a rigorous assistant answering questions.
You only answer based on the current information available.
The current information available is:

```
{CURRENT_INFO}
```

The answer must be in one of the following formats:
- YES/NO (for boolean questions)
Is the model an LLM?
YES
- Number (for numeric questions)
How many layers does the model have?
12
- Single letter (for multiple-choice questions)
What is the activation function used in the model? -A: ReLU -B: Sigmoid -C: Tanh
C
"""


def process_document(
    document_file,
    document_data,
    model,
    find_prompt: str = FIND_PROMPT,
    answer_prompt: str = ANSWER_WITH_TYPE_PROMPT,
):
    sections_dir = Path("sections") / Path(document_file).stem
    if not sections_dir.exists():
        logger.info("Splitting document into sections")
        document_to_sections_dir(document_file, sections_dir)

    logger.info("Predicting")
    answers = {}
    sections = {}
    for index, row in document_data.iterrows():
        question = row["question"]
        logger.info(f"Question: {question}")
        answer, sections_checked = find_retrieve_answer(
            question, model, sections_dir, find_prompt, answer_prompt
        )
        logger.info(f"Answer: {answer}")
        answers[index] = answer
        sections[index] = sections_checked[-1] if sections_checked else None

    return answers, sections

## Load Model

In [6]:
from structured_qa.model_loaders import load_llama_cpp_model

In [7]:
model = load_llama_cpp_model(
    "bartowski/Qwen2.5-7B-Instruct-GGUF/Qwen2.5-7B-Instruct-Q8_0.gguf"
)

# Run Benchmark

In [8]:
from pathlib import Path
from urllib.request import urlretrieve

import pandas as pd

logger.info("Loading input data")
data = pd.read_csv("structured_qa.csv")
data["pred_answer"] = [None] * len(data)
data["pred_section"] = [None] * len(data)

for document_link, document_data in data.groupby("document"):
    logger.info(f"Downloading document {document_link}")
    downloaded_document = Path(f"{Path(document_link).name}.pdf")
    if not Path(downloaded_document).exists():
        urlretrieve(document_link, downloaded_document)
        logger.info(f"Downloaded {document_link} to {downloaded_document}")
    else:
        logger.info(f"File {downloaded_document} already exists")

    answers, sections = process_document(downloaded_document, document_data, model)

    for index in document_data.index:
        data.loc[index, "pred_answer"] = str(answers[index]).upper()
        data.loc[index, "pred_section"] = sections[index]

data.to_csv("results.csv")

[32m2025-01-29 13:54:26.672[0m | [1mINFO    [0m | [36m__main__[0m:[36m<cell line: 0>[0m:[36m6[0m - [1mLoading input data[0m
[32m2025-01-29 13:54:26.682[0m | [1mINFO    [0m | [36m__main__[0m:[36m<cell line: 0>[0m:[36m12[0m - [1mDownloading document https://aiindex.stanford.edu/wp-content/uploads/2024/05/HAI_AI-Index-Report-2024.pdf[0m
[32m2025-01-29 13:54:26.686[0m | [1mINFO    [0m | [36m__main__[0m:[36m<cell line: 0>[0m:[36m18[0m - [1mFile HAI_AI-Index-Report-2024.pdf.pdf already exists[0m
[32m2025-01-29 13:54:26.687[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m40[0m - [1mPredicting[0m
[32m2025-01-29 13:54:26.689[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m45[0m - [1mQuestion: which type of risk was identified as the leading concern globally? -A: Fairness risks. -B: Privacy and data governance risks. -C: Risks related to generative AI deployment.[0m
[32m2025-01-29 13:54:27.691[0m | 

In [9]:
results = pd.read_csv("results.csv")
results.loc[results["answer"] != results["pred_answer"]]

Unnamed: 0.1,Unnamed: 0,document,section,question,answer,pred_answer,pred_section
0,0,https://arxiv.org/pdf/1706.03762,3 Model Architecture,What type of architecture does the model use? ...,C,GENERATION ERROR,
1,1,https://arxiv.org/pdf/1706.03762,3.1 Encoder and Decoder Stacks,How many layers compose the encoder?,6,GENERATION ERROR,
2,2,https://arxiv.org/pdf/1706.03762,3.1 Encoder and Decoder Stacks,How many layers compose the decoder?,6,GENERATION ERROR,
3,3,https://arxiv.org/pdf/1706.03762,3.2.2 Multi-Head Attention,How many parallel attention heads are used?,8,GENERATION ERROR,
4,4,https://arxiv.org/pdf/1706.03762,3.4 Embeddings and Softmax,Does the final model use learned embeddings fo...,YES,GENERATION ERROR,
...,...,...,...,...,...,...,...
94,94,https://aiindex.stanford.edu/wp-content/upload...,LLM Tokenization Introduces Unfairness,What are the three major inequalities resultin...,B,GENERATION ERROR,
95,95,https://aiindex.stanford.edu/wp-content/upload...,U.S. Regulation,How many AI-related regulations were enacted i...,25,GENERATION ERROR,
96,96,https://aiindex.stanford.edu/wp-content/upload...,U.S. Regulation,Which of the following was identified as a hig...,B,GENERATION ERROR,
97,97,https://aiindex.stanford.edu/wp-content/upload...,Europe,Which country had the highest proportion of fe...,B,GENERATION ERROR,


In [10]:
accuracy = sum(results["answer"] == results["pred_answer"]) / len(results)
accuracy

0.0