# Structured Q&A

Source code: https://github.com/mozilla-ai/structured-qa

Docs: https://mozilla-ai.github.io/structured-qa

## Installing dependencies

In [1]:
%pip install git+https://github.com/mozilla-ai/structured-qa.git@5-add-benchmark

Collecting git+https://github.com/mozilla-ai/structured-qa.git@5-add-benchmark
  Cloning https://github.com/mozilla-ai/structured-qa.git (to revision 5-add-benchmark) to /tmp/pip-req-build-q1o0cypa
  Running command git clone --filter=blob:none --quiet https://github.com/mozilla-ai/structured-qa.git /tmp/pip-req-build-q1o0cypa
  Running command git checkout -b 5-add-benchmark --track origin/5-add-benchmark
  Switched to a new branch '5-add-benchmark'
  Branch '5-add-benchmark' set up to track remote branch '5-add-benchmark' from 'origin'.
  Resolved https://github.com/mozilla-ai/structured-qa.git to commit a02ffd7c45a36261597af3f00a2316d7e349d05b
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting fire (from structured-qa==0.3.3.dev109+ga02ffd7)
  Downloading fire-0.7.0.tar.gz (87 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m87.2/87.2 

In [2]:
!wget https://raw.githubusercontent.com/mozilla-ai/structured-qa/refs/heads/5-add-benchmark/benchmark/structured_qa.csv

--2025-02-04 10:16:08--  https://raw.githubusercontent.com/mozilla-ai/structured-qa/refs/heads/5-add-benchmark/benchmark/structured_qa.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 21441 (21K) [text/plain]
Saving to: ‘structured_qa.csv’


2025-02-04 10:16:08 (13.8 MB/s) - ‘structured_qa.csv’ saved [21441/21441]



# Setup

In [3]:
import os
import google.generativeai as genai
from google.colab.userdata import get, SecretNotFoundError

try:
    genai.configure(api_key=get("GOOGLE_API_KEY"))
except SecretNotFoundError as e:
    raise RuntimeError("Please set the GOOGLE_API_KEY secret to your API key") from e
os.environ["LOGURU_LEVEL"] = "INFO"

In [4]:
from loguru import logger

## Function to Process a single Document

In [5]:
from structured_qa.config import FIND_PROMPT
from structured_qa.preprocessing import document_to_sections_dir
from structured_qa.workflow import find_retrieve_answer


ANSWER_WITH_TYPE_PROMPT = """
You are a rigorous assistant answering questions.
You must only answer based on the current information available which is:

```
{CURRENT_INFO}
```

If the current information available not enough to answer the question,
you must return "I need more info" srting and nothing else:

If the current information is enough to answer, you must return one of the following formats:
- YES/NO (for boolean questions)
- Number (for numeric questions)
- Single letter (for multiple-choice questions)
"""


def process_document(
    document_file,
    document_data,
    model,
    find_prompt: str = FIND_PROMPT,
    answer_prompt: str = ANSWER_WITH_TYPE_PROMPT,
):
    sections_dir = Path("sections") / Path(document_file).stem
    if not sections_dir.exists():
        logger.info("Splitting document into sections")
        document_to_sections_dir(document_file, sections_dir)

    logger.info("Predicting")
    answers = {}
    sections = {}
    for index, row in document_data.iterrows():
        question = row["question"]
        logger.info(f"Question: {question}")
        answer, sections_checked = find_retrieve_answer(
            question, model, sections_dir, find_prompt, answer_prompt
        )
        logger.info(f"Answer: {answer}")
        answers[index] = answer
        sections[index] = sections_checked[-1] if sections_checked else None

    return answers, sections

## Load Model

In [6]:
from structured_qa.model_loaders import load_gemini_model

In [7]:
model = load_gemini_model("gemini-2.0-flash-exp", system_prompt=None)

# Run Benchmark

In [8]:
from pathlib import Path
from urllib.request import urlretrieve

import pandas as pd

logger.info("Loading input data")
data = pd.read_csv("structured_qa.csv")
data["pred_answer"] = [None] * len(data)
data["pred_section"] = [None] * len(data)

for document_link, document_data in data.groupby("document"):
    logger.info(f"Downloading document {document_link}")
    downloaded_document = Path(f"{Path(document_link).name}.pdf")
    if not Path(downloaded_document).exists():
        urlretrieve(document_link, downloaded_document)
        logger.info(f"Downloaded {document_link} to {downloaded_document}")
    else:
        logger.info(f"File {downloaded_document} already exists")

    answers, sections = process_document(downloaded_document, document_data, model)

    for index in document_data.index:
        data.loc[index, "pred_answer"] = str(answers[index]).upper()
        data.loc[index, "pred_section"] = sections[index]

data.to_csv("results.csv")

[32m2025-02-04 10:16:14.200[0m | [1mINFO    [0m | [36m__main__[0m:[36m<cell line: 0>[0m:[36m6[0m - [1mLoading input data[0m
[32m2025-02-04 10:16:14.223[0m | [1mINFO    [0m | [36m__main__[0m:[36m<cell line: 0>[0m:[36m12[0m - [1mDownloading document https://aiindex.stanford.edu/wp-content/uploads/2024/05/HAI_AI-Index-Report-2024.pdf[0m
[32m2025-02-04 10:16:14.616[0m | [1mINFO    [0m | [36m__main__[0m:[36m<cell line: 0>[0m:[36m16[0m - [1mDownloaded https://aiindex.stanford.edu/wp-content/uploads/2024/05/HAI_AI-Index-Report-2024.pdf to HAI_AI-Index-Report-2024.pdf.pdf[0m
[32m2025-02-04 10:16:14.623[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m34[0m - [1mSplitting document into sections[0m
[32m2025-02-04 10:16:14.628[0m | [1mINFO    [0m | [36mstructured_qa.preprocessing[0m:[36mdocument_to_sections_dir[0m:[36m85[0m - [1mConverting HAI_AI-Index-Report-2024.pdf.pdf[0m


Processing HAI_AI-Index-Report-2024.pdf.pdf...

[32m2025-02-04 10:21:38.611[0m | [32m[1mSUCCESS [0m | [36mstructured_qa.preprocessing[0m:[36mdocument_to_sections_dir[0m:[36m88[0m - [32m[1mConverted[0m
[32m2025-02-04 10:21:38.612[0m | [1mINFO    [0m | [36mstructured_qa.preprocessing[0m:[36mdocument_to_sections_dir[0m:[36m90[0m - [1mExtracting sections[0m
[32m2025-02-04 10:21:38.763[0m | [32m[1mSUCCESS [0m | [36mstructured_qa.preprocessing[0m:[36mdocument_to_sections_dir[0m:[36m94[0m - [32m[1mFound 84 sections[0m
[32m2025-02-04 10:21:38.764[0m | [1mINFO    [0m | [36mstructured_qa.preprocessing[0m:[36mdocument_to_sections_dir[0m:[36m95[0m - [1mWriting sections to sections/HAI_AI-Index-Report-2024.pdf[0m
[32m2025-02-04 10:21:38.780[0m | [32m[1mSUCCESS [0m | [36mstructured_qa.preprocessing[0m:[36mdocument_to_sections_dir[0m:[36m103[0m - [32m[1mDone[0m
[32m2025-02-04 10:21:38.782[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m37[0m - [1mPredict

]


[32m2025-02-04 10:21:51.594[0m | [1mINFO    [0m | [36mstructured_qa.model_loaders[0m:[36mget_response[0m:[36m96[0m - [1mCurrent calls: 1[0m
[32m2025-02-04 10:21:52.767[0m | [1mINFO    [0m | [36mstructured_qa.model_loaders[0m:[36mget_response[0m:[36m96[0m - [1mCurrent calls: 2[0m
[32m2025-02-04 10:21:53.990[0m | [1mINFO    [0m | [36mstructured_qa.model_loaders[0m:[36mget_response[0m:[36m96[0m - [1mCurrent calls: 3[0m
[32m2025-02-04 10:21:55.165[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m46[0m - [1mAnswer: B
[0m
[32m2025-02-04 10:21:55.168[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m42[0m - [1mQuestion: In which geographical area were fairness risks selected by the smallest percentage of respondents? -A: Asia. -B: Europe. -C: North America.[0m
[32m2025-02-04 10:21:55.171[0m | [1mINFO    [0m | [36mstructured_qa.model_loaders[0m:[36mget_response[0m:[36m96[0m - [1mCurrent call

Processing 1706.03762.pdf...

[32m2025-02-04 10:30:30.455[0m | [32m[1mSUCCESS [0m | [36mstructured_qa.preprocessing[0m:[36mdocument_to_sections_dir[0m:[36m88[0m - [32m[1mConverted[0m
[32m2025-02-04 10:30:30.457[0m | [1mINFO    [0m | [36mstructured_qa.preprocessing[0m:[36mdocument_to_sections_dir[0m:[36m90[0m - [1mExtracting sections[0m
[32m2025-02-04 10:30:30.466[0m | [32m[1mSUCCESS [0m | [36mstructured_qa.preprocessing[0m:[36mdocument_to_sections_dir[0m:[36m94[0m - [32m[1mFound 12 sections[0m
[32m2025-02-04 10:30:30.469[0m | [1mINFO    [0m | [36mstructured_qa.preprocessing[0m:[36mdocument_to_sections_dir[0m:[36m95[0m - [1mWriting sections to sections/1706.03762[0m
[32m2025-02-04 10:30:30.474[0m | [32m[1mSUCCESS [0m | [36mstructured_qa.preprocessing[0m:[36mdocument_to_sections_dir[0m:[36m103[0m - [32m[1mDone[0m
[32m2025-02-04 10:30:30.476[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m37[0m - [1mPredicting[0m
[32m2025-

]


[32m2025-02-04 10:31:31.907[0m | [1mINFO    [0m | [36mstructured_qa.model_loaders[0m:[36mget_response[0m:[36m96[0m - [1mCurrent calls: 1[0m
[32m2025-02-04 10:31:33.483[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m46[0m - [1mAnswer: C
[0m
[32m2025-02-04 10:31:33.485[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m42[0m - [1mQuestion: How many layers compose the encoder?[0m
[32m2025-02-04 10:31:33.488[0m | [1mINFO    [0m | [36mstructured_qa.model_loaders[0m:[36mget_response[0m:[36m96[0m - [1mCurrent calls: 2[0m
[32m2025-02-04 10:31:35.114[0m | [1mINFO    [0m | [36mstructured_qa.model_loaders[0m:[36mget_response[0m:[36m96[0m - [1mCurrent calls: 3[0m
[32m2025-02-04 10:31:36.412[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m46[0m - [1mAnswer: 6
[0m
[32m2025-02-04 10:31:36.413[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m42[0m - [1

Processing 2106.09685.pdf.pdf...

[32m2025-02-04 10:35:27.059[0m | [32m[1mSUCCESS [0m | [36mstructured_qa.preprocessing[0m:[36mdocument_to_sections_dir[0m:[36m88[0m - [32m[1mConverted[0m
[32m2025-02-04 10:35:27.061[0m | [1mINFO    [0m | [36mstructured_qa.preprocessing[0m:[36mdocument_to_sections_dir[0m:[36m90[0m - [1mExtracting sections[0m
[32m2025-02-04 10:35:27.073[0m | [32m[1mSUCCESS [0m | [36mstructured_qa.preprocessing[0m:[36mdocument_to_sections_dir[0m:[36m94[0m - [32m[1mFound 22 sections[0m
[32m2025-02-04 10:35:27.075[0m | [1mINFO    [0m | [36mstructured_qa.preprocessing[0m:[36mdocument_to_sections_dir[0m:[36m95[0m - [1mWriting sections to sections/2106.09685.pdf[0m
[32m2025-02-04 10:35:27.080[0m | [32m[1mSUCCESS [0m | [36mstructured_qa.preprocessing[0m:[36mdocument_to_sections_dir[0m:[36m103[0m - [32m[1mDone[0m
[32m2025-02-04 10:35:27.081[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m37[0m - [1mPredicting[0m
[32m2

]


[32m2025-02-04 10:35:28.486[0m | [1mINFO    [0m | [36mstructured_qa.model_loaders[0m:[36mget_response[0m:[36m96[0m - [1mCurrent calls: 9[0m
[32m2025-02-04 10:35:28.488[0m | [1mINFO    [0m | [36mstructured_qa.model_loaders[0m:[36mget_response[0m:[36m99[0m - [1mWaiting for 60 seconds[0m
[32m2025-02-04 10:36:29.635[0m | [1mINFO    [0m | [36mstructured_qa.model_loaders[0m:[36mget_response[0m:[36m96[0m - [1mCurrent calls: 1[0m
[32m2025-02-04 10:36:30.757[0m | [1mINFO    [0m | [36mstructured_qa.model_loaders[0m:[36mget_response[0m:[36m96[0m - [1mCurrent calls: 2[0m
[32m2025-02-04 10:36:31.730[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m46[0m - [1mAnswer: YES
[0m
[32m2025-02-04 10:36:31.731[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m42[0m - [1mQuestion: By how much can LoRA reduce GPU memory requirements during training? -A: 10x, -B: 5x, -C: 3x[0m
[32m2025-02-04 10:36:31.734[

Processing 2201.11903.pdf...

[32m2025-02-04 10:39:55.481[0m | [32m[1mSUCCESS [0m | [36mstructured_qa.preprocessing[0m:[36mdocument_to_sections_dir[0m:[36m88[0m - [32m[1mConverted[0m
[32m2025-02-04 10:39:55.482[0m | [1mINFO    [0m | [36mstructured_qa.preprocessing[0m:[36mdocument_to_sections_dir[0m:[36m90[0m - [1mExtracting sections[0m
[32m2025-02-04 10:39:55.500[0m | [32m[1mSUCCESS [0m | [36mstructured_qa.preprocessing[0m:[36mdocument_to_sections_dir[0m:[36m94[0m - [32m[1mFound 21 sections[0m
[32m2025-02-04 10:39:55.503[0m | [1mINFO    [0m | [36mstructured_qa.preprocessing[0m:[36mdocument_to_sections_dir[0m:[36m95[0m - [1mWriting sections to sections/2201.11903[0m
[32m2025-02-04 10:39:55.510[0m | [32m[1mSUCCESS [0m | [36mstructured_qa.preprocessing[0m:[36mdocument_to_sections_dir[0m:[36m103[0m - [32m[1mDone[0m
[32m2025-02-04 10:39:55.512[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m37[0m - [1mPredicting[0m
[32m2025-

]


[32m2025-02-04 10:39:56.741[0m | [1mINFO    [0m | [36mstructured_qa.model_loaders[0m:[36mget_response[0m:[36m96[0m - [1mCurrent calls: 6[0m
[32m2025-02-04 10:39:57.887[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m46[0m - [1mAnswer: NO
[0m
[32m2025-02-04 10:39:57.890[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m42[0m - [1mQuestion: How many large language models were evaluated?[0m
[32m2025-02-04 10:39:57.891[0m | [1mINFO    [0m | [36mstructured_qa.model_loaders[0m:[36mget_response[0m:[36m96[0m - [1mCurrent calls: 7[0m
[32m2025-02-04 10:39:58.915[0m | [1mINFO    [0m | [36mstructured_qa.model_loaders[0m:[36mget_response[0m:[36m96[0m - [1mCurrent calls: 8[0m
[32m2025-02-04 10:40:00.034[0m | [1mINFO    [0m | [36mstructured_qa.model_loaders[0m:[36mget_response[0m:[36m96[0m - [1mCurrent calls: 9[0m
[32m2025-02-04 10:40:00.036[0m | [1mINFO    [0m | [36mstructured_qa.model_load

Processing 2210.05189.pdf...

[32m2025-02-04 10:48:08.615[0m | [32m[1mSUCCESS [0m | [36mstructured_qa.preprocessing[0m:[36mdocument_to_sections_dir[0m:[36m88[0m - [32m[1mConverted[0m
[32m2025-02-04 10:48:08.616[0m | [1mINFO    [0m | [36mstructured_qa.preprocessing[0m:[36mdocument_to_sections_dir[0m:[36m90[0m - [1mExtracting sections[0m
[32m2025-02-04 10:48:08.625[0m | [32m[1mSUCCESS [0m | [36mstructured_qa.preprocessing[0m:[36mdocument_to_sections_dir[0m:[36m94[0m - [32m[1mFound 12 sections[0m
[32m2025-02-04 10:48:08.627[0m | [1mINFO    [0m | [36mstructured_qa.preprocessing[0m:[36mdocument_to_sections_dir[0m:[36m95[0m - [1mWriting sections to sections/2210.05189[0m
[32m2025-02-04 10:48:08.631[0m | [32m[1mSUCCESS [0m | [36mstructured_qa.preprocessing[0m:[36mdocument_to_sections_dir[0m:[36m103[0m - [32m[1mDone[0m
[32m2025-02-04 10:48:08.636[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m37[0m - [1mPredicting[0m
[32m2025-

]


[32m2025-02-04 10:48:09.940[0m | [1mINFO    [0m | [36mstructured_qa.model_loaders[0m:[36mget_response[0m:[36m96[0m - [1mCurrent calls: 4[0m
[32m2025-02-04 10:48:10.960[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m46[0m - [1mAnswer: YES
[0m
[32m2025-02-04 10:48:10.962[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m42[0m - [1mQuestion: How many layers are in the toy model (y = x^2)?[0m
[32m2025-02-04 10:48:10.965[0m | [1mINFO    [0m | [36mstructured_qa.model_loaders[0m:[36mget_response[0m:[36m96[0m - [1mCurrent calls: 5[0m
[32m2025-02-04 10:48:11.937[0m | [1mINFO    [0m | [36mstructured_qa.model_loaders[0m:[36mget_response[0m:[36m96[0m - [1mCurrent calls: 6[0m
[32m2025-02-04 10:48:12.957[0m | [1mINFO    [0m | [36mstructured_qa.model_loaders[0m:[36mget_response[0m:[36m96[0m - [1mCurrent calls: 7[0m
[32m2025-02-04 10:48:13.976[0m | [1mINFO    [0m | [36mstructured_qa.model_lo

Processing 2302.13971.pdf...

[32m2025-02-04 10:56:30.134[0m | [32m[1mSUCCESS [0m | [36mstructured_qa.preprocessing[0m:[36mdocument_to_sections_dir[0m:[36m88[0m - [32m[1mConverted[0m
[32m2025-02-04 10:56:30.135[0m | [1mINFO    [0m | [36mstructured_qa.preprocessing[0m:[36mdocument_to_sections_dir[0m:[36m90[0m - [1mExtracting sections[0m
[32m2025-02-04 10:56:30.151[0m | [32m[1mSUCCESS [0m | [36mstructured_qa.preprocessing[0m:[36mdocument_to_sections_dir[0m:[36m94[0m - [32m[1mFound 17 sections[0m
[32m2025-02-04 10:56:30.153[0m | [1mINFO    [0m | [36mstructured_qa.preprocessing[0m:[36mdocument_to_sections_dir[0m:[36m95[0m - [1mWriting sections to sections/2302.13971[0m
[32m2025-02-04 10:56:30.160[0m | [32m[1mSUCCESS [0m | [36mstructured_qa.preprocessing[0m:[36mdocument_to_sections_dir[0m:[36m103[0m - [32m[1mDone[0m
[32m2025-02-04 10:56:30.164[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m37[0m - [1mPredicting[0m
[32m2025-

]


[32m2025-02-04 10:56:30.168[0m | [1mINFO    [0m | [36mstructured_qa.model_loaders[0m:[36mget_response[0m:[36m96[0m - [1mCurrent calls: 6[0m
[32m2025-02-04 10:56:31.145[0m | [1mINFO    [0m | [36mstructured_qa.model_loaders[0m:[36mget_response[0m:[36m96[0m - [1mCurrent calls: 7[0m
[32m2025-02-04 10:56:32.163[0m | [1mINFO    [0m | [36mstructured_qa.model_loaders[0m:[36mget_response[0m:[36m96[0m - [1mCurrent calls: 8[0m
[32m2025-02-04 10:56:33.107[0m | [1mINFO    [0m | [36mstructured_qa.model_loaders[0m:[36mget_response[0m:[36m96[0m - [1mCurrent calls: 9[0m
[32m2025-02-04 10:56:33.108[0m | [1mINFO    [0m | [36mstructured_qa.model_loaders[0m:[36mget_response[0m:[36m99[0m - [1mWaiting for 60 seconds[0m
[32m2025-02-04 10:57:34.278[0m | [1mINFO    [0m | [36mstructured_qa.model_loaders[0m:[36mget_response[0m:[36m96[0m - [1mCurrent calls: 1[0m
[32m2025-02-04 10:57:35.249[0m | [1mINFO    [0m | [36mstructured_qa.model_loa

Processing 6.8558_CO_Generative_AI_Framework_Report_v7_WEB.pdf.pdf...

[32m2025-02-04 11:06:05.243[0m | [32m[1mSUCCESS [0m | [36mstructured_qa.preprocessing[0m:[36mdocument_to_sections_dir[0m:[36m88[0m - [32m[1mConverted[0m
[32m2025-02-04 11:06:05.245[0m | [1mINFO    [0m | [36mstructured_qa.preprocessing[0m:[36mdocument_to_sections_dir[0m:[36m90[0m - [1mExtracting sections[0m
[32m2025-02-04 11:06:05.272[0m | [32m[1mSUCCESS [0m | [36mstructured_qa.preprocessing[0m:[36mdocument_to_sections_dir[0m:[36m94[0m - [32m[1mFound 26 sections[0m
[32m2025-02-04 11:06:05.275[0m | [1mINFO    [0m | [36mstructured_qa.preprocessing[0m:[36mdocument_to_sections_dir[0m:[36m95[0m - [1mWriting sections to sections/6.8558_CO_Generative_AI_Framework_Report_v7_WEB.pdf[0m
[32m2025-02-04 11:06:05.281[0m | [32m[1mSUCCESS [0m | [36mstructured_qa.preprocessing[0m:[36mdocument_to_sections_dir[0m:[36m103[0m - [32m[1mDone[0m
[32m2025-02-04 11:06:05.283[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[

]


[32m2025-02-04 11:06:06.360[0m | [1mINFO    [0m | [36mstructured_qa.model_loaders[0m:[36mget_response[0m:[36m96[0m - [1mCurrent calls: 7[0m
[32m2025-02-04 11:06:07.305[0m | [1mINFO    [0m | [36mstructured_qa.model_loaders[0m:[36mget_response[0m:[36m96[0m - [1mCurrent calls: 8[0m
[32m2025-02-04 11:06:08.351[0m | [1mINFO    [0m | [36mstructured_qa.model_loaders[0m:[36mget_response[0m:[36m96[0m - [1mCurrent calls: 9[0m
[32m2025-02-04 11:06:08.352[0m | [1mINFO    [0m | [36mstructured_qa.model_loaders[0m:[36mget_response[0m:[36m99[0m - [1mWaiting for 60 seconds[0m
[32m2025-02-04 11:07:09.649[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m46[0m - [1mAnswer: C
[0m
[32m2025-02-04 11:07:09.651[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m42[0m - [1mQuestion: Can LLMs be used as an alternative to visiting a doctor?[0m
[32m2025-02-04 11:07:09.655[0m | [1mINFO    [0m | [36mstructure

Processing Authors%20Alliance%20-%20Understanding%20Open%20Access.pdf.pdf...

[32m2025-02-04 11:08:38.611[0m | [32m[1mSUCCESS [0m | [36mstructured_qa.preprocessing[0m:[36mdocument_to_sections_dir[0m:[36m88[0m - [32m[1mConverted[0m
[32m2025-02-04 11:08:38.616[0m | [1mINFO    [0m | [36mstructured_qa.preprocessing[0m:[36mdocument_to_sections_dir[0m:[36m90[0m - [1mExtracting sections[0m


]


[32m2025-02-04 11:08:38.650[0m | [32m[1mSUCCESS [0m | [36mstructured_qa.preprocessing[0m:[36mdocument_to_sections_dir[0m:[36m94[0m - [32m[1mFound 40 sections[0m
[32m2025-02-04 11:08:38.654[0m | [1mINFO    [0m | [36mstructured_qa.preprocessing[0m:[36mdocument_to_sections_dir[0m:[36m95[0m - [1mWriting sections to sections/Authors%20Alliance%20-%20Understanding%20Open%20Access.pdf[0m
[32m2025-02-04 11:08:38.666[0m | [32m[1mSUCCESS [0m | [36mstructured_qa.preprocessing[0m:[36mdocument_to_sections_dir[0m:[36m103[0m - [32m[1mDone[0m
[32m2025-02-04 11:08:38.670[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m37[0m - [1mPredicting[0m
[32m2025-02-04 11:08:38.673[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m42[0m - [1mQuestion: According to the guide, what is the typical license used to grant reuse rights with libre open access? -A: GNU General Public License -B: Creative Commons license -C: MIT

Processing 1654ca52-ec72-4bae-ba40-d2fc0f3d71ae_en?filename=mit-1-performance-and-technical-performance-specification-v1-2_en.pdf.pdf...

[32m2025-02-04 11:11:39.970[0m | [32m[1mSUCCESS [0m | [36mstructured_qa.preprocessing[0m:[36mdocument_to_sections_dir[0m:[36m88[0m - [32m[1mConverted[0m
[32m2025-02-04 11:11:39.971[0m | [1mINFO    [0m | [36mstructured_qa.preprocessing[0m:[36mdocument_to_sections_dir[0m:[36m90[0m - [1mExtracting sections[0m


]


[32m2025-02-04 11:11:40.067[0m | [32m[1mSUCCESS [0m | [36mstructured_qa.preprocessing[0m:[36mdocument_to_sections_dir[0m:[36m94[0m - [32m[1mFound 254 sections[0m
[32m2025-02-04 11:11:40.070[0m | [1mINFO    [0m | [36mstructured_qa.preprocessing[0m:[36mdocument_to_sections_dir[0m:[36m95[0m - [1mWriting sections to sections/1654ca52-ec72-4bae-ba40-d2fc0f3d71ae_en?filename=mit-1-performance-and-technical-performance-specification-v1-2_en.pdf[0m
[32m2025-02-04 11:11:40.101[0m | [32m[1mSUCCESS [0m | [36mstructured_qa.preprocessing[0m:[36mdocument_to_sections_dir[0m:[36m103[0m - [32m[1mDone[0m
[32m2025-02-04 11:11:40.102[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m37[0m - [1mPredicting[0m
[32m2025-02-04 11:11:40.106[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m42[0m - [1mQuestion: Which type of water must be supplied in a toilet sink? -A: hot -B: cold -C: hot and cold[0m
[32m2025-02-04 

Processing CUDA_C_Programming_Guide.pdf.pdf...


[32m2025-02-04 11:20:20.044[0m | [32m[1mSUCCESS [0m | [36mstructured_qa.preprocessing[0m:[36mdocument_to_sections_dir[0m:[36m88[0m - [32m[1mConverted[0m
[32m2025-02-04 11:20:20.049[0m | [1mINFO    [0m | [36mstructured_qa.preprocessing[0m:[36mdocument_to_sections_dir[0m:[36m90[0m - [1mExtracting sections[0m
[32m2025-02-04 11:20:20.245[0m | [32m[1mSUCCESS [0m | [36mstructured_qa.preprocessing[0m:[36mdocument_to_sections_dir[0m:[36m94[0m - [32m[1mFound 447 sections[0m
[32m2025-02-04 11:20:20.250[0m | [1mINFO    [0m | [36mstructured_qa.preprocessing[0m:[36mdocument_to_sections_dir[0m:[36m95[0m - [1mWriting sections to sections/CUDA_C_Programming_Guide.pdf[0m
[32m2025-02-04 11:20:20.318[0m | [32m[1mSUCCESS [0m | [36mstructured_qa.preprocessing[0m:[36mdocument_to_sections_dir[0m:[36m103[0m - [32m[1mDone[0m
[32m2025-02-04 11:20:20.324[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m37[0m - [1mPredic

Processing 7DUME_EN01_Rules.pdf.pdf...

[32m2025-02-04 11:26:16.222[0m | [32m[1mSUCCESS [0m | [36mstructured_qa.preprocessing[0m:[36mdocument_to_sections_dir[0m:[36m88[0m - [32m[1mConverted[0m
[32m2025-02-04 11:26:16.224[0m | [1mINFO    [0m | [36mstructured_qa.preprocessing[0m:[36mdocument_to_sections_dir[0m:[36m90[0m - [1mExtracting sections[0m
[32m2025-02-04 11:26:16.231[0m | [32m[1mSUCCESS [0m | [36mstructured_qa.preprocessing[0m:[36mdocument_to_sections_dir[0m:[36m94[0m - [32m[1mFound 25 sections[0m
[32m2025-02-04 11:26:16.235[0m | [1mINFO    [0m | [36mstructured_qa.preprocessing[0m:[36mdocument_to_sections_dir[0m:[36m95[0m - [1mWriting sections to sections/7DUME_EN01_Rules.pdf[0m
[32m2025-02-04 11:26:16.246[0m | [32m[1mSUCCESS [0m | [36mstructured_qa.preprocessing[0m:[36mdocument_to_sections_dir[0m:[36m103[0m - [32m[1mDone[0m
[32m2025-02-04 11:26:16.250[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m37[0m - [1mPredicting[0m


]


[32m2025-02-04 11:27:17.430[0m | [1mINFO    [0m | [36mstructured_qa.model_loaders[0m:[36mget_response[0m:[36m96[0m - [1mCurrent calls: 1[0m
[32m2025-02-04 11:27:18.424[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m46[0m - [1mAnswer: 3
[0m
[32m2025-02-04 11:27:18.426[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m42[0m - [1mQuestion: How many victory conditions are there?[0m
[32m2025-02-04 11:27:18.427[0m | [1mINFO    [0m | [36mstructured_qa.model_loaders[0m:[36mget_response[0m:[36m96[0m - [1mCurrent calls: 2[0m
[32m2025-02-04 11:27:19.420[0m | [1mINFO    [0m | [36mstructured_qa.model_loaders[0m:[36mget_response[0m:[36m96[0m - [1mCurrent calls: 3[0m
[32m2025-02-04 11:27:20.367[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m46[0m - [1mAnswer: 3
[0m
[32m2025-02-04 11:27:20.369[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m42[0m - 

Processing is_eotn_rulebook.pdf.pdf...

[32m2025-02-04 11:51:02.798[0m | [32m[1mSUCCESS [0m | [36mstructured_qa.preprocessing[0m:[36mdocument_to_sections_dir[0m:[36m88[0m - [32m[1mConverted[0m
[32m2025-02-04 11:51:02.799[0m | [1mINFO    [0m | [36mstructured_qa.preprocessing[0m:[36mdocument_to_sections_dir[0m:[36m90[0m - [1mExtracting sections[0m
[32m2025-02-04 11:51:02.808[0m | [32m[1mSUCCESS [0m | [36mstructured_qa.preprocessing[0m:[36mdocument_to_sections_dir[0m:[36m94[0m - [32m[1mFound 40 sections[0m
[32m2025-02-04 11:51:02.811[0m | [1mINFO    [0m | [36mstructured_qa.preprocessing[0m:[36mdocument_to_sections_dir[0m:[36m95[0m - [1mWriting sections to sections/is_eotn_rulebook.pdf[0m
[32m2025-02-04 11:51:02.819[0m | [32m[1mSUCCESS [0m | [36mstructured_qa.preprocessing[0m:[36mdocument_to_sections_dir[0m:[36m103[0m - [32m[1mDone[0m
[32m2025-02-04 11:51:02.821[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m37[0m - [1mPredicting[0m


]


[32m2025-02-04 11:52:04.181[0m | [1mINFO    [0m | [36mstructured_qa.model_loaders[0m:[36mget_response[0m:[36m96[0m - [1mCurrent calls: 1[0m
[32m2025-02-04 11:52:05.176[0m | [1mINFO    [0m | [36mstructured_qa.model_loaders[0m:[36mget_response[0m:[36m96[0m - [1mCurrent calls: 2[0m
[32m2025-02-04 11:52:06.424[0m | [1mINFO    [0m | [36mstructured_qa.model_loaders[0m:[36mget_response[0m:[36m96[0m - [1mCurrent calls: 3[0m
[32m2025-02-04 11:52:07.419[0m | [1mINFO    [0m | [36mstructured_qa.model_loaders[0m:[36mget_response[0m:[36m96[0m - [1mCurrent calls: 4[0m
[32m2025-02-04 11:52:08.666[0m | [1mINFO    [0m | [36mstructured_qa.model_loaders[0m:[36mget_response[0m:[36m96[0m - [1mCurrent calls: 5[0m
[32m2025-02-04 11:52:09.661[0m | [1mINFO    [0m | [36mstructured_qa.model_loaders[0m:[36mget_response[0m:[36m96[0m - [1mCurrent calls: 6[0m
[32m2025-02-04 11:52:10.705[0m | [1mINFO    [0m | [36mstructured_qa.model_loaders[

In [9]:
results = pd.read_csv("results.csv")
for index, result in results.iterrows():
    results.loc[index, "pred_answer"] = result["pred_answer"].strip()
    if result["pred_answer"].startswith(
        (f"-{result['answer']}", f"{result['answer']}")
    ):
        results.loc[index, "pred_answer"] = result["answer"]
results.loc[results["answer"] != results["pred_answer"]]

Unnamed: 0.1,Unnamed: 0,document,section,question,answer,pred_answer,pred_section
12,12,https://arxiv.org/pdf/2210.05189,3 Experimental Results,How many layers are in the toy model (y = x^2)?,3,NOT FOUND,Caglar Aytekin AI Lead AAC Technologies
28,28,https://arxiv.org/pdf/2201.11903,3.1 Experimental Setup,How many large language models were evaluated?,5,3,Abstract
32,32,https://arxiv.org/pdf/2201.11903,5 Symbolic Reasoning,Which symbolic reasoning task is used as an ou...,A,YES,5 Symbolic Reasoning
33,33,https://arxiv.org/pdf/2201.11903,3.4 Robustness of Chain of Thought,How many annotators provided independent chain...,3,2,H Appendix: Alternate Annotators for MWP
34,34,https://arxiv.org/pdf/2201.11903,3.2 Results,How many random samples were examined to under...,100,NOT FOUND,Chain-of-Thought Prompting Elicits Reasoning i...
45,45,https://github.com/mozilla-ai/structured-qa/re...,CARD AND TILE EFFECTS,Which type of cards provide coins? -A: Gray -B...,B,NOT FOUND,4
48,48,https://github.com/mozilla-ai/structured-qa/re...,END OF THE GAME,Can the game end in a tie?,YES,NO,OVERVIEW AND GOAL
50,50,https://github.com/mozilla-ai/structured-qa/re...,LOOKOUT PHASE,What is the maximum number of cards a player m...,4,NOT FOUND,1 3 3 5 7 8
51,51,https://github.com/mozilla-ai/structured-qa/re...,LOOKOUT PHASE,Is there a limit to the number of cards a play...,NO,YES,CLEANUP PHASE players with extra Goods if enou...
54,54,https://github.com/mozilla-ai/structured-qa/re...,EXPEDITION PHASE,Can players conquer and pillage the same islan...,NO,YES,EXPEDITION PHASE


In [10]:
accuracy = sum(results["answer"] == results["pred_answer"]) / len(results)
accuracy

0.8640776699029126