# Structured Q&A

Source code: https://github.com/mozilla-ai/structured-qa

Docs: https://mozilla-ai.github.io/structured-qa

## GPU Check

First, you'll need to enable GPUs for the notebook:

- Navigate to `Edit`→`Notebook Settings`
- Select T4 GPU from the Hardware Accelerator section
- Click `Save` and accept.

Next, we'll confirm that we can connect to the GPU:

In [6]:
import torch

if not torch.cuda.is_available():
    raise RuntimeError("GPU not available")
else:
    print("GPU is available!")

GPU is available!


## Installing dependencies

In [7]:
%pip install ragatouille PyPDF2



In [8]:
%pip install git+https://github.com/mozilla-ai/structured-qa.git@5-add-benchmark

Collecting git+https://github.com/mozilla-ai/structured-qa.git@5-add-benchmark
  Cloning https://github.com/mozilla-ai/structured-qa.git (to revision 5-add-benchmark) to /tmp/pip-req-build-v6q9_weu
  Running command git clone --filter=blob:none --quiet https://github.com/mozilla-ai/structured-qa.git /tmp/pip-req-build-v6q9_weu
  Running command git checkout -b 5-add-benchmark --track origin/5-add-benchmark
  Switched to a new branch '5-add-benchmark'
  Branch '5-add-benchmark' set up to track remote branch '5-add-benchmark' from 'origin'.
  Resolved https://github.com/mozilla-ai/structured-qa.git to commit 97049d67d83ec6129569d442bd365c7a5e490578
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [9]:
!wget https://raw.githubusercontent.com/mozilla-ai/structured-qa/refs/heads/5-add-benchmark/benchmark/structured_qa.csv

--2025-02-04 15:27:16--  https://raw.githubusercontent.com/mozilla-ai/structured-qa/refs/heads/5-add-benchmark/benchmark/structured_qa.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 23304 (23K) [text/plain]
Saving to: ‘structured_qa.csv.1’


2025-02-04 15:27:17 (28.5 MB/s) - ‘structured_qa.csv.1’ saved [23304/23304]



# Setup

In [10]:
import os
import google.generativeai as genai
from google.colab.userdata import get, SecretNotFoundError

try:
    genai.configure(api_key=get("GOOGLE_API_KEY"))
except SecretNotFoundError as e:
    raise RuntimeError("Please set the GOOGLE_API_KEY secret to your API key") from e
os.environ["LOGURU_LEVEL"] = "INFO"

In [11]:
from loguru import logger

In [12]:
import PyPDF2


def load_pdf(pdf_file: str) -> str | None:
    try:
        pdf_reader = PyPDF2.PdfReader(pdf_file)
        return "\n".join(page.extract_text() for page in pdf_reader.pages)
    except Exception as e:
        logger.exception(e)
        return None

## Function to Process all questions for a single Document

In [13]:
import json
import time

from ragatouille import RAGPretrainedModel
from ragatouille.data import CorpusProcessor


def process_document(
    document_file,
    document_data,
    model,
):
    logger.info("Setting up RAG")
    RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")
    corpus_processor = CorpusProcessor()
    documents = corpus_processor.process_corpus([load_pdf(document_file)])
    RAG.encode([x["content"] for x in documents])

    logger.info("Predicting")
    answers = {}
    sections = {}
    for index, row in document_data.iterrows():
        if model.n > 0 and model.n % 9 == 0:
            logger.info("Waiting for 60 seconds")
            time.sleep(60)
        question = row["question"]
        question_part, *options = question.split("?")

        logger.info(f"Question: {question}")
        results = RAG.search_encoded_docs(query=question_part, k=3)
        current_info = "\n".join(result["content"] for result in results)
        logger.info(current_info[:100])

        answer = model.model.generate_content(
            [f"This is the document: {current_info}", question]
        )
        logger.info(answer.text)
        answers[index] = answer.text.strip()
        sections[index] = None
        model.n += 1

    return answers, sections

## Load Model

In [14]:
from structured_qa.model_loaders import load_gemini_model

In [15]:
SYSTEM_PROMPT = """
You are a rigorous assistant answering questions.
You must only answer based on the current information available which is:

```
{CURRENT_INFO}
```

If the current information available not enough to answer the question,
you must return "I need more info" srting and nothing else:

If the current information is enough to answer, you must return one of the following formats:
- YES/NO (for boolean questions)
- Number (for numeric questions)
- Single letter (for multiple-choice questions)
"""

In [16]:
model = load_gemini_model("gemini-2.0-flash-exp", system_prompt=SYSTEM_PROMPT)
model.n = 0

# Run Benchmark

In [17]:
from pathlib import Path
from urllib.request import urlretrieve

import pandas as pd

logger.info("Loading input data")
data = pd.read_csv("structured_qa.csv")
data["pred_answer"] = [None] * len(data)
data["pred_section"] = [None] * len(data)
for document_link, document_data in data.groupby("document"):
    logger.info(f"Downloading document {document_link}")
    downloaded_document = Path(f"{Path(document_link).name}.pdf")
    if not Path(downloaded_document).exists():
        urlretrieve(document_link, downloaded_document)
        logger.info(f"Downloaded {document_link} to {downloaded_document}")
    else:
        logger.info(f"File {downloaded_document} already exists")

    answers, sections = process_document(downloaded_document, document_data, model)

    for index in document_data.index:
        data.loc[index, "pred_answer"] = str(answers[index]).upper()
        data.loc[index, "pred_section"] = sections[index]

data.to_csv("results.csv")

[32m2025-02-04 15:27:45.015[0m | [1mINFO    [0m | [36m__main__[0m:[36m<cell line: 0>[0m:[36m6[0m - [1mLoading input data[0m
[32m2025-02-04 15:27:45.044[0m | [1mINFO    [0m | [36m__main__[0m:[36m<cell line: 0>[0m:[36m11[0m - [1mDownloading document https://aiindex.stanford.edu/wp-content/uploads/2024/05/HAI_AI-Index-Report-2024.pdf[0m
[32m2025-02-04 15:27:45.376[0m | [1mINFO    [0m | [36m__main__[0m:[36m<cell line: 0>[0m:[36m15[0m - [1mDownloaded https://aiindex.stanford.edu/wp-content/uploads/2024/05/HAI_AI-Index-Report-2024.pdf to HAI_AI-Index-Report-2024.pdf.pdf[0m
[32m2025-02-04 15:27:45.381[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m13[0m - [1mSetting up RAG[0m
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will b

artifact.metadata:   0%|          | 0.00/1.63k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/405 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

  self.scaler = torch.cuda.amp.GradScaler()


Encoding 1214 documents...


  return torch.cuda.amp.autocast() if self.activated else NullContextManager()
100%|██████████| 38/38 [00:05<00:00,  6.52it/s]
[32m2025-02-04 15:28:42.876[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m19[0m - [1mPredicting[0m
[32m2025-02-04 15:28:42.878[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m29[0m - [1mQuestion: which type of risk was identified as the leading concern globally? -A: Fairness risks. -B: Privacy and data governance risks. -C: Risks related to generative AI deployment.[0m


Shapes:
encodings: torch.Size([1214, 508, 128])
doc_masks: torch.Size([1214, 508])
Documents encoded!


[32m2025-02-04 15:28:43.129[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m32[0m - [1mNotably, they observe that these concerns are 
significantly higher in Asia and Europe compared to 
[0m
[32m2025-02-04 15:28:44.269[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m37[0m - [1mB
[0m
[32m2025-02-04 15:28:44.271[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m29[0m - [1mQuestion: In which geographical area were fairness risks selected by the smallest percentage of respondents? -A: Asia. -B: Europe. -C: North America.[0m
[32m2025-02-04 15:28:44.295[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m32[0m - [1mNotably, they observe that these concerns are 
significantly higher in Asia and Europe compared to 
[0m
[32m2025-02-04 15:28:45.485[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m37[0m - [1mC[0m
[32m2025-02-04 15:28:45.488[0m | [1mINFO

Encoding 56 documents...


  return torch.cuda.amp.autocast() if self.activated else NullContextManager()
100%|██████████| 2/2 [00:00<00:00,  7.91it/s]
[32m2025-02-04 15:29:59.549[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m19[0m - [1mPredicting[0m
[32m2025-02-04 15:29:59.551[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m29[0m - [1mQuestion: What type of architecture does the model use? -A: decoder only -B: encoder only -C: encoder-decoder[0m
[32m2025-02-04 15:29:59.569[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m32[0m - [1mAt each step the model is auto-regressive
[10], consuming the previously generated symbols as additi[0m


Shapes:
encodings: torch.Size([56, 508, 128])
doc_masks: torch.Size([56, 508])
Documents encoded!


[32m2025-02-04 15:30:00.910[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m37[0m - [1mC[0m
[32m2025-02-04 15:30:00.913[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m29[0m - [1mQuestion: How many layers compose the encoder?[0m
[32m2025-02-04 15:30:00.929[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m32[0m - [1mAt each step the model is auto-regressive
[10], consuming the previously generated symbols as additi[0m
[32m2025-02-04 15:30:01.941[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m37[0m - [1m6
[0m
[32m2025-02-04 15:30:01.943[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m29[0m - [1mQuestion: How many layers compose the decoder?[0m
[32m2025-02-04 15:30:01.961[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m32[0m - [1mTo facilitate these residual connections, all sub-layers in the model, as well as th

Encoding 137 documents...


  return torch.cuda.amp.autocast() if self.activated else NullContextManager()
100%|██████████| 5/5 [00:00<00:00,  9.22it/s]
[32m2025-02-04 15:31:22.466[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m19[0m - [1mPredicting[0m
[32m2025-02-04 15:31:22.468[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m29[0m - [1mQuestion: Does LoRA work with any neural network containing dense layers?[0m
[32m2025-02-04 15:31:22.484[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m32[0m - [1mMore importantly, these method often fail to
match the ﬁne-tuning baselines, posing a trade-off betw[0m


Shapes:
encodings: torch.Size([137, 508, 128])
doc_masks: torch.Size([137, 508])
Documents encoded!


[32m2025-02-04 15:31:23.246[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m37[0m - [1mYES
[0m
[32m2025-02-04 15:31:23.247[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m29[0m - [1mQuestion: By how much can LoRA reduce GPU memory requirements during training? -A: 10x, -B: 5x, -C: 3x[0m
[32m2025-02-04 15:31:23.265[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m32[0m - [1mUsing GPT-3 175B as an example – deploying indepen-
dent instances of ﬁne-tuned models, each with 17[0m
[32m2025-02-04 15:31:24.581[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m37[0m - [1mC[0m
[32m2025-02-04 15:31:24.582[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m29[0m - [1mQuestion: In billions, how many trainable parameters does GPT-3 have?[0m
[32m2025-02-04 15:31:24.600[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m32[0m - [1mThe

Encoding 199 documents...


  return torch.cuda.amp.autocast() if self.activated else NullContextManager()
100%|██████████| 7/7 [00:00<00:00,  7.52it/s]
[32m2025-02-04 15:31:30.253[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m19[0m - [1mPredicting[0m
[32m2025-02-04 15:31:30.256[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m29[0m - [1mQuestion: Is Arithmetic reasoning is a task that language models often find very easy?[0m
[32m2025-02-04 15:31:30.276[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m32[0m - [1m3 Arithmetic Reasoning
We begin by considering math word problems of the form in Figure 1, which mea[0m


Shapes:
encodings: torch.Size([199, 508, 128])
doc_masks: torch.Size([199, 508])
Documents encoded!


[32m2025-02-04 15:31:31.138[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m37[0m - [1mNO
[0m
[32m2025-02-04 15:31:31.141[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m29[0m - [1mQuestion: How many large language models were evaluated?[0m
[32m2025-02-04 15:31:31.169[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m32[0m - [1mFor AQuA, we used four exemplars
and solutions from the training set, as given in Appendix Table 21.[0m
[32m2025-02-04 15:31:32.586[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m37[0m - [1mFive[0m
[32m2025-02-04 15:31:32.588[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m24[0m - [1mWaiting for 60 seconds[0m
[32m2025-02-04 15:32:32.589[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m29[0m - [1mQuestion: How many benchmarks were used to evaluate arithmetic reasoning?[0m
[32m2025-02-04 

Encoding 44 documents...


  return torch.cuda.amp.autocast() if self.activated else NullContextManager()
100%|██████████| 2/2 [00:00<00:00,  8.95it/s]
[32m2025-02-04 15:32:42.648[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m19[0m - [1mPredicting[0m
[32m2025-02-04 15:32:42.650[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m29[0m - [1mQuestion: Can recurrent networks also be converted to decision trees?[0m
[32m2025-02-04 15:32:42.669[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m32[0m - [1mTherefore, for continuous activations, the neural
network equivalent tree immediately becomes inﬁnit[0m


Shapes:
encodings: torch.Size([44, 508, 128])
doc_masks: torch.Size([44, 508])
Documents encoded!


[32m2025-02-04 15:32:43.884[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m37[0m - [1mYES
[0m
[32m2025-02-04 15:32:43.886[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m29[0m - [1mQuestion: How many layers are in the toy model (y = x^2)?[0m
[32m2025-02-04 15:32:43.904[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m32[0m - [1mo(t)=c1^ZT
1WTh(0)+tX
i=1ci^ZiUTx(i)(17)
In Eq. 17, ci^ZT
i=a(t)^VT
ci^Wi.As one can observe from
Eq[0m
[32m2025-02-04 15:32:45.093[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m37[0m - [1m3[0m
[32m2025-02-04 15:32:45.094[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m29[0m - [1mQuestion: Does the toy model (y = x^2) use Sigmoid activation function?[0m
[32m2025-02-04 15:32:45.112[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m32[0m - [1mo(t)=c1^ZT
1WTh(0)+tX
i=1ci^ZiUTx(i)(17)
In Eq

Encoding 144 documents...


  return torch.cuda.amp.autocast() if self.activated else NullContextManager()
100%|██████████| 5/5 [00:00<00:00,  8.78it/s]
[32m2025-02-04 15:33:54.219[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m19[0m - [1mPredicting[0m
[32m2025-02-04 15:33:54.221[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m29[0m - [1mQuestion: What proportion of the pre-training data was from Github? -A: 4.5% -B: 15.0% -C: 4%[0m
[32m2025-02-04 15:33:54.239[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m32[0m - [1mFinally, we deduplicate the result-
ing dataset at the ﬁle level, with exact matches.
Wikipedia [4.5[0m


Shapes:
encodings: torch.Size([144, 508, 128])
doc_masks: torch.Size([144, 508])
Documents encoded!


[32m2025-02-04 15:33:55.379[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m37[0m - [1mA[0m
[32m2025-02-04 15:33:55.381[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m29[0m - [1mQuestion: How many languages did the Wikipedia data cover?[0m
[32m2025-02-04 15:33:55.398[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m32[0m - [1mThis process deduplicates the data at the
line level, performs language identiﬁcation with
a fastTex[0m
[32m2025-02-04 15:33:56.562[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m37[0m - [1m20
[0m
[32m2025-02-04 15:33:56.563[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m29[0m - [1mQuestion: What optimizer was used for training? -A: AdamW -B: Adam -C: SGD[0m
[32m2025-02-04 15:33:56.581[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m32[0m - [1mTo improve the
training stability, we norma

Encoding 168 documents...


  return torch.cuda.amp.autocast() if self.activated else NullContextManager()
100%|██████████| 6/6 [00:00<00:00,  9.11it/s]
[32m2025-02-04 15:35:10.918[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m19[0m - [1mPredicting[0m
[32m2025-02-04 15:35:10.919[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m29[0m - [1mQuestion: Which of the following is not considered a limitation of the Large Language Models? -A: Hallucination -B: Explainability -C: Memorization[0m
[32m2025-02-04 15:35:10.936[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m32[0m - [1mIt can help you 
to compare the capabilities of a large number of language models. Here are some of [0m


Shapes:
encodings: torch.Size([168, 508, 128])
doc_masks: torch.Size([168, 508])
Documents encoded!


[32m2025-02-04 15:35:11.950[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m37[0m - [1mI need more info
[0m
[32m2025-02-04 15:35:11.952[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m29[0m - [1mQuestion: Can LLMs be used as an alternative to visiting a doctor?[0m
[32m2025-02-04 15:35:11.969[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m32[0m - [1mLLMs, in particular, 
rely heavily on computational power both during their training phase and then [0m
[32m2025-02-04 15:35:12.982[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m37[0m - [1mNO
[0m
[32m2025-02-04 15:35:12.985[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m29[0m - [1mQuestion: Which of the following is NOT mentioned as a relevant legal or regulatory provision regarding the use of AI technologies? -A: UK data protection law -B: The Online Safety Act -C: Digital, Data and Technol

Encoding 143 documents...


  return torch.cuda.amp.autocast() if self.activated else NullContextManager()
100%|██████████| 5/5 [00:00<00:00,  8.54it/s]
[32m2025-02-04 15:35:19.149[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m19[0m - [1mPredicting[0m
[32m2025-02-04 15:35:19.151[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m29[0m - [1mQuestion: According to the guide, what is the typical license used to grant reuse rights with libre open access? -A: GNU General Public License -B: Creative Commons license -C: MIT license[0m
[32m2025-02-04 15:35:19.176[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m32[0m - [1mOne of these options is open access.
The basic idea of open access is that it makes 
copyrightable w[0m


Shapes:
encodings: torch.Size([143, 508, 128])
doc_masks: torch.Size([143, 508])
Documents encoded!


[32m2025-02-04 15:35:20.290[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m37[0m - [1mB
[0m
[32m2025-02-04 15:35:20.292[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m24[0m - [1mWaiting for 60 seconds[0m
[32m2025-02-04 15:36:20.294[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m29[0m - [1mQuestion: how many peer-reviewed open access journals are indexed by the Directory of Open Access Journals (DOAJ)? -A: Over 10,000 -B: Over 20,000 -C: Exactly 30,000[0m
[32m2025-02-04 15:36:20.313[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m32[0m - [1mFor authors of articles, a good 
place to start is the Directory of Open Access Journals (“DOAJ”), a[0m
[32m2025-02-04 15:36:21.706[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m37[0m - [1mA[0m
[32m2025-02-04 15:36:21.708[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m29

Encoding 364 documents...


  return torch.cuda.amp.autocast() if self.activated else NullContextManager()
100%|██████████| 12/12 [00:01<00:00,  9.77it/s]
[32m2025-02-04 15:36:35.031[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m19[0m - [1mPredicting[0m
[32m2025-02-04 15:36:35.035[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m29[0m - [1mQuestion: Which type of water must be supplied in a toilet sink? -A: hot -B: cold -C: hot and cold[0m
[32m2025-02-04 15:36:35.054[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m32[0m - [1mWelfare facilities for areas other than offices (restaurants, k itchens, conference rooms, daycare 
[0m


Shapes:
encodings: torch.Size([364, 508, 128])
doc_masks: torch.Size([364, 508])
Documents encoded!


[32m2025-02-04 15:36:36.220[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m37[0m - [1mB[0m
[32m2025-02-04 15:36:36.221[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m29[0m - [1mQuestion: In which type of parkings must a carbon monoxide detector be installed? -A: indoor -B: underground -C: indoor or underground[0m
[32m2025-02-04 15:36:36.239[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m32[0m - [1mI.2.8.    GAS DETECTION AND VE NTING  
The requirements outlined below must be met in addition to th[0m
[32m2025-02-04 15:36:37.531[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m37[0m - [1mC[0m
[32m2025-02-04 15:36:37.533[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m29[0m - [1mQuestion: What percentage is the daylight factor required for façades with exterior obstructions? -A: 0.7% -B: 80% -C: 0.77%[0m
[32m2025-02-04 15:36:37.564[0m | 

Encoding 1803 documents...


  return torch.cuda.amp.autocast() if self.activated else NullContextManager()
100%|██████████| 57/57 [00:10<00:00,  5.55it/s]
[32m2025-02-04 15:37:08.221[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m19[0m - [1mPredicting[0m
[32m2025-02-04 15:37:08.222[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m24[0m - [1mWaiting for 60 seconds[0m


Shapes:
encodings: torch.Size([1803, 508, 128])
doc_masks: torch.Size([1803, 508])
Documents encoded!


[32m2025-02-04 15:38:08.225[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m29[0m - [1mQuestion: What is the maximum number of threads within a thread block?[0m
[32m2025-02-04 15:38:08.250[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m32[0m - [1mOn current GPUs, a thread block may contain up to 1024 threads.
However, a kernel can be executed by[0m
[32m2025-02-04 15:38:10.222[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m37[0m - [1m1024
[0m
[32m2025-02-04 15:38:10.223[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m29[0m - [1mQuestion: Can you identify a thread with a four-dimensional index?[0m
[32m2025-02-04 15:38:10.250[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m32[0m - [1mThis provides a natural way
to invoke computation across the elements in a domain such as a vector, [0m
[32m2025-02-04 15:38:11.590[0m | [1mINFO    [0m | 

Encoding 17 documents...


  return torch.cuda.amp.autocast() if self.activated else NullContextManager()
100%|██████████| 1/1 [00:00<00:00, 13.42it/s]
[32m2025-02-04 15:39:27.827[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m19[0m - [1mPredicting[0m
[32m2025-02-04 15:39:27.831[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m29[0m - [1mQuestion: How many chapters does the game last?[0m
[32m2025-02-04 15:39:27.847[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m32[0m - [1m1
Will you play as the Fellowship of the Ring to defend the free races and destroy the One Ring?  
O[0m


Shapes:
encodings: torch.Size([17, 508, 128])
doc_masks: torch.Size([17, 508])
Documents encoded!


[32m2025-02-04 15:39:29.014[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m37[0m - [1m3[0m
[32m2025-02-04 15:39:29.015[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m29[0m - [1mQuestion: How many victory conditions are there?[0m
[32m2025-02-04 15:39:29.033[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m32[0m - [1mConquering Middle-earth
If you are present in all 7 regions (with a Fortress and/or at least 1 Unit)[0m
[32m2025-02-04 15:39:30.149[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m37[0m - [1m3[0m
[32m2025-02-04 15:39:30.151[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m29[0m - [1mQuestion: How many different races are there?[0m
[32m2025-02-04 15:39:30.168[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m32[0m - [1m• 
 Re
veal tokens to both players. There is no hidden information.Bonus spaces
2 ma

Encoding 48 documents...


  return torch.cuda.amp.autocast() if self.activated else NullContextManager()
100%|██████████| 2/2 [00:00<00:00,  9.87it/s]
[32m2025-02-04 15:40:50.170[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m19[0m - [1mPredicting[0m
[32m2025-02-04 15:40:50.171[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m24[0m - [1mWaiting for 60 seconds[0m


Shapes:
encodings: torch.Size([48, 508, 128])
doc_masks: torch.Size([48, 508])
Documents encoded!


[32m2025-02-04 15:41:50.175[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m29[0m - [1mQuestion: What is the maximum number of cards a player may acquire during the lookout phase?[0m
[32m2025-02-04 15:41:50.192[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m32[0m - [1m13. At the beginning of the game, each player draws 5 cards from their 
Clan deck, looks through the[0m
[32m2025-02-04 15:41:51.507[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m37[0m - [1m4[0m
[32m2025-02-04 15:41:51.509[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m29[0m - [1mQuestion: Is there a limit to the number of cards a player may have in their hand?[0m
[32m2025-02-04 15:41:51.526[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m32[0m - [1mNOTE 1:  There’s no limit to the number of cards a player may have 
in their hand.  
NOTE 2:  If the[0m
[32m2025-02-04 15:4

In [18]:
results = pd.read_csv("results.csv")
for index, result in results.iterrows():
    results.loc[index, "pred_answer"] = result["pred_answer"].strip()
    if result["pred_answer"].startswith(
        (f"-{result['answer']}", f"{result['answer']}")
    ):
        results.loc[index, "pred_answer"] = result["answer"]
results.loc[results["answer"] != results["pred_answer"]]

Unnamed: 0.1,Unnamed: 0,document,type,section,question,answer,pred_answer,pred_section
10,10,https://arxiv.org/pdf/1706.03762,Scientific Paper,5.4 Regularization,What was the dropout rate used for the base mo...,0.1,0. 1,
26,26,https://authorsalliance.org/wp-content/uploads...,Techincal Documentation,CHAPTER 5: WHERE DO YOU WANT TO MAKE YOUR WORK...,Are Gold Open Access and Green Open Access mut...,NO,YES,
28,28,https://arxiv.org/pdf/2201.11903,Scientific Report,3.1 Experimental Setup,How many large language models were evaluated?,5,FIVE,
29,29,https://arxiv.org/pdf/2201.11903,Scientific Report,3.1 Experimental Setup,How many benchmarks were used to evaluate arit...,5,FIVE,
33,33,https://arxiv.org/pdf/2201.11903,Scientific Report,3.4 Robustness of Chain of Thought,How many annotators provided independent chain...,3,THREE,
34,34,https://arxiv.org/pdf/2201.11903,Scientific Report,3.2 Results,How many random samples were examined to under...,100,50,
37,37,https://github.com/mozilla-ai/structured-qa/re...,Board Game,CARD AND TILE EFFECTS,How many different races are there?,6,I NEED MORE INFO,
42,42,https://github.com/mozilla-ai/structured-qa/re...,Board Game,CARD AND TILE COSTS,Can a player pay coins to compensate for missi...,YES,NO,
45,45,https://github.com/mozilla-ai/structured-qa/re...,Board Game,CARD AND TILE EFFECTS,Which type of cards provide coins? -A: Gray -B...,B,I NEED MORE INFO,
57,57,https://github.com/mozilla-ai/structured-qa/re...,Board Game,CLEANUP PHASE,Is there a cleanup phase in the final round?,NO,YES,


In [19]:
accuracy = sum(results["answer"] == results["pred_answer"]) / len(results)
accuracy

0.8640776699029126