# Structured Q&A

Source code: https://github.com/mozilla-ai/structured-qa

Docs: https://mozilla-ai.github.io/structured-qa

## GPU Check

First, you'll need to enable GPUs for the notebook:

- Navigate to `Edit`→`Notebook Settings`
- Select T4 GPU from the Hardware Accelerator section
- Click `Save` and accept.

Next, we'll confirm that we can connect to the GPU:

In [1]:
import torch

if not torch.cuda.is_available():
    raise RuntimeError("GPU not available")
else:
    print("GPU is available!")

GPU is available!


## Installing dependencies

In [2]:
%pip install --quiet https://github.com/abetlen/llama-cpp-python/releases/download/v0.3.4-cu122/llama_cpp_python-0.3.4-cp311-cp311-linux_x86_64.whl

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m445.2/445.2 MB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[?25h

In [3]:
%pip install ragatouille PyPDF2

Collecting ragatouille
  Downloading ragatouille-0.0.8.post4-py3-none-any.whl.metadata (15 kB)
Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Collecting colbert-ai==0.2.19 (from ragatouille)
  Downloading colbert-ai-0.2.19.tar.gz (86 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/86.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.7/86.7 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting faiss-cpu<2.0.0,>=1.7.4 (from ragatouille)
  Downloading faiss_cpu-1.9.0.post1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.4 kB)
Collecting fast-pytorch-kmeans==0.2.0.1 (from ragatouille)
  Downloading fast_pytorch_kmeans-0.2.0.1-py3-none-any.whl.metadata (1.1 kB)
Collecting llama-index>=0.7 (from ragatouille)
  Downloading llama_index-0.12.14-py3-none-any.whl.metadata (12 kB)
Collecting onnx<2.

In [4]:
%pip install git+https://github.com/mozilla-ai/structured-qa.git@5-add-benchmark

Collecting git+https://github.com/mozilla-ai/structured-qa.git@5-add-benchmark
  Cloning https://github.com/mozilla-ai/structured-qa.git (to revision 5-add-benchmark) to /tmp/pip-req-build-mwesk5v0
  Running command git clone --filter=blob:none --quiet https://github.com/mozilla-ai/structured-qa.git /tmp/pip-req-build-mwesk5v0
  Running command git checkout -b 5-add-benchmark --track origin/5-add-benchmark
  Switched to a new branch '5-add-benchmark'
  Branch '5-add-benchmark' set up to track remote branch '5-add-benchmark' from 'origin'.
  Resolved https://github.com/mozilla-ai/structured-qa.git to commit d12fa7202729134b9f88707e1dd1dfaae886a670
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting fire (from structured-qa==0.3.3.dev90+gd12fa72)
  Downloading fire-0.7.0.tar.gz (87 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m87.2/87.2 k

In [5]:
!wget https://raw.githubusercontent.com/mozilla-ai/structured-qa/refs/heads/5-add-benchmark/benchmark/structured_qa.csv

--2025-01-30 12:52:01--  https://raw.githubusercontent.com/mozilla-ai/structured-qa/refs/heads/5-add-benchmark/benchmark/structured_qa.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 21734 (21K) [text/plain]
Saving to: ‘structured_qa.csv’


2025-01-30 12:52:03 (24.2 MB/s) - ‘structured_qa.csv’ saved [21734/21734]



# Setup

In [6]:
import os

os.environ["LOGURU_LEVEL"] = "INFO"

In [7]:
from loguru import logger

In [8]:
import PyPDF2


def load_pdf(pdf_file: str) -> str | None:
    try:
        pdf_reader = PyPDF2.PdfReader(pdf_file)
        return "\n".join(page.extract_text() for page in pdf_reader.pages)
    except Exception as e:
        logger.exception(e)
        return None

## Function to Process all questions for a single Document

In [9]:
from ragatouille import RAGPretrainedModel
from ragatouille.data import CorpusProcessor


ANSWER_WITH_TYPE_PROMPT = """
You are a rigorous assistant answering questions.
You only answer based on the current information available.
The current information available is:

```
{CURRENT_INFO}
```

The answer must return ONLY one of the following strings and nothing else:
- YES/NO (for boolean questions)
Is the model an LLM?
YES
- Number (for numeric questions)
How many layers does the model have?
12
- Single letter (for multiple-choice questions)
What is the activation function used in the model? -A: ReLU -B: Sigmoid -C: Tanh
C
"""


def process_document(
    document_file,
    document_data,
    model,
):
    logger.info("Setting up RAG")
    RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")
    corpus_processor = CorpusProcessor()
    documents = corpus_processor.process_corpus([load_pdf(document_file)])
    RAG.encode([x["content"] for x in documents])

    logger.info("Predicting")
    answers = {}
    sections = {}
    for index, row in document_data.iterrows():
        question = row["question"]
        question_part, *options = question.split("?")

        logger.info(f"Question: {question}")
        results = RAG.search_encoded_docs(query=question_part, k=3)
        current_info = "\n".join(result["content"] for result in results)
        logger.info(current_info[:100])

        messages = [
            {
                "role": "system",
                "content": ANSWER_WITH_TYPE_PROMPT.format(CURRENT_INFO=current_info),
            },
            {"role": "user", "content": question},
        ]
        answer = model.get_response(messages)
        logger.info(answer)
        answers[index] = answer
        sections[index] = None

    return answers, sections

## Load Model

In [10]:
from structured_qa.model_loaders import load_llama_cpp_model

In [11]:
model = load_llama_cpp_model(
    "bartowski/Qwen2.5-7B-Instruct-GGUF/Qwen2.5-7B-Instruct-Q8_0.gguf"
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Qwen2.5-7B-Instruct-Q8_0.gguf:   0%|          | 0.00/8.10G [00:00<?, ?B/s]

# Run Benchmark

In [12]:
from pathlib import Path
from urllib.request import urlretrieve

import pandas as pd

logger.info("Loading input data")
data = pd.read_csv("structured_qa.csv")
data["pred_answer"] = [None] * len(data)
data["pred_section"] = [None] * len(data)
for document_link, document_data in data.groupby("document"):
    logger.info(f"Downloading document {document_link}")
    downloaded_document = Path(f"{Path(document_link).name}.pdf")
    if not Path(downloaded_document).exists():
        urlretrieve(document_link, downloaded_document)
        logger.info(f"Downloaded {document_link} to {downloaded_document}")
    else:
        logger.info(f"File {downloaded_document} already exists")

    answers, sections = process_document(downloaded_document, document_data, model)

    for index in document_data.index:
        data.loc[index, "pred_answer"] = str(answers[index]).upper()
        data.loc[index, "pred_section"] = sections[index]

data.to_csv("results.csv")

[32m2025-01-30 12:56:33.700[0m | [1mINFO    [0m | [36m__main__[0m:[36m<cell line: 0>[0m:[36m6[0m - [1mLoading input data[0m
[32m2025-01-30 12:56:33.792[0m | [1mINFO    [0m | [36m__main__[0m:[36m<cell line: 0>[0m:[36m11[0m - [1mDownloading document https://aiindex.stanford.edu/wp-content/uploads/2024/05/HAI_AI-Index-Report-2024.pdf[0m
[32m2025-01-30 12:56:34.296[0m | [1mINFO    [0m | [36m__main__[0m:[36m<cell line: 0>[0m:[36m15[0m - [1mDownloaded https://aiindex.stanford.edu/wp-content/uploads/2024/05/HAI_AI-Index-Report-2024.pdf to HAI_AI-Index-Report-2024.pdf.pdf[0m
[32m2025-01-30 12:56:34.300[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m32[0m - [1mSetting up RAG[0m


artifact.metadata:   0%|          | 0.00/1.63k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/405 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

  self.scaler = torch.cuda.amp.GradScaler()


Encoding 1214 documents...


  return torch.cuda.amp.autocast() if self.activated else NullContextManager()
100%|██████████| 38/38 [00:06<00:00,  5.99it/s]
[32m2025-01-30 12:57:37.060[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m38[0m - [1mPredicting[0m
[32m2025-01-30 12:57:37.065[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m45[0m - [1mQuestion: which type of risk was identified as the leading concern globally? -A: Fairness risks. -B: Privacy and data governance risks. -C: Risks related to generative AI deployment.[0m


Shapes:
encodings: torch.Size([1214, 508, 128])
doc_masks: torch.Size([1214, 508])
Documents encoded!


[32m2025-01-30 12:57:37.300[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m48[0m - [1mNotably, they observe that these concerns are 
significantly higher in Asia and Europe compared to 
[0m
[32m2025-01-30 12:57:38.613[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m58[0m - [1m-B[0m
[32m2025-01-30 12:57:38.616[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m45[0m - [1mQuestion: In which geographical area were fairness risks selected by the smallest percentage of respondents? -A: Asia. -B: Europe. -C: North America.[0m
[32m2025-01-30 12:57:38.649[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m48[0m - [1mNotably, they observe that these concerns are 
significantly higher in Asia and Europe compared to 
[0m
[32m2025-01-30 12:57:39.338[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m58[0m - [1mC[0m
[32m2025-01-30 12:57:39.340[0m | [1mINFO

Encoding 56 documents...


  return torch.cuda.amp.autocast() if self.activated else NullContextManager()
100%|██████████| 2/2 [00:00<00:00,  8.10it/s]
[32m2025-01-30 12:57:51.991[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m38[0m - [1mPredicting[0m
[32m2025-01-30 12:57:51.992[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m45[0m - [1mQuestion: What type of architecture does the model use? -A: decoder only -B: encoder only -C: encoder-decoder[0m
[32m2025-01-30 12:57:52.010[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m48[0m - [1mAt each step the model is auto-regressive
[10], consuming the previously generated symbols as additi[0m


Shapes:
encodings: torch.Size([56, 508, 128])
doc_masks: torch.Size([56, 508])
Documents encoded!


[32m2025-01-30 12:57:53.315[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m58[0m - [1mC[0m
[32m2025-01-30 12:57:53.316[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m45[0m - [1mQuestion: How many layers compose the encoder?[0m
[32m2025-01-30 12:57:53.334[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m48[0m - [1mAt each step the model is auto-regressive
[10], consuming the previously generated symbols as additi[0m
[32m2025-01-30 12:57:54.041[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m58[0m - [1m6[0m
[32m2025-01-30 12:57:54.042[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m45[0m - [1mQuestion: How many layers compose the decoder?[0m
[32m2025-01-30 12:57:54.060[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m48[0m - [1mTo facilitate these residual connections, all sub-layers in the model, as well as the

Encoding 137 documents...


  return torch.cuda.amp.autocast() if self.activated else NullContextManager()
100%|██████████| 5/5 [00:00<00:00,  8.77it/s]
[32m2025-01-30 12:58:11.465[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m38[0m - [1mPredicting[0m
[32m2025-01-30 12:58:11.466[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m45[0m - [1mQuestion: Does LoRA work with any neural network containing dense layers?[0m
[32m2025-01-30 12:58:11.486[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m48[0m - [1mMore importantly, these method often fail to
match the ﬁne-tuning baselines, posing a trade-off betw[0m


Shapes:
encodings: torch.Size([137, 508, 128])
doc_masks: torch.Size([137, 508])
Documents encoded!


[32m2025-01-30 12:58:12.422[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m58[0m - [1mYES[0m
[32m2025-01-30 12:58:12.424[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m45[0m - [1mQuestion: By how much can LoRA reduce GPU memory requirements during training? -A: 10x, -B: 5x, -C: 3x[0m
[32m2025-01-30 12:58:12.441[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m48[0m - [1mUsing GPT-3 175B as an example – deploying indepen-
dent instances of ﬁne-tuned models, each with 17[0m
[32m2025-01-30 12:58:13.325[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m58[0m - [1mC[0m
[32m2025-01-30 12:58:13.327[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m45[0m - [1mQuestion: In billions, how many trainable parameters does GPT-3 have?[0m
[32m2025-01-30 12:58:13.352[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m48[0m - [1mThe 

Encoding 199 documents...


  return torch.cuda.amp.autocast() if self.activated else NullContextManager()
100%|██████████| 7/7 [00:00<00:00,  7.78it/s]
[32m2025-01-30 12:58:19.802[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m38[0m - [1mPredicting[0m
[32m2025-01-30 12:58:19.805[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m45[0m - [1mQuestion: Is Arithmetic reasoning is a task that language models often find very easy?[0m
[32m2025-01-30 12:58:19.824[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m48[0m - [1m3 Arithmetic Reasoning
We begin by considering math word problems of the form in Figure 1, which mea[0m


Shapes:
encodings: torch.Size([199, 508, 128])
doc_masks: torch.Size([199, 508])
Documents encoded!


[32m2025-01-30 12:58:20.707[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m58[0m - [1mNO[0m
[32m2025-01-30 12:58:20.709[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m45[0m - [1mQuestion: How many large language models were evaluated?[0m
[32m2025-01-30 12:58:20.727[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m48[0m - [1mFor AQuA, we used four exemplars
and solutions from the training set, as given in Appendix Table 21.[0m
[32m2025-01-30 12:58:21.634[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m58[0m - [1m5[0m
[32m2025-01-30 12:58:21.636[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m45[0m - [1mQuestion: How many benchmarks were used to evaluate arithmetic reasoning?[0m
[32m2025-01-30 12:58:21.655[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m48[0m - [1m3 Arithmetic Reasoning
We begin by considering 

Encoding 44 documents...


  return torch.cuda.amp.autocast() if self.activated else NullContextManager()
100%|██████████| 2/2 [00:00<00:00,  9.42it/s]
[32m2025-01-30 12:58:29.318[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m38[0m - [1mPredicting[0m
[32m2025-01-30 12:58:29.319[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m45[0m - [1mQuestion: Can recurrent networks also be converted to decision trees?[0m
[32m2025-01-30 12:58:29.336[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m48[0m - [1mTherefore, for continuous activations, the neural
network equivalent tree immediately becomes inﬁnit[0m


Shapes:
encodings: torch.Size([44, 508, 128])
doc_masks: torch.Size([44, 508])
Documents encoded!


[32m2025-01-30 12:58:30.073[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m58[0m - [1mYES[0m
[32m2025-01-30 12:58:30.074[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m45[0m - [1mQuestion: How many layers are in the toy model (y = x^2)?[0m
[32m2025-01-30 12:58:30.093[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m48[0m - [1mo(t)=c1^ZT
1WTh(0)+tX
i=1ci^ZiUTx(i)(17)
In Eq. 17, ci^ZT
i=a(t)^VT
ci^Wi.As one can observe from
Eq[0m
[32m2025-01-30 12:58:30.952[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m58[0m - [1m3[0m
[32m2025-01-30 12:58:30.954[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m45[0m - [1mQuestion: Does the toy model (y = x^2) use Sigmoid activation function?[0m
[32m2025-01-30 12:58:30.971[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m48[0m - [1mo(t)=c1^ZT
1WTh(0)+tX
i=1ci^ZiUTx(i)(17)
In Eq.

Encoding 143 documents...


  return torch.cuda.amp.autocast() if self.activated else NullContextManager()
100%|██████████| 5/5 [00:00<00:00,  7.92it/s]
[32m2025-01-30 12:58:38.771[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m38[0m - [1mPredicting[0m
[32m2025-01-30 12:58:38.773[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m45[0m - [1mQuestion: According to the guide, what is the typical license used to grant reuse rights with libre open access? -A: GNU General Public License -B: Creative Commons license -C: MIT license[0m
[32m2025-01-30 12:58:38.792[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m48[0m - [1mOne of these options is open access.
The basic idea of open access is that it makes 
copyrightable w[0m


Shapes:
encodings: torch.Size([143, 508, 128])
doc_masks: torch.Size([143, 508])
Documents encoded!


[32m2025-01-30 12:58:39.709[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m58[0m - [1mB[0m
[32m2025-01-30 12:58:39.711[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m45[0m - [1mQuestion: how many peer-reviewed open access journals are indexed by the Directory of Open Access Journals (DOAJ)? -A: Over 10,000 -B: Over 20,000 -C: Exactly 30,000[0m
[32m2025-01-30 12:58:39.736[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m48[0m - [1mFor authors of articles, a good 
place to start is the Directory of Open Access Journals (“DOAJ”), a[0m
[32m2025-01-30 12:58:40.619[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m58[0m - [1mA[0m
[32m2025-01-30 12:58:40.621[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m45[0m - [1mQuestion: Does open access eliminate price barriers?[0m
[32m2025-01-30 12:58:40.649[0m | [1mINFO    [0m | [36m__main__[0m:[36

Encoding 364 documents...


  return torch.cuda.amp.autocast() if self.activated else NullContextManager()
100%|██████████| 12/12 [00:01<00:00,  9.08it/s]
[32m2025-01-30 12:58:54.973[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m38[0m - [1mPredicting[0m
[32m2025-01-30 12:58:54.975[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m45[0m - [1mQuestion: Which type of water must be supplied in a toilet sink? -A: hot -B: cold -C: hot and cold[0m
[32m2025-01-30 12:58:54.994[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m48[0m - [1mWelfare facilities for areas other than offices (restaurants, k itchens, conference rooms, daycare 
[0m


Shapes:
encodings: torch.Size([364, 508, 128])
doc_masks: torch.Size([364, 508])
Documents encoded!


[32m2025-01-30 12:58:55.940[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m58[0m - [1mB[0m
[32m2025-01-30 12:58:55.941[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m45[0m - [1mQuestion: In which type of parkings must a carbon monoxide detector be installed? -A: indoor -B: underground -C: indoor or underground[0m
[32m2025-01-30 12:58:55.961[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m48[0m - [1mI.2.8.    GAS DETECTION AND VE NTING  
The requirements outlined below must be met in addition to th[0m
[32m2025-01-30 12:58:56.858[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m58[0m - [1mC[0m
[32m2025-01-30 12:58:56.859[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m45[0m - [1mQuestion: What percentage is the daylight factor required for façades with exterior obstructions? -A: 0.7% -B: 80% -C: 0.77%[0m
[32m2025-01-30 12:58:56.879[0m | 

Encoding 1803 documents...


  return torch.cuda.amp.autocast() if self.activated else NullContextManager()
100%|██████████| 57/57 [00:10<00:00,  5.36it/s]
[32m2025-01-30 12:59:29.178[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m38[0m - [1mPredicting[0m
[32m2025-01-30 12:59:29.180[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m45[0m - [1mQuestion: What is the maximum number of threads within a thread block?[0m
[32m2025-01-30 12:59:29.204[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m48[0m - [1mOn current GPUs, a thread block may contain up to 1024 threads.
However, a kernel can be executed by[0m


Shapes:
encodings: torch.Size([1803, 508, 128])
doc_masks: torch.Size([1803, 508])
Documents encoded!


[32m2025-01-30 12:59:30.270[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m58[0m - [1m1024[0m
[32m2025-01-30 12:59:30.275[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m45[0m - [1mQuestion: Can you identify a thread with a four-dimensional index?[0m
[32m2025-01-30 12:59:30.304[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m48[0m - [1mThis provides a natural way
to invoke computation across the elements in a domain such as a vector, [0m
[32m2025-01-30 12:59:31.053[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m58[0m - [1mNO[0m
[32m2025-01-30 12:59:31.057[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m45[0m - [1mQuestion: In the offline compilation process using nvcc, what happens to the device code? -A: It is directly executed on the host CPU. -B:  It is transformed into assembly and/or binary form. -C:  It is ignored and not used in t

Encoding 754 documents...


  return torch.cuda.amp.autocast() if self.activated else NullContextManager()
100%|██████████| 24/24 [00:03<00:00,  7.64it/s]
[32m2025-01-30 12:59:52.140[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m38[0m - [1mPredicting[0m
[32m2025-01-30 12:59:52.141[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m45[0m - [1mQuestion: what is a requirement for datasets used in high-risk AI systems? -A: Exclusively open-source datasets -B: Datasets ensuring quality and diversity -C: Datasets not exceeding 1 GB in size[0m
[32m2025-01-30 12:59:52.162[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m48[0m - [1mIn particular , data 
sets should take into account, to the extent required by their intended purpos[0m


Shapes:
encodings: torch.Size([754, 508, 128])
doc_masks: torch.Size([754, 508])
Documents encoded!


[32m2025-01-30 12:59:52.977[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m58[0m - [1mB[0m
[32m2025-01-30 12:59:52.979[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m45[0m - [1mQuestion: What is the threshold, measured in floating point operations, that leads to a presumption that a general-purpose AI model has systemic risk? -A: 10^1 -B: 10^20 -C: 10^25[0m
[32m2025-01-30 12:59:52.999[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m48[0m - [1mThe full range of capabilities in a model could be better 
understo od after its placing on the mark[0m
[32m2025-01-30 12:59:53.889[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m58[0m - [1mB[0m
[32m2025-01-30 12:59:53.890[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m45[0m - [1mQuestion: What should providers of AI systems that generate synthetic content ensure? -A: That the content is not marke

Encoding 17 documents...


  return torch.cuda.amp.autocast() if self.activated else NullContextManager()
100%|██████████| 1/1 [00:00<00:00, 13.44it/s]
[32m2025-01-30 13:00:05.740[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m38[0m - [1mPredicting[0m
[32m2025-01-30 13:00:05.745[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m45[0m - [1mQuestion: How many chapters does the game last?[0m
[32m2025-01-30 13:00:05.761[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m48[0m - [1m1
Will you play as the Fellowship of the Ring to defend the free races and destroy the One Ring?  
O[0m


Shapes:
encodings: torch.Size([17, 508, 128])
doc_masks: torch.Size([17, 508])
Documents encoded!


[32m2025-01-30 13:00:06.454[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m58[0m - [1m3[0m
[32m2025-01-30 13:00:06.455[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m45[0m - [1mQuestion: How many victory conditions are there?[0m
[32m2025-01-30 13:00:06.474[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m48[0m - [1mConquering Middle-earth
If you are present in all 7 regions (with a Fortress and/or at least 1 Unit)[0m
[32m2025-01-30 13:00:07.093[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m58[0m - [1m3[0m
[32m2025-01-30 13:00:07.095[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m45[0m - [1mQuestion: How many different races are there?[0m
[32m2025-01-30 13:00:07.114[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m48[0m - [1m• 
 Re
veal tokens to both players. There is no hidden information.Bonus spaces
2 ma

Encoding 48 documents...


  return torch.cuda.amp.autocast() if self.activated else NullContextManager()
100%|██████████| 2/2 [00:00<00:00,  9.80it/s]
[32m2025-01-30 13:00:25.742[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m38[0m - [1mPredicting[0m
[32m2025-01-30 13:00:25.743[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m45[0m - [1mQuestion: What is the maximum number of cards a player may acquire during the lookout phase?[0m
[32m2025-01-30 13:00:25.762[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m48[0m - [1m13. At the beginning of the game, each player draws 5 cards from their 
Clan deck, looks through the[0m


Shapes:
encodings: torch.Size([48, 508, 128])
doc_masks: torch.Size([48, 508])
Documents encoded!


[32m2025-01-30 13:00:26.628[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m58[0m - [1m4[0m
[32m2025-01-30 13:00:26.630[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m45[0m - [1mQuestion: Is there a limit to the number of cards a player may have in their hand?[0m
[32m2025-01-30 13:00:26.649[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m48[0m - [1mNOTE 1:  There’s no limit to the number of cards a player may have 
in their hand.  
NOTE 2:  If the[0m
[32m2025-01-30 13:00:27.506[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m58[0m - [1mNO[0m
[32m2025-01-30 13:00:27.508[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m45[0m - [1mQuestion: Can you raid the locations of a player that has passed during the action phase?[0m
[32m2025-01-30 13:00:27.532[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m48[0m - [1mStart

In [13]:
results = pd.read_csv("results.csv")
for index, result in results.iterrows():
    if result["pred_answer"].startswith(
        (f"-{result['answer']}", f"{result['answer']}")
    ):
        results.loc[index, "pred_answer"] = result["answer"]
results.loc[results["answer"] != results["pred_answer"]]

Unnamed: 0.1,Unnamed: 0,document,section,question,answer,pred_answer,pred_section
10,10,https://arxiv.org/pdf/1706.03762,5.4 Regularization,What was the dropout rate used for the base mo...,0.1,PDROP= 0.1,
22,22,https://eur-lex.europa.eu/legal-content/EN/TXT...,Classification of general-purpose AI models as...,"What is the threshold, measured in floating po...",C,B,
28,28,https://eur-lex.europa.eu/legal-content/EN/TXT...,Compliant AI systems which present a risk,What is the time period for a market surveilla...,C,A,
44,44,https://arxiv.org/pdf/2201.11903,3.2 Results,How many random samples were examined to under...,100,50,
47,47,https://github.com/mozilla-ai/structured-qa/re...,CARD AND TILE EFFECTS,How many different races are there?,6,3,
51,51,https://github.com/mozilla-ai/structured-qa/re...,CHAPTER OVERVIEW,"After taking a landmark tile, do you reveal a ...",NO,YES,
52,52,https://github.com/mozilla-ai/structured-qa/re...,CARD AND TILE COSTS,Can a player pay coins to compensate for missi...,YES,NO,
53,53,https://github.com/mozilla-ai/structured-qa/re...,CARD AND TILE COSTS,"If a player is missing 2 skill symbols, how ma...",2,NO\n- SINGLE LETTER (FOR MULTIPLE-CHOICE QUEST...,
55,55,https://github.com/mozilla-ai/structured-qa/re...,CARD AND TILE EFFECTS,Which type of cards provide coins? -A: Gray -B...,B,A,
65,65,https://github.com/mozilla-ai/structured-qa/re...,EXPEDITION PHASE,Do you need a fish to conquer a distant island?,YES,NO,


In [14]:
accuracy = sum(results["answer"] == results["pred_answer"]) / len(results)
accuracy

0.898989898989899