# Structured Q&A

Source code: https://github.com/mozilla-ai/structured-qa

Docs: https://mozilla-ai.github.io/structured-qa

## GPU Check

First, you'll need to enable GPUs for the notebook:

- Navigate to `Edit`→`Notebook Settings`
- Select T4 GPU from the Hardware Accelerator section
- Click `Save` and accept.

Next, we'll confirm that we can connect to the GPU:

In [1]:
import torch

if not torch.cuda.is_available():
    raise RuntimeError("GPU not available")
else:
    print("GPU is available!")

GPU is available!


## Installing dependencies

In [2]:
%pip install ragatouille PyPDF2

Collecting ragatouille
  Downloading ragatouille-0.0.8.post4-py3-none-any.whl.metadata (15 kB)
Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Collecting colbert-ai==0.2.19 (from ragatouille)
  Downloading colbert-ai-0.2.19.tar.gz (86 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.7/86.7 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting faiss-cpu<2.0.0,>=1.7.4 (from ragatouille)
  Downloading faiss_cpu-1.9.0.post1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.4 kB)
Collecting fast-pytorch-kmeans==0.2.0.1 (from ragatouille)
  Downloading fast_pytorch_kmeans-0.2.0.1-py3-none-any.whl.metadata (1.1 kB)
Collecting llama-index>=0.7 (from ragatouille)
  Downloading llama_index-0.12.14-py3-none-any.whl.metadata (12 kB)
Collecting onnx<2.0.0,>=1.15.0 (from ragatouille)
  Downloading onnx-1.17.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.w

In [3]:
%pip install git+https://github.com/mozilla-ai/structured-qa.git@5-add-benchmark

Collecting git+https://github.com/mozilla-ai/structured-qa.git@5-add-benchmark
  Cloning https://github.com/mozilla-ai/structured-qa.git (to revision 5-add-benchmark) to /tmp/pip-req-build-fztvdq23
  Running command git clone --filter=blob:none --quiet https://github.com/mozilla-ai/structured-qa.git /tmp/pip-req-build-fztvdq23
  Running command git checkout -b 5-add-benchmark --track origin/5-add-benchmark
  Switched to a new branch '5-add-benchmark'
  Branch '5-add-benchmark' set up to track remote branch '5-add-benchmark' from 'origin'.
  Resolved https://github.com/mozilla-ai/structured-qa.git to commit 0b8e5cf9d2db91af71478a715bfdbba1b36316fa
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting fire (from structured-qa==0.3.3.dev77+g0b8e5cf)
  Downloading fire-0.7.0.tar.gz (87 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m87.2/87.2 k

In [4]:
!wget https://raw.githubusercontent.com/mozilla-ai/structured-qa/refs/heads/5-add-benchmark/benchmark/structured_qa.csv

--2025-01-29 09:01:27--  https://raw.githubusercontent.com/mozilla-ai/structured-qa/refs/heads/5-add-benchmark/benchmark/structured_qa.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 21734 (21K) [text/plain]
Saving to: ‘structured_qa.csv’


2025-01-29 09:01:27 (13.3 MB/s) - ‘structured_qa.csv’ saved [21734/21734]



# Setup

In [5]:
import os
import google.generativeai as genai
from google.colab.userdata import get, SecretNotFoundError

try:
    genai.configure(api_key=get("GOOGLE_API_KEY"))
except SecretNotFoundError as e:
    raise RuntimeError("Please set the GOOGLE_API_KEY secret to your API key") from e
os.environ["LOGURU_LEVEL"] = "INFO"

In [6]:
from loguru import logger

In [7]:
import PyPDF2


def load_pdf(pdf_file: str) -> str | None:
    try:
        pdf_reader = PyPDF2.PdfReader(pdf_file)
        return "\n".join(page.extract_text() for page in pdf_reader.pages)
    except Exception as e:
        logger.exception(e)
        return None

## Function to Process all questions for a single Document

In [8]:
import json
import time

from ragatouille import RAGPretrainedModel
from ragatouille.data import CorpusProcessor


def process_document(
    document_file,
    document_data,
    model,
):
    logger.info("Setting up RAG")
    RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")
    corpus_processor = CorpusProcessor()
    documents = corpus_processor.process_corpus([load_pdf(document_file)])
    RAG.encode([x["content"] for x in documents])

    logger.info("Predicting")
    answers = {}
    sections = {}
    for index, row in document_data.iterrows():
        if model.n > 0 and model.n % 9 == 0:
            logger.info("Waiting for 60 seconds")
            time.sleep(60)
        question = row["question"]
        question_part, *options = question.split("?")

        logger.info(f"Question: {question}")
        results = RAG.search_encoded_docs(query=question_part, k=3)
        current_info = "\n".join(result["content"] for result in results)
        logger.info(current_info[:100])

        answer = model.model.generate_content(
            [f"This is the document: {current_info}", question]
        )
        logger.info(answer.text)
        answers[index] = json.loads(answer.text)["answer"]
        sections[index] = None
        model.n += 1

    return answers, sections

## Load Model

In [9]:
from structured_qa.model_loaders import load_gemini_model

In [10]:
SYSTEM_PROMPT = """
You are a rigorous assistant answering questions.
You must only answer based on the current information available which is:

```
{CURRENT_INFO}
```

If the current information available not enough to answer the question,
you must return "I need more info" srting and nothing else:

If the current information is enough to answer, you must return one of the following formats:
- YES/NO (for boolean questions)
- Number (for numeric questions)
- Single letter (for multiple-choice questions)
"""

In [11]:
model = load_gemini_model("gemini-2.0-flash-exp", system_prompt=SYSTEM_PROMPT)
model.n = 0

# Run Benchmark

In [12]:
from pathlib import Path
from urllib.request import urlretrieve

import pandas as pd

logger.info("Loading input data")
data = pd.read_csv("structured_qa.csv")
data["pred_answer"] = [None] * len(data)
data["pred_section"] = [None] * len(data)
for document_link, document_data in data.groupby("document"):
    logger.info(f"Downloading document {document_link}")
    downloaded_document = Path(f"{Path(document_link).name}.pdf")
    if not Path(downloaded_document).exists():
        urlretrieve(document_link, downloaded_document)
        logger.info(f"Downloaded {document_link} to {downloaded_document}")
    else:
        logger.info(f"File {downloaded_document} already exists")

    answers, sections = process_document(downloaded_document, document_data, model)

    for index in document_data.index:
        data.loc[index, "pred_answer"] = str(answers[index]).upper()
        data.loc[index, "pred_section"] = sections[index]

data.to_csv("results.csv")

[32m2025-01-29 09:01:52.976[0m | [1mINFO    [0m | [36m__main__[0m:[36m<cell line: 0>[0m:[36m6[0m - [1mLoading input data[0m
[32m2025-01-29 09:01:53.001[0m | [1mINFO    [0m | [36m__main__[0m:[36m<cell line: 0>[0m:[36m11[0m - [1mDownloading document https://aiindex.stanford.edu/wp-content/uploads/2024/05/HAI_AI-Index-Report-2024.pdf[0m
[32m2025-01-29 09:01:53.336[0m | [1mINFO    [0m | [36m__main__[0m:[36m<cell line: 0>[0m:[36m15[0m - [1mDownloaded https://aiindex.stanford.edu/wp-content/uploads/2024/05/HAI_AI-Index-Report-2024.pdf to HAI_AI-Index-Report-2024.pdf.pdf[0m
[32m2025-01-29 09:01:53.338[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m13[0m - [1mSetting up RAG[0m
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will b

artifact.metadata:   0%|          | 0.00/1.63k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/405 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

  self.scaler = torch.cuda.amp.GradScaler()


Encoding 1214 documents...


  return torch.cuda.amp.autocast() if self.activated else NullContextManager()
100%|██████████| 38/38 [00:06<00:00,  5.93it/s]
[32m2025-01-29 09:02:49.579[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m19[0m - [1mPredicting[0m
[32m2025-01-29 09:02:49.581[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m29[0m - [1mQuestion: which type of risk was identified as the leading concern globally? -A: Fairness risks. -B: Privacy and data governance risks. -C: Risks related to generative AI deployment.[0m
[32m2025-01-29 09:02:49.751[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m32[0m - [1mNotably, they observe that these concerns are 
significantly higher in Asia and Europe compared to 
[0m


Shapes:
encodings: torch.Size([1214, 508, 128])
doc_masks: torch.Size([1214, 508])
Documents encoded!


[32m2025-01-29 09:02:51.147[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m37[0m - [1m{
  "answer": "B"
}[0m
[32m2025-01-29 09:02:51.148[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m29[0m - [1mQuestion: In which geographical area were fairness risks selected by the smallest percentage of respondents? -A: Asia. -B: Europe. -C: North America.[0m
[32m2025-01-29 09:02:51.166[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m32[0m - [1mNotably, they observe that these concerns are 
significantly higher in Asia and Europe compared to 
[0m
[32m2025-01-29 09:02:52.457[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m37[0m - [1m{
  "answer": "C"
}[0m
[32m2025-01-29 09:02:52.460[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m29[0m - [1mQuestion: What is a major consequence of the rising training costs for foundation models? -A: The exclusion of un

Encoding 56 documents...


  return torch.cuda.amp.autocast() if self.activated else NullContextManager()
100%|██████████| 2/2 [00:00<00:00,  8.31it/s]
[32m2025-01-29 09:04:09.484[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m19[0m - [1mPredicting[0m
[32m2025-01-29 09:04:09.485[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m29[0m - [1mQuestion: What type of architecture does the model use? -A: decoder only -B: encoder only -C: encoder-decoder[0m
[32m2025-01-29 09:04:09.500[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m32[0m - [1mAt each step the model is auto-regressive
[10], consuming the previously generated symbols as additi[0m


Shapes:
encodings: torch.Size([56, 508, 128])
doc_masks: torch.Size([56, 508])
Documents encoded!


[32m2025-01-29 09:04:10.640[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m37[0m - [1m{
  "answer": "C"
}[0m
[32m2025-01-29 09:04:10.641[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m29[0m - [1mQuestion: How many layers compose the encoder?[0m
[32m2025-01-29 09:04:10.657[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m32[0m - [1mAt each step the model is auto-regressive
[10], consuming the previously generated symbols as additi[0m
[32m2025-01-29 09:04:14.969[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m37[0m - [1m{
  "answer": 6
}[0m
[32m2025-01-29 09:04:14.970[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m29[0m - [1mQuestion: How many layers compose the decoder?[0m
[32m2025-01-29 09:04:14.987[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m32[0m - [1mTo facilitate these residual connections, all sub-l

Encoding 137 documents...


  return torch.cuda.amp.autocast() if self.activated else NullContextManager()
100%|██████████| 5/5 [00:00<00:00,  9.20it/s]
[32m2025-01-29 09:05:41.921[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m19[0m - [1mPredicting[0m
[32m2025-01-29 09:05:41.923[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m29[0m - [1mQuestion: Does LoRA work with any neural network containing dense layers?[0m
[32m2025-01-29 09:05:41.939[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m32[0m - [1mMore importantly, these method often fail to
match the ﬁne-tuning baselines, posing a trade-off betw[0m


Shapes:
encodings: torch.Size([137, 508, 128])
doc_masks: torch.Size([137, 508])
Documents encoded!


[32m2025-01-29 09:05:42.979[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m37[0m - [1m{
  "answer": "Yes"
}[0m
[32m2025-01-29 09:05:42.981[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m29[0m - [1mQuestion: By how much can LoRA reduce GPU memory requirements during training? -A: 10x, -B: 5x, -C: 3x[0m
[32m2025-01-29 09:05:42.996[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m32[0m - [1mUsing GPT-3 175B as an example – deploying indepen-
dent instances of ﬁne-tuned models, each with 17[0m
[32m2025-01-29 09:05:44.438[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m37[0m - [1m{
  "answer": "C"
}[0m
[32m2025-01-29 09:05:44.439[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m29[0m - [1mQuestion: In billions, how many trainable parameters does GPT-3 have?[0m
[32m2025-01-29 09:05:44.456[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess

Encoding 199 documents...


  return torch.cuda.amp.autocast() if self.activated else NullContextManager()
100%|██████████| 7/7 [00:00<00:00,  7.62it/s]
[32m2025-01-29 09:05:54.515[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m19[0m - [1mPredicting[0m
[32m2025-01-29 09:05:54.516[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m29[0m - [1mQuestion: Is Arithmetic reasoning is a task that language models often find very easy?[0m
[32m2025-01-29 09:05:54.533[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m32[0m - [1m3 Arithmetic Reasoning
We begin by considering math word problems of the form in Figure 1, which mea[0m


Shapes:
encodings: torch.Size([199, 508, 128])
doc_masks: torch.Size([199, 508])
Documents encoded!


[32m2025-01-29 09:05:57.435[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m37[0m - [1m{
  "answer": "No"
}[0m
[32m2025-01-29 09:05:57.436[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m29[0m - [1mQuestion: How many large language models were evaluated?[0m
[32m2025-01-29 09:05:57.453[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m32[0m - [1mFor AQuA, we used four exemplars
and solutions from the training set, as given in Appendix Table 21.[0m
[32m2025-01-29 09:05:58.894[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m37[0m - [1m{
"answer": 5
}[0m
[32m2025-01-29 09:05:58.896[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m24[0m - [1mWaiting for 60 seconds[0m
[32m2025-01-29 09:06:58.897[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m29[0m - [1mQuestion: How many benchmarks were used to evaluate arithmetic rea

Encoding 44 documents...


  return torch.cuda.amp.autocast() if self.activated else NullContextManager()
100%|██████████| 2/2 [00:00<00:00, 10.13it/s]
[32m2025-01-29 09:07:13.045[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m19[0m - [1mPredicting[0m
[32m2025-01-29 09:07:13.047[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m29[0m - [1mQuestion: Can recurrent networks also be converted to decision trees?[0m
[32m2025-01-29 09:07:13.067[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m32[0m - [1mTherefore, for continuous activations, the neural
network equivalent tree immediately becomes inﬁnit[0m


Shapes:
encodings: torch.Size([44, 508, 128])
doc_masks: torch.Size([44, 508])
Documents encoded!


[32m2025-01-29 09:07:14.509[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m37[0m - [1m{
  "answer": "Yes"
}[0m
[32m2025-01-29 09:07:14.511[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m29[0m - [1mQuestion: How many layers are in the toy model (y = x^2)?[0m
[32m2025-01-29 09:07:14.535[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m32[0m - [1mo(t)=c1^ZT
1WTh(0)+tX
i=1ci^ZiUTx(i)(17)
In Eq. 17, ci^ZT
i=a(t)^VT
ci^Wi.As one can observe from
Eq[0m
[32m2025-01-29 09:07:16.101[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m37[0m - [1m{
  "answer": 3
}[0m
[32m2025-01-29 09:07:16.103[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m29[0m - [1mQuestion: Does the toy model (y = x^2) use Sigmoid activation function?[0m
[32m2025-01-29 09:07:16.119[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m32[0m - [1mo(t)=c1^ZT
1W

Encoding 143 documents...


  return torch.cuda.amp.autocast() if self.activated else NullContextManager()
100%|██████████| 5/5 [00:00<00:00,  8.77it/s]
[32m2025-01-29 09:08:27.733[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m19[0m - [1mPredicting[0m
[32m2025-01-29 09:08:27.735[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m29[0m - [1mQuestion: According to the guide, what is the typical license used to grant reuse rights with libre open access? -A: GNU General Public License -B: Creative Commons license -C: MIT license[0m
[32m2025-01-29 09:08:27.752[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m32[0m - [1mOne of these options is open access.
The basic idea of open access is that it makes 
copyrightable w[0m


Shapes:
encodings: torch.Size([143, 508, 128])
doc_masks: torch.Size([143, 508])
Documents encoded!


[32m2025-01-29 09:08:29.018[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m37[0m - [1m{
  "answer": "B"
}[0m
[32m2025-01-29 09:08:29.020[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m29[0m - [1mQuestion: how many peer-reviewed open access journals are indexed by the Directory of Open Access Journals (DOAJ)? -A: Over 10,000 -B: Over 20,000 -C: Exactly 30,000[0m
[32m2025-01-29 09:08:29.036[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m32[0m - [1mFor authors of articles, a good 
place to start is the Directory of Open Access Journals (“DOAJ”), a[0m
[32m2025-01-29 09:08:30.426[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m37[0m - [1m{
  "answer": "A"
}[0m
[32m2025-01-29 09:08:30.428[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m29[0m - [1mQuestion: Does open access eliminate price barriers?[0m
[32m2025-01-29 09:08:30.446[0m | [1mI

Encoding 364 documents...


  return torch.cuda.amp.autocast() if self.activated else NullContextManager()
100%|██████████| 12/12 [00:01<00:00,  9.75it/s]
[32m2025-01-29 09:08:44.585[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m19[0m - [1mPredicting[0m
[32m2025-01-29 09:08:44.587[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m24[0m - [1mWaiting for 60 seconds[0m


Shapes:
encodings: torch.Size([364, 508, 128])
doc_masks: torch.Size([364, 508])
Documents encoded!


[32m2025-01-29 09:09:44.590[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m29[0m - [1mQuestion: Which type of water must be supplied in a toilet sink? -A: hot -B: cold -C: hot and cold[0m
[32m2025-01-29 09:09:44.608[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m32[0m - [1mWelfare facilities for areas other than offices (restaurants, k itchens, conference rooms, daycare 
[0m
[32m2025-01-29 09:09:46.579[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m37[0m - [1m{
  "answer": "B"
}[0m
[32m2025-01-29 09:09:46.581[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m29[0m - [1mQuestion: In which type of parkings must a carbon monoxide detector be installed? -A: indoor -B: underground -C: indoor or underground[0m
[32m2025-01-29 09:09:46.598[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m32[0m - [1mI.2.8.    GAS DETECTION AND VE NTING  
The requir

Encoding 1803 documents...


  return torch.cuda.amp.autocast() if self.activated else NullContextManager()
100%|██████████| 57/57 [00:10<00:00,  5.68it/s]
[32m2025-01-29 09:10:21.132[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m19[0m - [1mPredicting[0m
[32m2025-01-29 09:10:21.134[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m29[0m - [1mQuestion: What is the maximum number of threads within a thread block?[0m
[32m2025-01-29 09:10:21.154[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m32[0m - [1mOn current GPUs, a thread block may contain up to 1024 threads.
However, a kernel can be executed by[0m


Shapes:
encodings: torch.Size([1803, 508, 128])
doc_masks: torch.Size([1803, 508])
Documents encoded!


[32m2025-01-29 09:10:22.294[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m37[0m - [1m{
  "answer": 1024
}[0m
[32m2025-01-29 09:10:22.296[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m29[0m - [1mQuestion: Can you identify a thread with a four-dimensional index?[0m
[32m2025-01-29 09:10:22.316[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m32[0m - [1mThis provides a natural way
to invoke computation across the elements in a domain such as a vector, [0m
[32m2025-01-29 09:10:23.833[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m37[0m - [1m{
"answer": "No"
}[0m
[32m2025-01-29 09:10:23.835[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m29[0m - [1mQuestion: In the offline compilation process using nvcc, what happens to the device code? -A: It is directly executed on the host CPU. -B:  It is transformed into assembly and/or binary form. -C: 

Encoding 754 documents...


  return torch.cuda.amp.autocast() if self.activated else NullContextManager()
100%|██████████| 24/24 [00:03<00:00,  7.94it/s]
[32m2025-01-29 09:11:57.140[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m19[0m - [1mPredicting[0m
[32m2025-01-29 09:11:57.142[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m29[0m - [1mQuestion: what is a requirement for datasets used in high-risk AI systems? -A: Exclusively open-source datasets -B: Datasets ensuring quality and diversity -C: Datasets not exceeding 1 GB in size[0m
[32m2025-01-29 09:11:57.161[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m32[0m - [1mIn particular , data 
sets should take into account, to the extent required by their intended purpos[0m


Shapes:
encodings: torch.Size([754, 508, 128])
doc_masks: torch.Size([754, 508])
Documents encoded!


[32m2025-01-29 09:11:59.383[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m37[0m - [1m{
  "answer": "B"
}[0m
[32m2025-01-29 09:11:59.384[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m29[0m - [1mQuestion: What is the threshold, measured in floating point operations, that leads to a presumption that a general-purpose AI model has systemic risk? -A: 10^1 -B: 10^20 -C: 10^25[0m
[32m2025-01-29 09:11:59.404[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m32[0m - [1mThe full range of capabilities in a model could be better 
understo od after its placing on the mark[0m
[32m2025-01-29 09:12:00.845[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m37[0m - [1m{
  "answer": "No specific threshold is mentioned in the document."
}[0m
[32m2025-01-29 09:12:00.847[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m24[0m - [1mWaiting for 60 seconds[0m
[32m2

Encoding 17 documents...


  return torch.cuda.amp.autocast() if self.activated else NullContextManager()
100%|██████████| 1/1 [00:00<00:00, 13.61it/s]
[32m2025-01-29 09:13:15.250[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m19[0m - [1mPredicting[0m
[32m2025-01-29 09:13:15.252[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m29[0m - [1mQuestion: How many chapters does the game last?[0m
[32m2025-01-29 09:13:15.271[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m32[0m - [1m1
Will you play as the Fellowship of the Ring to defend the free races and destroy the One Ring?  
O[0m


Shapes:
encodings: torch.Size([17, 508, 128])
doc_masks: torch.Size([17, 508])
Documents encoded!


[32m2025-01-29 09:13:16.968[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m37[0m - [1m{
"answer": 3
}[0m
[32m2025-01-29 09:13:16.969[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m24[0m - [1mWaiting for 60 seconds[0m
[32m2025-01-29 09:14:16.972[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m29[0m - [1mQuestion: How many victory conditions are there?[0m
[32m2025-01-29 09:14:16.989[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m32[0m - [1mConquering Middle-earth
If you are present in all 7 regions (with a Fortress and/or at least 1 Unit)[0m
[32m2025-01-29 09:14:18.783[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m37[0m - [1m{
  "answer": 3
}[0m
[32m2025-01-29 09:14:18.785[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m29[0m - [1mQuestion: How many different races are there?[0m
[32m2025-01-29 09:14:18.80

Encoding 48 documents...


  return torch.cuda.amp.autocast() if self.activated else NullContextManager()
100%|██████████| 2/2 [00:00<00:00,  9.92it/s]
[32m2025-01-29 09:15:43.029[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m19[0m - [1mPredicting[0m
[32m2025-01-29 09:15:43.030[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m29[0m - [1mQuestion: What is the maximum number of cards a player may acquire during the lookout phase?[0m
[32m2025-01-29 09:15:43.047[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m32[0m - [1m13. At the beginning of the game, each player draws 5 cards from their 
Clan deck, looks through the[0m


Shapes:
encodings: torch.Size([48, 508, 128])
doc_masks: torch.Size([48, 508])
Documents encoded!


[32m2025-01-29 09:15:44.189[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m37[0m - [1m{
  "answer": 4
}[0m
[32m2025-01-29 09:15:44.192[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m29[0m - [1mQuestion: Is there a limit to the number of cards a player may have in their hand?[0m
[32m2025-01-29 09:15:44.213[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m32[0m - [1mNOTE 1:  There’s no limit to the number of cards a player may have 
in their hand.  
NOTE 2:  If the[0m
[32m2025-01-29 09:15:50.715[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m37[0m - [1m{
"answer": "No"
}[0m
[32m2025-01-29 09:15:50.716[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m29[0m - [1mQuestion: Can you raid the locations of a player that has passed during the action phase?[0m
[32m2025-01-29 09:15:50.735[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_docu

In [13]:
results = pd.read_csv("results.csv")
results.loc[results["answer"] != results["pred_answer"]]

Unnamed: 0.1,Unnamed: 0,document,section,question,answer,pred_answer,pred_section
22,22,https://eur-lex.europa.eu/legal-content/EN/TXT...,Classification of general-purpose AI models as...,"What is the threshold, measured in floating po...",C,NO SPECIFIC THRESHOLD IS MENTIONED IN THE DOCU...,
44,44,https://arxiv.org/pdf/2201.11903,3.2 Results,How many random samples were examined to under...,100,50,
47,47,https://github.com/mozilla-ai/structured-qa/re...,CARD AND TILE EFFECTS,How many different races are there?,6,7,
52,52,https://github.com/mozilla-ai/structured-qa/re...,CARD AND TILE COSTS,Can a player pay coins to compensate for missi...,YES,NO,
55,55,https://github.com/mozilla-ai/structured-qa/re...,CARD AND TILE EFFECTS,Which type of cards provide coins? -A: Gray -B...,B,NONE OF THE ABOVE,
66,66,https://github.com/mozilla-ai/structured-qa/re...,EXPEDITION PHASE,How many victory points you get from each conq...,1,THE DOCUMENT SAYS THAT PLAYERS GAIN VPS FROM P...,
83,83,https://docs.nvidia.com/cuda/pdf/CUDA_C_Progra...,15.3. API Fundamentals,When are virtual addresses assigned to graph a...,C,A,


In [14]:
accuracy = sum(results["answer"] == results["pred_answer"]) / len(results)
accuracy

0.9292929292929293