# Structured Q&A

Source code: https://github.com/mozilla-ai/structured-qa

Docs: https://mozilla-ai.github.io/structured-qa

## GPU Check

First, you'll need to enable GPUs for the notebook:

- Navigate to `Edit`→`Notebook Settings`
- Select T4 GPU from the Hardware Accelerator section
- Click `Save` and accept.

Next, we'll confirm that we can connect to the GPU:

In [2]:
import torch

if not torch.cuda.is_available():
    raise RuntimeError("GPU not available")
else:
    print("GPU is available!")

GPU is available!


## Installing dependencies

In [3]:
%pip install ragatouille PyPDF2

Collecting ragatouille
  Downloading ragatouille-0.0.8.post4-py3-none-any.whl.metadata (15 kB)
Collecting colbert-ai==0.2.19 (from ragatouille)
  Downloading colbert-ai-0.2.19.tar.gz (86 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/86.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.7/86.7 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting faiss-cpu<2.0.0,>=1.7.4 (from ragatouille)
  Downloading faiss_cpu-1.9.0.post1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.4 kB)
Collecting fast-pytorch-kmeans==0.2.0.1 (from ragatouille)
  Downloading fast_pytorch_kmeans-0.2.0.1-py3-none-any.whl.metadata (1.1 kB)
Collecting llama-index>=0.7 (from ragatouille)
  Downloading llama_index-0.12.13-py3-none-any.whl.metadata (12 kB)
Collecting onnx<2.0.0,>=1.15.0 (from ragatouille)
  Downloading onnx-1.17.0-cp311-cp311-manylinux_

In [4]:
%pip install git+https://github.com/mozilla-ai/structured-qa.git@5-add-benchmark

Collecting git+https://github.com/mozilla-ai/structured-qa.git@5-add-benchmark
  Cloning https://github.com/mozilla-ai/structured-qa.git (to revision 5-add-benchmark) to /tmp/pip-req-build-49ruike5
  Running command git clone --filter=blob:none --quiet https://github.com/mozilla-ai/structured-qa.git /tmp/pip-req-build-49ruike5
  Running command git checkout -b 5-add-benchmark --track origin/5-add-benchmark
  Switched to a new branch '5-add-benchmark'
  Branch '5-add-benchmark' set up to track remote branch '5-add-benchmark' from 'origin'.
  Resolved https://github.com/mozilla-ai/structured-qa.git to commit 17942ca192e0493c7c061e6f908cc2b945122ef6
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting fire (from structured-qa==0.3.3.dev62+g17942ca)
  Downloading fire-0.7.0.tar.gz (87 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m87.2/87.2 k

In [5]:
!wget https://raw.githubusercontent.com/mozilla-ai/structured-qa/refs/heads/5-add-benchmark/benchmark/structured_qa.csv

--2025-01-24 10:32:07--  https://raw.githubusercontent.com/mozilla-ai/structured-qa/refs/heads/5-add-benchmark/benchmark/structured_qa.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14711 (14K) [text/plain]
Saving to: ‘structured_qa.csv’


2025-01-24 10:32:07 (73.1 MB/s) - ‘structured_qa.csv’ saved [14711/14711]



# Setup

In [6]:
import os
import google.generativeai as genai
from google.colab.userdata import get, SecretNotFoundError

try:
    genai.configure(api_key=get("GOOGLE_API_KEY"))
except SecretNotFoundError as e:
    raise RuntimeError("Please set the GOOGLE_API_KEY secret to your API key") from e
os.environ["LOGURU_LEVEL"] = "INFO"

In [7]:
from loguru import logger

In [10]:
import PyPDF2


def load_pdf(pdf_file: str) -> str | None:
    try:
        pdf_reader = PyPDF2.PdfReader(pdf_file)
        return "\n".join(page.extract_text() for page in pdf_reader.pages)
    except Exception as e:
        logger.exception(e)
        return None

## Function to Process all questions for a single Document

In [47]:
import json
import time

from ragatouille import RAGPretrainedModel
from ragatouille.data import CorpusProcessor


def process_document(
    document_file,
    document_data,
    model,
):
    logger.info("Setting up RAG")
    RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")
    corpus_processor = CorpusProcessor()
    documents = corpus_processor.process_corpus([load_pdf(document_file)])
    RAG.encode([x["content"] for x in documents])

    logger.info("Predicting")
    answers = {}
    sections = {}
    for index, row in document_data.iterrows():
        if model.n > 0 and model.n % 9 == 0:
            logger.info("Waiting for 60 seconds")
            time.sleep(60)
        question = row["question"]

        logger.info(f"Question: {question}")
        results = RAG.search_encoded_docs(query=question, k=3)
        current_info = "\n".join(result["content"] for result in results)
        logger.info(current_info[:100])

        answer = model.model.generate_content(
            [f"This is the document: {current_info}", question]
        )
        logger.info(answer.text)
        answers[index] = json.loads(answer.text)["answer"]
        sections[index] = None
        model.n += 1

    return answers, sections

## Load Model

In [48]:
from structured_qa.model_loaders import load_gemini_model

In [49]:
SYSTEM_PROMPT = """
You are given an input document and a question.
You can only answer the question based on the information in the document.
You will return a JSON name with a single key: "answer".
In `"answer"`, you will return the answer using one of the following JSON types:
- Yes/No (for boolean questions)
Is the model an LLM?
{
  "answer": "No"
}
- Single number (for numeric questions)
How many layers does the model have?
{
  "answer": 12
}
- Single letter (for multiple-choice questions)
What is the activation function used in the model?
-A: ReLU
-B: Sigmoid
-C: Tanh
{
  "answer": "C"
}
"""

In [50]:
model = load_gemini_model(
    "gemini-2.0-flash-exp",
    system_prompt=SYSTEM_PROMPT,
    generation_config={
        "response_mime_type": "application/json",
    },
)
model.n = 0

# Run Benchmark

In [51]:
from pathlib import Path
from urllib.request import urlretrieve

import pandas as pd

logger.info("Loading input data")
data = pd.read_csv("structured_qa.csv")
data["pred_answer"] = [None] * len(data)
data["pred_section"] = [None] * len(data)
for document_link, document_data in data.groupby("document"):
    logger.info(f"Downloading document {document_link}")
    downloaded_document = Path(f"{Path(document_link).name}.pdf")
    if not Path(downloaded_document).exists():
        urlretrieve(document_link, downloaded_document)
        logger.info(f"Downloaded {document_link} to {downloaded_document}")
    else:
        logger.info(f"File {downloaded_document} already exists")

    answers, sections = process_document(downloaded_document, document_data, model)

    for index in document_data.index:
        data.loc[index, "pred_answer"] = str(answers[index]).upper()
        data.loc[index, "pred_section"] = sections[index]

data.to_csv("results.csv")

[32m2025-01-24 11:10:27.632[0m | [1mINFO    [0m | [36m__main__[0m:[36m<cell line: 0>[0m:[36m8[0m - [1mLoading input data[0m
[32m2025-01-24 11:10:27.640[0m | [1mINFO    [0m | [36m__main__[0m:[36m<cell line: 0>[0m:[36m13[0m - [1mDownloading document https://arxiv.org/pdf/1706.03762[0m
[32m2025-01-24 11:10:27.641[0m | [1mINFO    [0m | [36m__main__[0m:[36m<cell line: 0>[0m:[36m19[0m - [1mFile 1706.03762.pdf already exists[0m
[32m2025-01-24 11:10:27.643[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m13[0m - [1mSetting up RAG[0m
  self.scaler = torch.cuda.amp.GradScaler()


Encoding 56 documents...


  return torch.cuda.amp.autocast() if self.activated else NullContextManager()
100%|██████████| 2/2 [00:00<00:00,  7.78it/s]
[32m2025-01-24 11:10:30.151[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m19[0m - [1mPredicting[0m
[32m2025-01-24 11:10:30.157[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m28[0m - [1mQuestion: What type of architecture does the model use? -A: decoder only -B: encoder only -C: encoder-decoder[0m
[32m2025-01-24 11:10:30.174[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m31[0m - [1m(2015) [23] multi-task 93.0
Dyer et al. (2016) [8] generative 93.3
increased the maximum output leng[0m


Shapes:
encodings: torch.Size([56, 508, 128])
doc_masks: torch.Size([56, 508])
Documents encoded!


[32m2025-01-24 11:10:31.916[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m34[0m - [1m{
"answer": "C"
}[0m
[32m2025-01-24 11:10:31.918[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m28[0m - [1mQuestion: How many layers compose the encoder?[0m
[32m2025-01-24 11:10:31.941[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m31[0m - [1mAt each step the model is auto-regressive
[10], consuming the previously generated symbols as additi[0m
[32m2025-01-24 11:10:33.208[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m34[0m - [1m{
  "answer": 6
}[0m
[32m2025-01-24 11:10:33.209[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m28[0m - [1mQuestion: How many layers compose the decoder?[0m
[32m2025-01-24 11:10:33.225[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m31[0m - [1mTo facilitate these residual connections, all sub-lay

Encoding 137 documents...


  return torch.cuda.amp.autocast() if self.activated else NullContextManager()
100%|██████████| 5/5 [00:00<00:00,  9.18it/s]
[32m2025-01-24 11:11:53.319[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m19[0m - [1mPredicting[0m
[32m2025-01-24 11:11:53.323[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m28[0m - [1mQuestion: Does LoRA work with any neural network containing dense layers?[0m
[32m2025-01-24 11:11:53.349[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m31[0m - [1mMore importantly, these method often fail to
match the ﬁne-tuning baselines, posing a trade-off betw[0m


Shapes:
encodings: torch.Size([137, 508, 128])
doc_masks: torch.Size([137, 508])
Documents encoded!


[32m2025-01-24 11:11:54.815[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m34[0m - [1m{
  "answer": "Yes"
}[0m
[32m2025-01-24 11:11:54.818[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m28[0m - [1mQuestion: How much memory is saved (in GB) when training GPT-3 175B with LoRA compared to full fine-tuning?[0m
[32m2025-01-24 11:11:54.847[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m31[0m - [1mOn GPT-3 175B, we reduce the VRAM consumption during training from 1.2TB to
350GB. With r= 4and only[0m
[32m2025-01-24 11:11:56.114[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m34[0m - [1m{
  "answer": "850"
}[0m
[32m2025-01-24 11:11:56.116[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m28[0m - [1mQuestion: By how much can LoRA reduce GPU memory requirements during training? -A: 10x, -B: 5x, -C: 3x[0m
[32m2025-01-24 11:11:56.132[0m | [1mINFO 

Encoding 199 documents...


  return torch.cuda.amp.autocast() if self.activated else NullContextManager()
100%|██████████| 7/7 [00:00<00:00,  7.74it/s]
[32m2025-01-24 11:12:03.727[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m19[0m - [1mPredicting[0m
[32m2025-01-24 11:12:03.729[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m28[0m - [1mQuestion: Is Arithmetic reasoning is a task that language models often find very easy?[0m
[32m2025-01-24 11:12:03.746[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m31[0m - [1m3 Arithmetic Reasoning
We begin by considering math word problems of the form in Figure 1, which mea[0m


Shapes:
encodings: torch.Size([199, 508, 128])
doc_masks: torch.Size([199, 508])
Documents encoded!


[32m2025-01-24 11:12:05.361[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m34[0m - [1m{
  "answer": "No"
}[0m
[32m2025-01-24 11:12:05.362[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m28[0m - [1mQuestion: How many large language models were evaluated?[0m
[32m2025-01-24 11:12:05.384[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m31[0m - [1mFor AQuA, we used four exemplars
and solutions from the training set, as given in Appendix Table 21.[0m
[32m2025-01-24 11:12:10.267[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m34[0m - [1m{
"answer": 5
}[0m
[32m2025-01-24 11:12:10.269[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m24[0m - [1mWaiting for 60 seconds[0m
[32m2025-01-24 11:13:10.271[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m28[0m - [1mQuestion: How many benchmarks were used to evaluate arithmetic rea

Encoding 44 documents...


  return torch.cuda.amp.autocast() if self.activated else NullContextManager()
100%|██████████| 2/2 [00:00<00:00,  9.65it/s]
[32m2025-01-24 11:13:22.598[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m19[0m - [1mPredicting[0m
[32m2025-01-24 11:13:22.599[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m28[0m - [1mQuestion: How many layers are in the toy model (y = x^2)?[0m
[32m2025-01-24 11:13:22.615[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m31[0m - [1mo(t)=c1^ZT
1WTh(0)+tX
i=1ci^ZiUTx(i)(17)
In Eq. 17, ci^ZT
i=a(t)^VT
ci^Wi.As one can observe from
Eq[0m


Shapes:
encodings: torch.Size([44, 508, 128])
doc_masks: torch.Size([44, 508])
Documents encoded!


[32m2025-01-24 11:13:23.678[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m34[0m - [1m{
"answer": 3
}[0m
[32m2025-01-24 11:13:23.679[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m28[0m - [1mQuestion: Does the model use Sigmoid activation function?[0m
[32m2025-01-24 11:13:23.695[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m31[0m - [1mComputation and memory analysis of toy problems.
3 layers with leaky-ReLU activations, except for la[0m
[32m2025-01-24 11:13:25.083[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m34[0m - [1m{
"answer": "Yes"
}[0m
[32m2025-01-24 11:13:25.086[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m28[0m - [1mQuestion: How many parameters are in the y = x^2 toy model tree?[0m
[32m2025-01-24 11:13:25.101[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m31[0m - [1mo(t)=c1^ZT
1WTh(0)+tX
i=

Encoding 143 documents...


  return torch.cuda.amp.autocast() if self.activated else NullContextManager()
100%|██████████| 5/5 [00:00<00:00,  8.82it/s]
[32m2025-01-24 11:14:33.237[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m19[0m - [1mPredicting[0m
[32m2025-01-24 11:14:33.239[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m28[0m - [1mQuestion: According to the guide, what is the typical license used to grant reuse rights with libre open access? -A: GNU General Public License -B: Creative Commons license -C: MIT license[0m
[32m2025-01-24 11:14:33.256[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m31[0m - [1mCHAPTER 4: HOW 
“OPEN” DO YOU 
WANT TO MAKE 
YOUR WORK?
UNDER CURRENT U.S. COPYRIGHT LAW, COPY-
righ[0m


Shapes:
encodings: torch.Size([143, 508, 128])
doc_masks: torch.Size([143, 508])
Documents encoded!


[32m2025-01-24 11:14:34.368[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m34[0m - [1m{
  "answer": "B"
}[0m
[32m2025-01-24 11:14:34.370[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m28[0m - [1mQuestion: how many peer-reviewed open access journals are indexed by the Directory of Open Access Journals (DOAJ)? -A: Over 10,000 -B: Over 20,000 -C: Exactly 30,000[0m
[32m2025-01-24 11:14:34.386[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m31[0m - [1mFor authors of articles, a good 
place to start is the Directory of Open Access Journals (“DOAJ”), a[0m
[32m2025-01-24 11:14:35.750[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m34[0m - [1m{
  "answer": "A"
}[0m
[32m2025-01-24 11:14:35.752[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m28[0m - [1mQuestion: what is the term of office for members of the advisory board of the Authors Alliance? -

Encoding 364 documents...


  return torch.cuda.amp.autocast() if self.activated else NullContextManager()
100%|██████████| 12/12 [00:01<00:00,  9.92it/s]
[32m2025-01-24 11:14:48.915[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m19[0m - [1mPredicting[0m
[32m2025-01-24 11:14:48.917[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m24[0m - [1mWaiting for 60 seconds[0m


Shapes:
encodings: torch.Size([364, 508, 128])
doc_masks: torch.Size([364, 508])
Documents encoded!


[32m2025-01-24 11:15:48.921[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m28[0m - [1mQuestion: Which type of water must be supplied in a toilet sink? -A: hot -B: cold -C: hot and cold[0m
[32m2025-01-24 11:15:48.943[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m31[0m - [1mThis range may be amended to between 2 and 15  French degrees in the event that a specific 
demand e[0m
[32m2025-01-24 11:15:50.434[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m34[0m - [1m{
  "answer": "B"
}[0m
[32m2025-01-24 11:15:50.436[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m28[0m - [1mQuestion: In which type of parkings must a carbon monoxide detector be installed? -A: indoor -B: underground -C: indoor or underground[0m
[32m2025-01-24 11:15:50.453[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m31[0m - [1mI.2.8.    GAS DETECTION AND VE NTING  
The requir

Encoding 754 documents...


  return torch.cuda.amp.autocast() if self.activated else NullContextManager()
100%|██████████| 24/24 [00:02<00:00,  8.24it/s]
[32m2025-01-24 11:16:02.140[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m19[0m - [1mPredicting[0m
[32m2025-01-24 11:16:02.145[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m28[0m - [1mQuestion: Which type of AI systems are banned by the AI Act? -A: High-risk systems, -B: Manipulative systems, -C: Real-time biometric systems in public spaces[0m
[32m2025-01-24 11:16:02.171[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m31[0m - [1meducation and training and the context the AI systems are to be used in, and consider ing 
the perso[0m


Shapes:
encodings: torch.Size([754, 508, 128])
doc_masks: torch.Size([754, 508])
Documents encoded!


[32m2025-01-24 11:16:03.313[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m34[0m - [1m{
"answer": "B"
}[0m
[32m2025-01-24 11:16:03.315[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m28[0m - [1mQuestion: what is a requirement for datasets used in high-risk AI systems? -A: Exclusively open-source datasets -B: Datasets ensuring quality and diversity -C: Datasets not exceeding 1 GB in size[0m
[32m2025-01-24 11:16:03.332[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m31[0m - [1mIn particular , data 
sets should take into account, to the extent required by their intended purpos[0m
[32m2025-01-24 11:16:04.721[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m34[0m - [1m{
  "answer": "B"
}[0m
[32m2025-01-24 11:16:04.724[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m28[0m - [1mQuestion: What is the threshold, measured in floating point operation

Encoding 17 documents...


  return torch.cuda.amp.autocast() if self.activated else NullContextManager()
100%|██████████| 1/1 [00:00<00:00, 13.17it/s]
[32m2025-01-24 11:17:24.728[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m19[0m - [1mPredicting[0m
[32m2025-01-24 11:17:24.730[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m28[0m - [1mQuestion: How many chapters does the game last?[0m
[32m2025-01-24 11:17:24.749[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m31[0m - [1m1
Will you play as the Fellowship of the Ring to defend the free races and destroy the One Ring?  
O[0m


Shapes:
encodings: torch.Size([17, 508, 128])
doc_masks: torch.Size([17, 508])
Documents encoded!


[32m2025-01-24 11:17:25.988[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m34[0m - [1m{
"answer": 3
}[0m
[32m2025-01-24 11:17:25.990[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m28[0m - [1mQuestion: How many victory conditions are there?[0m
[32m2025-01-24 11:17:26.006[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m31[0m - [1mConquering Middle-earth
If you are present in all 7 regions (with a Fortress and/or at least 1 Unit)[0m
[32m2025-01-24 11:17:27.345[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m34[0m - [1m{
  "answer": 3
}[0m
[32m2025-01-24 11:17:27.346[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m28[0m - [1mQuestion: How many different races are there?[0m
[32m2025-01-24 11:17:27.362[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m31[0m - [1m• 
 Re
veal tokens to both players. There is no hidden

Encoding 48 documents...


  return torch.cuda.amp.autocast() if self.activated else NullContextManager()
100%|██████████| 2/2 [00:00<00:00, 10.41it/s]
[32m2025-01-24 11:19:49.710[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m19[0m - [1mPredicting[0m
[32m2025-01-24 11:19:49.712[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m28[0m - [1mQuestion: What is the maximum number of cards a player may acquire during the lookout phase?[0m
[32m2025-01-24 11:19:49.729[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m31[0m - [1m13. At the beginning of the game, each player draws 5 cards from their 
Clan deck, looks through the[0m


Shapes:
encodings: torch.Size([48, 508, 128])
doc_masks: torch.Size([48, 508])
Documents encoded!


[32m2025-01-24 11:19:51.094[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m34[0m - [1m{
  "answer": 4
}[0m
[32m2025-01-24 11:19:51.096[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m28[0m - [1mQuestion: Is there a limit to the number of cards a player may have in their hand?[0m
[32m2025-01-24 11:19:51.112[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m31[0m - [1mNOTE 1:  There’s no limit to the number of cards a player may have 
in their hand.  
NOTE 2:  If the[0m
[32m2025-01-24 11:19:52.626[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m34[0m - [1m{
  "answer": "No"
}[0m
[32m2025-01-24 11:19:52.628[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_document[0m:[36m28[0m - [1mQuestion: Can you raid the locations of a player that has passed during the action phase?[0m
[32m2025-01-24 11:19:52.645[0m | [1mINFO    [0m | [36m__main__[0m:[36mprocess_do

In [53]:
results = pd.read_csv("results.csv")
results.loc[results["answer"] != results["pred_answer"]]

Unnamed: 0.1,Unnamed: 0,document,section,question,answer,pred_answer,pred_section
12,12,https://arxiv.org/pdf/2210.05189,2.1 Fully Connected Networks,Does the model use Sigmoid activation function?,NO,YES,
22,22,https://eur-lex.europa.eu/legal-content/EN/TXT...,Prohibited AI Practices (Article 5),Which type of AI systems are banned by the AI ...,C,B,
24,24,https://eur-lex.europa.eu/legal-content/EN/TXT...,Classification rules (article 51),"What is the threshold, measured in floating po...",C,NOT APPLICABLE,
50,50,https://github.com/mozilla-ai/structured-qa/re...,CHAPTER OVERVIEW,Which player begins the game? -A: Sauron -B: T...,A,C,


In [54]:
accuracy = sum(results["answer"] == results["pred_answer"]) / len(results)
accuracy

0.9473684210526315