# Document question answering

This tutorial shows how to build a simple document question-answering system from scratch using OpenAI, [PyMuPDF](https://github.com/pymupdf/PyMuPDF),  and [LanceDB](https://github.com/lancedb/lancedb).

Supporting packge (`aiutils`) [is available here.](https://github.com/ploomber/doc/tree/main/aiutils)

In [2]:
import logging

logging.basicConfig(level=logging.CRITICAL)
aiutils_cache_logger = logging.getLogger('aiutils.cache')
aiutils_cache_logger.setLevel(logging.INFO)

Let's download the [OLMo](https://arxiv.org/abs/2402.00838) paper in PDF format:

In [3]:
from pathlib import Path

import requests

url = "https://arxiv.org/pdf/2402.00838.pdf"

path_to_data = Path(".data")
path_to_data.mkdir(exist_ok=True)
path_to_paper = path_to_data / "paper.pdf"

if not path_to_paper.exists():
    response = requests.get(url)
    response.raise_for_status()

    with open(path_to_paper, 'wb') as f:
        f.write(response.content)

`Document` is a little abstraction to run OCR on `.pdf` files (to extract text), it uses [`PyMuPDF`](https://github.com/pymupdf/PyMuPDF) under the hood.

In [4]:
from aiutils.document import Document

In [5]:
doc = Document(".data/paper.pdf")
doc

Document(path=.data/paper.pdf, n_tokens=20,910, price=0.01 USD)

`APICache` caches calls to OpenAI's API in a SQLite database, so calls with the same arguments return the cached response. This allows me to refactor and re-run code without worrying about paying for redundant requests.

In [6]:
from aiutils.cache import APICache
from openai import OpenAI

client = OpenAI()

embeddings_create = APICache(client.embeddings.create)

To get the most relevant pages to answer a question, we need to perform vector search. LanceDB is an embedded vector database, it requires no setup so it's perfect for this tutorial.

In [7]:
import shutil

import lancedb
import pyarrow as pa

path_to_vector_db = path_to_data / "vector-db"

if path_to_vector_db.exists():
    shutil.rmtree(path_to_vector_db)

db = lancedb.connect(path_to_vector_db)

# vector contains the embeddings we'll compute for each page
# content contains the page's text
schema = pa.schema([pa.field("vector", pa.list_(pa.float32(), list_size=1536)),
                    pa.field("content", pa.string())])
table = db.create_table("embeddings", schema=schema)

Iterate over pages, compute the embedding and insert it into the db:

In [8]:
for page in doc.pages():
    response = embeddings_create(
    input=page,
    model="text-embedding-3-small"
    )

    embedding = response.data[0].embedding
    data = dict(vector=embedding, content=page)
    table.add(data=[data])

INFO:aiutils.cache:Cache hit, using cached response.
INFO:aiutils.cache:Cache hit, using cached response.
INFO:aiutils.cache:Cache hit, using cached response.
INFO:aiutils.cache:Cache hit, using cached response.
INFO:aiutils.cache:Cache hit, using cached response.
INFO:aiutils.cache:Cache hit, using cached response.
INFO:aiutils.cache:Cache hit, using cached response.
INFO:aiutils.cache:Cache hit, using cached response.
INFO:aiutils.cache:Cache hit, using cached response.
INFO:aiutils.cache:Cache hit, using cached response.
INFO:aiutils.cache:Cache hit, using cached response.
INFO:aiutils.cache:Cache hit, using cached response.
INFO:aiutils.cache:Cache hit, using cached response.
INFO:aiutils.cache:Cache hit, using cached response.
INFO:aiutils.cache:Cache hit, using cached response.
INFO:aiutils.cache:Cache hit, using cached response.
INFO:aiutils.cache:Cache hit, using cached response.
INFO:aiutils.cache:Cache hit, using cached response.
INFO:aiutils.cache:Cache hit, using cached res

We have everything we need, let's put the user's query in a string:

In [9]:
user_query = "Which optimizer was used to train this model?"

We need to compute the embedding for the user's query, so we can search for the most relevent pages in the doc:

In [10]:
response = embeddings_create(
    input=user_query,
    model="text-embedding-3-small")

embedding_query = response.data[0].embedding

INFO:aiutils.cache:Cache hit, using cached response.


We search for the most relevant pages using the vector DB and retrieve the text from them:

In [11]:
result = table.search(embedding_query).limit(2)
content = [r["content"] for r in result.to_list()]

We convert the retrieved pages into a single string:

In [12]:
from jinja2 import Template

template = Template("""
{% for string in strings %}
### START PAGE ###
{{ string }}
{% endfor %}
""")

rendered = template.render(strings=content)

Now we use OpenAI to answer the question:

In [13]:
system_prompt = f"""
You're a system that answers questions from a document.

Here are the relevant sections from the document, use it to answer the question. Each
page is separated by: ### START PAGE ###

{rendered}
"""

completions_create = APICache(client.chat.completions.create)

response = completions_create(
  model="gpt-3.5-turbo-0125",
  messages=[
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": user_query},
  ])

print(user_query + '\n\n')
print(response.choices[0].message.content)

INFO:aiutils.cache:Cache hit, using cached response.


Which optimizer was used to train this model?


The AdamW optimizer was used to train this model.


---

The answer is correct! In section "3.2 Optimizer", the paper says:

> We use the AdamW optimizer (Loshchilov and Hutter, 2019) with the hyperparameters shown in
Table 4. 