# RAG with PDFs

Retrieval-augmented generation (RAG) provides large language models additional helpful information in a prompt that is retrieved when a user submits a query.

This guide uses Atlas as a data layer for retrieval, followed by LLM inference using the information queried from our Atlas Dataset via vector search.

## Setup

To run the code in this guide, make sure you have `docling`, `nomic`, `openai`, and `requests` installed to your python environment:

In [33]:
!pip install docling nomic openai requests

Then, login to `nomic` with your Nomic API key. If you don't have a Nomic API key you can create one [here](https://atlas.nomic.ai/cli-login).

In [34]:
!nomic login nk-...

## Create Atlas Dataset

Let's start with a collection with PDFs and chunk them into snippets to be fetched for retrieval.

For this example, we will download and parse PDFs with `docling` from the open-access paper repository arXiv.

Make sure `docling` is installed to your python environment:

In [None]:
!pip install docling

In [1]:
from docling.chunking import HybridChunker
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import DocumentConverter, PdfFormatOption

pdf_pipeline_options = PdfPipelineOptions(do_ocr=False, do_table_structure=False)
doc_converter = DocumentConverter(
    format_options={InputFormat.PDF: PdfFormatOption(
        pipeline_options=pdf_pipeline_options
    )}
)
chunker = HybridChunker()

In [7]:
# You can replace this with any list of PDFs you want
# The file can be a URL or a local filename for a PDF
PDFs = [
    {'title': "Attention Is All You Need", 'file': "https://arxiv.org/pdf/1706.03762"},
    {'title': "Deep Residual Learning", 'file': "https://arxiv.org/pdf/1512.03385"},
    {'title': "BERT", 'file': "https://arxiv.org/pdf/1810.04805"},
    {'title': "GPT-3", 'file': "https://arxiv.org/pdf/2005.14165"},
    {'title': "Adam Optimizer", 'file': "https://arxiv.org/pdf/1412.6980"},
    {'title': "GANs", 'file': "https://arxiv.org/pdf/1406.2661"},
    {'title': "U-Net", 'file': "https://arxiv.org/pdf/1505.04597"},
    {'title': "DALL-E 2", 'file': "https://arxiv.org/pdf/2204.06125"},
    {'title': "Stable Diffusion", 'file': "https://arxiv.org/pdf/2112.10752"}
]

data = []
for pdf in PDFs:
    print("Downloading and parsing", pdf['title'])
    doc = doc_converter.convert(pdf['file']).document
    for chunk in chunker.chunk(dl_doc=doc):
        chunk_dict = chunk.model_dump()
        filename = chunk_dict['meta']['origin']['filename']
        heading = chunk_dict['meta']['headings'][0] if chunk_dict['meta']['headings'] else None
        page_num = chunk_dict['meta']['doc_items'][0]['prov'][0]['page_no']
        data.append(
            {"text": chunk.text, "title": pdf['title'], "filename": filename, "heading": heading, "page_num": page_num}
        )

Downloading and parsing  Attention Is All You Need
Downloading and parsing  Deep Residual Learning
Downloading and parsing  BERT
Downloading and parsing  GPT-3
Downloading and parsing  Adam Optimizer
Downloading and parsing  GANs
Downloading and parsing  U-Net
Downloading and parsing  DALL-E 2
Downloading and parsing  Stable Diffusion


Let's take a look at the data to make sure it looks alright:

In [11]:
data[250]

{'text': 'We also empirically evaluate the effect of the bias correction terms explained in sections 2 and 3. Discussed in section 5, removal of the bias correction terms results in a version of RMSProp (Tieleman & Hinton, 2012) with momentum. We vary the β 1 and β 2 when training a variational autoencoder (VAE) with the same architecture as in (Kingma & Welling, 2013) with a single hidden layer with 500 hidden units with softplus nonlinearities and a 50-dimensional spherical Gaussian latent variable. We iterated over a broad range of hyper-parameter choices, i.e. β 1 ∈ [0 , 0 . 9] and β 2 ∈ [0 . 99 , 0 . 999 , 0 . 9999] , and log 10 ( α ) ∈ [ -5 , ..., -1] . Values of β 2 close to 1, required for robustness to sparse gradients, results in larger initialization bias; therefore we expect the bias correction term is important in such cases of slow decay, preventing an adverse effect on optimization.\nIn Figure 4, values β 2 close to 1 indeed lead to instabilities in training when no bias

## Upload Dataset to Atlas

Now, we can upload this data to Atlas to create a data map:

In [12]:
from nomic import atlas

atlas_dataset = atlas.map_data(
    data=data,
    indexed_field="text",
    identifier="pdf-data-for-rag"
)

[32m2025-01-29 16:37:39.721[0m | [1mINFO    [0m | [36mnomic.dataset[0m:[36m_create_project[0m:[36m867[0m - [1mOrganization name: `nomic`[0m
[32m2025-01-29 16:37:40.162[0m | [1mINFO    [0m | [36mnomic.dataset[0m:[36m_create_project[0m:[36m895[0m - [1mCreating dataset `pdf-data-for-rag`[0m
[32m2025-01-29 16:37:40.535[0m | [1mINFO    [0m | [36mnomic.atlas[0m:[36mmap_data[0m:[36m145[0m - [1mUploading data to Atlas.[0m
1it [00:00,  1.34it/s]
[32m2025-01-29 16:37:41.317[0m | [1mINFO    [0m | [36mnomic.dataset[0m:[36m_add_data[0m:[36m1714[0m - [1mUpload succeeded.[0m
[32m2025-01-29 16:37:41.320[0m | [1mINFO    [0m | [36mnomic.atlas[0m:[36mmap_data[0m:[36m163[0m - [1m`nomic/pdf-data-for-rag`: Data upload succeeded to dataset`[0m
[32m2025-01-29 16:37:42.887[0m | [1mINFO    [0m | [36mnomic.dataset[0m:[36mcreate_index[0m:[36m1301[0m - [1mCreated map `pdf-data-for-rag` in dataset `nomic/pdf-data-for-rag`: https://atlas.nomic.ai

You'll get an email when your data map is built!

## Retrieval Over Your Data Map

The Nomic Atlas vector search API returns the k-most semantically similar items from your Atlas Dataset based on a query. You can read more about how to use this endpoint in our API reference [here](https://docs.nomic.ai/reference/api/query/vector-search).

This helper function makes an API call to the Nomic Atlas vector search endpoint:

In [51]:
import requests
import os
from nomic import AtlasDataset

def retrieve(query: str, dataset_identifier: str, k: int, fields: list[str]) -> list:
    """Retrieve semantically similar items from an Atlas Dataset based on a query."""
    
    # load the projection ID for your map
    atlas_dataset = AtlasDataset(dataset_identifier)
    atlas_map_projection_id = atlas_dataset.maps[0].projection_id

    response = requests.post(
        'https://api-atlas.nomic.ai/v1/query/topk',
        headers={'Authorization': f'Bearer {os.environ.get("NOMIC_API_KEY")}'},
        json={
            'query': query,
            'k': k,
            'fields': fields,
            'projection_id': atlas_map_projection_id,
        }
    )
    return response.json()['data']

The parameters for this helper function are:

• `query`: the text query to search against

• `dataset_identifier`: a string of the form "your_org_name/your_dataset_name" used to load your dataset from Atlas

• `k`: number of similar items to return

• `fields`: which fields/columns from your dataset to return in the response

Let's inspect the output of `retrieve` on the query "What metrics are mentioned for evaluation?":


In [52]:
query = "What metrics are mentioned for evaluation?"
dataset_identifier = "YOUR_ORG_HERE/pdf-data-for-rag"
retrieved_data = retrieve(
    query, dataset_identifier, 3, ["title", "heading", "text"]
)

[32m2025-01-29 17:09:26.797[0m | [1mINFO    [0m | [36mnomic.dataset[0m:[36m__init__[0m:[36m804[0m - [1mLoading existing dataset `nomic/pdf-data-for-rag`.[0m


In [53]:
retrieved_data

[{'title': 'Stable Diffusion',
  'heading': 'E.3.5 Efficiency Analysis',
  'text': 'For efficiency reasons we compute the sample quality metrics plotted in Fig. 6, 17 and 7 based on 5k samples. Therefore, the results might vary from those shown in Tab. 1 and 10. All models have a comparable number of parameters as provided in Tab. 13 and 14. We maximize the learning rates of the individual models such that they still train stably. Therefore, the learning rates slightly vary between different runs cf . Tab. 13 and 14.',
  '_similarity': 0.7273091077804565},
 {'title': 'GPT-3',
  'heading': 'Context → Article:',
  'text': "Figure G.11: Formatted dataset example for ARC (Challenge). When predicting, we normalize by the unconditional probability of each answer as described in 2.\nFigure G.13: Formatted dataset example for Winograd. The 'partial' evaluation method we use compares the probability of the completion given a correct and incorrect context.\n53\nFigure G.14: Formatted dataset exa

# End-to-End RAG with the Data Map

With a retrieval function for our data map, we can now perform end-to-end RAG with Atlas as our intermediate data layer.

We will use GPT4o-mini from OpenAI as our LLM in this example. Make sure you have the openai package and an OpenAI API key.

In [23]:
!pip install openai

## Full Code Example

Here is a complete working example to go from a user query to an LLM response, retrieving data from Atlas as an intermediate step:

In [3]:
import requests
from openai import OpenAI
import os
from nomic import AtlasDataset

client = OpenAI(
#     api_key="sk-proj-..." # add your OpenAI API key here, or set it as an environment variable
)

def retrieve(query: str, dataset_identifier: str, k: int, fields: list[str]) -> list:
    """Retrieve semantically similar items from an Atlas Dataset based on a query."""
    
    # load the projection ID for your map
    atlas_dataset = AtlasDataset(dataset_identifier)
    atlas_map_projection_id = atlas_dataset.maps[0].projection_id

    response = requests.post(
        'https://api-atlas.nomic.ai/v1/query/topk',
        headers={'Authorization': f'Bearer {os.environ.get("NOMIC_API_KEY")}'},
        json={
            'query': query,
            'k': k,
            'fields': fields,
            'projection_id': atlas_map_projection_id,
        }
    )
    return response.json()['data']

query = "What metrics are mentioned for evaluation?"

print("retrieving data from Atlas...")

dataset_identifier = "YOUR_ORG_HERE/pdf-data-for-rag"
retrieved_data = retrieve(
    query, dataset_identifier, 3, ["title", "heading", "text"]
)


print("generating response from OpenAI...")
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "developer", "content": "You are a helpful assistant. Be specific and cite the context you are given"},
        {"role": "user", "content": f"Context:\n{retrieved_data}\n\nQuestion: {query}"}
    ]
).choices[0].message.content

retrieving data from Atlas...


[32m2025-01-29 17:12:21.287[0m | [1mINFO    [0m | [36mnomic.dataset[0m:[36m__init__[0m:[36m804[0m - [1mLoading existing dataset `nomic/pdf-data-for-rag`.[0m


generating response from OpenAI...


Now let's print the RAG response:

In [4]:
print(f"Q: {query}\n\nA: {response}")

Q: What metrics are mentioned for evaluation?

A: The context provided mentions several metrics related to sample quality and performance evaluation:

1. **Sample Quality Metrics** - Mentioned in the first entry regarding Stable Diffusion, which are computed based on 5,000 samples. These metrics are displayed in figures referenced as Fig. 6, 7, and 17.

2. **Cross-Entropy Validation Loss** - In the third entry about GPT-3, performance is measured in terms of cross-entropy validation loss, which follows a power-law trend with the amount of compute used for training.

3. **Normalization by Unconditional Probability** - The second entry describes a method used for predicting that involves normalizing by the unconditional probability of each answer, indicating a probabilistic evaluation method.

These metrics help assess the quality and performance of the respective models discussed in the context.
