## Do you really need a RAG?

If you have a document that is smaller than the context window of the model, you can avoid the chunking/retrieval step, and simply put the entire document into the context of the prompt.

Illustrated here using Barack Obama's 2015 tax return

In [1]:
#%pip install --quiet google-genai

In [1]:
GEMINI="gemini-2.0-flash-001"

import os
from dotenv import load_dotenv
load_dotenv("../keys.env")
assert os.environ["GEMINI_API_KEY"][:2] == "AI",\
       "Please specify the GEMINI_API_KEY access token in keys.env file"

## Cache President Obama's tax return

It has long been a bipartisan tradition for US candidates for high office to release their tax returns.
Here, we download the return and upload it into Gemini's cache. This way, we don't need to keep
sending it the data (see the Prompt Caching pattern)

In [None]:
def cache_pdf(pdf_path: str, 
              model_id: str = GEMINI,
              system_instruction: str = "You are a tax attorney") -> str:
    from google import genai
    from google.genai import types
    import io, httpx

    client = genai.Client(api_key=os.environ['GEMINI_API_KEY'])
    doc_io = io.BytesIO(httpx.get(pdf_path).content)
    document = client.files.upload(
      file=doc_io,
      config=dict(mime_type='application/pdf')
    )
    # Create a cached content object
    cache = client.caches.create(
        model=model_id,
        config=types.CreateCachedContentConfig(
          system_instruction=system_instruction,
          contents=[document],
        )
    )
    # Display the cache details
    print(f'{cache=}')
    return cache.name

cache_name = cache_pdf(pdf_path="https://s3.amazonaws.com/pdfs.taxnotes.com/2019/B_Obama_2014.pdf")
print(cache_name)

## Find the document in the cache
Here, we cached only one document, so I'll just use that one. Normally, you'd have some other way to track which one you want

In [9]:
from google import genai
from google.genai import types
client = genai.Client(api_key=os.environ['GEMINI_API_KEY'])
for cache in client.caches.list():
  print(cache)

name='cachedContents/wc0yofrcdjiv47at07bxfy9i0o9qyyimg6suw0pn' display_name='' model='models/gemini-2.0-flash-001' create_time=datetime.datetime(2025, 6, 1, 17, 2, 54, 807005, tzinfo=TzInfo(UTC)) update_time=datetime.datetime(2025, 6, 1, 17, 2, 54, 807005, tzinfo=TzInfo(UTC)) expire_time=datetime.datetime(2025, 6, 1, 18, 2, 53, 965026, tzinfo=TzInfo(UTC)) usage_metadata=CachedContentUsageMetadata(audio_duration_seconds=None, image_count=None, text_count=None, total_token_count=9811, video_duration_seconds=None)


No chunking necessary since the whole document is only 9811 tokens, which is well within the context window supported by the LLM

In [10]:
cache_name = client.caches.list()[0].name
print(cache_name)

cachedContents/wc0yofrcdjiv47at07bxfy9i0o9qyyimg6suw0pn


## Queries that use cached content

Each of these prompts uses the full PDF

In [11]:
# Generate content using the cached prompt and document
def answer_question(prompt: str, cached_tax_return: str) -> str:
    response = client.models.generate_content(
      model=GEMINI,
      contents=prompt,
      config=types.GenerateContentConfig(
        cached_content=cached_tax_return
      ))
    print(f'{response.usage_metadata=}')
    return response.text

answer_question("How much did Obama claim in business expenses?", cache_name)

response.usage_metadata=GenerateContentResponseUsageMetadata(cache_tokens_details=[ModalityTokenCount(modality=<MediaModality.TEXT: 'TEXT'>, token_count=7), ModalityTokenCount(modality=<MediaModality.DOCUMENT: 'DOCUMENT'>, token_count=9804)], cached_content_token_count=9811, candidates_token_count=25, candidates_tokens_details=[ModalityTokenCount(modality=<MediaModality.TEXT: 'TEXT'>, token_count=25)], prompt_token_count=9820, prompt_tokens_details=[ModalityTokenCount(modality=<MediaModality.DOCUMENT: 'DOCUMENT'>, token_count=9804), ModalityTokenCount(modality=<MediaModality.TEXT: 'TEXT'>, token_count=16)], thoughts_token_count=None, tool_use_prompt_token_count=None, tool_use_prompt_tokens_details=None, total_token_count=9845, traffic_type=None)


'According to the Schedule C form provided, Obama claimed $6,708 in business expenses (line 28).'

In [12]:
answer_question("Did Obama make any retirement plan contributions?", cache_name)

response.usage_metadata=GenerateContentResponseUsageMetadata(cache_tokens_details=[ModalityTokenCount(modality=<MediaModality.TEXT: 'TEXT'>, token_count=7), ModalityTokenCount(modality=<MediaModality.DOCUMENT: 'DOCUMENT'>, token_count=9804)], cached_content_token_count=9811, candidates_token_count=43, candidates_tokens_details=[ModalityTokenCount(modality=<MediaModality.TEXT: 'TEXT'>, token_count=43)], prompt_token_count=9819, prompt_tokens_details=[ModalityTokenCount(modality=<MediaModality.TEXT: 'TEXT'>, token_count=15), ModalityTokenCount(modality=<MediaModality.DOCUMENT: 'DOCUMENT'>, token_count=9804)], thoughts_token_count=None, tool_use_prompt_token_count=None, tool_use_prompt_tokens_details=None, total_token_count=9862, traffic_type=None)


'Yes, according to line 28 on form 1040, Obama made self-employed SEP, SIMPLE, and qualified plans contributions. The amount of contribution was $17,400.'

Note that the latter query used 9862 total tokens of which 9811 came from the cache