[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/learn/generation/langchain/handbook/05-langchain-retrieval-augmentation.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/learn/generation/langchain/handbook/05-langchain-retrieval-augmentation.ipynb)

#### [LangChain Handbook](https://www.pinecone.io/learn/series/langchain/)

# Retrieval Augmentation

**L**arge **L**anguage **M**odels (LLMs) have a data freshness problem. Even the most powerful LLMs in the world are not up to date with recent world events.

The world of LLMs is frozen in time. Their world exists as a static snapshot of the world as it was within their training data.

A solution to this problem is *retrieval augmentation*. The idea behind this is that we retrieve relevant information from an external knowledge base and give that information to our LLM. In this notebook we will learn how to do that.

[![Open fast notebook](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/fast-link.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/generation/langchain/handbook/05-langchain-retrieval-augmentation-fast.ipynb)

To begin, we must install the prerequisite libraries that we will be using in this notebook.

In [22]:
!pip install -qU \
    datasets==3.6.0 \
    langchain==0.3.25 \
    langchain-openai==0.3.22 \
    langchain-pinecone==0.2.8 \
    tiktoken==0.9.0

## Building the Knowledge Base

In [2]:
from datasets import load_dataset

data = load_dataset(
    "aurelio-ai/reddit-finance",
    split="train",
)
data

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Dataset({
    features: ['id', 'subreddit', 'title', 'selftext'],
    num_rows: 107
})

In [3]:
data[5]

{'id': '1k7782l',
 'subreddit': 'stocks',
 'title': 'Waymo reports 250,000 paid robotaxi rides per week in U.S.',
 'selftext': 'Alphabet\xa0reported Thursday that Waymo, its autonomous vehicle unit, is now delivering more than 250,000 paid robotaxi rides per week in the U.S.\n\nCEO Sundar Pichai said Waymo has options in terms of “business models across geographies,” and the robotaxi company is building partnerships with ride-hailing app Uber, automakers and operations and maintenance businesses that tend to its vehicle fleets.\n\n“We can’t possibly do it all ourselves,” said Pichai on a call with analysts for Alphabet’s\xa0first-quarter earnings.\xa0\n\nPichai noted that Waymo has not entirely defined its long-term business model, and there is “future optionality around personal ownership” of vehicles equipped with Waymo’s self-driving technology. The company is also exploring the ways it can scale up its operations, he said.\n\nThe 250,000 paid rides per week are up from 200,000 in F

Many records contain *a lot* of text. Our first task is therefore to identify a good preprocessing methodology for chunking these articles into more "concise" chunks to later be embedding and stored in our Pinecone vector database.

For this we use LangChain's `RecursiveCharacterTextSplitter` to split our text into chunks of a specified max length.

In [4]:
import tiktoken

# We use gpt-4.1-mini as standard but tiktoken does not support gpt-4.1.
# Fortunately, 4.1 and 4o models all use the same underlying tokenizer and so
# we can use gpt-4o here
encoding = tiktoken.encoding_for_model('gpt-4o')

The tokenizer encoding that we'll use is:

In [5]:
encoding.name

'o200k_base'

In [6]:
import tiktoken

tokenizer = tiktoken.get_encoding(encoding.name)

# create the length function
def tiktoken_len(text):
    tokens = tokenizer.encode(
        text,
        disallowed_special=()
    )
    return len(tokens)

tiktoken_len("hello I am a chunk of text and using the tiktoken_len function "
             "we can find the length of this chunk of text in tokens")

27

In [7]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=300,
    chunk_overlap=20,
    length_function=tiktoken_len,
    separators=["\n\n", "\n", " ", ""]
)

In [8]:
chunks = text_splitter.split_text(data[5]['selftext'])
chunks

['Alphabet\xa0reported Thursday that Waymo, its autonomous vehicle unit, is now delivering more than 250,000 paid robotaxi rides per week in the U.S.\n\nCEO Sundar Pichai said Waymo has options in terms of “business models across geographies,” and the robotaxi company is building partnerships with ride-hailing app Uber, automakers and operations and maintenance businesses that tend to its vehicle fleets.\n\n“We can’t possibly do it all ourselves,” said Pichai on a call with analysts for Alphabet’s\xa0first-quarter earnings.\xa0\n\nPichai noted that Waymo has not entirely defined its long-term business model, and there is “future optionality around personal ownership” of vehicles equipped with Waymo’s self-driving technology. The company is also exploring the ways it can scale up its operations, he said.\n\nThe 250,000 paid rides per week are up from 200,000 in February, before Waymo opened in\xa0Austin\xa0and expanded in the\xa0San Francisco Bay Area\xa0in March.\xa0\n\nWaymo, which is

In [9]:
tiktoken_len(chunks[0]), tiktoken_len(chunks[1])

(279, 291)

Using the `text_splitter` we get much better sized chunks of text. We'll use this functionality during the indexing process later. Now let's take a look at embedding.

## Creating Embeddings

Building embeddings using LangChain's OpenAI embedding support is fairly straightforward. We first need to add our [OpenAI api key]() by running the next cell:

In [10]:
import os
from getpass import getpass

os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY") \
    or getpass("Enter your OpenAI API key: ")

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

Enter your OpenAI API key: ··········


*(Note that OpenAI is a paid service and so running the remainder of this notebook may incur some small cost)*

After initializing the API key we can initialize our `text-embedding-3-small` embedding model like so:

In [11]:
from langchain_openai import OpenAIEmbeddings

model_name = 'text-embedding-3-small'

embed = OpenAIEmbeddings(
    model=model_name,
    openai_api_key=OPENAI_API_KEY
)

Now we embed some text like so:

In [12]:
texts = [
    'this is the first chunk of text',
    'then another second chunk of text is here'
]

res = embed.embed_documents(texts)
len(res), len(res[0])

(2, 1536)

From this we get *two* (aligning to our two chunks of text) 1536-dimensional embeddings.

Now we move on to initializing our Pinecone vector database.

## Vector Database

To create our vector database we first need a [free API key from Pinecone](https://app.pinecone.io). Then we initialize like so:

In [13]:
from pinecone import Pinecone
import os


import os
from pinecone import Pinecone
from getpass import getpass

os.environ["PINECONE_API_KEY"] = os.getenv("PINECONE_API_KEY") \
    or getpass("Enter your Pinecone API key: ")

PINECONE_API_KEY = os.getenv("PINECONE_API_KEY")

# configure client
pc = Pinecone(api_key=PINECONE_API_KEY)

Enter your Pinecone API key: ··········


Then we initialize the index. We will be using OpenAI's `text-embedding-3-small` model for creating the embeddings, so we set the `dimension` to `1536`.

In [24]:
from pinecone import AwsRegion, CloudProvider, Metric, ServerlessSpec

index_name = 'langchain-retrieval-augmentation'

# check if index already exists (it shouldn't if this is first time)
if not pc.has_index(name=index_name):
    # if does not exist, create index
    pc.create_index(
        name=index_name,
        dimension=1536,  # dimensionality of text-embedding-3-small
        metric=Metric.DOTPRODUCT,
        spec=ServerlessSpec(
            cloud=CloudProvider.AWS,
            region=AwsRegion.US_EAST_1
        )
    )

# connect to index
index = pc.Index(name=index_name)
# view index stats
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'metric': 'dotproduct',
 'namespaces': {},
 'total_vector_count': 0,
 'vector_type': 'dense'}

We should see that the new Pinecone index has a `total_vector_count` of `0`, as we haven't added any vectors yet.

## Indexing

We can perform the indexing task using the LangChain vector store object or via the Pinecone python client directly. Here, we will do this via the Pinecone client, upserting our records in batches of `100` or more.

In [25]:
from tqdm.auto import tqdm
from uuid import uuid4

batch_limit = 100

texts = []
metadatas = []

for i, record in enumerate(tqdm(data)):
    # first get metadata fields for this record
    url = f"https://reddit.com/r/{record['subreddit']}/comments/{record['id']}"
    metadata = {
        'thread_id': str(record['id']),
        'source': url,
        'subreddit': record['subreddit']
    }
    # now we create chunks from the record text
    record_texts = text_splitter.split_text(record['selftext'])
    # create individual metadata dicts for each chunk
    record_metadatas = [{
        "chunk": j, "text": text, **metadata
    } for j, text in enumerate(record_texts)]
    # append these to current batches
    texts.extend(record_texts)
    metadatas.extend(record_metadatas)
    # if we have reached the batch_limit we can add texts
    if len(texts) >= batch_limit:
        ids = [str(uuid4()) for _ in range(len(texts))]
        embeds = embed.embed_documents(texts)
        index.upsert(vectors=zip(ids, embeds, metadatas))
        texts = []
        metadatas = []

if len(texts) > 0:
    ids = [str(uuid4()) for _ in range(len(texts))]
    embeds = embed.embed_documents(texts)
    index.upsert(vectors=zip(ids, embeds, metadatas))

  0%|          | 0/107 [00:00<?, ?it/s]

We've now indexed everything. We can check the number of vectors in our index like so:

In [29]:
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'metric': 'dotproduct',
 'namespaces': {'': {'vector_count': 170}},
 'total_vector_count': 170,
 'vector_type': 'dense'}

## Creating a Vector Store and Querying

Now that we've build our index we can switch back over to LangChain. We start by initializing a vector store using the same index we just built. We do that like so:

In [30]:
from langchain_pinecone import PineconeVectorStore

# initialize the vector store object
vectorstore = PineconeVectorStore(index=index, embedding=embed)

In [31]:
query = "how many robotaxi rides did waymo report in the US?"

vectorstore.similarity_search(
    query,  # our search query
    k=3  # return 3 most relevant docs
)

[Document(id='1f8c9fe9-cbf9-48e4-81fc-a1528faed668', metadata={'chunk': 0.0, 'source': 'https://reddit.com/r/stocks/comments/1k7782l', 'subreddit': 'stocks', 'thread_id': '1k7782l'}, page_content='Alphabet\xa0reported Thursday that Waymo, its autonomous vehicle unit, is now delivering more than 250,000 paid robotaxi rides per week in the U.S.\n\nCEO Sundar Pichai said Waymo has options in terms of “business models across geographies,” and the robotaxi company is building partnerships with ride-hailing app Uber, automakers and operations and maintenance businesses that tend to its vehicle fleets.\n\n“We can’t possibly do it all ourselves,” said Pichai on a call with analysts for Alphabet’s\xa0first-quarter earnings.\xa0\n\nPichai noted that Waymo has not entirely defined its long-term business model, and there is “future optionality around personal ownership” of vehicles equipped with Waymo’s self-driving technology. The company is also exploring the ways it can scale up its operations,

All of these are good, relevant results. But what can we do with this? There are many tasks, one of the most interesting (and well supported by LangChain) is called _"Retrieval Augmented Generation"_ or RAG.

## Retrieval Augmented Generation

In RAG we take the query as a question that is to be answered by a LLM, but the LLM must answer the question based on the information it is seeing being returned from the `vectorstore`.

To do this we initialize a retrieval pipeline like so:

In [32]:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

# completion llm
llm = ChatOpenAI(
    openai_api_key=OPENAI_API_KEY,
    model_name='gpt-4.1-mini',
    temperature=0.0
)

# Create prompt template
template = """Answer the question based on the following context:

{context}

Question: {question}

Answer:"""

prompt = ChatPromptTemplate.from_template(template)

# Create LCEL chain
retrieval_chain = (
    {"context": vectorstore.as_retriever(), "question": lambda x: x}
    | prompt
    | llm
    | StrOutputParser()
)

print(query)

retrieval_chain.invoke(query)

how many robotaxi rides did waymo report in the US?


'Waymo reported delivering more than 250,000 paid robotaxi rides per week in the U.S.'

We can also include the sources of information that the LLM is using to answer our question:

In [35]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

# Create prompt template with source formatting
template = """Answer the question based on the following context. Include the source URLs in your answer.

{context}

Question: {question}

Answer:"""

prompt = ChatPromptTemplate.from_template(template)

# Create LCEL chain with source formatting
def format_docs(docs):
    return "\\n\\n".join([f"Source: {doc.metadata.get('source', 'Unknown')}\\n{doc.page_content}" for doc in docs])

retrieval_chain_with_sources = (
    {"context": vectorstore.as_retriever() | format_docs, "question": lambda x: x}
    | prompt
    | llm
    | StrOutputParser()
)

print(f"Question: {query}\n")
print(retrieval_chain_with_sources.invoke(query))

Question: how many robotaxi rides did waymo report in the US?

Waymo reported delivering more than 250,000 paid robotaxi rides per week in the U.S. This figure is an increase from 200,000 rides per week in February, following Waymo's expansion into Austin and the San Francisco Bay Area in March.

Sources:  
- https://reddit.com/r/stocks/comments/1k7782l  
- https://www.cnbc.com/2025/04/24/waymo-reports-250000-paid-robotaxi-rides-per-week-in-us.html


Now we answer the question being asked, *and* return the source of this information being used by the LLM.

Delete the index to save resources when you're done!

In [36]:
pc.delete_index(index_name)

---