[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/docs/langchain-retrieval-augmentation.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/docs/langchain-retrieval-augmentation.ipynb)

#### [LangChain Handbook](https://pinecone.io/learn/langchain)

# Retrieval Augmentation

**L**arge **L**anguage **M**odels (LLMs) have a data freshness problem. The most powerful LLMs in the world, like GPT-4, have no idea about recent world events.

The world of LLMs is frozen in time. Their world exists as a static snapshot of the world as it was within their training data.

A solution to this problem is *retrieval augmentation*. The idea behind this is that we retrieve relevant information from an external knowledge base and give that information to our LLM. In this notebook we will learn how to do that.

[![Open full notebook](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/full-link.svg)](https://github.com/pinecone-io/examples/blob/master/learn/generation/langchain/handbook/05-langchain-retrieval-augmentation.ipynb)

To begin, we must install the prerequisite libraries that we will be using in this notebook.

In [78]:
!pip install -qU \
  pinecone==5.4.2 \
  pinecone-datasets==1.0.2 \
  pinecone-notebooks==0.1.1 \
  langchain==0.3.20 \
  langchain-openai==0.3.9 \
  langchain-pinecone==0.2.3 \
  tqdm

---

🚨 _Note: the above `pip install` is formatted for Jupyter notebooks. If running elsewhere you may need to drop the `!`._

---

## Building the Knowledge Base

We will download a pre-embedding dataset from `pinecone-datasets`. Allowing us to skip the embedding and preprocessing steps, if you'd rather work through those steps you can find the [full notebook here](https://colab.research.google.com/github/pinecone-io/examples/blob/master/docs/langchain-retrieval-augmentation.ipynb).

The dataset we will be working with in this demo contains 50K chunked wikipedia articles that have been embedded using OpenAI's `text-embedding-ada-002` embedding model. This model produces embeddings with a dimension of 1536.

In [11]:
from pinecone_datasets import load_dataset

dataset = load_dataset('wikipedia-simple-text-embedding-ada-002-50K')

# We drop sparse_values and blob keys as they are not needed for this example
dataset.documents.drop(['sparse_values'], axis=1, inplace=True)
dataset.documents.drop(['blob'], axis=1, inplace=True)

dataset.head()

Loading documents parquet files: 100%|██████████| 5/5 [01:21<00:00, 16.22s/it]


Unnamed: 0,id,values,metadata
0,1-0,"[-0.011254455894231796, -0.01698738895356655, ...","{'chunk': 0, 'source': 'https://simple.wikiped..."
1,1-1,"[-0.0015197008615359664, -0.007858820259571075...","{'chunk': 1, 'source': 'https://simple.wikiped..."
2,1-2,"[-0.009930099360644817, -0.012211072258651257,...","{'chunk': 2, 'source': 'https://simple.wikiped..."
3,1-3,"[-0.011600767262279987, -0.012608098797500134,...","{'chunk': 3, 'source': 'https://simple.wikiped..."
4,1-4,"[-0.026462381705641747, -0.016362832859158516,...","{'chunk': 4, 'source': 'https://simple.wikiped..."


Now we move on to initializing our Pinecone vector database.

## Initializing the Pinecone client

Now the data is ready, we need to set up an index to store it.

We begin by initializing our connection to Pinecone. To do this we need a [free API key](https://app.pinecone.io).

In [3]:
import os

if not os.environ.get("PINECONE_API_KEY"):
    from pinecone_notebooks.colab import Authenticate
    Authenticate()

In [4]:
from pinecone import Pinecone

# Instantiate a Pinecone client
pc = Pinecone(api_key=os.environ.get("PINECONE_API_KEY"))

  from tqdm.autonotebook import tqdm


### Creating a Pinecone Index

When creating the index we need to define several configuration properties. 

- `name` can be anything we like. The name is used as an identifier for the index when performing other operations such as `describe_index`, `delete_index`, and so on. 
- `metric` specifies the similarity metric that will be used later when you make queries to the index.
- `dimension` should correspond to the dimension of the dense vectors produced by your embedding model. In this quick start, we are using made-up data so a small value is simplest.
- `spec` holds a specification which tells Pinecone how you would like to deploy our index. You can find a list of all [available providers and regions here](https://docs.pinecone.io/docs/projects).

There are more configurations available, but this minimal set will get us started.

In [5]:
from pinecone import ServerlessSpec

index_name = 'langchain-retrieval-augmentation-fast'

if not pc.has_index(name=index_name):
    pc.create_index(
        name=index_name,
        dimension=1536,  # dimensionality of text-embedding-ada-002
        metric='dotproduct',
        spec=ServerlessSpec(
            cloud='aws',
            region='us-east-1'
        )
    )

pc.describe_index(name=index_name)

{
    "name": "langchain-retrieval-augmentation-fast",
    "dimension": 1536,
    "metric": "dotproduct",
    "host": "langchain-retrieval-augmentation-fast-dojoi3u.svc.aped-4627-b74a.pinecone.io",
    "spec": {
        "serverless": {
            "cloud": "aws",
            "region": "us-east-1"
        }
    },
    "status": {
        "ready": true,
        "state": "Ready"
    },
    "deletion_protection": "disabled"
}

## Upserting data into the index

In [6]:
# Instantiate an Index client
index = pc.Index(name=index_name)

index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 50000}},
 'total_vector_count': 50000}

We should see that the new Pinecone index initially has a `total_vector_count` of `0`, as we haven't added any vectors yet.

Now we upsert the data to Pinecone:

In [16]:
from tqdm import tqdm

batch_size = 100

for start in tqdm(range(0, len(dataset.documents), batch_size), "Upserting records batch"):
    batch = dataset.documents.iloc[start:start + batch_size].to_dict(orient="records")
    index.upsert(vectors=batch)

Upserting records batch: 100%|██████████| 500/500 [07:07<00:00,  1.17it/s]


We've now indexed everything. We can check the number of vectors in our index again using `describe_index_stats()`.

We may see that the `total_vector_count` is a bit less than the total we expect (50K). This is because Pinecone is eventually consistent and not all vectors may be reflected in the index yet.

In [2]:
index.describe_index_stats()

NameError: name 'index' is not defined

## Creating a Langchain Vector Store and Querying

Now that we've build our index we can switch over to LangChain. We need to initialize a LangChain vector store using the same index we just built. For this we will also need a LangChain embedding object, which we initialize like so:

In [7]:
from langchain_openai import OpenAIEmbeddings

# Get openai api key from platform.openai.com
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY') or 'OPENAI_API_KEY'

model_name = 'text-embedding-ada-002'

embed = OpenAIEmbeddings(
    model=model_name,
    openai_api_key=OPENAI_API_KEY
)

Now initialize the vector store:

In [8]:
from langchain_pinecone import PineconeVectorStore

pinecone_vectorstore = PineconeVectorStore(
    index_name=index_name, 
    embedding=embed, 
    text_key="text"
)

Now we can query the vector store directly using `pinecone_vectorstore.similarity_search`:

In [9]:
from pprint import pprint

documents = pinecone_vectorstore.similarity_search(
    query="Who was Benito Mussolini?",  # our search query
    k=3  # return 3 most relevant docs
)

for doc in documents:
    pprint(doc.__dict__)
    print()

{'id': '6754-0',
 'metadata': {'chunk': 0.0,
              'source': 'https://simple.wikipedia.org/wiki/Benito%20Mussolini',
              'title': 'Benito Mussolini',
              'wiki-id': '6754'},
 'page_content': 'Benito Amilcare Andrea Mussolini KSMOM GCTE (29 July 1883 – '
                 '28 April 1945) was an Italian politician and journalist. He '
                 'was also the Prime Minister of Italy from 1922 until 1943. '
                 'He was the leader of the National Fascist Party.\n'
                 '\n'
                 'Biography\n'
                 '\n'
                 'Early life\n'
                 'Benito Mussolini was named after Benito Juarez, a Mexican '
                 'opponent of the political power of the Roman Catholic '
                 'Church, by his anticlerical (a person who opposes the '
                 'political interference of the Roman Catholic Church in '
                 "secular affairs) father. Mussolini's father was a "
           

All of these are good, relevant results. But what can we do with this? There are many tasks, one of the most interesting (and well supported by LangChain) is called _"Generative Question-Answering"_ or GQA.

## Generative Question-Answering

In GQA we take the query as a question that is to be answered by a LLM, but the LLM must answer the question based on the information it is seeing being returned from the `vectorstore`.

To do this we initialize a `RetrievalQA` object like so:

In [10]:
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA

# Chat Completion LLM
llm = ChatOpenAI(
    openai_api_key=OPENAI_API_KEY,
    model_name='gpt-4.5-preview',
    temperature=0.0
)

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=pinecone_vectorstore.as_retriever()
)

In [12]:
qa.invoke("Who was Benito Mussolini?")

{'query': 'Who was Benito Mussolini?',
 'result': 'Benito Mussolini was an Italian politician and journalist who served as the Prime Minister of Italy from 1922 until 1943. He was the leader of the National Fascist Party and established a dictatorship in Italy, calling himself "Il Duce" (meaning "leader" in Italian). Initially a socialist, Mussolini was expelled from the socialist party due to his support for Italy\'s involvement in World War I. He then created a new political ideology known as Fascism, which emphasized nationalism, authoritarianism, and strong centralized control.\n\nMussolini rose to power in 1922 after organizing the "March on Rome," where his supporters, known as the "Black Shirts," threatened to seize control of the government. King Vittorio Emanuele III appointed him Prime Minister, and by 1927 Mussolini had consolidated power, establishing a dictatorship enforced by his secret police, the OVRA.\n\nMussolini aimed to restore Italy to the glory of the ancient Roma

We can also include the sources of information that the LLM is using to answer our question. We can do this using a slightly different version of `RetrievalQA` called `RetrievalQAWithSourcesChain`:

In [13]:
from langchain.chains import RetrievalQAWithSourcesChain

qa_with_sources = RetrievalQAWithSourcesChain.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=pinecone_vectorstore.as_retriever(),
    return_source_documents=True
)

In [None]:
qa_with_sources.invoke("Who was Benito Mussolini?")

{'question': 'Who was Benito Mussolini?',
 'answer': 'Benito Mussolini was an Italian politician and journalist who served as the Prime Minister of Italy from 1922 until 1943. He was the leader of the National Fascist Party and established a dictatorship in Italy. Mussolini initially was a socialist but later created the ideology of Fascism, combining nationalist and conservative views. He rose to power through the March on Rome in 1922, eventually becoming dictator by 1927. Mussolini aimed to create a new Roman Empire, engaging in aggressive military actions such as invading Abyssinia (Ethiopia) and Albania. He allied with Adolf Hitler during World War II, forming part of the Axis Powers. Mussolini was deposed in 1943, briefly reinstated by the Germans as head of a puppet state, and was ultimately captured and executed by partisans in 1945.\n\n',
 'sources': '',
 'source_documents': [Document(id='6754-0', metadata={'chunk': 0.0, 'source': 'https://simple.wikipedia.org/wiki/Benito%20Mu

Now we answer the question being asked, *and* return the source of this information being used by the LLM.

## Demo cleanup

Once you're done, you can free up resources by deleting your index. Deleting an index is a permanet operation and cannot be undone.

In [None]:
pc.delete_index(name=index_name)

---