[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/learn/generation/langchain/rag-chatbot.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/learn/generation/langchain/rag-chatbot.ipynb)

# Building RAG Chatbots with LangChain

In this example, we'll work on building an AI chatbot from start-to-finish. We will be using LangChain, OpenAI, and Pinecone vector DB, to build a chatbot capable of learning from the external world using **R**etrieval **A**ugmented **G**eneration (RAG).

We will be using a dataset sourced from the Deepseek R1 ArXiv paper to help our chatbot answer questions about the latest and greatest in the world of AI.

By the end of the example we'll have a functioning chatbot and RAG pipeline that can hold a conversation and provide informative responses based on a knowledge base.

### Before you begin

You'll need to get an [OpenAI API key](https://platform.openai.com/account/api-keys) and [Pinecone API key](https://app.pinecone.io).

### Prerequisites

Before we start building our chatbot, we need to install some Python libraries. Here's a brief overview of what each library does:

- **langchain**: This is a library for GenAI. We'll use it to chain together different language models and components for our chatbot.
- **openai**: This is the official OpenAI Python client. We'll use it to interact with the OpenAI API and generate responses for our chatbot.
- **datasets**: This library provides a vast array of datasets for machine learning. We'll use it to load our knowledge base for the chatbot.
- **pinecone-client**: This is the official Pinecone Python client. We'll use it to interact with the Pinecone API and store our chatbot's knowledge base in a vector database.

You can install these libraries using pip like so:

In [1]:
!pip install -qU \
    langchain==0.3.23 \
    langchain-community==0.3.21 \
    langchain-pinecone==0.2.5 \
    langchain-openai==0.3.12 \
    datasets==3.5.0

### Building a Chatbot (no RAG)

We will be relying heavily on the LangChain library to bring together the different components needed for our chatbot. To begin, we'll create a simple chatbot without any retrieval augmentation. We do this by initializing a `ChatOpenAI` object. For this we do need an [OpenAI API key](https://platform.openai.com/account/api-keys).

In [2]:
import os
from langchain_openai import ChatOpenAI
from getpass import getpass

os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY") or getpass(
    "Enter your OpenAI API key: "
)

chat = ChatOpenAI(
    openai_api_key=os.environ["OPENAI_API_KEY"],
    model='gpt-4o-mini'
)

Chats with OpenAI's `gpt-3.5-turbo` and `gpt-4` chat models are typically structured (in plain text) like this:

```
System: You are a helpful assistant.

User: Hi AI, how are you today?

Assistant: I'm great thank you. How can I help you?

User: I'd like to understand string theory.

Assistant:
```

The final `"Assistant:"` without a response is what would prompt the model to continue the conversation. In the official OpenAI `ChatCompletion` endpoint these would be passed to the model in a format like:

```python
[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hi AI, how are you today?"},
    {"role": "assistant", "content": "I'm great thank you. How can I help you?"}
    {"role": "user", "content": "I'd like to understand string theory."}
]
```

In LangChain there is a slightly different format. We use three _message_ objects like so:

In [3]:
from langchain.schema import (
    SystemMessage,
    HumanMessage,
    AIMessage
)

messages = [
    SystemMessage(content="You are a helpful assistant."),
    HumanMessage(content="Hi AI, how are you today?"),
    AIMessage(content="I'm great thank you. How can I help you?"),
    HumanMessage(content="I'd like to understand string theory.")
]

The format is very similar, we're just swapped the role of `"user"` for `HumanMessage`, and the role of `"assistant"` for `AIMessage`.

We generate the next response from the AI by passing these messages to the `ChatOpenAI` object.

In [4]:
res = chat(messages)
res

  res = chat(messages)


AIMessage(content='String theory is a theoretical framework in physics that attempts to unify all fundamental forces of nature, including gravity, quantum mechanics, electromagnetism, and the strong and weak nuclear forces. Here are some key concepts to help you understand string theory better:\n\n1. **Basic Idea**: At its core, string theory posits that the fundamental building blocks of the universe are not point-like particles (as in traditional particle physics) but rather one-dimensional "strings." These strings can vibrate at different frequencies, and the different vibrational modes correspond to different particles.\n\n2. **Dimensions**: String theory requires additional spatial dimensions beyond the familiar three dimensions of space and one dimension of time. In many versions of string theory, there are 10 or 11 total dimensions. The extra dimensions are typically compactified, meaning they are curled up and not directly observable at human scales.\n\n3. **Types of Strings**:

In response we get another AI message object. We can print it more clearly like so:

In [5]:
print(res.content)

String theory is a theoretical framework in physics that attempts to unify all fundamental forces of nature, including gravity, quantum mechanics, electromagnetism, and the strong and weak nuclear forces. Here are some key concepts to help you understand string theory better:

1. **Basic Idea**: At its core, string theory posits that the fundamental building blocks of the universe are not point-like particles (as in traditional particle physics) but rather one-dimensional "strings." These strings can vibrate at different frequencies, and the different vibrational modes correspond to different particles.

2. **Dimensions**: String theory requires additional spatial dimensions beyond the familiar three dimensions of space and one dimension of time. In many versions of string theory, there are 10 or 11 total dimensions. The extra dimensions are typically compactified, meaning they are curled up and not directly observable at human scales.

3. **Types of Strings**: There are open strings (

Because `res` is just another `AIMessage` object, we can append it to `messages`, add another `HumanMessage`, and generate the next response in the conversation.

In [6]:
# add latest AI response to messages
messages.append(res)

# now create a new user prompt
prompt = HumanMessage(
    content="Why do physicists believe it can produce a 'unified theory'?"
)
# add to messages
messages.append(prompt)

# send to chat-gpt
res = chat(messages)

print(res.content)

Physicists believe string theory has the potential to produce a "unified theory" — often called a theory of everything — for several reasons:

1. **Inclusion of Different Forces**: String theory naturally incorporates all four fundamental forces of nature (gravity, electromagnetism, the weak nuclear force, and the strong nuclear force) into one framework. Traditional particle physics, particularly the Standard Model, successfully describes electromagnetic and weak interactions but struggles to include gravity. String theory, with its additional dimensions and fundamental strings, can account for gravity by treating it as a manifestation of strings vibrating in a higher-dimensional space.

2. **Quantum Gravity**: One of the major successes of string theory is that it provides a consistent description of gravity at the quantum level. In contrast, attempts to quantize gravity using conventional methods lead to inconsistencies and infinities. String theory's framework manages to avoid thes

### Dealing with Hallucinations

We have our chatbot, but as mentioned — the knowledge of LLMs can be limited. The reason for this is that LLMs learn all they know during training. An LLM essentially compresses the "world" as seen in the training data into the internal parameters of the model. We call this knowledge the _parametric knowledge_ of the model.

By default, LLMs have no access to the external world.

The result of this is very clear when we ask LLMs about more recent information, like about Deepseek R1.

In [7]:
# add latest AI response to messages
messages.append(res)

# now create a new user prompt
prompt = HumanMessage(
    content="What is so special about Deepseek R1?"
)
# add to messages
messages.append(prompt)

# send to OpenAI
res = chat(messages)

In [8]:
print(res.content)

As of my last knowledge update in October 2023, the Deepseek R1 is a notable advancement in the field of underwater exploration and robotics. Here are several features and aspects that may make the Deepseek R1 special:

1. **Versatile Underwater Capabilities**: The Deepseek R1 is designed to operate effectively in various underwater environments, making it valuable for research, exploration, inspection, and surveillance tasks in both freshwater and saltwater.

2. **High-Quality Imaging**: Often equipped with advanced imaging technology, the Deepseek R1 may offer high-resolution cameras, sonar systems, or even 3D imaging capabilities, allowing for detailed exploration and analysis of underwater environments.

3. **Autonomous Operation**: Many modern underwater drones, including the Deepseek R1, are capable of autonomous navigation. This means they can map, explore, and even conduct tasks without constant human control, which is vital for missions in challenging or hazardous underwater c

Our chatbot can no longer help us, it doesn't contain the information we need to answer the question. It was very clear from this answer that the LLM doesn't know the informaiton, but sometimes an LLM may respond like it _does_ know the answer — and this can be very hard to detect.

There is another way of feeding knowledge into LLMs. It is called _source knowledge_ and it refers to any information fed into the LLM via the prompt. We can try that with the Deepseek question. We can take the paper abstract from the [Deepseek R1 paper](https://arxiv.org/abs/2501.12948).

In [9]:
source_knowledge = (
    "We introduce our first-generation reasoning models, DeepSeek-R1-Zero and "
    "DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale "
    "reinforcement learning (RL) without supervised fine-tuning (SFT) as a "
    "preliminary step, demonstrates remarkable reasoning capabilities. Through "
    "RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and "
    "intriguing reasoning behaviors. However, it encounters challenges such as "
    "poor readability, and language mixing. To address these issues and "
    "further enhance reasoning performance, we introduce DeepSeek-R1, which "
    "incorporates multi-stage training and cold-start data before RL. "
    "DeepSeek-R1 achieves performance comparable to OpenAI-o1-1217 on "
    "reasoning tasks. To support the research community, we open-source "
    "DeepSeek-R1-Zero, DeepSeek-R1, and six dense models (1.5B, 7B, 8B, 14B, "
    "32B, 70B) distilled from DeepSeek-R1 based on Qwen and Llama."
)

We can feed this additional knowledge into our prompt with some instructions telling the LLM how we'd like it to use this information alongside our original query.

In [10]:
query = "What is so special about Deepseek R1?"

augmented_prompt = f"""Using the contexts below, answer the query.

Contexts:
{source_knowledge}

Query: {query}"""

Now we feed this into our chatbot as we were before.

In [11]:
# create a new user prompt
prompt = HumanMessage(
    content=augmented_prompt
)
# add to messages
messages.append(prompt)

# send to OpenAI
res = chat(messages)

In [12]:
print(res.content)

DeepSeek R1 is special due to its enhanced reasoning capabilities achieved through a multi-stage training process. Unlike its predecessor, DeepSeek-R1-Zero, which was trained solely using large-scale reinforcement learning (RL), DeepSeek R1 incorporates additional steps, including cold-start data before the RL phase. This approach effectively addresses challenges like poor readability and language mixing seen in R1-Zero, resulting in improved reasoning performance. Consequently, DeepSeek R1 demonstrates performance comparable to well-known models like OpenAI-o1-1217 on reasoning tasks, making it a significant advancement in reasoning models. Additionally, the models are open-sourced, contributing to the research community by providing access to diverse dense models distilled from DeepSeek R1.


The quality of this answer is phenomenal. This is made possible thanks to the idea of augmented our query with external knowledge (source knowledge). There's just one problem — how do we get this information in the first place?

We learned in the previous chapters about Pinecone and vector databases. Well, they can help us here too. But first, we'll need a dataset.

### Importing the Data

In this task, we will be importing our data. We will be using the Hugging Face Datasets library to load our data. Specifically, we will be using the `"jamescalam/deepseek-r1-paper-chunked"` dataset. This dataset contains the Deepseek R1 paper pre-processed into RAG-ready chunks.

In [13]:
from datasets import load_dataset

dataset = load_dataset(
    "jamescalam/deepseek-r1-paper-chunked",
    split="train"
)

dataset

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/24.0 [00:00<?, ?B/s]

train.jsonl:   0%|          | 0.00/54.4k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/76 [00:00<?, ? examples/s]

Dataset({
    features: ['doi', 'chunk-id', 'chunk', 'num_tokens', 'pages', 'source'],
    num_rows: 76
})

In [14]:
dataset[0]

{'doi': '2501.12948v1',
 'chunk-id': 1,
 'chunk': "uestion: If a > 1, then the sum of the real solutions of √a - √a + x = x is equal to Response: <think> To solve the equation √a – √a + x = x, let's start by squaring both . . . (√a-√a+x)² = x² ⇒ a - √a + x = x². Rearrange to isolate the inner square root term:(a – x²)² = a + x ⇒ a² – 2ax² + (x²)² = a + x ⇒ x⁴ - 2ax² - x + (a² – a) = 0",
 'num_tokens': 145,
 'pages': [1],
 'source': 'https://arxiv.org/abs/2501.12948'}

#### Dataset Overview

The dataset we are using is sourced from the Deepseek R1 ArXiv papers. Each entry in the dataset represents a "chunk" of text from the R1 paper.

Because most **L**arge **L**anguage **M**odels (LLMs) only contain knowledge of the world as it was during training, even many of the newest LLMs cannot answer questions about Deepseek R1 — at least not without this data.

### Task 4: Building the Knowledge Base

We now have a dataset that can serve as our chatbot knowledge base. Our next task is to transform that dataset into the knowledge base that our chatbot can use. To do this we must use an embedding model and vector database.

We begin by initializing our Pinecone client, this requires a [free API key](https://app.pinecone.io).

In [15]:
from pinecone import Pinecone

# get API key at app.pinecone.io
api_key = os.getenv("PINECONE_API_KEY") or getpass(
    "Enter your Pinecone API key: "
)

# initialize client
pc = Pinecone(api_key=api_key)

Now we setup our index specification, this allows us to define the cloud provider and region where we want to deploy our index. You can find a list of all [available providers and regions here](https://docs.pinecone.io/troubleshooting/available-cloud-regions).

In [16]:
from pinecone import ServerlessSpec, CloudProvider, AwsRegion, Metric

index_name = "deepseek-r1-rag"

if not pc.has_index(name=index_name):
    pc.create_index(
        name=index_name,
        metric=Metric.DOTPRODUCT,
        dimension=1536,  # this aligns with text-embedding-3-small dims
        spec=ServerlessSpec(
            cloud=CloudProvider.AWS,
            region=AwsRegion.US_EAST_1
        )
    )

index = pc.Index(name=index_name)

Our index is now ready but it's empty. It is a vector index, so it needs vectors. As mentioned, to create these vector embeddings we will OpenAI's `text-embedding-small-3` model — we can access it via LangChain like so:

In [18]:
from langchain_openai import OpenAIEmbeddings

embed_model = OpenAIEmbeddings(model="text-embedding-3-small")

Using this model we can create embeddings like so:

In [19]:
texts = [
    'this is the first chunk of text',
    'then another second chunk of text is here'
]

res = embed_model.embed_documents(texts)
len(res), len(res[0])

(2, 1536)

From this we get two (aligning to our two chunks of text) 1536-dimensional embeddings.

We're now ready to embed and index all our our data! We do this by looping through our dataset and embedding and inserting everything in batches.

In [22]:
from tqdm.auto import tqdm  # for progress bar

data = dataset.to_pandas()  # this makes it easier to iterate over the dataset

batch_size = 100

for i in tqdm(range(0, len(data), batch_size)):
    i_end = min(len(data), i+batch_size)
    # get batch of data
    batch = data.iloc[i:i_end]
    # generate unique ids for each chunk
    ids = [f"{x['doi']}-{x['chunk-id']}" for i, x in batch.iterrows()]
    # get text to embed
    texts = [x['chunk'] for _, x in batch.iterrows()]
    # embed text
    embeds = embed_model.embed_documents(texts)
    # get metadata to store in Pinecone
    metadata = [
        {'text': x['chunk'],
         'source': x['source']} for i, x in batch.iterrows()
    ]
    # add to Pinecone
    index.upsert(vectors=zip(ids, embeds, metadata))

  0%|          | 0/1 [00:00<?, ?it/s]

We can check that the vector index has been populated using `describe_index_stats` like before:

In [23]:
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'metric': 'dotproduct',
 'namespaces': {'': {'vector_count': 76}},
 'total_vector_count': 76,
 'vector_type': 'dense'}

#### Retrieval Augmented Generation

We've built a fully-fledged knowledge base. Now it's time to link that knowledge base to our chatbot. To do that we'll be diving back into LangChain and reusing our template prompt from earlier.

To use LangChain here we need to load the LangChain abstraction for a vector index, called a `vectorstore`. We pass in our vector `index` to initialize the object.

In [24]:
from langchain_pinecone import PineconeVectorStore

text_field = "text"  # the metadata field that contains our text

# initialize the vector store object
vectorstore = PineconeVectorStore(
    index=index,
    embedding=embed_model,
    text_key=text_field
)

Using this `vectorstore` we can already query the index and see if we have any relevant information given our question about Llama 2.

In [25]:
query = "What is so special about Deepseek R1?"

vectorstore.similarity_search(query, k=3)

[Document(id='2501.12948v1-39', metadata={'source': 'https://arxiv.org/abs/2501.12948'}, page_content='## 1.2. Summary of Evaluation Results - **Reasoning tasks:** (1) DeepSeek-R1 achieves a score of 79.8% Pass@1 on AIME 2024, slightly surpassing OpenAI-01-1217. On MATH-500, it attains an impressive score of 97.3%, performing on par with OpenAI-01-1217 and significantly outperforming other models. (2) On coding-related tasks, DeepSeek-R1 demonstrates expert level in code competition tasks, as it achieves 2,029 Elo rating on Codeforces outperforming 96.3% human participants in the competition. For engineering-related tasks, DeepSeek-R1 performs slightly better than DeepSeek-V3, which could help developers in real world tasks. - **Knowledge:** On benchmarks such as MMLU, MMLU-Pro, and GPQA Diamond, DeepSeek-R1 achieves outstanding results, significantly outperforming DeepSeek-V3 with scores of 90.8% on MMLU, 84.0% on MMLU-Pro, and 71.5% on GPQA Diamond. While its performance is slightly 

We return a lot of text here and it's not that clear what we need or what is relevant. Fortunately, our LLM will be able to parse this information much faster than us. All we need is to link the output from our `vectorstore` to our `chat` chatbot. To do that we can use the same logic as we used earlier.

In [26]:
def augment_prompt(query: str):
    # get top 3 results from knowledge base
    results = vectorstore.similarity_search(query, k=3)
    # get the text from the results
    source_knowledge = "\n".join([x.page_content for x in results])
    # feed into an augmented prompt
    augmented_prompt = f"""Using the contexts below, answer the query.

    Contexts:
    {source_knowledge}

    Query: {query}"""
    return augmented_prompt

Using this we produce an augmented prompt:

In [27]:
print(augment_prompt(query))

Using the contexts below, answer the query.

    Contexts:
    ## 1.2. Summary of Evaluation Results - **Reasoning tasks:** (1) DeepSeek-R1 achieves a score of 79.8% Pass@1 on AIME 2024, slightly surpassing OpenAI-01-1217. On MATH-500, it attains an impressive score of 97.3%, performing on par with OpenAI-01-1217 and significantly outperforming other models. (2) On coding-related tasks, DeepSeek-R1 demonstrates expert level in code competition tasks, as it achieves 2,029 Elo rating on Codeforces outperforming 96.3% human participants in the competition. For engineering-related tasks, DeepSeek-R1 performs slightly better than DeepSeek-V3, which could help developers in real world tasks. - **Knowledge:** On benchmarks such as MMLU, MMLU-Pro, and GPQA Diamond, DeepSeek-R1 achieves outstanding results, significantly outperforming DeepSeek-V3 with scores of 90.8% on MMLU, 84.0% on MMLU-Pro, and 71.5% on GPQA Diamond. While its performance is slightly below that of OpenAI-01-1217 on these be

There is still a lot of text here, so let's pass it onto our chat model to see how it performs.

In [28]:
# create a new user prompt
prompt = HumanMessage(
    content=augment_prompt(query)
)
# add to messages
messages.append(prompt)

res = chat(messages)

print(res.content)

DeepSeek-R1 is special for several reasons:

1. **Enhanced Reasoning Performance**: DeepSeek-R1 incorporates a multi-stage training pipeline that leverages both reinforcement learning (RL) and cold-start data. This method addresses the limitations encountered by its predecessor, DeepSeek-R1-Zero, which had issues with readability and language mixing. As a result, DeepSeek-R1 achieves impressive performance on reasoning tasks, surpassing OpenAI-01-1217 with a score of 79.8% Pass@1 on AIME 2024 and a remarkable 97.3% on the MATH-500 benchmark.

2. **Expert Performance in Coding Tasks**: DeepSeek-R1 demonstrates expert-level capabilities in coding-related tasks, achieving an Elo rating of 2,029 on Codeforces, outperforming 96.3% of human competitors. This showcases its practical utility in real-world coding challenges.

3. **Strong Performance on Knowledge Benchmarks**: DeepSeek-R1 excels on various knowledge benchmarks, achieving high scores such as 90.8% on MMLU, 84.0% on MMLU-Pro, and 

We can continue with another Deepseek R1:

In [30]:
prompt = HumanMessage(
    content=augment_prompt(
        "how does deepseek r1 compare to deepseek r1 zero?"
    )
)

res = chat(messages + [prompt])
print(res.content)

DeepSeek-R1 is an improvement over DeepSeek-R1-Zero in several key aspects:

1. **Reasoning Performance**: DeepSeek-R1 addresses some of the challenges faced by DeepSeek-R1-Zero, such as poor readability and language mixing, enhancing its overall reasoning capabilities. This results in DeepSeek-R1 achieving a score of 79.8% Pass@1 on AIME 2024, slightly surpassing OpenAI-01-1217, while DeepSeek-R1-Zero's performance was not mentioned in this specific context.

2. **Training Methodology**: DeepSeek-R1 incorporates a multi-stage training pipeline and utilizes a small amount of cold-start data before reinforcement learning (RL), which helps to refine its reasoning skills. In contrast, DeepSeek-R1-Zero was trained solely via large-scale RL without supervised fine-tuning as a preliminary step.

3. **Knowledge Benchmarking**: In comparison to DeepSeek-R1-Zero, DeepSeek-R1 achieves higher scores on knowledge benchmarks such as MMLU, MMLU-Pro, and GPQA Diamond, indicating better performance in

You can continue asking questions about Deepseek R1, but once you're done you can delete the index to save resources:

In [None]:
pc.delete_index(index_name)

---