[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/learn/generation/chatbots/chatbot.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/learn/generation/chatbots/chatbot.ipynb)

# Chatbots

The most powerful chatbots in the world all **hallucinate** and lack an up-to-date understanding of the real world. GPT-4, for example, cannot answer questions about recent world events, no matter how important they were.

The world of chatbots is frozen in time. Their world exists as a static snapshot of the world as it was within their training data.

A solution to this problem is *retrieval augmentation*. The idea behind this is plug-in a chatbot to an external vector database where they can then access *accurate* and up-to-date and information. Helping us limit hallucinations and answer questions about the latest events. In this example we will see how to do that.

To begin, we must install the prerequisite libraries that we will be using in this notebook.

In [None]:
!pip install -qU \
  langchain==0.0.162 \
  openai==0.27.7 \
  tiktoken==0.4.0 \
  "pinecone-client[grpc]"==2.2.1 \
  pinecone-datasets=='0.5.0rc11'

---

🚨 _Note: the above `pip install` is formatted for Jupyter notebooks. If running elsewhere you may need to drop the `!`._

---

## Data Preparation

The chatbot use-case requires that we store relevant information (called **contexts**) inside the vector database. To make these retrievable we encode them to create vector embeddings. Once these embeddings are indexed we can use them to augment user queries as demonstrated below:

![Chatbot with retrieval augmentation diagram](https://github.com/pinecone-io/examples/blob/master/learn/generation/chatbots/assets/chatbot-diagram.png?raw=true)

### Downloading the Dataset

We will download a pre-embedding dataset from `pinecone-datasets`. Allowing us to skip the embedding and any other preprocessing steps. When working with your own dataset you will need to perform this embedding step but we have prebuilt the embeddings so we can jump right to the action.

In [None]:
from pinecone import Pinecone_datasets

dataset = pinecone_datasets.load_dataset('wikipedia-simple-text-embedding-ada-002')
dataset.head()



Unnamed: 0,id,values,sparse_values,metadata,blob
0,1-0,"[-0.011254455894231796, -0.01698738895356655, ...",,,"{'chunk': 0, 'source': 'https://simple.wikiped..."
1,1-1,"[-0.0015197008615359664, -0.007858820259571075...",,,"{'chunk': 1, 'source': 'https://simple.wikiped..."
2,1-2,"[-0.009930099360644817, -0.012211072258651257,...",,,"{'chunk': 2, 'source': 'https://simple.wikiped..."
3,1-3,"[-0.011600767262279987, -0.012608098797500134,...",,,"{'chunk': 3, 'source': 'https://simple.wikiped..."
4,1-4,"[-0.026462381705641747, -0.016362832859158516,...",,,"{'chunk': 4, 'source': 'https://simple.wikiped..."


In [None]:
len(dataset)

283945

A `pinecone-dataset` always contains `id`, `values`, `sparse_values`, `metadata`, and `blob`. All we need are the IDs, vector embeddings (stored in `values`), and some metadata (which is actually stored in `blob`). Let's reformat the dataset ready for adding to Pinecone, and use a smaller subset of the full dataset.

In [None]:
# we drop sparse_values as they are not needed for this example
dataset.documents.drop(['sparse_values', 'metadata'], axis=1, inplace=True)
dataset.documents.rename(columns={'blob': 'metadata'}, inplace=True)
# we will use rows of the dataset up to index 30_000
dataset.documents.drop(dataset.documents.index[30_000:], inplace=True)
len(dataset)

30000

Now we move on to initializing our Pinecone vector database.

## Creating an Index

To create our vector database we first need a [free API key from Pinecone](https://app.pinecone.io). Then we initialize like so:

In [None]:
index_name = 'chatbot-onboarding'

In [None]:
import os
from pinecone import Pinecone
import time

# find API key in console at app.pinecone.io
PINECONE_API_KEY = os.getenv('PINECONE_API_KEY') or 'PINECONE_API_KEY'
# find ENV (cloud region) next to API key in console
PINECONE_ENVIRONMENT = os.getenv('PINECONE_ENVIRONMENT') or 'PINECONE_ENVIRONMENT'

pinecone.init(
    api_key=PINECONE_API_KEY,
    environment=PINECONE_ENVIRONMENT
)

if index_name not in pinecone.list_indexes().names():
    # we create a new index
    pinecone.create_index(
        name=index_name,
        metric='cosine',
        dimension=1536,  # 1536 dim of text-embedding-ada-002
        metadata_config={'indexed': ['wiki-id', 'title']}
    )
    # wait a moment for the index to be fully initialized
    time.sleep(1)

  from tqdm.autonotebook import tqdm


Then we connect to the new index:

In [None]:
index = pinecone.Index(index_name)
# wait a moment for the index to be fully initialized
time.sleep(1)

index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

We should see that the new Pinecone index has a `total_vector_count` of `0`, as we haven't added any vectors yet.

Now we upsert the data to Pinecone:

In [None]:
index.upsert_from_dataframe(dataset.documents, batch_size=100)

sending upsert requests:   0%|          | 0/30000 [00:00<?, ?it/s]

collecting async responses:   0%|          | 0/300 [00:00<?, ?it/s]

upserted_count: 30000

We've now indexed everything. We can check the number of vectors in our index like so:

In [None]:
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.1,
 'namespaces': {'': {'vector_count': 30000}},
 'total_vector_count': 30000}

## Building the Chatbot and Querying

Now that we've build our index we can switch over to LangChain. We need to initialize a LangChain vector store using the same index we just built. For this we will also need a LangChain embedding object, which we initialize like so:

In [None]:
from langchain.embeddings.openai import OpenAIEmbeddings

# get openai api key from platform.openai.com
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY') or 'OPENAI_API_KEY'

model_name = 'text-embedding-ada-002'

embed = OpenAIEmbeddings(
    model=model_name,
    openai_api_key=OPENAI_API_KEY
)

Now initialize the vector store:

In [None]:
from langchain.vectorstores import Pinecone

text_field = "text"

# switch back to normal index for langchain
index = pinecone.Index(index_name)

vectorstore = Pinecone(
    index, embed.embed_query, text_field
)

Now we can query the vector store directly using `vectorstore.similarity_search`:

In [None]:
query = "who was Benito Mussolini?"

vectorstore.similarity_search(
    query,  # our search query
    k=3  # return 3 most relevant docs
)

[Document(page_content='Benito Amilcare Andrea Mussolini KSMOM GCTE (29 July 1883 – 28 April 1945) was an Italian politician and journalist. He was also the Prime Minister of Italy from 1922 until 1943. He was the leader of the National Fascist Party.\n\nBiography\n\nEarly life\nBenito Mussolini was named after Benito Juarez, a Mexican opponent of the political power of the Roman Catholic Church, by his anticlerical (a person who opposes the political interference of the Roman Catholic Church in secular affairs) father. Mussolini\'s father was a blacksmith. Before being involved in politics, Mussolini was a newspaper editor (where he learned all his propaganda skills) and elementary school teacher.\n\nAt first, Mussolini was a socialist, but when he wanted Italy to join the First World War, he was thrown out of the socialist party. He \'invented\' a new ideology, Fascism, much out of Nationalist\xa0and Conservative views.\n\nRise to power and becoming dictator\nIn 1922, he took power b

All of these are good, relevant results. But this actually covers only the first few steps of our retrieval augmented chatbot, the **retrieval augmentation** steps. We're still missing the **chatbot** part.

### Creating the Chatbot

Our chatbot will take the query as a question that is to be answered, but the chatbot must answer the question based on the information it is seeing being returned from the `vectorstore`.

To do this we initialize a chat model and `RetrievalQA` object like so:

In [None]:
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA

# chatbot language model
llm = ChatOpenAI(
    openai_api_key=OPENAI_API_KEY,
    model_name='gpt-3.5-turbo',
    temperature=0.0
)
# retrieval augmented pipeline for chatbot
qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever()
)

In [None]:
qa.run(query)

'Benito Mussolini was an Italian politician and journalist who served as the Prime Minister of Italy from 1922 until 1943. He was the leader of the National Fascist Party and was also a newspaper editor and elementary school teacher before being involved in politics. Mussolini was a dictator of Italy by the end of 1927, and his form of Fascism, "Italian Fascism," was different and less destructive than Hitler\'s Nazism. He wanted Italy to become a new Roman Empire and attacked several countries, including Abyssinia (now called Ethiopia) and Greece. Mussolini was captured and shot by partisans in 1945.'

We can also include the sources of information that the chatbot is using to answer our question. We can do this using a slightly different version of `RetrievalQA` called `RetrievalQAWithSourcesChain`:

In [None]:
from langchain.chains import RetrievalQAWithSourcesChain

qa_with_sources = RetrievalQAWithSourcesChain.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever()
)

In [None]:
qa_with_sources(query)

{'question': 'who was Benito Mussolini?',
 'answer': 'Benito Mussolini was an Italian politician and journalist who was the Prime Minister of Italy from 1922 until 1943. He was the leader of the National Fascist Party and created his own form of Fascism called "Italian Fascism". He wanted Italy to become a new Roman Empire and attacked several countries, including Abyssinia (now called Ethiopia) and Greece. He was dictator of Italy by the end of 1927 and was deposed in 1943. He was later captured and shot by partisans. Mario Draghi is the current head of government of Italy. Italy was not a state before 1861 and was a group of separate states ruled by other countries. In 1860, Giuseppe Garibaldi took control of Sicily, creating the Kingdom of Italy in 1861. Victor Emmanuel II was made the king. \n',
 'sources': 'https://simple.wikipedia.org/wiki/Benito%20Mussolini, https://simple.wikipedia.org/wiki/Italy'}

Now we answer the question being asked, *and* return the source of this information being used by the LLM.

Once done, we can delete the index to save resources.

In [None]:
pinecone.delete_index(index_name)

## Where would we use this?

Chatbots and retrieval augmented LLMs are incredibly prelevant in our world despite their lack of adoption pre-ChatGPT. Retrieval augmentation is being used to improve the performance and reliability of chatbots and the retrieval *with sources* that we demonstrated above is clearly visible in both Google's Bard and Microsoft's Bing AI. Beyond these tech giants, giving LLMs access to up-to-date information is essential for customer service chatbots that must refer to customer data and FAQ docs, for chatbots plugged into fast changing ecommerce databases, and any scenario where up-to-date information is important or helpful.

---