We know how powerful retrieval augmentation and conversational agents (chatbots) can be. Now with the power of Pinecone and Langchain we can combine them.
Chatbots based with an LLM often struggle with data freshness, knowledge about specific domains, or accessing internal documentation. By coupling agents with retrieval augmentation tools we no longer have these problems.
One the other side, using "naive" retrieval augmentation without the use of an LLM means we will retrieve contexts with every query. Again, this isn't always ideal as not every query requires access to external knowledge.

Merging these methods gives us the best of both worlds. In this notebook we will attempt to create a chatbot which has a specialised knowledge base of heat pump installation guides. 

To begin, we must install the prerequisite libraries that we will be using in this notebook.

In [None]:
!pip install -qU \
  pandas \
  langchain==0.1.0 \
  openai==1.7.1 \
  tiktoken==0.5.2 \
  "pinecone-client[grpc]"==2.2.1 \
  pinecone-datasets=='0.5.0rc11' \
  PyMuPDF \
  nltk

Building a knowledge base by extracting the text from the PDF manual and then chunking the text up for efficiency.

In [1]:
# Code to extract and chunk text from PDF files in a specified directory
# To run this code, you'll need the NLTK and PyMuPDF (fitz) libraries. 
# Install them using: pip install nltk PyMuPDF
import nltk
import fitz  # PyMuPDF
import os

pdf_dir = '/Users/user_name/sustainability/installer_chatbot/air2water_HP_installation_guides/'
texts = [] # List to hold text
metadata_tags = []  # List to hold metadata tags
for pdf_file in os.listdir(pdf_dir):
    if pdf_file.endswith('.pdf'):
        pdf_path = os.path.join(pdf_dir, pdf_file)
        with fitz.open(pdf_path) as pdf_document:
            pdf_text = ''
            for page in pdf_document:
                pdf_text += page.get_text()
            texts.append(pdf_text)
            metadata_tags.append(pdf_file) 

# ... Extract and chunk text from PDFs ...
chunked_texts = []
chunked_metadata_tags = []
for text, metadata_tag in zip(texts, metadata_tags):
    chunks = nltk.sent_tokenize(text) 
    chunk_tags = [metadata_tag] * len(chunks)
    chunked_texts.extend(chunks)
    chunked_metadata_tags.extend(chunk_tags)
    




Creating vector embeddings for text chunks from a PDF is like translating a story into a secret code that only computers can understand. Each sentence is turned into a series of numbers that captures its essence, allowing the computer to see how all the different sentences are related to each other.

In [3]:
from openai import OpenAI
import pandas as pd

client = OpenAI(api_key='<INSERT OPEN API KEY>') 
map_pdf_to_web_dictionary = {'Nibe_F2040_231844-5.pdf':'https://www.nibe.eu/assets/documents/16900/231844-5.pdf'} 
embeddings = []
metadata_list = []  # This will contain our metadata with chunk and source
document_counter = 1  # Starting with the first document
chunk_counter = 0  # Initialize chunk counter
prev_tag = None  # Keep track of the previous tag
# Loop through the chunked texts and chunked meta tags created from the previous code snippet. 
for text, tag in zip(chunked_texts, chunked_metadata_tags):
    response = client.embeddings.create(input=text,
    model='text-embedding-ada-002')  # Use your chosen model ID)
    embedding = response.data[0].embedding
    embeddings.append(embedding)
    # Increment document counter if the tag changes (indicating a new document)
    if tag != prev_tag:
        document_counter += 1 if prev_tag is not None else 0
        chunk_counter = 0  # Reset chunk counter for a new document
        prev_tag = tag  # Update the previous tag
    # Create the metadata with chunk number and the PDF name from the tag
    metadata = {'chunk': chunk_counter, 'pdf source': tag, 'source': map_pdf_to_web_dictionary[tag], 'text': text}
    metadata_list.append(metadata)
    # Increment the chunk counter
    chunk_counter += 1
# Create the ids based on the document and chunk counters
ids = [f"{document_counter}-{i}" for i in range(len(embeddings))]

# Create a DataFrame of the embeddings which will be used for our database later on
hpInstallerEmbeddingsDF = pd.DataFrame({
    'id': ids,  # Use the ids generated
    'values': embeddings, # The vector embeddings which have been generated
    'metadata': metadata_list # All the necessary metadata
})

In [None]:
#Print head of dataframe
hpInstallerEmbeddingsDF.head()

Next we initialize the vector database. A vector database is like a vast library where instead of books, you have complex ideas and information stored as numbers in a way that machines can quickly find, compare, and understand them. It's designed to handle and search through these numerical codes (vectors) efficiently, helping to provide fast and relevant results when you're looking for specific pieces of information.  We can create a free API key with Pinecone, then we create the index:

In [None]:
import os
import pinecone
import time

# Index name for the heat pump chatbot
hp_chatbot_index_name = 'chatbot-onboarding'
PINECONE_API_KEY = os.getenv('<PINECONE API KEY>') or '<PINECONE API KEY>'
PINECONE_ENVIRONMENT = os.getenv('gcp-starter') or 'gcp-starter'
pinecone.init(
    api_key=PINECONE_API_KEY,
    environment=PINECONE_ENVIRONMENT
)

# Create a new Pinecone index if it doesn't exist
if hp_chatbot_index_name not in pinecone.list_indexes():
    pinecone.create_index(
        name=index_name,
        metric='cosine',
        dimension=1536,  # 1536 dim of text-embedding-ada-002
        metadata_config={'indexed': ['chunk', 'source']}
    )
    time.sleep(1)
# Initialize the Pinecone index
hp_chatbot_index = pinecone.GRPCIndex(hp_chatbot_index_name)
# Upsert data from DataFrame to the Pinecone index
hp_chatbot_index.upsert_from_dataframe(hpInstallerEmbeddingsDF, batch_size=100)

In [None]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Pinecone
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA

# Initialize OpenAI embeddings
OPENAI_API_KEY = os.getenv('<INSERT OPEN API KEY>') or '<INSERT OPEN API KEY>'
embed = OpenAIEmbeddings(
    model='text-embedding-ada-002',
    openai_api_key=OPENAI_API_KEY
)

# Set up Pinecone vector store
hp_chatbot_index = pinecone.Index(hp_chatbot_index_name)
vectorstore = Pinecone(hp_chatbot_index, embed.embed_query, "text")

# Set up the chatbot language model
llm = ChatOpenAI(
    openai_api_key=OPENAI_API_KEY,
    model_name='gpt-3.5-turbo',
    temperature=0.0
)

# Create a QA chain with retrieval from vectorstore
qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever()
)

# Example question
question = "Can you tell me how to deal with condensation run off for the NIBE F2040 heat pump?"
answer = qa(question)
print(answer)