<a href="https://colab.research.google.com/github/knivore/colab/blob/main/Internal_Knowledge_Base_Q%26A_Using_LangChain_%26_OpenAI.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Creating a Internal Knowledge Base Q&A Bot Using LangChain & OpenAI

This example shows how to query an internal knowledge base stored

This notebook is adapted from LangChain's [Question Answering](https://python.langchain.com/docs/use_cases/question_answering/) use case.

# Installations


In [None]:
!pip install langchain
!pip install openai
!pip install faiss-cpu
!pip install tiktoken



# Setup OPENAI_API_KEY and other variables

In [None]:
import os

os.environ["OPENAI_API_KEY"] = ""
os.environ["LANGCHAIN_API_KEY"] = ""
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
os.environ["LANGCHAIN_TRACING_V2"] = "true"

DOCUMENT_BASE_URL = "https://www.govbenefits.gov.sg/am-i-eligible/ap-cash"  # Actual URL
FAISS_DATA_STORE_DIR = "faiss_data_store"  # Folder to save/load the database

# Building the datastore

* Side note: [LangChain Integrations](https://integrations.langchain.com/) has a list of 154 types of document loaders



In [None]:
from langchain.document_loaders import WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = WebBaseLoader(f"{DOCUMENT_BASE_URL}")
data = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1000, chunk_overlap = 0)
split_data = text_splitter.split_documents(data)

Creating the vector data

* Side note: [LangChain Integrations](https://integrations.langchain.com/) has a list of 37 types of embeddings models & 46 types of vector stores

In [None]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS

embeddings = OpenAIEmbeddings()
vector_store = FAISS.from_documents(split_data, embeddings)

OPTIONAL: Save to local

In [None]:
# Save to a local FAISS index. So don't have to recreate it every time you use it.
if os.path.exists(path=FAISS_DATA_STORE_DIR):
    local_index = FAISS.load_local(folder_path=FAISS_DATA_STORE_DIR, embeddings=embeddings)
    local_index.merge_from(target=vector_store)
    local_index.save_local(folder_path=FAISS_DATA_STORE_DIR)
else:
    vector_store.save_local(folder_path=FAISS_DATA_STORE_DIR)

OPTIONAL: Load from local

In [None]:
vector_store = FAISS.load_local(folder_path=FAISS_DATA_STORE_DIR, embeddings=embeddings)

# Query ChatGPT with vector store

Setting up chat model and prompts

In [None]:
from langchain.chains import RetrievalQAWithSourcesChain
from langchain.chat_models import ChatOpenAI
from langchain.prompts.chat import (
    ChatPromptTemplate,
    SystemMessagePromptTemplate,
    HumanMessagePromptTemplate,
)

system_template = """
If you need any details clarified, please ask questions until all issues are clarified.
For tabular information return it as an html table. Do not return markdown format.
Take note of the sources and include them in the answer and provide URL links to the source itself.
If you do not know the answer, just say that "I don't know", do not ever try to make up an answer.
----------------
{summaries}"""
messages = [
    SystemMessagePromptTemplate.from_template(system_template),
    HumanMessagePromptTemplate.from_template("{question}")
]
prompt = ChatPromptTemplate.from_messages(messages)

#llm = ChatOpenAI(model_name="gpt-4", temperature=0.0)
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0.0)  # Set the temperature to 0.0 as we don't want creative answer

# Stuff - The full text of all retrieved documents is stuffed into the LLM prompt. This provides maximum context at the expense of potential repetition. Could increase the tokens sent to the model
stuff_chain = RetrievalQAWithSourcesChain.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vector_store.as_retriever(),
    return_source_documents=True,
    chain_type_kwargs={"prompt": prompt}
)

# Refine - The retrieved documents are summarized into a short context paragraph which is then provided to the LLM. This helps reduce repetition.
refine_chain_retriever = vector_store.as_retriever()
refine_chain_retriever.search_kwargs['distance_metric'] = 'cos'  # Read more on https://weaviate.io/blog/distance-metrics-in-vector-search
refine_chain_retriever.search_kwargs['fetch_k'] = 25  # Set how many documents you want to fetch before filtering
refine_chain_retriever.search_kwargs['maximal_marginal_relevance'] = True
refine_chain_retriever.search_kwargs['k'] = 10  # Top documents matches

refine_chain = RetrievalQAWithSourcesChain.from_chain_type(
    llm=llm,
    chain_type="refine",
    retriever=refine_chain_retriever,
    return_source_documents=True
)


def print_result(results):
    output_text = f"""
    * 🤖 Reply: {results['answer']}
    * 📚 All relevant sources:
    {' '.join(list(set([doc.metadata['source'] for doc in results['source_documents']])))}
    """
    print(output_text)

# Ask away!

In [None]:
query = "What is Assurance Package?"
result = stuff_chain(query)
# result = refine_chain({"question": query})
print_result(result)


    * 🤖 Reply: The Assurance Package is a government initiative in Singapore that provides various forms of assistance and support to Singaporeans. It includes cash payments, MediSave top-ups, U-Save rebates, Seniors' Bonus, Cost of Living Special Payment, and CDC Vouchers. The package aims to help Singaporeans cope with the cost of living and provide additional support to those who need it. The specific benefits and eligibility criteria vary for each component of the Assurance Package.
    * 📚 All relevant sources: 
    https://www.govbenefits.gov.sg/am-i-eligible/ap-cash
    
