# Knowledge Based System with Langchain and PaLM 
This notebook walks through building a question/answer system that retrieves information from a private knowledge base. A pre-trained LLM, or likely even a fine-tuned LLM will not be sufficient (in and of itself) when you want a conversational interface to ask specific questions about specific data (private knowledge base). This private knowledge base can be a collection of documents, websites, research papers, or even structured data tables and more. 

The steps to setup the private knowledge base are as follows:
1) Split documents into chunks
2) Vectorize (embed) each chunk 
3) Store vectors/embeddings in a database

Once you have a vectorstore of embeddings (private knowledge-base), the process of using it in a conversational workflow are as follows:
1) Embed the query (question)
2) Nearest neighbors lookup with query in vectorstore to find relevant chunks
3) Use relevant chunks to formulate response  

This process of course requires an LLM (like PaLM or others) to formulate responses to queries with the relevant chunks found via nearest neighbors.  

Of course there are many options for a vectorstore, including managed and scalable offerings like [Vertex AI Matching Engine](https://cloud.google.com/vertex-ai/docs/matching-engine/overview). Additionally there are different options for LLMs to use as the underpinning language model. In this walkthrough we will use [Chroma](https://www.trychroma.com/) as a vectorstore and [PaLM](https://cloud.google.com/vertex-ai/docs/generative-ai/start/quickstarts/api-quickstart) as the underpinning language model. In a production environment, consider using a more scalable and efficient vector store such as Vertex AI Matching Engine. 

**NOTE:** This notebook requires you to have a Google Cloud project and uses Google Cloud resources. If you are not running this lab in a Vertex AI Workbench Notebook, you need to set up the proper permissions to access these resources. Help can be found [here](https://cloud.google.com/docs/authentication/provide-credentials-adc).

### Setup

In [None]:
!pip3 install --user \
    langchain==0.0.217 \
    wikipedia==1.4.0 \
    chromadb==0.3.26 \
    google-cloud-aiplatform==1.26.1

Restart kernel

In [1]:
from langchain.document_loaders import WikipediaLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.embeddings import VertexAIEmbeddings
from langchain.llms import VertexAI 
from langchain.memory import ConversationBufferMemory
from langchain.chains import RetrievalQA, ConversationalRetrievalChain

### Document Loading 
Langchain provides classes to load data from different sources. Some useful data loaders are [Google Cloud Storage Directory Loader](https://python.langchain.com/docs/modules/data_connection/document_loaders/integrations/google_cloud_storage_directory), [Google Drive Loader](https://python.langchain.com/docs/modules/data_connection/document_loaders/integrations/google_drive), [Recursive URL Loader](https://python.langchain.com/docs/modules/data_connection/document_loaders/integrations/recursive_url_loader), [PDF Loader](https://python.langchain.com/docs/modules/data_connection/document_loaders/how_to/pdf), [JSON Loader](https://python.langchain.com/docs/modules/data_connection/document_loaders/how_to/json), [Wikipedia Loader](https://python.langchain.com/docs/modules/data_connection/document_loaders/integrations/wikipedia), and [more](https://python.langchain.com/docs/modules/data_connection/document_loaders/). 

In this notebook we will use the Wikipedia loader to create a private knowledge base of wikipedia articles about machine learning, but the overall process is similiar regardless of which document loader you use.

In [2]:
docs = WikipediaLoader(query="Machine Learning", load_max_docs=10).load()
docs += WikipediaLoader(query="Deep Learning", load_max_docs=10).load() 
docs += WikipediaLoader(query="Neural Networks", load_max_docs=10).load() 

# Take a look at a single document
docs[0]

Document(page_content='Machine learning (ML) is an umbrella term for solving problems for which development of algorithms by human programmers would be cost-prohibitive, and instead the problems are solved by helping machines \'discover\' their \'own\' algorithms, without needing to be explicitly told what to do by any human-developed algorithms. When there was a vast amount of potential answers, the correct ones needed to be labeled as valid by human labelers initially and human supervision was need. With advance of faster machines and new methods, however,  \'discovering\' machine\'s own models became possible not only by using supervised learning but also by using unsupervised learning or reinforcement learning. Although not all machine learning is statistically-based, computational statistics is an important source of the field\'s methods. \nGenerative artificial neural networks, mimicking the working of a biological brain, has been recently able to surpass results of many previous

### Split text into chunks
Now that we have the documents we will split them into chunks. Each chunk will become one vector in the vector store. To do this we will define a chunk size (number of characters) and a chunk overlap (amount of overlap i.e. sliding window). The perfect chunk size can be difficult to determine. Too large of a chunk size leads to too much information per chunk (individual chunks not specific enough), however too small of a chunk size leads to not enough information per chunk. In both cases, nearest neighbors lookup with a query/question embedding may struggle to retrieve the actually relevant chunks, or fail altogether if the chunks are too large to use as context with an LLM query.

In this notebook we will use a chunk size of 800 chacters and a chunk overlap of 400 characters, but feel free to experiment with other sizes! Note: you can specify a custom `length_function` with `RecursiveCharacterTextSplitter` if you want chunk size/overlap to be determined by something other than Python's `len` function. In addition to `RecursiveCharacterTextSplitter`, there are [other text splitters](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/split_by_token) you can consider. 

In [3]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 800,
    chunk_overlap  = 400,
    length_function = len,
)

chunks = text_splitter.split_documents(docs)

# Look at the first two chunks 
chunks[0:2]

[Document(page_content="Machine learning (ML) is an umbrella term for solving problems for which development of algorithms by human programmers would be cost-prohibitive, and instead the problems are solved by helping machines 'discover' their 'own' algorithms, without needing to be explicitly told what to do by any human-developed algorithms. When there was a vast amount of potential answers, the correct ones needed to be labeled as valid by human labelers initially and human supervision was need. With advance of faster machines and new methods, however,  'discovering' machine's own models became possible not only by using supervised learning but also by using unsupervised learning or reinforcement learning. Although not all machine learning is statistically-based, computational statistics is an important source", metadata={'title': 'Machine learning', 'summary': "Machine learning (ML) is an umbrella term for solving problems for which development of algorithms by human programmers wo

In [4]:
print(f'Number of documents: {len(docs)}')
print(f'Number of chunks: {len(chunks)}')

Number of documents: 30
Number of chunks: 258


### Vectorize/Embed Document Chunks
Now we need to embed the document chunks (turn them into vectors) and store them in a vectorstore. For this, we can use any text embedding model, however we need to be sure to use the same text embedding model when we embed our queries/questions at prediction time. To make things simple we will use the PaLM API for Embeddings. The langchain library provides a nice wrapper class around the PaLM Embeddings API, `VertexAIEmbeddings()`.

Since Vertex AI Matching Engine takes awhile (~45 minutes) to create an index, we will use [Chroma](https://www.trychroma.com/) instead to keep things simple. Of course, in a real-world use case with a large private knowledge-base, you may not be able to fit everything in memory. Langchain has a nice wrapper class for Chroma which allows us to pass in a list of documents, and an embedding class to create the vector store.

In [5]:
embedding = VertexAIEmbeddings() # PaLM embedding API 

# set persist directory so the vector store is saved to disk
db = Chroma.from_documents(chunks, embedding, persist_directory="./vectorstore")

### Putting it all together
Now that everything is in place, we can tie it all together with a langchain chain. A langchain chain simply orchestrates the multiple steps required to use an LLM for a specific use case. In this case the process we will chain together first embeds the query/question, then performs a nearest neighbors lookup to find the relevant chunks, then uses the relevant chunks to formulate a response with an LLM. We will use the Chroma database as our vector store and PaLM as our LLM. Langchain provides a wrapper around PaLM, `VertexAI()`. 

For this simple Q/A use case we can use langchain's `RetrievalQA` to link together the process.

In [6]:
# vector store 
retriever = db.as_retriever(
    search_type="similarity",
    search_kwargs={"k":5} # number of nearest neighbors to retrieve  
)

# PaLM API 
# You can also set temperature, top_p, top_k 
llm = VertexAI(
    model_name="text-bison",
    max_output_tokens=1024
)

# q/a chain 
qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True
)

### Query 
Now that everything is tied together we can send queries and get answers! 

In [7]:
def ask_question(question: str):
    response = qa({"query": question})
    print(f"Response: {response['result']}\n")

    citations = {doc.metadata['source'] for doc in response['source_documents']}
    print(f"Citations: {citations}\n")

    # uncomment below to print source chunks used  
    # print(f"Source Chunks Used: {response['source_documents']}")

In [8]:
ask_question("What technology underpins large language models?")

Response: The technology that underpins large language models is the transformer architecture.

Citations: {'https://en.wikipedia.org/wiki/Recurrent_neural_network', 'https://en.wikipedia.org/wiki/Transformer_(machine_learning_model)'}



In [9]:
ask_question("What is a gradient boosted tree?")

Response: Gradient boosted trees (GBTs) are a type of ensemble machine learning algorithm that combines decision trees with gradient boosting. Gradient boosting is an iterative process that builds a model by adding new trees to an existing model in order to reduce the error of the model. The first tree is built using the original data, and then subsequent trees are built using the residuals from the previous tree. This process is repeated until the desired level of accuracy is achieved. GBTs are often used for classification and regression tasks, and they are particularly well-suited for tasks where the data is noisy or incomplete.

Citations: {'https://en.wikipedia.org/wiki/Support_vector_machine', 'https://en.wikipedia.org/wiki/Rectifier_(neural_networks)', 'https://en.wikipedia.org/wiki/Graph_neural_network', 'https://en.wikipedia.org/wiki/Boosting_(machine_learning)'}



In [10]:
ask_question("When was the transformer invented?")

Response: The Transformer model came out in 2017.

Citations: {'https://en.wikipedia.org/wiki/Artificial_neural_network', 'https://en.wikipedia.org/wiki/Transformer_(machine_learning_model)', 'https://en.wikipedia.org/wiki/Recurrent_neural_network'}



### Preserve Chat History
`RetrievalQA` is great for asking single questions and getting an answer, but want if you want a chatbot that is able to track conversation history and understand context within a conversation? For that, we can use `ConversationalRetrievalChain` to orchestrate the flow (similar to `RetrievalQA`) and `ConversationBufferMemory` to preserve chat history.

In [11]:
# preserve chat history in memory 
memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)

chat_session = ConversationalRetrievalChain.from_llm(
    llm=llm, 
    retriever=retriever, 
    memory=memory
)

In [12]:
chat_session({'question': 'What technology underpins large language models?'})

{'question': 'What technology underpins large language models?',
 'chat_history': [HumanMessage(content='What technology underpins large language models?', additional_kwargs={}, example=False),
  AIMessage(content='The technology that underpins large language models is the transformer architecture.', additional_kwargs={}, example=False)],
 'answer': 'The technology that underpins large language models is the transformer architecture.'}

In [13]:
# With chat history it will understand that "they" refers to transformers 
chat_session({'question': 'When were they invented?'})

{'question': 'When were they invented?',
 'chat_history': [HumanMessage(content='What technology underpins large language models?', additional_kwargs={}, example=False),
  AIMessage(content='The technology that underpins large language models is the transformer architecture.', additional_kwargs={}, example=False),
  HumanMessage(content='When were they invented?', additional_kwargs={}, example=False),
  AIMessage(content='The Transformer model came out in 2017.', additional_kwargs={}, example=False)],
 'answer': 'The Transformer model came out in 2017.'}