GitHub - kn-neeraj/NotionKnowledgeAssistant

Asking LLMs to answer questions based on your private data soureces using Retrieval Augmented Generation (RAG) technique.

Why Retrieval Augmented Generation (RAG)?

Suppose you have your data repository (example - Notion Notes) and you want to be able QnA using prompts. Base (pre-trained) Large Language Models are only aware of the information they are pre-trained on. Base LLM's answers to prompts based on your private data will fall short or even hallucinate.

What is Retrieval Augmented Generation (RAG)?

Retrieval Augmented Generation (RAG) fetches relevant information basis your prompt from your data source and enhances the input prompt to LLM, providing richer context to improve LLM ouput. It combines the power of existing Information Retrieval techniques and LLMs by fetching relevant documents from your provided data repository and augment the final prompt to base LLM to improve factuality in LLM answers and reduce hallucination.

How it works?

QnA bot over your Notion Database using RAG technique & OpenAI LLM

As highlighted in the above diagram we will do the following steps :

Load the Notion database using Langchain Notion Document Loader
Split the loaded Notion Documents into semantically relevant chunks
Create vecttor embeddings of chunks & store them into a vector database (Pinecone)
Basis user's prompt, retrieve the semantically relevant chunk from the vector data
Create a prompt template using user's prompt + retrieved documents + any other metadata and pass it to base LLM (in our case OpenAI) to retrieve the right answers

Pre-requisites

Have OpenAI API Key. Refer : OpenAI API Keys
Set OPENAI_API_KEY as the environment variable
Download your Notion documents in markdown format, unzip it and put the file in your repo with name "Notion_db" : Export Notion Db

Note : While we are using Notion docs as an example, you can use any database but you have to appropriately make changes in every step i.e. data loaders, chunking parameters, retrieval methods, etc. Most of it would be straightforward using Langchain documents.

Step 1 : Load the Notion Database

Refer to the file dataloader.py
We are using Langchain's NotionDirectoryLoader and providing the db directory path from our repo
We print the number of Notion documents loaded

Step 2 : Split loaded documents in semantically relevant chunks

Refer to the file document_chunks.py
Why are we splitting the docs into chunks? Our ultimate objective is to be able to retrieve semantically relevant document portions/chunks basis user's query. Therefore, quality of chunk greatly impacts the end outcome.
Chunking can be on the basis of following splitters : Characters, Tokens, or context aware splitting used for structured data like Markdown.
Different parameters to experiment with
Parameter needed to be experimented with : Splitter, chunk size, chunk overlap, and length function.
I experimented with different splitters & chunk_sizes on my test prompts and finally chose character splitter with chunk size of 1024.
How do you evaluate quality of these chunks objectively without relying on experimentation with values? Good question! And I am still finding answers for. If you have suggestions or comments reach out to me Neeraj@Twitter
Finally, we test the chunking logic by printing the number of chunks created and printing the 1st chunk content & metadata

Fantastic! Now that we have chunks created, we need to create vector embeddings of each chunk and store them in our vector db. Using these vectors we will find semantically relevant chunks from user's query and retrieve them from the database.

Step 3 : Create embeddings & persist them in the vector database

Refer to file : document_embedings.py
We are using OpenAIEmbeddings module for creating embedding & Pinecone as vector database to save embeddings.
Create embeddings and persist the DB so that it can be used later.
Finally, we test it using our test prompts in test_data.py file. You can manually check if it is able to retrieve semantically relevant document chunks basis your test prompt.
Validate : You have to try different prompts and see if the current system is able to retrieve relevant chunks or not. If not, continue to refine document spltter built in the previous Step.

Step 4 : Create & pass the prompt to LLM having user's query + top 3 retrieved document chunks from our vector database

Refer to file : retrievalqachain.py
We are using OpenAI's Chatgpt 3.5-turbo model
To construct the prompt we are using Langchain's RetrievalChainWithPrompt class.
Finally, we test our complete tool by providing it with different test prompts and evaluating the output manually.

Dashboard using Gradio

Our bot is ready! Trained to answer questions on our private data sources. This would have been unimaginable few years back and today we can achieve it in a few 100 lines of code!

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
flagged		flagged
resources		resources
.gitignore		.gitignore
README.md		README.md
dashboard.py		dashboard.py
dataloader.py		dataloader.py
document_chunks.py		document_chunks.py
document_embeddings.py		document_embeddings.py
globalvariables.py		globalvariables.py
main.py		main.py
prompt_templates.py		prompt_templates.py
retrievalqachain.py		retrievalqachain.py
test_data.py		test_data.py

kn-neeraj/NotionKnowledgeAssistant

Folders and files

Latest commit

History

Repository files navigation

Asking LLMs to answer questions based on your private data soureces using Retrieval Augmented Generation (RAG) technique.

Why Retrieval Augmented Generation (RAG)?

What is Retrieval Augmented Generation (RAG)?

How it works?

QnA bot over your Notion Database using RAG technique & OpenAI LLM

Pre-requisites

Step 1 : Load the Notion Database

Step 2 : Split loaded documents in semantically relevant chunks

Step 3 : Create embeddings & persist them in the vector database

Step 4 : Create & pass the prompt to LLM having user's query + top 3 retrieved document chunks from our vector database

Dashboard using Gradio

Author : Neeraj Kumar. Say hello at Linkedin, Twitter.

About

Resources

Stars

Watchers

Forks

Languages