How to Create a Chatbot with a Private Knowledge-Base Using RAG
==============

A tutorial for creating a chatbot that uses a private knowledge-base with retrieval augmented generation.

## What is it?

This is an easy tutorial for creating a basic chatbot with a private knowledge-base. The chatbot can answer questions related to a specific business, product, or domain. Unlike general chatbots (ChatGPT, etc.), a personal chatbot trained using retrieval augmented generation (RAG) can answer questions that are specific to a domain. For example, the chatbot could answer questions from your company's support technical support; or it could have specific knowledge about your business brochure, or perhaps about a specific person (such as yourself), or even about a personal hobby.

## How does it work?

The chatbot works by allowing users to upload documents (text files, PDF documents, HTML pages) from which the chatbot will utilize the content for constructing its responses.

1. Each document is [stemmed](https://www.ibm.com/think/topics/stemming) and split into chunks (i.e., sentences and paragraphs).
2. The chunks are converted into a numeric vector.
3. The user enters a query for the chatbot.
4. The user's query is stemmed and converted into a numeric vector.
5. The user's query is matched agaist the database of content using a text similarity algorithm.
6. The top-N matches are included in the prompt as context, along with the user's query, and sent to an LLM.
7. The LLM uses the context to answer the question and respond to the user.

In [None]:
%pip install numpy pandas scikit-learn nltk PyPDF2 Cohere
from dotenv import load_dotenv
load_dotenv(override=True)

NUMBER_OF_MATCHES = 3 # Number of matching items to provide as context to the AI
CHUNK_SIZE = 99999 # Size of each matching item (99999 = entire document as context)

## Adding documents to the knowledge-base

Content can be added into the knowledge-base by providing text, PDF, or HTML files. The following methods read the associated files, extract the text, and return the stemmed and processed chunk data.

In [None]:
import os
import nltk
from nltk.tokenize import sent_tokenize
from nltk.stem import PorterStemmer
from PyPDF2 import PdfReader
from bs4 import BeautifulSoup

nltk.download('punkt')
ps = PorterStemmer()

def process_text(text, chunk_size=CHUNK_SIZE):
    sentences = sent_tokenize(text)
    original_chunks = []
    processed_chunks = []
    chunk = ""
    for sentence in sentences:
        if len(chunk) + len(sentence) > chunk_size:
            original_chunks.append(chunk)
            processed_chunks.append(' '.join([ps.stem(word) for word in chunk.split()]))
            chunk = sentence
        else:
            chunk += " " + sentence
    if chunk:
        original_chunks.append(chunk)
        processed_chunks.append(' '.join([ps.stem(word) for word in chunk.split()]))
    return original_chunks, processed_chunks

def read_pdf(file_path):
    with open(file_path, 'rb') as file:
        reader = PdfReader(file)
        text = ''
        for page in reader.pages:
            text += page.extract_text()
    return process_text(text)

def read_html(file_path):
    with open(file_path, 'r') as file:
        soup = BeautifulSoup(file, 'html.parser')
        text = soup.get_text()
        return process_text(text)

def read_txt(file_path):
    with open(file_path, 'r') as file:
        text = file.read()
        return process_text(text)

## Processing content

After the documents have been processed with stemming and divided into chunks, they are converted to numeric vectors and added into a database in memory.

## Finding the best matches

The LLM requires the most relevant context in order to provide an accurate response. To identify the best matching knowledge from the processed content, a similarity algorithm is executed against the user query and the document database. The top-N best matches are returned, along with their original text, for use within the prompt that will be sent to the LLM.

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
documents = []  # This will hold all processed documents
original_documents = []  # This will hold all original documents
vectors = None

def process_and_add_document(file_path, file_type):
    if file_type == 'pdf':
        original_chunks, processed_chunks = read_pdf(file_path)
    elif file_type == 'html':
        original_chunks, processed_chunks = read_html(file_path)
    elif file_type == 'txt':
        original_chunks, processed_chunks = read_txt(file_path)
    else:
        raise ValueError('Unsupported file type')
    
    original_documents.extend(original_chunks)  # Store the original text chunks
    vectors = add_document(processed_chunks)
    return vectors

def add_document(text):
    documents.extend(text)
    vectors = vectorizer.fit_transform(documents)
    return vectors

def find_best_matches(query, top_n=NUMBER_OF_MATCHES):
    query_processed = process_text(query)[1]  # Get the processed version of the query
    query_vector = vectorizer.transform(query_processed)
    similarities = (query_vector * vectors.T).toarray()
    best_match_indices = similarities.argsort()[0][-top_n:][::-1]  # Get the indices of the top N matches
    return [original_documents[i] for i in best_match_indices], [documents[i] for i in best_match_indices]

## Constructing the prompt for the LLM

The next step is to construct a prompt for the LLM to respond to the user. This is the core behind the chatbot. To do this, we first provide a system prompt for the LLM that explains they are an assistant and will use the context provided to formulate a response. They should not use general knowledge from outside the scope of the context.

The complete prompt sent to the LLM includes the instructional system prompt, the best matches from the document database as context, and the user's query.

## Calling the LLM

Once the prompt is constructed, the Cohere LLM is called via its API endpoint and a response is returned to the user.

In [13]:
import cohere

co = cohere.ClientV2(os.getenv('COHERE_API_KEY'))

def get_cohere_response(query, context):
    messages = [
        {"role": "system", "content": "You are an AI assistant. Use the provided context to answer the user's query accurately in a short and concise response. Do not generate information that is not present in the context. If the context does not contain the answer, inform the user that the information is not available."},
        {"role": "system", "content": context},
        {"role": "user", "content": query}
    ]

    response = co.chat(
        model='command-r-plus-08-2024',
        messages=messages
    )
    return response.message.content[0].text.strip()

## Putting it all together

The method `chat()` shown below provides an example of putting all the pieces together. A call is made to `find_best_matches()` to locate the best context to use with the user's query before sending to the LLM within the prompt. The Cohere LLM is called and the response is returned.

In [18]:
def reset_database():
    global documents, original_documents, vectors
    documents = []
    original_documents = []
    vectors = None

def initialize(file_name):
    file_type = file_name.split('.')[-1]
    return process_and_add_document(file_name, file_type)

def chat(user_query, is_debug = False):
    original_best_matches, processed_best_matches = find_best_matches(user_query)
    context = "\n\n".join(original_best_matches)  # Concatenate the top 3 best matches as context
    if is_debug:
        print(f"Context: {context}")
    response = get_cohere_response(user_query, context)
    return response

## Example

The example below initializes the knowledge-base using a PDF document `climatechange.pdf`. The document will be stemmed, chunked, vectorized, and added into the memory database. The user then provides a query for the chatbot from which we locate the best matches from the document database and return the matches as context for the user's query in the prompt to the LLM. Fially, the response is printed to the output.

In [15]:
reset_database()
vectors = initialize('climatechange.pdf')
response = chat('Who are the authors of the report?')
print(response)

The report was written by members of the Working Group I Technical Support Unit (WGI TSU) and several authors of the report. The authors are:

- Sarah Connors (WGI TSU)
- Sophie Berger (WGI TSU)
- Clotilde Péan (WGI TSU)
- Govindasamy Bala (Chapter 4 author)
- Nada Caud (WGI TSU)
- Deliang Chen (Chapter 1 author)
- Tamsin Edwards (Chapter 9 author)
- Sandro Fuzzi (Chapter 6 author)
- Thian Yew Gan (Chapter 8 author)
- Melissa Gomis (WGI TSU)
- Ed Hawkins (Chapter 1 author)
- Richard Jones (Atlas Chapter author)
- Robert Kopp (Chapter 9 author)
- Katherine Leitzell (WGI TSU)
- Elisabeth Lonnoy (WGI TSU)
- Douglas Maraun (Chapter 10 author)
- Valérie Masson-Delmotte (WGI Co-Chair)
- Tom Maycock (WGI TSU)
- Anna Pirani (WGI TSU)
- Roshanka Ranasinghe (Chapter 12 author)
- Joeri Rogelj (Chapter 5 author)
- Alex C. Ruane (Chapter 12 author)
- Sophie Szopa (Chapter 6 author)
- Panmao Zhai (WGI Co-Chair)


## Completed Chatbot

By providing a continuous loop to chat with the chatbot, the user can repeatedly enter queries for the chatbot to respond to. Each query performs the same process of locating the best matching context and calling the LLM with the constructed prompt.

In this scenario, an article about reviews of [coffee](https://medium.com/illumination/i-tried-10-decaf-coffees-as-a-first-time-coffee-drinker-heres-what-i-found-a8c5fb93a40e?sk=03a1bb8109f779521d9ffec8f5f275ae) are provided as knowledge to the LLM. This allows it to answer questions related specifically to an article written by the author.

In [19]:
reset_database()
vectors = initialize('coffee.html')
while True:
    user_query = input("Enter your query (type 'quit' or 'exit' to stop): ")
    if user_query.lower() in ['quit', 'exit']:
        print("Exiting the chat. Goodbye!")
        break
    print('--------------------')
    print(f"User: \"{user_query}\"")
    response = chat(user_query)
    print("AI:", response, flush=True)


--------------------
User: "Which coffee has the highest rating?"
AI: The highest-rated coffee in the review is Merit Coffee Espresso Decaf.
--------------------
User: "What was the author afraid that coffee might do to them?"
AI: The author was afraid that coffee might stain their teeth, upset their stomach, or make them jittery.
--------------------
User: "Who is the author?"
AI: The author of the article is Kory Becker.
--------------------
User: "How was the review for Folgers decaf?"
AI: The Folgers Instant Decaf coffee was ranked 8th in the list. It was described as having a bitter and dark flavor, with no impact on the stomach or caffeine effect. The coffee flakes dissolved quickly in water, and the price was noted as being a bit more expensive compared to some other options.
Exiting the chat. Goodbye!
