# Using Yellowbrick as the vector store for ChatGpt

This tutorial demonstrates how to create a simple chatbot backed by ChatGpt that uses Yellowbrick as a vector store to support Retrieval Augmented Generation (RAG). What you'll need:

1. An account on the [Yellowbrick sandbox](https://cloudlabs.yellowbrick.com/)
2. An api key from [OpenAI](https://platform.openai.com/)

The tutorial is divided into four parts. First we'll create a simple chatbot to interact with ChatGpt with langchain. Second, we'll add code to create a chatbot that is backed by ChatGpt and uses Yellowbrick as its vector store. Third, we'll show how to create a table in Yellowbrick to store your own embeddings in, and finally we'll describe how to load your own embeddings into your table.

## Part 1: Creating a chatbot backed by ChatGpt

Our chatbot integrates with ChatGpt via the langchain library, so you'll need to install this first. To do this, you should clone this fork of the langchain GitHub project that includes the Yellowbrick vector store connector:
1. git clone https://github.com/markcusack/langchain.git
2. cd PATH/TO/REPO/langchain/libs/langchain
3. pip3 install -e .

To get an api key for OpenAI:
1. Register at https://platform.openai.com/
2. Add a payment method
3. Create an API key
4. pip3 install openai

Install the postgres python connection library:
1. pip3 install psycopg2-binary

Run the following script, and enter the question "What is Yellowbrick?", and see what response you get before we give ChatGpt some additional context about Yellowbrick...


In [None]:
# Query using ChatGPT without Yellowbrick vector store integration

import os
import sys
from getpass import getpass

os.environ["OPENAI_API_KEY"] = getpass("Paste your OpenAI API key here and hit enter:")

# Set up the chat model and specific prompt

from langchain.prompts.chat import (
    ChatPromptTemplate,
    SystemMessagePromptTemplate,
    HumanMessagePromptTemplate,
)

system_template="""If you don't know the answer, just say that "I don't know", don't try to make up an answer.
"""
messages = [
    SystemMessagePromptTemplate.from_template(system_template),
    HumanMessagePromptTemplate.from_template("{question}")
]
prompt = ChatPromptTemplate.from_messages(messages)

from langchain.chat_models import ChatOpenAI
from langchain.chains import LLMChain

chain_type_kwargs = {"prompt": prompt}
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0, max_tokens=256)  # Modify model_name if you have access to GPT-4
chain = LLMChain(llm=llm,
                 prompt=prompt,
                 verbose=False,
)

from IPython.display import display, Markdown
def print_result(result):
  output_text = f"""### Question: 
  {query}
  ### Answer: 
  {result['text']}
  """
  display(Markdown(output_text))

# Use the chain to query

while True:
  try:
    query = input("ybChatKB> ")
    print_result(chain(query))
  except KeyboardInterrupt:
    break

## Part 2: Creating a chatbot backed by ChatGpt that uses Yellowbrick as the vector store

Next, we add Yellowbrick as a vector store. The vector store has been populated with embeddings representing
all of the Yellowbrick product documentation.

Modify the yellowbrick_connection_string use your user name and password, and then run it. Try entering the same question as above, "What is Yellowbrick?", and compare the results with the prior run.

In [None]:
# Query using ChatGPT with Yellowbrick vector store integration

import os
import sys
from getpass import getpass

os.environ["OPENAI_API_KEY"] = getpass("Paste your OpenAI API key here and hit enter:")

# Change the Yellowbrick connection string to reflect your target instance and database

yellowbrick_connection_string = "postgres://USERNAME:PASSWORD@HOSTNAME:5432/DATABASE"

# Set up the chat model and specific prompt

from langchain.prompts.chat import (
    ChatPromptTemplate,
    SystemMessagePromptTemplate,
    HumanMessagePromptTemplate,
)

system_template="""Use the following pieces of context to answer the users question.
Take note of the sources and include them in the answer in the format: "SOURCES: source1 source2", use "SOURCES" in capital letters regardless of the number of sources.
If you don't know the answer, just say that "I don't know", don't try to make up an answer.
----------------
{summaries}"""
messages = [
    SystemMessagePromptTemplate.from_template(system_template),
    HumanMessagePromptTemplate.from_template("{question}")
]
prompt = ChatPromptTemplate.from_messages(messages)

from langchain.chat_models import ChatOpenAI
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.chains import RetrievalQAWithSourcesChain
from langchain.vectorstores import Yellowbrick

vector_store = Yellowbrick(OpenAIEmbeddings(),
                            yellowbrick_connection_string,
                            "oai_embeddings", # Change the table name to reflect your embeddings
)

chain_type_kwargs = {"prompt": prompt}
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0, max_tokens=256)  # Modify model_name if you have access to GPT-4
chain = RetrievalQAWithSourcesChain.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vector_store.as_retriever(search_kwargs={'k': 5}),
    return_source_documents=True,
    chain_type_kwargs=chain_type_kwargs
)

from IPython.display import display, Markdown
def print_result(result):
  output_text = f"""### Question: 
  {query}
  ### Answer: 
  {result['answer']}
  ### Sources: 
  {result['sources']}
  ### All relevant sources:
  {' '.join(list(set([doc.metadata['source'] for doc in result['source_documents']])))}
  """
  display(Markdown(output_text))

# Use the chain to query

while True:
  try:
    query = input("ybChatKB> ")
    print_result(chain(query))
  except KeyboardInterrupt:
    break

## Part 3: Loading your own document embeddings into Yellowbrick

If you want to load your own document embeddings into Yellowbrick, you should create your own table for storing them in. Note that the 
Yellowbrick database that the table is in has to be UTF-8 encoded. 

Create a table in a UTF-8 database with the following schema, providing a table name of your choice:

<pre>
CREATE TABLE oai_embeddings (
    id uuid,
    embedding_id integer,
    text character varying(60000),
    metadata character varying(1024),
    embedding double precision
)
DISTRIBUTE ON (id);
</pre>

**Note**: Be sure to use the same table name in all the scripts to access your embeddings

To help, here's a script for connecting to a Yellowbrick instance and creating the table...

In [None]:
import psycopg2

# Establish connection parameters to Yellowbrick

yellowbrick_connection_string = "postgres://USERNAME:PASSWORD@HOST:5432/DATABASE"

# Establish a connection to the Yellowbrick database
try:
    conn = psycopg2.connect(yellowbrick_connection_string)
except psycopg2.Error as e:
    print(f"Error connecting to the database: {e}")
    exit(1)

# Create a cursor object using the connection
cursor = conn.cursor()

# Define the SQL statement to create a table
create_table_query = """
CREATE TABLE my_embeddings (
    id uuid,
    embedding_id integer,
    text character varying(60000),
    metadata character varying(1024),
    embedding double precision
)
DISTRIBUTE ON (id);
"""

# Execute the SQL query to create a table
try:
    cursor.execute(create_table_query)
    print("Table 'my_embeddings' created successfully!")
except psycopg2.Error as e:
    print(f"Error creating table: {e}")
    conn.rollback()

# Commit changes and close the cursor and connection
conn.commit()
cursor.close()
conn.close()


## Part 4: Loading your own document embeddings into Yellowbrick

If you want to load your own document embeddings into Yellowbrick, you can use the following script...

In [None]:
# Build the embeddings store

# Load documents and split them into chunks for conversion to embeddings

DOCS_FOLDER = "/Users/markcusack/llms/yb/dita2md"  # Folder to check out to
REPO_DOCUMENTS_PATH = ""  # Set to "" to index the whole data folder
DOCUMENT_BASE_URL = "https://yellowbrick.com/resources/product-documentation/"  # Actual URL
DATA_STORE_DIR = "/Users/markcusack/llms/yb/data_store"

import os
import pathlib
import re
from getpass import getpass

from langchain.docstore.document import Document
from langchain.vectorstores import Yellowbrick
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings

os.environ["OPENAI_API_KEY"] = getpass("Paste your OpenAI API key here and hit enter:")

# Change the Yellowbrick connection string to reflect your target instance and database

yellowbrick_connection_string = "postgres://USERNAME:PASSWORD@HOST:5432/DATABASE"

# Change the table to reflect your destination table for the embeddings in Yellowbrick

table = "oai_embeddings_2000"

name_filter = "**/*.md"
separator = "\n## "  # This separator assumes Markdown docs from the repo uses ### as logical main header most of the time
chunk_size_limit = 2000
max_chunk_overlap = 200

repo_path = pathlib.Path(os.path.join(DOCS_FOLDER, REPO_DOCUMENTS_PATH))
document_files = list(repo_path.glob(name_filter))

def convert_path_to_doc_url(doc_path):
  # Convert from relative path to actual document url
  return re.sub(f"{DOCS_FOLDER}/{REPO_DOCUMENTS_PATH}/(.*)\.[\w\d]+", f"{DOCUMENT_BASE_URL}/\\1", str(doc_path))

documents = [
    Document(
        page_content=open(file, "r").read(),
        metadata={"source": convert_path_to_doc_url(file)}
    )
    for file in document_files
]

text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size_limit, 
                                               chunk_overlap=max_chunk_overlap,
                                               separators=[separator, "\nn", "\n", ",", " ", ""]
)
split_docs = text_splitter.split_documents(documents)

docs_text = [doc.page_content for doc in split_docs]

embeddings = OpenAIEmbeddings()
vector_store = Yellowbrick.from_documents(documents=split_docs, 
                                          embedding=embeddings, 
                                          connection_string=yellowbrick_connection_string, 
                                          table=table
)