<a href="https://colab.research.google.com/github/henryantwi/-Profile/blob/main/RAG_(Retrieval_Augmented_Generation)_Chatbot_with_OpenAI_and_Upstash_Vector.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## RAG (Retrieval Augmented Generation) Chatbot with OpenAI and Upstash Vector

This notebook is an example implementation of a simple chatbot with RAG using OpenAI models and Upstash Vector.

RAG simply means using data from external sources to prevent the chatbot from generating incorrect or "hallucinated" information. This can be done by appending a string called _context_ to the prompt, containing usefull information about the topic.

>For example the context can contain statistics about the company or documentation for a tool.

The model then tries to answer the question with information contained in the context. This way, the model can be instructed to say _"I don't know"_ when it can't find the information in the context.

## Steps

Here is the steps
We begin by breaking our data into small chunks and inserting their embeddings into a vector database. We then derive the embedding of the question and query the index to retrieve the top five closest chunks, giving us the most relevant information. We then combine the question and context into a single large prompt and present it to the model. For the embeddings, we used `text-embedding-ada-002` and the completions model was `gpt-3.5-turbo`.

## Outline:

1. Install dependencies and create an index

2. Download and chunk the data

3. Generate embeddings

4. Query and run the prompt

5. Outro


## Create an Upstash Vector Index

Create a free vector database from [Upstash Console](https://console.upstash.com) with `1536` dimensions and `DOT_PRODUCT` distance and paste your `url` and `token` here.

The dimension size is important as it must match the dimensions of the [embedding model](https://platform.openai.com/docs/guides/embeddings).


Generate an OpenAI key and paste it here.



In [None]:
UPSTASH_VECTOR_REST_URL="<YOUR_UPSTASH_VECTOR_REST_URL>"
UPSTASH_VECTOR_REST_TOKEN="<YOUR_UPSTASH_VECTOR_REST_TOKEN>"

OPENAI_KEY="<YOUR_OPENAI_KEY>"

## Install dependencies

In [None]:
%pip install tiktoken langchain openai upstash_vector pypdf

Collecting tiktoken
  Using cached tiktoken-0.5.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB)
Collecting langchain
  Using cached langchain-0.1.4-py3-none-any.whl (803 kB)
Collecting openai
  Using cached openai-1.10.0-py3-none-any.whl (225 kB)
Collecting upstash_vector
  Using cached upstash_vector-0.2.0-py3-none-any.whl (9.7 kB)
Collecting pypdf
  Downloading pypdf-4.0.1-py3-none-any.whl (283 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m284.0/284.0 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain)
  Using cached dataclasses_json-0.6.3-py3-none-any.whl (28 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain)
  Using cached jsonpatch-1.33-py2.py3-none-any.whl (12 kB)
Collecting langchain-community<0.1,>=0.0.14 (from langchain)
  Using cached langchain_community-0.0.16-py3-none-any.whl (1.6 MB)
Collecting langchain-core<0.2,>=0.1.16 (from langchain)
  Using cached langchain_core-0.1.17-py

## Download and chunk the data

In this notebook, the pdf version of the [Bill Evans](https://en.wikipedia.org/wiki/Bill_Evans)'s wiki page will be used.

We'll download the PDF version of this article, extract the text and store it in a variable called `filedata`.

In [None]:
import os
from pypdf import PdfReader

if not os.path.exists('data.pdf'):
  !wget -O data.pdf https://en.wikipedia.org/api/rest_v1/page/pdf/Bill_Evans

reader = PdfReader("data.pdf")
filedata = ""
for page in reader.pages:
    filedata += page.extract_text() + "\n"

# A sample
print(filedata[500:1000])


poser who worked primarily
as the leader of his trio.[2] His interpretations of traditional jazz
repertoire, his ways of using impressionist harmony and block
chords, and his trademark rhythmically independent, "singing"
melodic lines, continue to influence jazz pianists today.
Born in Plainfield, New Jersey, United States, he studied classical
music at Southeastern Louisiana University and the Mannes
School of Music, in New York City, where he majored in
composition and received the Artist Dipl


Define the `token_len` function to be used by the langchain splitter.

In [None]:
import tiktoken

enc = tiktoken.encoding_for_model('gpt-3.5-turbo')

def token_len(text):
    return len(enc.encode(text))

print(f"File has {token_len(filedata)} tokens")

File has 17562 tokens


`text_splitter` splits our long article into chunks each containing about 150 tokens. You can learn more about it [here](https://api.python.langchain.com/en/latest/text_splitter/langchain.text_splitter.RecursiveCharacterTextSplitter.html)

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=20,
    length_function=token_len,
    separators=["\n\n", "\n", " ", ""]
)

chunks = text_splitter.split_text(filedata)
chunks[:4]

['Bill Evans\nEvans in 1961\nBackground information\nBirth name William John\nEvans\nBorn August 16, 1929\nPlainfield, New\nJersey, U.S.\nDied September 15,\n1980 (aged 51)\nNew York City,\nU.S.\nGenres Jazz · modal jazz ·\nthird stream · cool\njazz · smooth jazz\n· post-bop\nOccupation(s)Musician ·\ncomposer ·\narranger ·\nconductor\nInstrument(s)Piano\nDiscographyBill Evans\ndiscography\nYears active 1950s–1980[1]Bill Evans',
 'discography\nYears active 1950s–1980[1]Bill Evans\nWilliam John Evans (Augus t 16, 1929 – September 15, 1980)\nwas an American jazz pianist and composer who worked primarily\nas the leader of his trio.[2] His interpretations of traditional jazz\nrepertoire, his ways of using impressionist harmony and block\nchords, and his trademark rhythmically independent, "singing"\nmelodic lines, continue to influence jazz pianists today.\nBorn in Plainfield, New Jersey, United States, he studied classical\nmusic at Southeastern Louisiana University and the Mannes\nSchool 

## Generate embeddings

Here are some utility functions for creating embeddings for single and multiple chunks. An embedding is essentialy an array of floats. When given a string, the embedding model generates an embedding representing that string.

Here we check the dimension count generated by the embedding model and it is `1536` as expected.

In [None]:
from openai import OpenAI

openai = OpenAI(
   api_key=OPENAI_KEY
)

def get_embeddings(chunks, model="text-embedding-ada-002"):
   chunks = [c.replace("\n", " ") for c in chunks]

   res =  openai.embeddings.create(input = chunks, model=model).data

   return [r.embedding for r in res]

# For a single text
def get_embedding(text, model="text-embedding-ada-002"):
   text = text.replace("\n", " ")
   return get_embeddings([text], model)[0]

# Embedding of the first chunk
len(get_embedding(chunks[0]))

1536

Using the utiliy functions we generate embeddings in batches of 10's and convert them to `Vector` objects to be inserted into our index.

Conversion to `Vector` object is just there for type safety.

In [None]:
from tqdm import tqdm, trange
from upstash_vector import Vector

vectors = []

# generate the embeddings in batches of 10
batch_count = 10

for i in trange(0, len(chunks), batch_count):
    batch = chunks[i:i+batch_count]

    embeddings = get_embeddings(batch)

    for i, chunk in enumerate(batch):
        vec = Vector(id=f"chunk-{i}", vector=embeddings[i], metadata={
            "text": chunk
        })

        vectors.append(vec)

  0%|          | 0/14 [00:00<?, ?it/s]


NameError: name 'get_embeddings' is not defined

Here is a sample of vectors we will upsert into the Upstash Vector.

In [None]:
print(vectors[0])
print(vectors[0].metadata)

Vector(id='chunk-0', vector=[-0.02431023307144642, -0.010620915330946445, 0.009922416880726814, -0.009548221714794636, -0.014955345541238785, 0.04093698412179947, -0.009691663086414337, -0.012379634194076061, -0.017499875277280807, -0.005054757464677095, -0.028139499947428703, 0.012859851121902466, 0.01218006294220686, 0.007434017024934292, -0.005297984462231398, 0.009591877460479736, 0.022900763899087906, -0.002179688774049282, 0.003816793905571103, -0.01680137775838375, -0.013732974417507648, -0.02116699144244194, -0.007371651008725166, 0.004795938730239868, -0.0034924910869449377, 0.013533403165638447, 0.0222646314650774, -0.02764057368040085, 0.010327795520424843, -0.009941126219928265, 0.021828070282936096, 0.014855560846626759, -0.024983784183859825, -0.033877164125442505, -0.006517237983644009, -0.02006935141980648, -0.020268922671675682, -0.0025694756768643856, 0.005974654573947191, 0.010595968924462795, 0.041660431772470474, 0.022813450545072556, -0.00118807062972337, 0.008700

Upsert all of the vectors to the index at once. Upstash supports for 1000 vectors per request for free indexes.

In [None]:
from upstash_vector import Index

index = Index(
    url=UPSTASH_VECTOR_REST_URL,
    token=UPSTASH_VECTOR_REST_TOKEN
)

# If you want to reset your index beforehand uncomment this
# index.reset()

index.upsert(vectors)

'Success'

## Query and run the prompt




The first part is complete, now we query for the embedding of any text and it gives us relevant chunks of context we can use.

In [None]:
# Now we can search for similar vectors

embedding = get_embedding("waltz for debby")

# Search for similar vectors
res = index.query(vector=embedding, top_k=5, include_metadata=True)
[r.metadata['text'] for r in res]

['Helen" and "Song for Helen", for manager Helen Keane; "B minor Waltz (For Ellaine)", for girlfriend',
 'pianists Jean-Yves Thibaudet and Denis Matsuev, and many other musicians in jazz and other music\ngenres.[81]\nMany of his tunes, such as "Waltz for Debby", "Turn Out the Stars", "Very Early", and "Funkallero", have\nbecome often-recorded jazz standards.\nDuring his lifetime, Evans was honored with 31 Grammy nominations and seven Awards.[53] In 1994, he\nwas posthumously honored with the Grammy Lifetime Achievement Award.\nThe Bill Evans Jazz Festival at Southeastern Louisiana University began in 2002.[82] A Bill Evans painting',
 'List of compositions\nEvans\'s repertoire consisted of both jazz standards and original compositions. Many of these were dedicated\nto people close to him. Some known examples are: "Waltz for Debby", for his niece; "For Nenette", for his\nwife; "Letter to Evan", for his son; "NYC\'s No Lark", an anagram of Sonny Clark in memory of his friend\nthe pianist

Here is a utiliy function for asking questions. Note that the prompt is a combination of question and context.

In [None]:
def ask_question(question):
    # Get the embedding for the question
    question_embedding = get_embedding(question)

    # Search for similar vectors
    res = index.query(vector=question_embedding, top_k=5, include_metadata=True)

    # Collect the results in a context
    context = "\n".join([r.metadata['text'] for r in res])

    prompt = f"Question:{question}\n\nContext: {context}"

    response = openai.chat.completions.create(
                model="gpt-3.5-turbo",
                messages=[{"role": "system", "content": 'You are a helpfull search assistant. You answer the question given only using the context. If you do not know the answer, you can say "I do not know" and the user will be notified.'},
                            {"role": "user", "content": prompt}
                ])

    text = response.choices[0].message.content

    print("Response: ", text)
    print("Context used in the prompt:\n" + context)

Ask you can see, the model uses information given in the context to answer the questions. The approach taken here can be improved significantly, this is just an example showcase of how upstash vector can be used. Hope you enjoyed this tutorial!

In [None]:
ask_question("Who is debby in the album waltz for debby?")

Response:  Debby is Bill Evans' niece. "Waltz for Debby" is a composition dedicated to her.
Context used in the prompt:
List of compositions
Evans's repertoire consisted of both jazz standards and original compositions. Many of these were dedicated
to people close to him. Some known examples are: "Waltz for Debby", for his niece; "For Nenette", for his
wife; "Letter to Evan", for his son; "NYC's No Lark", an anagram of Sonny Clark in memory of his friend
the pianist; "Re: Person I Knew", another anagram, of the name of his friend and producer Orrin
Keepnews; "We Will Meet Again", for his brother; "Peri's Scope", for girlfriend Peri Cousins; "One for
Helen" and "Song for Helen", for manager Helen Keane; "B minor Waltz (For Ellaine)", for girlfriend

Ellaine Schultz; "Laurie", for girlfriend Laurie Verchomin; "Yet Ne'er Broken", an anagram of the name of
cocaine dealer Robert Kenney; "Maxine", for his stepdaughter; "Tiffany", for Joe LaBarbera's daughter;
"Knit For Mary F." for fan Mary 

In [None]:
ask_question("Which school did Bill Evans study in?")

Response:  Bill Evans studied at Southeastern Louisiana University.
Context used in the prompt:
hangs in the Recital Hall lobby of the Department of Music and Performing Arts. The Center for
Southeastern Louisiana Studies at the Simms Library holds the Bill Evans archives.[83] He was named
Outstanding Alumnus of the year in 1969 at Southeastern Louisiana University.[84]
Evans influenced the character Seb's wardrobe in the film La La Land.[85]
Reception
Music critic Richard S. Ginell wrote: "With the passage of time, Bill Evans has become an entire school
unto himself for pianists and a singular mood unto himself for listeners. There is no more influential jazzoriented pianist—only McCoy Tyner exerts nearly as much pull among younger players and
journeymen."[80]
During his short tenure with Davis in 1958, when the band left New York to go on the road, Evans
sometimes received cold receptions from the mostly black audiences. Evans later acknowledged that some
felt his presence threatened

## Outro

In this example, we have demonstrated how to implement RAG based chatbot using Upstash Vector and OpenAI API. Checkout [our examples](https://drive.google.com/drive/folders/1_W7MgkKGJmbfVQ_QiW_6qcfq0JZYFnhw?usp=sharing) for more ai notebooks and [follow us on X](https://twitter.com/upstash) for product updates.