# Vector Database and the Use of Metadata

## Framework of RAG

1. **Create a vector store:** This involves embedding documents into a vector space using a model like Sentence Transformers or OpenAI's text-embedding-ada-002.

2. **Add documents to the vector store:** This step involves storing the embedded documents in a vector database such as Pinecone, Weaviate, or a local vector store like FAISS.

3. **Query the vector store by calculating similarity between the query and the documents:** This is done by embedding the query and comparing it to the stored document vectors using cosine similarity or other distance metrics.

<p align="center">
  <img src="./resource/vec.jpg" alt="Vector Search" width="800"/>
</p>


## Vectors: word embeddings vs. document embeddings

Vectors are numerical representations of words or documents in a high-dimensional space. They capture semantic meaning, allowing for similarity comparisons. But there are two types of vectors that can easily confuse beginners:

1. `Word embeddings:` These are vectors representing individual words, capturing their meanings and relationships. They are typically generated using models like Word2Vec or GloVe.

2. `Document embeddings:` These vectors represent entire documents, capturing the overall meaning and context. They are often generated by aggregating word embeddings or using models like Sentence Transformers or OpenAI's text-embedding-ada-002.

For RAG systems, we primarily use document embeddings, as they allow us to compare the meaning of entire documents rather than just individual words. Note that the `"document"` in this context can be a sentence, paragraph, or even an entire article, depending on the granularity of the information we want to retrieve. It is NOT limited to what we usually think of as a "document" in the traditional sense. This confused the hell out of me when I first learned about RAG systems, so I want to clarify it here.

There are many models available for generating document embeddings, each with its own strengths and weaknesses. 

Table 5.1 Popular models for document embeddings
| Model Name | Description | Use Case |Local Option | Commercial Option |
| --- | --- | --- | --- | --- |
| OpenAI's text-embedding-ada-002 | A powerful model for generating document embeddings. | General-purpose document embedding. | No | Yes |
| BAAI/bge-base-en-v1.5 | A model designed for generating document embeddings. | General-purpose document embedding. | Yes | No |
| intfloat/e5-base | A model designed for generating document embeddings. | General-purpose document embedding. | Yes | No |




I will use sentence-transformers' `BAAI/bge-base-en-v1.5` model to generate document embeddings in this notebook. This model is designed for generating document embeddings and is available for local use, making it a good choice for many applications. It will convert text into a 768-dimensional vector in numpy ndarray format, which is a common size for document embeddings.



In [None]:
!pip install sentence-transformers
!pip install tf-keras

In [None]:
!pip uninstall -y keras keras-nightly keras-preprocessing keras-vis
!pip uninstall -y tf-keras-nightly tf-keras
!pip install tf-keras --upgrade
!pip install --upgrade transformers sentence-transformers
# install packages

In [None]:
from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer('BAAI/bge-base-en-v1.5') 
# assign the model to a variable, if the model is not downloaded, it will be downloaded automatically
# so the model is running locally, it's not a big model and can be run on a laptop


In [None]:
query = ["query: What are the symptoms of COVID-19?"]
embedding = model.encode(query, normalize_embeddings=True)
print(embedding.shape)  # Output: (1, 768), which means the query is represented by a 768-dimensional vector
print(embedding[:, :5])  # first 5 elements of the embedding
# model.encode(str) returns a numpy array which represents the embedding of the input text
# the bge-base-en-v1.5 model produces 768-dimensional embeddings ((768,) means a 1D array with 768 elements)

In [None]:
res = model.encode("RAG is awesome") 
#model.encode(str) returns a numpy array which represents the embedding of the input text
print(res.shape) # Output: (768,) 
# the bge-base-en-v1.5 model produces 768-dimensional embeddings ((768,) means a 1D array with 768 elements)
print(res[:5]) # first 5 elements of the embedding

In [None]:
# Example of encoding multiple words
# We can pass a list of words to the model to get their embeddings

words = ['apple', 'car', 'fruit', 'automobile', 'love', 'sentiment','book']
vectorized_words = model.encode(words) 
# returns a 2D numpy array where each row is the embedding of a word

print(vectorized_words.shape) 
# Output: (7, 768), which means there are 7 words, each represented by a 768-dimensional vector

print(vectorized_words[:5,:5]) # first 5 word embeddings' first 5 elements
# there's a note on numpy slicing at the end of the notebook.

[→Jump to numpy slicing reference←](#quick-refresher-note-on-slicing-numpy-arrays)

In [None]:
# Cosine Similarity Function
def cosine_similarity(v1, array_of_vectors):
    """
    Compute the cosine similarity between a vector and an array of vectors.
    
    Parameters:
    v1 (array-like): The first vector.
    array_of_vectors (array-like): An array of vectors or a single vector.

    Returns:
    list: A list of cosine similarities between v1 and each vector in array_of_vectors.
    """
    # Ensure that v1 is a numpy array
    v1 = np.array(v1)
    # Initialize a list to store similarities
    similarities = []
    
    # Check if array_of_vectors is a single vector
    if len(np.shape(array_of_vectors)) == 1:
        array_of_vectors = [array_of_vectors]
    
    # Iterate over each vector in the array
    for v2 in array_of_vectors:
        # Convert the current vector to a numpy array
        v2 = np.array(v2)
        # Compute the dot product of v1 and v2
        dot_product = np.dot(v1, v2)
        # Compute the norms of the vectors, np.linalg.norm(vector) computes the length of the vector
        norm_v1 = np.linalg.norm(v1) 
        norm_v2 = np.linalg.norm(v2)
        # Compute the cosine similarity and append to the list
        similarity = dot_product / (norm_v1 * norm_v2)
        similarities.append(similarity)
    return [float(x) for x in similarities]



In [None]:
# Using the cosine similarity function to compare the word 'car' with other words
word = 'car'
print(f"{word}:")
for i, w in enumerate(words):
    # Get the vectorized word for the word defined above
    vectorized_word = vectorized_words[words.index(word)]
    print(f"\t{w}:\t\tCosine Similarity: {cosine_similarity(vectorized_word, vectorized_words[i])[0]:.4f}")
print("\n\n\n")


In [None]:
# cosine similarity between the two vectors
def words_cosine_similarity(v1, v2):
    # ensure that v1 and v2 are numpy arrays
    v1 = np.array(v1)
    v2 = np.array(v2)
    # compute the dot product of v1 and v2
    dot_product = np.dot(v1, v2)
    # compute the norms of the vectors, np.linalg.norm(vector) computes the length of the vector
    norm_v1 = np.linalg.norm(v1) 
    norm_v2 = np.linalg.norm(v2)
    # compute the cosine similarity
    similarity = dot_product / (norm_v1 * norm_v2)
    return similarity

In [None]:
v1_cow = model.encode("cow")
v2_apple = model.encode("apple")
v3_alien = model.encode("alien")
v4_dog = model.encode("dog")


print("similarity between apple and alien:", words_cosine_similarity(v2_apple, v3_alien)) 
print("similarity between cow and alien:", words_cosine_similarity(v1_cow, v3_alien))
print("similarity between cow and dog:", words_cosine_similarity(v1_cow, v4_dog)) 
print("similarity between apple and dog:", words_cosine_similarity(v2_apple, v4_dog))         

### Limitation of Local Models: Input size

There is a limit to how much text these models can process at once, leading to truncation of text that exceeds this limit. When truncation occurs, all information beyond a certain point in the text is lost, potentially impacting the effectiveness and accuracy of the embedding.

To demonstrate this, I will use the `BAAI/bge-base-en-v1.5` model to generate document embeddings for a long text. The model has a maximum input size of 512 tokens, so any text longer than that will be truncated.

Note that tokens are not the same as words. A token can be a word, part of a word, or even punctuation. For example, the sentence "I love pizza!" might be tokenized into three tokens: "I", "love", and "pizza!". The exact number of tokens in a piece of text can vary depending on the tokenizer used by the model.

When we use len(str) to check the length of a long text, it returns the number of `characters` in the text, not the number of `tokens` or `words`. This can lead to confusion, as the number of tokens may be significantly less than the number of characters, especially for longer texts.



In [None]:
long_text = open("./resource/long_text.txt").read()
print(len(long_text)) #print the length of the long text, which is the number of characters in the text

12181


In [50]:
long_text_embedding = model.encode(long_text, normalize_embeddings=True)
print(long_text_embedding.shape)  # Output: (768,), which means the long text is represented by a 768-dimensional vector
print(long_text_embedding[:5])  # first 5 elements of the embedding

long_text_embedding_truncated = model.encode(long_text[:3000], normalize_embeddings=True)
print(long_text_embedding_truncated.shape)  # Output: (768,), which means the truncated long text is also represented by a 768-dimensional vector
print(long_text_embedding_truncated[:5])  # first 5 elements of the truncated embedding

np.array_equal(long_text_embedding, long_text_embedding_truncated)

(768,)
[ 0.0412914   0.03247961  0.01876977 -0.04653402  0.04523577]
(768,)
[ 0.0412914   0.03247961  0.01876977 -0.04653402  0.04523577]


True

When the text is longer than the model's maximum input size, the model will truncate the text to fit within the limit. This means that any information beyond a certain point in the text will be lost, potentially impacting the effectiveness and accuracy of the embedding. 

However, if the truncated text is still within the model's maximum input size, the model will generate an embedding for the truncated text without any issues. In this case, the embedding will be different from the embedding of the original long text, as the truncated text may not contain all the information present in the original text.

In [51]:
long_text_small_truncated = model.encode(long_text[:1000], normalize_embeddings=True)
print(long_text_small_truncated.shape)  # Output: (768,), which means the small truncated long text is also represented by a 768-dimensional vector
print(long_text_small_truncated[:5])  # first 5 elements of the small truncated embedding
np.array_equal(long_text_embedding, long_text_small_truncated)

  return forward_call(*args, **kwargs)


(768,)
[ 0.03915093  0.00793503  0.03163996 -0.05213862  0.05705126]


False

## Models for Embeddings: commercial options

Besides the local models, there are also commercial options available for generating document embeddings. These models are typically hosted by companies and can be accessed via APIs. They often provide higher accuracy and better performance compared to local models, but they may come with usage costs. In addition, since they are hosted by third-party companies, there may be concerns about data privacy, security, and accessibility.

I use OpenAI's API for generating embeddings in my projects. It provides a simple interface and high-quality embeddings, making it a great choice for many applications.

It is important to note that the choice of model for generating document embeddings can significantly impact the performance of your RAG system, and once you choose a model, you should stick with it throughout your project to ensure consistency in the embeddings. Switching models mid-project can lead to inconsistencies in the embeddings and may require re-embedding all your documents.

## Data Storage: FAISS vs Chroma 

If we have a lot of documents, it would be inefficient to store all the document embeddings in memory. Instead, we can use a `vector database` to store and retrieve the embeddings efficiently.

The choice of vector database confused me a lot when I first started learning about RAG systems. These terms were, and still are so abstract to me that I decided to just listen to the recommendations from the course and from ChatGPT. As far as I am concerned, I should be fine with either FAISS or Chroma, as they are both popular choices for storing and retrieving vectors.

My application is very small and personal, so I figure I don't need to worry about scalability or performance issues as I wouldn't be able to tell the difference anyway. I will use Chroma, together with LangChain and OpenAI's commercial API, to build my simple RAG system.

## Langchain: beginner's guide

## Metadata 



## Quick refresher note on slicing numpy arrays

The basic syntax for slicing numpy arrays is `arr[start:stop:step(first dimension), start:stop:step(second dimension)...]`

- `start` is the index to start slicing from (inclusive)
  
- `stop` is the index to stop slicing at (exclusive)
 
- `step` is the step size (default is 1)
  
- if you omit start, it defaults to 0, and the slice looks like arr[:stop], the step is optional, and if you omit it, it defaults to 1, so arr[start:stop] is equivalent to arr[start:stop:1]
  
- if you use [:], it means all elements along that dimension, for example, arr[:, :] means all rows and all columns and  arr[:, 0:2] means all rows and the first two columns

In [3]:
# quick note on slicing numpy arrays

import numpy as np

# Create a 4*5 numpy array
arr = np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10], [11, 12, 13, 14, 15], [16, 17, 18, 19, 20]])

# Slice the first row; 
first_row = arr[0]

# Slice the first two rows 
first_two_rows = arr[0:2]

# Slice the first two columns, [:, 0:2] means all rows and columns 0 and 1
first_two_columns = arr[:, 0:2] # Slice all elements in the first two columns 

# Slice the first two rows and first two columns
first_two_rows_and_columns = arr[0:2, 0:2]
last_element = arr[-1, -1]  # Access the last element in the array

# formatting the output
print("Array:\n", arr)
print("First row:", first_row)
print("=="*15)
print("First two rows:\n", first_two_rows)  
print("=="*15)
print("First two columns:\n", first_two_columns)
print("=="*15)
print("First two rows and columns:\n", first_two_rows_and_columns)
print("=="*15)
print("Last element:", last_element)  # Output: 20



Array:
 [[ 1  2  3  4  5]
 [ 6  7  8  9 10]
 [11 12 13 14 15]
 [16 17 18 19 20]]
First row: [1 2 3 4 5]
First two rows:
 [[ 1  2  3  4  5]
 [ 6  7  8  9 10]]
First two columns:
 [[ 1  2]
 [ 6  7]
 [11 12]
 [16 17]]
First two rows and columns:
 [[1 2]
 [6 7]]
Last element: 20
