## Document embedding and Vector store

There are many ways to build a retriever in RAG model, one of the most common ways is to use a document embedding and a vector store.

The document embedding is a fixed-size representation of the document, and the vector store is a database that stores the embeddings of all the documents.

When we want to retrieve documents, we can calculate the similarity between the query embedding and the document embeddings in the vector store, and return the documents with the highest similarity scores, these documents are expected to be relevant to the query.

![rag.png](./public/rag.png)

### Document loading

Document often stored in diferent format (such as json, csv, txt, ...) and different places (such as local file, internet, database, ...).

Document loading is the technique to load the document from different sources and convert them into a unified format.

In this example, we assume the we already have all the documents stored in `./knowledge` folder, and each document file is a rst file.

If you want to pull the documents, run this command:
```bash 
    git clone git@gitlab.myteksi.net:sentry/t6/t6.git ./tmp/t6

    mkdir -p knowledge/t6

    rsync -avm --include='*.rst' --remove-source-files -f 'hide,! */' "tmp/t6/doc" "knowledge/t6" 

    rm -rf tmp
```

In [None]:
# ! git clone git@gitlab.myteksi.net:sentry/t6/t6.git ./tmp/t6 && mkdir -p knowledge/t6 && rsync -avm --include='*.rst' --remove-source-files -f 'hide,! */' "tmp/t6/doc" "knowledge/t6" && rm -rf tmp

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 50,
    chunk_overlap = 10,
    length_function=len,
)

example_string = (
    "This is a really long string and I want to see how it works with text splitter."
    "This is a really long string and I want to see how it works with text splitter."
)

splits = text_splitter.split_text(example_string)

print(len(example_string))
print(len(splits))
print(splits[0])
print(splits[1])

In above example, we could use splitter to split the document

In [None]:
from typing import List
import os

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import DirectoryLoader, TextLoader
from langchain.docstore.document import Document

BASE_PATH = "./knowledge/t6"
if not os.path.exists(BASE_PATH):
    raise ValueError(f"Directory {BASE_PATH} does not exist")

# prepare the documents
loader = DirectoryLoader(
    path=BASE_PATH, loader_cls=TextLoader, glob="**/*.rst", exclude=["index.rst"]
)
documents: List[Document] = loader.load()

text_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n"],
    chunk_size=2000,
    chunk_overlap=150,
    length_function=len,
    is_separator_regex=False,
)
splits = text_splitter.split_documents(documents)

In [None]:
print(len(splits))

print(splits[0], end="\n<========================>\n\n")
print(splits[1], end="\n<========================>\n\n")

After loading document, we could have a list of documents, we then could we these document to answer the question.

As we've dicussed before, we need to embedding these documents and store them in a vector store.

### Document embedding

Document embedding is the technique to convert the document into a fixed-size representation a.k.a embedding vector.

There are many embedding techniques, but in the context of this hand-ons, we assume that embedding here is semantic embedding.

Under the hood, the document embedding is often implemented by an AI model that have ability to **catch the semantic meaning of text** and **encode them into a fixed-size vector**.

So in general, we expect that if 2 documents are **similar or have similar meaning**, their **embeddings should be close to each other in the vector space**.

In this example, we will use an API from Azure to get the embedding of the document.

In [None]:
from dotenv import load_dotenv
load_dotenv(".env")

# these variables are required to initialize Langchain AzureChatOpenAI instance
required_env_vars = [
    "AZURE_OPENAI_API_KEY",
    "AZURE_OPENAI_API_VERSION",
    "AZURE_OPENAI_ENDPOINT",
    "AZURE_OPENAI_MODEL",
    "AZURE_OPENAI_DEPLOYMENT_NAME",
    "AZURE_OPENAI_EMBEDDING_MODEL",
    "AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME",
]

for var in required_env_vars:
    if os.environ.get(var) is None:
        raise Exception(f"Missing `{var}` environment variable")


api_key = os.environ.get("AZURE_OPENAI_API_KEY", "")
api_version=os.environ.get("AZURE_OPENAI_API_VERSION", "2023-03-15-preview")
azure_endpoint=os.environ.get("AZURE_OPENAI_ENDPOINT", "https://public-api.grabgpt.managed.catwalk-k8s.stg-myteksi.com")
deployment_name=os.environ.get("AZURE_OPENAI_DEPLOYMENT_NAME", "gpt-4-turbo")
model=os.environ.get("AZURE_OPENAI_MODEL", "gpt-4-turbo")
embedding_model=os.environ.get("AZURE_OPENAI_EMBEDDING_MODEL", "text-embedding-3-large")
embedding_deployment_name=os.environ.get("AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME", "text-embedding-3-large")

In [None]:
from langchain_openai import AzureOpenAIEmbeddings
embedding = AzureOpenAIEmbeddings(
    api_key=api_key,
    api_version=api_version,
    azure_deployment=embedding_deployment_name,
)

First let's make some experiment to see how embedding works with Azure API.

We will embedding different text with different sematic meaning and see how the embedding vectors look like.

In [None]:
sentence1 = "Today is pleasantly warm with a gentle breeze."
sentence2 = "The temperature today is mild and the air is calm."
sentence3 = "It’s uncomfortably hot and humid today."

In [None]:
embedding1 = embedding.embed_query(sentence1)
embedding2 = embedding.embed_query(sentence2)
embedding3 = embedding.embed_query(sentence3)

In [None]:
import numpy as np

print(np.dot(embedding1, embedding2))
print(np.dot(embedding2, embedding3))
print(np.dot(embedding3, embedding1))

In the above example, it's easy to see that the sentence1 and sentence2 have similar meaning, and theirfor their embeddings are closer compared to the sentence3.

### Vector store


In [None]:
! rm -rf ./docs

In [None]:
from langchain.vectorstores import Chroma
persist_directory = './docs/chroma/'

# vectordb = Chroma.from_documents(
#     documents=splits,
#     embedding=embedding,
#     persist_directory=persist_directory
# )

vectordb = Chroma(
    persist_directory=persist_directory,
    embedding_function=embedding,
)

In [None]:
vectordb._collection.count()

In [None]:
question =" how to resolve Fail to push image error while running cop_image:envoy-base stage in ci/cd pipeline?"
candidates = vectordb.similarity_search(question, k=5)

In [None]:
for candidate in candidates:
    print(candidate.metadata)
    print(candidate.page_content[:1000])

We could see that all retrieved documents have similar content with the query, which means that we already gat sufficient information that are ready to feed into the model.