# RAG with Vector Databases

## Just a quick definition
## What are Vector Embeddings?

Vector embeddings are numerical representations of text, images, or other data types that capture semantic meaning in a multi-dimensional space. When you input text into an AI system, it gets converted into a vector (a series of numbers) where similar concepts are positioned close to each other in this high-dimensional space.

For example, the words "dog" and "puppy" would have vectors that are mathematically close to each other, while "dog" and "airplane" would be farther apart. This allows AI systems to understand relationships and similarities between different pieces of content, making them essential for tasks like search, recommendation systems, and retrieval-augmented generation (RAG).

![Image](../../images/vector-embeddings.png)

*Image courtesy of [Qdrant](https://medium.com/@qdrant/what-are-vector-embeddings-a7bda215702d)*

In [1]:
import os
import glob
import gradio as gr
import numpy as np
from openai import OpenAI
from dotenv import load_dotenv
from IPython.display import Markdown, display

# langchain libriaries
from langchain.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.schema import Document
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_chroma import Chroma

load_dotenv()

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

if OPENAI_API_KEY is None:
    raise Exception("API key is missing")

## Step 1: Load your documents

**Note:** There are other documents you can load in using LangChain. LangChain with its [DocumentLoaders](https://python.langchain.com/docs/concepts/document_loaders/) allows you to ingest data from a variety of sources like:
- csv
- xlsx
- PDF
- Webpages
- Cloud Providers like AWS/Azure/GCP

In [2]:

folders = glob.glob("docs/*")

# load in all the txt files in the directory
documents = []
for folder in folders:
    doc_type = os.path.basename(folder)
    loader = DirectoryLoader(
        path=folder,
        glob="**/*.txt",
        loader_cls=TextLoader,
        loader_kwargs={"encoding": "utf-8"}
    )
    folder_docs = loader.load()
    for doc in folder_docs:
        doc.metadata["doc_type"] = doc_type
    documents.extend(folder_docs)

# load the documents
# documents = loader.load()

In [3]:
documents

[Document(metadata={'source': 'docs/services/intellectual-property.txt', 'doc_type': 'services'}, page_content='The Intellectual Property & Technology team, led by Sarah Chen, provides comprehensive IP protection strategies including patent prosecution, trademark enforcement, software licensing, and data privacy law. Clients range from tech startups to Fortune 500 firms.'),
 Document(metadata={'source': 'docs/services/employment-law.txt', 'doc_type': 'services'}, page_content='David Thompson heads the Employment Law & Labor Relations practice, representing management in wage and hour disputes, harassment claims, union issues, and employment aspects of corporate deals. Our team helps employers stay compliant and mitigate legal risks.'),
 Document(metadata={'source': 'docs/services/litigation-and-dispute-resolution.txt', 'doc_type': 'services'}, page_content='Our litigation team handles high-stakes commercial disputes, class actions, and white-collar defense. Led by Alexandra Sterling, w

## Step 2: Split documents into chunks

In [4]:
text_splitter = CharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len
)

chunks = text_splitter.split_documents(documents)

In [5]:
chunks[0:5]

[Document(metadata={'source': 'docs/services/intellectual-property.txt', 'doc_type': 'services'}, page_content='The Intellectual Property & Technology team, led by Sarah Chen, provides comprehensive IP protection strategies including patent prosecution, trademark enforcement, software licensing, and data privacy law. Clients range from tech startups to Fortune 500 firms.'),
 Document(metadata={'source': 'docs/services/employment-law.txt', 'doc_type': 'services'}, page_content='David Thompson heads the Employment Law & Labor Relations practice, representing management in wage and hour disputes, harassment claims, union issues, and employment aspects of corporate deals. Our team helps employers stay compliant and mitigate legal risks.'),
 Document(metadata={'source': 'docs/services/litigation-and-dispute-resolution.txt', 'doc_type': 'services'}, page_content='Our litigation team handles high-stakes commercial disputes, class actions, and white-collar defense. Led by Alexandra Sterling, w

In [6]:
for chunk in chunks:
    if "Alex" in chunk.page_content:
        print(chunk.page_content)
        print(chunk.metadata)
        print("\n\n")

Our litigation team handles high-stakes commercial disputes, class actions, and white-collar defense. Led by Alexandra Sterling, we are known for courtroom excellence and effective dispute resolution strategies including arbitration and mediation.
{'source': 'docs/services/litigation-and-dispute-resolution.txt', 'doc_type': 'services'}



STERLING & ASSOCIATES LAW GROUP - ATTORNEY RECORD

Attorney ID: SAL-001
Name: Alexandra Catherine Sterling
Position: Managing Partner & Senior Litigation Attorney
Department: Corporate Litigation
Bar Admission: New York (1995), Pennsylvania (1996), Federal Courts (1997)
Hire Date: January 15, 1995 (Founding Partner)

PROFESSIONAL BACKGROUND:
------------------------

Current Position (1995 - Present):
Managing Partner & Senior Litigation Attorney
- Founded Sterling & Associates Law Group in 1995
- Leads high-stakes corporate litigation and white-collar defense
- Manages firm operations and strategic direction
- Oversees 45+ attorneys and 80+ support s

## Step 3: Generate vector embeddings for your documents

In [7]:
MODEL = 'gpt-4o-mini'
embedding_model = 'text-embedding-3-large'
db_name = "vector_db"

In [8]:
embeddings = OpenAIEmbeddings(model=embedding_model)

## Step 4: Create a vectorstore

In [9]:
# clearing out any previous vectorstores to have a fresh start
if os.path.exists(db_name):
    Chroma(persist_directory=db_name, embedding_function=embeddings).delete_collection()

In [10]:
vectorstore = Chroma.from_documents(documents=chunks, embedding=embeddings, persist_directory=db_name)
print(f"Vector store created with {vectorstore._collection.count()} documents.")

Vector store created with 47 documents.


In [11]:
vectorstore._collection.get().keys()

dict_keys(['ids', 'embeddings', 'documents', 'uris', 'included', 'data', 'metadatas'])

In [12]:
len(vectorstore._collection.get(include=['embeddings'])['embeddings'][0])

3072

## Step 4.5: Visualizing the vectorstore

In [13]:
# Prework

collection = vectorstore._collection
result = collection.get(include=['embeddings', 'documents', 'metadatas'])
vectors = np.array(result['embeddings'])
documents = result['documents']
doc_types = [metadata['doc_type'] for metadata in result['metadatas']]
colors = [['blue', 'green', 'red'][['about', 'employees', 'services'].index(t)] for t in doc_types]

In [14]:
# We humans find it easier to visalize things in 2D!
# Reduce the dimensionality of the vectors to 2D using t-SNE
# (t-distributed stochastic neighbor embedding)

from sklearn.manifold import TSNE
import plotly.graph_objects as go

tsne = TSNE(n_components=2, random_state=42)
reduced_vectors = tsne.fit_transform(vectors)

# Create the 2D scatter plot
fig = go.Figure(data=[go.Scatter(
    x=reduced_vectors[:, 0],
    y=reduced_vectors[:, 1],
    mode='markers',
    marker=dict(size=5, color=colors, opacity=0.8),
    text=[f"Type: {t}<br>Text: {d[:100]}..." for t, d in zip(doc_types, documents)],
    hoverinfo='text'
)])

fig.update_layout(
    title='2D Chroma Vector Store Visualization',
    scene=dict(xaxis_title='x',yaxis_title='y'),
    width=800,
    height=600,
    margin=dict(r=20, b=10, l=10, t=40)
)

fig.show()

## Step 5: Building the RAG chatbot logic

In its most basic form, RAG applications have 4 parts to them:

1. The **LLM** engine
2. The **memory** of the application
3. The information **retriever**
4. The **conversation chain**

In [None]:
# 1. The LLM engine
llm = ChatOpenAI(model=MODEL, temperature=0.7)

# 2. The memory of the application
from langchain.memory import ConversationBufferMemory
memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)

# 3. The information retriever
from langchain.chains import RetrievalQA
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

# 4. The conversation chain
from langchain.chains import ConversationalRetrievalChain
conversation_chain = ConversationalRetrievalChain.from_llm(
    llm=llm,
    retriever=retriever,
    memory=memory,
    return_source_documents=True
)