# 2. Vectors and embedding

The notebook is a step-by-step guide to processing documents and creating vector representations (embeddings) of their content. Here's a simplified explanation of what it does:

1) **Load the Document**: It starts by loading a document (e.g., a PDF file) into the program.
2) **Split the Document**: The document is then split into smaller chunks to make it easier to process.
3) **Create Embeddings**: For each chunk, the notebook uses a model to create a vector representation (embedding). This is like converting the text into a series of numbers that capture its meaning.
4) **Store the Embeddings**: These embeddings are stored in a special database called a vector store. This allows for efficient searching and retrieval based on the content of the documents.
5) **Query the Vector Store**: Finally, the notebook demonstrates how to query this vector store to find relevant document chunks based on a search query.

Overall, the notebook shows how to transform text documents into a format that can be easily searched and analyzed using machine learning techniques.

# Qdrant
Qdrant is a high-performance vector database used to store and search vector representations (embeddings) of data efficiently. It is scalable and integrates well with machine learning and AI applications, making it suitable for production environments.

In [1]:
# https://python.langchain.com/docs/tutorials/retrievers/
import os
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_ollama.embeddings import OllamaEmbeddings
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from qdrant_client.http import models

In [2]:
OLLAMA_SERVER = os.getenv("OLLAMA_SERVER")

In [None]:

file_path = "./docs/test.pdf"
# Load a PDF document and split it into individual pages as document objects
loader    = PyPDFLoader(file_path)
docs      = loader.load()
print(len(docs))

In [None]:
# Let's examine the first document
print(f"{docs[0].page_content[:100]}\n")
docs[0].metadata

In [None]:
# Initialize text splitter 
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, 
    chunk_overlap=10, 
    add_start_index=True
)
# Split the documents into chunks
all_splits    = text_splitter.split_documents(docs)
len(all_splits)

In [6]:
# Initialize Ollama embeddings
embeddings = OllamaEmbeddings(
    base_url=OLLAMA_SERVER, 
    model="mxbai-embed-large"
)
vectors    = [embeddings.embed_query(text.page_content) for text in all_splits]

In [None]:
# Instantiate the QDrant client and recreate a collection
# For this tutorial, we delete the collection if it already exists
# This is to ensure that the collection is created from scratch
qdrant_client     = QdrantClient(host='lawboxai_qdrant')
qdrant_collection = 'example'
qdrant_client.delete_collection(collection_name=qdrant_collection)

In [None]:
# Create a new collection with the same name
qdrant_client.create_collection(
   collection_name=qdrant_collection,
   vectors_config=models.VectorParams(
       size=len(vectors[0]), 
       distance=models.Distance.COSINE
    ),
)
# Add the documents to the collection
vector_store = QdrantVectorStore(
    client=qdrant_client,
    collection_name=qdrant_collection,
    embedding=embeddings
)
vector_store.add_documents(all_splits)

In [9]:
# Note this is not a LLM model query, but a similarity search in the vector
# We will do a more advanced query with LLM in the next tutorial
query = "I'm looking for Margaret"

In [None]:
# Similarity_search: Returns the top documents based solely on vector similarity
docs  = vector_store.similarity_search(query, k=10,)
for doc in docs:
    print(50*'-')
    print(doc.page_content)

In [None]:
# Similar to similarity_search, but also returns relevance scores
docs  = await vector_store.asimilarity_search_with_relevance_scores(query,k=5)   #.similarity_search(query, k=10,)
for doc in docs:
    print(50*'-',doc[1])
    print(doc[0].page_content)

In [None]:
# The LangChain way, using Retriever
# Retrievers are a higher-level abstraction that wraps the VectorStore
# They provide a more user-friendly interface to the VectorStore
retriever = vector_store.as_retriever()
retrieved_docs = retriever.invoke(query,k=7)

# Afficher les résultats
for doc in retrieved_docs:
    print(50*'-')
    print(doc.page_content)