# 2. Vectors and embedding

The notebook is a step-by-step guide to processing documents and creating vector representations (embeddings) of their content. Here's a simplified explanation of what it does:

1) **Load the Document**: It starts by loading a document (e.g., a PDF file) into the program.
2) **Split the Document**: The document is then split into smaller chunks to make it easier to process.
3) **Create Embeddings**: For each chunk, the notebook uses a model to create a vector representation (embedding). This is like converting the text into a series of numbers that capture its meaning.
4) **Store the Embeddings**: These embeddings are stored in a special database called a vector store. This allows for efficient searching and retrieval based on the content of the documents.
5) **Query the Vector Store**: Finally, the notebook demonstrates how to query this vector store to find relevant document chunks based on a search query.

Overall, the notebook shows how to transform text documents into a format that can be easily searched and analyzed using machine learning techniques.

# Qdrant
Qdrant is a high-performance vector database used to store and search vector representations (embeddings) of data efficiently. It is scalable and integrates well with machine learning and AI applications, making it suitable for production environments.

In [31]:
# https://python.langchain.com/docs/tutorials/retrievers/
import os
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_ollama.embeddings import OllamaEmbeddings
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from qdrant_client.http import models

In [32]:
OLLAMA_SERVER = os.getenv("OLLAMA_SERVER")

In [33]:

file_path = "./docs/test.pdf"
# Load a PDF document and split it into individual pages as document objects
loader    = PyPDFLoader(file_path)
docs      = loader.load()
print(len(docs))

31


In [34]:
# Let's examine the first document
print(f"{docs[0].page_content[:100]}\n")
docs[0].metadata

a wretch as I felt pride. Justine, poor unhappy Justine, was as innocent
as I, and she suffered the 



{'source': './docs/test.pdf', 'page': 0}

In [35]:
# Initialize text splitter 
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, 
    chunk_overlap=10, 
    add_start_index=True
)
# Split the documents into chunks
all_splits    = text_splitter.split_documents(docs)
len(all_splits)

100

In [36]:
# Initialize Ollama embeddings
embeddings = OllamaEmbeddings(
    base_url=OLLAMA_SERVER, 
    model="mxbai-embed-large"
)
vectors    = [embeddings.embed_query(text.page_content) for text in all_splits]

In [37]:
# Instantiate the QDrant client and recreate a collection
# For this tutorial, we delete the collection if it already exists
# This is to ensure that the collection is created from scratch
qdrant_client     = QdrantClient(host='lawboxai_qdrant')
qdrant_collection = 'example'
qdrant_client.delete_collection(collection_name=qdrant_collection)

True

In [38]:
# Create a new collection with the same name
qdrant_client.create_collection(
   collection_name=qdrant_collection,
   vectors_config=models.VectorParams(
       size=len(vectors[0]), 
       distance=models.Distance.COSINE
    ),
)
# Add the documents to the collection
vector_store = QdrantVectorStore(
    client=qdrant_client,
    collection_name=qdrant_collection,
    embedding=embeddings
)
vector_store.add_documents(all_splits)

['1cf0b2839ef14265b985cafb4162917e',
 '3fe84b42a7c64964a201b5bfd29f1a25',
 '0adb60123eca45668e1d5b7cf5b019a4',
 '52fe595e697b4297a1e81f38b52ce113',
 '59b31fef3b454ce282ed7895b638991b',
 '6f5571086e1e4151bdc75823557422ec',
 'eef55b5db7e141859542a70b307ea283',
 '12a78a0889f94659b4870ba0fa5a8999',
 'f0d9e67f27bb4a02886f381ba937c56c',
 'a8f78a13d6f54947a79f063b2f3c94af',
 '0dcec62039c44aa69d0103c1b7250d4c',
 '62168cb35d304f659e525e837342069d',
 '57054b77af7248a38f3872313f6e8616',
 '7a6d1bbe97234af59a8bab0749740510',
 '288b179bcd3b4290ac9d946ace366622',
 '1f85968f3e5a4d5dbbe9ed4c350cb0d6',
 'be23a505110e48bf96ec27b4cc9101d5',
 'cf0fd5f6d34844f0bdd272f85820d180',
 '43fcc73077f6459fb0914ab3d4a8709b',
 '469022352f4c4e2099ef129d0aaf2d98',
 '3c880be7b9544e8b8274a3d72135416b',
 'c953ff875d5a4910be8dc4b1c063cd1d',
 '2d93911b0004492fa4e367b0ba403c6b',
 '9310929016f9457c92b303a1dc0db54d',
 '709add28afeb4bafab76ea64c3e75e5f',
 '6af2abf00443480788ad8cd751461262',
 'c08a702386ac4f4db91b8b214d5d7ee6',
 

In [39]:
query = "I'm looking for Margaret"

In [40]:
# Similarity_search: Returns the top documents based solely on vector similarity
docs  = vector_store.similarity_search(query, k=10,)
for doc in docs:
    print(50*'-')
    print(doc.page_content)

--------------------------------------------------
such is not my destiny; I must pursue and destroy the being to whom I
gave existence; then my lot on earth will be fulfilled and I may die.”
My beloved Sister,
September 2d.
I write to you, encompassed by peril and ignorant whether I am ever
doomed to see again dear England and the dearer friends that inhabit
it. I am surrounded by mountains of ice which admit of no escape and
threaten every moment to crush my vessel. The brave fellows whom I
have persuaded to be my companions look towards me for aid, but I have
none to bestow. There is something terribly appalling in our
situation, yet my courage and hopes do not desert me. Yet it is
terrible to reflect that the lives of all these men are endangered
through me. If we are lost, my mad schemes are the cause.
And what, Margaret, will be the state of your mind? You will not hear of 
my
destruction, and you will anxiously await my return. Years will pass, 
and
you will have visitings of de

In [41]:
# Similar to similarity_search, but also returns relevance scores
docs  = await vector_store.asimilarity_search_with_relevance_scores(query,k=5)   #.similarity_search(query, k=10,)
for doc in docs:
    print(50*'-',doc[1])
    print(doc[0].page_content)

-------------------------------------------------- 0.7903445499999999
such is not my destiny; I must pursue and destroy the being to whom I
gave existence; then my lot on earth will be fulfilled and I may die.”
My beloved Sister,
September 2d.
I write to you, encompassed by peril and ignorant whether I am ever
doomed to see again dear England and the dearer friends that inhabit
it. I am surrounded by mountains of ice which admit of no escape and
threaten every moment to crush my vessel. The brave fellows whom I
have persuaded to be my companions look towards me for aid, but I have
none to bestow. There is something terribly appalling in our
situation, yet my courage and hopes do not desert me. Yet it is
terrible to reflect that the lives of all these men are endangered
through me. If we are lost, my mad schemes are the cause.
And what, Margaret, will be the state of your mind? You will not hear of 
my
destruction, and you will anxiously await my return. Years will pass, 
and
you will h

In [42]:
# The LangChain way, using Retriever
# Retrievers are a higher-level abstraction that wraps the VectorStore
# They provide a more user-friendly interface to the VectorStore
retriever = vector_store.as_retriever()
retrieved_docs = retriever.invoke(query,k=7)

# Afficher les résultats
for doc in retrieved_docs:
    print(50*'-')
    print(doc.page_content)

--------------------------------------------------
such is not my destiny; I must pursue and destroy the being to whom I
gave existence; then my lot on earth will be fulfilled and I may die.”
My beloved Sister,
September 2d.
I write to you, encompassed by peril and ignorant whether I am ever
doomed to see again dear England and the dearer friends that inhabit
it. I am surrounded by mountains of ice which admit of no escape and
threaten every moment to crush my vessel. The brave fellows whom I
have persuaded to be my companions look towards me for aid, but I have
none to bestow. There is something terribly appalling in our
situation, yet my courage and hopes do not desert me. Yet it is
terrible to reflect that the lives of all these men are endangered
through me. If we are lost, my mad schemes are the cause.
And what, Margaret, will be the state of your mind? You will not hear of 
my
destruction, and you will anxiously await my return. Years will pass, 
and
you will have visitings of de