# 2. Vectors and embedding

The notebook is a step-by-step guide to processing documents and creating vector representations (embeddings) of their content. Here's a simplified explanation of what it does:

1) **Load the Document**: It starts by loading a document (e.g., a PDF file) into the program.
2) **Split the Document**: The document is then split into smaller chunks to make it easier to process.
3) **Create Embeddings**: For each chunk, the notebook uses a model to create a vector representation (embedding). This is like converting the text into a series of numbers that capture its meaning.
4) **Store the Embeddings**: These embeddings are stored in a special database called a vector store. This allows for efficient searching and retrieval based on the content of the documents.
5) **Query the Vector Store**: Finally, the notebook demonstrates how to query this vector store to find relevant document chunks based on a search query.

Overall, the notebook shows how to transform text documents into a format that can be easily searched and analyzed using machine learning techniques.

# Qdrant
Qdrant is a high-performance vector database used to store and search vector representations (embeddings) of data efficiently. It is scalable and integrates well with machine learning and AI applications, making it suitable for production environments.

In [1]:
# https://python.langchain.com/docs/tutorials/retrievers/
import os
from langchain_core.documents import Document
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_ollama.embeddings import OllamaEmbeddings
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from qdrant_client.http import models

In [2]:
OLLAMA_SERVER = os.getenv("OLLAMA_SERVER")

In [8]:
# 1. We load the document
# -----------------------
file_path = "./docs/test.pdf"
loader    = PyPDFLoader(file_path)
docs      = loader.load()
print(len(docs))

31


In [9]:
print(f"{docs[0].page_content[:100]}\n")
docs[0].metadata

a wretch as I felt pride. Justine, poor unhappy Justine, was as innocent
as I, and she suffered the 



{'source': './docs/test.pdf', 'page': 0}

In [20]:
# We split this document un chunks
# --------------------------------
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, 
    chunk_overlap=10, 
    add_start_index=True
)
all_splits    = text_splitter.split_documents(docs)
len(all_splits)

10

In [21]:
# We select the embedder and create vectors
# ------------------------------------------
embeddings = OllamaEmbeddings(
    base_url=OLLAMA_SERVER, 
    model="mxbai-embed-large"
)
vectors    = [embeddings.embed_query(text.page_content) for text in all_splits]

In [22]:
# Instantiate the QDrant client and recreate a collection
#--------------------------------------------------------
qdrant_client     = QdrantClient(host='lawboxai_qdrant')
qdrant_collection = 'example'
qdrant_client.delete_collection(collection_name=qdrant_collection)

True

In [23]:
qdrant_client.create_collection(
   collection_name=qdrant_collection,
   vectors_config=models.VectorParams(
       size=len(vectors[0]), 
       distance=models.Distance.COSINE
    ),
)
vector_store = QdrantVectorStore(
    client=qdrant_client,
    collection_name=qdrant_collection,
    embedding=embeddings
)
vector_store.add_documents(all_splits)

['606babc67d0e47ac90a4eb6a383ff998',
 '4bd435e79f1f471aaf531d49c5f61248',
 '0fb5730d7cb547858a5a4ee203531269',
 '556829a3dc324ce0aed6325c9c3cb59c',
 '879deefe986f42a6b2f2b582ae91f2e5',
 '6b5fd21546974ca1818310a7a21d8deb',
 'e67b1c0c0dc546beb9d2b2c910f7f2ee',
 '528f6b042b2d499883c8a27c257de852',
 'e8b0a5b233f24e99990a48f46a35fd4b',
 'ad49507c092244cfb56a8a232e139c13']

In [27]:
query = "Am looking for Margaret"

In [28]:
# Similarity_search: Returns the top documents based solely on vector similarity
docs  = vector_store.similarity_search(query, k=10,)
for doc in docs:
    print(50*'-')
    print(doc.page_content)

--------------------------------------------------
have persuaded to be my companions look towards me for aid, but I have
none to bestow. There is something terribly appalling in our
situation, yet my courage and hopes do not desert me. Yet it is
terrible to reflect that the lives of all these men are endangered
through me. If we are lost, my mad schemes are the cause.
And what, Margaret, will be the state of your mind? You will not hear of 
my
destruction, and you will anxiously await my return. Years will pass, 
and
--------------------------------------------------
trust him not. His soul is as hellish as his form, full of treachery
and fiend-like malice. Hear him not; call on the names of William,
Justine, Clerval, Elizabeth, my father, and of the wretched Victor, and
thrust your sword into his heart. I will hover near and direct the
steel aright.
Walton, _in continuation._
August 26th, 17—.
You have read this strange and terrific story, Margaret; and do you not
feel your blood con

In [29]:
# Similar to similarity_search, but also returns relevance scores
docs  = await vector_store.asimilarity_search_with_relevance_scores(query,k=5)   #.similarity_search(query, k=10,)
for doc in docs:
    print(50*'-',doc[1])
    print(doc[0].page_content)

-------------------------------------------------- 0.78439495
have persuaded to be my companions look towards me for aid, but I have
none to bestow. There is something terribly appalling in our
situation, yet my courage and hopes do not desert me. Yet it is
terrible to reflect that the lives of all these men are endangered
through me. If we are lost, my mad schemes are the cause.
And what, Margaret, will be the state of your mind? You will not hear of 
my
destruction, and you will anxiously await my return. Years will pass, 
and
-------------------------------------------------- 0.7714237500000001
trust him not. His soul is as hellish as his form, full of treachery
and fiend-like malice. Hear him not; call on the names of William,
Justine, Clerval, Elizabeth, my father, and of the wretched Victor, and
thrust your sword into his heart. I will hover near and direct the
steel aright.
Walton, _in continuation._
August 26th, 17—.
You have read this strange and terrific story, Margaret; and 

In [30]:
# The LangChain way, using Retriever
retriever = vector_store.as_retriever()
retrieved_docs = retriever.invoke(query,k=7)

# Afficher les résultats
for doc in retrieved_docs:
    print(50*'-')
    print(doc.page_content)

--------------------------------------------------
have persuaded to be my companions look towards me for aid, but I have
none to bestow. There is something terribly appalling in our
situation, yet my courage and hopes do not desert me. Yet it is
terrible to reflect that the lives of all these men are endangered
through me. If we are lost, my mad schemes are the cause.
And what, Margaret, will be the state of your mind? You will not hear of 
my
destruction, and you will anxiously await my return. Years will pass, 
and
--------------------------------------------------
trust him not. His soul is as hellish as his form, full of treachery
and fiend-like malice. Hear him not; call on the names of William,
Justine, Clerval, Elizabeth, my father, and of the wretched Victor, and
thrust your sword into his heart. I will hover near and direct the
steel aright.
Walton, _in continuation._
August 26th, 17—.
You have read this strange and terrific story, Margaret; and do you not
feel your blood con