In [None]:
'''VectorDB: 
- Used to store and search the high-dimensional vectors -> embeddings generated from data like text, images, or audio. These embedding represent the semantic meaning of the data. 

Q. Why Do We Need a Vector Database?
- In traditional databases, we query data using exact values (e.g., ID = 5). But with semantic search, we want to find similar things:
- E.g., find documents similar to: "How to invest in real estate?"

# To do that:
- Convert the query and documents to vectors (via a model like OpenAI, Hugging Face, etc.).
- Store those vectors in a vector database.
- Use similarity search (cosine similarity, Euclidean distance) to find the most similar vectors.

- Indexing:
- Similarity search:

# Popular Vector Databases:

- FAISS (Facebook)
- Chroma
- Pinecone (SaaS)
- Weaviate
- Milvus
- Qdrant
'''

In [1]:
!pip install sentence-transformers faiss-cpu



In [4]:
!pip install sentence-transformers



In [None]:
'''
# Index Techniques (Indexing means a way of arranging the data in simple). so that we can optimize the searching. 
  1. Flat Index (Brute Force / FlatL2)
    ✅ What it does: Compares the query vector to every single vector in the database.
    - Uses exact search, so it's 100% accurate.
    - Slow for large datasets.
    - Best for: small-scale datasets, or when you want perfect accuracy.
    - Think of it like: Searching for a specific song by listening to every single one in a playlist.

2. IVF (Inverted File Index)
    ✅ What it does:
    - Clusters all vectors into "buckets" using KMeans.
    - For any new query, only compares it to vectors within the nearest clusters, not all.
    - It's a type of Approximate Nearest Neighbor (ANN) search.
    - Trade-off: Speed vs Accuracy — faster but a bit less precise than brute force.
    - Think of it like: Sorting books by genre and then searching only within the crime section if you're looking for a detective novel.

3. HNSW (Hierarchical Navigable Small World Graph)
    ✅ What it does:
    - Builds a graph of vectors, where each vector is connected to nearby vectors.
    - You start at a high-level “coarse” node, and navigate through the graph to get closer to your query vector.
    - Extremely fast and memory-efficient.
    - Common in FAISS, Qdrant, Weaviate, etc.
    - Think of it like: Finding a city on a map by first zooming into the region, then the state, then the city.

4. LSH (Locality Sensitive Hashing)
    ✅ What it does:
    - Uses a hashing function that puts similar vectors into the same buckets.
    - Then only searches within that bucket.
    - Works well when data is very high-dimensional.
    - Fast, but less accurate than IVF or HNSW.
    - Think of it like: You and your friends all use the same locker number because you have similar names — so if someone wants to find you, they look in that locker.

------------------------------------
  🔁 SIMILARITY SEARCH TECHNIQUES:
------------------------------------

1. Cosine Similarity
    ✅ Measures the angle between two vectors, not their magnitude.
    Useful when direction (semantic meaning) is more important than size.
    Ranges from -1 to 1 (1 = identical direction).
    📌 cos_sim = dot(A, B) / (||A|| * ||B||)

    🧠 Use when: You care about semantic similarity, like in NLP.

2. Dot Product
    ✅ Measures the projection of one vector onto another.
    Similar to cosine, but magnitude affects result.
    Commonly used in models like transformers.
    📌 dot(A, B) = Σ Aᵢ * Bᵢ

    🧠 Use when: Size and direction both matter (like attention mechanisms).

3. MMR (Maximal Marginal Relevance)
    ✅ Reranking technique after search:
    Balances relevance to query and diversity among results.
    Helps avoid similar results in top-k answers.
    📌 MMR = argmax(λ * Sim1 - (1 - λ) * Sim2)

    🧠 Use when: You want variety in answers — e.g., summarization, recommendations.

4. Euclidean Distance
    ✅ Measures the straight-line distance between vectors in space.
    Sensitive to magnitude.
    Used in FlatL2 and other exact searches.
    📌 distance = sqrt((x1 - y1)^2 + (x2 - y2)^2 + ...)

    🧠 Use when: Absolute values matter, like in image embeddings or location data.

5. Approximate Nearest Neighbors (ANN)
    ✅ General term for fast but not 100% accurate methods (e.g., IVF, HNSW, LSH).
    Much faster than brute force.
    🧠 Use when: You have millions of vectors and can accept tiny accuracy loss for huge speed gain.

6. Exact Search
    ✅ Searches all vectors and gives perfect matches.
    Uses Flat index with cosine or euclidean distance.
    ❌ Slower, expensive.
    🧠 Use when: Accuracy is critical (e.g., law, medicine, finance).
'''

In [33]:
from langchain_community.document_loaders import PyPDFLoader
from langchain_openai import OpenAIEmbeddings
from dotenv import load_dotenv
load_dotenv()

True

In [None]:

loader = PyPDFLoader("Transfomers.pdf")

documents = loader.load()
documents

[Document(metadata={'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2024-04-10T21:11:43+00:00', 'author': '', 'keywords': '', 'moddate': '2024-04-10T21:11:43+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': 'Transfomers.pdf', 'total_pages': 15, 'page': 0, 'page_label': '1'}, page_content='Provided proper attribution is provided, Google hereby grants permission to\nreproduce the tables and figures in this paper solely for use in journalistic or\nscholarly works.\nAttention Is All You Need\nAshish Vaswani∗\nGoogle Brain\navaswani@google.com\nNoam Shazeer∗\nGoogle Brain\nnoam@google.com\nNiki Parmar∗\nGoogle Research\nnikip@google.com\nJakob Uszkoreit∗\nGoogle Research\nusz@google.com\nLlion Jones∗\nGoogle Research\nllion@google.com\nAidan N. Gomez∗ †\nUniversity of Toronto\naidan@cs.toronto.edu\nŁukasz Kaiser∗\nGoogle Brain\nlukas

In [22]:
documents[1].page_content

'1 Introduction\nRecurrent neural networks, long short-term memory [13] and gated recurrent [7] neural networks\nin particular, have been firmly established as state of the art approaches in sequence modeling and\ntransduction problems such as language modeling and machine translation [ 35, 2, 5]. Numerous\nefforts have since continued to push the boundaries of recurrent language models and encoder-decoder\narchitectures [38, 24, 15].\nRecurrent models typically factor computation along the symbol positions of the input and output\nsequences. Aligning the positions to steps in computation time, they generate a sequence of hidden\nstates ht, as a function of the previous hidden state ht−1 and the input for position t. This inherently\nsequential nature precludes parallelization within training examples, which becomes critical at longer\nsequence lengths, as memory constraints limit batching across examples. Recent work has achieved\nsignificant improvements in computational efficiency t

In [23]:
for doc in documents:
    print(doc)

page_content='Provided proper attribution is provided, Google hereby grants permission to
reproduce the tables and figures in this paper solely for use in journalistic or
scholarly works.
Attention Is All You Need
Ashish Vaswani∗
Google Brain
avaswani@google.com
Noam Shazeer∗
Google Brain
noam@google.com
Niki Parmar∗
Google Research
nikip@google.com
Jakob Uszkoreit∗
Google Research
usz@google.com
Llion Jones∗
Google Research
llion@google.com
Aidan N. Gomez∗ †
University of Toronto
aidan@cs.toronto.edu
Łukasz Kaiser∗
Google Brain
lukaszkaiser@google.com
Illia Polosukhin∗ ‡
illia.polosukhin@gmail.com
Abstract
The dominant sequence transduction models are based on complex recurrent or
convolutional neural networks that include an encoder and a decoder. The best
performing models also connect the encoder and decoder through an attention
mechanism. We propose a new simple network architecture, the Transformer,
based solely on attention mechanisms, dispensing with recurrence and convolutions

In [25]:
documents[0].metadata

{'producer': 'pdfTeX-1.40.25',
 'creator': 'LaTeX with hyperref',
 'creationdate': '2024-04-10T21:11:43+00:00',
 'author': '',
 'keywords': '',
 'moddate': '2024-04-10T21:11:43+00:00',
 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5',
 'subject': '',
 'title': '',
 'trapped': '/False',
 'source': 'Transfomers.pdf',
 'total_pages': 15,
 'page': 0,
 'page_label': '1'}

In [26]:
for doc in documents:
    print(doc.page_content)
    print("##################################################")

Provided proper attribution is provided, Google hereby grants permission to
reproduce the tables and figures in this paper solely for use in journalistic or
scholarly works.
Attention Is All You Need
Ashish Vaswani∗
Google Brain
avaswani@google.com
Noam Shazeer∗
Google Brain
noam@google.com
Niki Parmar∗
Google Research
nikip@google.com
Jakob Uszkoreit∗
Google Research
usz@google.com
Llion Jones∗
Google Research
llion@google.com
Aidan N. Gomez∗ †
University of Toronto
aidan@cs.toronto.edu
Łukasz Kaiser∗
Google Brain
lukaszkaiser@google.com
Illia Polosukhin∗ ‡
illia.polosukhin@gmail.com
Abstract
The dominant sequence transduction models are based on complex recurrent or
convolutional neural networks that include an encoder and a decoder. The best
performing models also connect the encoder and decoder through an attention
mechanism. We propose a new simple network architecture, the Transformer,
based solely on attention mechanisms, dispensing with recurrence and convolutions
entirely. Exp

In [None]:
# These all the methods are for chunking. 
# character textsplitter.  (separation based on given character)
# character recursive text splitter.  (regex)
# Token text splitter. (LLM capability)

In [31]:
openai_embedding = OpenAIEmbeddings(model='text-embedding-3-large')

In [39]:
import faiss # This is the vector database.
from langchain_community.docstore.in_memory import InMemoryDocstore
from langchain_community.vectorstores import FAISS

In [41]:
from langchain.embeddings import HuggingFaceEmbeddings

embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
dim = len(embedding_model.embed_query("test"))


faiss_index = faiss.IndexFlat(dim)

vector_store = FAISS(
    embedding_function=embedding_model,
    index=faiss_index,
    docstore=InMemoryDocstore(),
    index_to_docstore_id={},
)

  embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")


In [42]:
vector_store

<langchain_community.vectorstores.faiss.FAISS at 0x335956510>

In [43]:
vector_store.add_documents(documents)

['2a70d36e-7bcd-4fb9-9bbe-2cea409b4c68',
 '7dcbac37-4694-44dd-90ab-e8b6766bf1cc',
 '47753722-6739-4f21-afb0-3a3877c73b01',
 '7996fbbc-27ae-49a2-9eaf-90a8f50c527d',
 '1bccf3d3-ee4a-4a39-ab76-2df97585c9fd',
 'f6a6ada4-c5bd-4a38-85d4-4db023edb5ef',
 '58c24066-5eb0-462f-aaf8-003d570d1a8e',
 'ac9e9de6-1319-42bc-ab90-073260523aac',
 '7e0f7231-fef5-4058-8d74-53b2aca5c243',
 'd4cb77db-53c0-4e6b-88ab-584b2b8bdb42',
 '35ce738d-bbe9-42b2-9dcd-12c0b93ca37f',
 '4c1a50db-49af-4ec8-a001-3b7db8827aac',
 '56400a78-b1b6-43c7-9573-62b223ed902e',
 '39a552b4-35c3-4ef4-ae2a-c5794223686c',
 '0a7d74d2-3cd1-426c-bc39-610254af1a08']

In [44]:
vector_store.similarity_search(
    query="what is llama2 and what is a difference between llama2 and mistral?",
    k=1
)

[Document(id='7e0f7231-fef5-4058-8d74-53b2aca5c243', metadata={'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2024-04-10T21:11:43+00:00', 'author': '', 'keywords': '', 'moddate': '2024-04-10T21:11:43+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': 'Transfomers.pdf', 'total_pages': 15, 'page': 8, 'page_label': '9'}, page_content='Table 3: Variations on the Transformer architecture. Unlisted values are identical to those of the base\nmodel. All metrics are on the English-to-German translation development set, newstest2013. Listed\nperplexities are per-wordpiece, according to our byte-pair encoding, and should not be compared to\nper-word perplexities.\nN d model dff h d k dv Pdrop ϵls\ntrain PPL BLEU params\nsteps (dev) (dev) ×106\nbase 6 512 2048 8 64 64 0.1 0.1 100K 4.92 25.8 65\n(A)\n1 512 512 5.29 24.9\n4 128 128 5.00 25.5\n

#### **ChromaDB**

In [1]:
!pip install chromadb

Collecting chromadb
  Downloading chromadb-1.0.13-cp39-abi3-macosx_11_0_arm64.whl.metadata (7.0 kB)
Collecting build>=1.0.3 (from chromadb)
  Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
Collecting pybase64>=1.4.1 (from chromadb)
  Downloading pybase64-1.4.1-cp313-cp313-macosx_11_0_arm64.whl.metadata (8.4 kB)
Collecting uvicorn>=0.18.3 (from uvicorn[standard]>=0.18.3->chromadb)
  Using cached uvicorn-0.34.3-py3-none-any.whl.metadata (6.5 kB)
Collecting posthog>=2.4.0 (from chromadb)
  Downloading posthog-5.4.0-py3-none-any.whl.metadata (5.7 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Downloading onnxruntime-1.22.0-cp313-cp313-macosx_13_0_universal2.whl.metadata (4.5 kB)
Collecting opentelemetry-api>=1.2.0 (from chromadb)
  Using cached opentelemetry_api-1.34.1-py3-none-any.whl.metadata (1.5 kB)
Collecting opentelemetry-exporter-otlp-proto-grpc>=1.2.0 (from chromadb)
  Using cached opentelemetry_exporter_otlp_proto_grpc-1.34.1-py3-none-any.whl.metadata (2.4 

In [None]:
from langchain_community.document_loaders import PyPDFDirectoryLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma

# Loading pdf data. 
directory_loader = PyPDFDirectoryLoader(
    "Transfomers.pdf"
)

# Checking the length of pdf. 
# print(len(directory_loader.load()))

# 
documents = directory_loader.load()

# Initializing the Recursively splitting the documents. 
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=2000,
    chunk_overlap=300,
    length_function=len,
    is_separator_regex=False,
)

further_split_doc = text_splitter.split_documents(documents)

# Vector db types. 
# in memory (faiss with in-memory)
# on disk (chorama db on disk)
# on cloud (pinecone)

# storing the vector database in 'vdb_latest'
persist_directory = "vdb_latest"

# Using this openai_embedding

chroma_vdb = Chroma.from_documents(
    documents = further_split_doc, 
    embedding=openai_embedding, # exceeded the openai key, need to purchase. 
    persist_directory=persist_directory,
)

# Loading the database. 
chroma_vdb.persist()

# Loading the database. 
vdb = Chroma(
    persist_directory=persist_directory,
    embedding_function=openai_embedding
)

# retriever the data from the vdb. 
retriever = vdb.as_retriever()

# 
print(retriever.get_relevant_documents("what is transformer and how it is working for llama2 model?",k=1))

# 
print(retriever.search_type)

# 
print(retriever.search_kwargs)

NameError: name 'openai_embedding' is not defined

#### **Pinecone**