# Vector Databases and RAG

- Vector database is a type of database that indexes, stores and manipulates the high dimensional vector data.
- 80% of the data out there is unstructured and cannot fit into a RD.

**It provides:**
- fast retreival and similarity search
- CRUD operations
- metadata filtering
- horizantal scaling


### Vector Embeddings

The process of conversion of data (audio, visual representations and documents) into numerical values called vectors. Vectors have both magnitude and dimension and that makes it easy for sematic searching.

### Sentence Transformers - embedding models

The embedding model in VectorDB is the core component, which converts textual data into numerical vectors for efficient storage and querying.

[sentence transformers](https://sbert.net/)


### RAG

Unstructured data is if various forms and all of these are stored using different approaches like key-pairs, documents, graphs. Vector data would solve this problem because all forms of data can be converted to vector data.

**Vector Databse**

Vector Database operates on vectors. In vector databses, a similarity metric is applied to find a vector that is most similar to the query. Vector DB uses alogarithims for Aprroximate Nearest Neighbour (ANN) search. The algorithms optmize the search through hashing, quantization and graph-based search. These algorithms are assembled into a pipeline. The results are approximate. Accuracy of result is inversely proportional to the speed of the retreival.

1. Indexing: The Vector db indexes vectors using algorithms. This maps the vectors to a data structure that will enable faster retreiving.
2. Querying: The vector db compares the indexed query vector to the existing indexed vectors to find the nearest neighbours.
3. Post processing: In some cases, the data retreived is post processsed. post processing can include re-ranking the the results using a difeerent similarity measure.


**Similarity measures:**

- Similarity measures are used to compare the query vectors with the indexed vectors.

- **Cosine similarity:** cosine of the angle between two vectors belongs to [-1,1]. 0 represents orthoginal vectors, 1-identical, -1-diametrically opposed (opposite), 0-orthoginal (non similar).
ex: semantic search, document classification, recomendation system based on past behavior.

- **Euclidean distance**: measures the straight line distance between two vectors ranging between [0,infinity]
ex: locally sensitive hashing

- **Dot Product:** product of magnitudes of two vectors and the cosine of angle between then. ranges between [-infinity,infinity]. positive-similar, 0- orthogonal and negative-opposite vectors.
ex: LLM training


**Filtering**

- Every vector store contains two indexes: a vector index and metadata index. While quering for similar vectors the metadata filtering is done. The filtering process can be:

1. Pre-filtering: during pre-filtering the metadata is filtered and search is performed. This process can lead to overlooking necessary data based on meta data filtering. It can reduce the search space but extensive metadata filtering can lead to computational overhead. It can lead to brute-force search which increases the time complexity
2. Post-filtering: In this approach, the metadata filtering is done after the vector search. This ensures that the relevent information is considered but it can be an added computational overhead because the search space in same and the metadata filtering is an additional process. It can lead to few or no results.

<i><b>Note:</b> Pinecone uses Single-Stage Filtering. It combines both the vector indexes and metadata indexes.</i>

#### RAG usecases

1. Customer Support and Chatbots
2. Knowledge Management and Information Retrieval
3. Content Generation and Creation


### Pinecone

In [37]:
! pip install -U sentence-transformers pinecone




[notice] A new release of pip is available: 24.2 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [38]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

In [40]:
from pinecone import Pinecone
import os

client = Pinecone(api_key=os.environ['PINECONE_API_KEY'])

In [None]:
index_name = "ai-training"

index = client.Index(index_name)

sentences = [
    "It does not do to dwell on dreams and forget to live.",
    "The ones that love us never really leave us.",
    "Happiness can be found even in the darkest of times if one only remembers to turn on the light.",
    "I solemnly swear that I am up to no good.",
    "After all this time? Always.",
    "Fear of a name increases fear of the thing itself.",
    "Working hard is important, but there is something that matters even more: believing in yourself.",
    "Mischief managed.",
    "You’re a wizard, Harry.",
    "Differences of habit and language are nothing at all if our aims are identical."
]





In [42]:
embeddings = model.encode(sentences, show_progress_bar=True)

Batches: 100%|██████████| 1/1 [00:00<00:00,  2.45it/s]


In [43]:
embeddings

array([[ 0.10806907, -0.033264  ,  0.054138  , ...,  0.01019363,
        -0.04888077,  0.01341156],
       [ 0.03924286, -0.05734717,  0.05856791, ...,  0.04650099,
         0.11111326, -0.00022875],
       [ 0.05457173,  0.03627849, -0.01686732, ...,  0.02342676,
        -0.04271597,  0.00404132],
       ...,
       [-0.01241201,  0.10178198,  0.03104583, ..., -0.0239843 ,
         0.06974334,  0.00350064],
       [ 0.00393334,  0.04085503, -0.01967292, ...,  0.06329961,
         0.02514079, -0.02411763],
       [ 0.10225588,  0.01978284, -0.02231681, ...,  0.00211693,
        -0.02913054, -0.06332339]], shape=(10, 384), dtype=float32)

In [49]:
vectors = []

for i in range(len(sentences)):
    vectors.append({'id': str(i), 'values': embeddings[i].tolist(), 'metadata': {"text": sentences[i], "category": "harry potter"}})

In [50]:
vectors

[{'id': '0',
  'values': [0.10806906968355179,
   -0.033263999968767166,
   0.054138004779815674,
   0.08762252330780029,
   0.040791451930999756,
   0.03738950192928314,
   -0.008698814548552036,
   -0.04701211303472519,
   0.11042595654726028,
   -0.09718664735555649,
   -0.011290483176708221,
   0.031321607530117035,
   0.015964992344379425,
   -0.06343211233615875,
   -0.014008059166371822,
   -0.006548354402184486,
   -0.05209029093384743,
   0.0315399095416069,
   0.011743316426873207,
   0.02359914779663086,
   0.03500012680888176,
   -0.03367157652974129,
   0.04471508413553238,
   -0.001578671857714653,
   0.003226776607334614,
   0.03583745285868645,
   -0.04590176045894623,
   -0.014071449637413025,
   0.0010921242646872997,
   0.015257418155670166,
   0.06531573832035065,
   0.10229740291833878,
   -0.055762674659490585,
   0.05170952528715134,
   0.030481666326522827,
   0.03426063433289528,
   -0.038321010768413544,
   0.01601576618850231,
   0.008692648261785507,
   0.01

In [51]:
index.upsert(namespace="default", vectors=vectors)

{'upserted_count': 10}

In [52]:
query = "Mischief"

query_embeddings = model.encode(query, show_progress_bar=True)

Batches: 100%|██████████| 1/1 [00:00<00:00,  4.90it/s]


In [55]:
result = index.query(
    namespace="default",
    top_k=3,
    include_values=True,
    include_metadata=True,
    vector=query_embeddings.tolist(),
    filter={
        "category": "harry potter"
    }
)

result

{'matches': [{'id': '7',
              'metadata': {'category': 'harry potter',
                           'text': 'Mischief managed.'},
              'score': 0.838246882,
              'values': [-0.012412007,
                         0.101781979,
                         0.0310458336,
                         0.0435147546,
                         -0.0107422145,
                         -0.0233873278,
                         0.0723281726,
                         -0.00608180137,
                         -0.0932470709,
                         0.0606928281,
                         0.0708687231,
                         0.0587492,
                         0.0228557065,
                         0.0297914073,
                         -0.0593473613,
                         0.00299838488,
                         -0.0594094172,
                         -0.00591090368,
                         -0.0764741749,
                         0.0639664,
                         0.0332575701,
    