Vector databases allow for fast comparison of embeddings

Here I am going to use chromadb

https://docs.trychroma.com/usage-guide

In [None]:
%pip install chromadb

In [4]:
# read passages from a file
import json

with open('passages.jsonl', 'r') as f:
    passages = [json.loads(line) for line in f]


These passages are from #dealing_with_pdfs.ipynb

In [30]:
passages[0]

{'path': 'C:\\Users\\kryst\\Documents\\Stuff\\University\\Cybersecurity\\Application Security\\Security_slides08_Apps.pdf',
 'page': 0,
 'passage': ['Application \nSecurity\nMichał Szychowiak\n , PhD\nhttp s://www.cs.put.poznan.pl/mszychowiak'],
 'embedding': [0.026378992944955826,
  0.032453376799821854,
  0.006036689970642328,
  0.030262457206845284,
  0.0931553840637207,
  0.03764684870839119,
  0.04652584716677666,
  0.03759697824716568,
  -0.009518101811408997,
  -0.007155050523579121,
  -0.01823224313557148,
  0.0024533483665436506,
  -0.04043308272957802,
  0.007214752957224846,
  -0.010217620059847832,
  0.05745114013552666,
  0.08077249675989151,
  0.020502407103776932,
  0.0024849968031048775,
  -0.005711267236620188,
  0.062455419450998306,
  0.07145043462514877,
  0.0598815493285656,
  0.03083709441125393,
  0.02414828911423683,
  0.03239484876394272,
  0.042618077248334885,
  -0.006426290608942509,
  -0.046566564589738846,
  0.003730559954419732,
  0.046498674899339676,
  

In [5]:
documents = []
embeddings = []
metadatas = []

for passage in passages:
    if len(passage["passage"]) > 0:
        documents.append(passage["passage"][0])
        embeddings.append(
            passage['embedding']
        )
        metadatas.append({
            "path": passage['path'],
            "page": passage['page'],
        }
        )
    else:
        print(passage)
        

{'path': 'C:\\Users\\kryst\\Documents\\Stuff\\University\\Semantic Web and Social Networks\\Social Networks\\social_networks_01_linked.pdf', 'page': 1, 'passage': [], 'embedding': [0.016788996756076813, -0.013808876276016235, -0.01604822464287281, 0.021892279386520386, 0.07405026257038116, 0.02662951871752739, 0.028776604682207108, 0.015657663345336914, -0.03983648121356964, -0.007657313719391823, 0.011292863637208939, 0.018824435770511627, -0.044349633157253265, 0.0019514847081154585, -0.005497219040989876, 0.07410973310470581, 0.07029037922620773, 0.0014849513536319137, 0.012235247530043125, -0.03864269703626633, 0.06009656563401222, 0.04208151996135712, 0.060616087168455124, 0.02429085597395897, 0.05661562457680702, 0.02079658955335617, 0.04402822256088257, -0.0012350984616205096, -0.04035063087940216, 0.014038575813174248, 0.025642238557338715, 0.023573245853185654, -0.023927075788378716, 0.00347224622964859, 0.03293684124946594, -0.027249563485383987, 0.016938041895627975, -0.0193

In [6]:
import chromadb
client = chromadb.Client()

client.list_collections()

[]

In [7]:
len(embeddings)

1453

In [8]:

collection = client.create_collection("sample_collection")

collection.add(
    documents=documents,
    embeddings=embeddings,
    metadatas=metadatas,
    ids=[f"{i}" for i in range(len(documents))] # in reality use a better method for ids e.g. a hash over a passage - this handles duplication
)


Key thing to note is to use the same model for query embeddings as for passage embeddings!

In [9]:
from fastembed.embedding import FlagEmbedding as Embedding


model = Embedding(model_name="BAAI/bge-base-en-v1.5", max_length=512)


Let's test some queries

In [24]:
query = "How does SPARQL syntax work?"
query_embedding = list(model.query_embed(query))[0].tolist()

In [25]:
collection.query(
    query_embedding,
)

{'ids': [['1420',
   '1426',
   '1444',
   '1421',
   '1424',
   '1442',
   '1423',
   '1427',
   '1437',
   '1428']],
 'distances': [[0.34534579515457153,
   0.35128283500671387,
   0.39873144030570984,
   0.40308111906051636,
   0.4037553369998932,
   0.40621864795684814,
   0.41147512197494507,
   0.4136652648448944,
   0.42037996649742126,
   0.42296984791755676]],
 'metadatas': [[{'page': 0,
    'path': 'C:\\Users\\kryst\\Documents\\Stuff\\University\\Semantic Web and Social Networks\\SPARQL query language\\SPARQL - lecture - en(1).pdf'},
   {'page': 6,
    'path': 'C:\\Users\\kryst\\Documents\\Stuff\\University\\Semantic Web and Social Networks\\SPARQL query language\\SPARQL - lecture - en(1).pdf'},
   {'page': 24,
    'path': 'C:\\Users\\kryst\\Documents\\Stuff\\University\\Semantic Web and Social Networks\\SPARQL query language\\SPARQL - lecture - en(1).pdf'},
   {'page': 1,
    'path': 'C:\\Users\\kryst\\Documents\\Stuff\\University\\Semantic Web and Social Networks\\SPARQL qu

In [26]:
query = "what versions of local search?"
query_embedding = list(model.query_embed(query))[0].tolist()

In [27]:
collection.query(
    query_embedding,
)

{'ids': [['1189',
   '1178',
   '1426',
   '1448',
   '1435',
   '1436',
   '1437',
   '1104',
   '1441',
   '1440']],
 'distances': [[0.5299314260482788,
   0.6316799521446228,
   0.6416252255439758,
   0.663886308670044,
   0.6649327874183655,
   0.6657594442367554,
   0.6669726371765137,
   0.6676359176635742,
   0.671486496925354,
   0.6754134893417358]],
 'metadatas': [[{'page': 97,
    'path': 'C:\\Users\\kryst\\Documents\\Stuff\\University\\Evolutionary Computation\\Theory.pdf'},
   {'page': 86,
    'path': 'C:\\Users\\kryst\\Documents\\Stuff\\University\\Evolutionary Computation\\Theory.pdf'},
   {'page': 6,
    'path': 'C:\\Users\\kryst\\Documents\\Stuff\\University\\Semantic Web and Social Networks\\SPARQL query language\\SPARQL - lecture - en(1).pdf'},
   {'page': 28,
    'path': 'C:\\Users\\kryst\\Documents\\Stuff\\University\\Semantic Web and Social Networks\\SPARQL query language\\SPARQL - lecture - en(1).pdf'},
   {'page': 15,
    'path': 'C:\\Users\\kryst\\Documents\\St

This is not what we were looking for, sometimes results may make no sense.

Let's try again with a "where specifier" to guide our search

In [28]:
collection.query(
    query_embedding,
    where_document={"$contains":"version"}
)

{'ids': [['889',
   '888',
   '899',
   '1230',
   '903',
   '906',
   '1283',
   '1250',
   '917',
   '918']],
 'distances': [[0.726885199546814,
   0.7334774136543274,
   0.7376772165298462,
   0.7504379749298096,
   0.7521195411682129,
   0.7550857067108154,
   0.7675451040267944,
   0.774927020072937,
   0.7796245813369751,
   0.7843963503837585]],
 'metadatas': [[{'page': 100,
    'path': 'C:\\Users\\kryst\\Documents\\Stuff\\University\\Evolutionary Computation\\EC.pdf'},
   {'page': 99,
    'path': 'C:\\Users\\kryst\\Documents\\Stuff\\University\\Evolutionary Computation\\EC.pdf'},
   {'page': 110,
    'path': 'C:\\Users\\kryst\\Documents\\Stuff\\University\\Evolutionary Computation\\EC.pdf'},
   {'page': 18,
    'path': 'C:\\Users\\kryst\\Documents\\Stuff\\University\\Semantic Web and Social Networks\\Knowledge Graphs\\Knowledge_graphs_en .pdf'},
   {'page': 114,
    'path': 'C:\\Users\\kryst\\Documents\\Stuff\\University\\Evolutionary Computation\\EC.pdf'},
   {'page': 117,
   

Now we got good results.

Getting bad results is sometimes inevitable, no embedding model yet can model everything, but we need to have a way to fix that. Imagine being stuck in the interface that allows only for semantic search.

When possible use problem structure to get better results. Here we know that the slide we were looking for contains string "version".

ChromaDB allows for filtering on metadata as well https://docs.trychroma.com/usage-guide#querying-a-collection . One obvious suggestion here would be to put course name in metadata and this would fix our results as well

The holy grail of search is connecting old and new methods to:

- allow for imprecise queries
- allow for typos (techniques like levhenstein distance)
- filtering on keywords
- filtering on whatever metadata


VectorDB's are effective at vector search, but they are not always designed for Full Text Search.

Some important features of FTS

- filtering out stopwords
- lemmatization
- handling typos

e.g. If user forgets a stopword but looks for literal occurence "looking at the mountains" - "looking at mountains" we still want them to get proper results 

Some solutions I am aware of that allow hybrid approach:
- https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html
- https://docs.cozodb.org/en/latest/tutorial.html <- though it may be too complicated


I encourage you to do some digging about what features suit you best, for me ChromaDB handles everything that I need for my projects, and I don't have too much data so performance is not an issue. It is a good practice to calculate embeddings outside of VectorDB's api's.


Most vector db's will use a HNSW algorithm

a good explanation is available here -> https://www.pinecone.io/learn/series/faiss/hnsw/

Sometimes we want to set *ef_construction* in short it controls how dense our "small worlds" will be, quality/efficiency hyperparameter