# Experiments in using embeddinghub to index embeddings

tl;dr nmslib does not currently support [incremental
updates](https://github.com/nmslib/nmslib/issues/73) which makes it difficult
to use for my scenarios. So I'm now investigating
[embeddinghub](https://github.com/featureform/embeddinghub) instead

In [1]:
import nmslib
import embeddinghub as eh

from sentence_transformers import SentenceTransformer, util

Your CPU supports instructions that this binary was not compiled to use: SSE3 SSE4.1 SSE4.2 AVX AVX2
For maximum performance, you can install NMSLIB from sources 
pip install --no-binary :all: nmslib


In [7]:
model = SentenceTransformer('all-MiniLM-L6-v2')
corpus = ['A man is eating food.',
          'A man is eating a piece of bread.',
          'The girl is carrying a baby.',
          'A man is riding a horse.',
          'A woman is playing violin.',
          'Two men pushed carts through the woods.',
          'A man is riding a white horse on an enclosed ground.',
          'A monkey is playing drums.',
          'A cheetah is running behind its prey.'
          ]
embeddings = model.encode(corpus, convert_to_tensor=True).cpu()

## Add embeddings to nms index

In [21]:
index = nmslib.init(method="hnsw", space="cosinesimil")
index.addDataPointBatch(embeddings.cpu())
# What does "post": 2 mean here?
index.createIndex({"post": 2})
index.saveIndex("./index.bin", save_data=True)

### Add embeddings to embeddinghub local instance

In [13]:
hub = eh.connect(eh.LocalConfig("./data/"))
space = hub.create_space("kb", 384)
for i, doc in enumerate(corpus):
    space.set(doc, embeddings[i].tolist())
hub.save()

SAVE NOT IMPLEMENTED YET


### Query embedding hub

In [17]:
queries = ['A man is eating pasta.', 
    'Someone in a gorilla costume is playing a set of drums.', 
    'A cheetah chases prey on across a field.']

for query in queries:
    query_embedding = model.encode(query, convert_to_tensor=True).cpu()
    results = space.nearest_neighbors(5, vector=query_embedding.tolist())
    print(f"QUERY: {query}")
    for idx in results:
        print(corpus[idx], f"(Score: unknown for now)")

QUERY: A man is eating pasta.
A man is eating food. (Score: unknown for now)
A man is eating a piece of bread. (Score: unknown for now)
A man is riding a horse. (Score: unknown for now)
A man is riding a white horse on an enclosed ground. (Score: unknown for now)
A cheetah is running behind its prey. (Score: unknown for now)
QUERY: Someone in a gorilla costume is playing a set of drums.
A monkey is playing drums. (Score: unknown for now)
A woman is playing violin. (Score: unknown for now)
A man is riding a horse. (Score: unknown for now)
A man is riding a white horse on an enclosed ground. (Score: unknown for now)
A cheetah is running behind its prey. (Score: unknown for now)
QUERY: A cheetah chases prey on across a field.
A cheetah is running behind its prey. (Score: unknown for now)
A man is eating food. (Score: unknown for now)
A monkey is playing drums. (Score: unknown for now)
A man is riding a white horse on an enclosed ground. (Score: unknown for now)
A man is riding a horse. (S

In [20]:
new_corpus_entries = [
    "A man is eating spaghetti",
    "A monkey is playing timpani"
]
corpus += new_corpus_entries
incremental_embeddings = model.encode(new_corpus_entries, convert_to_tensor=True)
for i, doc in enumerate(new_corpus_entries):
    space.set(doc, incremental_embeddings[i].tolist())


In [21]:
# Query against new entries
queries = ['A man is eating pasta.', 
    'Someone in a gorilla costume is playing a set of drums.', 
    'A cheetah chases prey on across a field.']

for query in queries:
    query_embedding = model.encode(query, convert_to_tensor=True).cpu()
    results = space.nearest_neighbors(5, vector=query_embedding.tolist())
    print(f"QUERY: {query}")
    for idx in results:
        print(corpus[idx], f"(Score: unknown for now)")

QUERY: A man is eating pasta.
A man is eating spaghetti (Score: unknown for now)
A man is eating food. (Score: unknown for now)
A man is eating a piece of bread. (Score: unknown for now)
A man is riding a horse. (Score: unknown for now)
A monkey is playing timpani (Score: unknown for now)
QUERY: Someone in a gorilla costume is playing a set of drums.
A monkey is playing drums. (Score: unknown for now)
A monkey is playing timpani (Score: unknown for now)
A woman is playing violin. (Score: unknown for now)
A man is riding a horse. (Score: unknown for now)
A man is riding a white horse on an enclosed ground. (Score: unknown for now)
QUERY: A cheetah chases prey on across a field.
A cheetah is running behind its prey. (Score: unknown for now)
A monkey is playing timpani (Score: unknown for now)
A man is eating food. (Score: unknown for now)
A monkey is playing drums. (Score: unknown for now)
A man is riding a white horse on an enclosed ground. (Score: unknown for now)


## Load index from disk and work with it

In [22]:
index = nmslib.init(method="hnsw", space="cosinesimil")
index.loadIndex("./index.bin")

In [23]:
queries = ['A man is eating pasta.', 
    'Someone in a gorilla costume is playing a set of drums.', 
    'A cheetah chases prey on across a field.']

for query in queries:
    query_embeddings = model.encode(query, convert_to_tensor=True).cpu()
    ids, distances = index.knnQuery(query_embeddings, k=5)
    print(f"QUERY: {query}")
    for i, j in zip(ids, distances):
        print(f"{corpus[i]} (Score: {1-j:.4f})")

QUERY: A man is eating pasta.
A man is eating food. (Score: 0.7035)
A man is eating a piece of bread. (Score: 0.5272)
A man is riding a horse. (Score: 0.1889)
A man is riding a white horse on an enclosed ground. (Score: 0.1047)
A cheetah is running behind its prey. (Score: 0.0980)
QUERY: Someone in a gorilla costume is playing a set of drums.
A monkey is playing drums. (Score: 0.6433)
A woman is playing violin. (Score: 0.2564)
A man is riding a horse. (Score: 0.1389)
A man is riding a white horse on an enclosed ground. (Score: 0.1191)
A cheetah is running behind its prey. (Score: 0.1080)
QUERY: A cheetah chases prey on across a field.
A cheetah is running behind its prey. (Score: 0.8253)
A man is eating food. (Score: 0.1399)
A monkey is playing drums. (Score: 0.1292)
A man is riding a white horse on an enclosed ground. (Score: 0.1097)
A man is riding a horse. (Score: 0.0650)


## Incrementally add some entries to the index and re-run query

In [16]:
new_corpus_entries = [
    "A man is eating spaghetti",
    "A monkey is playing timpani"
]
incremental_embedding = model.encode(new_corpus_entries, convert_to_tensor=True)
index.addDataPointBatch(incremental_embedding.cpu())
index.createIndex({"post": 2})

In [17]:
queries = ['A man is eating pasta.', 
    'Someone in a gorilla costume is playing a set of drums.', 
    'A cheetah chases prey on across a field.']

for query in queries:
    query_embeddings = model.encode(query, convert_to_tensor=True).cpu()
    ids, distances = index.knnQuery(query_embeddings, k=5)
    print(f"QUERY: {query}")
    for i, j in zip(ids, distances):
        print(f"{corpus[i]} (Score: {1-j:.4f})")

QUERY: A man is eating pasta.
A man is eating food. (Score: 0.8458)
A man is eating a piece of bread. (Score: 0.1698)
QUERY: Someone in a gorilla costume is playing a set of drums.
A man is eating a piece of bread. (Score: 0.5022)
A man is eating food. (Score: 0.0639)
QUERY: A cheetah chases prey on across a field.
A man is eating a piece of bread. (Score: 0.1537)
A man is eating food. (Score: 0.0189)
