# Annoy

This notebook shows how to use functionality related to the Annoy vector database.

> "Annoy (Approximate Nearest Neighbors Oh Yeah) is a C++ library with Python bindings to search for points in space that are close to a given query point. It also creates large read-only file-based data structures that are mmapped into memory so that many processes may share the same data."

via https://github.com/spotify/annoy


```{note}
Annoy is read-only - once the index is built you cannot add any more emebddings!
If you want to progressively add to your VectorStore then better choose an alternative!
```

## Quickstart

In [1]:
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Annoy

embeddings_func = HuggingFaceEmbeddings()

In [2]:
texts = ["pizza is great", "I love salad", "my car", "a dog"]

vector_store = Annoy.from_texts(texts, embeddings_func)

In [3]:
vector_store.similarity_search("food", k=3)

[Document(page_content='pizza is great', metadata={}),
 Document(page_content='I love salad', metadata={}),
 Document(page_content='my car', metadata={})]

In [4]:
# the score is a distance metric, so lower is better
vector_store.similarity_search_with_score("food", k=3)

[(Document(page_content='pizza is great', metadata={}), 1.0944390296936035),
 (Document(page_content='I love salad', metadata={}), 1.1273186206817627),
 (Document(page_content='my car', metadata={}), 1.1580758094787598)]

## Search via embeddings

In [5]:
motorbike_emb = embeddings_func.embed_query("motorbike")

In [6]:
vector_store.similarity_search_by_vector(motorbike_emb, k=3)

[Document(page_content='my car', metadata={}),
 Document(page_content='a dog', metadata={}),
 Document(page_content='pizza is great', metadata={})]

In [7]:
vector_store.similarity_search_with_score_by_vector(motorbike_emb, k=3)

[(Document(page_content='my car', metadata={}), 1.0870471000671387),
 (Document(page_content='a dog', metadata={}), 1.2095637321472168),
 (Document(page_content='pizza is great', metadata={}), 1.3254905939102173)]

## Search via docstore id

In [8]:
vector_store.index_to_docstore_id

{0: '21109f80-38d0-4944-9b3e-9fe17fba87ba',
 1: '67d9de18-a93d-46a5-af11-00c05922f9a5',
 2: '17effa8f-0603-4a5e-9572-f3570e5eff67',
 3: 'fb25ef58-0226-423b-8267-ba925258768f'}

In [9]:
some_docstore_id = 0 # texts[0]

vector_store.docstore._dict[vector_store.index_to_docstore_id[some_docstore_id]]

Document(page_content='pizza is great', metadata={})

In [10]:
# same document has distance 0
vector_store.similarity_search_with_score_by_index(some_docstore_id, k=3)

[(Document(page_content='pizza is great', metadata={}), 0.0),
 (Document(page_content='I love salad', metadata={}), 1.0734446048736572),
 (Document(page_content='my car', metadata={}), 1.2895267009735107)]

## save and load

In [11]:
vector_store.save_local("my_annoy_index_and_docstore")

In [12]:
loaded_vector_store = Annoy.load_local("my_annoy_index_and_docstore", embeddings=embeddings_func)

In [13]:
# same document has distance 0
loaded_vector_store.similarity_search_with_score_by_index(some_docstore_id, k=3)

[(Document(page_content='pizza is great', metadata={}), 0.0),
 (Document(page_content='I love salad', metadata={}), 1.0734446048736572),
 (Document(page_content='my car', metadata={}), 1.2895267009735107)]

## Construct from scratch

In [14]:
import uuid
from annoy import AnnoyIndex
from langchain.docstore.document import Document
from langchain.docstore.in_memory import InMemoryDocstore

metadatas = [{"x": "food"}, {"x": "food"}, {"x": "stuff"}, {"x": "animal"}]

# embeddings
embeddings = embeddings_func.embed_documents(texts)

# embedding dim
f = len(embeddings[0])

# index
index = AnnoyIndex(f, metric="angular")
for i, emb in enumerate(embeddings):
    index.add_item(i, emb)
index.build(10)

# docstore
documents = []
for i, text in enumerate(texts):
    metadata = metadatas[i] if metadatas else {}
    documents.append(Document(page_content=text, metadata=metadata))
index_to_docstore_id = {i: str(uuid.uuid4()) for i in range(len(documents))}
docstore = InMemoryDocstore(
    {index_to_docstore_id[i]: doc for i, doc in enumerate(documents)}
)

db_manually = Annoy(embeddings_func.embed_query, index, docstore, index_to_docstore_id)

In [17]:
db_manually.similarity_search_with_score("eating!", k=3)

[(Document(page_content='pizza is great', metadata={'x': 'food'}),
  1.1314140558242798),
 (Document(page_content='I love salad', metadata={'x': 'food'}),
  1.1668788194656372),
 (Document(page_content='my car', metadata={'x': 'stuff'}), 1.226445198059082)]