# DocArrayRetriever: Backends

[DocArray](https://github.com/docarray/docarray) is a versatile, open-source tool for managing your multi-modal data. It lets you shape your data however you want, and offers the flexibility to store and search it using various document index backends. Plus, it gets even better - you can utilize your DocArray document index to create a DocArrayRetriever, and build awesome Langchain apps!

Right now, DocArray works with five kinds of document index backends:

1. [InMemoryExactNNIndex](#InMemoryExactNNIndex)
2. [HnswDocumentIndex](#HnswDocumentIndex)
3. [WeaviateDocumentIndex](#WeaviateDocumentIndex)
4. [ElasticDocIndex](#ElasticDocIndex)
5. [QdrantDocumentIndex](#QdrantDocumentIndex)

This notebook is mainly about introducing the backends. You'll see how to set up and index each backend, and how to create a DocArrayRetriever to search for relevant documents.

However, if you want to learn more about using the DocArrayRetriever in general, check out this other notebook: https://python.langchain.com/en/latest/modules/indexes/retrievers/examples/docarray_usage.html






In [1]:
from langchain.retrievers import DocArrayRetriever
from docarray import BaseDoc
from docarray.typing import NdArray
import numpy as np
from langchain.embeddings import FakeEmbeddings
import random

embeddings = FakeEmbeddings(size=32)

  from .autonotebook import tqdm as notebook_tqdm


Before you start building the index, it's important define your document schema. This determines what fields your documents will have and what type of data each field will hold.

For this demonstration, we'll create a somewhat random schema containing 'title' (str), 'title_embedding' (numpy array), 'year' (int), and 'color' (str)

In [2]:
class MyDoc(BaseDoc):
    title: str
    title_embedding: NdArray[32]
    year: int
    color: str

## InMemoryExactNNIndex

InMemoryExactNNIndex stores all Documentsin memory. It is a great starting point for small datasets, where you may not want to launch a database server.

Learn more here: https://docs.docarray.org/user_guide/storing/index_in_memory/

In [3]:
from docarray.index import InMemoryExactNNIndex


# initialize the index
db = InMemoryExactNNIndex[MyDoc]()
# index data
db.index(
    [
        MyDoc(
            title=f'My document {i}',
            title_embedding=embeddings.embed_query(f'query {i}'),
            year=i,
            color=random.choice(['red', 'green', 'blue']),
        )
        for i in range(100)
    ]
)
# optionally, you can create a filter query
filter_query = {"year": {"$lte": 90}}

In [4]:
# create a retriever
retriever = DocArrayRetriever(
    index=db, 
    embeddings=embeddings, 
    search_field='title_embedding', 
    content_field='title',
    filters=filter_query,
)

# find the relevant document
doc = retriever.get_relevant_documents('some query')
print(doc)

[Document(page_content='My document 48', metadata={'id': '50d3db418c6e70749c1d72045fd591c6', 'year': 48, 'color': 'red'})]


## HnswDocumentIndex

HnswDocumentIndex is a lightweight Document Index implementation that runs fully locally and is best suited for small- to medium-sized datasets. It stores vectors on disk in [hnswlib](https://github.com/nmslib/hnswlib), and stores all other data in [SQLite](https://www.sqlite.org/index.html).

Learn more here: https://docs.docarray.org/user_guide/storing/index_hnswlib/

In [5]:
from docarray.index import HnswDocumentIndex


# initialize the index
db = HnswDocumentIndex[MyDoc](work_dir='hnsw_index')

# index data
db.index(
    [
        MyDoc(
            title=f'My document {i}',
            title_embedding=embeddings.embed_query(f'query {i}'),
            year=i,
            color=random.choice(['red', 'green', 'blue']),
        )
        for i in range(100)
    ]
)
# optionally, you can create a filter query
filter_query = {"year": {"$lte": 90}}

In [6]:
# create a retriever
retriever = DocArrayRetriever(
    index=db, 
    embeddings=embeddings, 
    search_field='title_embedding', 
    content_field='title',
    filters=filter_query,
)

# find the relevant document
doc = retriever.get_relevant_documents('some query')
print(doc)

[Document(page_content='My document 7', metadata={'id': '00840c63e5db617570db773ad0248800', 'year': 7, 'color': 'red'})]


## WeaviateDocumentIndex

WeaviateDocumentIndex is a document index that is built upon [Weaviate](https://weaviate.io/) vector database.

Learn more here: https://docs.docarray.org/user_guide/storing/index_weaviate/

In [7]:
# There's a small difference with the Weaviate backend compared to the others. 
# Here, you need to 'mark' the field used for vector search with 'is_embedding=True'. 
# So, let's create a new schema for Weaviate that takes care of this requirement.

from pydantic import Field 

class WeaviateDoc(BaseDoc):
    title: str
    title_embedding: NdArray[32] = Field(is_embedding=True)
    year: int
    color: str

In [8]:
from docarray.index import WeaviateDocumentIndex


# initialize the index
dbconfig = WeaviateDocumentIndex.DBConfig(
    host="http://localhost:8080"
)
db = WeaviateDocumentIndex[WeaviateDoc](db_config=dbconfig)

# index data
db.index(
    [
        MyDoc(
            title=f'My document {i}',
            title_embedding=embeddings.embed_query(f'query {i}'),
            year=i,
            color=random.choice(['red', 'green', 'blue']),
        )
        for i in range(100)
    ]
)
# optionally, you can create a filter query
filter_query = {"path": ["year"], "operator": "LessThanEqual", "valueInt": "90"}

In [9]:
# create a retriever
retriever = DocArrayRetriever(
    index=db, 
    embeddings=embeddings, 
    search_field='title_embedding', 
    content_field='title',
    filters=filter_query,
)

# find the relevant document
doc = retriever.get_relevant_documents('some query')
print(doc)

[Document(page_content='My document 43', metadata={'id': '6c6a94aef580bb862426c703d7293219', 'year': 43, 'color': 'green'})]


## ElasticDocIndex

ElasticDocIndex is a document index that is built upon [ElasticSearch](https://github.com/elastic/elasticsearch)

Learn more here: https://docs.docarray.org/user_guide/storing/index_elastic/

In [10]:
from docarray.index import ElasticDocIndex


# initialize the index
db = ElasticDocIndex[MyDoc](
    hosts="http://localhost:9200", 
    index_name="docarray_retriever"
)

# index data
db.index(
    [
        MyDoc(
            title=f'My document {i}',
            title_embedding=embeddings.embed_query(f'query {i}'),
            year=i,
            color=random.choice(['red', 'green', 'blue']),
        )
        for i in range(100)
    ]
)
# optionally, you can create a filter query
filter_query = {"range": {"year": {"lte": 90}}}

In [11]:
# create a retriever
retriever = DocArrayRetriever(
    index=db, 
    embeddings=embeddings, 
    search_field='title_embedding', 
    content_field='title',
    filters=filter_query,
)

# find the relevant document
doc = retriever.get_relevant_documents('some query')
print(doc)

[Document(page_content='My document 56', metadata={'id': '87df6ea12aa3d09ebd2a1a42ac6b4947', 'year': 56, 'color': 'blue'})]


## QdrantDocumentIndex

QdrantDocumentIndex is a document index that is build upon [Qdrant](https://qdrant.tech/) vector database

Learn more here: https://docs.docarray.org/user_guide/storing/index_qdrant/

In [12]:
from docarray.index import QdrantDocumentIndex
from qdrant_client.http import models as rest


# initialize the index
qdrant_config = QdrantDocumentIndex.DBConfig(path=":memory:")
db = QdrantDocumentIndex[MyDoc](qdrant_config)

# index data
db.index(
    [
        MyDoc(
            title=f'My document {i}',
            title_embedding=embeddings.embed_query(f'query {i}'),
            year=i,
            color=random.choice(['red', 'green', 'blue']),
        )
        for i in range(100)
    ]
)
# optionally, you can create a filter query
filter_query = rest.Filter(
    must=[
        rest.FieldCondition(
            key="year",
            range=rest.Range(
                gte=10,
                lt=90,
            ),
        )
    ]
)



In [13]:
# create a retriever
retriever = DocArrayRetriever(
    index=db, 
    embeddings=embeddings, 
    search_field='title_embedding', 
    content_field='title',
    filters=filter_query,
)

# find the relevant document
doc = retriever.get_relevant_documents('some query')
print(doc)

[Document(page_content='My document 50', metadata={'id': 'f63f8581ba4107d998afcc0f47ed776b', 'year': 50, 'color': 'red'})]
