# Dense Index

In this notebook showcase how you can use the dense index with Baguetter.

We showcase here the usage of our inference engine [Ofen](https://github/mixedbread-ai/ofen) for generating the embeddings.

In [None]:
%pip install "ofen[torch]==0.0.1"

In [2]:
from ofen.models import TextEncoder
from baguetter.indices import USearchDenseIndex
from baguetter.evaluation import evaluate_retrievers, HFDataset

## 1. Load and prepare model

In [3]:
ds = HFDataset("mteb/scidocs")

model = TextEncoder.from_pretrained("mixedbread-ai/mxbai-embed-large-v1")
## Convert model to half precision (FP16) for efficiency
model.half()


# Define the embedding function expected by the USearchDenseIndex.
# Alternatively, you can compute the embeddings yourself and add them to the index.
def embed_fn(text: list[str], is_query: bool = False, show_progress: bool = False):
    if is_query:
        text = [f"Represent this sentence for searching relevant passages: {query}" for query in text]
    return model.encode(text, batch_size=256, show_progress=show_progress).embeddings

## 2. Create the dense index

In [4]:
index = USearchDenseIndex(embed_fn=embed_fn)

## 3. Add documents and search the index

In [5]:
doc_ids, docs = ds.get_corpus()

index.add_many(doc_ids[:100], docs[:100], show_progress=True)

Add: 100%|██████████| 100/100 [00:00<00:00, 10812.85vector/s]


usearch.Index
- config
-- data type: ScalarKind.F16
-- dimensions: 1024
-- metric: MetricKind.Cos
-- multi: False
-- connectivity: 16
-- expansion on addition :128 candidates
-- expansion on search: 64 candidates
- binary
-- uses OpenMP: 0
-- uses SimSIMD: 1
-- supports half-precision: 1
-- uses hardware acceleration: haswell
- state
-- size: 100 vectors
-- memory usage: 21,006,272 bytes
-- max level: 0
--- 0. 100 nodes

In [6]:
index.search(docs[0], top_k=10)

SearchResults(keys=['632589828c8b9fca2c3a59e97451fde8fa7d188d', '51317b6082322a96b4570818b7a5ec8b2e330f2f', '506172b0e0dd4269bdcfe96dda9ea9d8602bbfb6', '2a047d8c4c2a4825e0f0305294e7da14f8de6fd3', '86e87db2dab958f1bd5877dc7d5b8105d6e31e46', 'c108437a57bd8f8eaed9e26360ee100074e3f3fc', '24ff5027e7042aeead47ef3071f1a023243078bb', '8e508720cdb495b7821bf6e43c740eeb5f3a444a', '2ae40898406df0a3732acc54f147c1d377f54e2a', '55ca165fa6091973674b12ea8fa3f1a3a1e50a6d'], scores=array([0.988507  , 0.7668916 , 0.7029923 , 0.69961214, 0.68070495,
       0.6354893 , 0.62746024, 0.6235753 , 0.6081617 , 0.6014837 ],
      dtype=float32), normalized=True)

## 4. Evaluate dense retriever

In [7]:
result = evaluate_retrievers(
    datasets=[ds],
    retriever_factories={
        "usearch": lambda: USearchDenseIndex(
            embed_fn=embed_fn
        )
    }
)
result.save("eval_results")

Evaluating  1 retrievers...
---------------------------------------------------------------
Datasets:  ['mteb/scidocs']
Top K:  100
Metrics:  ['ndcg@1', 'ndcg@5', 'ndcg@10', 'precision@1', 'precision@5', 'precision@10', 'mrr@1', 'mrr@5', 'mrr@10']
Ignore identical IDs:  True

Evaluating Dataset: mteb/scidocs
---------------------------------------------------------------
Starting Adding 25657 documents to usearch...


Encoding: 100%|██████████| 101/101 [00:46<00:00,  2.15it/s]
Add: 100%|██████████| 25657/25657 [00:00<00:00, 44663.61vector/s]


Adding 25657 documents to usearch took 47.65 seconds
Starting Searching 1000 queries with usearch...


Encoding: 100%|██████████| 4/4 [00:00<00:00,  7.39it/s]
Search: 100%|██████████| 1000/1000 [00:00<00:00, 4464.11vector/s]


Searching 1000 queries with usearch took 0.95 seconds

Report (rounded):
---------------------------------------------------------------
#    Model      NDCG@1    NDCG@5    NDCG@10    P@1    P@5    P@10    MRR@1    MRR@5    MRR@10
---  -------  --------  --------  ---------  -----  -----  ------  -------  -------  --------
a    usearch     0.254     0.193      0.231  0.254  0.173   0.121    0.254    0.362     0.376
