# Demo search backend

The notebook demos basic search functionality using OpenSearch and the Haystack framework. You must have Docker Desktop installed and be a part of the [MoJ Docker org](https://user-guide.operations-engineering.service.justice.gov.uk/documentation/services/dockerhub.html#docker) (so that you're covered by a licence) prior to using OpenSearch.

To install necessary packages, run `pip install -e '.[search_backend, dev]'`.

Before running this notebook, set up an Opensearch container (see docker-compose.yml) by running:
```
docker compose up localstack
```
Or alternatively follow instructions here: https://docs.haystack.deepset.ai/v2.0/docs/opensearchbm25retriever

In [None]:
import json
from haystack import Document
from haystack_integrations.document_stores.opensearch import OpenSearchDocumentStore
from search_backend.indexing_pipeline import IndexingPipeline
from search_backend.retrieval_pipeline import RetrievalPipeline
from search_backend.search import Search


cfg = {
    # Optional arg for the OpenSearch docstore, to prevent trying to index everything in one go
    "index_batch_size": 10,
    # Select embedding model for the semantic search. This should be a sentence-similarity
    # model available on Huggingface: https://huggingface.co/models?pipeline_tag=sentence-similarity
    "dense_embedding_model": "sentence-transformers/all-MiniLM-L6-v2",
    # The value of the embedding dimension must match that specified for the model defined above
    "embedding_dim": 384,
    # Language model used to rank search results better than the embedding retrieval can
    "rerank_model": "cross-encoder/ms-marco-MiniLM-L-2-v2",
}

## Get some text data

This dataset is based on Wikipedia introductions to the Seven Wonders of the Ancient World.

In [None]:
with open('../tests/data/demo_data.json') as f:
    doc_list = json.load(f)

In [None]:
doc_list

In [None]:
# Put into Haystack Document instances
docs = [Document(**content) for content in doc_list]

## Set up Opensearch

In [None]:
# Connect to an existing Opensearch container - see docker-compose.yml for Opensearch settings
query_document_store = OpenSearchDocumentStore(
    hosts="http://0.0.0.0:4566/opensearch/eu-west-2/rd-demo",
    use_ssl=False,
    verify_certs=False,
    http_auth=("localstack", "localstack"),
    embedding_dim=cfg["embedding_dim"],
    batch_size=cfg["index_batch_size"],
)

In [None]:
# Write the documents to the vector store
indexer = IndexingPipeline(query_document_store, dense_embedding_model=cfg["dense_embedding_model"], semantic=True)
indexer.index_docs(docs)

## Run BM25 search

In [None]:
bm25_pipeline = RetrievalPipeline(query_document_store).setup_bm25_pipeline()
bm25_search_init = Search(bm25_pipeline)

In [None]:
test_query = "lighthouse"
# test_query = "wonder that features plants"
results = bm25_search_init.bm25_search(test_query, top_k=3)

for doc in results:
    print(f'{doc.meta["title"]} - Score: {doc.score}')
    print(doc.content)

In [None]:
results[0].meta

In [None]:
results

## Run semantic search

In [None]:
semantic_pipeline = RetrievalPipeline(
    query_document_store,
    dense_embedding_model=cfg['dense_embedding_model'],
    rerank_model=cfg['rerank_model']
).setup_semantic_pipeline()
semantic_search_init = Search(semantic_pipeline)

In [None]:
test_query = "wonder that features plants"
results = semantic_search_init.semantic_search(test_query, top_k=3, threshold=0.00001)

for doc in results:
    print(f'{doc.meta["title"]} - Score: {doc.score}')
    print(doc.content)

## Hybrid search

In [None]:
hybrid_pipeline = RetrievalPipeline(
    query_document_store,
    dense_embedding_model=cfg['dense_embedding_model'],
    rerank_model=cfg['rerank_model']
).setup_hybrid_pipeline()
hybrid_search_init = Search(hybrid_pipeline)

In [None]:
test_query = "wonder that features plants"
results = hybrid_search_init.hybrid_search(test_query, bm25_top_k=3, semantic_top_k=3, threshold=0.000001)

for doc in results:
    print(f'{doc.meta["title"]} - Score: {doc.score}')
    print(doc.content)