# Test search functions

The notebook demos basic search functionality using OpenSearch and the Haystack framework. You must have Docker Desktop installed and be a part of the [MoJ Docker org](https://user-guide.operations-engineering.service.justice.gov.uk/documentation/services/dockerhub.html#docker) (so that you're covered by a licence) prior to using OpenSearch.

To install necessary packages, run `pip install -e '.[search_backend, dev]'`.

Before running this notebook, set up an Opensearch container following instructions here: https://docs.haystack.deepset.ai/v2.0/docs/opensearchbm25retriever

Or alternatively open Docker desktop and run:
```
docker pull opensearchproject/opensearch:2.11.0
docker run -p 9200:9200 -p 9600:9600 -e "discovery.type=single-node" -e "OPENSEARCH_JAVA_OPTS=-Xms1024m -Xmx1024m" opensearchproject/opensearch:2.11.0
```

Hybrid search was introduced in OpenSearch v2.11. Not clear whether Haystack is able to properly use a version this recent. Proper hybrid search with OpenSearch hasn't been enabled yet in Haystack.

In [None]:
import json
import os
# os.chdir('~/rd-search-backend')

from search_backend.api.lib.config import get_config
from search_backend.api.lib.opensearchpipeline import run_indexing_pipeline, RetrievalPipeline
from search_backend.api.lib.searchfunctions import Search

cfg = get_config()

test_query = "short custody"

print(test_query)

## Read data

In [None]:
with open('ai_catalogue.json') as f:
    project_list = json.load(f)

print(project_list)

In [3]:
# Replace newlines as they interfere with the matching
project_list = [{k : v.replace('\n', ' ') if v is not None else v for k, v in project.items()} for project in project_list]

In [None]:
project_list[0]

In [4]:
# If the data contains multiple fields we'd want to search over, list them here
fields_to_search = [
    'project_name',
    'description',
    'what_does_this_initiative_do',
    'reasons_for_use',
    'problem_solved_by_the_initiative',
    'metrics_or_intended_impacts'
]

In [5]:
def format_doc_dict(doc: dict, field: str):

    """
    Reformat data into format accepted by Haystack.
    Here we wish to search over multiple fields, so we include the text from different
    fields in separate dictionaries within a list.

    :doc: dictionary containing text data to search along with accompanying metadata
    :field: string corresponding to one of the dictionary keys, to indicate the field to index the text from
    """

    content = doc[field]
    # doc.pop(field)

    if content is None:
        # We can't index None values, so returning None here allows us to skip fields where no info is provided
        return None
    else:
        meta = doc.copy()
        meta['matched_field'] = field

        doc_dict = {
            'meta': meta,
            'content': content,
        }

        return doc_dict


dataset = [y for project in project_list for field in fields_to_search if (y := format_doc_dict(project, field)) is not None]

In [None]:
print(len(dataset))
dataset

## Set up Opensearch

In [8]:
# Connect to an existing Opensearch document store
# query_document_store = SERVICES["querydocumentstore"]
import certifi
from haystack_integrations.document_stores.opensearch import OpenSearchDocumentStore

query_document_store = OpenSearchDocumentStore(
    hosts="https://localhost:9200",
    use_ssl=True,
    ca_certs=certifi.where(),
    verify_certs=False,
    http_auth=("admin", "admin"),
    embedding_dim=cfg["embedding_dim"],
    recreate_index=True,
    index="document",
)

In [None]:
run_indexing_pipeline(dataset, query_document_store, cfg, semantic=True)

## Run BM25 search

In [9]:
bm25_pipeline = RetrievalPipeline(query_document_store)
bm25_pipeline = bm25_pipeline.setup_bm25_pipeline()

In [None]:
test_query = "improved service quality"
results = Search(test_query, bm25_pipeline, top_k=3).bm25_search()


for doc in results["bm25_retriever"]['documents']:
    print('-----------------------------------')
    print(f'{doc.meta["project_name"]} - Score: {doc.score}')
    print(doc.content)
    print("\n")

In [None]:
results["bm25_retriever"]['documents'][0].meta

## Run semantic search

In [18]:
semantic_pipeline = RetrievalPipeline(query_document_store, dense_embedding_model=cfg['dense_embedding_model'], rerank_model=cfg['rerank_model'])
semantic_pipeline = semantic_pipeline.setup_semantic_pipeline()

In [None]:
test_query = "project relating to legislation"
results = Search(test_query, semantic_pipeline, top_k=3).semantic_search()


for doc in results["ranker"]['documents']:
    print('-----------------------------------')
    print(f'{doc.meta["project_name"]} - Score: {doc.score}')
    print(doc.content)
    print("\n")

## Hybrid search

In [16]:
hybrid_pipeline = RetrievalPipeline(query_document_store, dense_embedding_model=cfg['dense_embedding_model'], rerank_model=cfg['rerank_model'])
hybrid_pipeline = hybrid_pipeline.setup_hybrid_pipeline()

In [None]:
test_query = "improved service quality"
results = Search(test_query, hybrid_pipeline, top_k=3).hybrid_search()


for doc in results["ranker"]['documents']:
    print('-----------------------------------')
    print(f'{doc.meta["project_name"]} - Score: {doc.score}')
    print(doc.content)
    print("\n")