## Tutorial

To read more about hybrid search, checkout this tutorial [here](https://www.elastic.co/search-labs/tutorials/search-tutorial/vector-search/hybrid-search).



## Connect to ElasticSearch

In [None]:
from pprint import pprint
from elasticsearch import Elasticsearch

es = Elasticsearch('http://localhost:9200')
client_info = es.info()
print('Connected to Elasticsearch!')
pprint(client_info.body)

## Preparing the index

We are adding a new field with type `dense_vector` to store the embeddings.

In [None]:
es.indices.delete(index="my_index", ignore_unavailable=True)
es.indices.create(
    index="my_index",
    mappings={
        "properties": {
            "embedding": {
                "type": "dense_vector",
            }
        }
    },
)

## Embedding model

I chose the `all-MiniLM-L6-v2` model for its speed, compact size, and versatility as a general-purpose model. It features an embedding dimension of `384` and truncates text that exceeds `256` words. This model is very popular in the community with almost `50M` downloads in one month.

To download and utilize this model, Hugging Face offers a Python package called `sentence-transformers`. This framework simplifies the process of computing dense vector representations.

In [None]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")
model

In [None]:
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

In [None]:
model = model.to(device)
model

## Load documents

In [None]:
import json


documents = json.load(open("../data/astronomy.json"))
documents

## Embed documents

In [None]:
from tqdm import tqdm
from pprint import pprint


def get_embedding(text):
    return model.encode(text)


operations = []
for document in tqdm(documents, total=len(documents)):
    operations.append({"index": {"_index": "my_index"}})
    operations.append(
        {
            **document,
            "embedding": get_embedding(document["content"]),
        }
    )

response = es.bulk(operations=operations)
pprint(response.body)

We indexed all documents with an additional field `embedding`. Let's retrieve the documents to verify that the text was converted to a dense vector.

In [None]:
response = es.search(index="my_index", body={"query": {"match_all": {}}})

pprint(response["hits"]["hits"])

Awesome! We successfully inserted the documents with the additional `embedding` field. Now, letâ€™s check the mapping to confirm that the dimension of the dense vector is 384.

In [None]:
response = es.indices.get_mapping(index="my_index")
pprint(response.body)

## kNN search

Before showing how to perform hybrid search, let's first see how to perform a kNN search using the `embedding` field.

In [None]:
query = "What is a black hole?"
embedded_query = get_embedding(query)

result = es.search(
    index="my_index",
    knn={
        "field": "embedding",
        "query_vector": embedded_query,
        "num_candidates": 5,
        "k": 3,
    },
)

n_documents = result.body["hits"]["total"]["value"]
print(f"Found {n_documents} documents")

In [None]:
hits = result.body["hits"]["hits"]
for hit in hits:
    print(f"Title  : {hit['_source']['title']}")
    print(f"Content: {hit['_source']['content']}")
    print(f"Score  : {hit['_score']}")
    print("*" * 100)

## Full-text search

Now, let's search for the same query using full-text search.

In [None]:
query = "What is a black hole?"
result = es.search(
    index="my_index",
    body={"query": {"match": {"content": {"query": query}}}},
)
n_documents = result.body["hits"]["total"]["value"]
print(f"Found {n_documents} documents")

In [None]:
hits = result.body["hits"]["hits"]
for hit in hits:
    print(f"Title  : {hit['_source']['title']}")
    print(f"Content: {hit['_source']['content']}")
    print(f"Score  : {hit['_score']}")
    print("*" * 100)

We can see that the first document is the only one that contains the term "black hole". That is why the score is very high compared to the other documents. We can add the `minimum_should_match` parameter to ensure that at least `minimum_should_match %` of the query terms are present in the document. This will help us filter out documents that are not relevant to the query.

In [None]:
query = "What is a black hole?"
result = es.search(
    index="my_index",
    query={"match": {"content": {"query": query, "minimum_should_match": "80%"}}},
)
n_documents = result.body["hits"]["total"]["value"]
print(f"Found {n_documents} documents")

As you can see, the number of documents returned is now 1 instead of 6.

In [None]:
hits = result.body["hits"]["hits"]
for hit in hits:
    print(f"Title  : {hit['_source']['title']}")
    print(f"Content: {hit['_source']['content']}")
    print(f"Score  : {hit['_score']}")
    print("*" * 100)

## Hybrid search

### Paid solution

As you saw from the previous two sections, both kNN search and full-text search have their strengths and weaknesses. kNN search is great for finding semantically similar documents, while full-text search is better for finding documents that contain specific keywords.

Each method of searching returns valuable results that the other method would miss, so combining both methods can yield better results. This is where hybrid search comes in. Hybrid search uses the [RRF](https://www.elastic.co/docs/reference/elasticsearch/rest-apis/reciprocal-rank-fusion) (Reciprocal Rank Fusion) algorithm to combine the results of both kNN search and full-text search.

We specify `rrf` in the `rank` parameter of the search API to perform hybrid search. 

In [None]:
query = "What is a black hole?"
result = es.search(
    index="my_index",
    query={"match": {"content": {"query": query, "minimum_should_match": "80%"}}},
    knn={
        "field": "embedding",
        "query_vector": embedded_query,
        "num_candidates": 5,
        "k": 3,
    },
    rank={"rrf": {}},
)
n_documents = result.body["hits"]["total"]["value"]
print(f"Found {n_documents} documents")

I got this exception because RRF is not available in the free version of ElasticSearch. You need to have a [paid subscription](https://www.elastic.co/subscriptions) to use this feature. If you have a paid subscription that is good, but if you don't I will show you how to use [ranx](https://github.com/AmenRa/ranx) to perform the RRF fusion.

### Free solution

Start by getting the results from both searches.

In [None]:
query = "What is a black hole?"

keyword_results = es.search(
    index="my_index",
    query={"match": {"content": {"query": query, "minimum_should_match": "80%"}}},
    size=10,
)
keyword_hits = keyword_results.body["hits"]["hits"]

knn_results = es.search(
    index="my_index",
    knn={
        "field": "embedding",
        "query_vector": embedded_query,
        "num_candidates": 10,
        "k": 5,
    },
)
knn_hits = knn_results.body["hits"]["hits"]

len(keyword_hits), len(knn_hits)

The lists `keyword_results` and `knn_results` don't contain the same number of documents, this is not an issue because RRF can handle that.

Now, convert the results to [Run](https://amenra.github.io/ranx/run/) objects.

In [None]:
from ranx import Run

query_id = "query_id"

keyword_run_dict = {query_id: {hit["_id"]: hit["_score"] for hit in keyword_hits}}
knn_run_dict = {query_id: {hit["_id"]: hit["_score"] for hit in knn_hits}}

keyword_run = Run.from_dict(keyword_run_dict, name="keyword_search")
knn_run = Run.from_dict(knn_run_dict, name="knn_search")

Finally, use the [fuse](https://amenra.github.io/ranx/fusion/) function from `ranx` to combine the two runs using RRF.

In [None]:
from ranx import fuse

combined_run = fuse(runs=[keyword_run, knn_run], method="rrf")

Sort the results by score in descending order.

In [None]:
from pprint import pprint

sorted_results = sorted(
    combined_run[query_id].items(), key=lambda item: item[1], reverse=True
)
pprint(sorted_results)

Notice that the final list is a fusion of both lists.

In [None]:
print(f"Length of the kNN list: {len(knn_hits)}")
print(f"Length of the keyword list: {len(keyword_hits)}\n")
print(f"Length of the final list: {len(sorted_results)}")

The scores have changed but the document that talks about black holes is still at the top.

In [None]:
combined_hits = {}
for hit in knn_hits + keyword_hits:
    hit_id = hit["_id"]
    if hit_id not in combined_hits:
        combined_hits[hit_id] = hit

for doc_id, score in sorted_results:
    for key, value in combined_hits.items():
        if key == doc_id:
            print(f"Title  : {value['_source']['title']}")
            print(f"Content: {value['_source']['content']}")
            print(f"Score  : {score}")
            print("*" * 100)