# Hybrid Search
This notebook shows how dense and spares vector search are combined to hybrid search.

Please note: there is research in progress on hybrid search strategies

In [1]:
import pysolr
from sentence_transformers import SentenceTransformer
from IPython.display import display,HTML

In [2]:
solr = pysolr.Solr('http://localhost:8983/solr/DenseIndex', always_commit=True)
model = SentenceTransformer("Snowflake/snowflake-arctic-embed-xs")

In [3]:
import urllib.request
import urllib.parse
import json

Define a query string

In [15]:
semantic_query = "FIFA World Cup events with a toxic atmosphere"

embed the sematic query and cast to a list, as later on we will need a list

In [16]:
sematic_query_embedded = list(model.encode([semantic_query], prompt_name="query")[0])

In [17]:
keyword_query = "toxic"

Execute the query

In [18]:
solr_response=solr.search(fl=['title', 'description', 'score'],
    q="{!knn f=description_vector topK=3}"+str([float(w) for w in sematic_query_embedded]),
    rows = 30)

## Define some queries

#### Combine Semantic and Keyword Search

In [37]:
solr_response=solr.search(fl=['title', 'description', 'score'],
    df="description",                    
    q="t{!bool should=$lexicalQuery should=$vectorQuery}",
    lexicalQuery="{!type=edismax qf=description}Switzerland",
    vectorQuery="{!knn f=description_vector topK=3}"+str([float(w) for w in sematic_query_embedded]),                        
    rows = 30)

In [42]:
solr_response=solr.search(fl=['title', 'description', 'score'],
    df="description",                    
    q="t{!bool must=$lexicalQuery must=$vectorQuery}",
    lexicalQuery="{!type=edismax qf=description}Italy",
    vectorQuery="{!knn f=description_vector topK=3}"+str([float(w) for w in sematic_query_embedded]),                        
    rows = 30)

#### Reranking

In [19]:
solr_response=solr.search(fl=['title', 'description', 'score'],
    df="description",                    
    q="year:(1962 2022 1934)",
    rq="{!rerank reRankQuery=$rqq reRankDocs=4 reRankWeight=1}",
    rqq="{!knn f=description_vector topK=3}"+str([float(w) for w in sematic_query_embedded]),                        
    rows = 30)

In [20]:
solr_response=solr.search(fl=['title', 'description', 'score'],
    df="description",                    
    q="portugal switzerland brasil",
    rq="{!rerank reRankQuery=$rqq reRankDocs=4 reRankWeight=1}",
    rqq="{!knn f=description_vector topK=3}"+str([float(w) for w in sematic_query_embedded]),                        
    rows = 30)

In [22]:
solr_response=solr.search(fl=['title', 'description', 'score'],
    df="description",                    
    q="toxic atmosphere",
    rq="{!rerank reRankQuery=$rqq reRankDocs=4 reRankWeight=1}",
    rqq="{!knn f=description_vector topK=3}"+str([float(w) for w in sematic_query_embedded]),                        
    rows = 30)

Out of the box, we get a score from -1 to 1 from the K-Nearest Neighbours search that is summed to an unbounded score from the lexical side (that could be way above that scale).

#### Recompute Scores

In [None]:
The filter component ignores any scoring and just builds the hybrid result set.
The must clause is responsible for assigning the score, using the appropriate function query.

The lexical score is min-max normalised to be scaled between 0 and 1 and it is summed to the K-Nearest Neighbours score.
This simple linear combination of the scores could be a good starting point.

In [50]:
solr_response=solr.search(fl=['title', 'description', 'score'],
    df="description",                    
    q="t{!bool filter=$retrievalStage must=$rankingStage}",
    retrievalStage = "{!bool should=$lexicalQuery should=$vectorQuery}",
    rankingStage = "{!func}sum(query($normalisedLexialQuery),query($vectorQuery))",
    normalisedLexialQuery = "{!func}scale(query($lexicalQuery),0,1)",
    lexicalQuery="{!type=edismax qf=description}Italy",
    vectorQuery="{!knn f=description_vector topK=3}"+str([float(w) for w in sematic_query_embedded]),                        
    rows = 30)

The filter component ignores any scoring and just builds the hybrid result set.
The must clause is responsible for assigning the score, using the appropriate function query.

The lexical score is min-max normalised to be scaled between 0.1 and 1 and it is multiplied by the K-Nearest Neighbours score.
There’s no evidence that this should bring better results than the simple sum, it’s always recommended to build a prototype and test your assumptions on real queries and rated datasets.

In [52]:
solr_response=solr.search(fl=['title', 'description', 'score'],
    df="description",                    
    q="t{!bool filter=$retrievalStage must=$rankingStage}",
    retrievalStage = "{!bool should=$lexicalQuery should=$vectorQuery}",
    rankingStage = "{!func}product(query($normalisedLexialQuery),query($vectorQuery))",
    normalisedLexialQuery = "{!func}scale(query($lexicalQuery),0.1,1)",
    lexicalQuery="{!type=edismax qf=description}Italy",
    vectorQuery="{!knn f=description_vector topK=3}"+str([float(w) for w in sematic_query_embedded]),                        
    rows = 30)

Build the union of the keyword search and the semantic search - duplicates eliminated

Build the intersection of both search methods

## Show the results
First run one of the queries above

In [53]:

print(f"Found {len(solr_response)} results")


for doc in solr_response:
    display(HTML("<h3>"+doc['title']+"   Score="+str(doc['score'])+"</h3>"))
    display(HTML(doc['description']))
    print("")

Found 16 results


















































See also:
https://sease.io/2023/12/hybrid-search-with-apache-solr.html
