feat(python): Hybrid search & Reranker API #824

AyushExel · 2024-01-18T21:30:43Z

based on #713

The Reranker api can be plugged into vector only or fts only search but this PR doesn't do that (see example - https://txt.cohere.com/rerank/)

Default reranker -- `LinearCombinationReranker(weight=0.7, fill=1.0)`

table.search("hello", query_type="hybrid").rerank(normalize="score").to_pandas()

Available rerankers

LinearCombinationReranker

from lancedb.rerankers import LinearCombinationReranker

# Same as default 
table.search("hello", query_type="hybrid").rerank(
                                      normalize="score", 
                                      reranker=LinearCombinationReranker()
                                     ).to_pandas()

# with custom params
reranker = LinearCombinationReranker(weight=0.3, fill=1.0)
table.search("hello", query_type="hybrid").rerank(
                                      normalize="score", 
                                      reranker=reranker
                                     ).to_pandas()

Cohere Reranker

from lancedb.rerankers import CohereReranker

# default model.. English and multi-lingual supported. See docstring for available custom params
table.search("hello", query_type="hybrid").rerank(
                                      normalize="rank",  # score or rank
                                      reranker=CohereReranker()
                                     ).to_pandas()

CrossEncoderReranker

from lancedb.rerankers import CrossEncoderReranker

table.search("hello", query_type="hybrid").rerank(
                                      normalize="rank", 
                                      reranker=CrossEncoderReranker()
                                     ).to_pandas()

Using custom Reranker

from lancedb.reranker import Reranker

class CustomReranker(Reranker):
    def rerank_hybrid(self, vector_result, fts_result):
           combined_res = self.merge_results(vector_results, fts_results) # or use custom combination logic
           # Custom rerank logic here
           
           return combined_res

Expand testing
Make sure usage makes sense
Run simple benchmarks for correctness (Seeing weird result from cohere reranker in the toy example)
Support diverse rerankers by default:
Cross encoding
Cohere
Reciprocal Rank Fusion

changhiskhan

Please also add tests for the rerankers (including the simple linear combo one).
Curious to see the real world impact on search results

python/lancedb/rerankers/base.py

python/lancedb/rerankers/cross_encoder.py

python/lancedb/rerankers/base.py

python/lancedb/query.py

python/lancedb/rerankers/base.py

python/lancedb/rerankers/cohere.py

prrao87

The options for linear combination, cross encoder and cohere reranker make sense from a variety standpoint, and covers all the bases. I've seen others go down the road of Rank Fusion methods, but we can revisit that based on community feedback (though most people will likely go with cohere because it's really good quality).

The weight of 0.7 for the linear combination is a sensible default (I presume the 0.7 is for the vector component). Would love to give this a spin on some larger data and see how it fares on search relevance tasks! Performance is a whole other matter and could be improved with more efficient numpy/pyarrow ops down the line, hopefully.

python/lancedb/rerankers/base.py

python/lancedb/rerankers/cohere.py

Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>

python/lancedb/rerankers/base.py

python/lancedb/utils/general.py

python/lancedb/rerankers/cohere.py

changhiskhan · 2024-01-20T00:56:49Z

python/lancedb/rerankers/cohere.py

+        scores = np.array([result[1] for result in results])
+        sorted_indices = np.argsort(scores)[::-1]
+        # sort the results by the sorted indices
+        combined_results = combined_results.take(sorted_indices)


i think you need to do combine_results.take(indices[np.argsort(scores)]) ?

Oh results are already sorted in descending order of score by cohere api. Sorting is not needed. removed it

python/lancedb/rerankers/linear_combination.py

…ncedb into hybrid_search_updates

Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>

changhiskhan

this looks good overall but the score inversion and fill values needs to be corrected i think

python/lancedb/query.py

changhiskhan · 2024-01-24T23:38:55Z

python/lancedb/rerankers/base.py

+        fts_results : pa.Table
+            The results from the FTS search
+        """
+        comnined = pa.concat_tables([vector_results, fts_results], promote=True)


what does this do with the _distance and _score columns? does it just add nulls?

Yeah if the values are not present it adds null

For example:
fts

text vector score _rowid 0 I've got a bad feeling about this [0.016979132, -0.009463778, -0.0041340766, -0.... 0.0 4

vector

text vector _distance _rowid 0 hello world [-0.014907048, 0.0013432145, -0.01851529, -0.0... 0.0 10

Will result in combined:

text vector _distance _rowid score 0 hello world [-0.014907048, 0.0013432145, -0.01851529, -0.0... 0.0 10 NaN 1 I've got a bad feeling about this [0.016979132, -0.009463778, -0.0041340766, -0.... NaN 4 0.0

python/lancedb/rerankers/base.py

python/lancedb/rerankers/cohere.py

changhiskhan · 2024-01-24T23:42:34Z

python/lancedb/rerankers/linear_combination.py

+            vi = vector_list[i]
+            fj = fts_list[j]
+            # invert the fts score
+            inverted_fts_score = 1 - fj["score"]


actually i think this won't work if the user chose to use ranks rather than scores. inverting ranks should be something like rank_inverted(i) = max(ranks) - rank(i) ?

Discussed..

changhiskhan · 2024-01-25T00:33:24Z

python/lancedb/rerankers/linear_combination.py

+        return combined_results
+
+    def merge_results(
+        self, vector_results: pa.Table, fts_results: pa.Table, fill: float


i think the fill values for vector results and fts results will be different, especially if the scores are not pre-normalized?

As discussed async the scores are always normalized currently

AyushExel · 2024-01-25T14:22:01Z

I think I responded to all comments.. please verify
(docs test failing on installing so shouldn't be related to this)

changhiskhan

yolo

…_search_updates

AyushExel · 2024-01-30T13:40:25Z

okay I'm gonna bite the bullet and merge this. Ran some manual tests on open reranking datasets and the results seem sane. If you guys know some in-depth re-ranking benchmark scrips lmk and I'll run those for blog announcement.

based on lancedb#713 - The Reranker api can be plugged into vector only or fts only search but this PR doesn't do that (see example - https://txt.cohere.com/rerank/) ### Default reranker -- `LinearCombinationReranker(weight=0.7, fill=1.0)` ``` table.search("hello", query_type="hybrid").rerank(normalize="score").to_pandas() ``` ### Available rerankers LinearCombinationReranker ``` from lancedb.rerankers import LinearCombinationReranker # Same as default table.search("hello", query_type="hybrid").rerank( normalize="score", reranker=LinearCombinationReranker() ).to_pandas() # with custom params reranker = LinearCombinationReranker(weight=0.3, fill=1.0) table.search("hello", query_type="hybrid").rerank( normalize="score", reranker=reranker ).to_pandas() ``` Cohere Reranker ``` from lancedb.rerankers import CohereReranker # default model.. English and multi-lingual supported. See docstring for available custom params table.search("hello", query_type="hybrid").rerank( normalize="rank", # score or rank reranker=CohereReranker() ).to_pandas() ``` CrossEncoderReranker ``` from lancedb.rerankers import CrossEncoderReranker table.search("hello", query_type="hybrid").rerank( normalize="rank", reranker=CrossEncoderReranker() ).to_pandas() ``` ## Using custom Reranker ``` from lancedb.reranker import Reranker class CustomReranker(Reranker): def rerank_hybrid(self, vector_result, fts_result): combined_res = self.merge_results(vector_results, fts_results) # or use custom combination logic # Custom rerank logic here return combined_res ``` - [x] Expand testing - [x] Make sure usage makes sense - [x] Run simple benchmarks for correctness (Seeing weird result from cohere reranker in the toy example) - Support diverse rerankers by default: - [x] Cross encoding - [x] Cohere - [x] Reciprocal Rank Fusion --------- Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com> Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>

based on #713 - The Reranker api can be plugged into vector only or fts only search but this PR doesn't do that (see example - https://txt.cohere.com/rerank/) ### Default reranker -- `LinearCombinationReranker(weight=0.7, fill=1.0)` ``` table.search("hello", query_type="hybrid").rerank(normalize="score").to_pandas() ``` ### Available rerankers LinearCombinationReranker ``` from lancedb.rerankers import LinearCombinationReranker # Same as default table.search("hello", query_type="hybrid").rerank( normalize="score", reranker=LinearCombinationReranker() ).to_pandas() # with custom params reranker = LinearCombinationReranker(weight=0.3, fill=1.0) table.search("hello", query_type="hybrid").rerank( normalize="score", reranker=reranker ).to_pandas() ``` Cohere Reranker ``` from lancedb.rerankers import CohereReranker # default model.. English and multi-lingual supported. See docstring for available custom params table.search("hello", query_type="hybrid").rerank( normalize="rank", # score or rank reranker=CohereReranker() ).to_pandas() ``` CrossEncoderReranker ``` from lancedb.rerankers import CrossEncoderReranker table.search("hello", query_type="hybrid").rerank( normalize="rank", reranker=CrossEncoderReranker() ).to_pandas() ``` ## Using custom Reranker ``` from lancedb.reranker import Reranker class CustomReranker(Reranker): def rerank_hybrid(self, vector_result, fts_result): combined_res = self.merge_results(vector_results, fts_results) # or use custom combination logic # Custom rerank logic here return combined_res ``` - [x] Expand testing - [x] Make sure usage makes sense - [x] Run simple benchmarks for correctness (Seeing weird result from cohere reranker in the toy example) - Support diverse rerankers by default: - [x] Cross encoding - [x] Cohere - [x] Reciprocal Rank Fusion --------- Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com> Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>

* hash join interface * build hash joiner * use hash key * collect from hash joiner * take rows * handle null * use fragment * merge on fragment level * change interface * cargo fmat * draft remainder of implementation * get rust working * Add Python bindings * cleanup * Add more comments * add list to supported nulls --------- Co-authored-by: Will Jones <willjones127@gmail.com>

changhiskhan and others added 15 commits December 16, 2023 16:52

initial code for hybrid search

cb64630

runs

ff7ba79

fix fill and normalization and ordering

147ce11

run requests in parallel

304e78f

lint

13728bb

update

4ae6bef

update

c9b0912

initial code for hybrid search

4b8e0ca

runs

d6bd97f

fix fill and normalization and ordering

a101920

run requests in parallel

9b80b47

lint

d3b5aed

update

04f4202

update

888156d

merge

8ebd5cf

changhiskhan reviewed Jan 19, 2024

View reviewed changes

add linear combination + refact hybridQueryBuilder

4c73c9a

prrao87 approved these changes Jan 19, 2024

View reviewed changes

python/lancedb/rerankers/base.py Outdated Show resolved Hide resolved

python/lancedb/rerankers/base.py Outdated Show resolved Hide resolved

python/lancedb/rerankers/base.py Outdated Show resolved Hide resolved

python/lancedb/rerankers/cohere.py Outdated Show resolved Hide resolved

AyushExel and others added 6 commits January 20, 2024 01:14

update

f9537a6

update

9e39ed8

Update python/lancedb/rerankers/base.py

fceea50

Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>

Update python/lancedb/rerankers/cohere.py

e3b2987

Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>

Update python/lancedb/rerankers/base.py

da969c2

Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>

Update python/lancedb/rerankers/base.py

8b84d02

Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>

AyushExel commented Jan 19, 2024

View reviewed changes

python/lancedb/rerankers/base.py Show resolved Hide resolved

AyushExel added 2 commits January 20, 2024 01:56

update

6d02ba4

Merge branch 'main' into hybrid_search_updates

a4b91fd

changhiskhan reviewed Jan 20, 2024

View reviewed changes

AyushExel added 2 commits January 20, 2024 21:00

standardize safe_import as a util

8db1cd7

Merge branch 'hybrid_search_updates' of https://github.com/lancedb/la…

366e3da

…ncedb into hybrid_search_updates

AyushExel and others added 10 commits January 24, 2024 22:46

Update docs/src/hybrid_search.md

477411e

Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>

Update docs/src/hybrid_search.md

95ec05a

Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>

Update docs/src/hybrid_search.md

744b78d

Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>

Update docs/src/hybrid_search.md

2d6a563

Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>

Update docs/src/hybrid_search.md

a721845

Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>

Update docs/src/hybrid_search.md

6d07b53

Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>

Update docs/src/hybrid_search.md

001e9a2

Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>

Update docs/src/hybrid_search.md

bd1617e

Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>

Update python/lancedb/rerankers/linear_combination.py

e64eda6

Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>

docs udpate

a16365d

AyushExel requested a review from changhiskhan January 24, 2024 20:43

prrao87 approved these changes Jan 24, 2024

View reviewed changes

changhiskhan reviewed Jan 25, 2024

View reviewed changes

AyushExel added 5 commits January 25, 2024 19:12

add doctrings & refactor

d264600

ruff

0cfeaa8

docstrings

d4ccc71

update

1aeab8f

fix tests

4304a71

AyushExel requested a review from changhiskhan January 25, 2024 14:22

AyushExel changed the title ~~[WIP] Hybrid search & Reranker API~~ feat(python): Hybrid search & Reranker API Jan 25, 2024

changhiskhan approved these changes Jan 26, 2024

View reviewed changes

AyushExel added 2 commits January 28, 2024 01:47

Merge branch 'main' of https://github.com/lancedb/lancedb into hybrid…

6bc8e17

…_search_updates

Merge branch 'main' of https://github.com/lancedb/lancedb into hybrid…

6d67e4d

…_search_updates

AyushExel merged commit 3ffed89 into main Jan 30, 2024
11 checks passed

AyushExel deleted the hybrid_search_updates branch January 30, 2024 13:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(python): Hybrid search & Reranker API #824

feat(python): Hybrid search & Reranker API #824

AyushExel commented Jan 18, 2024 •

edited

Loading

changhiskhan left a comment

prrao87 left a comment

changhiskhan Jan 20, 2024

AyushExel Jan 20, 2024

changhiskhan left a comment

changhiskhan Jan 24, 2024

AyushExel Jan 25, 2024

AyushExel Jan 25, 2024

changhiskhan Jan 24, 2024

AyushExel Jan 25, 2024

changhiskhan Jan 25, 2024

AyushExel Jan 25, 2024

AyushExel commented Jan 25, 2024 •

edited

Loading

changhiskhan left a comment

AyushExel commented Jan 30, 2024

feat(python): Hybrid search & Reranker API #824

feat(python): Hybrid search & Reranker API #824

Conversation

AyushExel commented Jan 18, 2024 • edited Loading

Default reranker -- LinearCombinationReranker(weight=0.7, fill=1.0)

Available rerankers

Using custom Reranker

changhiskhan left a comment

Choose a reason for hiding this comment

prrao87 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

changhiskhan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AyushExel commented Jan 25, 2024 • edited Loading

changhiskhan left a comment

Choose a reason for hiding this comment

AyushExel commented Jan 30, 2024

AyushExel commented Jan 18, 2024 •

edited

Loading

Default reranker -- `LinearCombinationReranker(weight=0.7, fill=1.0)`

AyushExel commented Jan 25, 2024 •

edited

Loading