-
Notifications
You must be signed in to change notification settings - Fork 330
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(python): Hybrid search & Reranker API #824
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Please also add tests for the rerankers (including the simple linear combo one).
- Curious to see the real world impact on search results
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The options for linear combination, cross encoder and cohere reranker make sense from a variety standpoint, and covers all the bases. I've seen others go down the road of Rank Fusion methods, but we can revisit that based on community feedback (though most people will likely go with cohere because it's really good quality).
The weight of 0.7
for the linear combination is a sensible default (I presume the 0.7 is for the vector component). Would love to give this a spin on some larger data and see how it fares on search relevance tasks! Performance is a whole other matter and could be improved with more efficient numpy/pyarrow ops down the line, hopefully.
Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>
Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>
Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>
Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>
python/lancedb/rerankers/cohere.py
Outdated
scores = np.array([result[1] for result in results]) | ||
sorted_indices = np.argsort(scores)[::-1] | ||
# sort the results by the sorted indices | ||
combined_results = combined_results.take(sorted_indices) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think you need to do combine_results.take(indices[np.argsort(scores)])
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh results are already sorted in descending order of score by cohere api. Sorting is not needed. removed it
…ncedb into hybrid_search_updates
Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>
Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>
Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>
Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>
Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>
Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>
Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>
Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>
Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks good overall but the score inversion and fill values needs to be corrected i think
python/lancedb/rerankers/base.py
Outdated
fts_results : pa.Table | ||
The results from the FTS search | ||
""" | ||
comnined = pa.concat_tables([vector_results, fts_results], promote=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does this do with the _distance and _score columns? does it just add nulls?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah if the values are not present it adds null
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example:
fts
text vector score _rowid
0 I've got a bad feeling about this [0.016979132, -0.009463778, -0.0041340766, -0.... 0.0 4
vector
text vector _distance _rowid
0 hello world [-0.014907048, 0.0013432145, -0.01851529, -0.0... 0.0 10
Will result in combined:
text vector _distance _rowid score
0 hello world [-0.014907048, 0.0013432145, -0.01851529, -0.0... 0.0 10 NaN
1 I've got a bad feeling about this [0.016979132, -0.009463778, -0.0041340766, -0.... NaN 4 0.0
vi = vector_list[i] | ||
fj = fts_list[j] | ||
# invert the fts score | ||
inverted_fts_score = 1 - fj["score"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually i think this won't work if the user chose to use ranks
rather than scores
. inverting ranks should be something like rank_inverted(i) = max(ranks) - rank(i)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed..
return combined_results | ||
|
||
def merge_results( | ||
self, vector_results: pa.Table, fts_results: pa.Table, fill: float |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think the fill values for vector results and fts results will be different, especially if the scores are not pre-normalized?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed async the scores are always normalized currently
I think I responded to all comments.. please verify |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yolo
…_search_updates
…_search_updates
okay I'm gonna bite the bullet and merge this. Ran some manual tests on open reranking datasets and the results seem sane. If you guys know some in-depth re-ranking benchmark scrips lmk and I'll run those for blog announcement. |
based on lancedb#713 - The Reranker api can be plugged into vector only or fts only search but this PR doesn't do that (see example - https://txt.cohere.com/rerank/) ### Default reranker -- `LinearCombinationReranker(weight=0.7, fill=1.0)` ``` table.search("hello", query_type="hybrid").rerank(normalize="score").to_pandas() ``` ### Available rerankers LinearCombinationReranker ``` from lancedb.rerankers import LinearCombinationReranker # Same as default table.search("hello", query_type="hybrid").rerank( normalize="score", reranker=LinearCombinationReranker() ).to_pandas() # with custom params reranker = LinearCombinationReranker(weight=0.3, fill=1.0) table.search("hello", query_type="hybrid").rerank( normalize="score", reranker=reranker ).to_pandas() ``` Cohere Reranker ``` from lancedb.rerankers import CohereReranker # default model.. English and multi-lingual supported. See docstring for available custom params table.search("hello", query_type="hybrid").rerank( normalize="rank", # score or rank reranker=CohereReranker() ).to_pandas() ``` CrossEncoderReranker ``` from lancedb.rerankers import CrossEncoderReranker table.search("hello", query_type="hybrid").rerank( normalize="rank", reranker=CrossEncoderReranker() ).to_pandas() ``` ## Using custom Reranker ``` from lancedb.reranker import Reranker class CustomReranker(Reranker): def rerank_hybrid(self, vector_result, fts_result): combined_res = self.merge_results(vector_results, fts_results) # or use custom combination logic # Custom rerank logic here return combined_res ``` - [x] Expand testing - [x] Make sure usage makes sense - [x] Run simple benchmarks for correctness (Seeing weird result from cohere reranker in the toy example) - Support diverse rerankers by default: - [x] Cross encoding - [x] Cohere - [x] Reciprocal Rank Fusion --------- Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com> Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>
based on #713 - The Reranker api can be plugged into vector only or fts only search but this PR doesn't do that (see example - https://txt.cohere.com/rerank/) ### Default reranker -- `LinearCombinationReranker(weight=0.7, fill=1.0)` ``` table.search("hello", query_type="hybrid").rerank(normalize="score").to_pandas() ``` ### Available rerankers LinearCombinationReranker ``` from lancedb.rerankers import LinearCombinationReranker # Same as default table.search("hello", query_type="hybrid").rerank( normalize="score", reranker=LinearCombinationReranker() ).to_pandas() # with custom params reranker = LinearCombinationReranker(weight=0.3, fill=1.0) table.search("hello", query_type="hybrid").rerank( normalize="score", reranker=reranker ).to_pandas() ``` Cohere Reranker ``` from lancedb.rerankers import CohereReranker # default model.. English and multi-lingual supported. See docstring for available custom params table.search("hello", query_type="hybrid").rerank( normalize="rank", # score or rank reranker=CohereReranker() ).to_pandas() ``` CrossEncoderReranker ``` from lancedb.rerankers import CrossEncoderReranker table.search("hello", query_type="hybrid").rerank( normalize="rank", reranker=CrossEncoderReranker() ).to_pandas() ``` ## Using custom Reranker ``` from lancedb.reranker import Reranker class CustomReranker(Reranker): def rerank_hybrid(self, vector_result, fts_result): combined_res = self.merge_results(vector_results, fts_results) # or use custom combination logic # Custom rerank logic here return combined_res ``` - [x] Expand testing - [x] Make sure usage makes sense - [x] Run simple benchmarks for correctness (Seeing weird result from cohere reranker in the toy example) - Support diverse rerankers by default: - [x] Cross encoding - [x] Cohere - [x] Reciprocal Rank Fusion --------- Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com> Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>
* hash join interface * build hash joiner * use hash key * collect from hash joiner * take rows * handle null * use fragment * merge on fragment level * change interface * cargo fmat * draft remainder of implementation * get rust working * Add Python bindings * cleanup * Add more comments * add list to supported nulls --------- Co-authored-by: Will Jones <willjones127@gmail.com>
based on #713
Default reranker --
LinearCombinationReranker(weight=0.7, fill=1.0)
Available rerankers
LinearCombinationReranker
Cohere Reranker
CrossEncoderReranker
Using custom Reranker