Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python): Hybrid search & Reranker API #824

Merged
merged 57 commits into from
Jan 30, 2024
Merged

Conversation

AyushExel
Copy link
Contributor

@AyushExel AyushExel commented Jan 18, 2024

based on #713

Default reranker -- LinearCombinationReranker(weight=0.7, fill=1.0)

table.search("hello", query_type="hybrid").rerank(normalize="score").to_pandas()

Available rerankers

LinearCombinationReranker

from lancedb.rerankers import LinearCombinationReranker

# Same as default 
table.search("hello", query_type="hybrid").rerank(
                                      normalize="score", 
                                      reranker=LinearCombinationReranker()
                                     ).to_pandas()

# with custom params
reranker = LinearCombinationReranker(weight=0.3, fill=1.0)
table.search("hello", query_type="hybrid").rerank(
                                      normalize="score", 
                                      reranker=reranker
                                     ).to_pandas()

Cohere Reranker

from lancedb.rerankers import CohereReranker

# default model.. English and multi-lingual supported. See docstring for available custom params
table.search("hello", query_type="hybrid").rerank(
                                      normalize="rank",  # score or rank
                                      reranker=CohereReranker()
                                     ).to_pandas()

CrossEncoderReranker

from lancedb.rerankers import CrossEncoderReranker

table.search("hello", query_type="hybrid").rerank(
                                      normalize="rank", 
                                      reranker=CrossEncoderReranker()
                                     ).to_pandas()

Using custom Reranker

from lancedb.reranker import Reranker

class CustomReranker(Reranker):
    def rerank_hybrid(self, vector_result, fts_result):
           combined_res = self.merge_results(vector_results, fts_results) # or use custom combination logic
           # Custom rerank logic here
           
           return combined_res
  • Expand testing
  • Make sure usage makes sense
  • Run simple benchmarks for correctness (Seeing weird result from cohere reranker in the toy example)
  • Support diverse rerankers by default:
  • Cross encoding
  • Cohere
  • Reciprocal Rank Fusion

Copy link
Contributor

@changhiskhan changhiskhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Please also add tests for the rerankers (including the simple linear combo one).
  • Curious to see the real world impact on search results

python/lancedb/rerankers/base.py Outdated Show resolved Hide resolved
python/lancedb/rerankers/cross_encoder.py Show resolved Hide resolved
python/lancedb/rerankers/base.py Outdated Show resolved Hide resolved
python/lancedb/query.py Outdated Show resolved Hide resolved
python/lancedb/rerankers/base.py Outdated Show resolved Hide resolved
python/lancedb/rerankers/cohere.py Outdated Show resolved Hide resolved
python/lancedb/rerankers/cohere.py Outdated Show resolved Hide resolved
python/lancedb/rerankers/cohere.py Outdated Show resolved Hide resolved
python/lancedb/rerankers/cohere.py Outdated Show resolved Hide resolved
python/lancedb/rerankers/cohere.py Outdated Show resolved Hide resolved
Copy link
Contributor

@prrao87 prrao87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The options for linear combination, cross encoder and cohere reranker make sense from a variety standpoint, and covers all the bases. I've seen others go down the road of Rank Fusion methods, but we can revisit that based on community feedback (though most people will likely go with cohere because it's really good quality).

The weight of 0.7 for the linear combination is a sensible default (I presume the 0.7 is for the vector component). Would love to give this a spin on some larger data and see how it fares on search relevance tasks! Performance is a whole other matter and could be improved with more efficient numpy/pyarrow ops down the line, hopefully.

python/lancedb/rerankers/base.py Outdated Show resolved Hide resolved
python/lancedb/rerankers/base.py Outdated Show resolved Hide resolved
python/lancedb/rerankers/base.py Outdated Show resolved Hide resolved
python/lancedb/rerankers/cohere.py Outdated Show resolved Hide resolved
AyushExel and others added 6 commits January 20, 2024 01:14
Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>
Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>
Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>
Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>
python/lancedb/utils/general.py Outdated Show resolved Hide resolved
python/lancedb/rerankers/cohere.py Outdated Show resolved Hide resolved
scores = np.array([result[1] for result in results])
sorted_indices = np.argsort(scores)[::-1]
# sort the results by the sorted indices
combined_results = combined_results.take(sorted_indices)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think you need to do combine_results.take(indices[np.argsort(scores)]) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh results are already sorted in descending order of score by cohere api. Sorting is not needed. removed it

AyushExel and others added 10 commits January 24, 2024 22:46
Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>
Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>
Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>
Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>
Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>
Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>
Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>
Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>
Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>
Copy link
Contributor

@changhiskhan changhiskhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks good overall but the score inversion and fill values needs to be corrected i think

python/lancedb/query.py Outdated Show resolved Hide resolved
fts_results : pa.Table
The results from the FTS search
"""
comnined = pa.concat_tables([vector_results, fts_results], promote=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this do with the _distance and _score columns? does it just add nulls?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah if the values are not present it adds null

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example:
fts

                                text                                             vector  score  _rowid
0  I've got a bad feeling about this  [0.016979132, -0.009463778, -0.0041340766, -0....    0.0       4

vector

          text                                             vector  _distance  _rowid
0  hello world  [-0.014907048, 0.0013432145, -0.01851529, -0.0...        0.0      10

Will result in combined:

                                text                                             vector  _distance  _rowid  score
0                        hello world  [-0.014907048, 0.0013432145, -0.01851529, -0.0...        0.0      10    NaN
1  I've got a bad feeling about this  [0.016979132, -0.009463778, -0.0041340766, -0....        NaN       4    0.0

python/lancedb/rerankers/base.py Outdated Show resolved Hide resolved
python/lancedb/rerankers/cohere.py Show resolved Hide resolved
vi = vector_list[i]
fj = fts_list[j]
# invert the fts score
inverted_fts_score = 1 - fj["score"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually i think this won't work if the user chose to use ranks rather than scores. inverting ranks should be something like rank_inverted(i) = max(ranks) - rank(i) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed..

return combined_results

def merge_results(
self, vector_results: pa.Table, fts_results: pa.Table, fill: float
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the fill values for vector results and fts results will be different, especially if the scores are not pre-normalized?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed async the scores are always normalized currently

@AyushExel
Copy link
Contributor Author

AyushExel commented Jan 25, 2024

I think I responded to all comments.. please verify
(docs test failing on installing so shouldn't be related to this)

@AyushExel AyushExel changed the title [WIP] Hybrid search & Reranker API feat(python): Hybrid search & Reranker API Jan 25, 2024
Copy link
Contributor

@changhiskhan changhiskhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yolo

@AyushExel
Copy link
Contributor Author

okay I'm gonna bite the bullet and merge this. Ran some manual tests on open reranking datasets and the results seem sane. If you guys know some in-depth re-ranking benchmark scrips lmk and I'll run those for blog announcement.

@AyushExel AyushExel merged commit 3ffed89 into main Jan 30, 2024
11 checks passed
@AyushExel AyushExel deleted the hybrid_search_updates branch January 30, 2024 13:40
raghavdixit99 pushed a commit to raghavdixit99/lancedb that referenced this pull request Apr 5, 2024
based on lancedb#713
- The Reranker api can be plugged into vector only or fts only search
but this PR doesn't do that (see example -
https://txt.cohere.com/rerank/)


### Default reranker -- `LinearCombinationReranker(weight=0.7,
fill=1.0)`

```
table.search("hello", query_type="hybrid").rerank(normalize="score").to_pandas()
```
### Available rerankers
LinearCombinationReranker
```
from lancedb.rerankers import LinearCombinationReranker

# Same as default 
table.search("hello", query_type="hybrid").rerank(
                                      normalize="score", 
                                      reranker=LinearCombinationReranker()
                                     ).to_pandas()

# with custom params
reranker = LinearCombinationReranker(weight=0.3, fill=1.0)
table.search("hello", query_type="hybrid").rerank(
                                      normalize="score", 
                                      reranker=reranker
                                     ).to_pandas()
```

Cohere Reranker
```
from lancedb.rerankers import CohereReranker

# default model.. English and multi-lingual supported. See docstring for available custom params
table.search("hello", query_type="hybrid").rerank(
                                      normalize="rank",  # score or rank
                                      reranker=CohereReranker()
                                     ).to_pandas()

```

CrossEncoderReranker

```
from lancedb.rerankers import CrossEncoderReranker

table.search("hello", query_type="hybrid").rerank(
                                      normalize="rank", 
                                      reranker=CrossEncoderReranker()
                                     ).to_pandas()

```

## Using custom Reranker
```
from lancedb.reranker import Reranker

class CustomReranker(Reranker):
    def rerank_hybrid(self, vector_result, fts_result):
           combined_res = self.merge_results(vector_results, fts_results) # or use custom combination logic
           # Custom rerank logic here
           
           return combined_res
```

- [x] Expand testing
- [x] Make sure usage makes sense
- [x] Run simple benchmarks for correctness (Seeing weird result from
cohere reranker in the toy example)
- Support diverse rerankers by default:
- [x] Cross encoding
- [x] Cohere
- [x] Reciprocal Rank Fusion

---------

Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com>
Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>
westonpace pushed a commit that referenced this pull request Apr 5, 2024
based on #713
- The Reranker api can be plugged into vector only or fts only search
but this PR doesn't do that (see example -
https://txt.cohere.com/rerank/)


### Default reranker -- `LinearCombinationReranker(weight=0.7,
fill=1.0)`

```
table.search("hello", query_type="hybrid").rerank(normalize="score").to_pandas()
```
### Available rerankers
LinearCombinationReranker
```
from lancedb.rerankers import LinearCombinationReranker

# Same as default 
table.search("hello", query_type="hybrid").rerank(
                                      normalize="score", 
                                      reranker=LinearCombinationReranker()
                                     ).to_pandas()

# with custom params
reranker = LinearCombinationReranker(weight=0.3, fill=1.0)
table.search("hello", query_type="hybrid").rerank(
                                      normalize="score", 
                                      reranker=reranker
                                     ).to_pandas()
```

Cohere Reranker
```
from lancedb.rerankers import CohereReranker

# default model.. English and multi-lingual supported. See docstring for available custom params
table.search("hello", query_type="hybrid").rerank(
                                      normalize="rank",  # score or rank
                                      reranker=CohereReranker()
                                     ).to_pandas()

```

CrossEncoderReranker

```
from lancedb.rerankers import CrossEncoderReranker

table.search("hello", query_type="hybrid").rerank(
                                      normalize="rank", 
                                      reranker=CrossEncoderReranker()
                                     ).to_pandas()

```

## Using custom Reranker
```
from lancedb.reranker import Reranker

class CustomReranker(Reranker):
    def rerank_hybrid(self, vector_result, fts_result):
           combined_res = self.merge_results(vector_results, fts_results) # or use custom combination logic
           # Custom rerank logic here
           
           return combined_res
```

- [x] Expand testing
- [x] Make sure usage makes sense
- [x] Run simple benchmarks for correctness (Seeing weird result from
cohere reranker in the toy example)
- Support diverse rerankers by default:
- [x] Cross encoding
- [x] Cohere
- [x] Reciprocal Rank Fusion

---------

Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com>
Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>
alexkohler pushed a commit to alexkohler/lancedb that referenced this pull request Apr 20, 2024
* hash join interface

* build hash joiner

* use hash key

* collect from hash joiner

* take rows

* handle null

* use fragment

* merge on fragment level

* change interface

* cargo fmat

* draft remainder of implementation

* get rust working

* Add Python bindings

* cleanup

* Add more comments

* add list to supported nulls

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants