Add Cross-Encoder support to Similarity pipeline #372

nickchomey · 2022-10-19T02:57:35Z

It seems like Cross Encoders are the preferred model for doing Re-Ranking of search results that were generated by another means (BM25, vector search etc...). However, if I provide one of these models as the path, all the results just have scores of 0.5.

Sentence Transformers recommends doing this. https://www.sbert.net/examples/applications/retrieve_rerank/README.html

In particular, their msmarco-minilm models seem ideal as a default (maybe the L-6 version?) https://www.sbert.net/docs/pretrained-models/ce-msmarco.html

Haystack's implementation uses this in its Ranker node. https://haystack.deepset.ai/pipeline_nodes/ranker

nickchomey · 2022-10-21T17:42:49Z

This one might be the most promising as a default.
https://huggingface.co/cross-encoder/mmarco-mMiniLMv2-L12-H384-v1

The differences/improvements appear to be:

mMiniLM v2, a multilingual version of MiniLM
MMarco, a multilingual version of MSMarco.

Here's a paper that seems to be the basis of MMarco and shows very competitive performance for these multilingual models vs monolingual ones, as well as mMiniLM performing comparably to the far larger mT5. https://arxiv.org/pdf/2108.13897.pdf

nickchomey · 2022-10-21T18:27:45Z

I'm not savvy enough to submit a PR for this, but here's how Haystack implements it: https://github.com/deepset-ai/haystack/tree/main/haystack/nodes/ranker

And sentence-transformers provided simple code on the HF model page:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model = AutoModelForSequenceClassification.from_pretrained('model_name')
tokenizer = AutoTokenizer.from_pretrained('model_name')

features = tokenizer(['How many people live in Berlin?', 'How many people live in Berlin?'], ['Berlin has a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.', 'New York City is famous for the Metropolitan Museum of Art.'],  padding=True, truncation=True, return_tensors="pt")

model.eval()
with torch.no_grad():
    scores = model(**features).logits
    print(scores)

davidmezzetti · 2022-10-21T23:44:32Z

The similarity pipeline is also using sequence classification. I'll take a look at what is going wrong. Most likely reason is the model outputs aren't interpreted correctly.

nickchomey · 2022-10-22T00:22:45Z

Yeah, if I'm not mistaken, cross encoders just output a similarity score rather than a vector. Probably just need to add a bit of logic to interpret it.

davidmezzetti · 2022-10-25T17:37:29Z

Added a new pipeline type for cross-encoders. The similarity pipeline can use the cross-encoder pipeline as a reference if crossencode=True.

nickchomey · 2022-10-25T23:32:58Z

As it turns out, I'm not getting great results with a few cross encoders, as compared to valhalla/distilbart-mnli-12-3 which was used in Example 4... Perhaps I need to adjust the queries etc...

But it was probably worth adding this anyway!

davidmezzetti · 2022-10-26T11:25:16Z

It seems to be that the models work best with questions - longer passages with the answer (asymmetric similarity). For symmetric similarity, mnli models appear to work better. Definitely was worth adding.

davidmezzetti self-assigned this Oct 25, 2022

davidmezzetti added this to the v5.2.0 milestone Oct 25, 2022

davidmezzetti closed this as completed in 30f82a6 Oct 25, 2022

davidmezzetti added a commit that referenced this issue Oct 25, 2022

Add documentation for #372

41bb1b8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Cross-Encoder support to Similarity pipeline #372

Add Cross-Encoder support to Similarity pipeline #372

nickchomey commented Oct 19, 2022 •

edited

nickchomey commented Oct 21, 2022 •

edited

nickchomey commented Oct 21, 2022 •

edited

davidmezzetti commented Oct 21, 2022

nickchomey commented Oct 22, 2022

davidmezzetti commented Oct 25, 2022

nickchomey commented Oct 25, 2022

davidmezzetti commented Oct 26, 2022

Add Cross-Encoder support to Similarity pipeline #372

Add Cross-Encoder support to Similarity pipeline #372

Comments

nickchomey commented Oct 19, 2022 • edited

nickchomey commented Oct 21, 2022 • edited

nickchomey commented Oct 21, 2022 • edited

davidmezzetti commented Oct 21, 2022

nickchomey commented Oct 22, 2022

davidmezzetti commented Oct 25, 2022

nickchomey commented Oct 25, 2022

davidmezzetti commented Oct 26, 2022

nickchomey commented Oct 19, 2022 •

edited

nickchomey commented Oct 21, 2022 •

edited

nickchomey commented Oct 21, 2022 •

edited