Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Cross-Encoder support to Similarity pipeline #372

Closed
nickchomey opened this issue Oct 19, 2022 · 7 comments
Closed

Add Cross-Encoder support to Similarity pipeline #372

nickchomey opened this issue Oct 19, 2022 · 7 comments
Assignees
Milestone

Comments

@nickchomey
Copy link

nickchomey commented Oct 19, 2022

It seems like Cross Encoders are the preferred model for doing Re-Ranking of search results that were generated by another means (BM25, vector search etc...). However, if I provide one of these models as the path, all the results just have scores of 0.5.

Sentence Transformers recommends doing this. https://www.sbert.net/examples/applications/retrieve_rerank/README.html

In particular, their msmarco-minilm models seem ideal as a default (maybe the L-6 version?) https://www.sbert.net/docs/pretrained-models/ce-msmarco.html

Haystack's implementation uses this in its Ranker node. https://haystack.deepset.ai/pipeline_nodes/ranker

@nickchomey
Copy link
Author

nickchomey commented Oct 21, 2022

This one might be the most promising as a default.
https://huggingface.co/cross-encoder/mmarco-mMiniLMv2-L12-H384-v1

The differences/improvements appear to be:

  • mMiniLM v2, a multilingual version of MiniLM
  • MMarco, a multilingual version of MSMarco.

Here's a paper that seems to be the basis of MMarco and shows very competitive performance for these multilingual models vs monolingual ones, as well as mMiniLM performing comparably to the far larger mT5. https://arxiv.org/pdf/2108.13897.pdf

@nickchomey
Copy link
Author

nickchomey commented Oct 21, 2022

I'm not savvy enough to submit a PR for this, but here's how Haystack implements it: https://github.com/deepset-ai/haystack/tree/main/haystack/nodes/ranker

And sentence-transformers provided simple code on the HF model page:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model = AutoModelForSequenceClassification.from_pretrained('model_name')
tokenizer = AutoTokenizer.from_pretrained('model_name')

features = tokenizer(['How many people live in Berlin?', 'How many people live in Berlin?'], ['Berlin has a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.', 'New York City is famous for the Metropolitan Museum of Art.'],  padding=True, truncation=True, return_tensors="pt")

model.eval()
with torch.no_grad():
    scores = model(**features).logits
    print(scores)

@davidmezzetti
Copy link
Member

The similarity pipeline is also using sequence classification. I'll take a look at what is going wrong. Most likely reason is the model outputs aren't interpreted correctly.

@nickchomey
Copy link
Author

Yeah, if I'm not mistaken, cross encoders just output a similarity score rather than a vector. Probably just need to add a bit of logic to interpret it.

@davidmezzetti davidmezzetti self-assigned this Oct 25, 2022
@davidmezzetti davidmezzetti added this to the v5.2.0 milestone Oct 25, 2022
@davidmezzetti
Copy link
Member

Added a new pipeline type for cross-encoders. The similarity pipeline can use the cross-encoder pipeline as a reference if crossencode=True.

davidmezzetti added a commit that referenced this issue Oct 25, 2022
@nickchomey
Copy link
Author

As it turns out, I'm not getting great results with a few cross encoders, as compared to valhalla/distilbart-mnli-12-3 which was used in Example 4... Perhaps I need to adjust the queries etc...

But it was probably worth adding this anyway!

@davidmezzetti
Copy link
Member

It seems to be that the models work best with questions - longer passages with the answer (asymmetric similarity). For symmetric similarity, mnli models appear to work better. Definitely was worth adding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants