## Pretrained Models

### Model Overview

There're a lot of pretrained models can be found suitable for several tasks

In [2]:
from sentence_transformers import SentenceTransformer, util

### Semantic Search
Given a question/search query, the model can find relevant text passages.

#### Usage

In [10]:
semantic_search_model = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')

# query_embedding = semantic_search_model.encode("what's the best way to live?")
# passage_embedding = semantic_search_model.encode(["You should live in the moment. Don't dwell on the past or worry about the future. Live for today!",
#                                   "Everybody dies, but not everybody lives"])

query_embedding = semantic_search_model.encode("Get min and max string of list")
passage_embedding = semantic_search_model.encode(["java.util.Optional.get",
                                                  "java.util.stream.Collectors.maxBy",
                                                  "java.awt.Rectangle.intersects",
                                                  "java.lang.String.split",
                                                  "Get min and max string of list"]) # the model does not have the knowledge of the query, need to be fine-tuned

print("Similarity score:", util.dot_score(query_embedding, passage_embedding))

Similarity score: tensor([[0.2438, 0.3479, 0.0650, 0.2787, 1.0000]])


<i><u>note:</u></i> The first 2 should be the most relevants

#### Multi-QA Models

There are several models have been trained on 215M QA pairs from various sources and domains.

These model perform well on many search tasks and domains.(e.g., semantic search)
- Some were tuned to be used with dot-product
- Some produce normalized vectors of length 1, which can be used with dot-product, cosine-similarity and Euclidean distance

##### Model tuned to be used with dot product

In [25]:
mul_qa_dot_model = SentenceTransformer('multi-qa-MiniLM-L6-dot-v1')
# query_embedding = mul_qa_dot_model.encode("Get min and max string of list")
# passage_embedding = mul_qa_dot_model.encode(["java.util.Optional.get",
#                                             "java.util.stream.Collectors.maxBy",
#                                             "java.awt.Rectangle.intersects",
#                                             "java.lang.String.split",
#                                             "Get min and max string of list"]) # the model does not have the knowledge of the query, need to be fine-tuned

query_embedding = mul_qa_dot_model.encode("How many people live in London?")
passage_embedding = mul_qa_dot_model.encode(["Around 9 million people live in London",
                                             "London is known for its financial district",
                                             "How many people live in London?"])

print("Similarity score:", util.dot_score(query_embedding, passage_embedding))

Similarity score: tensor([[44.9437, 36.3805, 43.5505]])


===> The answer for the query get the score better than the query itself. ===> Model was trained for q&a purpose

<-----=-----> TODO Why the dot_score's bigger than 1?

In [26]:
mul_qa_dot_model = SentenceTransformer('multi-qa-MiniLM-L6-dot-v1')
query_embedding = mul_qa_dot_model.encode("Get min and max string of list")
passage_embedding = mul_qa_dot_model.encode(["java.util.Optional.get",
                                            "java.util.stream.Collectors.maxBy",
                                            "java.awt.Rectangle.intersects",
                                            "java.lang.String.split",
                                            "Get min and max string of list"]) # the model does not have the knowledge of the query, need to be fine-tuned

print("Similarity score:", util.dot_score(query_embedding, passage_embedding))

Similarity score: tensor([[32.6886, 36.7602, 29.2752, 32.7594, 49.1859]])


<i><u>note:</u></i> The first 2 should be the most relevants

##### Model normalized vectors of length 1

In [23]:
mul_qa_cos_model = SentenceTransformer('multi-qa-distilbert-cos-v1')

query_embedding = mul_qa_cos_model.encode("How many people live in London?")
passage_embedding = mul_qa_cos_model.encode(["Around 9 million people live in London.",
                                                "London is known for its financial district",
                                                "How many people live in London?"])
print("Similarity score:", util.dot_score(query_embedding, passage_embedding))

Similarity score: tensor([[0.8800, 0.4522, 1.0000]])


===> dot_score of the query itself is always 1 ===> The model was trained for semantic comparing purpose

In [27]:
mul_qa_cos_model = SentenceTransformer('multi-qa-distilbert-cos-v1')

query_embedding = mul_qa_dot_model.encode("Get min and max string of list")
passage_embedding = mul_qa_dot_model.encode(["java.util.Optional.get",
                                            "java.util.stream.Collectors.maxBy",
                                            "java.awt.Rectangle.intersects",
                                            "java.lang.String.split",
                                            "Get min and max string of list"]) # the model does not have the knowledge of the query, need to be fine-tuned

print("Similarity score:", util.dot_score(query_embedding, passage_embedding))

Similarity score: tensor([[32.6886, 36.7602, 29.2752, 32.7594, 49.1859]])


<i><u>note:</u></i> The first 2 should be the most relevants

#### MSMARCO Passage Models

The MSMARCO Passage Ranking Dataset contains 500k real queries. Models also perform well on other domains.
- Some were tuned to be used with dot-product
- Some produce normalized vectors of length 1, which can be used with dot-product, cosine-similarity and Euclidean distance

##### Model tuned to be use with dot product

In [29]:
msmarco_dot_model = SentenceTransformer('msmarco-distilbert-dot-v5')

query_embedding = msmarco_dot_model.encode("How many people live in London?")
passage_embedding = msmarco_dot_model.encode(["Around 9 million people live in London.",
                                                "London is known for its financial district",
                                                "How many people live in London?"])
print("Similarity score:", util.dot_score(query_embedding, passage_embedding))

Similarity score: tensor([[87.0118, 71.3875, 86.9643]])


===> The answer for the query get the score better than the query itself. ===> Model was trained for q&a purpose

In [30]:
msmarco_dot_model = SentenceTransformer('msmarco-distilbert-dot-v5')

query_embedding = msmarco_dot_model.encode("Get min and max string of list")
passage_embedding = msmarco_dot_model.encode(["java.util.Optional.get",
                                            "java.util.stream.Collectors.maxBy",
                                            "java.awt.Rectangle.intersects",
                                            "java.lang.String.split",
                                            "Get min and max string of list"]) # the model does not have the knowledge of the query, need to be fine-tuned

print("Similarity score:", util.dot_score(query_embedding, passage_embedding))

Similarity score: tensor([[64.9078, 69.4863, 61.9641, 70.2112, 93.0185]])


<i><u>note:</u></i> The first 2 should be the most relevants

##### Model normalized vectors of length 1

In [32]:
msmarco_cos_model = SentenceTransformer('msmarco-distilbert-cos-v5')

query_embedding = msmarco_cos_model.encode("How many people live in London?")
passage_embedding = msmarco_cos_model.encode(["Around 9 million people live in London.",
                                                "London is known for its financial district",
                                                "How many people live in London?"])

print("Similarity score:", util.dot_score(query_embedding, passage_embedding))

Similarity score: tensor([[0.9593, 0.3506, 1.0000]])


===> dot_score of the query itself is always 1 ===> The model was trained for semantic comparing purpose

In [33]:
msmarco_cos_model = SentenceTransformer('msmarco-distilbert-cos-v5')

query_embedding = msmarco_cos_model.encode("Get min and max string of list")
passage_embedding = msmarco_cos_model.encode(["java.util.Optional.get",
                                            "java.util.stream.Collectors.maxBy",
                                            "java.awt.Rectangle.intersects",
                                            "java.lang.String.split",
                                            "Get min and max string of list"]) # the model does not have the knowledge of the query, need to be fine-tuned

print("Similarity score:", util.dot_score(query_embedding, passage_embedding))

Similarity score: tensor([[0.2284, 0.3320, 0.0240, 0.3599, 1.0000]])


<i><u>note:</u></i> The first 2 should be the most relevants

#### ===> By normalized vectors of length 1, cos model always see the query itself as the most relevant, while dot model see the answer for the query as the most relevant

#### ===> All the experiment models are still not good enough for downstream task (e.g., API recommendation) (Multi-QA models are a little bit better)

### Multi-Lingual Models

### Image & Text-Models

### Other Models