## Semantic Search
Semantic search seeks to improve search accuracy by understanding the content of the search query. In contrast to traditional search engines which only find documents based on lexical matches, semantic search can also find synonyms.

### Symmetric vs. Asymetric Semantic Search
A critical distinction for your setup is symmetric vs. asymmetric semantic search:

- For symmetric semantic search your query and the entries in your corpus are of about the same length and have the same amount of content. An example would be searching for similar questions: Your query could for example be “How to learn Python online?” and you want to find an entry like “How to learn Python on the web?”. For symmetric tasks, you could potentially flip the query and the entries in your corpus.

- For asymmetric semantic search, you usually have a short query (like a question or some keywords) and you want to find a longer paragraph answering the query. An example would be a query like “What is Python” and you want to find the paragraph “Python is an interpreted, high-level and general-purpose programming language. Python’s design philosophy …”. For asymmetric tasks, flipping the query and the entries in your corpus usually does not make sense.



In [3]:
!pip install -U sentence-transformers

Collecting sentence-transformers
  Downloading sentence_transformers-2.7.0-py3-none-any.whl (171 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/171.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━[0m [32m163.8/171.5 kB[0m [31m6.6 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m171.5/171.5 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (

In [4]:
from sentence_transformers import SentenceTransformer, util
import torch

embedder = SentenceTransformer("all-MiniLM-L6-v2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [5]:
corpus = [
    "A man is eating food.",
    "A man is eating a piece of bread.",
    "The girl is carrying a baby.",
    "A man is riding a horse.",
    "A woman is playing violin.",
    "Two men pushed carts through the woods.",
    "A man is riding a white horse on an enclosed ground.",
    "A monkey is playing drums.",
    "A cheetah is running behind its prey.",
]

In [6]:
corpus_embeddings = embedder.encode(corpus, convert_to_tensor=True)

In [7]:
queries = [
    "A man is eating pasta.",
    "Someone in a gorilla costume is playing a set of drums.",
    "A cheetah chases prey on across a field.",
]

In [8]:
top_k = min(5, len(corpus))

In [14]:
for query in queries:
  query_embedding = embedder.encode(query, convert_to_tensor=True)

  cos_scores = util.cos_sim(query_embedding, corpus_embeddings)[0]
  print(cos_scores)
  top_results = torch.topk(cos_scores, k=top_k)
  print(top_results)

  print("======================================")
  print("Query: ", query)
  for score, idx in zip(top_results[0], top_results[1]):
    print(corpus[idx], "(Score: {:.4f})".format(score))
  print("======================================")


tensor([ 0.7035,  0.5272, -0.0898,  0.1889,  0.0336, -0.0594,  0.1047,  0.0819,
         0.0980])
torch.return_types.topk(
values=tensor([0.7035, 0.5272, 0.1889, 0.1047, 0.0980]),
indices=tensor([0, 1, 3, 6, 8]))
Query:  A man is eating pasta.
A man is eating food. (Score: 0.7035)
A man is eating a piece of bread. (Score: 0.5272)
A man is riding a horse. (Score: 0.1889)
A man is riding a white horse on an enclosed ground. (Score: 0.1047)
A cheetah is running behind its prey. (Score: 0.0980)
tensor([ 0.0630,  0.0216,  0.0247,  0.1389,  0.2564, -0.0895,  0.1191,  0.6433,
         0.1080])
torch.return_types.topk(
values=tensor([0.6433, 0.2564, 0.1389, 0.1191, 0.1080]),
indices=tensor([7, 4, 3, 6, 8]))
Query:  Someone in a gorilla costume is playing a set of drums.
A monkey is playing drums. (Score: 0.6433)
A woman is playing violin. (Score: 0.2564)
A man is riding a horse. (Score: 0.1389)
A man is riding a white horse on an enclosed ground. (Score: 0.1191)
A cheetah is running behind its

### sentence_transformers.util.semantic_search
(query_embeddings: torch.Tensor, corpus_embeddings: torch.Tensor, query_chunk_size: int = 100, corpus_chunk_size: int = 500000, top_k: int = 10, score_function: Callable[[torch.Tensor, torch.Tensor], torch.Tensor] = <function cos_sim>) → List[List[Dict[str, Union[int, float]]]]

This function performs a cosine similarity search between a list of query embeddings and a list of corpus embeddings. It can be used for Information Retrieval / Semantic Search for corpora up to about 1 Million entries.

Parameters
query_embeddings – A 2 dimensional tensor with the query embeddings.

corpus_embeddings – A 2 dimensional tensor with the corpus embeddings.

query_chunk_size – Process 100 queries simultaneously. Increasing that value increases the speed, but requires more memory.

corpus_chunk_size – Scans the corpus 100k entries at a time. Increasing that value increases the speed, but requires more memory.

top_k – Retrieve top k matching entries.

score_function – Function for computing scores. By default, cosine similarity.

Returns
Returns a list with one entry for each query. Each entry is a list of dictionaries with the keys ‘corpus_id’ and ‘score’, sorted by decreasing cosine similarity scores.

In [28]:
embedder = SentenceTransformer("all-MiniLM-L6-v2")

# Corpus with example sentences
corpus = [
    "A man is eating food.",
    "A man is eating a piece of bread.",
    "The girl is carrying a baby.",
    "A man is riding a horse.",
    "A woman is playing violin.",
    "Two men pushed carts through the woods.",
    "A man is riding a white horse on an enclosed ground.",
    "A monkey is playing drums.",
    "A cheetah is running behind its prey.",
]
corpus_embeddings = embedder.encode(corpus, convert_to_tensor=True)

# Query sentences:
queries = [
    "A man is eating pasta.",
    "Someone in a gorilla costume is playing a set of drums.",
    "A cheetah chases prey on across a field.",
]

query_embedding = embedder.encode(queries, convert_to_tensor=True)

search_results = util.semantic_search(query_embeddings=query_embedding, corpus_embeddings=corpus_embeddings, top_k=3)

In [29]:
print(search_results)

[[{'corpus_id': 0, 'score': 0.7035486698150635}, {'corpus_id': 1, 'score': 0.5271987915039062}, {'corpus_id': 3, 'score': 0.18889561295509338}], [{'corpus_id': 7, 'score': 0.6432533264160156}, {'corpus_id': 4, 'score': 0.2564156651496887}, {'corpus_id': 3, 'score': 0.13887260854244232}], [{'corpus_id': 8, 'score': 0.8253214359283447}, {'corpus_id': 0, 'score': 0.13989517092704773}, {'corpus_id': 7, 'score': 0.12919361889362335}]]


In [30]:
len(search_results)

3

In [31]:
search_results[0]

[{'corpus_id': 0, 'score': 0.7035486698150635},
 {'corpus_id': 1, 'score': 0.5271987915039062},
 {'corpus_id': 3, 'score': 0.18889561295509338}]

In [33]:
for idx in range(len(search_results)):
  search_result = search_results[idx]
  print("Query: ", queries[idx])
  for result in search_result:
    print(corpus[result["corpus_id"]], result["score"])
  print()

Query:  A man is eating pasta.
A man is eating food. 0.7035486698150635
A man is eating a piece of bread. 0.5271987915039062
A man is riding a horse. 0.18889561295509338

Query:  Someone in a gorilla costume is playing a set of drums.
A monkey is playing drums. 0.6432533264160156
A woman is playing violin. 0.2564156651496887
A man is riding a horse. 0.13887260854244232

Query:  A cheetah chases prey on across a field.
A cheetah is running behind its prey. 0.8253214359283447
A man is eating food. 0.13989517092704773
A monkey is playing drums. 0.12919361889362335

