Semantic search seeks to improve search accuracy by understanding the content of the search query. In contrast to traditional search engines which only find documents based on lexical matches, semantic search can also find synonyms.

In [1]:
from sentence_transformers import SentenceTransformer, util
import torch

embedder = SentenceTransformer('all-MiniLM-L6-v2')

# Corpus with example sentences
corpus = ['The appellant having been convicted under Section 80 of the Karnataka Police Act, 1963 (for short, ‘the 1963 Act’) has filed the present appeal.',
          'Notice in the appeal was issued on 27.02.2023 limited to the extent of consideration as to whether the appellant can be granted benefit of probation.',
           'The brief facts of the case are that FIR dated 16.8.2007 was registered against 24 accused persons including the appellant under sections 79 and 80 of the 1963 Act as they were found to be indulging in gambling.',
          'The charge sheet was filed and the Trial Court vide order dated 21.8.2007 convicted them under Section 79 & 80 of the 1963 Act and sentenced them to undergo imprisonment for a period of one yeareach under both the provisions along with a fine of ₹ 600/- after the accused had pleaded guilty.'
         ]
corpus_embeddings = embedder.encode(corpus, convert_to_tensor=True)

# Query sentences:
queries = ['The litigant having been convicted under Section 70',
          'Considartions are bein made for probation for the appellant',
          'For indulging in gambling the accused were registered under the Act 79 and 80',
          'The guilty were sentenced for a period of one year and and a fine was set for ₹ 600/-']

# Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
top_k = min(5, len(corpus))
for query in queries:
    query_embedding = embedder.encode(query, convert_to_tensor=True)

    # We use cosine-similarity and torch.topk to find the highest 5 scores
    cos_scores = util.cos_sim(query_embedding, corpus_embeddings)[0]
    top_results = torch.topk(cos_scores, k=top_k)

    print("\n\n======================\n\n")
    print("Query:", query)
    print("\nTop 5 most similar sentences in corpus:")

    for score, idx in zip(top_results[0], top_results[1]):
        print(corpus[idx], "(Score: {:.4f})".format(score))

2023-06-06 10:23:01.842458: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-06-06 10:23:01.936366: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-06-06 10:23:01.936796: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.






Query: The litigant having been convicted under Section 70

Top 5 most similar sentences in corpus:
The brief facts of the case are that FIR dated 16.8.2007 was registered against 24 accused persons including the appellant under sections 79 and 80 of the 1963 Act as they were found to be indulging in gambling. (Score: 0.5325)
The charge sheet was filed and the Trial Court vide order dated 21.8.2007 convicted them under Section 79 & 80 of the 1963 Act and sentenced them to undergo imprisonment for a period of one yeareach under both the provisions along with a fine of ₹ 600/- after the accused had pleaded guilty. (Score: 0.5233)
The appellant having been convicted under Section 80 of the Karnataka Police Act, 1963 (for short, ‘the 1963 Act’) has filed the present appeal. (Score: 0.4988)
Notice in the appeal was issued on 27.02.2023 limited to the extent of consideration as to whether the appellant can be granted benefit of probation. (Score: 0.3492)




Query: Considartions are bein

In [2]:

corpus_embeddings = util.normalize_embeddings(corpus_embeddings)
query_embeddings = embedder.encode(queries, convert_to_tensor=True)

query_embeddings = util.normalize_embeddings(query_embeddings)
hits = util.semantic_search(query_embeddings, corpus_embeddings, score_function=util.dot_score)

In [3]:
hits

[[{'corpus_id': 2, 'score': 0.5324870347976685},
  {'corpus_id': 3, 'score': 0.5232713222503662},
  {'corpus_id': 0, 'score': 0.4987628161907196},
  {'corpus_id': 1, 'score': 0.3491813540458679}],
 [{'corpus_id': 1, 'score': 0.7081699371337891},
  {'corpus_id': 2, 'score': 0.43056827783584595},
  {'corpus_id': 0, 'score': 0.42389726638793945},
  {'corpus_id': 3, 'score': 0.41952770948410034}],
 [{'corpus_id': 2, 'score': 0.7320606708526611},
  {'corpus_id': 3, 'score': 0.493481308221817},
  {'corpus_id': 0, 'score': 0.456560343503952},
  {'corpus_id': 1, 'score': 0.27411243319511414}],
 [{'corpus_id': 3, 'score': 0.771031379699707},
  {'corpus_id': 2, 'score': 0.5865825414657593},
  {'corpus_id': 1, 'score': 0.4227883815765381},
  {'corpus_id': 0, 'score': 0.41303613781929016}]]

In [4]:
for i in range(len(corpus_embeddings)):
    print(corpus_embeddings[i])

tensor([ 4.6353e-03,  1.0816e-01, -2.0048e-02, -5.4991e-02,  1.0946e-02,
         4.5126e-02,  6.1876e-02,  7.6445e-02, -5.1902e-02,  1.1494e-02,
         3.2990e-02, -4.0189e-02, -2.5777e-02,  3.0751e-02,  3.8288e-03,
         6.3561e-02,  1.2551e-02, -7.1715e-03, -4.0144e-02,  3.7604e-02,
         1.4875e-02,  7.5218e-02, -1.2694e-02, -5.1965e-02,  3.0455e-02,
        -4.1585e-02, -8.7488e-02,  1.1444e-02,  4.1083e-02, -1.3039e-02,
         3.0569e-02,  7.5474e-02,  1.3254e-02,  6.0152e-02,  2.5846e-02,
        -6.6116e-02,  2.7391e-02, -4.8090e-03, -3.9978e-03, -4.3957e-04,
         6.5380e-02,  6.7866e-03, -6.8749e-02,  3.0744e-02, -1.0328e-02,
        -1.8715e-02, -2.4261e-02, -3.0117e-02, -4.4541e-03, -6.8009e-03,
        -4.7796e-02,  2.5050e-02,  1.1413e-02,  3.3048e-02, -5.8435e-02,
        -1.2432e-01,  3.8322e-02,  1.0110e-01,  2.0420e-02,  4.1950e-02,
        -2.7463e-02, -3.5802e-03,  7.7402e-03,  4.2371e-02, -1.6991e-02,
        -5.0181e-02, -2.3276e-04, -7.6047e-02,  5.7

In [5]:
for i in range(len(query_embeddings)):
    print(query_embeddings[i])

tensor([-4.6898e-02,  6.7274e-02, -1.4832e-02,  7.2339e-03,  2.5502e-02,
         9.1274e-02,  5.6130e-02,  3.0640e-02, -1.4478e-01,  5.3751e-02,
         7.3595e-02,  4.2836e-02,  3.1493e-02,  9.1486e-02,  1.9458e-02,
         1.0364e-02, -5.8954e-02,  1.3518e-02,  5.0499e-02,  3.1287e-02,
        -5.6449e-03,  4.1618e-02,  3.3829e-02, -2.7573e-02,  2.4290e-02,
         8.5252e-04, -6.6257e-02,  3.0321e-02, -2.0771e-03,  3.8240e-03,
        -4.0213e-02,  1.7815e-02, -2.6955e-02,  8.7495e-02,  6.2875e-02,
        -7.0205e-02,  4.2816e-02,  1.1786e-02, -2.3249e-05,  3.0638e-02,
         1.3056e-02, -2.6513e-02, -5.0397e-02,  4.0152e-02, -7.4084e-02,
        -8.4030e-02, -2.9048e-02,  6.8340e-03, -3.7368e-02, -5.0403e-02,
        -5.4493e-02, -3.4893e-02,  1.1950e-02,  7.4624e-02, -2.2653e-02,
        -2.8544e-02,  2.7735e-02,  6.0948e-02,  1.0454e-02, -4.0763e-02,
        -1.0593e-02,  9.2048e-03, -6.1556e-02,  1.2075e-02, -5.4741e-02,
        -3.7355e-02, -3.8064e-04, -2.7491e-02,  1.6

In [7]:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')

Downloading (…)5fedf/.gitattributes:   0%|          | 0.00/737 [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)2cb455fedf/README.md:   0%|          | 0.00/11.5k [00:00<?, ?B/s]

Downloading (…)b455fedf/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)edf/data_config.json:   0%|          | 0.00/25.5k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)5fedf/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/383 [00:00<?, ?B/s]

Downloading (…)fedf/train_script.py:   0%|          | 0.00/13.8k [00:00<?, ?B/s]

Downloading (…)2cb455fedf/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)455fedf/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [8]:
print("Similarity:", util.dot_score(query_embeddings, corpus_embeddings))

Similarity: tensor([[0.4988, 0.3492, 0.5325, 0.5233],
        [0.4239, 0.7082, 0.4306, 0.4195],
        [0.4566, 0.2741, 0.7321, 0.4935],
        [0.4130, 0.4228, 0.5866, 0.7710]])


In [9]:
scores = util.dot_score(query_embeddings, corpus_embeddings)[0].cpu().tolist()

#Combine docs & scores
doc_score_pairs = list(zip(corpus, scores))

#Sort by decreasing score
doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)

#Output passages & scores
for doc, score in doc_score_pairs:
    print(score, doc)

0.5324870347976685 The brief facts of the case are that FIR dated 16.8.2007 was registered against 24 accused persons including the appellant under sections 79 and 80 of the 1963 Act as they were found to be indulging in gambling.
0.5232713222503662 The charge sheet was filed and the Trial Court vide order dated 21.8.2007 convicted them under Section 79 & 80 of the 1963 Act and sentenced them to undergo imprisonment for a period of one yeareach under both the provisions along with a fine of ₹ 600/- after the accused had pleaded guilty.
0.4987628161907196 The appellant having been convicted under Section 80 of the Karnataka Police Act, 1963 (for short, ‘the 1963 Act’) has filed the present appeal.
0.3491813540458679 Notice in the appeal was issued on 27.02.2023 limited to the extent of consideration as to whether the appellant can be granted benefit of probation.
