# Semantic Search Engine with Transformers

In this notebook we will leverage pre-trained embedding models / tranformer models to extract embeddings from sentences and find out their semantic similarity.

## Semantic Search

Semantic search seeks to improve search accuracy by understanding the content of the search query. In contrast to traditional search engines which only find documents based on lexical matches, semantic search can also find synonyms or words with similar meanings.

The idea behind semantic search is to embed all entries in your corpus, whether they be sentences, paragraphs, or documents, into a vector space.

At search time, the query is embedded into the same vector space and the closest embeddings from the corpus are found. These entries should have a high semantic overlap with the query.

![](https://i.imgur.com/FyUCkG5.png)

In [1]:
!pip install -U sentence-transformers

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch>=1.11.0->sentence-transformers)
 

In [2]:
!nvidia-smi

Sat Feb 15 18:03:30 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   41C    P8             10W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

### Load Dependencies

In [3]:
import torch
from sentence_transformers import SentenceTransformer, util

### Load Pre-trained Transformer Model

In [4]:
# https://huggingface.co/microsoft/MiniLM-L12-H384-uncased
# MiniLM: Small and Fast Pre-trained Models for Language Understanding and Generation
# MiniLMv1-L12-H384-uncased: 12-layer, 384-hidden, 12-heads, 33M parameters, 2.7x faster than BERT-Base

model = SentenceTransformer('all-MiniLM-L12-v2')

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/352 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling%2Fconfig.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [5]:
model

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

This model has been trained on Wikipedia data and it has learnt the representation for AI and artificial intelligence to be very similar. It has also learned to associate AI with artificial intelligence and other words that frequently occur together.

### Create a corpus of documents

Let's create a corpus of documents which will be our source to run text searches

In [6]:
documents = [
    "The theory of relativity, proposed by Einstein, revolutionized our understanding of space and time.",
    "DNA carries the genetic instructions used in the growth and functioning of all known living organisms.",
    "The Great Wall of China is one of the largest construction projects ever completed.",
    "Climate change refers to long-term shifts in global temperatures and weather patterns.",
    "The periodic table organizes chemical elements based on their atomic number and properties.",
    "The human brain is the most complex organ, responsible for thought, memory, and emotion.",
    "Black holes are regions in space where gravity is so strong that nothing can escape.",
    "Yoga is an ancient practice that involves physical postures and meditation.",
    "Artificial Intelligence aims to create machines that can think and learn.",
    "The Fibonacci sequence is a series of numbers where each number is the sum of the two preceding ones."
]

In [7]:
len(documents)

10

In [8]:
### Get document embeddings
document_embeddings = model.encode(documents)

In [9]:
document_embeddings, document_embeddings.shape

(array([[-0.02667873,  0.03832817,  0.03548458, ...,  0.07289366,
         -0.07014348, -0.01182672],
        [-0.01190928,  0.09693008, -0.02113872, ...,  0.04476817,
          0.0810886 , -0.02945544],
        [ 0.04817612,  0.01371991,  0.05322814, ...,  0.0059902 ,
          0.04432571, -0.00228396],
        ...,
        [-0.04375399, -0.00201995, -0.00192908, ...,  0.08587868,
         -0.00045801, -0.02675951],
        [ 0.01003823, -0.03451467, -0.00501188, ...,  0.08199856,
          0.06596605, -0.06212616],
        [-0.03938903, -0.01254976, -0.05345229, ...,  0.03958099,
          0.09943354,  0.00296505]], dtype=float32),
 (10, 384))

In [10]:
documents[0]

'The theory of relativity, proposed by Einstein, revolutionized our understanding of space and time.'

In [11]:
document_embeddings[0].shape

(384,)

In [12]:
document_embeddings[0]

array([-2.66787335e-02,  3.83281708e-02,  3.54845822e-02,  2.49578841e-02,
        2.60813739e-02,  4.17036936e-02,  1.48317441e-02,  2.86246762e-02,
        2.76448596e-02,  1.76539123e-02,  1.54975452e-03, -5.95488623e-02,
       -4.78711538e-02,  3.33896689e-02, -1.92171521e-02, -3.70581485e-02,
       -6.26182109e-02,  6.65221438e-02, -3.96491438e-02, -1.15649238e-01,
        5.59612885e-02, -1.19090453e-03, -3.39162201e-02,  4.16375175e-02,
       -1.95887163e-02,  2.06748694e-02,  5.47295660e-02, -2.34289207e-02,
        7.22084288e-03, -2.79319678e-02, -5.21211438e-02,  5.75540960e-02,
        1.07460789e-01,  7.02173822e-03,  4.30832105e-03,  3.51511291e-03,
        8.34457204e-02,  4.80626849e-03, -8.97591095e-03,  3.43583487e-02,
        2.93964166e-02, -5.22546880e-02,  7.15413615e-02, -2.65643653e-02,
        1.30517334e-01,  3.16096214e-03,  4.46754918e-02,  2.72026937e-02,
       -6.71746507e-02,  7.74200931e-02,  3.31745669e-03, -9.27645341e-02,
       -3.93879935e-02, -

Let's try to find the most similar document for one query

In [13]:
new_text = 'What is AI?'
new_text

'What is AI?'

In [14]:
### Get Embedding for New Query
new_text_embedding = model.encode(new_text)
new_text_embedding.shape

(384,)

In [15]:
### Get Cosine Similarity Score of Document Emebddings compared to New Query Embedding
cos_scores = util.pytorch_cos_sim(new_text_embedding, document_embeddings)[0]
cos_scores

tensor([0.1179, 0.0652, 0.0308, 0.0566, 0.0882, 0.3208, 0.0186, 0.0685, 0.6171,
        0.0518])

In [16]:
### Get Most Similar Document ID
top_results = torch.topk(cos_scores, k=1)
top_results

torch.return_types.topk(
values=tensor([0.6171]),
indices=tensor([8]))

In [17]:
idx = top_results.indices[0]
idx

tensor(8)

In [18]:
### Get Most Similar Document
documents[idx]

'Artificial Intelligence aims to create machines that can think and learn.'

In [19]:
## Create a function to return the top similar document based on any query
def semantic_search_engine(query, embedder_model):
  query_embedding = embedder_model.encode(query)
  cos_scores = util.pytorch_cos_sim(query_embedding, document_embeddings)[0]
  top_results = torch.topk(cos_scores, k=1)
  idx = top_results.indices[0]
  return documents[idx]

In [20]:
new_sentence = 'Tell me about AI'
semantic_search_engine(new_sentence, model)

'Artificial Intelligence aims to create machines that can think and learn.'

In [22]:
new_sentence = 'What are black holes?'
semantic_search_engine(new_sentence, model)

'Black holes are regions in space where gravity is so strong that nothing can escape.'

In [23]:
new_sentence = 'Which is the largest monument to be ever built?'
semantic_search_engine(new_sentence, model)

'The Great Wall of China is one of the largest construction projects ever completed.'