# Semantic Search Engine with Transformers

In this notebook we will leverage pre-trained embedding models / tranformer models to extract embeddings from sentences and find out their semantic similarity.

## Semantic Search

Semantic search seeks to improve search accuracy by understanding the content of the search query. In contrast to traditional search engines which only find documents based on lexical matches, semantic search can also find synonyms or words with similar meanings.

The idea behind semantic search is to embed all entries in your corpus, whether they be sentences, paragraphs, or documents, into a vector space.

At search time, the query is embedded into the same vector space and the closest embeddings from the corpus are found. These entries should have a high semantic overlap with the query.

![](https://i.imgur.com/FyUCkG5.png)

Let's create a corpus of documents which will be our source on which we will run text searches

In [1]:
documents = [
  "Quantum mechanics describes the behavior of very small particles.",
  "Photosynthesis is the process by which green plants make food using sunlight.",
  "Shakespeare's plays are a testament to English literature.",
  "Artificial Intelligence aims to create machines that can think and learn.",
  "The pyramids of Egypt are historical monuments that have stood for thousands of years.",
  "Biology is the study of living organisms and their interactions with the environment.",
  "Music therapy can aid in the mental well-being of individuals.",
  "The Milky Way is just one of billions of galaxies in the universe.",
  "Economic theories help understand the distribution of resources in society.",
  "Yoga is an ancient practice that involves physical postures and meditation."
]

In [2]:
!pip install -U sentence-transformers

Collecting sentence-transformers
  Downloading sentence_transformers-3.0.1-py3-none-any.whl (227 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/227.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m225.3/227.1 kB[0m [31m7.7 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.1/227.1 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)


In [3]:
!nvidia-smi

Tue Jun 18 10:54:45 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   53C    P8              12W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

### Load Dependencies

In [4]:
from sentence_transformers import SentenceTransformer, util
import torch

  from tqdm.autonotebook import tqdm, trange


### Load Pre-trained Transformer Model

In [5]:
# https://huggingface.co/microsoft/MiniLM-L12-H384-uncased
# MiniLM: Small and Fast Pre-trained Models for Language Understanding and Generation
# MiniLMv1-L12-H384-uncased: 12-layer, 384-hidden, 12-heads, 33M parameters, 2.7x faster than BERT-Base

model = SentenceTransformer('all-MiniLM-L12-v2')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/352 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [6]:
model

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

This model has been trained on Wikipedia data and it has learnt the representation for AI and artificial intelligence to be very similar. to associate the AI with artificial intelligence and other words that frequently occur together.

### Create a corpus of documents

In [7]:
documents = ['Quantum mechanics describes the behavior of very small particles.',
 'Photosynthesis is the process by which green plants make food using sunlight.',
 "Shakespeare's plays are a testament to English literature.",
 'Artificial Intelligence aims to create machines that can think and learn.',
 'The pyramids of Egypt are historical monuments that have stood for thousands of years.',
 'Biology is the study of living organisms and their interactions with the environment.',
 'Music therapy can aid in the mental well-being of individuals.',
 'The Milky Way is just one of billions of galaxies in the universe.',
 'Economic theories help understand the distribution of resources in society.',
 'Yoga is an ancient practice that involves physical postures and meditation.']

In [8]:
len(documents)

10

In [9]:
### Get document embeddings
document_embeddings = model.encode(documents)

In [10]:
document_embeddings, document_embeddings.shape

(array([[ 0.01569205,  0.02793592,  0.01044163, ...,  0.05586382,
         -0.06863529,  0.04651124],
        [-0.03118527,  0.03116489,  0.01921587, ..., -0.03559601,
          0.0487383 ,  0.0257752 ],
        [ 0.01772631, -0.03875763,  0.12742487, ...,  0.06566847,
          0.05715033, -0.05221871],
        ...,
        [ 0.09988892, -0.04224407,  0.04059114, ..., -0.05507799,
          0.03537545, -0.07602306],
        [-0.01405635, -0.00320508, -0.01274013, ...,  0.01621084,
          0.0084957 ,  0.00209839],
        [-0.04375399, -0.00201995, -0.00192908, ...,  0.08587868,
         -0.00045801, -0.02675951]], dtype=float32),
 (10, 384))

In [11]:
documents[0]

'Quantum mechanics describes the behavior of very small particles.'

In [12]:
document_embeddings[0].shape

(384,)

In [13]:
document_embeddings[0]

array([ 1.56920515e-02,  2.79359184e-02,  1.04416301e-02,  5.38146012e-02,
       -6.64291531e-02, -1.10787347e-01,  3.04780379e-02,  2.54000239e-02,
       -5.17279888e-03,  3.58387157e-02,  3.20614390e-02,  5.82043594e-03,
       -8.15960299e-03,  6.18610643e-02,  5.40908948e-02,  7.69706210e-03,
        6.66749030e-02, -4.58203293e-02, -4.35077623e-02, -6.01345347e-03,
       -3.49232852e-02,  3.26351449e-02, -5.32207713e-02,  1.72806047e-02,
       -1.03645697e-01, -4.15867157e-02,  3.19176205e-02, -2.92218272e-02,
        2.84545459e-02, -1.93595272e-02, -6.80246018e-03,  3.82997468e-02,
        2.14442946e-02,  5.28521836e-02, -9.52141173e-03, -3.56447324e-02,
        2.97932494e-02, -7.39673600e-02, -5.50745577e-02,  1.35463169e-02,
        3.38974483e-02,  4.48160172e-02,  2.14401446e-02,  6.01723790e-02,
        1.87074691e-02, -1.51426755e-02,  1.95753127e-02, -1.35566080e-02,
       -5.95715567e-02, -3.25450376e-02, -1.14724800e-01,  1.16386944e-02,
       -3.54244374e-02,  

Let's try to find the most similar document for one query

In [14]:
new_text = 'What is AI?'
new_text

'What is AI?'

In [15]:
### Get Embedding for New Query
new_text_embedding = model.encode(new_text)
new_text_embedding.shape

(384,)

In [16]:
### Get Cosine Similarity Score of Document Emebddings compared to New Query Embedding
cos_scores = util.pytorch_cos_sim(new_text_embedding, document_embeddings)[0]
cos_scores

tensor([ 5.4058e-02, -4.1397e-02,  6.0208e-02,  6.1713e-01,  4.6178e-04,
         1.1132e-01,  6.8011e-02, -4.1013e-02,  1.3222e-01,  6.8518e-02])

In [17]:
### Get Most Similar Document ID
top_results = torch.topk(cos_scores, k=1)
top_results

torch.return_types.topk(
values=tensor([0.6171]),
indices=tensor([3]))

In [18]:
idx = top_results.indices[0]
idx

tensor(3)

In [19]:
### Get Most Similar Document
documents[idx]

'Artificial Intelligence aims to create machines that can think and learn.'

In [30]:
## Create a function to return the top similar document based on any query
def semantic_search_engine(query, embedder_model):
  query_embedding = embedder_model.encode(query)
  cos_scores = util.pytorch_cos_sim(query_embedding, document_embeddings)[0]
  top_results = torch.topk(cos_scores, k=1)
  idx = top_results.indices[0]
  return documents[idx]

In [31]:
new_sentence = 'Tell me about AI'
semantic_search_engine(new_sentence, model)

'Artificial Intelligence aims to create machines that can think and learn.'

In [32]:
new_sentence = 'Do you know about the pyramids?'
semantic_search_engine(new_sentence, model)

'The pyramids of Egypt are historical monuments that have stood for thousands of years.'

In [33]:
new_sentence = 'How do plants survive?'
semantic_search_engine(new_sentence, model)

'Photosynthesis is the process by which green plants make food using sunlight.'