<a href="https://colab.research.google.com/github/leelasankar/SalesforceMobileSDK-Samples/blob/master/1_Hands_on_Build_your_own_search_engine_with_Transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Contextual Embeddings and Semantic Search Engines with Transformers

# Finding semantic similarity with Pre-trained Embeddings

Here we will leverage already pre-trained embedding models \ deep learning models to extract embeddings from sentences and find out their semantic similarity.

Models we will look at:

- Transformers

# Semantic Search

Semantic search seeks to improve search accuracy by understanding the content of the search query. In contrast to traditional search engines which only find documents based on lexical matches, semantic search can also find synonyms.


The idea behind semantic search is to embed all entries in your corpus, whether they be sentences, paragraphs, or documents, into a vector space.

At search time, the query is embedded into the same vector space and the closest embeddings from your corpus are found. These entries should have a high semantic overlap with the query.

![](https://i.imgur.com/FyUCkG5.png)

___[Created By: Dipanjan (DJ)](https://www.linkedin.com/in/dipanjans/)___

# Fun with Embeddings: Simple Search Engine!

Let's create a corpus of documents which will be our source on which we will run text searches

In [None]:
documents = [
  "Quantum mechanics describes the behavior of very small particles.",
  "Photosynthesis is the process by which green plants make food using sunlight.",
  "Shakespeare's plays are a testament to English literature.",
  "Artificial Intelligence aims to create machines that can think and learn.",
  "The pyramids of Egypt are historical monuments that have stood for thousands of years.",
  "Biology is the study of living organisms and their interactions with the environment.",
  "Music therapy can aid in the mental well-being of individuals.",
  "The Milky Way is just one of billions of galaxies in the universe.",
  "Economic theories help understand the distribution of resources in society.",
  "Yoga is an ancient practice that involves physical postures and meditation."
]

## Building Robust Semantic Search Engines with Transformers

In [None]:
!pip install -U sentence-transformers

Collecting sentence-transformers
  Downloading sentence_transformers-2.7.0-py3-none-any.whl (171 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/171.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━[0m [32m112.6/171.5 kB[0m [31m3.3 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m171.5/171.5 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (

In [None]:
!nvidia-smi

Fri Apr 19 09:26:41 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   39C    P8              11W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

## Load Dependencies

In [None]:
from sentence_transformers import SentenceTransformer, util
import torch

## Load Pre-trained Transformer Model

In [None]:
# https://huggingface.co/microsoft/MiniLM-L12-H384-uncased
# MiniLM: Small and Fast Pre-trained Models for Language Understanding and Generation
# MiniLMv1-L12-H384-uncased: 12-layer, 384-hidden, 12-heads, 33M parameters, 2.7x faster than BERT-Base

model = SentenceTransformer('all-MiniLM-L12-v2')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/352 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
model

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

## Create a corpus of documents

In [None]:
documents = ['Quantum mechanics describes the behavior of very small particles.',
 'Photosynthesis is the process by which green plants make food using sunlight.',
 "Shakespeare's plays are a testament to English literature.",
 'Artificial Intelligence aims to create machines that can think and learn.',
 'The pyramids of Egypt are historical monuments that have stood for thousands of years.',
 'Biology is the study of living organisms and their interactions with the environment.',
 'Music therapy can aid in the mental well-being of individuals.',
 'The Milky Way is just one of billions of galaxies in the universe.',
 'Economic theories help understand the distribution of resources in society.',
 'Yoga is an ancient practice that involves physical postures and meditation.']

In [None]:
len(documents)

10

## Get document embeddings

In [None]:
document_embeddings = model.encode(documents)

In [None]:
document_embeddings, document_embeddings.shape

(array([[ 0.01569203,  0.02793593,  0.01044162, ...,  0.05586386,
         -0.06863528,  0.04651124],
        [-0.03118527,  0.03116488,  0.01921586, ..., -0.035596  ,
          0.04873832,  0.0257752 ],
        [ 0.01772629, -0.03875764,  0.1274249 , ...,  0.06566847,
          0.05715032, -0.05221874],
        ...,
        [ 0.09988894, -0.04224412,  0.04059111, ..., -0.05507796,
          0.03537545, -0.07602305],
        [-0.01405633, -0.0032051 , -0.01274014, ...,  0.0162108 ,
          0.00849572,  0.00209841],
        [-0.04375398, -0.00201992, -0.00192907, ...,  0.08587869,
         -0.000458  , -0.02675952]], dtype=float32),
 (10, 384))

In [None]:
documents[0]

'Quantum mechanics describes the behavior of very small particles.'

In [None]:
document_embeddings[0].shape

(384,)

In [None]:
document_embeddings[0]

array([ 1.56920254e-02,  2.79359259e-02,  1.04416236e-02,  5.38145900e-02,
       -6.64291754e-02, -1.10787347e-01,  3.04780807e-02,  2.54000314e-02,
       -5.17281843e-03,  3.58387381e-02,  3.20614353e-02,  5.82040893e-03,
       -8.15960206e-03,  6.18610606e-02,  5.40908687e-02,  7.69706815e-03,
        6.66749105e-02, -4.58203107e-02, -4.35077548e-02, -6.01345208e-03,
       -3.49232629e-02,  3.26351747e-02, -5.32207489e-02,  1.72806252e-02,
       -1.03645667e-01, -4.15867232e-02,  3.19175981e-02, -2.92218029e-02,
        2.84545552e-02, -1.93595141e-02, -6.80246390e-03,  3.82997766e-02,
        2.14443114e-02,  5.28521836e-02, -9.52143781e-03, -3.56447212e-02,
        2.97932513e-02, -7.39673600e-02, -5.50745279e-02,  1.35463262e-02,
        3.38974632e-02,  4.48159762e-02,  2.14401558e-02,  6.01723716e-02,
        1.87074393e-02, -1.51426448e-02,  1.95753276e-02, -1.35566248e-02,
       -5.95715605e-02, -3.25450040e-02, -1.14724785e-01,  1.16386916e-02,
       -3.54244113e-02,  

## Let's try to find the most similar document for one query

### New Query

In [None]:
new_text = 'What is AI?'
new_text

'What is AI?'

### Get Embedding for New Query

In [None]:
new_text_embedding = model.encode(new_text)
new_text_embedding.shape

(384,)

### Get Cosine Similarity Score of Document Emebddings compared to New Query Embedding

In [None]:
cos_scores = util.pytorch_cos_sim(new_text_embedding, document_embeddings)[0]
cos_scores

tensor([ 5.4059e-02, -4.1397e-02,  6.0208e-02,  6.1713e-01,  4.6178e-04,
         1.1132e-01,  6.8011e-02, -4.1013e-02,  1.3222e-01,  6.8518e-02])

### Get Most Similar Document ID

In [None]:
top_results = torch.topk(cos_scores, k=1)
top_results

torch.return_types.topk(
values=tensor([0.6171]),
indices=tensor([3]))

In [None]:
idx = top_results.indices[0]
idx

tensor(3)

### Get Most Similar Document

In [None]:
documents[idx]

'Artificial Intelligence aims to create machines that can think and learn.'

## Create a function to return the top similar document based on any query

In [None]:
def semantic_search_engine(query, embedder_model):
  query_embedding = embedder_model.encode(query)
  cos_scores = util.pytorch_cos_sim(query_embedding, document_embeddings)[0]
  top_results = torch.topk(cos_scores, k=1)
  idx = top_results.indices[0]
  return documents[idx]

## Try out the function

In [None]:
new_sentence = 'Tell me about AI'
semantic_search_engine(new_sentence, model)

'Artificial Intelligence aims to create machines that can think and learn.'

In [None]:
new_sentence = 'Do you know about the pyramids?'
semantic_search_engine(new_sentence, model)

'The pyramids of Egypt are historical monuments that have stood for thousands of years.'

In [None]:
new_sentence = 'How do plants survive?'
semantic_search_engine(new_sentence, model)

'Photosynthesis is the process by which green plants make food using sunlight.'