Notebook introduces the Embeddings and their use cases 

It all starts with Data, in this case the data is taken from the https://faq.ssa.gov/en-US/

Note, Embedding is a process of converting a word or a number into a vector of certain dimensions
Tokenizer and Embedding models are not same. They are different. 

Tokenizers are functions written in python that take a corpus of data and returns a dictionary-id map. Based on which the tokenizers, work on the sentences.

Embedding models are Neural Networks coded in Torch/TF/Jax/Flax that are used for creating vectors 

In [1]:
# we have to work with sentence transformers library. 

from sentence_transformers import SentenceTransformer, util

model_id = "sentence-transformers/all-MiniLM-L6-v2"
model_embedding = SentenceTransformer(model_id)

In [2]:
texts = ["How do I get a replacement Medicare card?",
        "What is the monthly premium for Medicare Part B?",
        "How do I terminate my Medicare Part B (medical insurance)?",
        "How do I sign up for Medicare?",
        "Can I sign up for Medicare Part B if I am working and have health insurance through an employer?",
        "How do I sign up for Medicare Part B if I already have Part A?",
        "What are Medicare late enrollment penalties?",
        "What is Medicare and who can get it?",
        "How can I get help with my Medicare Part A and Part B premiums?",
        "What are the different parts of Medicare?",
        "Will my Medicare premiums be higher because of my higher income?",
        "What is TRICARE ?",
        "Should I sign up for Medicare Part B if I have Veterans' Benefits?"]

In [3]:
model_embedding.to('cuda')

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

In [5]:
texts_tokenized = model_embedding.tokenize(texts=texts[0],)

In [6]:
texts_tokenized

{'input_ids': tensor([[ 101, 1044,  102],
         [ 101, 1051,  102],
         [ 101, 1059,  102],
         [ 101,  102,    0],
         [ 101, 1040,  102],
         [ 101, 1051,  102],
         [ 101,  102,    0],
         [ 101, 1045,  102],
         [ 101,  102,    0],
         [ 101, 1043,  102],
         [ 101, 1041,  102],
         [ 101, 1056,  102],
         [ 101,  102,    0],
         [ 101, 1037,  102],
         [ 101,  102,    0],
         [ 101, 1054,  102],
         [ 101, 1041,  102],
         [ 101, 1052,  102],
         [ 101, 1048,  102],
         [ 101, 1037,  102],
         [ 101, 1039,  102],
         [ 101, 1041,  102],
         [ 101, 1049,  102],
         [ 101, 1041,  102],
         [ 101, 1050,  102],
         [ 101, 1056,  102],
         [ 101,  102,    0],
         [ 101, 1049,  102],
         [ 101, 1041,  102],
         [ 101, 1040,  102],
         [ 101, 1045,  102],
         [ 101, 1039,  102],
         [ 101, 1037,  102],
         [ 101, 1054,  102],
 

In [12]:
texts_embed01 = model_embedding.encode(texts[0], convert_to_tensor=True)
texts_embed01.shape

torch.Size([384])

In [13]:
texts_embed02 = model_embedding.encode(texts[1], convert_to_tensor=True)
texts_embed02.shape

torch.Size([384])

In [14]:
from sentence_transformers.util import pytorch_cos_sim

In [15]:
similarity = pytorch_cos_sim(texts_embed01, texts_embed02)
similarity

tensor([[0.4886]], device='cuda:0')

In [5]:
embedding_texts = model_embedding.encode(texts)
embedding_texts.shape

(13, 384)

In [6]:
from pandas import DataFrame

embed_df = DataFrame(embedding_texts)

In [18]:
embed_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,344,345,346,347,348,349,350,351,352,353,354,355,356,357,358,359,360,361,362,363,364,365,366,367,368,369,370,371,372,373,374,375,376,377,378,379,380,381,382,383
0,-0.023889,0.055259,-0.011655,-0.033414,-0.012261,-0.024873,-0.012663,0.025346,0.018508,-0.083508,-0.09302,0.014486,-0.017411,-0.088344,-0.004479,-0.046326,-0.013194,0.035382,0.062311,0.04859,-0.059118,0.054135,-0.064397,0.034024,0.006636,0.035917,-0.067838,-0.017735,-0.012722,0.046462,0.108644,0.023821,-0.026996,0.037174,0.097598,-0.02703,-0.04543,0.031817,-0.033746,-0.015198,...,-0.045291,0.118322,0.054848,-0.040015,0.098105,0.022277,-0.030813,-0.005176,0.049103,0.045938,-0.023188,-0.027573,-0.040576,0.016116,0.02501,-0.058007,0.047965,0.117957,-0.008974,-0.013361,0.020989,-0.0252,-0.006896,-0.021131,0.005462,0.064137,0.026008,-0.02985,-0.011776,0.00309,-0.161688,-0.046426,0.006004,0.005281,-0.003342,0.027754,0.020411,0.005778,0.034098,-0.006889
1,-0.012688,0.046874,-0.010502,-0.020384,-0.013361,0.042322,0.016628,-0.004099,-0.002607,-0.010188,-0.044768,0.019365,0.031505,-0.118893,0.01985,0.035861,0.034993,-0.083673,0.056933,0.057396,-0.057795,-0.005447,0.003423,0.014473,0.146743,-0.053123,0.003083,0.030637,0.055512,0.043963,0.047002,0.044337,0.020708,-0.004741,-0.008704,-0.039581,-0.063424,-0.011725,-0.090585,-0.045387,...,-0.063684,0.099501,0.002105,0.042053,0.054385,-0.017293,-0.00745,0.034746,-0.000616,-0.050755,-0.040021,0.014303,0.025885,-0.062788,0.040704,-0.028741,0.069934,-0.024656,0.06453,0.014862,0.030004,-0.010374,-0.09046,-0.062121,-0.01513,-0.003932,0.075132,0.052699,0.020436,0.024714,-0.061594,-0.020717,-0.009082,-0.02926,-0.066253,0.065257,0.013229,-0.023103,-0.002785,0.010474
2,0.000494,0.119412,0.00523,-0.092734,0.007773,-0.005325,0.034506,-0.051981,-0.006265,-0.00611,-0.079471,0.036207,-0.00971,-0.081195,-0.001876,-0.013249,-0.042756,0.004501,-0.007266,0.100785,-0.002075,0.042169,-0.023942,0.098594,0.072433,-0.002734,0.016057,0.00572,-0.026609,-0.013365,0.097391,0.01028,-0.016172,-0.003942,0.034441,-0.013009,-0.10954,-0.019242,-0.003607,-0.060187,...,-0.015841,0.088835,-0.022281,0.007992,0.04476,-0.002664,-0.015018,-0.024615,0.043037,0.046402,-0.074185,0.007321,0.012401,-0.004225,0.040887,-0.013238,0.086007,0.130728,0.009953,0.053924,0.037271,-0.037933,-0.00412,-0.041604,-0.048431,0.110611,0.038085,-0.016102,-0.011424,-0.00941,-0.108326,-0.049646,-0.073399,-0.029898,-0.102734,0.062121,0.034605,0.016877,-0.023861,0.005264
3,-0.029711,0.023298,-0.057041,-0.012183,-0.01371,0.029796,0.063739,0.001101,-0.045124,-0.040748,-0.131671,0.000674,0.032849,-0.048718,-0.016917,-0.04001,-0.003435,-0.000405,0.049092,0.057811,0.007957,-0.01472,-0.055192,0.029432,0.086543,-0.034207,-0.004638,-0.006953,-0.017902,0.089433,0.138466,-0.004411,-0.012209,0.027505,0.056866,-0.016538,-0.03082,0.005954,-0.056146,-0.004276,...,-0.02335,0.117718,0.058016,0.007543,0.053195,0.029278,-0.005433,0.046559,-0.008911,-0.013223,-0.073022,-0.018384,-0.001908,-0.026813,0.075265,-0.090822,0.035911,0.121485,0.071004,-0.025873,-0.021903,0.062796,-0.012797,-0.006417,0.017931,0.035687,0.033231,0.021569,0.100695,-0.047331,-0.117682,0.031924,0.000854,0.0202,-0.020666,-0.005167,0.03837,0.003617,0.033993,-0.010255
4,-0.025628,0.070389,-0.01738,-0.056567,0.028576,0.052823,0.067063,-0.052617,-0.054702,-0.11623,-0.126143,0.038227,0.011085,-0.027623,0.086316,0.0057,0.013502,0.001248,0.03837,0.087459,-0.060004,0.007136,-0.052758,-0.003477,0.079192,-0.030614,0.03455,0.065704,-0.011732,0.051478,0.095803,-0.019129,-0.036677,0.015641,0.036194,-0.058811,-0.035086,0.022795,-0.081846,-0.027348,...,-0.075405,0.129256,-0.058059,-0.01965,0.10145,0.003209,-0.012665,0.038677,0.021085,-0.004969,-0.021644,-0.070017,0.060121,-0.107323,0.001019,-0.093465,0.087102,0.094227,0.080545,0.032137,-0.011176,-0.064559,-0.031923,-0.051013,-0.017872,0.017034,0.061883,0.052157,0.101039,-0.056417,-0.118145,0.013343,-0.055188,-0.032723,0.008436,0.019169,0.048212,-0.040412,0.083346,0.026855
5,-0.022656,0.02116,0.005105,-0.046494,0.009074,0.041495,0.054268,-0.024185,-0.013483,-0.075966,-0.090702,-0.029076,0.045339,-0.077989,0.047003,-0.01883,-0.031521,-0.022798,0.021713,0.057836,-0.051639,-0.014933,-0.029978,0.02325,0.087391,-0.062931,-0.00042,0.062464,-0.021476,0.035335,0.125799,0.029123,-0.037065,0.013791,0.057291,-0.072491,-0.044007,0.026902,-0.039566,-0.066453,...,-0.077669,0.099516,-0.011076,-0.007306,0.062561,0.006845,-0.005897,0.007084,0.010039,0.003088,-0.000738,-0.014339,-0.00231,-0.035318,0.033689,-0.050801,0.076678,0.09998,0.07201,0.044336,0.028311,0.001274,-0.067214,-0.064206,-0.031583,0.06006,0.076265,0.012245,0.071965,-0.010519,-0.10011,0.01075,-0.031469,-0.004822,0.039657,0.026384,0.045514,0.059089,-0.017509,0.007166
6,-0.002911,0.060791,-0.009176,-0.006133,0.040493,0.036594,0.002054,-0.031345,0.031806,-0.023495,0.071992,0.048723,0.081783,-0.050864,-0.005711,-0.080416,-0.01225,-0.003741,-0.029289,0.052237,-0.010236,0.037758,-0.079403,0.124539,0.091983,-0.010715,0.034181,-0.016364,-0.023802,0.015979,-0.060006,0.040025,-0.029828,0.017246,0.017604,-0.004945,-0.012642,0.005651,-0.064422,-0.001107,...,-0.037479,0.120514,0.092009,0.150646,0.05924,0.016865,-0.015192,0.032755,0.074319,0.0063,-0.098705,-0.016977,-0.04784,-0.077831,0.031058,-0.0236,0.030114,-0.007999,0.037392,-0.022385,0.026635,-0.019759,-0.097564,0.022126,-0.026906,-0.008749,-0.033806,0.028241,-0.001251,-0.003584,-0.028763,-0.060458,-0.018598,-0.040189,-0.031486,-0.018299,0.002286,-0.07342,0.016235,-0.000244
7,-0.080526,0.059888,-0.048847,-0.040176,-0.063342,0.041848,0.119045,0.010652,-0.030095,-0.004561,-0.07515,0.081693,0.003867,-0.084236,-0.0619,-0.02171,0.010616,-0.023371,0.03094,0.093385,-0.03637,0.04271,-0.061342,0.052395,0.041366,0.008109,-0.061988,-0.035993,-0.004243,0.071631,0.100317,0.0053,0.006457,0.049251,-0.039963,0.021823,-0.021824,0.033236,-0.022382,0.009573,...,-0.036822,0.103294,0.086007,0.000951,0.036378,0.036222,-0.036158,0.012988,0.004459,0.041679,-0.089641,-0.028039,0.027155,-0.080964,0.054563,-0.134662,0.005126,0.086044,0.044157,0.023074,-0.026023,-0.024532,-0.02176,-0.052582,0.015607,0.022571,0.046028,0.050643,0.054423,-0.083213,-0.144566,0.020404,0.023088,0.005077,-0.055645,-0.007675,0.050791,-0.005989,0.134562,0.034817
8,-0.034388,0.072501,0.01444,-0.036695,0.014019,0.06307,0.034683,-0.014531,-0.059862,-0.045383,-0.055213,-0.034528,0.00927,-0.095072,0.036745,0.025977,0.013696,0.004641,-0.044114,0.063383,-0.088903,0.013146,-0.03782,0.023436,0.079054,0.02817,-0.02684,0.012249,0.032541,-0.019416,0.079922,-0.04345,-0.04865,-0.00617,0.047211,-0.0036,-0.06654,0.031916,-0.052208,-0.04867,...,-0.026549,0.120332,-0.020662,-0.007842,0.052714,0.005838,-0.021314,-0.019987,0.016647,-0.036486,-0.018713,0.007056,0.013114,-0.034846,0.019419,-0.048089,0.070016,0.015946,0.055659,0.041075,0.049812,-0.037412,-0.01456,-0.032269,-0.040533,0.04331,0.072315,0.006942,0.030646,0.013022,-0.114763,-0.035894,-0.019877,-0.033375,-0.030168,0.039412,0.044993,0.000578,-0.025124,0.034191
9,-0.005964,0.025044,-0.003182,-0.025243,-0.039823,-0.012772,0.044713,0.014535,-0.038213,-0.041149,-0.05854,0.070492,-0.029789,-0.046087,-0.016301,-0.080821,0.030458,-0.014638,0.012796,0.120223,-0.032289,0.035957,-0.018771,0.06087,0.000829,0.037492,0.004634,0.005595,-0.000582,-0.020706,0.063955,0.027098,0.031915,0.017982,0.007558,0.045427,0.023558,0.037546,-0.043077,-0.012915,...,-0.012387,0.085,0.074588,0.018098,0.027723,0.073802,-0.010719,0.027924,0.027842,-0.001941,-0.052277,0.019475,0.04263,-0.044101,0.061573,-0.064164,0.077146,-0.030594,0.061598,0.050569,0.029921,-0.06405,-0.025672,0.022948,0.001914,-0.00496,0.032083,0.061701,0.011159,-0.078794,-0.057621,0.021594,0.048983,-0.044541,-0.030137,0.006779,0.054854,0.029937,0.070214,0.041565


In [7]:
embed_df['texts'] = texts

In [8]:
embed_df.to_csv("embed_text.csv",index=False)

In [11]:
# pushing the dataset to huggingface hub

from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [12]:
from datasets import load_dataset, Dataset

In [14]:
embed_ds = Dataset.from_pandas(embed_df)
embed_ds

Dataset({
    features: ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '60', '61', '62', '63', '64', '65', '66', '67', '68', '69', '70', '71', '72', '73', '74', '75', '76', '77', '78', '79', '80', '81', '82', '83', '84', '85', '86', '87', '88', '89', '90', '91', '92', '93', '94', '95', '96', '97', '98', '99', '100', '101', '102', '103', '104', '105', '106', '107', '108', '109', '110', '111', '112', '113', '114', '115', '116', '117', '118', '119', '120', '121', '122', '123', '124', '125', '126', '127', '128', '129', '130', '131', '132', '133', '134', '135', '136', '137', '138', '139', '140', '141', '142', '143', '144', '145', '146', '147', '148', '149', '150', '151', '152', '153', '154', 

In [15]:
embed_ds = embed_ds.train_test_split(train_size=0.8)

In [16]:
embed_ds

DatasetDict({
    train: Dataset({
        features: ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '60', '61', '62', '63', '64', '65', '66', '67', '68', '69', '70', '71', '72', '73', '74', '75', '76', '77', '78', '79', '80', '81', '82', '83', '84', '85', '86', '87', '88', '89', '90', '91', '92', '93', '94', '95', '96', '97', '98', '99', '100', '101', '102', '103', '104', '105', '106', '107', '108', '109', '110', '111', '112', '113', '114', '115', '116', '117', '118', '119', '120', '121', '122', '123', '124', '125', '126', '127', '128', '129', '130', '131', '132', '133', '134', '135', '136', '137', '138', '139', '140', '141', '142', '143', '144', '145', '146', '147', '148', '149', '150',

In [17]:
dataset_path = "Kamaljp/embed_texts"
embed_ds.push_to_hub(dataset_path)

Pushing dataset shards to the dataset hub:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Pushing dataset shards to the dataset hub:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Downloading metadata:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

In [24]:
# there are some models with mandatory prompts,

allmini = "sentence-transformers/all-MiniLM-L6-v2"

prompts = {
    "classification": "Classify the following text:",
    "retrieval": "Retrieve Semantically Similar text:",
    "Clustering": "Identify the topic or theme based on the text:"
}

In [25]:
multiling_model = SentenceTransformer(
    model_name_or_path=allmini,
    prompts=prompts
)

In [26]:
# during embedding, the following process is followed

embeddings = multiling_model.encode("There are many good looking places",
                                    prompt_name='retrieval')

In [33]:
# We can use the Transformers AutoClasses, will require additional steps to access the embedding
# We cannot dismantle the model so we need to dissect the model output

from transformers import AutoTokenizer, AutoModel

In [31]:
tokenizer = AutoTokenizer.from_pretrained(allmini)
model = AutoModel.from_pretrained(allmini)

In [35]:
sentences = [
    "This framework generate embeddings for each input sentence",
    "Sentences are passed as a list of strings",
    "The quick brown fox jupms over the lazy dog.",
]

In [37]:
encode_sentence = tokenizer(sentences, padding=True, truncation=True, max_length=128, return_tensors='pt') 

In [39]:
encode_sentence.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

In [40]:
# When getting the inference, the encoded tokens are sent into the models for processing
# first step in the model is embedding
import torch

with torch.no_grad():  # We don't want the model to calculate the gradient, when making this pass
    model_out = model(**encode_sentence)

In [42]:
type(model_out)  # observe the type of the model_out

transformers.modeling_outputs.BaseModelOutputWithPoolingAndCrossAttentions

In [44]:
token_embeds = model_out[0]
token_embeds.shape  # Try to explain the shape, by relooking at the earlier steps

torch.Size([3, 14, 384])

#### Next mean-pooling is not mandatory

In [47]:
# lets seperate the attn_mask
attn_mask = encode_sentence["attention_mask"]
attn_mask.shape

torch.Size([3, 14])

In [49]:
# Starting the process of Mean Pooling manually. This step is done for model inference
# expanding the attn_masks
attn_mask_expanded = attn_mask.unsqueeze(-1).expand(token_embeds.size()).float()
# attn_mask_expanded

tensor([[[1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         ...,
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [0., 0., 0.,  ..., 0., 0., 0.]],

        [[1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         ...,
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.]],

        [[1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         ...,
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.]]])

In [50]:
sum_embeds = torch.sum(token_embeds * attn_mask_expanded , 1)

In [51]:
sum_mask = torch.clamp(attn_mask_expanded.sum(1), min=1e-9)

In [52]:
mean_pooled = sum_embeds / sum_mask
mean_pooled

tensor([[-0.0923, -0.2241, -0.0747,  ...,  0.5802,  0.6137, -0.2178],
        [ 0.2897,  0.2838,  0.2554,  ...,  0.2671,  0.3355, -0.1703],
        [ 0.1504,  0.4607,  0.2637,  ...,  0.0908,  0.3025,  0.3073]])

#### Some Tasks with Embedding

##### text similarity

In [57]:
# Text Similarity

sentences2 = [
    'There is more to embeddings than it meets the eyes',
    'Every object in the Neural Network world have a rich and varied back story',
    'when there are instances, that means there has to be blueprints of them lying around'
]
sentences = ['This framework generate embeddings for each input sentence',
 'Sentences are passed as a list of strings',
 'The quick brown fox jupms over the lazy dog.']

In [59]:
embed = model_embedding.encode(sentences)
embed2 = model_embedding.encode(sentences2)

In [60]:
# get the cosine scores
cos_scores = util.cos_sim(embed2, embed)
cos_scores  # the scores for each sentence in one embeding list is compared with another embeding list
# So there will be a matrix

tensor([[0.3594, 0.1727, 0.1343],
        [0.3569, 0.1764, 0.1245],
        [0.1035, 0.1637, 0.1032]])

##### semantic search

Symmetric search: Query and the retrieved sentences are having the same length
[link](https://www.sbert.net/docs/pretrained_models.html#sentence-embedding-models)

Assymetric search: Query is smaller in size, while the sentences are longer, like paras
[link](https://www.sbert.net/docs/pretrained-models/msmarco-v3.html)

In [61]:
corpus = [
    "A man is eating food.",
    "A man is eating a piece of bread.",
    "The girl is carrying a baby.",
    "A man is riding a horse.",
    "A woman is playing violin.",
    "Two men pushed carts through the woods.",
    "A man is riding a white horse on an enclosed ground.",
    "A monkey is playing drums.",
    "A cheetah is running behind its prey.",
]

# Query sentences:
queries = [
    "A man is eating pasta.",
    "Someone in a gorilla costume is playing a set of drums.",
    "A cheetah chases prey on across a field.",
]


In [72]:
from sentence_transformers.util import (
    normalize_embeddings,
    dot_score,
    semantic_search
)

In [62]:
corpus_embeddings = model_embedding.encode(corpus, convert_to_tensor=True)

In [63]:
corpus_embeddings =  corpus_embeddings.to('cuda')

In [65]:
corpus_embeddings.shape

torch.Size([9, 384])

In [70]:
# normalized corpus embeddings make it simple to calculat the dot-pdt
corpus_embeddings = normalize_embeddings(corpus_embeddings)
corpus_embeddings

tensor([[ 0.0332,  0.0044, -0.0063,  ...,  0.0692, -0.0246, -0.0376],
        [ 0.0525,  0.0552, -0.0112,  ..., -0.0162, -0.0602, -0.0412],
        [-0.0363, -0.0357, -0.0272,  ..., -0.0386,  0.1057, -0.0013],
        ...,
        [ 0.0370,  0.0226,  0.0496,  ..., -0.0031,  0.0489,  0.0167],
        [ 0.0235, -0.0585,  0.0560,  ...,  0.0584,  0.0377,  0.0410],
        [ 0.0228,  0.1041, -0.0340,  ...,  0.0029,  0.0386,  0.0438]],
       device='cuda:0')

In [71]:
query_embeddings = model_embedding.encode(queries, convert_to_tensor=True)
query_embeddings = normalize_embeddings(query_embeddings.to('cuda'))
query_embeddings

tensor([[-0.0116, -0.0508, -0.0217,  ...,  0.0822,  0.0099, -0.0394],
        [-0.0357,  0.0168,  0.0448,  ...,  0.0249,  0.0653, -0.0112],
        [ 0.0544,  0.0540, -0.0037,  ...,  0.0325,  0.0219,  0.0621]],
       device='cuda:0')

In [73]:
semantic_search = semantic_search(query_embeddings, corpus_embeddings, score_function=dot_score)

In [74]:
semantic_search

[[{'corpus_id': 0, 'score': 0.7035486698150635},
  {'corpus_id': 1, 'score': 0.5271987318992615},
  {'corpus_id': 3, 'score': 0.18889553844928741},
  {'corpus_id': 6, 'score': 0.10469923168420792},
  {'corpus_id': 8, 'score': 0.09803037345409393},
  {'corpus_id': 7, 'score': 0.08189043402671814},
  {'corpus_id': 4, 'score': 0.033593956381082535},
  {'corpus_id': 5, 'score': -0.059434838593006134},
  {'corpus_id': 2, 'score': -0.08980069309473038}],
 [{'corpus_id': 7, 'score': 0.6432533264160156},
  {'corpus_id': 4, 'score': 0.25641557574272156},
  {'corpus_id': 3, 'score': 0.1388726532459259},
  {'corpus_id': 6, 'score': 0.11909151822328568},
  {'corpus_id': 8, 'score': 0.10798682272434235},
  {'corpus_id': 0, 'score': 0.06300687044858932},
  {'corpus_id': 2, 'score': 0.02465788647532463},
  {'corpus_id': 1, 'score': 0.021566985175013542},
  {'corpus_id': 5, 'score': -0.08950325846672058}],
 [{'corpus_id': 8, 'score': 0.8253214359283447},
  {'corpus_id': 0, 'score': 0.1398952305316925}

##### Many real world problems can be solved using the above embedding models

- Using ANN to search for getting the context for the RAG

- Retrieve Similar questions, or similar problems, similar products based on something chosen by the user

- Ranking the retrieved snippets for relevancy is another interesting task that can enrich the search process-

- Clustering the sentences into topics or ideas

- Paraphrase mining of large corpus texts with similar meaning or idea

- Image Search using the embedded data of the image

##### Retrieve & Re-Rank

We will be using the CrossEncoder modlel cross-encoder/ms-marco-MiniLM-L-6-v2

In [None]:
# !pip install rank_bm25

In [75]:
import json
from sentence_transformers import SentenceTransformer, CrossEncoder, util
import gzip
import os
import torch

In [78]:
multi_qa = "multi-qa-MiniLM-L6-cos-v1"
bi_encoder = SentenceTransformer(multi_qa)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/11.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/383 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [91]:
cross_enc = "cross-encoder/ms-marco-MiniLM-L-6-v2"

cross_encoder = CrossEncoder(cross_enc)

In [80]:
wikipedia_filepath = 'simplewiki-2020-11-01.jsonl.gz'

if not os.path.exists(wikipedia_filepath):
    util.http_get('http://sbert.net/datasets/simplewiki-2020-11-01.jsonl.gz', wikipedia_filepath)

  0%|          | 0.00/50.2M [00:00<?, ?B/s]

In [81]:
passages = []
with gzip.open(wikipedia_filepath, 'rt', encoding='utf8') as fIn:
    for line in fIn:
        data = json.loads(line.strip())

        #Add all paragraphs
        #passages.extend(data['paragraphs'])

        #Only add the first paragraph
        passages.append(data['paragraphs'][0])

print("Passages:", len(passages))

Passages: 169597


In [82]:
bi_encoder = bi_encoder.to("cuda")  # In CPU will take long time, items/sec rate will be 10/ 15. With 

In [83]:
# embed the entire 170K passages with bi-encoder
corpus_embeding = bi_encoder.encode(passages, convert_to_tensor=True, show_progress_bar=True, device='cuda')

Batches:   0%|          | 0/5300 [00:00<?, ?it/s]

In [84]:
from rank_bm25 import BM25Okapi
from sklearn.feature_extraction import _stop_words
import string
from tqdm.autonotebook import tqdm
import numpy as np


In [85]:
# We lower case our text and remove stop-words from indexing
def bm25_tokenizer(text):
    tokenized_doc = []
    for token in text.lower().split():
        token = token.strip(string.punctuation)

        if len(token) > 0 and token not in _stop_words.ENGLISH_STOP_WORDS:
            tokenized_doc.append(token)
    return tokenized_doc


In [86]:
tokenized_corpus = []
for passage in tqdm(passages):
    tokenized_corpus.append(bm25_tokenizer(passage))

bm25 = BM25Okapi(tokenized_corpus)

  0%|          | 0/169597 [00:00<?, ?it/s]

In [92]:
# This function will search all wikipedia articles for passages that
# answer the query
def search(query):
    print("Input question:", query)

    ##### BM25 search (lexical search) #####
    bm25_scores = bm25.get_scores(bm25_tokenizer(query))
    top_n = np.argpartition(bm25_scores, -5)[-5:]
    bm25_hits = [{'corpus_id': idx, 'score': bm25_scores[idx]} for idx in top_n]
    bm25_hits = sorted(bm25_hits, key=lambda x: x['score'], reverse=True)
    
    print("Top-3 lexical search (BM25) hits")
    for hit in bm25_hits[0:3]:
        print("\t{:.3f}\t{}".format(hit['score'], passages[hit['corpus_id']].replace("\n", " ")))

    ##### Semantic Search #####
    # Encode the query using the bi-encoder and find potentially relevant passages
    question_embedding = bi_encoder.encode(query, convert_to_tensor=True)
    question_embedding = question_embedding.cuda()
    hits = util.semantic_search(question_embedding, corpus_embeddings, top_k=32)
    hits = hits[0]  # Get the hits for the first query

    ##### Re-Ranking ##### Try to explain this part 
    # Now, score all retrieved passages with the cross_encoder
    cross_inp = [[query, passages[hit['corpus_id']]] for hit in hits]
    cross_scores = cross_encoder.predict(cross_inp)

    # Sort results by the cross-encoder scores
    for idx in range(len(cross_scores)):
        hits[idx]['cross-score'] = cross_scores[idx]

    # Output of top-5 hits from bi-encoder
    print("\n-------------------------\n")
    print("Top-3 Bi-Encoder Retrieval hits")
    hits = sorted(hits, key=lambda x: x['score'], reverse=True)
    for hit in hits[0:3]:
        print("\t{:.3f}\t{}".format(hit['score'], passages[hit['corpus_id']].replace("\n", " ")))

    # Output of top-5 hits from re-ranker
    print("\n-------------------------\n")
    print("Top-3 Cross-Encoder Re-ranker hits")
    hits = sorted(hits, key=lambda x: x['cross-score'], reverse=True)
    for hit in hits[0:3]:
        print("\t{:.3f}\t{}".format(hit['cross-score'], passages[hit['corpus_id']].replace("\n", " ")))

In [93]:
search(query="Who is the president of India")

Input question: Who is the president of India
Top-3 lexical search (BM25) hits
	14.509	The Vice President of India is the second-highest constitutional official in India, after the President.
	13.316	Fakhruddin Ali Ahmed was the fifth President of India from 1974 to 1977 and also the 2nd President of India to die in office.
	11.866	The President of India is the head of state of the Republic of India. The current president, Ram Nath Kovind, who was sworn in on 25 July 2017. He succeeded Pranab Mukherjee. The President resides in an estate known as the Rashtrapati Bhavan in New Delhi.

-------------------------

Top-3 Bi-Encoder Retrieval hits
	0.119	Ted Cassidy (July 31, 1932 - January 16, 1979) was an American actor. He was best known for his roles as Lurch and Thing on "The Addams Family".
	0.077	Aileen Carol Wuornos Pralle (born Aileen Carol Pittman; February 29, 1956 – October 9, 2002) was an American serial killer. She was born in Rochester, Michigan. She confessed to killing six m

In [94]:
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("all-MiniLM-L6-v2")

# Single list of sentences - Possible tens of thousands of sentences
sentences = [
    "The cat sits outside",
    "A man is playing guitar",
    "I love pasta",
    "The new movie is awesome",
    "The cat plays in the garden",
    "A woman watches TV",
    "The new movie is so great",
    "Do you like pizza?",
]

paraphrases = util.paraphrase_mining(model, sentences)

for paraphrase in paraphrases[0:10]:
    score, i, j = paraphrase
    print("{} \t\t {} \t\t Score: {:.4f}".format(sentences[i], sentences[j], score))

The new movie is awesome 		 The new movie is so great 		 Score: 0.8939
The cat sits outside 		 The cat plays in the garden 		 Score: 0.6788
I love pasta 		 Do you like pizza? 		 Score: 0.5096
I love pasta 		 The new movie is so great 		 Score: 0.2560
I love pasta 		 The new movie is awesome 		 Score: 0.2440
A man is playing guitar 		 The cat plays in the garden 		 Score: 0.2105
The new movie is awesome 		 Do you like pizza? 		 Score: 0.1969
The new movie is so great 		 Do you like pizza? 		 Score: 0.1692
The cat sits outside 		 A woman watches TV 		 Score: 0.1310
The cat plays in the garden 		 Do you like pizza? 		 Score: 0.0900
