# Sentence Embeddings using Sentence Transformers 

### **SBERT - state of the art sentence embedding technique.**

### Reference_links: <br>

https://www.sbert.net/# <br>
SentenceTransformers is a Python framework for state-of-the-art sentence, text and image embeddings.
The initial work is described in paper Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.


Problem with BERT: <br>
 - A common method to address semantic search problems is to map each sentence to a vector space such that semantically similar sentences are close. 
 - Researchers have experimented with inputting individual sentences into BERT and to derive fixedsize sentence embeddings. The most commonly used approach is to average the BERT output layer(known as BERT embeddings) or by using the output of the first token (the [CLS] token). 
 
 - Above paper shows experimentation of using input of individual sentences into BERT, deriving fixed size sentence embeddings(most popular approach is to average the BERT output layer or by using the output of the first token (the [CLS] token)) is ofthen worse than using average Glove embeddings.
 
SBERT Overview: <br>
 - Sentence-BERT(SBERT), a modification of the pretrained BERT network that use siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity.
 
- Sbert derived sentence embeddings significantly outperform other state-of-the-art sentence embedding methods like InferSent (Conneau et al., 2017) and Universal Sentence Encoder (Cer et al., 2018).

- we use the pre-trained BERT and RoBERTa network and only fine-tune it to yield useful sentence embeddings. This reduces significantly the needed training time: SBERT can be tuned in less than 20 minutes, while yielding  better results than comparable sentence embedding methods.



https://www.sbert.net/examples/applications/semantic-search/README.html

### Command to install sentence- transformers


In [None]:
pip install -U sentence-transformers

In [None]:
!pip install swifter # for efficient application of pd.apply()

### Importing libraries

In [None]:
from sentence_transformers import SentenceTransformer, util
import torch
import pandas as pd
import numpy as np
import re
import bs4
import swifter
from nltk.stem import WordNetLemmatizer

### **Data Loading**

In [None]:
%%time
!wget --header="Host: 34.125.119.108:5000" --header="User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36" --header="Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9" --header="Accept-Language: en-US,en;q=0.9" --header="Referer: http://34.125.119.108:5000/edit/Final_df.csv" "http://34.125.119.108:5000/files/Final_df.csv?download=1" -c -O 'Final_df.csv'

In [None]:
%%time

df = pd.read_csv('./Final_df.csv')

In [None]:
df.shape, df.columns

### **Data Preprocessing Functions**

- Removing html tags from question corpus
- Converting text to lowercase
- Text decontraction
- Remove any non-alphanumeric character exept '+', '.' and '#'. These puntuations are kept as we have tags such as c++,c#,.net, vb.net etc. If '+' and '#' is removed all the questions of c# and c++ will be tagged of 'C' programming language which would be a disaster.

- Word lemmatization - all the words will be converted to its stem word.

In [None]:
# # https://stackoverflow.com/a/47091490/4084039
def decontracted(phrase):
    # specific
    phrase = re.sub(r"won't", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)
    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    #phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase



def text_preprocessing(text):
    '''This function does text preprocessing 
       It includes removal of html tags,
       converting to lowercase, 
       decontraction and 
       removal of any non alphanumeric characters.
       
       Function takes one parameter - text
       returns - preprocessed text
    '''
    
    # Some titles (~42) start with '<' but doesnt have closing '>'. 
    #eg: #text = '<asp: RegularExpressionValidator and RegexOptions.IgnorePatternWhitespace'
    # beautifulsoup gives emppty string on such text so remove '<' before removing html tags from titles.
    text = text.replace("<","")
    # Remove html tags from question corpus
    text = bs4.BeautifulSoup(text, 'lxml').get_text()
    # Convert each word to lowercase
    text = text.lower()
    # text decontraction. eg: won't to will not. Can't to cannot
    text = decontracted(text)
    # Remove any non-alphanumeric characters if present
    #text = re.sub('\W', ' ',text).strip()
    text = re.sub("[^a-zA-Z'.+# ]+", '', text) # kepping + for c++, . for .net, vb.net etc, # for C#

    # why lemmatization is choose over stemming
    #https://stackoverflow.com/questions/1787110/what-is-the-difference-between-lemmatization-vs-stemming
    # Lemmatization   
    lemmatizer = WordNetLemmatizer()
    text = " ".join([lemmatizer.lemmatize(word) for word in text.split()])
    text = text.strip()
    return text

In [None]:
%%time
df['Cleaned_Titles'] = df['Title'].swifter.apply(lambda x: text_preprocessing(x))

In [None]:
print("Original: ",df['Title'].iloc[2])
print("Cleaned: ",df['Cleaned_Titles'].iloc[2])
print("_____________________________________________________________")
print("Original: ",df['Title'].iloc[3])
print("Cleaned: ",df['Cleaned_Titles'].iloc[3])

print("_____________________________________________________________")
print("Original: ",df['Title'].iloc[1000])
print("Cleaned: ",df['Cleaned_Titles'].iloc[1000])

In [None]:
%%time
corpus = df['Cleaned_Titles'].values.tolist()

In [None]:
len(corpus)

In [None]:
import gc
del df
gc.collect()

## **Modelling**



https://www.sbert.net/docs/pretrained_models.html#sentence-embedding-models

- The all-* models where trained on all available training data (more than 1 billion training pairs) and are designed as general purpose models. The all-mpnet-base-v2 model provides the best quality, while all-MiniLM-L6-v2 is 5 times faster and still offers good quality.


#### **Pre-Trained Model Selection**

https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

- all-MiniLM-L6-v2 model is selected because of following advantages and relevancy:- 
    - It was fine-tuned on datasets relevant to our problem statement such as stackexchange, reddit comments, yahoo answers etc.
    - It is faster
    - Intended to use for semantic search
    
    
#### **Model Background**
- all-MiniLM-L6-v2 model is fine tuned on pretrained nreimers/MiniLM-L6-H384-uncased model on a 1B sentence pairs dataset. Contrastive learning objective has been used: given a sentence from the pair, the model should predict which out of a set of randomly sampled other sentences, was actually paired with it in our dataset.

- Formally,cosine similarity from each possible sentence pairs from the batch was computed. Then cross entropy loss was applied by comparing with true pairs. 

## **Model Loading**

In [None]:
%%time
embedder = SentenceTransformer('all-MiniLM-L6-v2')

### **Extracting Sentence Embedding**

In [None]:
# %%time
# corpus_embeddings = embedder.encode(corpus, convert_to_tensor=True)

### Saving sentence embeddings for future use

In [None]:
# import pickle
# with open("sentence-embeddings.pkl", "wb") as fOut:
#     pickle.dump({'sentences': corpus, 'embeddings': corpus_embeddings},fOut)

In [None]:
import pickle
with open("../input/sbert-embeddings/sentence-embeddings.pkl", "rb") as f:
    corpus_dict = pickle.load(f)

In [None]:
corpus_embeddings = corpus_dict['embeddings']

In [None]:
type(corpus_embeddings)

In [None]:
len(corpus_embeddings[0]) # sentence embedding of 384 dimensional

### Observation:-
- Model 'all-MiniLM-L6-v2' gives 384 dimensional sentence embeddings.
- Each question title has been converted into 384 dim embeddings.

### **Model Performance**

In [None]:
def get_similar_questions(query):
    ''' Function to accept user query and show top 5 similar question alongwith cosine similarity score.
        Function accepts one parameter: query (text input)
        Processing: Text preprocessing of query, compute sentence embedding and top similar 5 questions.
        Returns: None, prints similar question's titles and cosine similarity score.
    '''
    preprocessed_query = text_preprocessing(query)
    query_embedding = embedder.encode(query, convert_to_tensor=True)
    # We use cosine-similarity and torch.topk to find the highest 5 scores
    cos_scores = util.pytorch_cos_sim(query_embedding, corpus_embeddings)[0]
    
    
    # Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
    top_k = 5
    top_results = torch.topk(cos_scores, k=top_k)
    
    print("\n\n======================\n\n")
    print("Query:", query)
    print("\nTop 5 most similar questions in corpus:")
    
    i = 1
    for score, idx in zip(top_results[0], top_results[1]):
        print("{}) ".format(i), corpus[idx], "(Score: {:.4f})".format(score))
        i = i+1

In [None]:
%%time
get_similar_questions('python sort dictionary')

In [None]:
%%time
get_similar_questions('CSS Performance')

In [None]:
%%time
get_similar_questions('python convert date to datetime')

In [None]:
%%time
get_similar_questions('how to create list of lists in python')

In [None]:
%%time
get_similar_questions('use groupingby with custom logic in java')

In [None]:
%%time
get_similar_questions('str(a) giving unicode error')

In [None]:
%%time
get_similar_questions('pd.melt() not working python')

In [None]:
%%time
get_similar_questions('try: 22/0 except Exception as e:print("Error! Code: {c}, Message, {m}".format(c = e.code, m = str(e))')

In [None]:
%%time
get_similar_questions('def main(): return {a:1, b:2}')

In [None]:
%%time
get_similar_questions('import KNN \
                       knn= KNN(n=4) \
                       knn.fit(Xtrain, ytrain)')

### **Observations:-**
- For query - 'python sort dictionary'
    - Top 5 similar questions retrieved have similarity score > 90. Model is confident about the results given as output.
    - All the questions are pretty much similar to query provided.

- For query - 'python convert date to datetime'
    - Key thing to notice is, result set didnt included completely opposite question such as conversion of datetime to date while other techniques like avgw2vec faced this issue.
    
    
- For query - 'how to create list of lists in python'
    - Top most similar question list of list in pythonic way - indirect connection to list comprehension.
    - Though avgw2v method had captured this connection.
    
 - Code related query - 'str(a) giving unicode error'
     - Resutls are impressive.
     - All top 5 questions was related to unicode error.


-  Code related query: 'pd.melt() not working python'
    - Query was very specific to python pandas, corpus may not have data related to it.
    - Still the top similar question retrived is very much similar to query inputted.

# **Final Conclusion**


- Best part of using this technique is the time it takes to retrieve results. Result are retrieved within 2-3 seconds which is pretty amazing satisfying our bussiness constraint of low latency. 

    - Main character of this super fast result retriveal is torch.topK function. It is very much efficient and optimized to return largest K values from the provided tensor.

    - Whereas W2vec technique was taking around 4.5 mins to retrieve top 5 similar questions.
    
- Model size is around 80Mb only, compatible with deployment.

- Model gave fairly good results.

