#### Here will be covered : 
- Sentence Embedding
- Sentence Similarity
- Semantic Search
- Clustering

Resources:
- https://www.sbert.net/index.html
- https://www.sbert.net/docs/pretrained_models.html

In [None]:
%pip install -U sentence-transformers

# sentence-transformers (BERT) is a library for computing sentence embeddings (sentence vectors) in Python and PyTorch. 

**Generate Embeding**

In [2]:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')
# all-MiniLM-L6-v2 is a transformer model trained on a large corpus of English sentences and their paraphrases. It is a smaller version of the MiniLM model (6 layers instead of 12 layers). It was trained on the AllNLI and STS benchmark datasets. And it is the best performing model on the STS benchmark dataset.

import numpy as np

  from .autonotebook import tqdm as notebook_tqdm


In [4]:
sentences = ['This framework generates embeddings for each input sentence','Sentences are passed as a list of string.', 'The quick brown fox jumps over the lazy dog.']

embeddings = model.encode(sentences, convert_to_tensor=True)

for sentence, embedding in zip(sentences, embeddings):
    print("Sentence:", sentence)
    print("Embedding:", embedding)
    print("")

Sentence: This framework generates embeddings for each input sentence
Embedding: tensor([-1.3717e-02, -4.2852e-02, -1.5629e-02,  1.4054e-02,  3.9554e-02,
         1.2180e-01,  2.9433e-02, -3.1752e-02,  3.5496e-02, -7.9314e-02,
         1.7588e-02, -4.0437e-02,  4.9726e-02,  2.5491e-02, -7.1870e-02,
         8.1497e-02,  1.4707e-03,  4.7963e-02, -4.5034e-02, -9.9217e-02,
        -2.8177e-02,  6.4505e-02,  4.4467e-02, -4.7622e-02, -3.5295e-02,
         4.3867e-02, -5.2857e-02,  4.3304e-04,  1.0192e-01,  1.6407e-02,
         3.2700e-02, -3.4599e-02,  1.2134e-02,  7.9487e-02,  4.5834e-03,
         1.5778e-02, -9.6821e-03,  2.8763e-02, -5.0581e-02, -1.5579e-02,
        -2.8791e-02, -9.6228e-03,  3.1556e-02,  2.2735e-02,  8.7145e-02,
        -3.8503e-02, -8.8472e-02, -8.7550e-03, -2.1234e-02,  2.0892e-02,
        -9.0208e-02, -5.2573e-02, -1.0564e-02,  2.8831e-02, -1.6145e-02,
         6.1784e-03, -1.2323e-02, -1.0734e-02,  2.8335e-02, -5.2857e-02,
        -3.5862e-02, -5.9799e-02, -1.0906e-

In [6]:
text1 = """
Gradient descent is an optimization algorithm which is commonly-used to train machine learning models and neural networks. Training data helps these models learn over time, and the cost function within gradient descent specifically acts as a barometer, gauging its accuracy with each iteration of parameter updates.
"""
text2 = """
Gradient descent (GD) is not an iterative first-order optimisation algorithm used to find a local minimum/maximum of a given function. This method is commonly used in machine learning (ML) and deep learning(DL) to minimise a cost/loss function (e.g. in a linear regression). Due to its importance and ease of implementation, this algorithm is usually taught at the beginning of almost all machine learning courses.
"""
text3 = """
Artificial intelligence is the simulation of human intelligence processes by machines, especially computer systems. Specific applications of AI include expert systems, natural language processing, speech recognition and machine vision.
"""

text4 = """
Natural language processing (NLP) is a branch of artificial intelligence (AI) that enables computers to comprehend, generate, and manipulate human language. Natural language processing has the ability to interrogate the data with natural language text or voice.
"""

text5 = """
Gradient Descent is known as one of the most commonly used optimization algorithms to train machine learning models by means of minimizing errors between actual and expected results. Further, gradient descent is also used to train Neural Networks.

In mathematical terminology, Optimization algorithm refers to the task of minimizing/maximizing an objective function f(x) parameterized by x. Similarly, in machine learning, optimization is the task of minimizing the cost function parameterized by the model's parameters. The main objective of gradient descent is to minimize the convex function using iteration of parameter updates. Once these machine learning models are optimized, these models can be used as powerful tools for Artificial Intelligence and various computer science applications.
"""

# This is the summerized version of text5
text6 = """
 Gradient Descent is an optimization algorithm used to train machine learning models by minimizing errors between actual and expected results. It is also used to train Neural Networks and minimize the cost function parameterized by the model's parameters.
"""

text7 = """
The Fear Nothing Blood Test is able to give you an accurate understanding of your health by checking several key health indicators. The standard Fear Nothing Blood Test can tell you about your: Vitamin levels. Hormone levels. Liver health.
"""

text8 = """
This subject only gives a brief description about different types of materials used in building construction for members like foundation, masonry, arches, lintels, balcony, roof, floor, doors, windows, stairs, plastering, painting and other general topics. Properties of various construction materials, their uses and different applications are discussed in this subject. 
"""

text9="""
Initial setting time for ideal cement mix is around 30 minutes for almost all kind of cements. For masonry cement it can be 90 minutes. Final setting time of ideal cement mix should be 10 hours at max. For masonry cement it shouldn’t exceed 24 hours.
"""

t1 = 'He likes to play.'
t2 = 'He does not like to play.'

# emd2 = model.encode("Tom deserves unbiased judgement")

In [9]:
len(text5.split())

114

In [7]:
emd1 = model.encode(t1)
emd2 = model.encode(t2)

**Cosine Similarity**

In [8]:
cos_similarity = util.cos_sim(emd1, emd2)
cos_similarity

tensor([[0.6760]])

**Jaccard Similarity**

In [13]:
def jaccard_similarity(list1, list2):
    intersection = len(set(list1).intersection(list2))
    union = (len(list1) + len(list2)) - intersection
    return float(intersection) / union

In [15]:
similarity_score = jaccard_similarity(set(text1.lower().split()), set(text2.lower().split()))

similarity_score

0.1891891891891892

#### Manhattan Distance

In [None]:
def manhatan_distance(x, y):
    return np.sum(np.abs(x - y))

**Compute Cosine Similarity Between all pairs**

In [None]:
sentences = [
    'A man is eating food.',
    'A man is eating a piece of bread.',
    'A man is riding a horse.',
    'The girl is carrying a baby.',
    'A woman is playing violin.',
    'two men pushed carts through the woods.',
    'A man is riding a white horse on an enclosed ground.',
    'A monkey is playing drums.',
    'A cheetah is running behind its prey.'
]

In [None]:
# Enode sentences to get their embeddings
embeddings = model.encode(sentences, convert_to_tensor=True)

# Compute cosine similarity between all pairs
cosine_scores = util.cos_sin(embeddings, embeddings)

In [None]:
# add all pairs to a list with their cosine similarity score
all_sentence_combinations = []
for i in range(len(cosine_scores)-1):
    for j in range(i+1, len(cosine_scores)):
        all_sentence_combinations.append([cosine_scores[i][j], i, j])

# Sort in decreasing order of the cosine similarity score
all_sentence_combinations = sorted(all_sentence_combinations, key=lambda x: x[0], reverse=True)

# x[0] is the criteria for sorting

In [None]:
# print the pairs according to their cosine similarity score
print("Top-5 most similar pairs:")
for score, i, j in all_sentence_combinations[0:5]:
    print("{} \t\t {} \t\t {:.4f}".format(sentences[i], sentences[j], cosine_scores[i][j]))

In [None]:
#### Using Hugging Face Transformers

In [None]:
from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F

#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

# Normalize embeddings
sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)

print("Sentence embeddings:")
print(sentence_embeddings)


#### Semantic Search

In [None]:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('clips/mfaq')

In [None]:
question = "<Q>How many models can I host on Hugging Face?</Q>"
answer_1 = "<A>All plans come with unlimited private models and datasets."
answer_2 = "<A>AutoNLP is an automatic way to train and deploy state-of-the-art NLP models, seamlessly integrated with the Hugging Face ecosystem."
answer_3 = "<A>Based on how much training data and model variants are created, we send you a compute cost and payment link - as low as $10 per job."

In [None]:
query_embedding = model.encode(question)
corpus_embeddings = model.encode([answer_1, answer_2, answer_3])

results = util.semantic_search(query_embedding, corpus_embeddings)

#### Question Answering

Basically for question answering we have to perfrom also the previous tasks
Sequence is : 
1. First find the semantic similarity by ranking existing document
2. Then Semantic search
3. Find the top document -> Finnaly the QA step

In [None]:
from transformers import pipeline

In [None]:
qa_model = pipeline("question-answering")
question = "How many models can I host on HuggingFace?"
context = "All plans come with unlimited private models and datasets."
qa_model(question = question, context = context)

#### Clustering

In [None]:
from sklearn.cluster import KMeans
import numpy as np

embedder = SentenceTransformer('all-MiniLM-L6-v2')

# Corpus with example sentences
corpus = ['A man is eating food.',
          'A man is eating a piece of bread.',
          'Horse is eating grass.',
          'A man is eating pasta.',
          'A Woman is eating Biryani.',
          'The girl is carrying a baby.',
          'The baby is carried by the woman',
          'A man is riding a horse.',
          'A man is riding a white horse on an enclosed ground.',
          'A monkey is playing drums.',
          'Someone in a gorilla costume is playing a set of drums.',
          'A cheetah is running behind its prey.',
          'A cheetah chases prey on across a field.',
          'The cheetah is chasing a man who is riding the horse.',
          'man and women with their baby are watching cheetah in zoo'
          ]
corpus_embeddings = embedder.encode(corpus)

# Normalize the embeddings to unit length
corpus_embeddings = corpus_embeddings /  np.linalg.norm(corpus_embeddings, axis=1, keepdims=True)

In [None]:
corpus_embeddings[0]

In [None]:
# source: https://stackoverflow.com/questions/55619176/how-to-cluster-similar-sentences-using-bert

clustering_model = KMeans(n_clusters=4)
clustering_model.fit(corpus_embeddings)
cluster_assignment = clustering_model.labels_
print(cluster_assignment)

In [None]:
clustered_sentences = {}
for sentence_id, cluster_id in enumerate(cluster_assignment):
    if cluster_id not in clustered_sentences:
        clustered_sentences[cluster_id] = []

    clustered_sentences[cluster_id].append(corpus[sentence_id])
clustered_sentences

In [None]:
clustered_sentences = {}
for sentence_id, cluster_id in enumerate(cluster_assignment):
    if cluster_id not in clustered_sentences:
        clustered_sentences[cluster_id] = []

    clustered_sentences[cluster_id].append(corpus[sentence_id])
clustered_sentences