<a href="https://colab.research.google.com/github/mobarakol/3D_Attention_UNet/blob/main/Sentence_Embedding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



    Doc2Vec
    SentenceBERT
    InferSent
    Universal Sentence Encoder






Step 1:

Firstly, import the libraries and download ‘punkt‘

In [None]:
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
import numpy as np

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Step 2:

Then, we define our list of sentences. You can use a larger list (it is best to use a list of sentences for easier processing of each sentence)

In [None]:
sentences = ["I ate dinner.", 
       "We had a three-course meal.", 
       "Brad came to dinner with us.",
       "He loves fish tacos.",
       "In the end, we all felt like we ate too much.",
       "We all agreed; it was a magnificent evening."]

Step 3:

We will also keep  a tokenized version of these sentences

In [None]:
# Tokenization of each document
tokenized_sent = []
for s in sentences:
    tokenized_sent.append(word_tokenize(s.lower()))
tokenized_sent

[['i', 'ate', 'dinner', '.'],
 ['we', 'had', 'a', 'three-course', 'meal', '.'],
 ['brad', 'came', 'to', 'dinner', 'with', 'us', '.'],
 ['he', 'loves', 'fish', 'tacos', '.'],
 ['in',
  'the',
  'end',
  ',',
  'we',
  'all',
  'felt',
  'like',
  'we',
  'ate',
  'too',
  'much',
  '.'],
 ['we', 'all', 'agreed', ';', 'it', 'was', 'a', 'magnificent', 'evening', '.']]

Step 4:

Finally, we define a function which returns the cosine similarity between 2 vectors

In [None]:
def cosine(u, v):
    return np.dot(u, v) / (np.linalg.norm(u) * np.linalg.norm(v))

**Doc2Vec**

Step 1:

We will use Gensim to show an example of how to use Doc2Vec. Further, we have already had a list of sentences. We will first import the model and other libraries and then we will build a tagged sentence corpus. Each sentence is now represented as a TaggedDocument containing a list of the words in it and a tag associated with it.

In [None]:
# import
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
tagged_data = [TaggedDocument(d, [i]) for i, d in enumerate(tokenized_sent)]
tagged_data

[TaggedDocument(words=['i', 'ate', 'dinner', '.'], tags=[0]),
 TaggedDocument(words=['we', 'had', 'a', 'three-course', 'meal', '.'], tags=[1]),
 TaggedDocument(words=['brad', 'came', 'to', 'dinner', 'with', 'us', '.'], tags=[2]),
 TaggedDocument(words=['he', 'loves', 'fish', 'tacos', '.'], tags=[3]),
 TaggedDocument(words=['in', 'the', 'end', ',', 'we', 'all', 'felt', 'like', 'we', 'ate', 'too', 'much', '.'], tags=[4]),
 TaggedDocument(words=['we', 'all', 'agreed', ';', 'it', 'was', 'a', 'magnificent', 'evening', '.'], tags=[5])]

Step 2:

We then train the model with the parameters:

In [None]:
## Train doc2vec model
model = Doc2Vec(tagged_data, vector_size = 20, window = 2, min_count = 1, epochs = 100)

'''
vector_size = Dimensionality of the feature vectors.
window = The maximum distance between the current and predicted word within a sentence.
min_count = Ignores all words with total frequency lower than this.
alpha = The initial learning rate.
'''

## Print model vocabulary
model.wv.vocab

{',': <gensim.models.keyedvectors.Vocab at 0x7fd3d8b853d0>,
 '.': <gensim.models.keyedvectors.Vocab at 0x7fd3d8b81f10>,
 ';': <gensim.models.keyedvectors.Vocab at 0x7fd3d8b85590>,
 'a': <gensim.models.keyedvectors.Vocab at 0x7fd3d8b85090>,
 'agreed': <gensim.models.keyedvectors.Vocab at 0x7fd3d8b85550>,
 'all': <gensim.models.keyedvectors.Vocab at 0x7fd3d8b85410>,
 'ate': <gensim.models.keyedvectors.Vocab at 0x7fd3d8b81e90>,
 'brad': <gensim.models.keyedvectors.Vocab at 0x7fd3d8b81fd0>,
 'came': <gensim.models.keyedvectors.Vocab at 0x7fd3d8b85110>,
 'dinner': <gensim.models.keyedvectors.Vocab at 0x7fd3d8b81ed0>,
 'end': <gensim.models.keyedvectors.Vocab at 0x7fd3d8b85390>,
 'evening': <gensim.models.keyedvectors.Vocab at 0x7fd3d8b85690>,
 'felt': <gensim.models.keyedvectors.Vocab at 0x7fd3d8b85450>,
 'fish': <gensim.models.keyedvectors.Vocab at 0x7fd3d8b85290>,
 'had': <gensim.models.keyedvectors.Vocab at 0x7fd3d8b85050>,
 'he': <gensim.models.keyedvectors.Vocab at 0x7fd3d8b85210>,
 'i

Step 3:

We now take up a new test sentence and find the top 5 most similar sentences from our data. We will also display them in order of decreasing similarity. The infer_vector method returns the vectorized form of the test sentence(including the paragraph vector). The most_similar method returns similar sentences

In [None]:
test_doc = word_tokenize("I had pizza and pasta".lower())
test_doc_vector = model.infer_vector(test_doc)
model.docvecs.most_similar(positive = [test_doc_vector])

'''
positive = List of sentences that contribute positively.
'''

'\npositive = List of sentences that contribute positively.\n'

**SentenceBERT**

Currently, the leader among the pack, SentenceBERT was introduced in 2018 and immediately took the pole position for Sentence Embeddings. At the heart of this BERT-based model, there are 4 key concepts:

    Attention
    Transformers
    BERT
    Siamese Network


In [None]:
!pip install sentence-transformers

Collecting sentence-transformers
  Downloading sentence-transformers-2.1.0.tar.gz (78 kB)
[K     |████████████████████████████████| 78 kB 3.2 MB/s 
[?25hCollecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.15.0-py3-none-any.whl (3.4 MB)
[K     |████████████████████████████████| 3.4 MB 11.4 MB/s 
[?25hCollecting tokenizers>=0.10.3
  Downloading tokenizers-0.11.2-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.8 MB)
[K     |████████████████████████████████| 6.8 MB 43.0 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 42.4 MB/s 
[?25hCollecting huggingface-hub
  Downloading huggingface_hub-0.2.1-py3-none-any.whl (61 kB)
[K     |████████████████████████████████| 61 kB 414 kB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596

Step 1:

We will then load the pre-trained BERT model. There are many other pre-trained models available. You can find the full list of models here (https://github.com/UKPLab/sentence-transformers/blob/master/docs/pretrained-models/sts-models.md).

In [None]:
from sentence_transformers import SentenceTransformer
sbert_model = SentenceTransformer('bert-base-nli-mean-tokens')

Downloading:   0%|          | 0.00/391 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.95k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/625 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/399 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Step 2:

We will then encode the provided sentences. We can also display the sentence vectors

In [None]:
sentence_embeddings = sbert_model.encode(sentences)

print('Sample BERT embedding vector - length', len(sentence_embeddings[0]))
print('Sample BERT embedding vector - note includes negative values', sentence_embeddings[0])

Sample BERT embedding vector - length 768
Sample BERT embedding vector - note includes negative values [ 1.71653464e-01  2.26479676e-02  1.93014610e+00 -2.18605757e-01
  3.40927206e-02  5.23479164e-01 -1.25486827e+00  9.17776406e-01
 -3.36900681e-01 -5.32600343e-01  3.54305148e-01  8.74753714e-01
  7.92069674e-01  1.30202487e-01  3.55893862e-03  5.88224716e-02
  5.33439338e-01 -2.87108332e-01  1.44994453e-01 -8.17292988e-01
 -2.01645549e-02  1.39119014e-01 -9.96068954e-01  1.77107528e-01
  4.00362909e-02  4.30928320e-01 -2.58669794e-01  4.17331129e-01
  1.20064068e+00  9.88089517e-02 -2.55319804e-01 -1.98270474e-02
  8.69255066e-01 -8.33961904e-01  1.74719855e-01 -7.85964668e-01
 -1.65306181e-01  2.84225255e-01 -4.37051684e-01  6.77007616e-01
 -3.97849768e-01  1.01804636e-01  7.46632993e-01  4.15896505e-01
 -2.42900550e-02  2.68644720e-01  1.04453206e+00  1.43222368e+00
  4.57947522e-01 -1.08110821e+00  9.75882530e-01 -1.14395499e+00
 -2.62539059e-01  6.03446484e-01 -5.55950642e-01  1.

Step 3:

Then we will define a test query and encode it as well:

In [None]:
query = "I had pizza and pasta"
query_vec = sbert_model.encode([query])[0]

Step 4:

We will then compute the cosine similarity using scipy. We will retrieve the similarity values between the sentences and our test query:

In [None]:
for sent in sentences:
  sim = cosine(query_vec, sbert_model.encode([sent])[0])
  print("Sentence = ", sent, "; similarity = ", sim)

Sentence =  I ate dinner. ; similarity =  0.7173462
Sentence =  We had a three-course meal. ; similarity =  0.6371338
Sentence =  Brad came to dinner with us. ; similarity =  0.5897909
Sentence =  He loves fish tacos. ; similarity =  0.6223936
Sentence =  In the end, we all felt like we ate too much. ; similarity =  0.4198049
Sentence =  We all agreed; it was a magnificent evening. ; similarity =  0.18081607
