<a href="https://colab.research.google.com/github/graviraja/100-Days-of-NLP/blob/embeddings/embeddings/Sentence%20Embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Imports

In [1]:
!pip install sentence-transformers

Collecting sentence-transformers
[?25l  Downloading https://files.pythonhosted.org/packages/b9/46/b7d6c37d92d1bd65319220beabe4df845434930e3f30e42d3cfaecb74dc4/sentence-transformers-0.2.6.1.tar.gz (55kB)
[K     |████████████████████████████████| 61kB 1.8MB/s 
[?25hCollecting transformers>=2.8.0
[?25l  Downloading https://files.pythonhosted.org/packages/48/35/ad2c5b1b8f99feaaf9d7cdadaeef261f098c6e1a6a2935d4d07662a6b780/transformers-2.11.0-py3-none-any.whl (674kB)
[K     |████████████████████████████████| 675kB 7.2MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 25.3MB/s 
Collecting tokenizers==0.7.0
[?25l  Downloading https://files.pythonhosted.org/packages/14/e5/a26eb4716523808bb0a799fcfdceb6ebf77a18169d9591b2f46a9adb87d9/tokenizers-0.7.0-cp36-cp36m-manylinux1_x86_64.whl (3.8MB)
[K     |████████

## Sentence Transformer

BERT has set a new state-of-the-art
performance on sentence-pair regression tasks
like semantic textual similarity (STS). However, it requires that both sentences are fed
into the network, which causes a massive computational overhead.

Finding in a collection of n = 10 000 sentences the pair with the highest similarity requires with BERT $n*(n−1) / 2$ = 49 995 000 inference computations. On a modern V100 GPU, this requires about 65 hours. Similar, finding which of the over 40 million existent questions of Quora is the most similar for a new question could be modeled as a pair-wise comparison with BERT, however, answering a single query would require over 50 hours.

The construction of BERT makes it unsuitable for semantic similarity search as well as for unsupervised tasks
like clustering.

A common method to address clustering and semantic search is to map each sentence to a vector space such that semantically similar sentences are close. Researchers have started to input individual sentences into BERT and to derive fixedsize sentence embeddings. The most commonly used approach is to average the BERT output layer (known as BERT embeddings) or by using the output of the first token (the [CLS] token). This common practice yields rather bad
sentence embeddings, often worse than averaging
GloVe embeddings

To alleviate this issue, a new architecture called **SBERT** was used. The siamese network architecture enables that
fixed-sized vectors for input sentences can be derived. Using a similarity measure like cosinesimilarity or Manhatten / Euclidean distance, semantically similar sentences can be found. These
similarity measures can be performed extremely
efficient on modern hardware, allowing SBERT
to be used for semantic similarity search as well
as for clustering.

![](https://drive.google.com/uc?id=1TH-juL431ykPCe_415ezcwf4Vo_KVGw1)


The complexity for finding the most similar sentence pair in a collection of 10,000 sentences is reduced from 65 hours with BERT to the computation of 10,000 sentence embeddings
(\~5 seconds with SBERT) and computing cosinesimilarity (~0.01 seconds). By using optimized index structures, finding the most similar Quora question can be reduced from 50 hours to a few milliseconds.

- [Sentence BERT Paper](https://arxiv.org/pdf/1908.10084.pdf)
- [Sentence Transformers Repo](https://github.com/UKPLab/sentence-transformers)

There are many ways in which sentence-transformers could be used. With pre-trained models, training on custom dataset, with LSTM, CNN, Word Embeddings (Glove) etc. We will explore a few of them.

### Using Pre-trained Models

There are many pre-trained models available. We can use our own custom trained model also. Some of the pre-trained models available are:

Natural Language Inference (NLI): Given two sentences, the model should classify if these two sentence entail, contradict, or are neutral to each other. 

These models were trained on SNLI and MultiNLI dataset to create universal sentence embeddings.

- **`bert-base-nli-mean-tokens`**: BERT-base model with mean-tokens pooling. Performance: STSbenchmark: 77.12
- **`bert-large-nli-mean-tokens`**: BERT-large with mean-tokens pooling. Performance: STSbenchmark: 79.19
- **`roberta-base-nli-mean-tokens`**: RoBERTa-base with mean-tokens pooling. Performance: STSbenchmark: 77.49
- **`roberta-large-nli-mean-tokens`**: RoBERTa-base with mean-tokens pooling. Performance: STSbenchmark: 78.69
- **`distilbert-base-nli-mean-tokens`**: DistilBERT-base with mean-tokens pooling. Performance: STSbenchmark: 76.97

These models were first fine-tuned on the AllNLI datasent, then on train set of STS benchmark. They are specifically well suited for semantic textual similarity.

- **`bert-base-nli-stsb-mean-tokens`**: Performance: STSbenchmark: 85.14
- **`bert-large-nli-stsb-mean-tokens`**: Performance: STSbenchmark: 85.29
- **`roberta-base-nli-stsb-mean-tokens`**: Performance: STSbenchmark: 85.44
- **`roberta-large-nli-stsb-mean-tokens`**: Performance: STSbenchmark: 86.39
- **`distilbert-base-nli-stsb-mean-tokens`**: Performance: STSbenchmark: 84.38

For more information refer to this [link](https://github.com/UKPLab/sentence-transformers#english-pre-trained-models)

In [2]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('bert-base-nli-mean-tokens')

100%|██████████| 405M/405M [00:55<00:00, 7.29MB/s]


In [0]:
sentences = ['This framework generates embeddings for each input sentence',
    'This is a hot sunny day', 
    'Machine learning is awesome. It is necessary for each and everyone to learn about it.']
sentence_embeddings = model.encode(sentences)

In [5]:
for sentence, embedding in zip(sentences, sentence_embeddings):
    print("Sentence:", sentence)
    print("Embedding:", embedding.shape)
    print("")

Sentence: This framework generates embeddings for each input sentence
Embedding: (768,)

Sentence: This is a hot sunny day
Embedding: (768,)

Sentence: Machine learning is awesome. It is necessary for each and everyone to learn about it.
Embedding: (768,)



### Training the SBERT

First, the BERT model (instantiated from bert-base-uncased) to map tokens in a sentence to the output embeddings from BERT is used. The next layer of the model is a Pooling model: In this case, mean-pooling is used. You can also perform max-pooling or use the embedding from the CLS token. You can also combine multiple poolings together.

These two modules (word_embedding_model and pooling_model) form our SentenceTransformer. Each sentence is now passed first through the word_embedding_model and then through the pooling_model to give fixed sized sentence vectors.

In [0]:
# This code won't run. It is for sample purpose to show the flow of training
# For actual running code, refer here: https://github.com/UKPLab/sentence-transformers#training
nli_reader = NLIDataReader('datasets/AllNLI')

train_data = SentencesDataset(nli_reader.get_examples('train.gz'), model=model)
train_dataloader = DataLoader(train_data, shuffle=True, batch_size=batch_size)
train_loss = losses.SoftmaxLoss(model=model, sentence_embedding_dimension=model.get_sentence_embedding_dimension(), num_labels=train_num_labels)

# evaluation on different dataset
sts_reader = STSBenchmarkDataReader('datasets/stsbenchmark')
dev_data = SentencesDataset(examples=sts_reader.get_examples('sts-dev.csv'), model=model)
dev_dataloader = DataLoader(dev_data, shuffle=False, batch_size=train_batch_size)
evaluator = EmbeddingSimilarityEvaluator(dev_dataloader)

# training pipeline
model.fit(train_objectives=[(train_dataloader, train_loss)],
         evaluator=evaluator,
         epochs=num_epochs,
         evaluation_steps=1000,
         warmup_steps=warmup_steps,
         output_path=model_save_path
         )

### Using Custom BERT models

In [0]:
# Use BERT for mapping tokens to embeddings
word_embedding_model = models.Transformer('path/to/your/BERT/model')

# Apply mean pooling to get one fixed sized sentence vector
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(),
                               pooling_mode_mean_tokens=True,
                               pooling_mode_cls_token=False,
                               pooling_mode_max_tokens=False)

model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

### Using Glove embeddings instead of BERT for tokens

We need to change the word_embedding_model. Instead of BERT, we have to specify the glove embeddings

In [0]:
# Map tokens to traditional word embeddings like GloVe
word_embedding_model = models.WordEmbeddings.from_text_file('glove.6B.300d.txt.gz')

# Apply mean pooling to get one fixed sized sentence vector
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(),
                               pooling_mode_mean_tokens=True,
                               pooling_mode_cls_token=False,
                               pooling_mode_max_tokens=False)

# Add two trainable feed-forward networks (DAN)
sent_embeddings_dimension = pooling_model.get_sentence_embedding_dimension()
dan1 = models.Dense(in_features=sent_embeddings_dimension, out_features=sent_embeddings_dimension)
dan2 = models.Dense(in_features=sent_embeddings_dimension, out_features=sent_embeddings_dimension)

model = SentenceTransformer(modules=[word_embedding_model, pooling_model, dan1, dan2])

### Using LSTM model 

We just need to add LSTM model before the pooling layer 

In [0]:
# Map tokens to traditional word embeddings like GloVe
word_embedding_model = models.WordEmbeddings.from_text_file('glove.6B.300d.txt.gz')

lstm = models.LSTM(word_embedding_dimension=word_embedding_model.get_word_embedding_dimension(), hidden_dim=1024)

# Apply mean pooling to get one fixed sized sentence vector
pooling_model = models.Pooling(lstm.get_word_embedding_dimension(),
                               pooling_mode_mean_tokens=False,
                               pooling_mode_cls_token=False,
                               pooling_mode_max_tokens=True)


model = SentenceTransformer(modules=[word_embedding_model, lstm, pooling_model])


### Using CNN

Similar to the above configuration. Replacing LSTM model with CNN

In [0]:
# Map tokens to vectors using BERT
word_embedding_model = models.BERT('bert-base-uncased')

cnn = models.CNN(in_word_embedding_dimension=word_embedding_model.get_word_embedding_dimension(), out_channels=256, kernel_sizes=[1,3,5])

# Apply mean pooling to get one fixed sized sentence vector
pooling_model = models.Pooling(cnn.get_word_embedding_dimension(),
                               pooling_mode_mean_tokens=True,
                               pooling_mode_cls_token=False,
                               pooling_mode_max_tokens=False)


model = SentenceTransformer(modules=[word_embedding_model, cnn, pooling_model])

**NOTE**: All the above mentioned snippets are taken from [sentence-transformers](https://github.com/UKPLab/sentence-transformers/)  repository. Please refer to that if you need more details.