### What are sentence embeddings?
Sentence embeddings can be described as a document processing method of mapping sentences to vectors as a means of representing text with real numbers suitable for machine learning. Similarity measurements such as cosine similarity or Manhattan/Euclidean distance, evaluate semantic textual similarity so that the scores can be exploited for a variety of helpful NLP tasks, including information retrieval, paraphrase identification and text summarization.

#### Transformers and Pre-trained Language models
The revolutionary Transformer model was introduced in 2017 by Vaswani et al. The attention-only architecture was extremely impactful in the field of NLP, and it’s influence is now being felt in other fields such as computer vision, and graph neural networks. For a detailed look at Transformers, Harvard NLP’s Annotated Transformer is a wonderful tutorial. In 2018, building on this architecture, Devlin et al. created BERT (Bidirectional Encoder Representations from Transformers) a pre-trained language model, that set SOTA records for various NLP tasks, including the Semantic Textual Similarity (STS) benchmark (Cer et al. 2017).

Following BERT, RoBERTa was released by Lui et al. 2019, and this model uses a robustly optimized pre-training approach to improve upon the original BERT model. These pre-trained language models are powerful tools, and this paradigm has extended to other languages. 

https://medium.com/swlh/transformer-based-sentence-embeddings-cd0935b3b1e0

### Pooling Strategy
S-BERT importantly adds a pooling operation to the output of a BERT/RoBERTA model to create a fixed-sized sentence embedding. As mentioned, the default is a MEAN pooling strategy, since this was determined to be superior to using the output of the [CLS]-token or a MAX pooling strategy. A fixed-sized sentence embedding is the key to producing embeddings that can be used efficiently in downstream tasks, such as inferring semantic textual similarity with cosine similarity scores.

Inference with Sentence-BERT
Once trained, S-BERT uses a regressive objective function for inference within a siamese network similar to the one used for fine-tuning. As seen in the diagram below, the cosine similarity between two sentence embeddings (u and v) are computed as a score between [-1…1].



The regressive objective function is optimized with mean-squared-error loss, and concatenation is not required before calculating the cosine similarity of the sentence embeddings.

##### Implementation
There are two main options available to produce S-BERT or S-RoBERTa sentence embeddings, the Python library Huggingface transformers or a Python library maintained by UKP Lab, sentence-transformers. Both libraries rely on Pytorch, and the main difference is that using Huggingface requires defining a function for the pooling strategy, as seen in the code snippet below.

In [1]:
# adapted from https://www.sbert.net/examples/applications/computing-embeddings/README.html
# Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
    sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
    return sum_embeddings / sum_mask

To create S-BERT sentence embeddings with Huggingface, simply import the Autotokenizer and Automodel to tokenize and create a model from the pre-trained S-BERT model (a BERT-base model that was fine-tuned on a natural language inference dataset). As seen in the code snippet below, Pytorch is used to compute the embeddings, and the previously defined MEAN pooling function is applied.

In [3]:
sentences = ['Once trained, S-BERT uses a regressive objective function for inference within a siamese network similar to the one used for fine-tuning.',' As seen in the diagram below, the cosine similarity between two sentence embeddings (u and v) are computed as a score between [-1…1].']

In [4]:
# https://huggingface.co/transformers/
from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/bert-base-nli-mean-tokens")
model = AutoModel.from_pretrained("sentence-transformers/bert-base-nli-mean-tokens")
encoded_input = tokenizer(sentences, padding=True, truncation=True, max_length=128, return_tensors='pt')

# compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# mean pooling
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

### Extractive text summarization with sentence-transformers
If using sentence-transformers, there are several pre-trained S-BERT and S-RoBERTa models available for sentence embeddings. For the task of extractive text summarization, I prefer to use a distilled version of S-RoBERTa fine-tuned on a paraphrase identification dataset. As seen in the code snippet below, with sentence-transformers it is simple to create a model and embeddings, and then calculate the cosine similarity.

In [6]:
import numpy as np
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('distilroberta-base-paraphrase-v1')
embeddings = model.encode(sentences, convert_to_tensor=True)

# calculate pair-wise cosine similarities
cosine_scores = util.pytorch_cos_sim(embeddings, embeddings).numpy()

NameError: name 'util' is not defined

After calculating cosine similarity, I use code adapted from the Python library LexRank to find the most central sentences in the document, as calculated by degree centrality. Lastly, to produce a summary I select the top five highest ranked sentences.

In [None]:
from LexRank import degree_centrality_scores

centrality_scores = degree_centrality_scores(cosine_scores, threshold=None)

# argsort to order by sentence score
most_central_sentence_indices = np.argsort(centrality_scores)

# Print the 5 sentences with the highest scores
print("\n\nSummary:")
for idx in most_central_sentence_indices[0:4]:
    print(all_sentences_pooled[idx].strip())