# Language-agnostic BERT Sentence Embedding (LaBSE)

LaBSE is a recent transformer model developed by google to create Language Agnostic Embeddings.

According to their results LaBSE outperforms LASER in most multilingual benchmarks with the advantage of running well in windows ðŸ˜‚.

In this notebook I will give you the basics on "how to get sentence embeddings using LaBSE". I hope that this will foster some ideas for the project.

[Official Google blog](https://ai.googleblog.com/2020/08/language-agnostic-bert-sentence.html)

In [None]:
# We will import the model from the transformers library. Make sure you have it installed along with pytorch!
!pip install transformers

# Model Architecture

![image.png](attachment:image.png)

LaBSE follows a dual encoder architecture in which the source (text to be translated) and target text (translated text) are encoded using a shared transformer embedding network separately. The model is then trained in a translation ranking task in which the text representations of paraphares and translations is forced to be close together.


![TranslationRanking.gif](attachment:TranslationRanking.gif)

### Usage
Using the model:

In [None]:
import torch
from transformers import BertModel, BertTokenizerFast

tokenizer = BertTokenizerFast.from_pretrained("setu4993/LaBSE")
model = BertModel.from_pretrained("setu4993/LaBSE")
model = model.eval()

In [None]:

english_sentences = [
    "dog",
    "Puppies are nice.",
    "I enjoy taking long walks along the beach with my dog.",
]
italian_sentences = [
    "cane",
    "I cuccioli sono carini.",
    "Mi piace fare lunghe passeggiate lungo la spiaggia con il mio cane.",
]
italian_inputs = tokenizer(italian_sentences, return_tensors="pt", padding=True)
english_inputs = tokenizer(english_sentences, return_tensors="pt", padding=True)

with torch.no_grad():
    italian_outputs = model(**italian_inputs)
    english_outputs = model(**english_inputs)

To get the sentence embeddings, use the pooler output:

In [None]:
italian_embeddings = italian_outputs.pooler_output
english_embeddings = english_outputs.pooler_output

For similarity between sentences, an L2-norm is recommended before calculating the similarity:

In [None]:
import torch.nn.functional as F


def similarity(embeddings_1, embeddings_2):
    normalized_embeddings_1 = F.normalize(embeddings_1, p=2)
    normalized_embeddings_2 = F.normalize(embeddings_2, p=2)
    return torch.matmul(
        normalized_embeddings_1, normalized_embeddings_2.transpose(0, 1)
    )

print(similarity(italian_embeddings, english_embeddings))