# Multilingual Sentence Embedders

Let's chop that up first:

### Multilingual
At ML6, we have a lot of non-English use cases, so it's definitely handy to have some multilingual tools under our belt.

Also, multilingual embedders are useful because they offer you one joint model for multiple input languages. This is very convenient for a multi-language country like Belgium for example.



### Sentence 
When trying to get a meaningfull vector representation, you can of course use Word Embeddings and try to aggregate them in some meaningfull way.

However:

* for non-contextualized Word Embeddings (e.g. FastText), this throws away a bunch of information.
* for contextualized Word Embeddings (e.g. BERT), this just doesn't seem to work that well nor fast.

So a better way to do things is by having a dedicated sentence model that can take a variable-length input sentence and spit out a fixed-length representation.



### Embedders
These can be useful for a lot of things!

* Either directly for similarity or retrieval
* Either indirectly as input for a downstream classifier for example

## Why this experiment?
Up until recently, if you needed a multilingual sentence embedding model, one of our de-facto goto's was Universal Sentence Encoder (USE).

However, recent advances in using [BERT for sentence embeddings](https://arxiv.org/abs/1908.10084), and a subsequent [paper from Google](https://arxiv.org/pdf/2007.01852.pdf) on making this multilingual made us revisit this question.


This was also the perfect opportunity to have a quick look at [LASER from Facebook](https://engineering.fb.com/ai-research/laser-multilingual-sentence-embeddings/).

Let's kick things off!

## Setup

Feel free to play around with newer versions, the ones mentioned below worked for us:

In [1]:
!pip install -q beautifulsoup4==4.9.3
!pip install -q bert-for-tf2==0.14.6
!pip install -q laserembeddings==1.1.0
!pip install -q lxml==4.5.2
!pip install -q nbformat==5.0.7
!pip install -q plotly==4.4.1
!pip install -q scikit-learn==0.23.0
!pip install -q tensorflow==2.3.0
!pip install -q tensorflow-text==2.3.0
!pip install -q tensorflow-hub==0.9.0

[K     |████████████████████████████████| 122kB 7.8MB/s 
[K     |████████████████████████████████| 40kB 4.1MB/s 
[?25h  Building wheel for bert-for-tf2 (setup.py) ... [?25l[?25hdone
  Building wheel for py-params (setup.py) ... [?25l[?25hdone
  Building wheel for params-flow (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 860kB 11.7MB/s 
[K     |████████████████████████████████| 51kB 6.8MB/s 
[?25h  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 5.5MB 7.3MB/s 
[K     |████████████████████████████████| 174kB 7.3MB/s 
[31mERROR: nbclient 0.5.1 has requirement jupyter-client>=6.1.5, but you'll have jupyter-client 5.3.5 which is incompatible.[0m
[K     |████████████████████████████████| 7.3MB 8.3MB/s 
[K     |████████████████████████████████| 320.4MB 51kB/s 
[K     |████████████████████████████████| 460kB 57.9MB/s 
[K     |████████████████████████████████| 20.1MB 1.3MB/s 
[31mERROR: datasc

In [None]:
!python -m laserembeddings download-models

Downloading models into /usr/local/lib/python3.6/dist-packages/laserembeddings/data

✅   Downloaded https://dl.fbaipublicfiles.com/laser/models/93langs.fcodes    
✅   Downloaded https://dl.fbaipublicfiles.com/laser/models/93langs.fvocab    
✅   Downloaded https://dl.fbaipublicfiles.com/laser/models/bilstm.93langs.2018-12-26.pt    

✨ You're all set!


## Data

We will try out the models with three types of sentences:

* short sentences
* normal sentences
* long sentences / short paragraphs
* full-page wikipedia articles

In [None]:
short_sentences = [
    "Mexicaanse wraps met guacamole",
    "Quesadillas met 3 kazen en tex-mex kruiden",
    "Wraps mexicains au guacamole",
    "Quesadillas aux 3 fromages et épices tex-mex",
    "Mexikanische Wraps mit Guacamole",
    "Quesadillas mit 3 Käsesorten und Tex-Mex-Gewürzen",
    "Appeltaart met een bolletje ijs en cacao-saus",
    "Appelsouflé om van te dromen, met verse chocolade",
    "Tarte aux pommes avec une boule de glace",
    "Souflé de rêve aux pommes au chocolat frais",
    "Apfelkuchen mit einer Kugel Eis",
    "Traum Apfelsouffle mit frischer Schokolade",
    "Deze Dark n' Stormy mocktail doet je versteld staan",
    "Margarita cocktails met vers limoensap",
    "Ce mocktail Dark n 'Stormy vous épatera",
    "Cocktails Margarita avec jus de citron vert frais",
    "Dieser Dark n 'Stormy Mocktail wird Sie umhauen",
    "Margarita-Cocktails mit frischem Limettensaft"
]

short_labels = [f"{type}_{lang}_{num}" for type in ["texmex","dessert","cocktail"] for lang in ["nl","fr","de"] for num in [1,2]]

In [None]:
mid_sentences = [
    "Met deze mexicaanse quesadillas haal je het zuiden in huis, lekker met wat hummus of guacamole",
    "Avec ces quesadillas mexicaines vous apportez le sud chez vous, délicieuses avec du houmous ou du guacamole",
    "Een appeltaart om u tegen te zeggen, heerlijk warm en zoet. Serveertip: doe er een bolletje ijs bij!",
    "Une tarte aux pommes pour le moins, délicieusement chaude et sucrée. Conseil de service: ajoutez une boule de glace!",
    "Simpel en snel: meng vodka, limoensap en triple-sec met elkaar om een verfrissende zomercocktail te krijgen",
    "Simple et rapide: mélangez la vodka, le jus de citron vert et le triple-sec pour obtenir un cocktail d'été rafraîchissant"
]

mid_labels = [f"{kind}_{lang}" for kind in ["texmex","dessert","cocktail"] for lang in ["nl","fr"]]

In [None]:
long_sentences = [
    """Uitslag van de wedstrijd
    In de eerste helft draaide de bal nog goed rond, de middenvelder en de spits vonden elkaar goed.
    Na een spijtige own-goal van Pfaff werd het 0-1 voor de tegenstanders. Anderlecht herpakte zich echter
    goed in de tweede helft, en kwam tot driemaal toe dicht bij scoren. Slechts een penalty op het einde kon soelaas bieden.""",
    """Résultat du match
    En première mi-temps, le ballon a bien tourné, le milieu de terrain et l'attaquant se sont bien trouvés.
    Après un but contre son camp regrettable de Pfaff, les adversaires ont marqué 0-1. Anderlecht, cependant, a récupéré
    bien en seconde période, se rapprochant de marquer trois fois. Seule une pénalité à la fin pourrait offrir un soulagement.""",
    """Resultaat van de rit:
    Na een vroege ontsnapping ontstond een duidelijke kopgroep en peloton. De renners moesten alles geven in de waaieretape.
    Een sprong van Boonen naar de kopgroep trok de volledige koers open. Een groep der favorieten ontstond.
    Ze draaiden allemaal goed mee. Na de laatste bevoorrading ging de rit bergop helemaal op slot. Uiteindelijk won
    Wallays de massasprint. Boonen werd derde.""",
    """Résultat du voyage:
    Après une échappée précoce, une échappée claire et un peloton ont émergé. Les coureurs ont dû tout donner dans la fan tape.
    Un saut de Boonen au groupe de tête a ouvert tout le parcours. Un groupe de favoris a surgi.
    Ils ont tous bien joué. Après le dernier réapprovisionnement, le trajet était complètement fermé en montée. En fin de compte, il a gagné
    Wallays dans le sprint du peloton. Boonen est arrivé troisième.""",
    """Splits de dooiers van de eieren. Klop deze lichtjes op tot luchtig alvorens de suiker toe te voegen.
    Indien je wenst kun je in dit stadium gerust ook de amandelen en walnoten al crushen en toevoegen.
    Warm de oven voor op 180°. Smelt intussen de chocolade au-bain-marie. Doe dit al roerend, zodat er geen
    aangebrande klonters ontstaan. Eenmaal gesmolten kun je alles samenvoegen: de bloem, de eieren, de boter en de
    rest van het beslag.
    Zet een 30-tal minuten in de oven, en voila: een heerlijke cake!""",
    """Séparez les jaunes des œufs. Battez-le légèrement jusqu'à ce qu'il soit mousseux avant d'ajouter le sucre.
    Si vous le souhaitez, vous pouvez également écraser et ajouter les amandes et les noix à ce stade.
    Préchauffez le four à 180 °. Pendant ce temps, faites fondre le chocolat au bain-marie. Faites-le en remuant afin que
    des grumeaux brûlés se forment. Une fois fondu, vous pouvez tout mettre ensemble: la farine, les œufs, le beurre et le
    reste de la pâte.
    Mettez au four environ 30 minutes et le tour est joué: un délicieux gâteau!"""
]

long_labels = [
    "sports_football_nl",
    "sports_football_fr",
    "sports_cycling_nl",
    "sports_cycling_fr",
    "recipe_cake_nl",
    "recipe_cake_fr"
]

In [None]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

def get_wiki_text(url):
    source = urlopen(url).read()
    soup = BeautifulSoup(source,'lxml')
    text = ''
    for paragraph in soup.find_all('p'):
        text += paragraph.text
        text += " "
    
    text = re.sub(r'\[.*?\]+', '', text)
    text = text.replace('\n', ' ')
    text = text.replace('  ', ' ')

    return text

In [None]:
pages = [get_wiki_text(url) for url in [
    "https://nl.wikipedia.org/wiki/Carla_Bruni",
    "https://fr.wikipedia.org/wiki/Carla_Bruni",
    "https://nl.wikipedia.org/wiki/Paul_Simon_(artiest)",
    "https://fr.wikipedia.org/wiki/Paul_Simon_(chanteur)",
    "https://nl.wikipedia.org/wiki/Bloemkool",
    "https://fr.wikipedia.org/wiki/Chou-fleur"
]]

page_labels = [
    "carla_bruni_nl",
    "carla_bruni_fr",
    "paul_simon_nl",
    "paul_simon_fr",
    "cauliflower_nl",
    "cauliflower_fr"
]

## Let's embed!

### m-USE

16 languages (Arabic, Chinese-simplified, Chinese-traditional, English, French, German, Italian, Japanese, Korean, Dutch, Polish, Portuguese, Spanish, Thai, Turkish, Russian)


In [None]:
import tensorflow as tf
import tensorflow_hub as tfhub
import tensorflow_text

In [None]:
embed_layer = tfhub.KerasLayer(
    "https://tfhub.dev/google/universal-sentence-encoder-multilingual-large/3",
    input_shape=[],
    dtype=tf.string,
    trainable=False)

model_use = tf.keras.Sequential()
model_use.add(embed_layer)

In [None]:
def get_encoding(text):
    enc_use = embed_layer([text])
    return enc_use[0]

In [None]:
%%time
emb_use_short = [get_encoding(sent) for sent in short_sentences]

CPU times: user 2.17 s, sys: 62.6 ms, total: 2.23 s
Wall time: 3.05 s


In [None]:
%%time
emb_use_mid = [get_encoding(sent) for sent in mid_sentences]

CPU times: user 95.6 ms, sys: 10.6 ms, total: 106 ms
Wall time: 82.2 ms


In [None]:
%%time
emb_use_long = [get_encoding(sent) for sent in long_sentences]

CPU times: user 107 ms, sys: 7.91 ms, total: 115 ms
Wall time: 103 ms


In [None]:
%%time
emb_use_page = [get_encoding(page) for page in pages]

CPU times: user 210 ms, sys: 59.2 ms, total: 269 ms
Wall time: 649 ms


### LASER

93 languages: Afrikaans, Albanian, Amharic, Arabic, Armenian, Aymara, Azerbaijani, Basque, Belarusian, Bengali, Berber languages, Bosnian, Breton, Bulgarian, Burmese, Catalan, Central/Kadazan Dusun, Central Khmer, Chavacano, Chinese, Coastal Kadazan, Cornish, Croatian, Czech, Danish, Dutch, Eastern Mari, English, Esperanto, Estonian, Finnish, French, Galician, Georgian, German, Greek, Hausa, Hebrew, Hindi, Hungarian, Icelandic, Ido, Indonesian, Interlingua, Interlingue, Irish, Italian, Japanese, Kabyle, Kazakh, Korean, Kurdish, Latvian, Latin, Lingua Franca Nova, Lithuanian, Low German/Saxon, Macedonian, Malagasy, Malay, Malayalam, Maldivian (Divehi), Marathi, Norwegian (Bokmål), Occitan, Persian (Farsi), Polish, Portuguese, Romanian, Russian, Serbian, Sindhi, Sinhala, Slovak, Slovenian, Somali, Spanish, Swahili, Swedish, Tagalog, Tajik, Tamil, Tatar, Telugu, Thai, Turkish, Uighur, Ukrainian, Urdu, Uzbek, Vietnamese, Wu Chinese and Yue Chinese.

In [None]:
from laserembeddings import Laser

In [None]:
laser = Laser()

In [None]:
%%time
emb_laser_short = laser.embed_sentences(
    short_sentences,
    lang=['nl', 'nl', 'fr', 'fr', 'de', 'de']*3)

CPU times: user 108 ms, sys: 2.53 ms, total: 111 ms
Wall time: 181 ms


In [None]:
%%time
emb_laser_mid = laser.embed_sentences(
    mid_sentences,
    lang=['nl', 'fr']*3)

CPU times: user 28.8 ms, sys: 3.97 ms, total: 32.8 ms
Wall time: 33.1 ms


In [None]:
%%time
emb_laser_long = laser.embed_sentences(
    long_sentences,
    lang=['nl', 'fr']*3)

CPU times: user 92.6 ms, sys: 26.2 ms, total: 119 ms
Wall time: 120 ms


In [None]:
%%time
emb_laser_page = laser.embed_sentences(
    pages,
    lang=['nl', 'fr']*3)

CPU times: user 1.59 s, sys: 394 ms, total: 1.99 s
Wall time: 1.99 s


### m-SentenceBert
109 languages, list available here: https://github.com/facebookresearch/LASER/tree/master/data/tatoeba/v1

More info on the Tatoeba corpus can be found [here](https://en.wiki.tatoeba.org/articles/show/main), an example can be found [here](https://tatoeba.org/eng/sentences/show/33396).

A huggingface [model page](https://huggingface.co/sentence-transformers/LaBSE) is also available as of October last year, but this is from another third-party, and the results just don't seem as crisp in our opinion.

In [None]:
import tensorflow as tf
import bert
import numpy as np
import tensorflow_hub as hub

In [None]:
def get_model(model_url, max_seq_length):
    labse_layer = hub.KerasLayer(model_url, trainable=True)
    input_word_ids = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32, name="input_word_ids")
    input_mask = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32, name="input_mask")
    segment_ids = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32, name="segment_ids")

    # LaBSE layer.
    pooled_output, _ = labse_layer([input_word_ids, input_mask, segment_ids])

    # The embedding is l2 normalized.
    pooled_output = tf.keras.layers.Lambda(lambda x: tf.nn.l2_normalize(x, axis=1))(pooled_output)

    # Define model.
    return tf.keras.Model(
        inputs=[input_word_ids, input_mask, segment_ids],
        outputs=pooled_output), labse_layer

max_seq_length = min([512, max([len(x.split(" ")) for x in pages])])
labse_model, labse_layer = get_model(model_url="https://tfhub.dev/google/LaBSE/1", max_seq_length=max_seq_length)

In [None]:
vocab_file = labse_layer.resolved_object.vocab_file.asset_path.numpy()
do_lower_case = labse_layer.resolved_object.do_lower_case.numpy()
tokenizer = bert.bert_tokenization.FullTokenizer(vocab_file, do_lower_case)

In [None]:
def create_input(input_strings, tokenizer, max_seq_length):

  input_ids_all, input_mask_all, segment_ids_all = [], [], []
  for input_string in input_strings:
    # Tokenize input.
    input_tokens = ["[CLS]"] + tokenizer.tokenize(input_string) + ["[SEP]"]
    input_ids = tokenizer.convert_tokens_to_ids(input_tokens)
    sequence_length = min(len(input_ids), max_seq_length)

    # Padding or truncation.
    if len(input_ids) >= max_seq_length:
      input_ids = input_ids[:max_seq_length]
    else:
      input_ids = input_ids + [0] * (max_seq_length - len(input_ids))

    input_mask = [1] * sequence_length + [0] * (max_seq_length - sequence_length)

    input_ids_all.append(input_ids)
    input_mask_all.append(input_mask)
    segment_ids_all.append([0] * max_seq_length)

  return np.array(input_ids_all), np.array(input_mask_all), np.array(segment_ids_all)

In [None]:
def encode(input_text):
  input_ids, input_mask, segment_ids = create_input(
    input_text, tokenizer, max_seq_length)
  return labse_model([input_ids, input_mask, segment_ids])

In [None]:
%%time
emb_mbert_short = encode(short_sentences)

CPU times: user 491 ms, sys: 33 ms, total: 524 ms
Wall time: 1.25 s


In [None]:
%%time
emb_mbert_mid = encode(mid_sentences)

CPU times: user 460 ms, sys: 6.1 ms, total: 466 ms
Wall time: 469 ms


In [None]:
%%time
emb_mbert_long = encode(long_sentences)

CPU times: user 39.1 ms, sys: 10.1 ms, total: 49.2 ms
Wall time: 241 ms


In [None]:
%%time
emb_mbert_page = encode(pages)

CPU times: user 126 ms, sys: 4.38 ms, total: 130 ms
Wall time: 239 ms


## Comparison

In [None]:
from sklearn.metrics.pairwise import cosine_similarity 
import plotly.graph_objects as go

### Short sentences

In [None]:
fig = go.Figure(data=go.Heatmap(
                   z=cosine_similarity(emb_use_short, emb_use_short),
                   x=short_labels,
                   y=short_labels,
                   hoverongaps = False,
                   zmin=0,
                   zmax=1))

fig.update_layout(
    title="USE embeddings - short sentences",
    autosize=False,
    width=600,
    height=600,
    yaxis={
        "autorange": "reversed"
    })

fig.show()

In [None]:
fig = go.Figure(data=go.Heatmap(
                   z=cosine_similarity(emb_mbert_short, emb_mbert_short),
                   x=short_labels,
                   y=short_labels,
                   hoverongaps = False,
                   zmin=0,
                   zmax=1))

fig.update_layout(
    title="Sentence-BERT embeddings - short sentences",
    autosize=False,
    width=600,
    height=600,
    yaxis={
        "autorange": "reversed"
    })

fig.show()

In [None]:
fig = go.Figure(data=go.Heatmap(
                   z=cosine_similarity(emb_laser_short, emb_laser_short),
                   x=short_labels,
                   y=short_labels,
                   hoverongaps = False,
                   zmin=0,
                   zmax=1))

fig.update_layout(
    title="LASER embeddings - short sentences",
    autosize=False,
    width=600,
    height=600,
    yaxis={
        "autorange": "reversed"
    })

fig.show()

### Medium sentences

In [None]:
fig = go.Figure(data=go.Heatmap(
                   z=cosine_similarity(emb_use_mid, emb_use_mid),
                   x=mid_labels,
                   y=mid_labels,
                   hoverongaps = False,
                   zmin=0,
                   zmax=1))

fig.update_layout(
    title="USE embeddings - mid sentences",
    autosize=False,
    width=400,
    height=400,
    yaxis={
        "autorange": "reversed"
    })

fig.show()

In [None]:
fig = go.Figure(data=go.Heatmap(
                   z=cosine_similarity(emb_mbert_mid, emb_mbert_mid),
                   x=mid_labels,
                   y=mid_labels,
                   hoverongaps = False,
                   zmin=0,
                   zmax=1))

fig.update_layout(
    title="Sentence-BERT embeddings - mid sentences",
    autosize=False,
    width=400,
    height=400,
    yaxis={
        "autorange": "reversed"
    })

fig.show()

In [None]:
fig = go.Figure(data=go.Heatmap(
                   z=cosine_similarity(emb_laser_mid, emb_laser_mid),
                   x=mid_labels,
                   y=mid_labels,
                   hoverongaps = False,
                   zmin=0,
                   zmax=1))

fig.update_layout(
    title="LASER embeddings - mid sentences",
    autosize=False,
    width=400,
    height=400,
    yaxis={
        "autorange": "reversed"
    })

fig.show()

### Long sentences

In [None]:
fig = go.Figure(data=go.Heatmap(
                   z=cosine_similarity(emb_use_long, emb_use_long),
                   x=long_labels,
                   y=long_labels,
                   hoverongaps = False,
                   zmin=0,
                   zmax=1))

fig.update_layout(
    title="USE embeddings - long sentences",
    autosize=False,
    width=400,
    height=400,
    yaxis={
        "autorange": "reversed"
    })

fig.show()

In [None]:
fig = go.Figure(data=go.Heatmap(
                   z=cosine_similarity(emb_mbert_long, emb_mbert_long),
                   x=long_labels,
                   y=long_labels,
                   hoverongaps = False,
                   zmin=0,
                   zmax=1))

fig.update_layout(
    title="Sentence-BERT embeddings - long sentences",
    autosize=False,
    width=400,
    height=400,
    yaxis={
        "autorange": "reversed"
    })

fig.show()

In [None]:
fig = go.Figure(data=go.Heatmap(
                   z=cosine_similarity(emb_laser_long, emb_laser_long),
                   x=long_labels,
                   y=long_labels,
                   hoverongaps = False,
                   zmin=0,
                   zmax=1))

fig.update_layout(
    title="LASER embeddings - long sentences",
    autosize=False,
    width=400,
    height=400,
    yaxis={
        "autorange": "reversed"
    })

fig.show()

### Pages

In [None]:
fig = go.Figure(data=go.Heatmap(
                   z=cosine_similarity(emb_use_page, emb_use_page),
                   x=page_labels,
                   y=page_labels,
                   hoverongaps = False,
                   zmin=0,
                   zmax=1))

fig.update_layout(
    title="USE embeddings - pages",
    autosize=False,
    width=400,
    height=400,
    yaxis={
        "autorange": "reversed"
    })

fig.show()

In [None]:
fig = go.Figure(data=go.Heatmap(
                   z=cosine_similarity(emb_mbert_page, emb_mbert_page),
                   x=page_labels,
                   y=page_labels,
                   hoverongaps = False,
                   zmin=0,
                   zmax=1))

fig.update_layout(
    title="Sentence-BERT embeddings - pages",
    autosize=False,
    width=400,
    height=400,
    yaxis={
        "autorange": "reversed"
    })

fig.show()

In [None]:
fig = go.Figure(data=go.Heatmap(
                   z=cosine_similarity(emb_laser_page, emb_laser_page),
                   x=page_labels,
                   y=page_labels,
                   hoverongaps = False,
                   zmin=0,
                   zmax=1))

fig.update_layout(
    title="LASER embeddings - pages",
    autosize=False,
    width=400,
    height=400,
    yaxis={
        "autorange": "reversed"
    })

fig.show()

## Take-aways

All-in-all, we're very impressed by m-sentence-bert!

* sentence-BERT seems to be able to much better capture multilinguality (shorter distances between literal translations) than USE
* it also feels like it can quite nicely capture relationships and similarities
* these effects seems to become more outspoken the longer the sentence gets
* yes, sentence-BERT does seem to take longer to process since it is indeed a larger model
* both USE and sentence-BERT seemed to greatly benefit (time / 6) from using a GPU backend (Colab notebook test on K80)
* side note: when embedding large datasets, it's definitely useful to look at tools like Dataflow
* sentence-BERT offers a MUCH wider variety of languages!
* As for LASER: either we're doing something wrong, but the results don't seem impressive.

So we can safely say that sentence-BERT is a more than worthy alternative to USE! Especially for longer sequences!