<table class="tfo-notebook-buttons" align="left">
  <td>
    <a href="https://colab.research.google.com/github/martin-fabbri/colab-notebooks/blob/master/nlp/multilingual/multilingual-universal-sentence-encoder.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>    
  </td>
  <td>
    <a href="https://github.com/martin-fabbri/colab-notebooks/blob/master/nlp/multilingual/multilingual-universal-sentence-encoder.ipynb" target="_parent"><img src="https://raw.githubusercontent.com/martin-fabbri/colab-notebooks/master/assets/github.svg" alt="View On Github"/></a>  </td>
</table>

# Multilingual Universal Sentence Encoder

## Outline

- [1. Multilingual sentence visualization](#1)

- [2. Build a semantic search engine](#2)

## Citation

*Research papers that make use of the models explored in this colab should cite:*

### [Multilingual universal sentence encoder for semantic retrieval](https://arxiv.org/abs/1907.04307)
Yinfei Yang, Daniel Cer, Amin Ahmad, Mandy Guo, Jax Law, Noah Constant, Gustavo Hernandez Abrego, Steve Yuan, Chris Tar, Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil. 2019.
 arXiv preprint arXiv:1907.04307

### [Dataset: News-Commentary](http://www.casmacat.eu/corpus/news-commentary.html)
J. Tiedemann, 2012, Parallel Data, Tools and Interfaces in OPUS. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012)

In [1]:
#@title Setup Environment
#@markdown Simple Neighbors is a clean and easy interface for 
#@markdown performing nearest-neighbor lookups on items from a 
#@markdown corpus. To install the package:
%%capture
!pip install tensorflow_text
!pip install bokeh
!pip install simpleneighbors[annoy]
!pip install tqdm

In [2]:
#@title Imports & Visualization utils
import os

import bokeh
import bokeh.models
import bokeh.plotting
import numpy as np
import pandas as pd
import sklearn.metrics.pairwise
import tensorflow.compat.v2 as tf
import tensorflow_hub as hub
from simpleneighbors import SimpleNeighbors
from tensorflow_text import SentencepieceTokenizer
from tqdm import tqdm, trange

def visualize_similarity(
    embeddings_1,
    embeddings_2,
    labels_1,
    labels_2,
    plot_title,
    plot_width=1200,
    plot_height=600,
    xaxis_font_size="12pt",
    yaxis_font_size="12pt",
):

    assert len(embeddings_1) == len(labels_1)
    assert len(embeddings_2) == len(labels_2)

    # arccos based text similarity (Yang et al. 2019; Cer et al. 2019)
    sim = (
        1
        - np.arccos(
            sklearn.metrics.pairwise.cosine_similarity(
                embeddings_1, embeddings_2
            )
        )
        / np.pi
    )

    embeddings_1_col, embeddings_2_col, sim_col = [], [], []
    for i in range(len(embeddings_1)):
        for j in range(len(embeddings_2)):
            embeddings_1_col.append(labels_1[i])
            embeddings_2_col.append(labels_2[j])
            sim_col.append(sim[i][j])
    df = pd.DataFrame(
        zip(embeddings_1_col, embeddings_2_col, sim_col),
        columns=["embeddings_1", "embeddings_2", "sim"],
    )

    mapper = bokeh.models.LinearColorMapper(
        palette=[*reversed(bokeh.palettes.YlOrRd[9])],
        low=df.sim.min(),
        high=df.sim.max(),
    )

    p = bokeh.plotting.figure(
        title=plot_title,
        x_range=labels_1,
        x_axis_location="above",
        y_range=[*reversed(labels_2)],
        plot_width=plot_width,
        plot_height=plot_height,
        tools="save",
        toolbar_location="below",
        tooltips=[("pair", "@embeddings_1 ||| @embeddings_2"), ("sim", "@sim")],
    )
    p.rect(
        x="embeddings_1",
        y="embeddings_2",
        width=1,
        height=1,
        source=df,
        fill_color={"field": "sim", "transform": mapper},
        line_color=None,
    )

    p.title.text_font_size = "12pt"
    p.axis.axis_line_color = None
    p.axis.major_tick_line_color = None
    p.axis.major_label_standoff = 16
    p.xaxis.major_label_text_font_size = xaxis_font_size
    p.xaxis.major_label_orientation = 0.25 * np.pi
    p.yaxis.major_label_text_font_size = yaxis_font_size
    p.min_border_right = 300

    bokeh.io.output_notebook()
    bokeh.io.show(p)

!pip list | grep "bokeh\|simpleneighbors\|tensorflow-hub\|tensorflow-text"

bokeh                         2.1.1          
simpleneighbors               0.1.0          
tensorflow-hub                0.11.0         
tensorflow-text               2.4.3          


In [3]:
module_url = "https://tfhub.dev/google/universal-sentence-encoder-multilingual/3"
model = hub.load(module_url)

def embed_text(input):
    return model(input)

<a name="1"></a>
## 1. Multilingual sentence visualization

Compute text embeddings

In [4]:
# Some texts of different lengths in different languages.
arabic_sentences = [
    "كلب",
    "الجراء لطيفة.",
    "أستمتع بالمشي لمسافات طويلة على طول الشاطئ مع كلبي.",
]
chinese_sentences = ["狗", "小狗很好。", "我喜欢和我的狗一起沿着海滩散步。"]
english_sentences = [
    "dog",
    "Puppies are nice.",
    "I enjoy taking long walks along the beach with my dog.",
]
french_sentences = [
    "chien",
    "Les chiots sont gentils.",
    "J'aime faire de longues promenades sur la plage avec mon chien.",
]
german_sentences = [
    "Hund",
    "Welpen sind nett.",
    "Ich genieße lange Spaziergänge am Strand entlang mit meinem Hund.",
]
italian_sentences = [
    "cane",
    "I cuccioli sono carini.",
    "Mi piace fare lunghe passeggiate lungo la spiaggia con il mio cane.",
]
japanese_sentences = ["犬", "子犬はいいです", "私は犬と一緒にビーチを散歩するのが好きです"]
korean_sentences = ["개", "강아지가 좋다.", "나는 나의 산책을 해변을 따라 길게 산책하는 것을 즐긴다."]
russian_sentences = [
    "собака",
    "Милые щенки.",
    "Мне нравится подолгу гулять по пляжу со своей собакой.",
]
spanish_sentences = [
    "perro",
    "Los cachorros son agradables.",
    "Disfruto de dar largos paseos por la playa con mi perro.",
]

# Multilingual example
multilingual_example = [
    "Willkommen zu einfachen, aber",
    "verrassend krachtige",
    "multilingüe",
    "compréhension du langage naturel",
    "модели.",
    "大家是什么意思",
    "보다 중요한",
    "comprensión del lenguaje natural",
]
multilingual_example_in_en = [
    "Welcome to simple yet",
    "surprisingly powerful",
    "multilingual",
    "natural language understanding",
    "models.",
    "What people mean",
    "matters more than",
    "the language they speak.",
]

In [5]:
ar_result = embed_text(arabic_sentences)
en_result = embed_text(english_sentences)
es_result = embed_text(spanish_sentences)
de_result = embed_text(german_sentences)
fr_result = embed_text(french_sentences)
it_result = embed_text(italian_sentences)
ja_result = embed_text(japanese_sentences)
ko_result = embed_text(korean_sentences)
ru_result = embed_text(russian_sentences)
zh_result = embed_text(chinese_sentences)

multilingual_result = embed_text(multilingual_example)
multilingual_in_en_result = embed_text(multilingual_example_in_en)

In [6]:
visualize_similarity(
    multilingual_in_en_result,
    multilingual_result,
    multilingual_example_in_en,
    multilingual_example,
    "Multilingual Universal Sentence Encoder",
    plot_width=900,
    plot_height=500
)

English-Spanish similarity

In [7]:
visualize_similarity(
    en_result,
    es_result,
    english_sentences,
    spanish_sentences,
    "English-Spanish similarity",
    plot_width=900,
    plot_height=500
)

Spanish-Italian Similarity

In [8]:
visualize_similarity(
    es_result,
    it_result,
    spanish_sentences,
    italian_sentences,
    "Spanish-Italian similarity",
    plot_width=900,
    plot_height=500
)

<a name="2"></a>
## 2. Multilingual Semantic-Similarity Search Engine

We will build a semantic-search index of about 200,000 sentences from a wikipedia corpus. About half be in English and the other half in Spanish to demostrate the multilingual capabilities of the Universal Sentence Encoder.

First, we will download news sentences in multiple languages from the News Comentary Corpus.

In [9]:
corpus_metadata = [
    ("es", "en-es.txt.zip", "News-Commentary.en-es.es", "Spanish"),
    ("en", "en-es.txt.zip", "News-Commentary.en-es.en", "English"),
]

language_to_sentences = {}
language_to_news_path = {}

corpus_url = "http://opus.nlpl.eu/download.php?f=News-Commentary/v11/moses/"

for language_code, zip_file, news_file, language_name in corpus_metadata:
    zip_path = tf.keras.utils.get_file(
        fname=zip_file, origin=f"{corpus_url}{zip_file}", extract=True
    )
    news_path = os.path.join(os.path.dirname(zip_path), news_file)
    language_to_sentences[language_code] = pd.read_csv(
        news_path, sep="\t", header=None
    )[0][:1000]
    language_to_news_path[language_code] = news_path
    print(f"{len(language_to_sentences[language_code]):,} {language_name}")

10,000 Spanish
10,000 English


In [12]:
language_to_sentences["es"], language_to_sentences["es"].shape

(0                               ¿El oro a 10.000 dólares?
 1       SAN FRANCISCO – Nunca ha resultado fácil soste...
 2       Últimamente, con los precios del oro más de un...
 3       Apenas en el pasado mes de diciembre, mis cole...
 4                                           ¿Y saben qué?
                               ...                        
 9995    La competencia global y la integración de los ...
 9996    ·&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
 9997    El descenso de la competitividad de Estados Un...
 9998    El cambio tecnológico y la globalización han c...
 9999    No obstante, las opciones de políticas de EE.U...
 Name: 0, Length: 10000, dtype: object, (10000,))

## Using a pre-trained model to transform sentences into vectors

We compute embeddings in _batches_ so that they fit in the GPU's RAM.

In [10]:
batch_size = 2048
language_to_embeddings = {}
for language_code, zip_file, news_file, language_name in corpus_metadata:
    print(f"Computing {language_name} embeddings")
    with tqdm(total=len(language_to_news_path[language_code])) as pbar:
        for batch in pd.read_csv(
            language_to_news_path[language_code],
            sep="\t",
            header=None,
            chunksize=batch_size,
        ):
            language_to_embeddings.setdefault(language_code, []).extend(
                embed_text(batch[0])
            )
            pbar.update(len(batch))

  0%|          | 0/46 [00:00<?, ?it/s]

Computing Spanish embeddings


238819it [00:39, 6096.20it/s]
  0%|          | 0/46 [00:00<?, ?it/s]

Computing English embeddings


238853it [00:36, 6462.35it/s]


## Building an index of semantic vectors

We use the [SimpleNeighbors](https://pypi.org/project/simpleneighbors/) library---which is a wrapper for the [Annoy](https://github.com/spotify/annoy) library---to efficiently look up results from the corpus.

In [16]:
num_index_trees = 40
language_name_to_index = {}
embedding_dimensions = len(list(language_to_embeddings.values())[0][0])
print("Embedding dimensions:", embedding_dimensions)

Embedding dimensions: 512


In [None]:
%%time
for language_code, zip_file, news_file, language_name in corpus_metadata:
    print("===")
    print(f"Adding {language_name} embeddings to index")
    index = SimpleNeighbors(embedding_dimensions, metric="dot")
    for sentence, embedding in zip(language_to_sentences[language_code], language_to_embeddings[language_code]):
        index.add_one(sentence, embedding)

    print(f"Build {language_name} index with {num_index_trees} trees")
    index.build(n=num_index_trees)
    language_name_to_index[language_name] = index

===
Adding Spanish embeddings to index
