<table class="tfo-notebook-buttons" align="left">
  <td>
    <a href="https://colab.research.google.com/github/martin-fabbri/colab-notebooks/blob/master/nlp/multilingual/multilingual-semantic-search-faiss.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>    
  </td>
  <td>
    <a href="https://github.com/martin-fabbri/colab-notebooks/blob/master/nlp/multilingual/multilingual-semantic-search-faiss.ipynb" target="_parent"><img src="https://raw.githubusercontent.com/martin-fabbri/colab-notebooks/master/assets/github.svg" alt="View On Github"/></a>  </td>
</table>

# Multilingual Semantic Search with Faiss

## Outline
- [0. Setup](#0)

- [1. Load dataset](#1)

- [2. Build the semantic-similarity search engine](#2)

- [3. Test the semantic-similarity search engine](#3)

- [4. References](#4)

<a name="0"></a>
## 0. Setup

In [1]:
#@title Setup Environment
#@markdown Install [Faiss](https://github.com/facebookresearch/faiss) a library 
#@markdown for efficient similarity search and clustering of dense vectors. 
%%capture
!pip install tensorflow_text ## needed????
!pip install bokeh
!pip install faiss-gpu
!pip install tqdm

In [2]:
#@title Imports & Visualization utils
%load_ext google.colab.data_table
import os

import bokeh
import bokeh.models
import bokeh.plotting
import faiss
import numpy as np
import pandas as pd
import sklearn.metrics.pairwise
import tensorflow.compat.v2 as tf
import tensorflow_hub as hub
from google.colab import data_table
from tensorflow_text import SentencepieceTokenizer
from tqdm import tqdm, trange

def visualize_similarity(
    embeddings_1,
    embeddings_2,
    labels_1,
    labels_2,
    plot_title,
    plot_width=1200,
    plot_height=600,
    xaxis_font_size="12pt",
    yaxis_font_size="12pt",
):

    assert len(embeddings_1) == len(labels_1)
    assert len(embeddings_2) == len(labels_2)

    # arccos based text similarity (Yang et al. 2019; Cer et al. 2019)
    sim = (
        1
        - np.arccos(
            sklearn.metrics.pairwise.cosine_similarity(
                embeddings_1, embeddings_2
            )
        )
        / np.pi
    )

    embeddings_1_col, embeddings_2_col, sim_col = [], [], []
    for i in range(len(embeddings_1)):
        for j in range(len(embeddings_2)):
            embeddings_1_col.append(labels_1[i])
            embeddings_2_col.append(labels_2[j])
            sim_col.append(sim[i][j])
    df = pd.DataFrame(
        zip(embeddings_1_col, embeddings_2_col, sim_col),
        columns=["embeddings_1", "embeddings_2", "sim"],
    )

    mapper = bokeh.models.LinearColorMapper(
        palette=[*reversed(bokeh.palettes.YlOrRd[9])],
        low=df.sim.min(),
        high=df.sim.max(),
    )

    p = bokeh.plotting.figure(
        title=plot_title,
        x_range=labels_1,
        x_axis_location="above",
        y_range=[*reversed(labels_2)],
        plot_width=plot_width,
        plot_height=plot_height,
        tools="save",
        toolbar_location="below",
        tooltips=[("pair", "@embeddings_1 ||| @embeddings_2"), ("sim", "@sim")],
    )
    p.rect(
        x="embeddings_1",
        y="embeddings_2",
        width=1,
        height=1,
        source=df,
        fill_color={"field": "sim", "transform": mapper},
        line_color=None,
    )

    p.title.text_font_size = "12pt"
    p.axis.axis_line_color = None
    p.axis.major_tick_line_color = None
    p.axis.major_label_standoff = 16
    p.xaxis.major_label_text_font_size = xaxis_font_size
    p.xaxis.major_label_orientation = 0.25 * np.pi
    p.yaxis.major_label_text_font_size = yaxis_font_size
    p.min_border_right = 300

    bokeh.io.output_notebook()
    bokeh.io.show(p)

!pip list | grep "bokeh\|faiss\|tensorflow-hub\|tensorflow-text"

bokeh                         2.1.1          
faiss-gpu                     1.7.0          
tensorflow-hub                0.11.0         
tensorflow-text               2.4.3          


In [3]:
module_url = "https://tfhub.dev/google/universal-sentence-encoder-multilingual/3"
model = hub.load(module_url)

def embed_text(input):
    return model(input)

<a name="1"></a>
## 1. Load dataset

We will build a semantic-search index of about 200,000 sentences from a wikipedia corpus. About half be in English and the other half in Spanish to demostrate the multilingual capabilities of the Universal Sentence Encoder.

First, we will download news sentences in multiple languages from the News Comentary Corpus.

In [4]:
corpus_metadata = [
    ("es", "Spanish", "en-es.txt.zip", "News-Commentary.en-es.es"),
    ("en", "English", "en-es.txt.zip", "News-Commentary.en-es.en"),
]

sentences = []
language_to_sentences = {}
language_to_news_path = {}

corpus_url = "http://opus.nlpl.eu/download.php?f=News-Commentary/v11/moses/"

for language_code, language_name, zip_file, news_file in corpus_metadata:
    zip_path = tf.keras.utils.get_file(
        fname=zip_file, origin=f"{corpus_url}{zip_file}", extract=True
    )
    news_path = os.path.join(os.path.dirname(zip_path), news_file)
    language_to_sentences[language_code] = pd.read_csv(
        news_path, sep="\t", header=None
    )[0]
    language_to_news_path[language_code] = news_path
    sentences.extend(language_to_sentences[language_code])
    print(f"{len(language_to_sentences[language_code]):,} {language_name}")

Downloading data from http://opus.nlpl.eu/download.php?f=News-Commentary/v11/moses/en-es.txt.zip
238,819 Spanish
238,853 English


In [5]:
explore_df = pd.DataFrame(
    {
        "en": language_to_sentences["en"][:1000],
        "es": language_to_sentences["es"][:1000],
    }
)
data_table.DataTable(explore_df, include_index=False, num_rows_per_page=5)

Unnamed: 0,en,es
0,"$10,000 Gold?",¿El oro a 10.000 dólares?
1,SAN FRANCISCO – It has never been easy to have...,SAN FRANCISCO – Nunca ha resultado fácil soste...
2,"Lately, with gold prices up more than 300% ove...","Últimamente, con los precios del oro más de un..."
3,"Just last December, fellow economists Martin F...","Apenas en el pasado mes de diciembre, mis cole..."
4,Wouldn’t you know it?,¿Y saben qué?
...,...,...
995,"WASHINGTON, DC – Whether we like it or not, th...",Consideremos las economías avanzadas.
996,But recent economic trends suggest that this c...,"Durante las últimas dos décadas, el crecimient..."
997,Consider the advanced economies.,"Como resultado, la participación del consumo e..."
998,"During the last two decades, economic growth i...",1990


<a name="2"></a>
## 2. Build the semantic-similarity search engine

#### Using a pre-trained model to transform sentences into vectors

We compute embeddings in _batches_ so that they fit in the GPU's RAM.

In [6]:
batch_size = 2048
embeddings = []
for indx in trange(0, len(sentences), batch_size):
    embeddings.extend(embed_text(sentences[indx:indx+batch_size]))
    
embeddings = np.array(embeddings).astype("float32")

100%|██████████| 234/234 [01:27<00:00,  2.69it/s]


#### Building an index of semantic vectors

In [8]:
num_sentences, embedding_dimensions = embeddings.shape
print(f"Number of sentences:  {num_sentences:,}")
print(f"Embedding dimensions: {embedding_dimensions}")

Number of sentences:  477,672
Embedding dimensions: 512


In [9]:
index = faiss.IndexFlatL2(embedding_dimensions)
index = faiss.IndexIDMap(index)
ids = np.array(list(range(num_sentences)))
index.add_with_ids(embeddings, ids)
print(f"Number of vectors in the Faiss index: {index.ntotal}")

Number of vectors in the Faiss index: 477672


In [12]:
D, I = index.search(np.array([embeddings[2]]), k=10)
print(
    f"L2 distance: {D.flatten().tolist()}\n\nMAG sentence ids: {I.flatten().tolist()}"
)

L2 distance: [0.0, 0.2768366038799286, 0.8137745261192322, 0.8345040678977966, 0.8759987354278564, 0.9003213047981262, 0.9229668974876404, 0.9233704805374146, 0.9258378744125366, 0.926464319229126]

MAG sentence ids: [2, 238821, 238825, 6, 99478, 16, 99477, 178450, 178448, 238835]


In [13]:
pd.DataFrame({"Similar Sentences": np.array(sentences)[I].flatten()})

Unnamed: 0,Similar Sentences
0,"Últimamente, con los precios del oro más de un..."
1,"Lately, with gold prices up more than 300% ove..."
2,"Gold prices even hit a record-high $1,300 rece..."
3,Los precios del oro incluso alcanzaron recient...
4,"Y para el año 2000, cuando el IPC de EE.UU. er..."
5,"El precio de hoy, 1.300 dólares, probablemente..."
6,"Diez años después, el índice de precios al con..."
7,"En abril, el oro se vendía por cerca de 1300 d..."
8,"En su punto máximo, los llamados escarabajos d..."
9,"At $1,300, today’s price is probably more than..."


<a name="3"></a>
## 3. Test the search engine

- Try different enter sentences or select from the samples list
- Try different languages (EN/ES)
- Try changing the number of similar sentences

In [17]:
sample_query = "Global warming" #@param ["Global warming", "Researchers made a surprising new discovery last week.", "The stock market fell four points.", "Lawmakers will vote on the proposal tomorrow."] {allow-input: true}
num_results = 10 #@param {type:"slider", min:0, max:100, step:10}

query_embedding = embed_text(sample_query)[0]
_, I = index.search(np.array([query_embedding]), k=num_results)

pd.DataFrame({"Similar Sentences": np.array(sentences)[I].flatten()})

Unnamed: 0,Similar Sentences
0,Take global warming.
1,The Real Danger of Global Warming
2,Global warming is real.
3,The Diseases of Global Warming
4,Talking Sense About Global Warming
5,Scared Silly about Global Warming
6,El verdadero peligro del calentamiento global
7,El calentamiento global es real.
8,Global Warnings
9,Global warming is a long-term problem.


<a name="4"></a>
## 4. References

### [Multilingual universal sentence encoder for semantic retrieval](https://arxiv.org/abs/1907.04307)
Yinfei Yang, Daniel Cer, Amin Ahmad, Mandy Guo, Jax Law, Noah Constant, Gustavo Hernandez Abrego, Steve Yuan, Chris Tar, Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil. 2019.
 arXiv preprint arXiv:1907.04307

### [Dataset: News-Commentary](http://opus.nlpl.eu/News-Commentary.php)
J. Tiedemann, 2012, Parallel Data, Tools and Interfaces in OPUS. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012)

### [Faiss Billion-scale similarity search with GPUs](https://github.com/facebookresearch/faiss)
Johnson, Jeff and Douze, Matthijs and Hervé Jégou
arXiv preprint arXiv:1702.08734
2017