# MultilingualSentencesIndexing

This notebook aims to build an indexing model that can be used by an inference API. This API takes any input text and return the top N most probable FAQ_ids.

To support the API, we will need a model that can retrieve questions related to any input. The model:
- takes any input
- encodes the input into embeddings
- uses same similarity calculation method to get the similarity between input and all questions, by using their embeddings
- return the top_n closest questions with FAQ_id and some other informations

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Parameters-to-be-input-before-running" data-toc-modified-id="Parameters-to-be-input-before-running-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Parameters to be input before running</a></span></li><li><span><a href="#Custom-Indexing-Model" data-toc-modified-id="Custom-Indexing-Model-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Custom Indexing Model</a></span></li><li><span><a href="#Build-model" data-toc-modified-id="Build-model-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Build model</a></span></li><li><span><a href="#Verify" data-toc-modified-id="Verify-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Verify</a></span></li><li><span><a href="#Save-Model" data-toc-modified-id="Save-Model-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Save Model</a></span></li><li><span><a href="#Next" data-toc-modified-id="Next-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Next</a></span></li></ul></div>

## Parameters to be input before running

In [1]:
input_file_path = '/data/extracted_data.csv'
output_file_path = '/data/closest_matches.csv'
output_model_path = '/codes/multilingual_indexing_model.pkl'

In [2]:
# Data Manipulation
import pandas as pd
import numpy as np

# Modelling
from sklearn.metrics.pairwise import cosine_similarity
import tensorflow.compat.v2 as tf
import tensorflow_hub as hub
from tensorflow_text import SentencepieceTokenizer

# Save model
import joblib

In [3]:
# Load results built by MultilingualSentencesGrouping.ipynb 
result_df = pd.read_csv(output_file_path, index_col=0)
result_df['FAQ_id'] = result_df['FAQ_id'].astype(int)
result_df.head()

Load the previous results stored in csv (y/n)?y


In [4]:
# Load modules from Tensorflow Hub
preprocessor = hub.KerasLayer(
    "https://tfhub.dev/google/universal-sentence-encoder-cmlm/multilingual-preprocess/2"
)
encoder = hub.KerasLayer(
    "https://tfhub.dev/google/universal-sentence-encoder-cmlm/multilingual-base/1"
)

Load the Preprocessor and Encoder (y/n)?y
Metal device set to: Apple M2


2022-11-25 09:31:43.793305: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2022-11-25 09:31:43.793573: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
2022-11-25 09:31:46.048936: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz
2022-11-25 09:31:46.070104: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.


## Custom Indexing Model

In [5]:
class Indexing:
    """
    Wrap all data into one class
    
    -------------------------
    Args:
        url (str): The webpage that we want to extract information

    Returns:
        sub_urls (set): a set (unique list) of urls found in the given webpage
        title: the question extracted from the given webpage
        content: the answer to the question
    """

    def __init__(self, preprocessor, encoder, closest_matches_df):
        # Store loaded modules to avoid reloading
        self.preprocessor = preprocessor
        self.encoder = encoder

        # All questions in closest_matches files
        self.closest_matches_df = closest_matches_df
        self.questions = list(set(closest_matches_df["question"].values))
        self.questions_embeds = None

    def normalization(self, embeds):
        """
        Use l2 normalization to embeddings
        -------------------------
        Args:
            embeds (vector): embedding (high-dimensional) vectors produced by encoder

        Returns:
            norms_embeds (vector): normalized embedding (high-dimensional) vectors
        """
        norms = np.linalg.norm(embeds, 2, axis=1, keepdims=True)
        return embeds / norms

    def embeddings(self, sentences):
        """
        Encode raw sentences into embedding (high-dimensional) vectors
        -------------------------
        Args:
            embeds (vector): embedding (high-dimensional) vectors

        Returns:
            norms_embeds (vector): normalized embedding (high-dimensional) vectors
        """
        with tf.device('/cpu:0'):
            sentences_embeds = tf.constant(sentences)
            sentences_embeds = self.encoder(
                self.preprocessor(sentences_embeds))["default"]
            # For semantic similarity tasks, apply l2 normalization to embeddings
            sentences_embeds = self.normalization(sentences_embeds)
        return sentences_embeds

    def calculate_similarity(self, embeddings_1, embeddings_2, labels_1,
                             labels_2):
        """
        Calculate the similarity using arccos based text similarity
        of two high-dimensional vectors
        -------------------------
        Args:
            embeddings_1 (vector): embeddings produced by encoder
            embeddings_2: embeddings produced by encoder
            labels_1: texts in used for embeddings_1
            labels_2: texts in used for embeddings_2

        Returns:
            df (dataframe): a pandas dataframe with three columns: 
            (texts in embeddings_1, texts in embeddings_2, similarity between two texts)
        """
        assert len(embeddings_1) == len(labels_1)
        assert len(embeddings_2) == len(labels_2)

        # arccos based text similarity (Yang et al. 2019; Cer et al. 2019)
        sim = 1 - np.arccos(cosine_similarity(embeddings_1,
                                              embeddings_2)) / np.pi

        embeddings_1_col, embeddings_2_col, sim_col = [], [], []
        for i in range(len(embeddings_1)):
            for j in range(len(embeddings_2)):
                embeddings_1_col.append(labels_1[i])
                embeddings_2_col.append(labels_2[j])
                sim_col.append(sim[i][j])
        df = pd.DataFrame(zip(embeddings_1_col, embeddings_2_col, sim_col),
                          columns=['query', 'question', 'sim'])

        df = df.fillna(1)
        return df

    def get_top_n_faqs(self, query, top_n):
        """
        Get top N closest questions based on input text
        -------------------------
        Args:
            query (str): the input text we want to classify against
            top_n (int): the number of results we want to get

        Returns:
            res (dict): a dictionary like 
            {"0":{"question":"Deposit fee","Ranking":1.0,"FAQ_id":132,"locale":"en","market":"en-de"},
            "1":{"question":"Deposit fee","Ranking":1.0,"FAQ_id":132,"locale":"en","market":"en-it"}
        """
        query = [query]

        query_embeds = self.embeddings(query)

        res = self.calculate_similarity(query_embeds, self.questions_embeds,
                                        query, self.questions).nlargest(
                                            top_n, ['sim'])

        res['Ranking'] = res['sim'].rank(ascending=False)
        res = res.merge(self.closest_matches_df, how='left',
                        on='question').drop(['query', 'sim'], axis=1)
        res = res[:top_n].drop('answer', axis=1).T.to_dict()
        return res

## Build model

In [6]:
# Initialize model
model = Indexing(preprocessor, encoder, result_df)

In [7]:
# Take about 2 minutes
# Add questions embeddings
# Put outside to avoid rerunning
try:
    model.questions_embeds = questions_embeds
except:
    print('No questions_embeds found')
    questions_embeds = model.embeddings(model.questions)
    model.questions_embeds = questions_embeds

No questions_embeds found


2022-11-25 09:31:59.490152: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2022-11-25 09:31:59.747729: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.


## Verify

In this section we show how to retrieve sentences related to a given input. Things to try:
- Try a few different sample sentences
- Try changing the number of returned results (they are returned in order of similarity)
- Try cross-lingual capabilities by inputting texts in different languages (might want to use Google Translate on some results to your native language for sanity check)


In [8]:
%time
query = 'deposit fee'
top_n = 10 

model.get_top_n_faqs(query, top_n)

CPU times: user 1e+03 ns, sys: 1e+03 ns, total: 2 µs
Wall time: 10.3 µs


2022-11-25 09:33:22.162480: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2022-11-25 09:33:22.401172: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.


{0: {'question': 'Deposit fee',
  'Ranking': 1.0,
  'FAQ_id': 132,
  'locale': 'en',
  'market': 'en-de'},
 1: {'question': 'Deposit fee',
  'Ranking': 1.0,
  'FAQ_id': 132,
  'locale': 'en',
  'market': 'en-it'},
 2: {'question': 'Deposit fee',
  'Ranking': 1.0,
  'FAQ_id': 132,
  'locale': 'en',
  'market': 'en-fr'},
 3: {'question': 'Deposit fee',
  'Ranking': 1.0,
  'FAQ_id': 132,
  'locale': 'en',
  'market': 'en-eu'},
 4: {'question': 'Frais de dépôt',
  'Ranking': 2.0,
  'FAQ_id': 132,
  'locale': 'fr',
  'market': 'fr-fr'},
 5: {'question': 'Documentos bancarios',
  'Ranking': 3.0,
  'FAQ_id': 226,
  'locale': 'es',
  'market': 'es-es'},
 6: {'question': 'Verwahrentgelt',
  'Ranking': 4.0,
  'FAQ_id': 132,
  'locale': 'de',
  'market': 'de-de'},
 7: {'question': 'Wie lange dauert eine Überweisung?',
  'Ranking': 5.0,
  'FAQ_id': 273,
  'locale': 'de',
  'market': 'de-at'},
 8: {'question': 'Wie lange dauert eine Überweisung?',
  'Ranking': 5.0,
  'FAQ_id': 273,
  'locale': 'de'

## Save Model

In [9]:
# Cannot save tensorflow model in joblib
# But we will reload it separately in API
model.preprocessor = None
model.encoder = None

In [10]:
with open(output_model_path, 'wb') as f:
    joblib.dump(model, f)

In [11]:
with open(output_model_path, 'rb') as f:
    test = joblib.load(f)

In [12]:
test.preprocessor = preprocessor
test.encoder = encoder

In [13]:
%time
query = 'deposit fee'
top_n = 10 

test.get_top_n_faqs(query, top_n)

CPU times: user 6 µs, sys: 6 µs, total: 12 µs
Wall time: 4.05 µs


{0: {'question': 'Deposit fee',
  'Ranking': 1.0,
  'FAQ_id': 132,
  'locale': 'en',
  'market': 'en-de'},
 1: {'question': 'Deposit fee',
  'Ranking': 1.0,
  'FAQ_id': 132,
  'locale': 'en',
  'market': 'en-it'},
 2: {'question': 'Deposit fee',
  'Ranking': 1.0,
  'FAQ_id': 132,
  'locale': 'en',
  'market': 'en-fr'},
 3: {'question': 'Deposit fee',
  'Ranking': 1.0,
  'FAQ_id': 132,
  'locale': 'en',
  'market': 'en-eu'},
 4: {'question': 'Frais de dépôt',
  'Ranking': 2.0,
  'FAQ_id': 132,
  'locale': 'fr',
  'market': 'fr-fr'},
 5: {'question': 'Documentos bancarios',
  'Ranking': 3.0,
  'FAQ_id': 226,
  'locale': 'es',
  'market': 'es-es'},
 6: {'question': 'Verwahrentgelt',
  'Ranking': 4.0,
  'FAQ_id': 132,
  'locale': 'de',
  'market': 'de-de'},
 7: {'question': 'Wie lange dauert eine Überweisung?',
  'Ranking': 5.0,
  'FAQ_id': 273,
  'locale': 'de',
  'market': 'de-at'},
 8: {'question': 'Wie lange dauert eine Überweisung?',
  'Ranking': 5.0,
  'FAQ_id': 273,
  'locale': 'de'

## Next
- Put the class in section 7 in /codes/MultilingualSentencesIndexing.py, so we can use it to build API