# Multilingual Universal Sentence Encoder Q&A Retrieval

**Acknowledgements:**
1. Tutorial on [Colab](https://www.tensorflow.org/hub/tutorials/retrieval_with_tf_hub_universal_encoder_qa)  
2. Notebook on [Github](https://github.com/tensorflow/hub/blob/master/examples/colab/retrieval_with_tf_hub_universal_encoder_qa.ipynb)  

**References:**  
1. [Universal Encoder Multilinqual Q&A Model](https://tfhub.dev/google/universal-sentence-encoder-multilingual-qa/3)  
2. SQuAD dataset:
   1. [Home](https://rajpurkar.github.io/SQuAD-explorer/).  
   2. [Retrieval evaluation](https://github.com/google/retrieval-qa-eval)  
   3. [v1.0 paper](https://arxiv.org/abs/1606.05250)  
   4. [v2.0 paper](https://arxiv.org/abs/1806.03822)  
3. Simple Neighbors library:  
   1. [Docs](https://simpleneighbors.readthedocs.io/en/latest/)  
   2. [pypi](https://pypi.org/project/simpleneighbors/)  
4. TensorFlow components:
   1. [TF Text (Github guide w/ examples](https://github.com/tensorflow/text)  
   2. [TF Embeddings (tutorial)](https://www.tensorflow.org/tutorials/text/word_embeddings)  
5. NLTK:
   1. [Home](https://www.nltk.org/)  
   2. [Book](https://www.nltk.org/book/)  

**Table of Contents:**  
1. [Example use of the sentence encoder model](#Example-use)  
2. [Tutorial setup](#Setup)  
3. [SQuAD utiltity functions](#SQuAD-utility-functions)  
4. [Visualization functions](#Visualization-functions)  
5. [SQuAD extraction](#SQuAD-extraction)  
6. [Encoder setup](#Encoder-setup)  
7. [Embedding computation](#Embedding-computation) with **`response_encoder`**
8. [Retrieval](#Retrieval) with **`question_encoder`**

## Example use

Example of using the [universal-sentence-encoder-multilingual-qa](https://tfhub.dev/google/universal-sentence-encoder-multilingual-qa/3) model. Doesn't do much but serves as a canary for the environment. Computes the dot product of the question and response embeddings to identify the most likely response.

### TF Hub module signatures
*Signatures* are [input-output specifications for TF Hub modules](https://www.tensorflow.org/hub/common_signatures), aiming to achieve interoperability and interchangeability without knowing the internals. For the sentence encoder, they are `question_encoder` and `response_encoder`. Notice they are called as follows:
```python
module.signatures['question_encoder']()
```

In [8]:
import tensorflow as tf
import tensorflow_hub as hub
import numpy as np
import tensorflow_text

questions = ["What is your age?"]
responses = ["I am 20 years old.", "good morning"]
response_contexts = ["I will be 21 next year.", "great day."]

module = hub.load('https://tfhub.dev/google/universal-sentence-encoder-multilingual-qa/3')

question_embeddings = module.signatures['question_encoder'](
            tf.constant(questions))
response_embeddings = module.signatures['response_encoder'](
        input=tf.constant(responses),
        context=tf.constant(response_contexts))

np.inner(question_embeddings['outputs'], response_embeddings['outputs'])

array([[0.4088399 , 0.08877401]], dtype=float32)

## Setup

In [7]:
import json
import nltk
import os
import pprint
import random
import simpleneighbors
import urllib
from IPython.display import HTML, display
from tqdm.notebook import tqdm

import tensorflow.compat.v2 as tf
import tensorflow_hub as hub
from tensorflow_text import SentencepieceTokenizer

nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/ivogeorg/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

## SQuAD utility functions

In [9]:
def download_squad(url):
    return json.load(urllib.request.urlopen(url))

def extract_sentences_from_squad_json(squad):
    all_sentences = []
    for data in squad['data']:
        for paragraph in data['paragraphs']:
            sentences = nltk.tokenize.sent_tokenize(paragraph['context'])
            all_sentences.extend(zip(sentences, [paragraph['context']] * len(sentences)))  # (text, context) where context is all the sentences
    return list(set(all_sentences))  # remove duplicates

def extract_questions_from_squad_json(squad):
    questions = []
    for data in squad['data']:
        for paragraph in data['paragraphs']:
            for qas in paragraph['qas']:
                if qas['answers']:
                    questions.append((qas['question'], qas['answers'][0]['text']))
    return list(set(questions))

## Visualization functions

In [11]:
def output_with_highlight(text, highlight):
    output = '<li> '
    i = text.find(highlight)
    while True:
        if i == -1:
            output += text
            break
        output += text[0:i]
        output += '<b>' + text[i:i + len(highlight)] + '<b>'
        text = text[i + len(highlight):]
        i = text.find(highlight)
    return output + '</li>\n'

def display_nearest_neighbors(query_text, answer_text=None):
    query_embedding = model.signatures['question_encoder'](tf.constant([query_text]))['outputs'][0]
    search_results = index.nearest(query_embedding, n=num_results)
    
    if answer_text:
        result_md = '''
        <p>Random Question from SQuAD:</p>
        <p>&nbsp;&nbsp;<b>%s</b></p>
        <p>Answer:</p>
        <p>&nbsp;&nbsp;<b>%s</b></p>
        ''' % (query_text, answer_text)
    else:
        result_md = '''
        <p>Random Question from SQuAD:</p>
        <p>&nbsp;&nbsp;<b>%s</b></p>
        ''' % query_text
        
    result_md += '''
        <p>Retrieved sentences:
        <ol>
        '''
    
    if answer_text:
        for s in search_results:
            result_md += output_with_highlight(s, answer_text)
    else:
        for s in search_results:
            result_md += '<li>' + s + '</li>\n'
            
    result_md += '</ol>'
    display(HTML(result_md))

## SQuAD extraction

The SQuAD dataset will be extracted into:
* **sentences** as a list of *(text, context)* tuples (each SQuAD paragraph is split into sentences and the sentence and paragraph form the *(text, context)* tuple.  
* **questions** as a list of *(question, answer)* tuples.

In [13]:
squad_versions = [
    "https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json", 
    "https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json", 
    "https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json",
    "https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json"]
squad_url = squad_versions[3]

squad_json = download_squad(squad_url)
sentences = extract_sentences_from_squad_json(squad_json)
questions = extract_questions_from_squad_json(squad_json)
print('{} sentences, {} questions extracted from SQuAD {}'.format(len(sentences), len(questions), squad_url))

print('\nExample sentence and context: \n')
sentence = random.choice(sentences)
print('sentence:\n')
pprint.pprint(sentence[0])
print('context:\n')
pprint.pprint(sentence[1])
print()

10455 sentences, 10552 questions extracted from SQuAD https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json

Example sentence and context: 

sentence:

('In GR, gravitation is not viewed as a force, but rather, objects moving '
 'freely in gravitational fields travel under their own inertia in straight '
 'lines through curved space-time – defined as the shortest space-time path '
 'between two space-time events.')
context:

('Since then, and so far, general relativity has been acknowledged as the '
 'theory that best explains gravity. In GR, gravitation is not viewed as a '
 'force, but rather, objects moving freely in gravitational fields travel '
 'under their own inertia in straight lines through curved space-time – '
 'defined as the shortest space-time path between two space-time events. From '
 'the perspective of the object, all motion occurs as if there were no '
 'gravitation whatsoever. It is only when observing the motion in a global '
 'sense that the curvature 

## Encoder setup

In [14]:
module_url = 'https://tfhub.dev/google/universal-sentence-encoder-multilingual-qa/3'
model = hub.load(module_url)

## Embedding computation

The embedding of all the (text, context) tuples are computed and stored in a [`simpleneighbors`](https://pypi.org/project/simpleneighbors/) index using the **`response_encoder`**.

In [15]:
batch_size = 100

encodings = model.signatures['response_encoder'](
    input=tf.constant([sentences[0][0]]),
    context=tf.constant([sentences[0][1]])
)

index = simpleneighbors.SimpleNeighbors(
    len(encodings['outputs'][0]), metric='angular')

print('Computing embedding for {} sentences'.format(len(sentences)))

slices = zip(*(iter(sentences),) * batch_size)  # TODO: Parse these two lines
num_batches = int(len(sentences) / batch_size)

for s in tqdm(slices, total=num_batches):
    response_batch = list([r for r, c, in s])
    context_batch = list([c for r, c in s])
    encodings = model.signatures['response_encoder'](
        input=tf.constant(response_batch),
        context=tf.constant(context_batch)
    )
    for batch_index, batch in enumerate(response_batch):
        index.add_one(batch, encodings['outputs'][batch_index])
        
index.build()

print('simpleneighbors index for {} sentences built.'.format(len(sentences)))

Computing embedding for 10455 sentences


HBox(children=(FloatProgress(value=0.0, max=104.0), HTML(value='')))


simpleneighbors index for 10455 sentences built.


## Retrieval

Upon retrieval, the question is encoded using the **`question_encoder`** the question embedding is used to query the [`simpleneighbors`](https://pypi.org/project/simpleneighbors/) index. **TODO:** Split the question encoding out of the neighbor display in `display_nearest_neighbors`.

In [17]:
num_results = 25

query = random.choice(questions)
display_nearest_neighbors(query[0], query[1])

**TODO:**
1. Review SQuAD: question, answer, text, context. *How is the answer supposed to be generated/retrieved?*  
2. What do the points in the `SimpleNeighbor` index represent? How is it an *index*?  
3. Are only the senteces indexed? Isn't this tantamount to *"taking them out of context"*?  