<a href="https://colab.research.google.com/github/jocelynprince1/NLP_Experiments/blob/master/Q%26A_Universal_Encoder_Tests_Magog_R%C3%A9glements.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Q&A With Universal Sentence Encoder - TensorFlow Hub module

#### Jocelyn Prince (Mar 2020) based on examples provided by references below
---

### Description
Find some specific statements in "Règlements de la Ville de Magog" to perform semantic searches (e.g. not using the same words by the same meaning)

References:

* ELMo: Contextual language embedding by Josh Taylor: https://towardsdatascience.com/elmo-contextual-language-embedding-335de2268604

* Prateek Joshi and his post: https://www.analyticsvidhya.com/blog/2019/03/learn-to-use-elmo-to-extract-features-from-text/

* https://arxiv.org/pdf/1802.05365.pdf

* https://colab.research.google.com/github/tensorflow/hub/blob/master/examples/colab/tf2_text_classification.ipynb#scrollTo=Eg62Pmz3o83v

* Aurélien Géron’s Book: hands-on Machine Learning with Scikit-Learn, Keras and Tensorflow, O’Reilly Media Inc.
-----

This is a demo for using [Univeral Encoder Multilingual Q&A model](https://tfhub.dev/google/universal-sentence-encoder-multilingual-qa/3) for question-answer retrieval of text, illustrating the use of **question_encoder** and **response_encoder** of the model. 

The original notebook used sentences from [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) paragraphs as the demo dataset, each sentence and its context (the text surrounding the sentence) is encoded into high dimension embeddings with the **response_encoder**. These embeddings are stored in an index built using the [simpleneighbors](https://pypi.org/project/simpleneighbors/) library for question-answer retrieval.

Note: This colab requires **Python 3** runtime type, which can be selected from ***Runtime->Change Runtime type*** dropdown menu above, and
to enable faster processing, select Hardware Accelerator "**GPU**". Estimated indexing time of the SQuAD train 2.0 dataset with ~94,000 sentences with GPU is about 3 mins.

In [0]:
#%%capture
#@title Setup Environment
# Install the latest Tensorflow version.
!pip3 install tensorflow_text
!pip3 install --upgrade tensorflow-gpu
!pip3 install tensorflow-hub
!pip3 install simpleneighbors
!pip3 install nltk

Collecting tensorflow_text
[?25l  Downloading https://files.pythonhosted.org/packages/78/e7/d260e51d44bea241e8eee39d0266df9b33a3d6219ded118a1c81a872e848/tensorflow_text-2.2.0-cp36-cp36m-manylinux1_x86_64.whl (3.0MB)
[K     |████████████████████████████████| 3.0MB 9.1MB/s 
Installing collected packages: tensorflow-text
Successfully installed tensorflow-text-2.2.0
Collecting tensorflow-gpu
[?25l  Downloading https://files.pythonhosted.org/packages/31/bf/c28971266ca854a64f4b26f07c4112ddd61f30b4d1f18108b954a746f8ea/tensorflow_gpu-2.2.0-cp36-cp36m-manylinux2010_x86_64.whl (516.2MB)
[K     |████████████████████████████████| 516.2MB 32kB/s 
Installing collected packages: tensorflow-gpu
Successfully installed tensorflow-gpu-2.2.0
Collecting simpleneighbors
  Downloading https://files.pythonhosted.org/packages/f9/10/9092e15d9aa4a9e5a263416121f124e565766767e7866e11d7074ec50df5/simpleneighbors-0.1.0-py2.py3-none-any.whl
Installing collected packages: simpleneighbors
Successfully installed sim

In [0]:
#@title Setup common imports and functions
import json
import nltk
import os
import pprint
import random
import simpleneighbors
import urllib
from IPython.display import HTML, display

import tensorflow.compat.v2 as tf
import tensorflow_hub as hub
from tensorflow_text import SentencepieceTokenizer

nltk.download('punkt')


def download_squad(url):
  return json.load(urllib.request.urlopen(url))

def extract_sentences_from_squad_json(squad):
  all_sentences = []
  for data in squad['data']:
    for paragraph in data['paragraphs']:
      sentences = nltk.tokenize.sent_tokenize(paragraph['context'])
      all_sentences.extend(zip(sentences, [paragraph['context']] * len(sentences)))
  return list(set(all_sentences)) # remove duplicates

def extract_questions_from_squad_json(squad):
  questions = []
  for data in squad['data']:
    for paragraph in data['paragraphs']:
      for qas in paragraph['qas']:
        if qas['answers']:
          questions.append((qas['question'], qas['answers'][0]['text']))
  return list(set(questions))

def output_with_highlight(text, highlight):
  output = "<li> "
  i = text.find(highlight)
  while True:
    if i == -1:
      output += text
      break
    output += text[0:i]
    output += '<b>'+text[i:i+len(highlight)]+'</b>'
    text = text[i+len(highlight):]
    i = text.find(highlight)
  return output + "</li>\n"

def display_nearest_neighbors(query_text, answer_text=None):
  query_embedding = model.signatures['question_encoder'](tf.constant([query_text]))['outputs'][0]
  #print(query_embedding)
  search_results = index.nearest(query_embedding, n=num_results)
  print(search_results)

  if answer_text:
    result_md = '''
    <p>Random Question from SQuAD:</p>
    <p>&nbsp;&nbsp;<b>%s</b></p>
    <p>Answer:</p>
    <p>&nbsp;&nbsp;<b>%s</b></p>
    ''' % (query_text , answer_text)
  else:
    result_md = '''
    <p>Question:</p>
    <p>&nbsp;&nbsp;<b>%s</b></p>
    ''' % query_text

  result_md += '''
    <p>Retrieved sentences :
    <ol>
  '''

  if answer_text:
    for s in search_results:
      result_md += output_with_highlight(s, answer_text)
  else:
    for s in search_results:
      result_md += '<li>'+ s +'</li>\n'

  result_md += "</ol>"
  display(HTML(result_md))

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [0]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
# This Notebook needs to be executed under Python ≥3.5 and TensorFlow >= 2.0
import sys
assert sys.version_info >= (3, 5)


# Common imports
import time
import pandas as pd
import numpy as np

In [0]:
# *!**!*!*!*!*!*!*!*
# ATTENTION: Fichiers doivent être sauvegardés en UTF-8
# -----------------------------------------------------

#TOS_file = "/content/drive/My Drive/Colab Notebooks/NLP/Terms_of_use/Apple_Terms_of_Use_2019.txt"
#TOS_file = "/content/drive/My Drive/Colab Notebooks/NLP/Terms_of_use/Apple_Terms_of_Use_2019v2.txt"
#TOS_file = "/content/drive/My Drive/Colab Notebooks/NLP/Terms_of_use/applev3.txt"
#TOS_file = "/content/drive/My Drive/Colab Notebooks/NLP/Terms_of_use/weathernetworkv1.txt"
#TOS_file = "/content/drive/My Drive/Colab Notebooks/NLP/Terms_of_use/Test1.txt"
TOS_file = "/content/drive/My Drive/Colab Notebooks/NLP/Reglements/Magog_reglements3.txt"

# *!**!*!*!*!*!*!*!*
# ATTENTION: Fichiers doivent être sauvegardés en UTF-8
# -----------------------------------------------------


In [0]:
import nltk
nltk.download('punkt')
fileObj = open(TOS_file, 'r')
text = fileObj.read()
tokens = nltk.sent_tokenize(text)
print(tokens)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
['\ufeffCHAPITRE I\nDISPOSITIONS DÉCLARATOIRES ET INTERPRÉTATIVES .SECTION I .DISPOSITIONS DÉCLARATOIRES .Titre .Le présent règlement est intitulé « Règlement de permis et certificats ».', '.Territoire assujetti .Le présent règlement s’applique à l’ensemble du territoire de la Ville de Magog.', '.Règlements remplacés .Toute disposition incompatible avec le présent règlement contenue dans tous les règlements municipaux antérieurs est, par la présente, remplacée.', '.Sans restreindre la généralité du premier alinéa, le présent règlement remplace le règlement 1382 du territoire de l’ancienne Ville de Magog, le règlement 10-2002 du territoire de l’ancien Canton de Magog et le règlement no 2000-197 du territoire de l’ancien Village d’Omerville et leurs amendements.', '.SECTION II .DISPOSITIONS INTERPRÉTATIVES .Interprétation des tableaux .Les annexes, c

In [0]:
fileObj.close()

In [0]:
df_statement = pd.DataFrame(tokens, columns=["statement"])

In [0]:
df_statement.shape

(462, 1)

In [0]:
for i in range(0, 20):
  pprint.pprint (df_statement.statement[i])

('\ufeffCHAPITRE I\n'
 'DISPOSITIONS DÉCLARATOIRES ET INTERPRÉTATIVES .SECTION I .DISPOSITIONS '
 'DÉCLARATOIRES .Titre .Le présent règlement est intitulé « Règlement de '
 'permis et certificats ».')
('.Territoire assujetti .Le présent règlement s’applique à l’ensemble du '
 'territoire de la Ville de Magog.')
('.Règlements remplacés .Toute disposition incompatible avec le présent '
 'règlement contenue dans tous les règlements municipaux antérieurs est, '
 'par la présente, remplacée.')
('.Sans restreindre la généralité du premier alinéa, le présent '
 'règlement remplace le règlement 1382 du territoire de l’ancienne Ville de '
 'Magog, le règlement 10-2002 du territoire de l’ancien Canton de Magog et le '
 'règlement no 2000-197 du territoire de l’ancien Village d’Omerville et '
 'leurs amendements.')
('.SECTION II .DISPOSITIONS INTERPRÉTATIVES .Interprétation des tableaux '
 '.Les annexes, croquis, tableaux, diagrammes, graphiques, symboles et toutes '


In [0]:
fileObj.close()

The following code block setup the tensorflow graph **g** and **session** with the [Univeral Encoder Multilingual Q&A model](https://tfhub.dev/google/universal-sentence-encoder-multilingual-qa/3)'s **question_encoder** and **response_encoder** signatures.

In [0]:
#@title Load model from tensorflow hub
%%time
module_url = "https://tfhub.dev/google/universal-sentence-encoder-multilingual-qa/3" #@param ["https://tfhub.dev/google/universal-sentence-encoder-multilingual-qa/3", "https://tfhub.dev/google/universal-sentence-encoder-qa/3"]
model = hub.load(module_url)


CPU times: user 14 s, sys: 2.51 s, total: 16.5 s
Wall time: 19.5 s


The following code block compute the embeddings for all the text, context tuples and store them in a [simpleneighbors](https://pypi.org/project/simpleneighbors/) index using the **response_encoder**.


In [0]:
#@title Compute embeddings and build simpleneighbors index
%%time

encodings = model.signatures['response_encoder'](
  input=tf.constant(df_statement.statement[0]),
  context=tf.constant(df_statement.statement[0]))
index = simpleneighbors.SimpleNeighbors(
    len(encodings['outputs'][0]), metric='angular')



CPU times: user 2.04 s, sys: 36.5 ms, total: 2.08 s
Wall time: 3 s


In [0]:
#print(index)
print(encodings)

{'outputs': <tf.Tensor: shape=(1, 512), dtype=float32, numpy=
array([[ 3.52027640e-03,  2.99208406e-02, -2.45567746e-02,
        -1.17191207e-02, -6.64082468e-02, -5.31697161e-02,
        -2.51023490e-02, -3.14379744e-02, -5.80820739e-02,
        -4.57478389e-02,  7.23934174e-02,  4.61452715e-02,
        -4.51491140e-02,  4.13819253e-02,  1.71931032e-02,
        -6.66568950e-02,  2.79163774e-02,  2.58047599e-04,
        -3.45768183e-02, -3.75409834e-02, -1.37330894e-03,
        -5.01763485e-02,  4.86569293e-02,  5.80334179e-02,
         6.35171235e-02,  5.28913252e-02,  6.36867853e-03,
         2.30070073e-02, -3.40671092e-02, -1.51939979e-02,
         4.00387570e-02, -5.89067154e-02, -2.35501062e-02,
        -2.46386584e-02,  3.97542939e-02,  1.39056938e-03,
        -6.92113787e-02,  2.90953740e-02, -6.41708374e-02,
         6.06820248e-02,  1.02825258e-02, -5.72114885e-02,
        -7.57310074e-03,  3.74145694e-02,  4.28285860e-02,
         8.14917311e-03,  5.48161054e-03,  2.69245375

In [0]:
#print('Computing embeddings for %s sentences' % len(df_statement))
CONTEXT_SIZE = 2
for i in range(len(df_statement)) :
  if (i > CONTEXT_SIZE and i < (len(df_statement)-CONTEXT_SIZE)):
    context_statement = df_statement.statement[i-2]+df_statement.statement[i-1]+df_statement.statement[i]+df_statement.statement[i+1]+df_statement.statement[i+2]
  else:
    context_statement = df_statement.statement[i]

  encodings = model.signatures['response_encoder'](
    input=tf.constant(df_statement.statement[i]),
    context=tf.constant(context_statement))
  array1 = encodings['outputs'][0].numpy()
  index.add_one(df_statement.statement[i], array1)

In [0]:
index.build()

In [0]:
#@title Semantic search
#@markdown Enter a set of words to find matching sentences. 
# 'results_returned' can beused to modify the number of matching 
# sentences retured. 
query = 'permis pour enlever un ch\xEAne ?' #@param {type:"string"}
answer = 'cookies'
num_results = 10 #@param {type:"slider", min:5, max:40, step:1}

display_nearest_neighbors(query, answer)


['.ii) \xa0localisation du terrain faisant l’objet de la demande et description des travaux d’abattage d’arbres pour lesquels une demande de certificat d’autorisation est faite.', '.iii) \xa0sauf pour la coupe d’arbre mort, malade ou dangereux, un plan d’aménagement du site doit être remis comprenant (Règlement 2534- 2015): .- \xa0les superficies visées par l’abattage d’arbre.', 'La demande d’autorisation doit comprendre les informations suivantes : .i) \xa0mention de l’entrepreneur qui procédera à la coupe, ou du titulaire du droit de coupe et de l’ingénieur forestier qui a prescrit les travaux s’il y a lieu.', 'La demande d’autorisation doit comprendre les informations suivantes (Règlement 2534-2015) : .i) localisation du terrain faisant l’objet de la demande et l’identification des arbres à abattre pour lesquels une demande de certificat d’autorisation est faite.', '.abattage d’arbres pour fins autres que l’exploitation forestière : .Toute personne désirant effectuer un a