<a href="https://colab.research.google.com/github/markshope/AI-for-Lawyers-Beginner-Course/blob/master/Cross_Lingual_Similarity_with_TF_Hub_Multilingual_Universal_Encoder.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Cross-Lingual Similarity with Multilingual Universal Sentence Encoder


This notebook illustrates how to access the Multilingual Universal Sentence Encoder module and use it for sentence similarity in Chinese and English. This module is an extension of the [original Universal Encoder module](https://tfhub.dev/google/universal-sentence-encoder/2).


# Getting Started

This section sets up the environment for access to the Multilingual Universal Sentence Encoder Module
.

In [0]:
#@title Setup Environment
!pip uninstall --quiet --yes tensorflow
!pip install --quiet tensorflow-gpu==1.13.1
!pip install --quiet tensorflow-hub
!pip install --quiet bokeh
!pip install --quiet tf-sentencepiece
!pip install --quiet simpleneighbors
!pip install --quiet tqdm

In [0]:
#@title Setup common imports and functions
import bokeh
import bokeh.models
import bokeh.plotting
import numpy as np
import os
import pandas as pd
import tensorflow as tf
import tensorflow_hub as hub
import tf_sentencepiece  # Not used directly but needed to import TF ops.
import sklearn.metrics.pairwise

from simpleneighbors import SimpleNeighbors
from tqdm import tqdm
from tqdm import trange

def visualize_similarity(embeddings_1, embeddings_2, labels_1, labels_2,
                         plot_title,
                         plot_width=1200, plot_height=600,
                         xaxis_font_size='12pt', yaxis_font_size='12pt'):

  assert len(embeddings_1) == len(labels_1)
  assert len(embeddings_2) == len(labels_2)

  # arccos based text similarity (Yang et al. 2019; Cer et al. 2019)
  sim = 1 - np.arccos(
      sklearn.metrics.pairwise.cosine_similarity(embeddings_1,
                                                 embeddings_2))/np.pi

  embeddings_1_col, embeddings_2_col, sim_col = [], [], []
  for i in range(len(embeddings_1)):
    for j in range(len(embeddings_2)):
      embeddings_1_col.append(labels_1[i])
      embeddings_2_col.append(labels_2[j])
      sim_col.append(sim[i][j])
  df = pd.DataFrame(zip(embeddings_1_col, embeddings_2_col, sim_col),
                    columns=['embeddings_1', 'embeddings_2', 'sim'])

  mapper = bokeh.models.LinearColorMapper(
      palette=[*reversed(bokeh.palettes.YlOrRd[9])], low=df.sim.min(),
      high=df.sim.max())

  p = bokeh.plotting.figure(title=plot_title, x_range=labels_1,
                            x_axis_location="above",
                            y_range=[*reversed(labels_2)],
                            plot_width=plot_width, plot_height=plot_height,
                            tools="save",toolbar_location='below', tooltips=[
                                ('pair', '@embeddings_1 ||| @embeddings_2'),
                                ('sim', '@sim')])
  p.rect(x="embeddings_1", y="embeddings_2", width=1, height=1, source=df,
         fill_color={'field': 'sim', 'transform': mapper}, line_color=None)

  p.title.text_font_size = '12pt'
  p.axis.axis_line_color = None
  p.axis.major_tick_line_color = None
  p.axis.major_label_standoff = 16
  p.xaxis.major_label_text_font_size = xaxis_font_size
  p.xaxis.major_label_orientation = 0.25 * np.pi
  p.yaxis.major_label_text_font_size = yaxis_font_size
  p.min_border_right = 300

  bokeh.io.output_notebook()
  bokeh.io.show(p)


This is additional boilerplate code where we import the pre-trained ML model we will use to encode text throughout this notebook.

In [0]:
#@title Import Multilingual Module
module_url = 'https://tfhub.dev/google/universal-sentence-encoder-multilingual/1'

# Set up graph.
g = tf.Graph()
with g.as_default():
  text_input = tf.placeholder(dtype=tf.string, shape=[None])
  multiling_embed = hub.Module(module_url)
  embedded_text = multiling_embed(text_input)
  init_op = tf.group([tf.global_variables_initializer(), tf.tables_initializer()])
g.finalize()

# Initialize session.
session = tf.Session(graph=g)
session.run(init_op)

# Visualize Text Similarity Between Chinese and English
With the sentence embeddings now in hand, we can visualize semantic similarity between Chinese and English.

## Computing Text Embeddings

We first define a set of sentences in English and Chinese. Then, we precompute the embeddings for our sentences.

In [0]:
# Force Majeure and Compliance Clause.

chinese_sentences = ['賣方和買方應遵守適用於本合約的法律。', '被告可以通過證明損失係由原告自己或不可抗力因素造成而免除自己的責任。']
english_sentences = ['Buyer shall comply with all applicable laws, regulations, and ordinances.', 'Seller shall not be liable for any failure or delay in fulfilling or performing.']


In [0]:
# Compute embeddings.

en_result = session.run(embedded_text, feed_dict={text_input: english_sentences})
zh_result = session.run(embedded_text, feed_dict={text_input: chinese_sentences})


## Visualizing Similarity

With text embeddings in hand, we can take their dot-product to visualize how similar sentences are between the languages. A darker color indicates the embeddings are semantically similar.

### English-Chinese Similarity

In [0]:
visualize_similarity(en_result, zh_result, english_sentences, chinese_sentences, 'English-Chinese Clause Similarity')

# Copyright Notices

The following copyright statements and licenses apply to various open source software components (or portions thereof) that are included with the code in this file. The code does not necessarily use all the open source software components referred to below and may also only use portions of a given component.

**Copyright 2019 The TensorFlow Hub Authors.**

Licensed under the Apache License, Version 2.0 (the "License");

In [0]:
# Copyright 2019 The TensorFlow Hub Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================