##### Copyright 2019 The TensorFlow Hub Authors.

Licensed under the Apache License, Version 2.0 (the "License");

In [0]:
# Copyright 2018 The TensorFlow Hub Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

# Approximate Embeddings Similarity Matching

This tutorial illustrates how to gnerate embeddings from a [TensorFlow Hub](https://www.tensorflow.org/hub) (TF-Hb) module given input data, and build an approximate nearest neighbours (ANN) index using the extracted embeddings. The index can then be used for real-time similarity matching an retreival.

The steps of this tutorial are:
1. Download sample data.
2. Generate embeddings for the data using a TF-Hub module
3. Build an ANN index for the embeddings
4. Use the index for similarity matching

We use [Apache Beam](https://beam.apache.org/documentation/programming-guide/) with to generate the embeddings from the TF-Hub module. We also use Spotify's [ANNOY](https://github.com/spotify/annoy) library to build the approximate nearest neighbours index.

This tutorial uses with **TensorFlow 2.0**, and works only with tf2 TF-Hub modules (**SavedModel 2.0**).

## Getting Started

Install the required libraries. Make sure to **restart the runtime** after installtion is completed.

In [0]:
!pip3 install -U tensorflow
!pip3 install apache_beam[gcp]
!pip3 install annoy

Import the required libraries

In [0]:
import os
import sys
import multiprocessing
import pickle
from datetime import datetime
import numpy as np
import apache_beam as beam
from apache_beam.transforms import util
import tensorflow as tf
import tensorflow_hub as hub
import annoy

In [0]:
print('TF version: {}'.format(tf.__version__))
print('TF-Hub version: {}'.format(hub.__version__))
print('Apache Beam version: {}'.format(beam.__version__))

TF version: 2.0.0
TF-Hub version: 0.6.0
Apache Beam version: 2.16.0


## 1. Download Sample Data

[A Million News Headlines](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/SYBGZL#) contains data of news headlines published over a period of 15 years. Sourced from the reputable Australian news source Australian Broadcasting Corp. (ABC). This this news dataset as a summarised historical record of noteworthy events in the globe from early-2003 to end-2017 with a more granular focus on Australia. 

This includes the entire corpus of articles published by the ABC website in the given time range. With a volume of 200 articles per day and a good focus on international news, events of significance has been captured here. Digging into the keywords, one can see all the important episodes shaping the last decade and how they evolved over time. Ex: financial crisis, iraq war, multiple US elections, ecological disasters, terrorism, famous people, Australian crimes etc.

**Format**: Tab-separated two-column data: 1) publication date and 2) headline text. We are only interested in the headline text.



In [0]:
!wget https://dataverse.harvard.edu/api/access/datafile/3450625?format=tab&gbrecs=true
!mv 3450625?format=tab raw.tsv
!wc -l raw.tsv
!head raw.tsv

For simplicity, we only keep the headline text and remove the publication date

In [0]:
!rm -r corpus
!mkdir corpus

with open('corpus/text.txt', 'w') as out_file:
  with open('raw.tsv', 'r') as in_file:
    for line in in_file:
      headline = line.split('\t')[1].strip().strip('"')
      out_file.write(headline+"\n")

In [0]:
!tail corpus/text.txt

severe storms forecast for nye in south east queensland
snake catcher pleads for people not to kill reptiles
south australia prepares for party to welcome new year
strikers cool off the heat with big win in adelaide
stunning images from the sydney to hobart yacht
the ashes smiths warners near miss liven up boxing day test
timelapse: brisbanes new year fireworks
what 2017 meant to the kids of australia
what the papodopoulos meeting may mean for ausus
who is george papadopoulos the former trump campaign aide


## 2. Generate Embeddings for the Data.

In this tutorial, we use the [Neural Network Language Model (NNLM)](https://tfhub.dev/google/tf2-preview/nnlm-en-dim50/1) to generate emebeddings for the headlines data. The sentence embeddings can then be easily used to compute sentence level meaning similarity. We run the embeddings generation process using Apache Beam.

### Embeddings extraction pipeline

In [0]:
embed_fn = None

def generate_embeddings(text, module_url, random_projection_matrix=None):
  import tensorflow_hub as hub
  global embed_fn
  if embed_fn is None:
    embed_fn = hub.load(module_url)
  embedding = embed_fn(text).numpy()
  if random_projection_matrix is not None:
    embedding = embedding.dot(random_projection_matrix)
  return text, embedding

def to_tf_example(entries):
  examples = []

  text_list, embedding_list = entries
  for i in range(len(text_list)):
    text = text_list[i]
    embedding = embedding_list[i]

    features = {
        'text': tf.train.Feature(
            bytes_list=tf.train.BytesList(value=[text.encode('utf-8')])),
        'embedding': tf.train.Feature(
            float_list=tf.train.FloatList(value=embedding.tolist()))
    }
  
    example = tf.train.Example(
        features=tf.train.Features(
            feature=features)).SerializeToString(deterministic=True)
  
    examples.append(example)
  
  return examples

In [0]:
def run_hub2emb(args):
  '''Runs the embedding generation pipeline'''
  source_data_location = args['source_data_location']
  sink_data_location = args['sink_data_location']
  runner = args['runner']
  temporary_dir = args['temporary_location']
  module_url = args['module_url']
  original_dim = args['original_dim']
  projected_dim = args['projected_dim']
  batch_size = args['batch_size']

  pipeline_options = beam.options.pipeline_options.PipelineOptions(**args)
  
  random_projection_matrix = None
  if projected_dim and original_dim > projected_dim:
    # Creating a random projection matrix
    random_projection_matrix = np.random.uniform(
        low=-1, high=1, size=(original_dim, projected_dim))
    print('Storing random projection matrix to disk...')
    with open('random_projection_matrix', 'wb') as handle:
      pickle.dump(random_projection_matrix, 
                  handle, protocol=pickle.HIGHEST_PROTOCOL)

  print("Starting the Beam pipeline...")
  with beam.Pipeline(runner, options=pipeline_options) as pipeline:
    (
        pipeline
        | 'Read sentences from files' >> beam.io.ReadFromText(
            file_pattern=source_data_location)
        |'Batch elements' >> util.BatchElements(
            min_batch_size=batch_size, max_batch_size=batch_size)
        | 'Generate emebddings' >> beam.Map(
            generate_embeddings, module_url, random_projection_matrix)
        | 'Encode to tf example' >> beam.FlatMap(to_tf_example)
        | 'Write to TFRecords files' >> beam.io.WriteToTFRecord(
            file_path_prefix='{}/emb'.format(sink_data_location),
            file_name_suffix='.tfrecords')
    )

### Run pipeline

In [0]:
runner = 'DirectRunner'
job_name = 'hub2emb-{}'.format(datetime.utcnow().strftime('%y%m%d-%H%M%S'))
input_data = 'corpus/*.txt'
output_dir = 'embeds'
temporary_dir = 'tmp'
module_url = 'https://tfhub.dev/google/tf2-preview/nnlm-en-dim128/1'

original_dim = 128
projected_dim = 64

args = {
    'job_name': job_name,
    'runner': runner,
    'batch_size': 1024,
    'source_data_location': input_data,
    'sink_data_location': output_dir,
    'temporary_location': temporary_dir,
    'module_url': module_url,
    'original_dim': original_dim,
    'projected_dim': projected_dim,
}

print("Pipeline args are set.")
args

Pipeline args are set.


{'batch_size': 1024,
 'job_name': 'hub2emb-191009-163329',
 'module_url': 'https://tfhub.dev/google/tf2-preview/nnlm-en-dim128/1',
 'original_dim': 128,
 'projected_dim': 64,
 'runner': 'DirectRunner',
 'sink_data_location': 'embeds',
 'source_data_location': 'corpus/*.txt',
 'temporary_location': 'tmp'}

In [0]:
!rm -r {output_dir}
!rm -r {temporary_dir}
!rm random_projection_matrix

print("Running pipeline...")
%time run_hub2emb(args)
print("Pipeline is done.")

rm: cannot remove 'tmp': No such file or directory
Running pipeline...
Storing random projection matrix to disk...
Starting the Beam pipeline...




CPU times: user 4min 57s, sys: 1min 54s, total: 6min 51s
Wall time: 4min 46s
Pipeline is done.


In [0]:
!ls {output_dir}

emb-00000-of-00001.tfrecords


Read some of the generated embeddings...

In [0]:
embed_file = '{}/emb-00000-of-00001.tfrecords'.format(output_dir)
sample = 5

# Create a description of the features.
feature_description = {
    'text': tf.io.FixedLenFeature([], tf.string),
    'embedding': tf.io.FixedLenFeature([projected_dim], tf.float32)
}

def _parse_example(example):
  # Parse the input `tf.Example` proto using the dictionary above.
  return tf.io.parse_single_example(example, feature_description)

dataset = tf.data.TFRecordDataset(embed_file)
for record in dataset.take(sample).map(_parse_example):
  print("{}:{}".format(record['text'].numpy().decode('utf-8'), record['embedding'].numpy()[:10]))


headline_text:[-0.722893   -0.4011973  -0.26381597  0.3096847  -0.20064476  0.6773664
 -0.43674338 -0.08279408  0.28475836 -0.06558699]
aba decides against community broadcasting licence:[ 0.57050043  0.6861436   0.5523113  -0.22888096 -0.8650546   0.23084538
 -0.55934376 -0.03965315 -0.0914881  -0.26085043]
act fire witnesses must be aware of defamation:[-0.4060502   0.61067057 -0.18213813  0.4637944  -0.41108343  0.20381606
 -0.16592963 -1.2887975   0.19063626  0.85022455]
a g calls for infrastructure protection summit:[ 0.97178245  0.2629447   0.944849   -0.70715874 -0.45348793  0.726673
  0.17557988  0.17089605  1.0747907  -0.34547323]
air nz staff in aust strike for pay rise:[ 0.34060448  0.30002418 -0.850083    0.9558997   0.8382068  -0.02917271
  0.24227455 -0.6201514   0.59361756 -0.10875675]


## 3. Build the ANN Index for the Embeddings

In [0]:
def build_index(embedding_files_pattern, index_filename, vector_length, 
    metric='angular', num_trees=100):
  '''Builds an ANNOY index'''

  annoy_index = annoy.AnnoyIndex(vector_length, metric=metric)
  # Mapping between the item and its identifier in the index
  mapping = {}

  embed_files = tf.io.gfile.glob(embedding_files_pattern)
  num_files = len(embed_files)
  print('Found {} embedding file(s).'.format(num_files))

  item_counter = 0
  for i, embed_file in enumerate(embed_files):
    print('Loading embeddings in file {} of {}...'.format(i+1, num_files))
    dataset = tf.data.TFRecordDataset(embed_file)
    for record in dataset.map(_parse_example):
      text = record['text'].numpy().decode("utf-8")
      embedding = record['embedding'].numpy()
      mapping[item_counter] = text
      annoy_index.add_item(item_counter, embedding)
      item_counter += 1
      if item_counter % 200000 == 0:
        print('{} items loaded to the index'.format(item_counter))

  print('A total of {} items added to the index'.format(item_counter))

  print('Building the index with {} trees...'.format(num_trees))
  annoy_index.build(n_trees=num_trees)
  print('Index is successfully built.')
  
  print('Saving index to disk...')
  annoy_index.save(index_filename)
  print('Index is saved to disk.')
  print("Index file size: {} GB".format(
    round(os.path.getsize(index_filename) / float(1024 ** 3), 2)))
  annoy_index.unload()

  print('Saving mapping to disk...')
  with open(index_filename + '.mapping', 'wb') as handle:
    pickle.dump(mapping, handle, protocol=pickle.HIGHEST_PROTOCOL)
  print('Mapping is saved to disk.')
  print("Mapping file size: {} MB".format(
    round(os.path.getsize(index_filename + '.mapping') / float(1024 ** 2), 2)))

In [0]:
embedding_files = "{}/emb-*.tfrecords".format(output_dir)
embedding_dimension = projected_dim
index_filename = "index"

!rm {index_filename}
!rm {index_filename}.mapping

%time build_index(embedding_files, index_filename, embedding_dimension)

rm: cannot remove 'index': No such file or directory
rm: cannot remove 'index.mapping': No such file or directory
Found 1 embedding file(s).
Loading embeddings in file 1 of 1...
200000 items loaded to the index
400000 items loaded to the index
600000 items loaded to the index
800000 items loaded to the index
1000000 items loaded to the index
A total of 1103664 items added to the index
Building the index with 100 trees...
Index is successfully built.
Saving index to disk...
Index is saved to disk.
Index file size: 1.61 GB
Saving mapping to disk...
Mapping is saved to disk.
Mapping file size: 50.61 MB
CPU times: user 9min 30s, sys: 36.7 s, total: 10min 7s
Wall time: 9min 37s


In [0]:
!ls

corpus	index	       random_projection_matrix  sample_data
embeds	index.mapping  raw.tsv


## 4. Use the Index for Similarity Matching

### Load the index and the mapping files

In [0]:
index = annoy.AnnoyIndex(embedding_dimension)
index.load(index_filename, prefault=True)
print('Annoy index is loaded.')
with open(index_filename + '.mapping', 'rb') as handle:
  mapping = pickle.load(handle)
print('Mapping file is loaded.')


  """Entry point for launching an IPython kernel.


Annoy index is loaded.
Mapping file is loaded.


### Similarity matching method

In [0]:
def find_similar_items(embedding, num_matches=5):
  '''Finds similar items to a given embedding in the ANN index'''
  ids = index.get_nns_by_vector(
  embedding, num_matches, search_k=-1, include_distances=False)
  items = [mapping[i] for i in ids]
  return items

### Extract embedding from a given query

In [0]:
# Load the TF-Hub module
print("Loading the TF-Hub module...")
embed_fn = hub.load(module_url)
print("TF-hub module is loaded.")

random_projection_matrix = None
if os.path.exists('random_projection_matrix'):
  print("Loading random projection matrix...")
  with open('random_projection_matrix', 'rb') as handle:
    random_projection_matrix = pickle.load(handle)
  print('random projection matrix is loaded.')

def extract_embeddings(query):
  '''Generates the embedding for the query'''
  query_embedding =  embed_fn([query])[0].numpy()
  if random_projection_matrix is not None:
    query_embedding = query_embedding.dot(random_projection_matrix)
  return query_embedding


Loading the TF-Hub module...
TF-hub module is loaded.
Loading random projection matrix...
random projection matrix is loaded.


In [0]:
extract_embeddings("Hello Machine Learning!")[:10]

array([-0.36083189, -0.82776135, -0.65961214, -0.16514267,  1.39722511,
       -0.88492214,  0.11774054,  0.24281768,  0.41504281,  0.16867341])

In [0]:
query = "confronting global challenges" #@param {type:"string"}

In [0]:
print("Generating embedding for the query...")
%time query_embedding = extract_embeddings(query)

print("")
print("Finding relevant items in the index...")
%time items = find_similar_items(query_embedding, 10)

print("")
print("Results:")
print("=========")
for item in items:
  print(item)

Generating embedding for the query...
CPU times: user 144 ms, sys: 99.3 ms, total: 244 ms
Wall time: 250 ms

Finding relevant items in the index...
CPU times: user 1.29 ms, sys: 112 µs, total: 1.41 ms
Wall time: 833 µs

Results:
confronting global challenges
world struggling to cope with global terrorism
experts to discuss global warming threat
diabetes increasing worldwide at alarming rates
analyst warns of looming global climate wars
global downturn helps emerging art dealer
an asian universities rising in latest global rankings
nuclear watchdog warns of new global dangers
conference examines challenges facing major cities
global financial uncertainty influencing prices at
