# Approximate Embeddings Similarity Matching

This tutorial illustrates how to gnerate embeddings from a [TensorFlow Hub](https://www.tensorflow.org/hub) (TF-Hb) module given input data, and build an approximate nearest neighbours (ANN) index using the extracted embeddings. The index can then be used for real-time similarity matching an retreival.

The steps of this tutorial are:
1. Download sample data.
2. Generate embeddings for the data using a TF-Hub module
3. Build an ANN index for the embeddings
4. Use the index for similarity matching

We use [Apache Beam](https://beam.apache.org/documentation/programming-guide/) with [TensorFlow Transform](https://www.tensorflow.org/tfx/tutorials/transform/simple) (TF-Transform) to generate the embeddings from the TF-Hub module. We also use Spotify's [ANNOY](https://github.com/spotify/annoy) library to build the approximate nearest neighbours index.

## Getting Started

Install the required libraries.

In [0]:
!pip3 install apache_beam[gcp]
!pip3 install tensorflow_transform
!pip3 install annoy

Import the required libraries

In [0]:
import os
import pickle
from datetime import datetime
import numpy as np
import apache_beam as beam
import tensorflow as tf
import tensorflow_transform as tft
import tensorflow_hub as hub
import tensorflow_transform.beam as tft_beam
import annoy

  'Some syntactic constructs of Python 3 are not yet fully supported by '


In [0]:
print('TF version: {}'.format(tf.__version__))
print('TF-Hub version: {}'.format(hub.__version__))
print('TF-Transform version: {}'.format(tft.__version__))
print('Apache Beam version: {}'.format(beam.__version__))

TF version: 1.14.0
TF-Hub version: 0.6.0
TF-Transform version: 0.14.0
Apache Beam version: 2.15.0


## 1. Download Sample Data

[A Million News Headlines](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/SYBGZL#) contains data of news headlines published over a period of 15 years. Sourced from the reputable Australian news source Australian Broadcasting Corp. (ABC). This this news dataset as a summarised historical record of noteworthy events in the globe from early-2003 to end-2017 with a more granular focus on Australia. 

This includes the entire corpus of articles published by the ABC website in the given time range. With a volume of 200 articles per day and a good focus on international news, events of significance has been captured here. Digging into the keywords, one can see all the important episodes shaping the last decade and how they evolved over time. Ex: financial crisis, iraq war, multiple US elections, ecological disasters, terrorism, famous people, Australian crimes etc.

**Format**: Tab-separated two-column data: 1) publication date and 2) headline text. We are only interested in the headline text.



In [0]:
!wget https://dataverse.harvard.edu/api/access/datafile/3450625?format=tab&gbrecs=true
!mv 3450625?format=tab raw.tsv

In [0]:
!wc -l raw.tsv
!head raw.tsv

For simplicity, we only keep the headline text and remove the publication date

In [0]:
!rm -r corpus
!mkdir corpus

In [0]:
with open('corpus/text.txt', 'w') as out_file:
  with open('raw.tsv', 'r') as in_file:
    for line in in_file:
      headline = line.split('\t')[1].strip().strip('"')
      out_file.write(headline+"\n")

In [0]:
!tail corpus/text.txt

## 2. Generate Embeddings for the Data.

In this tutorial, we use the [Universal Sentence Encoder](https://tfhub.dev/google/universal-sentence-encoder/2) to generate emebeddings for the headlines data. The sentence embeddings can then be easily used to compute sentence level meaning similarity. We run the embeddings generation process using Apache Beam and TF-Transform.

### Embeddings extraction pipeline

In [0]:
encoder = None

def embed_text(text, module_url):
  import tensorflow_hub as hub
  global encoder
  if encoder is None:
    encoder = hub.Module(module_url)
  embedding = encoder(text)
  return embedding


def get_metadata():
  from tensorflow_transform.tf_metadata import dataset_metadata
  from tensorflow_transform.tf_metadata import schema_utils
  feature_spec = {'text': tf.FixedLenFeature([], dtype=tf.string)}
  schema = schema_utils.schema_from_feature_spec(feature_spec)
  metadata = dataset_metadata.DatasetMetadata(schema)
  return metadata


def make_preprocess_fn(module_url, random_projection_matrix=None):

  def _preprocess_fn(input_features):
    text = input_features['text']
    embedding = embed_text(text, module_url)

    if random_projection_matrix is not None:
      embedding = tf.matmul(
          embedding, tf.cast(random_projection_matrix, embedding.dtype))

    output_features = {
        'text': text,
        'embedding': embedding
        }
    return output_features
  
  return _preprocess_fn


In [0]:
def run_hub2emb(args):

  source_data_location = args['source_data_location']
  sink_data_location = args['sink_data_location']
  runner = args['runner']
  temporary_dir = args['temporary_location']
  module_url = args['module_url']
  original_dim = args['original_dim']
  projected_dim = args['projected_dim']

  pipeline_options = beam.options.pipeline_options.PipelineOptions(**args)
  raw_metadata = get_metadata()
  converter = tft.coders.CsvCoder(
      column_names=['text'], schema=raw_metadata.schema)
  
  random_projection_matrix = None
  if projected_dim and original_dim != projected_dim:
    random_projection_matrix = np.random.uniform(
        low=-1, high=1, size=(original_dim, projected_dim))
    print('Storing random projection matrix to disk...')
    with open('random_projection_matrix', 'wb') as handle:
      pickle.dump(random_projection_matrix, handle, protocol=pickle.HIGHEST_PROTOCOL)

  with beam.Pipeline(runner, options=pipeline_options) as pipeline:
    with tft_beam.Context(temporary_dir):

      sentences = ( 
          pipeline
          | 'Read sentences from files' >> beam.io.ReadFromText(
              file_pattern=source_data_location)
          | 'Convert to dictionary' >> beam.Map(converter.decode)
      )

      sentences_dataset = (sentences, raw_metadata)
      preprocess_fn = make_preprocess_fn(module_url, random_projection_matrix)

      embeddings_dataset, _ = (
          sentences_dataset
          | 'Extract embeddings' >> tft_beam.AnalyzeAndTransformDataset(
              preprocess_fn)
      )

      embeddings, transformed_metadata = embeddings_dataset

      embeddings | 'Write embeddings to TFRecords' >> beam.io.tfrecordio.WriteToTFRecord(
        file_path_prefix='{}/emb'.format(sink_data_location),
        file_name_suffix='.tfrecords',
        coder=tft.coders.ExampleProtoCoder(transformed_metadata.schema))
      

### Run pipeline

In [0]:
runner = 'DirectRunner'
job_name = 'hub2emb-{}'.format(datetime.utcnow().strftime('%y%m%d-%H%M%S'))
input_data = 'corpus/*.txt'
output_dir = 'embeds'
temporary_dir = 'tmp'
module_url = 'https://tfhub.dev/google/universal-sentence-encoder/2'

original_dim = 512
projected_dim = 128

args = {
    'job_name': job_name,
    'runner': runner,
    'source_data_location': input_data,
    'sink_data_location': output_dir,
    'temporary_location': temporary_dir,
    'module_url': module_url,
    'original_dim': original_dim,
    'projected_dim': projected_dim
}

print("Pipeline args are set.")

Pipeline args are set.


In [0]:
!rm -r {output_dir}
!rm -r {temporary_dir}
!rm random_projection_matrix

print("Running pipeline...")
%time run_hub2emb(args)
print("Pipeline is done.")

Running pipeline...
Storing random projection matrix to disk...
INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


Instructions for updating:
This function will only be available through the v1 compatibility library as tf.compat.v1.saved_model.utils.build_tensor_info or tf.compat.v1.saved_model.build_tensor_info.


Instructions for updating:
This function will only be available through the v1 compatibility library as tf.compat.v1.saved_model.utils.build_tensor_info or tf.compat.v1.saved_model.build_tensor_info.


INFO:tensorflow:Assets added to graph.


INFO:tensorflow:Assets added to graph.


INFO:tensorflow:No assets to write.


INFO:tensorflow:No assets to write.


INFO:tensorflow:SavedModel written to: tmp/tftransform_tmp/60f511ce6d3d415c9f94fb9db1543888/saved_model.pb


INFO:tensorflow:SavedModel written to: tmp/tftransform_tmp/60f511ce6d3d415c9f94fb9db1543888/saved_model.pb


CPU times: user 4min 27s, sys: 17.3 s, total: 4min 44s
Wall time: 3min 58s
Pipeline is done.


In [0]:
!ls {output_dir}

emb-00000-of-00001.tfrecords


Read some of the generated embeddings...

In [0]:
embed_file = '{}/emb-00000-of-00001.tfrecords'.format(output_dir)
sample = 5
record_iterator =  tf.io.tf_record_iterator(path=embed_file)
for string_record in record_iterator:
  example = tf.train.Example()
  example.ParseFromString(string_record)
  text = example.features.feature['text'].bytes_list.value
  embedding = np.array(example.features.feature['embedding'].float_list.value)
  print("Embedding dimensions: {}".format(embedding.shape[0]))
  print("{}:{}".format(text, embedding[:10]))
  sample-=1
  if sample == 0: break


Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`


Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`


Embedding dimensions: 128
[b'headline_text']:[ 0.7299096   0.82932085  0.55692822 -0.5001204  -1.04153848  0.60820484
  0.65177363 -0.27067748  0.15273833  0.8012833 ]
Embedding dimensions: 128
[b'aba decides against community broadcasting licence']:[ 0.01064706 -0.3083396  -0.01214939 -0.91801566  0.39616808  0.30477336
 -0.48459959  0.18867671 -0.3611635   0.53673756]
Embedding dimensions: 128
[b'act fire witnesses must be aware of defamation']:[ 0.23446536 -0.41275284  0.34584063 -0.81324148  0.02928767  0.75109184
 -1.0389266   0.49377581 -0.49257499  0.47597888]
Embedding dimensions: 128
[b'a g calls for infrastructure protection summit']:[ 0.06868215  0.02751701 -0.66759372 -0.59131128 -0.54670727  0.33901572
 -0.66903847  0.16438812 -1.0265044   0.57989764]
Embedding dimensions: 128
[b'air nz staff in aust strike for pay rise']:[-0.03368764 -0.22079813 -0.40127575  0.30999386  0.07844155  0.42006987
 -0.97262436  0.5886246  -0.9725759  -0.71267378]


## 3. Build the ANN Index for the Embeddings

In [0]:
def build_index(
    embedding_files_pattern, index_filename, vector_length, 
    metric='angular', num_trees=100):

  annoy_index = annoy.AnnoyIndex(vector_length, metric=metric)
  mapping = {}

  embed_files = tf.gfile.Glob(embedding_files_pattern)
  print('Found {} embedding file(s).'.format(len(embed_files)))

  item_counter = 0
  for f, embed_file in enumerate(embed_files):
    print('Loading embeddings in file {} of {}...'.format(
      f+1, len(embed_files)))
    record_iterator = tf.io.tf_record_iterator(
      path=embed_file)

    for string_record in record_iterator:
      example = tf.train.Example()
      example.ParseFromString(string_record)
      text = example.features.feature['text'].bytes_list.value[0].decode("utf-8")
      mapping[item_counter] = text
      embedding = np.array(
        example.features.feature['embedding'].float_list.value)
      annoy_index.add_item(item_counter, embedding)
      item_counter += 1
      if item_counter % 200000 == 0:
        print('{} items loaded to the index'.format(item_counter))

    print('A total of {} items added to the index'.format(item_counter))

  print('Building the index with {} trees...'.format(num_trees))
  annoy_index.build(n_trees=num_trees)
  print('Index is successfully built.')
  
  print('Saving index to disk...')
  annoy_index.save(index_filename)
  print('Index is saved to disk.')
  print("Index file size: {} GB".format(
    round(os.path.getsize(index_filename) / float(1024 ** 3), 2)))
  annoy_index.unload()

  print('Saving mapping to disk...')
  with open(index_filename + '.mapping', 'wb') as handle:
    pickle.dump(mapping, handle, protocol=pickle.HIGHEST_PROTOCOL)
  print('Mapping is saved to disk.')
  print("Mapping file size: {} MB".format(
    round(os.path.getsize(index_filename + '.mapping') / float(1024 ** 2), 2)))

In [0]:
embedding_files = "{}/emb-*.tfrecords".format(output_dir)
embedding_dimension = projected_dim
index_filename = "index"

!rm {index_filename}
!rm {index_filename}.mapping

%time build_index(embedding_files, index_filename, embedding_dimension)

Found 1 embedding file(s).
Loading embeddings in file 1 of 1...
200000 items loaded to the index
400000 items loaded to the index
600000 items loaded to the index
800000 items loaded to the index
1000000 items loaded to the index
A total of 1103664 items added to the index
Building the index with 100 trees...
Index is successfully built.
Saving index to disk...
Index is saved to disk.
Index file size: 2.03 GB
Saving mapping to disk...
Mapping is saved to disk.
Mapping file size: 50.61 MB
CPU times: user 12min 9s, sys: 7.41 s, total: 12min 16s
Wall time: 12min 45s


In [0]:
!ls

corpus	index	       random_projection_matrix  sample       tmp
embeds	index.mapping  raw.tsv			 sample_data


## 4. Use the Index for Similarity Matching

### Load the index and the mapping files

In [0]:
index = annoy.AnnoyIndex(embedding_dimension)
index.load(index_filename, prefault=True)
print('Annoy index is loaded.')
with open(index_filename + '.mapping', 'rb') as handle:
  mapping = pickle.load(handle)
print('Mapping file is loaded.')


  """Entry point for launching an IPython kernel.


Annoy index is loaded.
Mapping file is loaded.


### Similarity matching method

In [0]:
def find_similar_items(embedding, num_matches=5):
  ids = index.get_nns_by_vector(
  embedding, num_matches, search_k=-1, include_distances=False)
  items = [mapping[i] for i in ids]
  return items

### Extract embedding from a given query

In [0]:
embed_module = hub.Module(module_url)
placeholder = tf.placeholder(dtype=tf.string)
embed = embed_module(placeholder)
session = tf.Session()
session.run([tf.global_variables_initializer(), tf.tables_initializer()])
print('Tf-Hub module is loaded.')

def _embeddings_fn(sentences):
    computed_embeddings = session.run(
        embed, feed_dict={placeholder: sentences})
    return computed_embeddings

embedding_fn = _embeddings_fn

def extract_embeddings(query):
  return embedding_fn([query])[0]

INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


Tf-Hub module is loaded.


In [0]:
extract_embeddings("Hello Machine Learning!")[:10]

array([-0.02643181, -0.04425209, -0.0363341 ,  0.00761549, -0.03102973,
       -0.06329978,  0.0234422 ,  0.03972385, -0.00340698,  0.05722774],
      dtype=float32)

### Enter a query to find the most similar items

In [0]:
query = "confronting global challenges" #@param {type:"string"}

In [0]:
random_projection_matrix = None
if os.path.exists('random_projection_matrix'):
  with open('random_projection_matrix', 'rb') as handle:
    random_projection_matrix = pickle.load(handle)
  print('random projection matrix is loaded.')

print("")
print("Generating embedding for the query...")
%time query_embedding = extract_embeddings(query)

if random_projection_matrix is not None:
  query_embedding = query_embedding.dot(random_projection_matrix)

print("")
print("Finding relevant items in the index...")
%time items = find_similar_items(query_embedding, 10)

print("")

print("Results:")
print("=========")
for item in items:
  print(item)

random projection matrix is loaded.

Generating embedding for the query...
CPU times: user 4.48 ms, sys: 43 Âµs, total: 4.53 ms
Wall time: 5.09 ms

Finding relevant items in the index...
CPU times: user 1.1 ms, sys: 1 ms, total: 2.1 ms
Wall time: 1.06 ms

Results:
confronting global challenges
bluescope ponders global challenges
hopes for mullewa official to solve social problems
momentum against pacific leaders arguing for
nff challenges social media interpretation of
old wisdom unites to solve global dilemmas
old wisdom unites to solve global dilemmas
global credit uncertainty provides opportunity
riverland adopts suicide prevention scheme
the emerging global order
