# Document Search with Universal Sentence Encoder

Originally by [Jeremy B. Merrill](https://twitter.com/jeremybmerrill), formerly of Quartz, for NICAR 2020. Updated lightly throughout by John Keefe.

Original Github repos:

- https://github.com/Quartz/aistudio-searching-data-dumps-with-use
- https://github.com/Quartz/aistudio-workshops

# Getting Started

This notebook was originally set up to run on Google's [Colaboratory](https://colab.research.google.com/) service. To try it yourself:

- Go to [colab.research.google.com](https://colab.research.google.com/)
- Choose the "Github" tab
- Type "jkeefe" on the top line and press enter.
- Pick the "black-lives-matter-words" repo
- Pick this notebook, `semantic_scoring.ipynb`

**IMPORTANT** Note: Once the notebook is running go to ***Runtime->Change Runtime type*** dropdown menu above and pick **GPU** _before_ running this notebook for faster execution.

This is an interactive demo. You can run all the code necessary right here.

We're using two neat pieces of technology called the *Universal Sentence Encoder* and *Annoy*.

- the *Universal Sentence Encoder* is a pre-trained machine-learning model that sorta understands human language. If you feed in a sentence, it comes out with 512 numbers that represent the approximate meaning of that sentence. What's really cool is that if you feed in a second sentence that means about the same thing, that second sentence's numbers will be very close to those of the first sentence.
- *Annoy* is a library that makes it really easy to find points in vector space that are close to each other. 

What's "vector space"? Imagine dot plot with an x-axis and a y-axis. That's two-dimensional vector space.

This is three-dimensional vector space. Three axes: x, y, z.

![alt text](https://filedn.com/lVaAxkskVxILBoUDG3XUrm7/nicar20presentation/Screen%20Shot%202020-02-28%20at%205.43.59%20PM.png)

Now imagine 512 axes. That's what we're dealing with here.

## Okay, let's get started.

Run the cell below to install everything you need. It'll take a few minutes. Note that we're actually downgrading to an old version of TensorFlow here. (The new version would work, I'm sure, but I haven't refactored it yet!)

In [1]:
#@title Setup Environment
#latest Tensorflow that supports sentencepiece is 1.14
!pip uninstall --quiet --yes tensorflow
!pip install --quiet tensorflow-gpu==1.14
!pip install --quiet tensorflow==1.14
!pip install --quiet tensorflow-hub
!pip install --quiet bokeh
!pip install --quiet tf-sentencepiece
!pip install --quiet annoy
!pip install --quiet tqdm
!pip install --quiet w3lib
!pip install --quiet syntok

[K     |████████████████████████████████| 377.0MB 41kB/s 
[K     |████████████████████████████████| 3.2MB 58.0MB/s 
[K     |████████████████████████████████| 491kB 53.0MB/s 
[K     |████████████████████████████████| 51kB 8.2MB/s 
[K     |████████████████████████████████| 109.2MB 28kB/s 
[K     |████████████████████████████████| 2.1MB 4.5MB/s 
[K     |████████████████████████████████| 645kB 4.6MB/s 
[?25h  Building wheel for annoy (setup.py) ... [?25l[?25hdone
  Building wheel for syntok (setup.py) ... [?25l[?25hdone


Here we load in everything we just installed.

In [2]:
#@title Setup common imports and functions
%tensorflow_version 1.x
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import numpy as np
import os
import pandas as pd
import tensorflow as tf
import tensorflow_hub as hub
import tf_sentencepiece  # Not used directly but needed to import TF ops.
import sklearn.metrics.pairwise

from tqdm import tqdm
from tqdm import trange
from annoy import AnnoyIndex

TensorFlow 1.x selected.


This is additional boilerplate code where we import the pre-trained ML model we will use to encode text throughout this notebook.

In [3]:
#@title get the machine learning stuff set up. (boilerplate!)
# this version of the Universal Sentence Encoder only "speaks" English
# but there's another version you can switch in that supports 16 different languages!
module_url = 'https://tfhub.dev/google/universal-sentence-encoder/2'

# boilerplate, getting started with Tensorflow.
# (how to use Tensorflow is way outside the scope of this class)
g = tf.Graph()
with g.as_default():
  text_input = tf.placeholder(dtype=tf.string, shape=[None])
  multiling_embed = hub.Module(module_url)
  embedded_text = multiling_embed(text_input)
  init_op = tf.group([tf.global_variables_initializer(), tf.tables_initializer()])
g.finalize()

session = tf.Session(graph=g)
session.run(init_op)

INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


## What the heck is JSONL?

Don't worry too much about it. It looks like this, but there's nothing special to it, it's just a way to get the content of the statments. Here's one line:

```json
{"_source": {"content": "For all of us who care deeply about diversity, equality and fairness, it\u2019s been painful and heartbreaking to watch the tragic events of this past week in Minneapolis and in several other parts of America. These events are emblematic of the inequities that black and other diverse communities face on a daily basis. The need for change is evident and we know that more work needs to be done to ensure that all people in all communities feel included, equal and safe. As a company that is deeply committed to diversity, inclusion and human rights, we will strengthen our resolve in advocating for change and in doing our part so that we build a society that protects all people and values all voices."},  "_id": "MetLife.txt"}
```

I used another notebook in this repo `jsonl_maker.ipynb` to turn the collection of statements, each in its own text file, into a single JSONL file.


In [4]:
# @ get the data.
# let's get our data!
# it's a JSONL file, which has each statement as its own JSON document, and each JSON document on one line.
!wget --quiet -nc -O docs.jsonl https://www.dropbox.com/s/watyjmstbsorsh4/docs.jsonl

## Chopping each page into a list of sentences

We have to do this because pages and paragraphs often cover multiple topics, which might confuse the model. And, Universal Sentence Encoder is built to encode sentences... and so it ignores anything after the 128th word in its input.

The code below cuts the text into sentences, but groups any two consecutive sentences under 10 words long together.

In [None]:
# how many lines our in our document?
!wc docs.jsonl

In [6]:
import json
from bs4 import BeautifulSoup
from functools import reduce
from w3lib.html import remove_tags

import syntok.segmenter as segmenter

total_docs = 88 # get this with `wc` (only used for progress bar)

total_short_paragraphs = 0
MAX_SENT_LEN = 50

def sentenceify(text):
    return [sl for l in [[''.join([t.spacing + t.value for t in s]) for s in p if len(s) < MAX_SENT_LEN] for p in segmenter.analyze(text)] for sl in l if any(map(lambda x: x.isalpha(), sl))]


def clean_html(html):
    if "<" in html and ">" in html:
        try:
            soup = BeautifulSoup(html, features="html.parser")
            plist = soup.find('plist')
            if plist:
                plist.decompose() # remove plists because ugh
            text = soup.getText()
        except:
            text = remove_tags(html)
        return '. '.join(text.split("\r\n\r\n\r\n"))
    else:
        return '. '.join(html.split("\r\n\r\n\r\n"))

# if this sentence is short, then group it with other short sentences (so you get groups of continuous short sentences, broken up by one-element groups of longer sentences)
def short_sentence_grouper_bean_factory(target_sentence_length): # in chars
    def group_short_sentences(list_of_lists_of_sentences, next_sentence):
        if not list_of_lists_of_sentences:
            return [[next_sentence]]
        if len(next_sentence) < target_sentence_length:
           list_of_lists_of_sentences[-1].append(next_sentence)
        else:
            list_of_lists_of_sentences.append([next_sentence])
            list_of_lists_of_sentences.append([])
        return list_of_lists_of_sentences
    return group_short_sentences


def overlap(document_tokens, target_length):
    """ pseudo-paginate a document by creating lists of tokens of length `target-length` that overlap at 50%
    return a list of `target_length`-length lists of tokens, overlapping by 50% representing all the tokens in the document 
    """

    overlapped = []
    cursor = 0
    while len(' '.join(document_tokens[cursor:]).split()) >= target_length:
      overlapped.append(document_tokens[cursor:cursor+target_length])
      cursor += target_length // 2
    return overlapped


def sentences_to_short_paragraphs(group_of_sentences, target_length, min_shingle_length=10):
    """ outputting overlapping groups of shorter sentences 
    
        group_of_sentences = list of strings, where each string is a sentences
        target_length = max length IN WORDS of output sentennces
        min_shingle_length = don't have sentences that differ just in the inclusion of a sentence of this size
    """
    if len(group_of_sentences) == 1:
        return [' '.join(group_of_sentences[0].split())]
    sentences_as_words = [sent.split() for sent in group_of_sentences]
    sentences_as_words = [sentence for sentence in sentences_as_words if [len(word) for word in sentence].count(1) < (len(sentence) * 0.5) ]
    paragraphs = []
    for i, sentence in enumerate(sentences_as_words[:-1]):
        if i > 0 and len(sentence) < min_shingle_length  and len(sentences_as_words[i-1]) < min_shingle_length and i % 2 == 0:
            continue # skip really short sentences if the previous one is also really short (but not so often that we lose anything )
        buff = list(sentence) # just making a copy.
        for subsequent_sentence in sentences_as_words[i+1:]:
            if len(buff) + len(subsequent_sentence) <= target_length:
                buff += subsequent_sentence
            else:
                break
        paragraphs.append(buff)
    return [' '.join(graf) for graf in paragraphs]


def to_short_paragraphs(text, paragraph_len=15, min_sentence_len=8): # paragraph_len in words, min_sentence_len in chars
    sentences = sentenceify( clean_html(text) )
    grouped_sentences = reduce(short_sentence_grouper_bean_factory(150) , sentences, [])
    return [sl for l in [sentences_to_short_paragraphs(group, paragraph_len) for group in grouped_sentences if len(group) >= 2 or (len(group) > 0 and len(group[0]) > min_sentence_len)] for sl in l if sl]

paragraph_target_length = 10

with open(f"docs-sentences{paragraph_target_length}.json", 'w') as writer: 
    with open('docs.jsonl', 'r') as reader:
        for i, line_json in tqdm(enumerate(reader), total=total_docs):
            line = json.loads(line_json)
            text = line["_source"]["content"][:1000000]
            for j, page in enumerate(to_short_paragraphs(text, paragraph_target_length)):
                total_short_paragraphs += 1
                writer.write(json.dumps({
                    "text": page, 
                    "_id": line["_id"], 
                    "chonk": j,
                    # "routing": line.get("_routing", None),
                    # "path": line["_source"]["path"]
                    }) + "\n")
print(f"total paragraphs: {total_short_paragraphs}")


100%|██████████| 88/88 [00:00<00:00, 160.97it/s]

total paragraphs: 1680





Let's take a look at the first few lines of the sentences file:

In [7]:
!head docs-sentences10.json

{"text": "I wanted to say something earlier but was afraid it would come out wrong or that I wouldn't be able to find the words to express how I really feel.", "_id": "Berkshire Hathaway.txt", "chonk": 0}
{"text": "However, I\u2019ve realized that staying silent is far worse because, like you;", "_id": "Berkshire Hathaway.txt", "chonk": 1}
{"text": "The murders of George Floyd in Minneapolis, Breonna Taylor in Kentucky, and Ahmaud Arbery in Georgia are the most recent names added to a lengthy list of horrors faced by black people over the past several hundred years.", "_id": "Berkshire Hathaway.txt", "chonk": 2}
{"text": "During this troublesome time, even when most people are craving normalcy, we must not turn a blind eye to injustices and continue to stand on the sidelines.", "_id": "Berkshire Hathaway.txt", "chonk": 3}
{"text": "Returning to the status quo will only perpetuate the damage being done.", "_id": "Berkshire Hathaway.txt", "chonk": 4}
{"text": "Everyone must do more to su

# Creating a Multilingual Semantic-Similarity Search Engine

## Using a pre-trained model to transform sentences into vectors

We compute embeddings in _batches_ so that they fit in the GPU's RAM.

In [None]:
vector_index_chunk = AnnoyIndex(512, 'angular')  # Length of item vector that will be indexed

batch_size = 256
docs = {}

doc_counter = 0
with tqdm(total=1680) as pbar:
  for j, batch in enumerate(pd.read_json('docs-sentences10.json', lines=True, chunksize=batch_size)):
    batch_vecs = session.run(embedded_text, feed_dict={text_input: batch["text"]})
    # sentences.extend(batch["text"])
    pbar.update(len(batch))
    doc_idxs = list(range(doc_counter, doc_counter + batch_size))
    for vec, page_num, doc in zip(batch_vecs, doc_idxs, batch.iterrows()):
      vector_index_chunk.add_item(page_num, vec)
      docs[page_num] = doc[1]["_id"]
    doc_counter += batch_size
    
    

## Building an index of semantic vectors

We use the [Annoy](https://github.com/spotify/annoy) library---to efficiently look up results from the corpus.

In [9]:
vector_index_chunk.build(10) # 10 trees

True

In [10]:
vector_index_chunk.save('docs_annoy_small.bin') # you could save this and skip the step above, if you'd like

True

What's indexed in Annoy is a meaningless set of 512 numbers for each sentence. Computers can sort of understand this, but humans can't. So we load up into memory the list of all the sentences, so we can print those as the result.

This demo uses a fairly small (5mb) set of documents. If you were using this in "real life" you'd probably want to use a database to hold onto these -- they'd be too big to hold in memory.

In [11]:
doc_texts = pd.read_json('docs-sentences10.json', lines=True);

## Verify that the semantic-similarity search engine works

Here are all the sentences in our collection:

In [12]:
doc_texts

Unnamed: 0,text,_id,chonk
0,I wanted to say something earlier but was afra...,Berkshire Hathaway.txt,0
1,"However, I’ve realized that staying silent is ...",Berkshire Hathaway.txt,1
2,"The murders of George Floyd in Minneapolis, Br...",Berkshire Hathaway.txt,2
3,"During this troublesome time, even when most p...",Berkshire Hathaway.txt,3
4,Returning to the status quo will only perpetua...,Berkshire Hathaway.txt,4
...,...,...,...
1675,It’s also why today Intel is pledging $1 milli...,Intel.txt,29
1676,I also encourage employees to consider donatin...,Intel.txt,30
1677,It’s with a heavy heart that I write this note...,Intel.txt,31
1678,"I know I speak for the leadership team, our bo...",Intel.txt,32


### Try searching for similar sentences yourself!

*   Try a few different sample sentences
*   Try changing the number of returned results (they are returned in order of similarity)

Once you've tried it out a bit, click the menu button to the left, and click Form -> Show Code to see what this is doing under the hood.


In [None]:
sample_query = "These events impact us, our customers and the communities we serve, and we are called to action. "  #@param ["We cannot tolerate it and none of us can stand by quietly if we observe it."] {allow-input: true}
num_results = 15  #@param {type:"slider", min:0, max:50, step:1}

query_embedding = session.run(embedded_text, feed_dict={text_input: [sample_query]})[0]

search_results = vector_index_chunk.get_nns_by_vector(query_embedding, n=num_results)

print('sentences similar to: "{}"\n'.format(sample_query))
# search_results

for idx, result_idx in enumerate(search_results):
  page_num = docs[result_idx]
  text = doc_texts.iloc[result_idx]["text"]
  print(f"{idx + 1}, \"{text}\", {page_num}")

Here's how I ran the sentences Sonia gave me:

In [14]:
query_sentences = ["it's more important than ever that we ground ourselves in the fundamental values that define us as a company.",
"Recent racial discrimination incidents in Minnesota, New York, Kentucky and Georgia have drawn widespread national attention",
"our focus on our values of respect, diversity and inclusion cannot waver. ",
"Supporting, encouraging and engaging everyone in our company – no matter their gender, color of their skin, sexual orientation, disability, religion, point of view or other unique qualities – are actions we must take every day. ",
"While social distancing separates us, this is not a time to be passive, but a time to reach out to colleagues, engage in a dialogue, make sure they're okay and their voices are heard. To let them know they are valued and important.",
"There is no room in our company for hate, intolerance, discrimination or harassment of any kind – either obvious or covert – toward our colleagues or customers. ",
"We cannot tolerate it and none of us can stand by quietly if we observe it.",
"These events impact us, our customers and the communities we serve, and we are called to action."]


In [15]:
for sample_query in query_sentences: 
  num_results = 16  #@param {type:"slider", min:0, max:50, step:1}

  query_embedding = session.run(embedded_text, feed_dict={text_input: [sample_query]})[0]

  search_results = vector_index_chunk.get_nns_by_vector(query_embedding, n=num_results)

  print('sentences similar to: "{}"\n'.format(sample_query))
  # search_results

  for idx, result_idx in enumerate(search_results):
    page_num = docs[result_idx]
    text = doc_texts.iloc[result_idx]["text"]
    print(f"{idx }, \"{text}\", {page_num}")

  print('\n-----')


sentences similar to: "it's more important than ever that we ground ourselves in the fundamental values that define us as a company."

0, "Dear Colleagues, During these extraordinary times, when many parts of our lives have changed, it's more important than ever that we ground ourselves in the fundamental values that define us as a company.", Exelon.txt
1, "In doing so, we have aligned ourselves with the values that you, our clients, live on a day-to-day basis through your inspiring work – values that are grounded in the dignity and worth of every human being.", TIAA.txt
2, "But, more than that, we reflect our core values to the world, and we advocate to make the places we live and work more inclusive.", Dow Chemical.txt
3, "With that in mind, now and each day, we remain guided by Our Purpose and Our Values of Integrity and Honesty, Safety and Respect, Diversity and Inclusion.", Kroger.txt
4, "One of the key ways we can continue to improve as a company and as individuals is to listen t

## Wait, how did that work?

### Nearest neighbors -- it's what it sounds like.

When is a sentence "similar" to another?

Remember those 512-dimensional vectors? We're treating two sentences as similar if their vectors are close together. Our search results are "nearest neighbors," which is what it sounds like.

Imagine the vectors were just three dimensions and we had four sentences, encoded as:

1. [1, 2, 1]
2. [100, 600, -12]
3. [5, 7, 3]
4. [-50, 1, -5798]

Which sentence is probably the most similar to sentence #1?

"Annoy" is a library that makes this easier to calculate quickly for hundreds of thousands of sentences. 




-------------------



**Copyright 2019 The TensorFlow Hub Authors and Quartz.**

Licensed under the Apache License, Version 2.0 (the "License");

In [3]:
# Copyright 2019 The TensorFlow Hub Authors and Quartz All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================