In this notebook, we demonstrate how to extract and lookup for contextually-most-similar words using BERT and nearest neighbor search. 

This was inspired by the StackOverflow question https://stackoverflow.com/questions/59865719/how-to-find-the-closest-word-to-a-vector-using-bert

# learn to extract embeddings from bert

We use `bert-embedding` package; see https://pypi.org/project/bert-embedding/

We use GPU, so please choose the Colab kernel accordingly

In [None]:
!pip install mxnet-cu102
!pip install bert-embedding

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting mxnet-cu102
  Downloading mxnet_cu102-1.9.1-py3-none-manylinux2014_x86_64.whl (380.8 MB)
[K     |████████████████████████████████| 380.8 MB 6.1 kB/s 
[?25hCollecting graphviz<0.9.0,>=0.8.1
  Downloading graphviz-0.8.4-py2.py3-none-any.whl (16 kB)
Installing collected packages: graphviz, mxnet-cu102
  Attempting uninstall: graphviz
    Found existing installation: graphviz 0.10.1
    Uninstalling graphviz-0.10.1:
      Successfully uninstalled graphviz-0.10.1
Successfully installed graphviz-0.8.4 mxnet-cu102-1.9.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting bert-embedding
  Downloading bert_embedding-1.0.1-py3-none-any.whl (13 kB)
Collecting typing==3.6.6
  Downloading typing-3.6.6-py3-none-any.whl (25 kB)
Collecting mxnet==1.4.0
  Downloading mxnet-1.4.0-py2.py3-none-manylinux1_x86_64.whl (29.6 MB)
[K     |███

In [None]:
import mxnet as mx
from bert_embedding import BertEmbedding

In [None]:
# ctx = mx.gpu(0)
bert = BertEmbedding()

Vocab file is not found. Downloading.
Downloading /root/.mxnet/models/book_corpus_wiki_en_uncased-a6607397.zip from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/vocab/book_corpus_wiki_en_uncased-a6607397.zip...
Downloading /root/.mxnet/models/bert_12_768_12_book_corpus_wiki_en_uncased-75cc780f.zip from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/bert_12_768_12_book_corpus_wiki_en_uncased-75cc780f.zip...


In [None]:
from tqdm.auto import tqdm, trange

In [None]:
bert_abstract = """We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers.
 Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations by jointly conditioning on both left and right context in all layers.
 As a result, the pre-trained BERT representations can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. 
BERT is conceptually simple and empirically powerful. 
It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE benchmark to 80.4% (7.6% absolute improvement), MultiNLI accuracy to 86.7 (5.6% absolute improvement) and the SQuAD v1.1 question answering Test F1 to 93.2 (1.5% absolute improvement), outperforming human performance by 2.0%."""

In [None]:
sentences = bert_abstract.split('\n')
result = bert(sentences)
toks, embs = result[0]
print(toks)
print(len(toks), len(embs))
print(embs[0][:10])

['we', 'introduce', 'a', 'new', 'language', 'representation', 'model', 'called', 'bert', ',', 'which', 'stands', 'for', 'bidirectional', 'encoder', 'representations', 'from', 'transformers']
18 18
[ 0.47964773  0.1824888  -0.28597528 -0.46567446  0.01248981 -0.07430505
 -0.18017295  0.37813222  0.9135139  -0.25295883]


# process a corpus

We download a 10k web-public .com corpus from https://wortschatz.uni-leipzig.de/en/download/


In [None]:
!wget https://files.pushshift.io/gab/GABPOSTS_2018-10.xz

--2022-11-27 04:12:34--  https://files.pushshift.io/gab/GABPOSTS_2018-10.xz
Resolving files.pushshift.io (files.pushshift.io)... 172.67.170.36, 104.21.28.11, 2606:4700:3031::6815:1c0b, ...
Connecting to files.pushshift.io (files.pushshift.io)|172.67.170.36|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 337144496 (322M) [application/octet-stream]
Saving to: ‘GABPOSTS_2018-10.xz’


2022-11-27 04:13:03 (11.6 MB/s) - ‘GABPOSTS_2018-10.xz’ saved [337144496/337144496]



In [None]:
import lzma
import json
import pandas as pd
import re
import numpy as np
gab_posts = pd.DataFrame()
temp = []
counter = 0
with lzma.open('GABPOSTS_2018-10.xz', mode='r') as file:
    for line in file:
      # Can add raw/other fields, just worried about memory requirements
      # I think I did this in a cleaner way before, but w/e it works :)
      if counter > 100000:
        break
      counter = counter + 1
      temp.append({"body": re.sub(r"(?:\@|\\|https?\://)\S+", "",json.loads(line)['body']), "date":json.loads(line)['created_at']})


In [None]:
gab_posts = gab_posts.append(temp)

In [None]:
len(temp)

100001

In [None]:
banned_words = ['pussie', 'phonesex', 'footjob', 'horniest', 'clitoris', 'headfuck', 'areola', 'pussies', 'goddamnes', 'suicide', 'voyeurweb', 'suicide girls', 'niggarded', 'deepthroat', 'fuckbuddy', 'nigra', 'freefuck', 'boob', 'hentai', 'rentafuck', 'wanking', 'jerk off', 'molester', 'horney', 'titfuckin', 'milf', 'wrapping men', 'whorefucker', 'masturbating', 'dick', 'honkers', 'chocolate rosebuds', 'neonazi', 'vibrater', 'uptheass', 'shitdick', 'pussycat', 'naked', 'group sex', 'suckmydick', 'pussyeater', 'masturbate', 'stupidfuck', 'nig', 'rape', 'meth', 'virgin', 'livesex', 'terrorist', 'upskirt', 'shortfuck', 'genital', 'jiggaboo', 'marijuana', 'cumshots', 'koon', 'holestuffer', 'tit', 'assbagger', 'ball sack', 'sexpot', 'suckmyass', 'lovejuice', 'phukking', 'wigger', 'black cock', 'whiskeydick', 'blonde on blonde action', 'retarded', 'kunt', 'motherfuckin', 'orgy', 'ejaculation', 'fuckme', 'phone sex', 'fuckher', 'niggerhole', 'intercourse', 'pussylips', 'niggardly', 'tongethruster', 'nig nog', 'kumbullbe', 'nigger', 'wanker', 'peepshpw', 'cocks', 'omorashi', 'female squirting', 'blow job', 'bung hole', 'homicide', 'penetration', 'puddboy', 'gang bang', 'lickme', 'spermhearder', 'titties', 'rigger', 'shitblimp', 'twat', 'fag', 'gangbanger', 'orgasim', 'porno', 'assfuck', 'pussy', 'sodomy', 'cumshot', 'cock', 'jihad', 'niggaz', 'picaninny', 'bondage', 'dry hump', 'poorwhitetrash', 'whitenigger', 'nip', 'masturbation', 'peni5', 'sexed', 'escort', 'g-spot', 'muffindiver', 'fingerbang', 'shite', 'gypo', 'scrotum', 'creampie', 'goddamnmuthafucker', 'foreskin', 'titty', 'dildo', 'sexkitten', 'anus', 'niggling', 'niggerhead', 'footlicker', 'pussylover', 'limpdick', 'fucktard', 'male squirting', 'gangbang', 'nigg', 'suckdick', 'vagina', 'reestie', 'bangbros', 'givehead', 'spank', 'trailertrash', 'giant cock', 'fucktards', 'sexo', 'pussypounder', 'gaymuthafuckinwhore', 'negroid', 'lsd', 'ball gag', 'jigga', "nigger's", 'orgasm', 'nlgger', 'asskiss', 'coprolagnia', 'boobs', 'pussylicker', 'whitetrash', 'mothafuckings', 'fingering', 'scum', 'paedophile', 'sperm', 'testicle', 'poopchute', 'wank', 'jerkoff', 'octopussy', 'pedophile', 'reverse cowgirl', 'negroes', 'suckmytit', 'big tits', 'sonofbitch', 'swastika', 'jizz', 'sexslave', 'bunghole', 'retard', 'hore', 'nipplering', 'kink', 'nipples', 'vaginal', 'tittie', 'hitler', 'jiggabo', 'pedobear', 'handjob', 'pubic', 'kkk', 'niggled', 'pthc', "negro's", 'doggystyle', 'samckdaddy', 'gangbanged', 'clit', 'hand job', 'beaners', 'ecchi', 'doggy style', 'nutten', 'bdsm', 'cunnilingus', 'killing', 'genitals', 'poop chute', 'fuckfest', 'spermherder', 'brunette action', 'motherfuck', 'cumming', 'erotic', 'splooge moose', 'foursome', 'niglet', 'nigre', 'incest', 'cunt', 'molest', 'threesome', 'kissass', 'narcotic', 'sexhouse', 'nudity', 'fudgepacker', 'snownigger', 'white power', 'jiggerboo', 'honky', 'rosy palm and her 5 sisters', 'nittit', 'horny', 'hotpussy', 'ball sucking', 'nignog', 'palesimian', 'jizjuice', 'zoophilia', 'nigga', 'asslicker', 'niggle', 'nlggor', 'pornography', 'sexing', 'slutt', 'titlicker', 'kunnilingus', 'fuckwhore', 'wet dream', 'spunk', 'pisser', 'puss', 'boner', 'skeet', 'sextoys', 'vibrator', 'manpaste', 'faggot', 'humping', 'nipple', 'double penetration', 'coons', 'assklown', 'pubes', 'fuckface', 'anal', 'nimphomania', 'blowjob', 'rimjob', 'fisting', 'niggardliness', 'sultry women', 'jizzim', 'kinkster', 'skankfuck', 'penis', 'how to kill', 'semen', 'mothafucker', 'analsex', 'niggur', 'panty', 'deep throat', 'foot fetish', 'freakyfucker', 'date rape', 'assblaster', 'bukkake', 'lesbo', 'spaghettinigger', 'beaner', 'clover clamps', 'twobitwhore', 'nigr', 'fuckfriend', 'sextoy', 'prostitute', 'pussyfucker', 'kanake', 'porchmonkey', 'testicles', 'erotism', 'pusy', 'assjockey', 'pimpjuic', 'booty call', 'kaffir', 'fuckable', 'goldenshower', 'homobangers', 'pegging', 'rapist', 'venus mound', 'raping', 'fudge packer', 'sexcam', 'timbernigger', 'viagra', 'make me come', 'beastiality', 'leather restraint', 'coon', 'futanari', 'fuckina', 'iblowu', 'masterbate', 'luckycammeltoe', "niggardliness's", 'fuckmehard', 'tits', 'suckme', 'intheass', 'niggarding', 'tonguetramp', 'niggor', 'schlong', 'niggah', 'raped', 'nazi', 'two girls one cup', 'huge fat', 'upthebutt', 'daterape', 'mastabater', 'cum', 'asslick', 'raghead', 'bestiality', 'golden shower', 'niggers', 'penises', 'mufflikcer', 'camel toe', 'shaved pussy', 'niggles', 'jijjiboo']


In [None]:

gab_posts['body'].replace(r'\n',' ', regex=True, inplace=True)
gab_posts['body'].replace(r'\r',' ', regex=True, inplace=True)
gab_posts['body'].replace('', np.nan, inplace=True)
gab_posts.dropna(subset=['body'], inplace=True)
gab_posts['body'].str.lower()


0                                        #trade #winning   
1         o deputado arthur lira (pp/al):  ⚠️ responde a...
2                       cocaine mitch comes out swinging.  
3         #demoncrats have no redeeming value. a black c...
4         putting up the rent in one town/city/country w...
                                ...                        
99996     this is called a ‘hejira’ or migration in arab...
99997     don jr. no doubt gets updates via guild couriers.
99998     mental illness is what it is. the truth doesn'...
99999     i really hope she runs again so we call all re...
100000    israel has lined china up by stealing all the ...
Name: body, Length: 87563, dtype: object

remove row index from each sentence

In [None]:
# all_sentences = [l.split('\t')[1] for l in lines] 
all_sentences = gab_posts['body'].to_numpy()

# create a search index

In [None]:
from sklearn.neighbors import KDTree
import numpy as np


class ContextNeighborStorage:
    def __init__(self, sentences, model):
        self.sentences = sentences
        self.model = model

    def process_sentences(self):
        result = self.model(self.sentences)

        self.sentence_ids = []
        self.token_ids = []
        self.all_tokens = []
        all_embeddings = []
        for i, (toks, embs) in enumerate(tqdm(result)):
            for j, (tok, emb) in enumerate(zip(toks, embs)):
                self.sentence_ids.append(i)
                self.token_ids.append(j)
                self.all_tokens.append(tok)
                all_embeddings.append(emb)
        all_embeddings = np.stack(all_embeddings)
        # we normalize embeddings, so that euclidian distance is equivalent to cosine distance
        self.normed_embeddings = (all_embeddings.T / (all_embeddings**2).sum(axis=1) ** 0.5).T

    def build_search_index(self):
        # this takes some time
        self.indexer = KDTree(self.normed_embeddings)

    def query(self, query_sent, query_word, k=10, filter_same_word=False):
        toks, embs = self.model([query_sent])[0]

        found = False
        for tok, emb in zip(toks, embs):
            if tok == query_word:
                found = True
                break
        if not found:
            raise ValueError('The query word {} is not a single token in sentence {}'.format(query_word, toks))
        emb = emb / sum(emb**2)**0.5

        if filter_same_word:
            initial_k = max(k, 100)
        else:
            initial_k = k
        di, idx = self.indexer.query(emb.reshape(1, -1), k=initial_k)
        distances = []
        neighbors = []
        contexts = []
        for i, index in enumerate(idx.ravel()):
            token = self.all_tokens[index]
            if filter_same_word and (query_word in token or token in query_word):
                continue
            distances.append(di.ravel()[i])
            neighbors.append(token)
            contexts.append(self.sentences[self.sentence_ids[index]])
            if len(distances) == k:
                break
        return distances, neighbors, contexts

Now let's use this indexer

In [None]:
storage = ContextNeighborStorage(sentences=all_sentences, model=bert)
storage.process_sentences()

  0%|          | 0/87563 [00:00<?, ?it/s]

Creating the index would require some time

In [None]:
storage.build_search_index()

In [None]:
banned_sentences = dict((word,[]) for word in banned_words)


for idx, entry in gab_posts.iterrows():
  body = entry['body']
  for word in banned_words:
    if word in body:
      # Can just include sentence if we want
      banned_sentences[word].append(body)

In [None]:
print(banned_sentences)



In [None]:
final_count = {}
for banned_word, sentences in banned_sentences.items():
  ctr_passed_sentences = 0
  ctr_total_checked = 0
  final_count[banned_word] = {}
  for sentence in sentences:
    if ctr_passed_sentences > 100 or ctr_total_checked > 1000:
      break
    ctr_total_checked = ctr_total_checked + 1
    try:
      distances, neighbors, contexts = storage.query(query_sent=sentence, query_word=banned_word, k=5, filter_same_word=True)
      # print('BANNED WORD: {} \n ORIGINAL SENTENCE: {} '.format(banned_word, sentence))
      for w in neighbors:
          if w in final_count[banned_word]:
            final_count[banned_word][w] = final_count[banned_word][w] + 1
          else:
            final_count[banned_word][w] = 1
      # for d, w, c in zip(distances, neighbors, contexts):
      ctr_passed_sentences = ctr_passed_sentences + 1
          # print('{} {}  {}'.format(w, d, c.strip()))
    except Exception as e:
      continue

KeyboardInterrupt: ignored

In [None]:
finalized_top_10 = {}

for key, occur_dict in final_count.items():
  if len(occur_dict.keys()) > 0:
    # Sort occurances
    sorted_dict = dict(sorted(occur_dict.items(), key=lambda item: item[1], reverse=True))
    # Take top 10
    finalized_top_10[key] = {k: sorted_dict[k] for k in list(sorted_dict)[:10]}
  else:
    finalized_top_10[key] = {}

In [None]:
finalized_top_10

{'pussie': {},
 'phonesex': {},
 'footjob': {},
 'horniest': {},
 'clitoris': {},
 'headfuck': {},
 'areola': {},
 'pussies': {'wussie': 11,
  'wussies': 10,
  'pussified': 4,
  'titties': 3,
  'cunts': 3,
  'mussy': 3,
  'niggers': 3,
  'fags': 2,
  'coozes': 2,
  'yuppiebrats': 2},
 'goddamnes': {},
 'suicide': {'death': 34,
  'murder': 31,
  'die': 19,
  'commit': 13,
  'suicidal': 11,
  'rape': 8,
  'committing': 8,
  'killed': 8,
  'terror': 6,
  'committed': 5},
 'voyeurweb': {},
 'suicide girls': {},
 'niggarded': {},
 'deepthroat': {'bulimia': 1,
  'playthrough': 1,
  'mindboggling': 1,
  'swampcritter': 1,
  'squareclouds': 1},
 'fuckbuddy': {},
 'nigra': {},
 'freefuck': {},
 'boob': {'libs': 2,
  'booger': 2,
  'boof': 2,
  'booette': 1,
  'blowjob': 1,
  'dweeb': 1,
  'nutcase': 1},
 'hentai': {'futanari': 4,
  'hokuto': 4,
  'futa': 4,
  'bisokuzenshin': 3,
  'otaku': 3,
  'ecchi': 1,
  'meio': 1,
  'bowls': 1,
  'gen': 1,
  'noodles': 1},
 'rentafuck': {},
 'wanking': {'d

In [None]:
import json
with open('top_10_100000.json', 'w', encoding='utf-8') as f:
    json.dump(finalized_top_10, f, ensure_ascii=False, indent=4)

In [None]:
distances, neighbors, contexts = storage.query(query_sent='It is an investment bank.', query_word='bank', k=5, filter_same_word=True)
total = 0
for d, w, c in zip(distances, neighbors, contexts):
    print('{} {}  {}'.format(w, d, c.strip()))


# query homonymous words

Now see how it works: 

* if the word "bank" is in context of "power bank", then the nearest neighbor is a "power bank" as well.

* if the word "bank" is in context of "investment bank", then the nearest neighbor is a "bank" as well, but in financial context



In [None]:
distances, neighbors, contexts = storage.query(query_sent='It is a power bank.', query_word='bank', k=5)
for d, w, c in zip(distances, neighbors, contexts):
    print('{} {}  {}'.format(w, d, c.strip()))

bank 0.7777660875463489  END THE JEWISH FED, GIVE THE AMERICAN PEOPLE 50 STATE BANKS AND ONE FEDERAL BANK, OWNED BY THE PEOPLE!
bank 0.8313893178415127  ALERT! Noble Bank Implosion Will Pull Down Tether, Bitfinex, Block.one & EOS!!
bank 0.8355487440916679  Daddy was over the CIA blackop bank in DC...she’s mk ultra and graduating other little commies out of that cesspool bastion of communism institution. Clown is a dead ringer for Peter Strozk
bank 0.838885023508294  #München: #Pakistaner grapscht einer auf einer Bank tanzenden Wiesnbesucherin an Bein und Gesäß, als diese seine Hand wegschlägt wechselt er zu einer anderen Frau und stößt ihr einen Finger zwischen die Pobacken – fängt sich eine Watschen dafür #Oktoberfest
banking 0.844878896015092  Now we have countries run by a  centralized banking system controlled by old men of whom think they are akin to #Gods. #psychopaths #delusionsofgrandeur


In [None]:
distances, neighbors, contexts = storage.query(query_sent='It is an investment bank.', query_word='bank', k=5)
for d, w, c in zip(distances, neighbors, contexts):
    print('{} {}  {}'.format(w, d, c.strip()))

bank 0.6642010406218649  The bank also was awarded a 5-star, Superior Bauer rating for Dec. 31, 2017, financial data.
bank 0.7402058801860062  Even though the cost ratio in UBS's investment bank looks stubbornly and dangerously high at 88%.
bank 0.7738409107605938  This simplifies the handling of new issues for the fund or the custodian bank, which benefits third-party banks and minimizes administrative costs.
banks 0.7774697527263753  This is an unusual fee, as many banks don’t charge you to move money in and out of your savings account.
bank 0.7801288580435606  Pop open your business bank account and take a look at the past few months of transactions.


If we look for the neighbors not containing the word "bank", then with investment context it is all about finance, but for "power bank" there are a few non-financial contexts. 

Probably, with a larger corpus, we would be able to find even more relevant examples (like "battery")

In [None]:
distances, neighbors, contexts = storage.query(query_sent='It is an investment bank.', query_word='bank', k=5, filter_same_word=True)
total = 0
for d, w, c in zip(distances, neighbors, contexts):
    print('{} {}  {}'.format(w, d, c.strip()))


financial 0.825917039570758  I've said it before and I'll say it again--creating an alternative online financial platform is the most important first step in breaking the back of these rootless, transnational, cosmopolitan, globalist, media dominating money lenders. Peter Thiel has built one before. I'll bet Elon Musk is slowly beginning to hate them, too.
financial 0.8451282058433567  One can, without needing financial institutions or banks. When one has crypto, you are already secured funding.
account 0.8858836058276792  what determines what is a qualified account?
account 0.8900929688788969  You'd think an account that reportedly "takes down MRAs" would at least provide some evidence? #laughingattakedownmras
funds 0.8918568158785994  "Donald Trump has lined up three New York hedge funds, including money from billionaire George Soros, to invest $160 million in his Chicago skyscraper"


In [None]:
distances, neighbors, contexts = storage.query(query_sent='It is a power bank.', query_word='bank', k=5, filter_same_word=True)
total = 0
for d, w, c in zip(distances, neighbors, contexts):
    print('{} {}  {}'.format(w, d, c.strip()))

account 0.9242870843320813  what determines what is a qualified account?
exchange 0.9279305347364061  So, how exactly do you withdraw coins from an exchange that has frozen all activity?
account 0.928275682636193  Verifying account...
reserve 0.9400618116385648  Wherever you turn, any rabbi hole you investigate, federal reserve, Chemtrails, GMO, Big pharma, Zionism, scientism, phornography, the plague, earth's shape, it is all an attack on our Creator and his credibility and sanity. and his limitless love for his creation..   Jan Irvin is woke and connects all.
account 0.9409660081907342  I’m New to Gab. Does the pres have an account here?


For a "river bank", there are no relevant examples in our small corpus (10k sentences only), but the result is still not completely meaningless (e.g. "river side" is related to "river bank").

In [None]:
distances, neighbors, contexts = storage.query(query_sent='It is a river bank.', query_word='bank', k=5, filter_same_word=True)
total = 0
for d, w, c in zip(distances, neighbors, contexts):
    print('{} {}  {}'.format(w, d, c.strip()))

river 0.9818166324827107  still river mic!! so smooth in the image
still 0.987935092243826  still river mic!! so smooth in the image
mic 0.994373162983143  still river mic!! so smooth in the image
line 1.0032134728858089  Therm line fast
line 1.0032134728858089  Therm line fast


# query named entities

Now let's try a query with named entity. We can see that Amazon the company and Amazon the toponym have nearest neighbors which are a company and a toponym as well. 

In [None]:
s = "Bezos announced that its two-day delivery service, Amazon Prime, had surpassed 100 million subscribers worldwide."
distances, neighbors, contexts = storage.query(query_sent=s, query_word='amazon', k=5)
for d, w, c in zip(distances, neighbors, contexts):
    print('{} {}  {}'.format(w, d, c.strip()))

amazon 0.6034335621074037  Expanded third-party integration including Amazon Alexa, Google Assistant, and IFTTT.
amazon 0.6813754730639783  And fewer than 1 percent of e-commerce sales took place at Amazon, everyone’s favorite scapegoat for retail’s struggles.
amazon 0.6875736233539647  The Alexa Skills Kit (ASK) empowers anyone to leverage Amazon’s years of innovation in the field of voice design.
amazon 0.6956514512735816  The Republican tax overhaul that Trump signed last year could have singled out Amazon for harsh treatment, but the company still qualifies for every corporate tax break.
amazon 0.7096277774470231  One agency, for instance, is working with a real estate company and looking for data on people shopping on Amazon.com for moving boxes.


In [None]:
s = "The Atlantic has sufficient wave and tidal energy to carry most of the Amazon's sediments out to sea, thus the river does not form a true delta"
distances, neighbors, contexts = storage.query(query_sent=s, query_word='amazon', k=5)
for d, w, c in zip(distances, neighbors, contexts):
    print('{} {}  {}'.format(w, d, c.strip()))

brazil 0.9719889877743066  And, this year our stories are the work of traveling from Brazil’s Iguassu Falls to a chicken farm in Atlanta, building a 270-degree in-car VR experience, creating new partnerships to connect our audience’s passions with experiences, and much more.
amazon 0.9849940681911138  Describes how to create, manage, and use an Amazon CloudSearch domain to implement a search solution for your website or application.
amazon 0.9903185403586894  The Alexa Skills Kit (ASK) empowers anyone to leverage Amazon’s years of innovation in the field of voice design.
amazon 1.0300692791411952  Amazon’s dynamic pricing system is often a puzzle, but adding to the mystery is the fact that these items have different prices for different sizes but were all sold and shipped via Amazon — not through different sellers.
brazilian 1.0338752062964356  In one Brazilian examining the instances of hearing loss in bus drivers, it was found that 32.7% of bus drivers experienced noise-induced heari

Moreover, we can infer that Amazon the company is related to Google, Alexa and Netflix, whereas Amazon the river is related to Brazil and the Brazilian city Belem. 

In [None]:
s = "Bezos announced that its two-day delivery service, Amazon Prime, had surpassed 100 million subscribers worldwide."
distances, neighbors, contexts = storage.query(query_sent=s, query_word='amazon', k=5, filter_same_word=True)
for d, w, c in zip(distances, neighbors, contexts):
    print('{} {}  {}'.format(w, d, c.strip()))

google 0.8355246761580234  Expanded third-party integration including Amazon Alexa, Google Assistant, and IFTTT.
alexa 0.9301979249848525  Expanded third-party integration including Amazon Alexa, Google Assistant, and IFTTT.
netflix 0.9347305391903858  It isn't just on-demand video services like Netflix and Amazon that are taking customers away from cable.
google 0.9353963024942532  That month, Google sites were accessed by 185.05 million unique mobile users.
google 0.9362379311474001  Both Google and Amazon don’t place any limits, saying that the frontend load balancers will scale up as needed to support the traffic.


In [None]:
s = "The Atlantic has sufficient wave and tidal energy to carry most of the Amazon's sediments out to sea, thus the river does not form a true delta"
distances, neighbors, contexts = storage.query(query_sent=s, query_word='amazon', k=5, filter_same_word=True)
for d, w, c in zip(distances, neighbors, contexts):
    print('{} {}  {}'.format(w, d, c.strip()))

brazil 0.9719889877743066  And, this year our stories are the work of traveling from Brazil’s Iguassu Falls to a chicken farm in Atlanta, building a 270-degree in-car VR experience, creating new partnerships to connect our audience’s passions with experiences, and much more.
brazilian 1.0338752062964356  In one Brazilian examining the instances of hearing loss in bus drivers, it was found that 32.7% of bus drivers experienced noise-induced hearing loss.
brazil 1.0511793567879462  A subset of these action figures were also released in Canada, and years later, in Brazil.
brazil 1.0534603912133236  And with Brazil making a comeback — and AES having operations there — things are looking up for this stock on the growth side of equation.
belem 1.0627748582658  Start at the Embarcadero Belem dock to experience the waterways.
