# Interactive word2vec visualization

## Intro

This notebook contains all steps needed to make a word2vec model from file with text sentences and then visualize that model in interactive Galaxy-like style using https://github.com/anvaka/pm library

Assumptions:
1. Lets assume you have the "input.txt" text file where sentences are divided by new lines. 
2. The distance between two words in word2vec model depends on frequency of these two words occuring in one sentences

Contents of sample input.txt file:

In my real world example (which unfortunately I cannot disclose) there are 131101 lines of phrases in a file

## Data preparation and model training

In [2]:
#input file
input_file = "input.txt"

# this files are going to be generated during the script run
phrases_file = "phrases.file"
bin_file = 'bin.bin'
clusters_file = "clusters.file"

In [4]:
# import modules & set up logging
import gensim, logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [None]:
import os
from smart_open import smart_open
class MySentences(object):
    def __init__(self, filename):
        self.filename = filename
 
    def __iter__(self):
        for line in smart_open(self.filename, 'r'):
            yield line.split()

Let's train the model

In [None]:
sentences = MySentences(input_file)
model = gensim.models.Word2Vec(sentences)
model.save(bin_file)

In [5]:
model = gensim.models.Word2Vec.load(bin_file)

2018-09-09 22:03:18,067 : INFO : loading Word2Vec object from bin.bin
2018-09-09 22:03:18,629 : INFO : loading wv recursively from bin.bin.wv.* with mmap=None
2018-09-09 22:03:18,630 : INFO : loading vectors from bin.bin.wv.vectors.npy with mmap=None
2018-09-09 22:03:18,726 : INFO : setting ignored attribute vectors_norm to None
2018-09-09 22:03:18,737 : INFO : loading vocabulary recursively from bin.bin.vocabulary.* with mmap=None
2018-09-09 22:03:18,746 : INFO : loading trainables recursively from bin.bin.trainables.* with mmap=None
2018-09-09 22:03:18,752 : INFO : loading syn1neg from bin.bin.trainables.syn1neg.npy with mmap=None
2018-09-09 22:03:18,868 : INFO : setting ignored attribute cum_table to None
2018-09-09 22:03:18,869 : INFO : loaded bin.bin


The model is ready. You can play with it, looking at probabilites of co-occurence of words:

In [None]:
print(model.predict_output_word(['you']))

In [14]:
len(list(model.wv.vocab))

131101

In [15]:
!pip install annoy

Collecting annoy
[?25l  Downloading https://files.pythonhosted.org/packages/7a/9e/dbfa8bad4015b23b1c2e29f1dc6178c3eec4e004a654ea7b38ba618998ec/annoy-1.13.0.tar.gz (634kB)
[K    100% |████████████████████████████████| 634kB 85kB/s ta 0:00:011
[?25hBuilding wheels for collected packages: annoy
  Running setup.py bdist_wheel for annoy ... [?25ldone
[?25h  Stored in directory: /Users/timming/Library/Caches/pip/wheels/96/af/26/f26df0a684b1e41ad8c56a13fc13e7a0a15a8a1a8b1cb0111a
Successfully built annoy
[31msmart-open 1.6.0 requires bz2file, which is not installed.[0m
[31mdistributed 1.21.8 requires msgpack, which is not installed.[0m
Installing collected packages: annoy
Successfully installed annoy-1.13.0
[33mYou are using pip version 10.0.1, however version 18.0 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [21]:
model['someword']

  """Entry point for launching an IPython kernel.


array([ 6.31395504e-02,  1.32039428e-01,  1.98137254e-01, -6.41543150e-01,
       -3.02360356e-01,  5.15950955e-02, -2.51930326e-01, -7.05299899e-03,
        2.32723579e-01, -6.66235341e-03, -2.14652702e-01, -2.47196555e-01,
        1.45640746e-01,  3.48122656e-01,  6.06271505e-01,  8.97369981e-02,
        7.57813603e-02, -8.50206241e-02, -1.97893649e-01, -3.94424856e-01,
        3.59753698e-01, -6.98147357e-01,  1.80305824e-01, -5.21175027e-01,
       -6.20052926e-02, -4.78175551e-01, -4.65428419e-02, -1.77680984e-01,
        3.55405957e-02, -2.69618422e-01, -3.99093367e-02, -7.88689435e-01,
       -2.81666100e-01,  2.89484203e-01,  2.48964489e-01,  1.79778904e-01,
       -1.17163353e-01,  5.55652194e-02, -4.98614879e-03,  1.64480507e-01,
       -8.75537470e-03,  3.27218063e-02, -2.16997966e-01, -2.19280675e-01,
        3.19542527e-01,  6.28549635e-01, -2.93996036e-01,  1.13070235e-01,
       -2.29681745e-01,  1.86282262e-01, -5.71526885e-01,  9.04795766e-01,
       -7.44460523e-02,  

Now let's build an index file

In [35]:
from __future__ import print_function

import re
# Install from https://github.com/spotify/annoy
from annoy import AnnoyIndex

# Ignore all vectors with distance larger than this:
threshold = 0.9

# This file will contain nearest neighbors, one per line:
# node [tab char] neighbor_1 neighbor_2 ...

out_file = "edges.txt"

# How many dimension in the vector space
dimensions = 100

# How many trees do want to use for `AnnoyIndex`
max_trees = 50

vocab = list(model.wv.vocab)

word_id = 0
word_index = AnnoyIndex(dimensions)
words = []

for word in vocab:
    # There are a lot of words with numbers (dates, years) - and they are not very intersting to me.
    # There are also ~140K instances of words with non-word characters, so we are going to ignore them
    # as well
    #if re.search('[0-9\W]', word):
    #    continue

    words.append(word)
    vectors = model[word]
    word_index.add_item(word_id, vectors)
    word_id += 1
    
word_index.build(max_trees)

# If you want to save index:
word_index.save('crawl_50_clean.ann')

# If you want to load index:
# u1 = AnnoyIndex(dimensions)
# u1.load('crawl_50_clean.ann')

# naive test:
# result = word_index.get_nns_by_item(words.index('dog'), 42, include_distances=True)
# result = zip(result[0], result[1])
# print([(words[x], dist) for x, dist in result]) # will find the 1000 nearest neighbors

out = open(out_file,'w')

for idx in range(word_id):
    try:
        word = words[idx]
        result = word_index.get_nns_by_item(idx, 42, include_distances=True)
        pairs = zip(result[0], result[1])
        edges = [words[pair[0]] for pair in pairs if (words[pair[0]] != word) and (pair[1] < threshold)]
        if len(edges) > 0:
            out.write(word + '\t' + " ".join(edges) + '\n')

        if idx % 10000 == 0:
            print(idx)
    except Exception as e:
        print("Error:", e)
        print(idx)

out.close()

print("All done")



0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
110000
120000
130000
All done


## Visualisation

Our data is ready. Now we go to visualization part. All kudos go to anvaka and his https://github.com/anvaka/pm library

run `node edges2graph.js graph-data/edges.txt` - this will save graph in binary format into graph-data folder (graph-data/labels.json, graph-data/links.bin)

Download and compile layout from https://github.com/anvaka/ngraph.native (valid for linux and macos)
Then copy the binary to this project and run

`./layout++ ./graph-data/links.bin`

You will need to manually kill it (Ctrl + C) after 500-700 iterations.

`
mv 500.bin ./graph_data/positions.bin
mv graph-data/* graph_data/`

`http-server --cors`, before that do `npm install -g http-server`

Clone pm from https://github.com/anvaka/pm

in `src/config.js` write IP like this: 

```
export default {
  dataUrl: '//127.0.0.1:8080/'
};
```

Run `npm install`
Then `npm start`

Open link http://127.0.0.1:8081/#/galaxy/graph_data?ml=150&s=1.75&l=1

The result is going to be something like this:

![Alt Text](https://elmiles.com/download/2019-06-06%2015-31-43.gif)