# word2vec implementation

## Original papers
* [Distributed Representations of Words and Phrases and their Compositionality](https://arxiv.org/abs/1310.4546)
* [word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method](https://arxiv.org/abs/1402.3722)



## Overview

The original paper proposed using:
1. Matmul to extract word vectors from ```Win```.
2. Softmax to calculate scores with all the word vectors from ```Wout```.

<img src="image/word2vec_cbow_mechanism.png" align="left"/>

## Negative Sampling

### Motivation

The original implementation was computationally expensive.

1. Matmal to extract word vectors from ```Win```.
2. Softmax can happen vocabulary size times with ```Wout```.

<img src="image/word2vec_negative_sampling_motivation.png" align="left"/>


### Solution

1. Use index to extract word vectors from Win.
2. Instead of calculating softmax with all the word vectors from ```Wout```, sample small number (SL) of negative/false word vectors from ```Wout``` and calculate logistic log loss with each sample. 

<img src="image/wors2vec_neg_sample_backprop.png" align="left"/>

## Using only one Word vector space W

There is no reasoning nor proof why two word vector space are required. In the end, we only use one word vector space, which appears to be ```Win```. 

However, if we use one vector space ```W``` for ```event```, ```context``` and ```negative samples```,then an event vector ```event=W[i]``` in a sentence can be used as a negative sample in another setence. Then the weight ```W[i]``` is updated for both positive and negative labels in the same gradient descent on ```W```. The actual [experiment of using only one vector space](./layer/embedding_single_vector_space.py) ```W``` did not work well.

* [Why do we need 2 matrices for word2vec or GloVe](https://datascience.stackexchange.com/a/94422/68313)


<img src="image/word2vec_why_not_one_W.png" align="left" width=800/>

---
# Setups

In [1]:
import cProfile
import sys
import os
import time
import re
from itertools import islice
from typing import Dict, List
import numpy as np
import tensorflow as tf

np.set_printoptions(threshold=sys.maxsize)
np.set_printoptions(linewidth=400) 

## Setup for Google Colab environment

Colab gets disconnected within approx 20 min. Hence not suitable for training (or need to upgrade to the pro version).

### Clone github to Google Drive

In [2]:
try:
    import google.colab
    IN_GOOGLE_COLAB = True
except:
    IN_GOOGLE_COLAB = False
    
if IN_GOOGLE_COLAB:
    !pip install line_profiler
    !google.colab.drive.mount('/content/gdrive')
    !rm -rf /content/drive/MyDrive/github
    !mkdir -p /content/drive/MyDrive/github
    !git clone https://github.com/oonisim/python-programs.git /content/drive/MyDrive/github


### Clone github to local directory

In [3]:
try:
    import google.colab
    IN_GOOGLE_COLAB = True
except:
    IN_GOOGLE_COLAB = False
    
if IN_GOOGLE_COLAB:
    !pip install line_profiler
    !google.colab.drive.mount('/content/gdrive')
    !rm -rf /content/github
    !mkdir -p /content/github
    !git clone https://github.com/oonisim/python-programs.git /content/github
        
    import sys
    sys.path.append('/content/github/nlp/src')

# Jupyter notebook setups

Auto reolaod causes an error in Jupyter notebooks. Restart the Jupyter kernel for the error:
```TypeError: super(type, obj): obj must be an instance or subtype of type```
See
- https://stackoverflow.com/a/52927102/4281353
- http://thomas-cokelaer.info/blog/2011/09/382/

> The problem resides in the mechanism of reloading modules.
> Reloading a module often changes the internal object in memory which
> makes the isinstance test of super return False.

In [4]:
%load_ext line_profiler
%load_ext autoreload

## Utilites

In [5]:
%autoreload 2

def take(n, iterable):
    "Return first n items of the iterable as a list"
    return list(islice(iterable, n))

import function.fileio as fileio
import function.text as text

---
# Data Types


In [6]:
from common.constant import (
    TYPE_INT,
    TYPE_FLOAT,
    TYPE_LABEL,
    TYPE_TENSOR,
)

# Constants

Too large LR generates unusable event vector space.

Uniform weight distribution does not work (Why?)
Weights from the normal distribution sampling with small std (0.01) works (Why?)

In [7]:
USE_TEXT8 = False
USE_PTB = not USE_TEXT8
USE_CBOW = False
USE_SGRAM = not USE_CBOW

CORPUS_FILE = "text8_256" if USE_TEXT8 else "ptb_train"
CORPUS_URL = "https://data.deepai.org/text8.zip" \
    if USE_TEXT8 else f'https://raw.githubusercontent.com/tomsercu/lstm/master/data/ptb.train.txt' \

TARGET_SIZE = TYPE_INT(1)       # Size of the target event (word)
CONTEXT_SIZE = TYPE_INT(10)     # Size of the context in which the target event occurs.
WINDOW_SIZE = TARGET_SIZE + CONTEXT_SIZE
SAMPLE_SIZE = TYPE_INT(5)       # Size of the negative samples

VECTOR_SIZE = TYPE_INT(100)     # Number of features in the event vector.
WEIGHT_SCHEME = "normal"
WEIGHT_PARAMS = {
    "std": 0.01
}

LR = TYPE_FLOAT(20)
NUM_SENTENCES = 10

STATE_FILE = "../models/word2vec_sgram_%s_E%s_C%s_S%s_W%s_%s_%s_V%s_LR%s_N%s.pkl" % (
    CORPUS_FILE,
    TARGET_SIZE,
    CONTEXT_SIZE,
    SAMPLE_SIZE,
    WEIGHT_SCHEME,
    "std",
    WEIGHT_PARAMS["std"],
    VECTOR_SIZE,
    LR,
    NUM_SENTENCES,
)

MAX_ITERATIONS = 100000

---

# Data
## Corpus

In [8]:
path_to_corpus = f"~/.keras/datasets/{CORPUS_FILE}"
if fileio.Function.is_file(path_to_corpus):
    pass
else:
    # text8, run "cat text8 | xargs -n 512 > text8_512" after download
    path_to_corpus = tf.keras.utils.get_file(
        fname=CORPUS_FILE,
        origin=CORPUS_URL,
        extract=True
    )
corpus = fileio.Function.read_file(path_to_corpus)
print(path_to_corpus)

/home/oonisim/.keras/datasets/ptb_train


In [9]:
examples = corpus.split('\n')[:1]
for line in examples:
    print(line)

 aer banknote berlitz calloway centrust cluett fromstein gitano guterman hydro-quebec ipo kia memotec mlx nahb punts rake regatta rubens sim snack-food ssangyong swapo wachter 


---
# Event (word) indexing
Index the events that have occurred in the event sequence.

In [10]:
%autoreload 2
from layer.preprocessing import (
    EventIndexing
)



In [11]:
word_indexing = EventIndexing(
    name="word_indexing",
    corpus=corpus,
    min_sequence_length=WINDOW_SIZE
)
del corpus

## EventIndexing  for the corpus

Adapt to the ```corpus``` and provides:
* event_to_index dictionary
* vocaburary of the corpus
* word occurrence probabilites

In [12]:
words = word_indexing.list_events(range(10))
print(f"EventIndexing.vocabulary[10]:\n{words}\n")

indices = word_indexing.list_indices(words)
print(f"EventIndexing.event_to_index[10]:")
for item in zip(words, indices):
    print(item)

probabilities = word_indexing.list_probabilities(words)
print(f"\nEventIndexing.probabilities[10]:")
for word, p in zip(words, probabilities):
    print(f"{word:20s} : {p:.5e}")

EventIndexing.vocabulary[10]:
['<nil>' '<unk>' 'aer' 'banknote' 'berlitz' 'calloway' 'centrust' 'cluett' 'fromstein' 'gitano']

EventIndexing.event_to_index[10]:
('<nil>', 0)
('<unk>', 1)
('aer', 2)
('banknote', 3)
('berlitz', 4)
('calloway', 5)
('centrust', 6)
('cluett', 7)
('fromstein', 8)
('gitano', 9)

EventIndexing.probabilities[10]:
<nil>                : 0.00000e+00
<unk>                : 1.65308e-02
aer                  : 5.34860e-06
banknote             : 5.34860e-06
berlitz              : 5.34860e-06
calloway             : 5.34860e-06
centrust             : 5.34860e-06
cluett               : 5.34860e-06
fromstein            : 5.34860e-06
gitano               : 5.34860e-06


## Sampling using the probability

Sample events according to their probabilities.

In [13]:
sample = word_indexing.sample(size=5)
print(sample)

['shapiro', 'personal', 'estimated', 'cutbacks', 'growth']


## Negative Sampling
Sample events not including those events already sampled.

In [14]:
negative_indices = word_indexing.negative_sample_indices(
    size=5, excludes=word_indexing.list_indices(sample)
)
print(f"negative_indices={negative_indices} \nevents={word_indexing.list_events(negative_indices)}")

negative_indices=[9254, 2566, 6284, 2188, 8245] 
events=['azt' 'series' '80s' 'third' 'ralph']


## Sentence to Sequence

In [15]:
# sentences = "\n".join(corpus.split('\n')[5:6])
sentences = """
the asbestos fiber <unk> is unusually <unk> once it enters the <unk> 
"""
sequences = word_indexing.function(sentences)
for pair in zip(sentences.strip().split(" "), sequences[0]):
    print(f"{pair[0]:15} : {pair[1]:5}")

Sentence is empty. Skipping...
Sentence is empty. Skipping...


the             :    34
asbestos        :    63
fiber           :    86
<unk>           :     1
is              :    42
unusually       :    87
<unk>           :     1
once            :    64
it              :    80
enters          :    88
the             :    34
<unk>           :     1


---
# EventContext

EventContext layer generates ```(event, context)``` pairs from sequences.

In [16]:
%autoreload 2
from layer.preprocessing import (
    EventContext
)

In [17]:
event_context = EventContext(
    name="ev",
    window_size=WINDOW_SIZE,
    event_size=TARGET_SIZE
)

## Context of an event (word) in a sentence

In the sentence ```"a form of asbestos once used to make kent cigarette filters"```, one of the context windows ```a form of asbestos once``` of size 5 and event size 1 has.
* ```of``` as a target word.
* ```(a, form) and (asbestos, once)``` as its context.

### Sequence of the word indices for the sentence

In [18]:
sentences = """
a form of asbestos once used to make kent cigarette filters

N years old and former chairman of consolidated gold 
"""

sequence = word_indexing.function(sentences)
sequence

Sentence is empty. Skipping...
Sentence is empty. Skipping...
Sentence is empty. Skipping...


array([[37, 62, 44, 63, 64, 65, 66, 67, 68, 69, 70],
       [29, 30, 31, 50, 51, 43, 44, 52, 53,  0,  0]])

## (event, context) pairs

For each word (event) in the setence ```(of, asbestos, ... , kent)``` excludnig the ends of the sentence, create ```(target, context)``` as:

```
[
  [of, a, form, asbestos, once],              # target is 'of', context is (a, form, asbestos, once)
  ['asbestos', 'form', 'of', 'once', 'used'],
  ['once', 'of', 'asbestos', 'used', 'to'],
  ...
]
```

### Format of the (event, context) pairs

* **E** is the target event indices
* **C** is the context indices

<img src="image/event_context_format.png" align="left"/>

In [19]:
event_context_pairs = event_context.function(sequence)
print(
    f"Event context pairs. Shape %s, Target event size %s, Window size %s." % 
    (event_context_pairs.shape, event_context.event_size, event_context.window_size)
)
event_context_pairs[:10]

Event context pairs. Shape (2, 11), Target event size 1, Window size 11.


array([[65, 37, 62, 44, 63, 64, 66, 67, 68, 69, 70],
       [43, 29, 30, 31, 50, 51, 44, 52, 53,  0,  0]], dtype=int32)

### (event, context) pairs in textual words

In [20]:
word_indexing.sequence_to_sentence(event_context_pairs[:10])

[['used',
  'a',
  'form',
  'of',
  'asbestos',
  'once',
  'to',
  'make',
  'kent',
  'cigarette',
  'filters'],
 ['chairman',
  'n',
  'years',
  'old',
  'and',
  'former',
  'of',
  'consolidated',
  'gold',
  '<nil>',
  '<nil>']]

---
# Word Embedding

Embedding is to train the model to group similar events in a close proximity in the event vector space. If two events e.g. 'pencil' and 'pen' are similar concepts, then their event vectors resides in a close distance in the event space. 

* [Thought Vectors](https://wiki.pathmind.com/thought-vectors)

## Training process

```EL``` is the number or length of the event words as the targets. The number of the event words can be more than 1. ```CL``` is the number of the context words sorrounding the event words. ```SL``` is the number of words to use for **negative sampling**.

For instance, **know why you say good bye and I say hello**, if ```(EL=2,CL=8)```, then event words are ```(good, bye)``` and context words are ```(know, why, you, say, and, I, say, hello)```. ```Be``` is generated as the mean of the event word embedeing vectors.

1. Calculate ```Bc```, the BoW (Bag of Words) from context event vectors.
2. Calculate ```Be```,  the BoW (Bag of Words) from target event vectors.
3. The dot product ```Ye = dot(Bc, Be)``` is given the label 1 to get them closer.
4. For each negative sample ```Ws(s)```, the dot product with ```Ys = dot(Be, Ws(s)``` is given the label 0 to get them apart. 
5. ```np.c_[Ye, Ys]``` is fowarded to the logistic log loss layer.

<img src="image/word2vec_backprop_Be.png" align="left"/>


In [21]:
# Word2vec has two approaches, 1. CBOW and 2. Skipgram.
%autoreload 2
if USE_CBOW:
    # CBOW
    from layer.embedding_cbow_dual_vector_spaces.py import (
        Embedding
    )
else:
    # Skipgram
    from layer.embedding_sgram import (
        Embedding
    )

from optimizer import (
    SGD
)

In [22]:
embedding: Embedding = Embedding(
    name="embedding",
    num_nodes=WINDOW_SIZE,
    target_size=TARGET_SIZE,
    context_size=CONTEXT_SIZE,
    negative_sample_size=SAMPLE_SIZE,
    event_vector_size=VECTOR_SIZE,
    optimizer=SGD(lr=LR),
    dictionary=word_indexing,
    weight_initialization_scheme=WEIGHT_SCHEME,
    weight_initialization_parameters=WEIGHT_PARAMS
)

### Scores ```np.c_[Ye, Ys]``` from the Embedding

The 0th column is the scores for ```dot(Bc, Be)``` for positive labels. The rest are the scores for ```dot(Bc, Ws)``` for negative labels.

The idea is the vectors ```(Bc, Be)``` should be closer (cosine similarity -> 1 or dot product is large positive), and the vectors ```(Be, Bs)``` should be apart.

In [23]:
scores = embedding.function(event_context_pairs)
print(f"Scores:\n{scores[:5]}\n")
print(f"Scores for dot(Bc, Be): \n{scores[:5, :1]}")

Scores:
[[-2.4349254e-04 -2.1856527e-03 -8.7675056e-04 -9.7383186e-04  2.3010063e-03  1.2452344e-03  1.2126924e-03 -2.4789597e-03  7.9648133e-04 -2.6054615e-03  4.3943583e-04  4.5977812e-04  1.3538190e-03 -1.0263928e-03  1.1331388e-03]
 [ 1.0284438e-03  1.5731137e-03  2.7410622e-04  8.6363230e-04  2.6896014e-04 -1.6666343e-04 -5.3010433e-04 -1.1244978e-04  0.0000000e+00  0.0000000e+00 -6.9077860e-04 -1.6753969e-05  5.4561766e-04  9.2070817e-04 -8.8071707e-04]]

Scores for dot(Bc, Be): 
[[-0.00024349]
 [ 0.00102844]]


---
# Batch Normalization

In [24]:
from layer.normalization import (
    BatchNormalization
)

In [25]:
bn = BatchNormalization(
    name="bn",
    num_nodes=1+SAMPLE_SIZE
)

---
# Logistic Log Loss

Train the model to get:
1. BoW of the context event vectors close to the target event vector. Label 1
2. BoW of the context event vectors away from each of the negative sample event vectors Label 0.

This is a binary logistic classification, hence use Logistic Log Loss as the network objective function.

In [26]:
from common.function import (
    sigmoid_cross_entropy_log_loss,
    sigmoid
)
from layer.objective import (
    CrossEntropyLogLoss
)

In [27]:
loss = CrossEntropyLogLoss(
    name="loss",
    num_nodes=1,  # Logistic log loss
    log_loss_function=sigmoid_cross_entropy_log_loss
)

---
# Adapter

The logistic log loss layer expects the input of shape ```(N,M=1)```, however Embedding outputs ```(N,(1+SL)``` where ```SL``` is SAMPLE_SIZE. The ```Adapter``` layer bridges between Embedding and the Loss layers.


In [28]:
from layer.adapter import (
    Adapter
)

In [29]:
adapter_function = embedding.adapt_function_to_logistic_log_loss(loss=loss)
adapter_gradient = embedding.adapt_gradient_to_logistic_log_loss()

adapter: Adapter = Adapter(
    name="adapter",
    num_nodes=1,    # Number of output M=1 
    function=adapter_function,
    gradient=adapter_gradient
)

---
# Word2vec Network

## Construct a sequential network

$ \text {Sentences} \rightarrow EventIndexing \rightarrow EventContext \rightarrow  Embedding \rightarrow Adapter \rightarrow LogisticLogLoss$

In [30]:
from network.sequential import (
    SequentialNetwork
)

In [31]:
network = SequentialNetwork(
    name="word2vec",
    num_nodes=1,
    inference_layers=[
        word_indexing,
        event_context,
        embedding,
        adapter
    ],
    objective_layers=[
        loss
    ]
)

## Run training

### Load the saved model file if exists

In [32]:
if fileio.Function.is_file(STATE_FILE):
    print("Loading model...\nSTATE_FILE: %s" % STATE_FILE)
    state = embedding.load(STATE_FILE)

    fmt="""Model loaded.
    event_size %s
    context_size: %s
    event_vector_size: %s
    """
    print(fmt % (
        state["target_size"], 
        state["context_size"], 
        state["event_vector_size"]
    ))
else:
    print("State file does not exist. Saving the initial model to %s." % STATE_FILE)
    embedding.save(STATE_FILE)

State file does not exist. Saving the initial model to ../models/word2vec_sgram_ptb_train_E1_C10_S5_Wnormal_std_0.01_V100_LR20.0_N10.pkl.


### Continue training

In [33]:
def sentences_generator(path_to_file, num_sentences):
    stream = fileio.Function.file_line_stream(path_to_file)
    try:
        while True:
            _lines = fileio.Function.take(num_sentences, stream)
            yield np.array(_lines)
    finally:
        stream.close()

# Sentences for the trainig
source = sentences_generator(
    path_to_file=path_to_corpus, num_sentences=NUM_SENTENCES
)

# Restore the state if exists.
state = embedding.load(STATE_FILE)

# Continue training
profiler = cProfile.Profile()
profiler.enable()

total_sentences = 0
epochs = 0

for i in range(MAX_ITERATIONS):
    try:
        sentences = next(source)
        total_sentences += len(sentences)

        start = time.time()
        network.train(X=sentences, T=np.array([0]))

        if i % 100 == 0:
            print(
                f"Batch {i:05d} of {NUM_SENTENCES} sentences: "
                f"Average Loss: {np.mean(network.history):10f} "
                f"Duration {time.time() - start:3f}"
            )
        if i % 1000 == 0:
            # embedding.save(STATE_FILE)
            pass
        
    except fileio.Function.GenearatorHasNoMore as e:
        source.close()
        embedding.save(STATE_FILE)

        # Next epoch
        print(f"epoch {epochs} batches {i:05d} done")
        epochs += 1
        source = sentences_generator(
            path_to_file=path_to_corpus, num_sentences=NUM_SENTENCES
        )

    except Exception as e:
        print("Unexpected error:", sys.exc_info()[0])
        source.close()
        raise e

embedding.save(STATE_FILE)

profiler.disable()
profiler.print_stats(sort="cumtime")

Batch 00000 of 10 sentences: Average Loss:   0.693131 Duration 0.285150
Batch 00100 of 10 sentences: Average Loss:   0.667208 Duration 0.245759
Batch 00200 of 10 sentences: Average Loss:   0.634716 Duration 0.089798
Batch 00300 of 10 sentences: Average Loss:   0.611182 Duration 0.200067
Batch 00400 of 10 sentences: Average Loss:   0.593024 Duration 0.223902
Batch 00500 of 10 sentences: Average Loss:   0.579920 Duration 0.279604
Batch 00600 of 10 sentences: Average Loss:   0.571002 Duration 0.216216
Batch 00700 of 10 sentences: Average Loss:   0.561259 Duration 0.336020
Batch 00800 of 10 sentences: Average Loss:   0.553482 Duration 0.352396
Batch 00900 of 10 sentences: Average Loss:   0.547468 Duration 0.237554
Batch 01000 of 10 sentences: Average Loss:   0.541967 Duration 0.271538
Batch 01100 of 10 sentences: Average Loss:   0.537164 Duration 0.265858
Batch 01200 of 10 sentences: Average Loss:   0.532819 Duration 0.206694
Batch 01300 of 10 sentences: Average Loss:   0.528261 Duration 0

Batch 11400 of 10 sentences: Average Loss:   0.437829 Duration 0.175656
Batch 11500 of 10 sentences: Average Loss:   0.437598 Duration 0.131591
Batch 11600 of 10 sentences: Average Loss:   0.437322 Duration 0.162459
Batch 11700 of 10 sentences: Average Loss:   0.437063 Duration 0.133105
Batch 11800 of 10 sentences: Average Loss:   0.436793 Duration 0.139187
Batch 11900 of 10 sentences: Average Loss:   0.436575 Duration 0.187415
Batch 12000 of 10 sentences: Average Loss:   0.436337 Duration 0.224208
Batch 12100 of 10 sentences: Average Loss:   0.436142 Duration 0.176519
Batch 12200 of 10 sentences: Average Loss:   0.435892 Duration 0.107115
Batch 12300 of 10 sentences: Average Loss:   0.435579 Duration 0.125227
Batch 12400 of 10 sentences: Average Loss:   0.435288 Duration 0.170385
Batch 12500 of 10 sentences: Average Loss:   0.435058 Duration 0.171922
Batch 12600 of 10 sentences: Average Loss:   0.434845 Duration 0.199426
epoch 2 batches 12623 done
Batch 12700 of 10 sentences: Average 

Batch 22700 of 10 sentences: Average Loss:   0.419879 Duration 0.116651
Batch 22800 of 10 sentences: Average Loss:   0.419806 Duration 0.187972
Batch 22900 of 10 sentences: Average Loss:   0.419701 Duration 0.201726
Batch 23000 of 10 sentences: Average Loss:   0.419586 Duration 0.218563
Batch 23100 of 10 sentences: Average Loss:   0.419477 Duration 0.170224
Batch 23200 of 10 sentences: Average Loss:   0.419366 Duration 0.218204
Batch 23300 of 10 sentences: Average Loss:   0.419227 Duration 0.131817
Batch 23400 of 10 sentences: Average Loss:   0.419072 Duration 0.113211
Batch 23500 of 10 sentences: Average Loss:   0.418984 Duration 0.149859
Batch 23600 of 10 sentences: Average Loss:   0.418870 Duration 0.250694
Batch 23700 of 10 sentences: Average Loss:   0.418761 Duration 0.159725
Batch 23800 of 10 sentences: Average Loss:   0.418676 Duration 0.163515
Batch 23900 of 10 sentences: Average Loss:   0.418580 Duration 0.191605
Batch 24000 of 10 sentences: Average Loss:   0.418475 Duration 0

Batch 34000 of 10 sentences: Average Loss:   0.410302 Duration 0.176282
Batch 34100 of 10 sentences: Average Loss:   0.410217 Duration 0.144981
Batch 34200 of 10 sentences: Average Loss:   0.410171 Duration 0.198678
Batch 34300 of 10 sentences: Average Loss:   0.410124 Duration 0.168010
Batch 34400 of 10 sentences: Average Loss:   0.410039 Duration 0.149536
Batch 34500 of 10 sentences: Average Loss:   0.409972 Duration 0.221091
Batch 34600 of 10 sentences: Average Loss:   0.409921 Duration 0.282004
Batch 34700 of 10 sentences: Average Loss:   0.409864 Duration 0.197598
Batch 34800 of 10 sentences: Average Loss:   0.409832 Duration 0.149493
Batch 34900 of 10 sentences: Average Loss:   0.409786 Duration 0.169722
Batch 35000 of 10 sentences: Average Loss:   0.409716 Duration 0.180265
Batch 35100 of 10 sentences: Average Loss:   0.409637 Duration 0.252486
Batch 35200 of 10 sentences: Average Loss:   0.409599 Duration 0.147384
Batch 35300 of 10 sentences: Average Loss:   0.409530 Duration 0

Batch 45400 of 10 sentences: Average Loss:   0.403682 Duration 0.210803
Batch 45500 of 10 sentences: Average Loss:   0.403645 Duration 0.189276
Batch 45600 of 10 sentences: Average Loss:   0.403578 Duration 0.232171
Batch 45700 of 10 sentences: Average Loss:   0.403527 Duration 0.224334
Batch 45800 of 10 sentences: Average Loss:   0.403489 Duration 0.210731
Batch 45900 of 10 sentences: Average Loss:   0.403418 Duration 0.215115
Batch 46000 of 10 sentences: Average Loss:   0.403353 Duration 0.154949
Batch 46100 of 10 sentences: Average Loss:   0.403277 Duration 0.187279
Batch 46200 of 10 sentences: Average Loss:   0.403223 Duration 0.149619
epoch 10 batches 46287 done
Batch 46300 of 10 sentences: Average Loss:   0.403178 Duration 0.190695
Batch 46400 of 10 sentences: Average Loss:   0.403141 Duration 0.174913
Batch 46500 of 10 sentences: Average Loss:   0.403101 Duration 0.116006
Batch 46600 of 10 sentences: Average Loss:   0.403044 Duration 0.218864
Batch 46700 of 10 sentences: Average

Batch 56700 of 10 sentences: Average Loss:   0.398509 Duration 0.148686
Batch 56800 of 10 sentences: Average Loss:   0.398470 Duration 0.295860
Batch 56900 of 10 sentences: Average Loss:   0.398411 Duration 0.207232
Batch 57000 of 10 sentences: Average Loss:   0.398350 Duration 0.145264
Batch 57100 of 10 sentences: Average Loss:   0.398317 Duration 0.221088
Batch 57200 of 10 sentences: Average Loss:   0.398267 Duration 0.168620
Batch 57300 of 10 sentences: Average Loss:   0.398213 Duration 0.202826
Batch 57400 of 10 sentences: Average Loss:   0.398179 Duration 0.150028
Batch 57500 of 10 sentences: Average Loss:   0.398140 Duration 0.281765
Batch 57600 of 10 sentences: Average Loss:   0.398109 Duration 0.180932
Batch 57700 of 10 sentences: Average Loss:   0.398059 Duration 0.197020
Batch 57800 of 10 sentences: Average Loss:   0.398033 Duration 0.254252
Batch 57900 of 10 sentences: Average Loss:   0.397993 Duration 0.260480
Batch 58000 of 10 sentences: Average Loss:   0.397949 Duration 0

Batch 68000 of 10 sentences: Average Loss:   0.394116 Duration 0.262549
Batch 68100 of 10 sentences: Average Loss:   0.394069 Duration 0.129192
Batch 68200 of 10 sentences: Average Loss:   0.394047 Duration 0.245481
Batch 68300 of 10 sentences: Average Loss:   0.394012 Duration 0.215628
Batch 68400 of 10 sentences: Average Loss:   0.393990 Duration 0.202138
Batch 68500 of 10 sentences: Average Loss:   0.393966 Duration 0.255367
Batch 68600 of 10 sentences: Average Loss:   0.393942 Duration 0.225496
Batch 68700 of 10 sentences: Average Loss:   0.393895 Duration 0.205198
Batch 68800 of 10 sentences: Average Loss:   0.393864 Duration 0.233027
Batch 68900 of 10 sentences: Average Loss:   0.393839 Duration 0.186669
Batch 69000 of 10 sentences: Average Loss:   0.393804 Duration 0.200268
Batch 69100 of 10 sentences: Average Loss:   0.393776 Duration 0.190370
Batch 69200 of 10 sentences: Average Loss:   0.393740 Duration 0.159379
Batch 69300 of 10 sentences: Average Loss:   0.393704 Duration 0

Batch 79400 of 10 sentences: Average Loss:   0.390331 Duration 0.228678
Batch 79500 of 10 sentences: Average Loss:   0.390301 Duration 0.248209
Batch 79600 of 10 sentences: Average Loss:   0.390259 Duration 0.160702
Batch 79700 of 10 sentences: Average Loss:   0.390220 Duration 0.233731
Batch 79800 of 10 sentences: Average Loss:   0.390174 Duration 0.223936
Batch 79900 of 10 sentences: Average Loss:   0.390145 Duration 0.236949
epoch 18 batches 79951 done
Batch 80000 of 10 sentences: Average Loss:   0.390116 Duration 0.267089
Batch 80100 of 10 sentences: Average Loss:   0.390089 Duration 0.236761
Batch 80200 of 10 sentences: Average Loss:   0.390070 Duration 0.223507
Batch 80300 of 10 sentences: Average Loss:   0.390027 Duration 0.189871
Batch 80400 of 10 sentences: Average Loss:   0.389990 Duration 0.207526
Batch 80500 of 10 sentences: Average Loss:   0.389970 Duration 0.199607
Batch 80600 of 10 sentences: Average Loss:   0.389946 Duration 0.207347
Batch 80700 of 10 sentences: Average

Batch 90700 of 10 sentences: Average Loss:   0.387026 Duration 0.239290
Batch 90800 of 10 sentences: Average Loss:   0.387004 Duration 0.130527
Batch 90900 of 10 sentences: Average Loss:   0.386973 Duration 0.200252
Batch 91000 of 10 sentences: Average Loss:   0.386939 Duration 0.204631
Batch 91100 of 10 sentences: Average Loss:   0.386917 Duration 0.152196
Batch 91200 of 10 sentences: Average Loss:   0.386894 Duration 0.153647
Batch 91300 of 10 sentences: Average Loss:   0.386866 Duration 0.167943
Batch 91400 of 10 sentences: Average Loss:   0.386839 Duration 0.222885
Batch 91500 of 10 sentences: Average Loss:   0.386815 Duration 0.185552
Batch 91600 of 10 sentences: Average Loss:   0.386788 Duration 0.210121
Batch 91700 of 10 sentences: Average Loss:   0.386758 Duration 0.230412
Batch 91800 of 10 sentences: Average Loss:   0.386736 Duration 0.184718
Batch 91900 of 10 sentences: Average Loss:   0.386701 Duration 0.175697
Batch 92000 of 10 sentences: Average Loss:   0.386673 Duration 0

 16614639   15.623    0.000   15.623    0.000 fromnumeric.py:71(<dictcomp>)
  1999540   14.584    0.000   14.584    0.000 {built-in method numpy.arange}
   299931    0.379    0.000   14.571    0.000 array_ops.py:805(rank)
 15814777   14.443    0.000   14.443    0.000 {method 'tolist' of 'numpy.ndarray' objects}
   299931    1.111    0.000   14.361    0.000 gen_math_ops.py:3181(equal)
   299931    2.986    0.000   14.191    0.000 array_ops.py:841(rank_internal)
  3899103   13.593    0.000   14.068    0.000 dtypes.py:604(as_dtype)
   199954    0.462    0.000   13.596    0.000 <__array_function__ internals>:2(isin)
  4398988   13.331    0.000   13.331    0.000 {method '_shape_tuple' of 'tensorflow.python.framework.ops.EagerTensor' objects}
 17672588    9.464    0.000   13.226    0.000 arraysetops.py:125(_unpack_tuple)
  1399678    2.746    0.000   12.990    0.000 fromnumeric.py:199(reshape)
   999724    2.205    0.000   12.967    0.000 numerictypes.py:599(find_common_type)
   199954    1.

  1199724    0.500    0.000    0.500    0.000 fromnumeric.py:2350(_all_dispatcher)
   799816    0.476    0.000    0.476    0.000 tensor_shape.py:858(__bool__)
  1399678    0.471    0.000    0.471    0.000 {method 'pop' of 'dict' objects}
   699839    0.435    0.000    0.435    0.000 {method 'replace' of 'str' objects}
  1999448    0.432    0.000    0.432    0.000 {method 'strip' of 'str' objects}
  1999448    0.429    0.000    0.429    0.000 fromnumeric.py:3195(_around_dispatcher)
   199954    0.255    0.000    0.424    0.000 composite.py:141(T)
   100025    0.394    0.000    0.394    0.000 {method 'join' of 'str' objects}
  1099701    0.384    0.000    0.384    0.000 preprocess.py:491(window_size)
   299931    0.368    0.000    0.368    0.000 base.py:63(lr)
  1099701    0.346    0.000    0.346    0.000 preprocess.py:503(event_size)
    99977    0.163    0.000    0.337    0.000 fromnumeric.py:1439(squeeze)
    99977    0.235    0.000    0.325    0.000 objective.py:156(J)
    99977    0

---
# Evaluate the vector space

Verify if the trained model, or the vector space W, has encoded the words in a way that **similar** words are close in the vector space.

* [How to measure the similarity among vectors](https://math.stackexchange.com/questions/4132458)

In [34]:
n = 10
context = "cash".split()
word_indices = np.array(word_indexing.list_indices(context), dtype=TYPE_INT)

print(f"Words {context}")
print(f"Word indices {word_indices}")
print(f"prediction for {context}:\n{word_indexing.list_events([embedding.predict(word_indices, n)])}")

Words ['cash']
Word indices [413]
prediction for ['cash']:
[['dividends' 'amount' 'exceed' 'flow' 'unpaid' 'depreciation' 'emhart' 'ordinary' 'borrow' 'riskier']
 ['unpaid' 'emhart' 'flow' 'refunds' 'accumulated' 'earmarked' 'financings' 'exceed' 'depreciation' 'riskier']]


  return self._vocabulary[list(iter(indices))]


---
## Compare with [gensim word2vec](https://radimrehurek.com/gensim/models/word2vec.html)

In [35]:
from gensim.models import (
    Word2Vec
)
from gensim.models.word2vec import (
    LineSentence    
)

In [36]:
sentences = LineSentence(source=path_to_corpus)
w2v = Word2Vec(
    sentences=sentences, 
    sg=0,
    window=5, 
    negative=5,
    vector_size=100, 
    min_count=1, 
    workers=4
)
del sentences

In [37]:
w2v.wv.most_similar(context, topn=n)

[('amount', 0.8938570618629456),
 ('debt', 0.8765637874603271),
 ('value', 0.8724021911621094),
 ('payment', 0.8634955286979675),
 ('proceeds', 0.856279730796814),
 ('assets', 0.8465060591697693),
 ('payments', 0.8203557133674622),
 ('dividends', 0.8193185925483704),
 ('dividend', 0.818557858467102),
 ('face', 0.8156279921531677)]