# word2vec implementation

* [Distributed Representations of Words and Phrases and their Compositionality](https://arxiv.org/abs/1310.4546)
* [word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method](https://arxiv.org/abs/1402.3722)

## Overview

The original paper proposed using:
1. Matmul to extract word vectors from ```Win```.
2. Softmax to calculate scores with all the word vectors from ```Wout```.

<img src="image/word2vec_cbow_mechanism.png" align="left"/>

## Negative Sampling

### Motivation

The original implementation was computationally expensive.

1. Matmal to extract word vectors from ```Win```.
2. Softmax can happen vocabulary size times with ```Wout```.

<img src="image/word2vec_negative_sampling_motivation.png" align="left"/>


### Solution

1. Use index to extract word vectors from Win.
2. Instead of calculating softmax with all the word vectors from ```Wout```, sample small number (SL) of negative/false word vectors from ```Wout``` and calculate logistic log loss with each sample. 

<img src="image/wors2vec_neg_sample_backprop.png" align="left"/>

## Using only one Word vector space W

There is no reasoning nor proof why two word vector space are required. In the end, we only use one word vector space, which appears to be ```Win```. Use only one word vector space ```W``` to test if it works.

<img src="image/word2vec_backprop_Be.png" align="left"/>

---
# Setups

In [1]:
import cProfile
import sys
import os
import re
from itertools import islice
from typing import Dict, List
import numpy as np
import tensorflow as tf

np.set_printoptions(threshold=sys.maxsize)
np.set_printoptions(linewidth=400) 

## Setup for Google Colab environment

Colab gets disconnected within approx 20 min. Hence not suitable for training (or need to upgrade to the pro version).

### Clone github to Google Drive

In [2]:
try:
    import google.colab
    IN_GOOGLE_COLAB = True
except:
    IN_GOOGLE_COLAB = False
    
if IN_GOOGLE_COLAB:
    !pip install line_profiler
    !google.colab.drive.mount('/content/gdrive')
    !rm -rf /content/drive/MyDrive/github
    !mkdir -p /content/drive/MyDrive/github
    !git clone https://github.com/oonisim/python-programs.git /content/drive/MyDrive/github


### Clone github to local directory

In [3]:
try:
    import google.colab
    IN_GOOGLE_COLAB = True
except:
    IN_GOOGLE_COLAB = False
    
if IN_GOOGLE_COLAB:
    !pip install line_profiler
    !google.colab.drive.mount('/content/gdrive')
    !rm -rf /content/github
    !mkdir -p /content/github
    !git clone https://github.com/oonisim/python-programs.git /content/github
        
    import sys
    sys.path.append('/content/github/nlp/src')

# Jupyter notebook setups

Auto reolaod causes an error in Jupyter notebooks. Restart the Jupyter kernel for the error:
```TypeError: super(type, obj): obj must be an instance or subtype of type```
See
- https://stackoverflow.com/a/52927102/4281353
- http://thomas-cokelaer.info/blog/2011/09/382/

> The problem resides in the mechanism of reloading modules.
> Reloading a module often changes the internal object in memory which
> makes the isinstance test of super return False.

In [4]:
%load_ext line_profiler
%load_ext autoreload

## Utilites

In [5]:
%autoreload 2

def take(n, iterable):
    "Return first n items of the iterable as a list"
    return list(islice(iterable, n))

import function.fileio as fileio
import function.text as text

---
# Data Types


In [6]:
from common.constant import (
    TYPE_INT,
    TYPE_FLOAT,
    TYPE_LABEL,
    TYPE_TENSOR,
)

# Constants

In [7]:
USE_PTB = True
DEBUG = False
VALIDATION = True

TARGET_SIZE = 1   # Size of the target event (word)
CONTEXT_SIZE = 6  # Size of the context.
WINDOW_SIZE = TARGET_SIZE + CONTEXT_SIZE
SAMPLE_SIZE = 10   # Size of the negative samples
VECTOR_SIZE = 100  # Number of features in the event vector.

---

# Data
## Corpus

In [8]:
corpus = "To be, or not to be, that is the question that matters"
_file = "ptb.train.txt"
if USE_PTB:
    if not fileio.Function.is_file(f"~/.keras/datasets/{_file}"):
        path_to_ptb = tf.keras.utils.get_file(
            _file, 
            f'https://raw.githubusercontent.com/tomsercu/lstm/master/data/{_file}'
        )
    corpus = fileio.Function.read_file(path_to_ptb)

In [9]:
examples = corpus.split('\n')[:5]
for line in examples:
    print(line)

 aer banknote berlitz calloway centrust cluett fromstein gitano guterman hydro-quebec ipo kia memotec mlx nahb punts rake regatta rubens sim snack-food ssangyong swapo wachter 
 pierre <unk> N years old will join the board as a nonexecutive director nov. N 
 mr. <unk> is chairman of <unk> n.v. the dutch publishing group 
 rudolph <unk> N years old and former chairman of consolidated gold fields plc was named a nonexecutive director of this british industrial conglomerate 
 a form of asbestos once used to make kent cigarette filters has caused a high percentage of cancer deaths among a group of workers exposed to it more than N years ago researchers reported 


---
# Event (word) indexing
Index the events that have occurred in the event sequence.

In [10]:
%autoreload 2
from layer.preprocessing import (
    EventIndexing, 
)

In [11]:
word_indexing = EventIndexing(
    name="word_indexing_on_ptb",
    corpus=corpus
)
del corpus

## EventIndexing  for the corpus

Adapt to the ```corpus``` and provides:
* event_to_index dictionary
* vocaburary of the corpus
* word occurrence probabilites

In [12]:
words = word_indexing.list_events(range(10))
print(f"EventIndexing.vocabulary[10]:\n{words}\n")

indices = word_indexing.list_indices(words)
print(f"EventIndexing.event_to_index[10]:")
for item in zip(words, indices):
    print(item)

probabilities = word_indexing.list_probabilities(words)
print(f"\nEventIndexing.probabilities[10]:")
for word, p in zip(words, probabilities):
    print(f"{word:20s} : {p:.5e}")

EventIndexing.vocabulary[10]:
['<nil>' '<unk>' 'aer' 'banknote' 'berlitz' 'calloway' 'centrust' 'cluett' 'fromstein' 'gitano']

EventIndexing.event_to_index[10]:
('<nil>', 0)
('<unk>', 1)
('aer', 2)
('banknote', 3)
('berlitz', 4)
('calloway', 5)
('centrust', 6)
('cluett', 7)
('fromstein', 8)
('gitano', 9)

EventIndexing.probabilities[10]:
<nil>                : 0.00000e+00
<unk>                : 1.65308e-02
aer                  : 5.34860e-06
banknote             : 5.34860e-06
berlitz              : 5.34860e-06
calloway             : 5.34860e-06
centrust             : 5.34860e-06
cluett               : 5.34860e-06
fromstein            : 5.34860e-06
gitano               : 5.34860e-06


## Sampling using the probability

Sample events according to their probabilities.

In [13]:
sample = word_indexing.sample(size=5)
print(sample)

['subordinate', 'share', 'into', 't', 'sentences']


## Negative Sampling
Sample events not including those events already sampled.

In [14]:
negative_indices = word_indexing.negative_sample_indices(
    size=5, excludes=word_indexing.list_indices(sample)
)
print(f"negative_indices={negative_indices} \nevents={word_indexing.list_events(negative_indices)}")


negative_indices=[1024, 6433, 9445, 268, 44] 
events=['i' 'listening' 'fazio' 'environmental' 'of']


## Sentence to Sequence

In [15]:
# sentences = "\n".join(corpus.split('\n')[5:6])
sentences = """
the asbestos fiber <unk> is unusually <unk> once it enters the <unk> 
with even brief exposures to it causing symptoms that show up decades later researchers said
"""
sequences = word_indexing.function(sentences)
for pair in zip(sentences.strip().split(" "), sequences[0]):
    print(f"{pair[0]:15} : {pair[1]:5}")

Sentence is empty. Skipping...
Sentence is empty. Skipping...


the             :    34
asbestos        :    63
fiber           :    86
<unk>           :     1
is              :    42
unusually       :    87
<unk>           :     1
once            :    64
it              :    80
enters          :    88
the             :    34
<unk>           :     1

with           :     0
even            :     0
brief           :     0


---
# EventContext

EventContext layer generates ```(event, context)``` pairs from sequences.

In [16]:
%autoreload 2
from layer.preprocessing import (
    EventContext
)

In [17]:
event_context = EventContext(
    name="ev",
    window_size=WINDOW_SIZE,
    event_size=TARGET_SIZE
)

## Context of an event (word) in a sentence

In the sentence ```"a form of asbestos once used to make kent cigarette filters"```, one of the context windows ```a form of asbestos once``` of size 5 and event size 1 has.
* ```of``` as a target word.
* ```(a, form) and (asbestos, once)``` as its context.

### Sequence of the word indices for the sentence

In [18]:
sentences = """
a form of asbestos once used to make kent cigarette filters

N years old and former chairman of consolidated gold fields plc was named a nonexecutive director
"""

sequence = word_indexing.function(sentences)
sequence

Sentence is empty. Skipping...
Sentence is empty. Skipping...
Sentence is empty. Skipping...


array([[37, 62, 44, 63, 64, 65, 66, 67, 68, 69, 70,  0,  0,  0,  0,  0],
       [29, 30, 31, 50, 51, 43, 44, 52, 53, 54, 55, 56, 57, 37, 38, 39]])

## (event, context) pairs

For each word (event) in the setence ```(of, asbestos, ... , kent)``` excludnig the ends of the sentence, create ```(target, context)``` as:

```
[
  [of, a, form, asbestos, once],              # target is 'of', context is (a, form, asbestos, once)
  ['asbestos', 'form', 'of', 'once', 'used'],
  ['once', 'of', 'asbestos', 'used', 'to'],
  ...
]
```

### Format of the (event, context) pairs

* **E** is the target event indices
* **C** is the context indices

<img src="image/event_context_format.png" align="left"/>

In [19]:
event_context_pairs = event_context.function(sequence)
print(
    f"Event context pairs. Shape %s, Target event size %s, Window size %s." % 
    (event_context_pairs.shape, event_context.event_size, event_context.window_size)
)
event_context_pairs[:10]

Event context pairs. Shape (18, 7), Target event size 1, Window size 7.


array([[63, 37, 62, 44, 64, 65, 66],
       [64, 62, 44, 63, 65, 66, 67],
       [65, 44, 63, 64, 66, 67, 68],
       [66, 63, 64, 65, 67, 68, 69],
       [67, 64, 65, 66, 68, 69, 70],
       [68, 65, 66, 67, 69, 70,  0],
       [69, 66, 67, 68, 70,  0,  0],
       [70, 67, 68, 69,  0,  0,  0],
       [50, 29, 30, 31, 51, 43, 44],
       [51, 30, 31, 50, 43, 44, 52]], dtype=int32)

### (event, context) pairs in textual words

In [20]:
word_indexing.sequence_to_sentence(event_context_pairs[:10])

[['asbestos', 'a', 'form', 'of', 'once', 'used', 'to'],
 ['once', 'form', 'of', 'asbestos', 'used', 'to', 'make'],
 ['used', 'of', 'asbestos', 'once', 'to', 'make', 'kent'],
 ['to', 'asbestos', 'once', 'used', 'make', 'kent', 'cigarette'],
 ['make', 'once', 'used', 'to', 'kent', 'cigarette', 'filters'],
 ['kent', 'used', 'to', 'make', 'cigarette', 'filters', '<nil>'],
 ['cigarette', 'to', 'make', 'kent', 'filters', '<nil>', '<nil>'],
 ['filters', 'make', 'kent', 'cigarette', '<nil>', '<nil>', '<nil>'],
 ['and', 'n', 'years', 'old', 'former', 'chairman', 'of'],
 ['former', 'years', 'old', 'and', 'chairman', 'of', 'consolidated']]

---
# Word Embedding

Embedding is to train the model to group similar events in a close proximity in the event vector space. If two events e.g. 'pencil' and 'pen' are similar concepts, then their event vectors resides in a close distance in the event space. 

* [Thought Vectors](https://wiki.pathmind.com/thought-vectors)

## Training process

1. Calculate ```Bc```, the BoW (Bag of Words) from context event vectors.
2. Calculate ```Be```,  the BoW (Bag of Words) from target event vectors.
3. The dot product ```Ye = dot(Bc, Be)``` is given the label 1 to get them closer.
4. For each negative sample ```Ws(s)```, the dot product with ```Ys = dot(Be, Ws(s)``` is given the label 0 to get them apart. 
5. ```np.c_[Ye, Ys]``` is fowarded to the logistic log loss layer.

<img src="image/word2vec_backprop_Be.png" align="left"/>


In [21]:
%autoreload 2
from layer import (
    Embedding
)

In [22]:
embedding: Embedding = Embedding(
    name="embedding",
    num_nodes=WINDOW_SIZE,
    target_size=TARGET_SIZE,
    context_size=CONTEXT_SIZE,
    negative_sample_size=SAMPLE_SIZE,
    event_vector_size=VECTOR_SIZE,
    dictionary=word_indexing
)

### Scores ```np.c_[Ye, Ys]``` from the Embedding

The 0th column is the scores for ```dot(Bc, Be``` for positive labels. The rest are the scores for ```dot(Bc, Ws)``` for negative labels.

In [23]:
scores = embedding.function(event_context_pairs)
print(f"Scores:\n{scores[:5]}\n")
print(f"Scores for dot(Bc, Be): \n{scores[:5, :1]}")

Scores:
[[-11.342428    -2.555201    -3.0864296   -6.2725124    4.2037983   -2.0689092  -18.332573    -2.5910058   -4.976053    -1.3367822   -1.303802  ]
 [ -4.102366     5.4036145    9.27371     -2.9248075   17.596584    -6.2422657   -0.93544555   7.6472836    5.7966843   -6.730744   -12.388437  ]
 [ -2.2152784   10.308252    -0.31621504  -4.530749    -7.2113247   12.009228     8.913546    16.463419    -5.450465    -8.169324    -2.016971  ]
 [  7.261647    -5.6301117    3.9049284    4.096468    -3.1283627    0.78723     -0.6988249   -3.1793845   -1.8274341   -0.26620245   4.6407547 ]
 [ -4.615188    -0.40471506  17.433493    -7.4154363   -3.6721926   14.468176    11.101589     6.032888    -2.7721272    7.5408087    2.2634978 ]]

Scores for dot(Bc, Be): 
[[-11.342428 ]
 [ -4.102366 ]
 [ -2.2152784]
 [  7.261647 ]
 [ -4.615188 ]]


---
# Logistic Log Loss

Train the model to get:
1. BoW of the context event vectors close to the target event vector. Label 1
2. BoW of the context event vectors away from each of the negative sample event vectors Label 0.

This is a binary logistic classification, hence use Logistic Log Loss as the network objective function.

In [24]:
from common.function import (
    sigmoid_cross_entropy_log_loss,
    sigmoid
)
from layer.objective import (
    CrossEntropyLogLoss
)

In [25]:
loss = CrossEntropyLogLoss(
    name="loss",
    num_nodes=1,  # Logistic log loss
    log_loss_function=sigmoid_cross_entropy_log_loss
)

---
# Adapter

The logistic log loss layer expects the input of shape ```(N,M=1)```, however Embedding outputs ```(N,(1+SL)``` where ```SL``` is SAMPLE_SIZE. The ```Adapter``` layer bridges between Embedding and the Loss layers.


In [26]:
from layer.adapter import (
    Adapter
)

In [27]:
adapter_function = embedding.adapt_function_to_logistic_log_loss(loss=loss)
adapter_gradient = embedding.adapt_gradient_to_logistic_log_loss()

adapter: Adapter = Adapter(
    name="adapter",
    num_nodes=1,    # Number of output M=1 
    function=adapter_function,
    gradient=adapter_gradient
)

---
# Word2vec Network

## Construct a sequential network

$ \text {Sentences} \rightarrow EventIndexing \rightarrow EventContext \rightarrow  Embedding \rightarrow Adapter \rightarrow LogisticLogLoss$

In [28]:
from network.sequential import (
    SequentialNetwork
)

In [29]:
network = SequentialNetwork(
    name="word2vec",
    num_nodes=1,
    inference_layers=[
        word_indexing,
        event_context,
        embedding,
        adapter
    ],
    objective_layers=[
        loss
    ]
)

In [30]:
#STATE_FILE = "wor2vec_embedding_10MAY2021_Jupyter.pkl"
#embedding.save(STATE_FILE)

## Run training

In [None]:
NUM_SENTENCES = 50
MAX_ITERATIONS = 100000


def sentences_generator(path_to_file, num_sentences):
    stream = fileio.Function.file_line_stream(path_to_file)
    try:
        while True:
            _lines = fileio.Function.take(num_sentences, stream)
            yield np.array(_lines)
    finally:
        stream.close()

# Sentences for the trainig
path_to_input = path_to_ptb
source = sentences_generator(
    path_to_file=path_to_input, num_sentences=NUM_SENTENCES
)

# Restore the state if exists.
state = embedding.load(STATE_FILE)

# Continue training
profiler = cProfile.Profile()
profiler.enable()

total_sentences = 0
epochs = 0

for i in range(MAX_ITERATIONS):
    try:
        sentences = next(source)
        total_sentences += len(sentences)
        network.train(X=sentences, T=np.array([0]))
        if i % 100 == 0:
            # print(f"Batch {i:05d} of {NUM_SENTENCES} sentences: Loss: {network.history[-1]:15f}")
            print(f"Batch {i:05d}: Loss: {network.history[-1]:15f}")
        if i % 1000 == 0:
            embedding.save(STATE_FILE)
            # print("State saved.")

    except fileio.Function.GenearatorHasNoMore as e:
        # Next epoch
        print(f"epoch {epochs} done")
        epochs += 1
        source.close()
        source = sentences_generator(
            path_to_file=path_to_input, num_sentences=NUM_SENTENCES
        )

    except Exception as e:
        print("Unexpected error:", sys.exc_info()[0])
        source.close()
        raise e

embedding.save(STATE_FILE)

profiler.disable()
profiler.print_stats(sort="cumtime")

Batch 00000: Loss:        3.112713
Batch 00100: Loss:        3.133286
Batch 00200: Loss:        3.085930
Batch 00300: Loss:        3.159912
Batch 00400: Loss:        3.081688
Batch 00500: Loss:        3.074373
Batch 00600: Loss:        3.082561
Batch 00700: Loss:        2.947149
Batch 00800: Loss:        3.010769
epoch 0 done
Batch 00900: Loss:        3.096384
Batch 01000: Loss:        2.915789
Batch 01100: Loss:        3.053586
Batch 01200: Loss:        3.210108
Batch 01300: Loss:        2.984506
Batch 01400: Loss:        3.017680
Batch 01500: Loss:        2.910682
Batch 01600: Loss:        3.035512
epoch 1 done
Batch 01700: Loss:        2.962959
Batch 01800: Loss:        2.983829
Batch 01900: Loss:        2.914031
Batch 02000: Loss:        2.976863
Batch 02100: Loss:        2.835485
Batch 02200: Loss:        2.943722
Batch 02300: Loss:        2.959889
Batch 02400: Loss:        2.929062
Batch 02500: Loss:        2.870975
epoch 2 done
Batch 02600: Loss:        2.839388
Batch 02700: Los

Batch 22500: Loss:        2.415982
Batch 22600: Loss:        2.408726
Batch 22700: Loss:        2.412830
epoch 26 done
Batch 22800: Loss:        2.446558
Batch 22900: Loss:        2.340935
Batch 23000: Loss:        2.503056
Batch 23100: Loss:        2.298322
Batch 23200: Loss:        2.245181
Batch 23300: Loss:        2.398801
Batch 23400: Loss:        2.474218
Batch 23500: Loss:        2.309206
Batch 23600: Loss:        2.333836
epoch 27 done
Batch 23700: Loss:        2.380915
Batch 23800: Loss:        2.342965
Batch 23900: Loss:        2.463509
Batch 24000: Loss:        2.451466
Batch 24100: Loss:        2.429160
Batch 24200: Loss:        2.361946
Batch 24300: Loss:        2.310337
Batch 24400: Loss:        2.314754
epoch 28 done
Batch 24500: Loss:        2.312159
Batch 24600: Loss:        2.323457
Batch 24700: Loss:        2.436345
Batch 24800: Loss:        2.447668
Batch 24900: Loss:        2.407589
Batch 25000: Loss:        2.460996
Batch 25100: Loss:        2.404357
Batch 25200: 

Batch 44900: Loss:        2.185711
Batch 45000: Loss:        2.164877
Batch 45100: Loss:        2.165250
Batch 45200: Loss:        2.272878
Batch 45300: Loss:        2.218867
Batch 45400: Loss:        2.154819
Batch 45500: Loss:        2.150480
epoch 53 done
Batch 45600: Loss:        2.220457
Batch 45700: Loss:        2.219050
Batch 45800: Loss:        2.192997
Batch 45900: Loss:        2.215337
Batch 46000: Loss:        2.193352
Batch 46100: Loss:        2.241782
Batch 46200: Loss:        2.277177
Batch 46300: Loss:        2.261205
epoch 54 done
Batch 46400: Loss:        2.237511
Batch 46500: Loss:        2.078870
Batch 46600: Loss:        2.263288
Batch 46700: Loss:        2.199450
Batch 46800: Loss:        2.106005
Batch 46900: Loss:        2.232303
Batch 47000: Loss:        2.189767
Batch 47100: Loss:        2.189201
Batch 47200: Loss:        2.168115
epoch 55 done
Batch 47300: Loss:        2.154792
Batch 47400: Loss:        2.158733
Batch 47500: Loss:        2.179071
Batch 47600: 

Batch 67300: Loss:        2.028180
Batch 67400: Loss:        2.029108
epoch 79 done
Batch 67500: Loss:        2.052614
Batch 67600: Loss:        2.070246
Batch 67700: Loss:        2.023265
Batch 67800: Loss:        2.072086
Batch 67900: Loss:        2.071011
Batch 68000: Loss:        2.035030
Batch 68100: Loss:        2.078843
Batch 68200: Loss:        2.051834
epoch 80 done
Batch 68300: Loss:        2.060919
Batch 68400: Loss:        2.073912
Batch 68500: Loss:        1.996692
Batch 68600: Loss:        2.012480
Batch 68700: Loss:        2.199805
Batch 68800: Loss:        1.978501
Batch 68900: Loss:        2.168346
Batch 69000: Loss:        1.978974
Batch 69100: Loss:        2.050477
epoch 81 done
Batch 69200: Loss:        2.057402
Batch 69300: Loss:        2.032913
Batch 69400: Loss:        2.007004


---
# Evaluate the vector space

Verify if the trained model, or the vector space W, has encoded the words in a way that **similar** words are close in the vector space.

* [How to measure the similarity among vectors](https://math.stackexchange.com/questions/4132458)

In [195]:
n = 3
context = "Verify if the trained model, or the vector space W, has encoded the words in a way that similar words are close in the vector space.".split()
word_indices = np.array(word_indexing.list_indices(context), dtype=TYPE_INT)

In [196]:
embedding.predict(word_indices, n)

<tf.Tensor: shape=(3,), dtype=int32, numpy=array([5841, 6830, 8028], dtype=int32)>

In [197]:
word_indexing.list_events([embedding.predict(word_indices, n)])

array(['milan', 'speeds', 'customs'], dtype='<U19')