## Original Lecture Note Link
Motivation and explanation of worc2vec [Stanford NLP Slides](http://web.stanford.edu/class/cs224n/lectures/cs224n-2017-lecture2.pdf)

Original Mikolov et. al. papers
- [Distributed Representations of Words and Phrases and their Compositionality](https://arxiv.org/pdf/1310.4546.pdf)
- [Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/pdf/1301.3781.pdf)

A simple explanation on skip-gram model [blog](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/)

On Softmax, hierarchical softmax, negative sampling and NCE
- [On word embeddings - Part 2: Approximating the Softmax](http://ruder.io/word-embeddings-softmax/)
- [Intuition behind NCE (Noise Contrastive Estimation) for word embeddings](https://twitter.com/MShahriariNia/status/908080372298031104): Negative sampling, as the  name suggests, belongs to the family of sampling-based approaches. This family also includes importance sampling and target sampling. Negative sampling is actually a simplified model of an approach called Noise Contrastive Estimation (NCE), e.g. negative sampling makes certain assumption about the number of noise samples to generate (k) and the distribution of noise samples (Q) (negative sampling assumes that kQ(w) = 1) to simplify computation
- [Distributed Representations of Words and Phrases and their Compositionality](http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf):  Mikolov et al.
 have shown training the Skip-gram model that results in faster training and better vector representations for frequent words, compared to more complex hierarchical softmax

## Summary

Traditional way of doing wordembedding would be through __counting__:
- Build a square matrix with dimmensionality of vocabulary where each item is the count. Then do matrix factorization and get an array of doubles for each token. 

Using SVD to find a low rank approximation. Instead of taking the full |V| we take the top k.

This is similar to glove which perdicts matrices itself.

CBOW has a overlapping sliding window of size |w|
1. take dot product of one hot vector to weight matrix to get the embedding of the word
2. calculate the dot product of the word's embedding to all the words to get the distnce to each 
3. to convert distances to probbaility distribution calculate the softmax on top of it
4. if two words are similar teh probability distribution of all the surrounding text would be roughly similar, hence the network training idea

In the basic word2vec we have a softmax, but the summation in softmax over the entire vocabulary is expensive:

- Prob(output|context) = exp(u0 . v_c) / sum_w exp(u_w . v_c)  , where v_c is the context. [Reference](https://arxiv.org/pdf/1410.8251.pdf)

Alternatives:

- instead of full softmax use hierarchical softmax

## Dataset Loader

In [6]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

from collections import Counter
import random
import os
import sys
sys.path.append('..')
import zipfile

import numpy as np
from six.moves import urllib
import tensorflow as tf

# Parameters for downloading data
DOWNLOAD_URL = 'http://mattmahoney.net/dc/'
EXPECTED_BYTES = 31344016
DATA_FOLDER = 'data/'
FILE_NAME = 'text8.zip'

def make_dir(path):
    """ Create a directory if there isn't one already. """
    try:
        os.mkdir(path)
    except OSError:
        pass

def download(file_name, expected_bytes):
    """ Download the dataset text8 if it's not already downloaded """
    file_path = DATA_FOLDER + file_name
    if os.path.exists(file_path):
        print("Dataset ready")
        return file_path
    file_name, _ = urllib.request.urlretrieve(DOWNLOAD_URL + file_name, file_path)
    file_stat = os.stat(file_path)
    if file_stat.st_size == expected_bytes:
        print('Successfully downloaded the file', file_name)
    else:
        raise Exception('File ' + file_name +
                        ' might be corrupted. You should try downloading it with a browser.')
    return file_path

def read_data(file_path):
    """ Read data into a list of tokens 
    There should be 17,005,207 tokens
    """
    with zipfile.ZipFile(file_path) as f:
        words = tf.compat.as_str(f.read(f.namelist()[0])).split() 
        # tf.compat.as_str() converts the input into the string
    return words

def build_vocab(words, vocab_size):
    """ Build vocabulary of VOCAB_SIZE most frequent words """
    dictionary = dict()
    count = [('UNK', -1)]  # initialize dict of words and their count. 
    
    # Find the k most common words and add them to the list that already has UNK
    count.extend(Counter(words).most_common(vocab_size - 1)) 
    index = 0
    make_dir('processed')
    with open('processed/vocab_1000.tsv', "w") as f:
        for word, _ in count:
            dictionary[word] = index
            if index < 1000:
                f.write(word + "\n")
            index += 1
    index_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
    return dictionary, index_dictionary

def convert_words_to_index(words, dictionary):
    """ Replace each word in the dataset with its index in the dictionary """
    return [dictionary[word] if word in dictionary else 0 for word in words] # 0  mean UNK, cool!

# TODO: Check
def generate_sample(index_words, context_window_size):
    """ Form training pairs according to the skip-gram model. (Overlapping slide of window) """
    for index, center in enumerate(index_words):
        context = random.randint(1, context_window_size)
        # get a random target before the center word
        for target in index_words[max(0, index - context): index]:
            yield center, target
        # get a random target after the center wrod
        for target in index_words[index + 1: index + context + 1]:
            yield center, target

# TODO: Check
def get_batch(iterator, batch_size):
    """ Group a numerical stream into batches and yield them as Numpy arrays. """
    while True:
        center_batch = np.zeros(batch_size, dtype=np.int32)
        target_batch = np.zeros([batch_size, 1])
        for index in range(batch_size):
            center_batch[index], target_batch[index] = next(iterator)
        yield center_batch, target_batch

def process_data(vocab_size, batch_size, skip_window):
    file_path = download(FILE_NAME, EXPECTED_BYTES)
    words = read_data(file_path)
    dictionary, _ = build_vocab(words, vocab_size)
    index_words = convert_words_to_index(words, dictionary)
    del words # to save memory
    single_gen = generate_sample(index_words, skip_window)
    return get_batch(single_gen, batch_size)

def get_index_vocab(vocab_size):
    file_path = download(FILE_NAME, EXPECTED_BYTES)
    words = read_data(file_path)
    return build_vocab(words, vocab_size)


In [1]:
# imports and constants

import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'

import numpy as np
import tensorflow as tf
from tensorflow.contrib.tensorboard.plugins import projector

from process_data import process_data


ImportError: No module named process_data