# CNTK Word2Vec Part A:  Data Loader

In this tutorial, we will learn word embeddings using the Word2Vec model by [Mikolov et al.](http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf). We will use the [Text8 Corpus](http://mattmahoney.net/dc/textdata) by Matt Mahoney which is cleaned text obtained from English Wikipedia Dump on Mar. 3, 2006.

This tutorial is divided into two parts:
- Part A: Familiarize with the dataset that will be used later in the tutorial
- Part B: We will build our model to learn word embeddings from the Text8 corpus.

# Motivation

Simply put, word embeddings means giving a vector representation to words. But why would one do so?

For applying machine learning (and, in specific, deep learning), it is very important to give a rich representation to our data as per the problem we are tackling. Think of a machine learning algorithm as a child. To make a child learn, we have books with simpler and easy to grasp words with appropriate pictorial representations as well. This make the child learn all the information contained in the data (book) easily. Same is the case with a machine learning algorithm as well.

Image and audio processing datasets already work with rich, high dimensional datasets. Images can be represented as vectors of individual raw pixel intensities and audio data as vectors of power spectral density coefficients. So, for audio and image recognition tasks, all the required information is encoded in the dataset itself.

But this is not the case with NLP based tasks. NLP systems treat words as discrete atomic units and so simply the word itself cannot capture relationships easily. Text data, in general, is sparse and hence we need more data to successfully train our models.

**Vector Space Models** (VSM) represent(embed) words in a continuous vector space where, **semantically similar words map to nearby points**.

# Learning Techniques

We have two types of methods to learn word embeddings:

1. **Count based Methods**: Here, we compute stats of how often some word co-occurs with the neighbor words in a large text corpus (forming co-occurance matrix) and then map these statistics into small, dense vectors for each word. Example: [GloVe Model](http://nlp.stanford.edu/projects/glove)

2. **Predictive Methods**: Here, we directly try to predict a word from its neighbors in terms of small, dense embedding vectors (considered as parameters of the model). Example: Word2Vec

# Word2Vec

Word2Vec is a computationally efficient predictive model. It has two types which can be explained using the example,

the quick brown fox jumped over the lazy dog

1. **CBOW**: Continuous Bag of Words model, where we predict a target word from source context words. Example: predicting **fox** from **brown** and **jumped**, given the above sentence.

2. **Skip Gram**: here, we predict source context words from the target word. Example: predicting **brown** and **jumped** from **fox**, given the above sentence.

Here, in this tutorial we will implement the **Skip Gram Model** of **Word2Vec**.

In [2]:
# Import the relevant modules to be used later
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import collections
import math
import os
import pickle
import random
import sys
import zipfile

from six.moves import urllib
from six.moves import xrange

# Initializing globals

vocab_size = 4096
data = list()
dictpickle = 'w2v-dict.pkl'
datapickle = 'w2v-data.pkl'


## Data download

We will download the data into local machine. The Text8 Corpus is cleaned text from wikipedia and is widely used for training and testing of word embeddings. It is a zip file of ~31MB, when uncompressed, becomes ~100MB. The below code will look for the zip file in the current directory. If not present, then it will download it.

In [3]:
def maybe_download(filename, expected_bytes):
    """Download a file if not present, and make sure it's the right size."""
    url = 'http://mattmahoney.net/dc/'
    if not os.path.exists(filename):
        print('Downloading Sample Data..')
        filename, _ = urllib.request.urlretrieve(url + filename, filename)
    statinfo = os.stat(filename)
    if statinfo.st_size == expected_bytes:
        print('Found and verified', filename)
    else:
        print(statinfo.st_size)
        raise Exception('Failed to verify ' + filename + '. Can you get to it with a browser?')
    return filename

# Reading the data

Now that the example text corpus has been downloaded, we can read it into memory. Or, if instead of running the example case, you wish to supply your own data file (.txt), then that can also be done using the 'read_data' function.

In [4]:
def read_data(filename):
    """Read the file as a list of words"""
    data = list()
    with codecs.open(filename, 'r', 'utf-8') as f:
        for line in f:
            data += line.split()
    return data


def read_data_zip(filename):
    """Extract the first file enclosed in a zip file as a list of words"""
    with zipfile.ZipFile(filename) as f:
        bdata = f.read(f.namelist()[0]).split()
    data = [x.decode() for x in bdata]
    return data

# Building the Dataset

Next, we make up a vocabulary of required size containing the most frequent words in the corpus and intergerize our corpus, by mapping a word to its corresponding index in the vocabulary. If a word is not present, we consider it as a special 'UNK' (unknown) token.

Also, we save the vocabulary as a pickle file for later visualizing our learnt embeddings.

In [5]:
def build_dataset(words):
    global data, vocab_size
    
    print('Building Dataset..')
    
    print('Finding the N most common words in the dataset..')
    count = [['UNK', -1]]
    count.extend(collections.Counter(words).most_common(vocab_size - 1))
    print('Done')
    
    dictionary = dict()
    
    for word, _ in count:
        dictionary[word] = len(dictionary)
    
    print('Integerizing the data..')
    data = list()
    unk_count = 0
    
    for word in words:
        if word in dictionary:
            index = dictionary[word]
        else:
            index = 0  # dictionary['UNK']
            unk_count += 1
        data.append(index)
    
    print('Done')
    count[0][1] = unk_count
    reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
    
    print('Saving Vocabulary..')
    with open(dictpickle, 'wb') as handle:
        pickle.dump(dictionary, handle)
    print('Done')
    
    print('Saving the processed dataset..')
    with open(datapickle, 'wb') as handle:
        pickle.dump(data, handle)
    print('Done')
    
    print('Most common words (+UNK)', count[:5])
    print('Sample data', data[:10], [reverse_dictionary[i] for i in data[:10]])


# Putting it all together

In [6]:
def process_text(filename):

    if filename == 'runexample':
        print('Running on the example data..')
        filename = maybe_download('text8.zip', 31344016)
        words = read_data_zip(filename)
    else:
        print('Running on the user specified data')
        words = read_data(filename)
    
    build_dataset(words)
    
# Running on Example Data (i.e. Text8 Corpus)
process_text('runexample')

Running on the example data..
Found and verified text8.zip
Building Dataset..
Finding the N most common words in the dataset..
Done
Integerizing the data..
Done
Saving Vocabulary..
Done
Saving the processed dataset..
Done
Most common words (+UNK) [['UNK', 3061524], ('the', 1061396), ('of', 593677), ('and', 416629), ('one', 411764)]
Sample data [0, 3082, 12, 6, 195, 2, 3137, 46, 59, 156] ['UNK', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against']
