In [None]:
# All Import Statements Defined Here
# Note: Do not add to this list.
# ----------------


import sys
import random
import pprint
import nltk
import matplotlib.pyplot as plt
import numpy as np
import scipy as sp
from nltk.corpus import reuters
from sklearn.decomposition import TruncatedSVD
from sklearn.decomposition import PCA
from gensim.models import KeyedVectors
from gensim.test.utils import datapath

In [None]:


# Our corpus uses the following start and end tokens.
START_TOKEN = '<START>'
END_TOKEN = '<END>'

np.random.seed(1337)
random.seed(1337)

Word Vectors are often used as a fundamental component for downstream NLP tasks, e.g. question answering, text generation, translation, etc. It is therefore important to build some intuition as to their strengths and weaknesses. Here, we will explore two types of word vectors: those derived from co-occurrence matrices, and those derived via GloVe.

The terms "word vectors" and "word embeddings" are often used interchangeably. The term "embedding" refers to the fact that we are encoding aspects of a word's meaning in a lower dimensional space. As Wikipedia states, "conceptually it involves a mathematical embedding from a space with one dimension per word to a continuous vector space with a much lower dimension".

##### Part 1: Count-Based Word Vectors
Most word vector models start from the following idea:

*You shall know a word by the company it keeps (Firth, J. R. 1957:11)*

Many word vector implementations are driven by the idea that similar words, i.e., (near) synonyms, will be used in similar contexts. As a result, similar words will often be spoken or written along with a shared subset of words, i.e., contexts. By examining these contexts, we can try to develop embeddings for our words. With this intuition in mind, many "old school" approaches to constructing word vectors relied on word counts. Here we elaborate upon one of those strategies, co-occurrence matrices.

##### Co-Occurrence
A co-occurrence matrix counts how often things co-occur in some environment. Given some word 𝑤𝑖 occurring in the document, we consider the context window surrounding 𝑤𝑖. Supposing our fixed window size is 𝑛, then this is the 𝑛 preceding and 𝑛 subsequent words in that document, i.e. words 𝑤𝑖−𝑛…𝑤𝑖−1 and 𝑤𝑖+1…𝑤𝑖+𝑛. We build a co-occurrence matrix 𝑀, which is a symmetric word-by-word matrix in which 𝑀𝑖𝑗 is the number of times 𝑤𝑗 appears inside 𝑤𝑖's window among all documents.

The rows (or columns) of a co-occurrence matrix provide one type of word vectors (those based on word-word co-occurrence), but the vectors will be large in general (linear in the number of distinct words in a corpus). Thus, our next step is to run *dimensionality reduction*. In particular, we will run *SVD (Singular Value Decomposition)*, which is a kind of generalized *PCA (Principal Components Analysis)* to select the top 𝑘 *principal components*. Here's a visualization of dimensionality reduction with SVD. In this picture our co-occurrence matrix is 𝐴 with 𝑛 rows corresponding to 𝑛 words. We obtain a full matrix decomposition, with the singular values ordered in the diagonal 𝑆 matrix, and our new, shorter length-𝑘 word vectors in 𝑈𝑘.

![Picture of an SVD](./imgs/svd.png "SVD")

This reduced-dimensionality co-occurrence representation preserves semantic relationships between words, e.g. *doctor* and *hospital* will be closer than *doctor* and *dog*. 

In [None]:
def read_corpus(category="crude"):
    """ Read files from the specified Reuter's category.
        Params:
            category (string): category name
        Return:
            list of lists, with words from each of the processed files
    """
    files = reuters.fileids(category)
    return [[START_TOKEN] + [w.lower() for w in list(reuters.words(f))] + [END_TOKEN] for f in files]

In [None]:
reuters_corpus = read_corpus()
pprint.pprint(reuters_corpus[:3], compact=True, width=100)