# Section 9. Word Embeddings

# 1. Overview

- Up to now we've been representing words as features in a one-hot-encoded matrix. Ie. each words gets its own feature. 
- This is problematic for datasets with HUGE vocabs... A vocab of 1 million = 1 million features to train on


- **Problem:** One pair of words (car, vehicle) might be more related to each other than some other pair (cat, ocean)
    - All pairs of words are the same distance apart!!!
    - **All words are a distance of 1 from the origin, and a Manhattan distance of 2 from each other**
    - vectors in this high dimensional space are ALL orthogonal to the axes! 
    - We can't tell "how well are two words related?"
    
    
- **Solution:** Word embeddings! But how do we make them?

    - Its actually pretty simply: **Apply PCA to a V \* D matrix (where V is vocab, D is documents)**. 
    - **Each word becomes and overservation, and each document is a feature.**
    - This give us a vector prepresentation of the word (PCA finds correlations and gives you smalled vector but retains information, as you know)
    - You can also step it up a notch and used TF-IDF (instead of raw counts for your original matrix, before PCA) and non-linear t-SNE (instead of basic linear PCA). 
    - **word2vec and GLoVe use similar practices, in very creative ways, to create word embeddings**
    
 ### We will first start by looking at some pretrained word2vec word embeddings (and experiment with Word Analogies). We will then build our own more simple tfidf word embeddings (not using word2vec). In the next section will will finally we will get into building our own custom word2vec implementation. WOOOT!

# 2. Using Pre-trained word embeddings

Stolen from here: https://blog.manash.me/how-to-use-pre-trained-word-vectors-from-facebooks-fasttext-a71e6d55f27

- Training word embeddings with word2vec or GLoVe often takes a lot of computing power
- Fortunately, you can use aweseom pretrained word embeddings! We can use gensim to load in pretrained vectors
- The vectors were trained by FB and can be found at https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md
    - from github: "We are publishing pre-trained word vectors for 294 languages, trained on Wikipedia using fastText. These vectors in dimension 300 were obtained using the skip-gram model described in Bojanowski et al. (2016) with default parameters."
    - The word vectors come in both the binary and text default formats of fastText. In the text format, each line contain a word followed by its embedding. Each value is space separated. Words are ordered by their frequency in a descending order.

In [8]:
from gensim.models import KeyedVectors
from time import time
# Create the model
t0 = time()
en_model = KeyedVectors.load_word2vec_format('wiki.simple.vec')
t1 = time()
print("Vectors load time: ", t1-t0)

Vectors load time:  44.94893479347229


In [9]:
# Getting the tokens
words = []
for word in en_model.vocab:
    words.append(word)
    
# Printing out number of tokens available
print("Number of Words: {}".format(len(words)))

# Printing out the dimension of a word vector 
print("Dimension of a word vector: {}".format(len(en_model[words[0]])))

# Print out the vector of a word 
print("Example vector (first n elements of 300):",en_model[words[0]][0:10])

Number of Words: 111051
Dimension of a word vector: 300
Example vector (first n elements of 300): [ 0.28922001 -0.46075001  0.35141999 -0.41104001  0.16421001  0.17307
 -0.21562    -0.090636   -0.079495   -0.11149   ]


### Word Analogies
- Now that we have loaded our pretrained vectors in, lets experiment with them using word analogies
- Generally speaking, words vectors with similar directions have similar meanings (especially if you use word2vec or GLoVe)
- For instance:
        king - man ~= Queen - woman
        
 We will test this out with this simple exercise:
 
 1. Convert some words to their word embeddings
     - Eg. vec("king") = WE[word2idx["king"]]
     - v0 = vec(king) - vec(man) + vec(woman)
     - #v0 is just a vector in a space with an infinite number of values
 2. Loop through all word vectors, find the one closest to v0, return that word
     - There is no way to map directly from a vector to a word, as the vector space is infinite, and that would require an infinite amount of words.
     - As such, there are several similarity metrics we will use to find the \*CLOSEST\* word
     
**Distance Metrics:**

There are many distance metrics we could use:
- Euclidean distance (plain old squared distance): $||a-b^2||$ 
- Cosine distance: $cos\_distance(a,b) = 1-\space a^Tb/(||a||\space ||b||)$
    - since: $a^Tb = ||a||\space ||b|| (cos(a,b)$
    - Paralell vectors (0 degree angle):
        - $cos(0deg) = 1 $
        - 1 is max val of cos
    - Orthogonal vectors (90 degree angle)
        - $cos(90deg)=0$
    - Vectors in opposite direction (180 degree angle)
        - $cos(180deg)=-1$
        - (-1) is min val of cos
    - So essentially, the closer the vectors are, the LARGER cos(y) will be... this is sort of the opposite of what we want
        - $cos(theta)$ 
    - That said, we take the negative of the cos function...
        - we want our "distance" function to be  $1 - cos(theta)$
    - **For this distance, its useful to normalize all word embeddings so length is 1. So ALL word embeddings lie on the UNIT SPHERE**
    
**Finding closest vector matches:**

Loop through the words and keep track of distances:
    #pseudocode
    min_dist = Infinity
    best_word = ''
    for word, idx in word2idx.iteritems():
        v1 = WE[idx]
        if dist(v0, v1) < min_dist:
            min_dist = dist(v0, v1)
            best_word = word
    print("The closest word is: ", best_word)
    
#### NOTE: gensims similarity method computes the cosine similarity. We will use it as a check to make sure our function is working as expected. Gensim also has a 'most_similar' method which we will use instead of this loop (as the we have a big vocab in this test set and the loop would be inefficient).

In [3]:
from numpy import linalg as la
#create cosine_sim function
def cosine_sim(a, b):
    numerator = a.T.dot(b)
    denominator = la.norm(a) * la.norm(b)
    return numerator / denominator

a = 'queen'
b = 'king'
vec_a = en_model[a]
vec_b = en_model[b]

print("Custom cosine similarity: ", cosine_sim(vec_a, vec_b))
print("Gensim cosine similarity: ", en_model.similarity(a,b))

Custom cosine similarity:  0.544238
Gensim cosine similarity:  0.544238451305


#### Alright, lets play around with some word analogies!
We will use gensims similar by vector method here:
    
    similar_by_vector(vector, topn=10, restrict_vocab=None)

In [5]:
a = 'man'
b = 'king'
test = en_model[a] - en_model[b]

en_model.similar_by_vector(test)

[('man', 0.5498616695404053),
 ('spider', 0.23319827020168304),
 ('woman', 0.23259012401103973),
 ('mischief', 0.20489361882209778),
 ('ejaculate', 0.20252743363380432),
 ('spiderleg', 0.19765730202198029),
 ('topless', 0.1948021799325943),
 ('naturopathy', 0.1926197111606598),
 ('ejaculated', 0.18959881365299225),
 ('spiderman', 0.18746206164360046)]

In [6]:
a = 'king'
b = 'man'
test = en_model[a] - en_model[b]

en_model.similar_by_vector(test)

[('king', 0.6832274794578552),
 ('kingship', 0.43243950605392456),
 ('kings', 0.430464506149292),
 ('kingz', 0.4212456941604614),
 ('queen', 0.35821980237960815),
 ('kingdoms', 0.3533620238304138),
 ('kingkong', 0.353018581867218),
 ('abdicates', 0.34669357538223267),
 ('kingdome', 0.34243044257164),
 ('reigned', 0.333217054605484)]

In [7]:
a = 'queen'
b = 'woman'
test = en_model[a] - en_model[b]

en_model.similar_by_vector(test, topn=10)

[('queen', 0.6491795182228088),
 ('queene', 0.44217807054519653),
 ('queens', 0.38087162375450134),
 ('queenie', 0.37162333726882935),
 ('queenside', 0.3590025305747986),
 ('queensrÿche', 0.3425987958908081),
 ('queensway', 0.33737051486968994),
 ('consort', 0.3284097909927368),
 ('elizabeth', 0.32121044397354126),
 ('queensbury', 0.30972492694854736)]

# 3. Tf-Idf & t-SNE to find word embedding

We can use TF-IDF and t-SNE to give us a low dimension word embedding. **As described above, this is a precursor to word2vec, and its good to understand these fundementals prior to jumping into the big algos!**

#### TF-IDF
Recall that TF-IDF discounts words that appear in many different types of docs (taking into account the global frequency of a word). It also gives higher scores to words that appear in only a few number of documents. 

#### t-SNE
t-SNE is a non linear dimensionality reduction method. Will it perform better thatn linear PCA? We shall see!

## 1. Import packages

In [7]:
import json
import numpy as np
import matplotlib.pyplot as plt
from sklearn.utils import shuffle
from sklearn.manifold import TSNE
from datetime import datetime
from sklearn.feature_extraction.text import TfidfTransformer
import os
import sys
import random
from datetime import datetime
from nltk.corpus import brown
import nltk
import operator

#NOTE WE DO NOT NEED REUSABLE FUNCTIONS AS WE RELOAD CODE BELOW
#import reusable function so we dont have to rewrite them in this notebook
#from LazyProgrammerGitRepos.rnn_class.util import get_wikipedia_data
#from LazyProgrammerGitRepos.rnn_class.brown import get_sentences_with_word2idx_limit_vocab, get_sentences_with_word2idx

## 2. Import old code
Used to get data from brown corpus

In [8]:
#GET SENTENCES FROM BROWN CORPUS
def get_sentences():
    # returns 57340 of the Brown corpus
    # each sentence is represented as a list of individual string tokens
    return brown.sents()
test = get_sentences()

#SENTANCES TO INDEX REPRESENTATION WITH LIMITED VOCAB
KEEP_WORDS = set([
  'king', 'man', 'queen', 'woman',
  'italy', 'rome', 'france', 'paris',
  'london', 'britain', 'england',
])

def get_sentences_with_word2idx_limit_vocab(n_vocab=2000, keep_words=KEEP_WORDS, print_v=False):
    #initialize sentences and index
    sentences = get_sentences()
    indexed_sentences = []
    i = 2
    #initialize start/end tags
    #capitalized so wont get confused with actual words in corpus
    word2idx = {'START': 0, 'END': 1}
    idx2word = ['START', 'END']
    
    #Set start tokens to inf so they dont get removed when sorting by count
    word_idx_count = {0: float('inf'), 1: float('inf')}
    
    #Count each word 
    for sentence in sentences:
        indexed_sentence = []
        for token in sentence:
            token = token.lower()
            if token not in word2idx:
                idx2word.append(token)
                word2idx[token] = i
                i += 1
            # keep track of counts for later sorting
            idx = word2idx[token]
            word_idx_count[idx] = word_idx_count.get(idx, 0) + 1

            indexed_sentence.append(idx)
        indexed_sentences.append(indexed_sentence)

    # restrict vocab size
    # set all the words I want to keep to infinity
    # so that they are included when I pick the most
    for word in keep_words:
        word_idx_count[word2idx[word]] = float('inf')
    #tell sorted funciton to use 2nd item to sort
    sorted_word_idx_count = sorted(word_idx_count.items(), key=operator.itemgetter(1), reverse=True)
    word2idx_small = {}
    new_idx = 0
    #create new dictionary from old dict
    idx_new_idx_map = {}
    for idx, count in sorted_word_idx_count[:n_vocab]:
        word = idx2word[idx]
        if print_v:
            print(word, count)
        word2idx_small[word] = new_idx
        idx_new_idx_map[idx] = new_idx
        new_idx += 1
    # let 'unknown' be the last token
    # replcae all infrequent words are replaced with 'UNKOWN"
    word2idx_small['UNKNOWN'] = new_idx 
    unknown = new_idx
    
    # sanit check to make sure all words wanted to keep are still there
    assert('START' in word2idx_small)
    assert('END' in word2idx_small)
    for word in keep_words:
        assert(word in word2idx_small)

    # map old idx to new idx
    sentences_small = []
    for sentence in indexed_sentences:
        if len(sentence) > 1:
            new_sentence = [idx_new_idx_map[idx] if idx in idx_new_idx_map else unknown for idx in sentence]
            sentences_small.append(new_sentence)

    return sentences_small, word2idx_small

## 3. Extract data

In [9]:
sentences, word2idx = get_sentences_with_word2idx_limit_vocab(n_vocab=1500)
print('Sentences: ',len(sentences))
print('Sentences: ',len(word2idx))

Sentences:  57013
Sentences:  1501


## 4. Build VxN term document matrix

This is used to create the word embeddings

In [10]:
V = len(word2idx)
N = len(sentences)

# create raw counts first
A = np.zeros((V, N))
j = 0
for sentence in sentences:
    for i in sentence:
        A[i,j] += 1
    j += 1
    if j%10000 == 0:
        print("{} sentences complete".format(j))
print("finished getting raw counts")

10000 sentences complete
20000 sentences complete
30000 sentences complete
40000 sentences complete
50000 sentences complete
finished getting raw counts


## 5. Transform VxN matrix using TFIDF

#### Full disclosure, I have no idea if this is legit. It doesnt makes sense to me because for TFIDF to work, you need to have the documents along the rows... so that Inverse Document Frequency can be calculated accurately. Anyway, lets power through for now.

In [11]:
#load TfidfTransformer again for funsies
from sklearn.feature_extraction.text import TfidfTransformer
transformer = TfidfTransformer()
A1 = transformer.fit_transform(A)

#turn to array from sklearn sparse matrix
A1 = A1.toarray()

A1.shape

(1501, 57013)

## 6. Get t-SNE word embeddings

In [18]:
#initialize
tsne = TSNE()
t0 = time()
#fit_transform
Z = tsne.fit_transform(A1)
t1 = time()

print("Runtime: ",t1-t0)

Runtime:  396.6635196208954


## 7. Test out some word analogies
First create find_analogies function, which performs the following function:
    
    W1 - W2 + W3 = ?
    
Then toss in a bunch of word sets and see what happens!

In [22]:
#create the find_analogies function
def find_analogies(w1, w2, w3, We, word2idx):
    king = We[word2idx[w1]]
    man = We[word2idx[w2]]
    woman = We[word2idx[w3]]
    v0 = king - man + woman

    def dist1(a, b):
        return np.linalg.norm(a - b)
    def dist2(a, b):
        return 1 - a.dot(b) / (np.linalg.norm(a) * np.linalg.norm(b))

    for dist, name in [(dist1, 'Euclidean'), (dist2, 'cosine')]:
        min_dist = float('inf')
        best_word = ''
        for word, idx in word2idx.items():
            if word not in (w1, w2, w3):
                v1 = We[idx]
                d = dist(v0, v1)
                if d < min_dist:
                    min_dist = d
                    best_word = word
        print("closest match by", name, "distance:", best_word)
        print(w1, "-", w2, "=", best_word, "-", w3)

In [23]:
analogies_to_try = (
    ('king', 'man', 'woman'),
    ('france', 'paris', 'london'),
    ('france', 'paris', 'rome'),
    ('paris', 'france', 'italy'),
)

# test to make sure all analogy words are in the vocab
# otherwise it wont work
notfound = False
for word_list in analogies_to_try:
    for w in word_list:
        if w not in word2idx:
            print("%s not found in vocab, remove it from \
                analogies to try or increase vocab size")
            notfound = True
if notfound:
    exit()
    
for word_list in analogies_to_try:
    w1, w2, w3 = word_list
    find_analogies(w1, w2, w3, Z, word2idx)

closest match by Euclidean distance: de
king - man = de - woman
closest match by cosine distance: boys
king - man = boys - woman
closest match by Euclidean distance: finally
france - paris = finally - london
closest match by cosine distance: quiet
france - paris = quiet - london
closest match by Euclidean distance: nothing
france - paris = nothing - rome
closest match by cosine distance: '
france - paris = ' - rome
closest match by Euclidean distance: playing
paris - france = playing - italy
closest match by cosine distance: leadership
paris - france = leadership - italy


# Conclusion
### As expected, these word embeddings suck, and so we don't get very good word analogies. Lets move onto word2vec!