# Lab 10: Word Embeddings
![itIs30MinutesBeforeLabAndThisIsTheBestICanDo](https://github.com/crowegian/memes/blob/master/iHaveHitALowPoint.png?raw=true)

## Introduction
In this lab you'll learn how [Skip-Gram](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf) is implemented. The skip-gram model works by training a single layer neural network to predict the surrounding words given a center word. The goal is to have a network that learns which words are more likely to appear in the context of a given word. The model is trained using word pairs given a center word, and the context words that appear within a fixed window around the word. The example below shows the word pairs created using different center words and a window size of 2

![trainingWordPairs](http://mccormickml.com/assets/word2vec/training_data.png)


[Cool](https://www.youtube.com/watch?v=A8q8PXoJwVk). So how does this model actually work? The model has two main moving parts, a set of weights representing the center word and context word embeddings or $V$ and $V^{\prime}$. Each matrix has separate weights and $\in R^{v, e}$ where v is the size of the vocabulary and e is the embedding dimension (a hyperparameter you choose).

The model learns to minimize the following function.

$$L = log(\sigma(v^{\prime}_{c_o}v_{c_e}^{T})) + \sum_{c_o,c_e \in \bar{D}} log(\sigma(-v^{\prime}_{c_o}v_{c_e}^{T}))$$

Where $c_o$ and $c_e$ are the context and center words respectively, $v$ and $v^{\prime}$ represent the center and context embeddings respectively and $\bar{D}$ is the set of word pairs where $c_o$ are the negatively sampled context embeddings.

## Negative Sampling
Please refer to [this](http://mccormickml.com/2017/01/11/word2vec-tutorial-part-2-negative-sampling/) tutorial to understand more about negative sampling. You don't have to build the unigram table but you'll need to know how it's used.

## Data
There are two datasets extracted for you. One is from the ap news data, and the other is a pull from pubmed. We'll train two sets of word embeddings and compare them at the end. There is also a test corpus which you can use for debugging and getting the model to run. Extra points ($\leq 0$) if you can [guess](www.google.com) where the corpus comes from.

## Installs
tqdm is a nice wrapper for loops to check your progress as you go

conda install -c conda-forge tqdm

ipywidgets makes tqdm look pretty

conda install -c conda-forge ipywidgets

Tokenization and NLP toolkit

conda install -c anaconda nltk 


## Janitorial Work
All of the data cleaning is handled for you. But please familiarize yourself with the objects created by extractVocabMappers as you'll be using these in the code.

In [1]:
testCorpus = ["First of all, quit grinnin’ like an idiot. Indians ain’t supposed to smile like that. Get stoic.",
             "No. Like this. You gotta look mean, or people won’t respect you.",
              " people will run all over you if you don’t look mean.",
              "You gotta look like a warrior. You gotta look like you just came back from killing a buffalo.",
             "But our tribe never hunted buffalo. We were fishermen."
             "What? You wanna look like you just came back from catching a fish?",
             "This ain’t dances with salmon, you know. Thomas, you gotta look like a warrior."]

# NOTE: reduce this number if you can't get things to run quickly.
maxDocs = 1000

In [2]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/ob2285/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [3]:
# Read in pubmed corpus into a text file

import glob
pubMedDataFolderPath = "data/pubMed_corpus/"
pubMedDataFiles = glob.glob(pubMedDataFolderPath + "*.txt")
pubMedCorpus = [""]*len(pubMedDataFiles)
for idx, pubMedDataPath in enumerate(pubMedDataFiles):
    with open(pubMedDataPath, "r") as pubMedFile:
        text = pubMedFile.read().strip()
        pubMedCorpus[idx] = text
pubMedCorpus = pubMedCorpus[0:maxDocs]
print("{} pub med abstracts".format(len(pubMedCorpus)))

1000 pub med abstracts


In [4]:
# Read in the ap corpus
apTextFile = "data/ap.txt"
apCorpus = []
readText = False
with open(apTextFile) as apDataFile:
    for line in apDataFile:
        if readText:
            apCorpus.append(line.strip())
            readText = False
        if line == "<TEXT>\n":
            readText = True
apCorpus = apCorpus[0:maxDocs]
print("{} ap articles".format(len(apCorpus)))

1000 ap articles


In [5]:
import string
import nltk
from nltk.tokenize import word_tokenize 
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords
import re
def removePunctuation(myStr):
    excludedCharacters = string.punctuation + "’" + "\%"
    newStr = "".join(char for char in myStr if char not in excludedCharacters)
    return(newStr)
def removeStopWords(tokenList):
    newTokenList = [tok for tok in tokenList if tok not in stopwords.words('english')]
    return(newTokenList)
def cleanDocStr(docStr):
    docStr = docStr.lower()
    docStr = removePunctuation(docStr)
    docStr = re.sub('\d', '%d%', docStr)
    docStrTokenized = nltk.tokenize.word_tokenize(docStr)
    myStopWords = set(stopwords.words('english'))
    docStrTokenized = [tok for tok in docStrTokenized if tok not in myStopWords]
    return(docStrTokenized)
def tokenize_corpus(corpus):
    tokens = [cleanDocStr(x) for x in corpus]
    return tokens

apCorpusTokenized = tokenize_corpus(apCorpus)
pubMedCorpusTokenized = tokenize_corpus(pubMedCorpus)
testCorpusTokenized = tokenize_corpus(testCorpus)

[nltk_data] Downloading package stopwords to /home/ob2285/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/ob2285/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [6]:
import time
from tqdm import tqdm, tqdm_notebook
from collections import Counter

minVocabOccurence = 5

def extractVocabMappers(tokenizedCorpus, minVocabOccurence = 0):
    """
    Decription: 
    Input:
        tokenizedCorpus (list(list(str))): A list where each index is a document from the corpus.
            Each document is further tokenized into a list of tokens. 
            [doc1, doc2,...] where doc1 = [tok1, tok2, ...]
        minVocabOccurence (int): Minimum number of times a word needs to show up to be considered
            for the vocabulary
    Output:
        word2Idx (dict): A dictionary mapping each word to its integer ID
        idx2Word (dict): A dictionary mapping each integer ID to its word
        wordCounts (list(tuples)): A list of tuples mapping each vocab to its count in the
            corpus
        newTokenizedCorpus (list(list(str))): Same as tokenized corpus but out of vocabulary terms are
            mapped to <UNK>
        
    """
    UNK = "<UNK>"
    flattenedCorpus = [item for sublist in tokenizedCorpus for item in sublist]
    wordCounts = Counter(flattenedCorpus).most_common()
    wordCounts = [(w, c) for w,c in wordCounts if c > minVocabOccurence]
#     wordCounts = wordCounts.most_common(vocabSizeMax)
    vocabulary = [word for word, count in wordCounts]
    
    # below is more readable but significantly slower code
    if False:
        vocabulary = []
        for sentence in tqdm(tokenizedCorpus):
            for token in sentence:
                if token not in vocabulary:
                    vocabulary.append(token)
#     vocabulary.append(UNK)
    print("Vocab size: {}".format(len(vocabulary)))
    word2idx = {w: idx for (idx, w) in enumerate(vocabulary)}
    idx2word = {idx: w for (idx, w) in enumerate(vocabulary)}
    newTokenizedCorpus = []# all words missing from vocab replaced with <UNK>
    # JK Im removing them
    for doc in tokenizedCorpus:
        newDoc = [word for word in doc if word in word2idx]# remove UNK from corpus
#         newDoc = [word if word in word2idx else UNK for word in doc]
        newTokenizedCorpus.append(newDoc)
    return(word2idx, idx2word, wordCounts, newTokenizedCorpus)

start = time.time()
print("Building ap corpus vocabulary")
word2Idx_ap, idx2Word_ap, vocabCount_ap, finalTokenizedCorpus_ap = extractVocabMappers(apCorpusTokenized,
                                                                                      minVocabOccurence = minVocabOccurence)
print("ap data tokenized in {} seconds\n".format(time.time() - start))
start = time.time()
print("Building pubMed corpus vocabulary")
word2Idx_pubMed, idx2Word_pubMed, vocabCount_pubMed, finalTokenizedCorpus_pubMed = extractVocabMappers(pubMedCorpusTokenized,
                                                                                                      minVocabOccurence = minVocabOccurence)
print("pubmed data tokenized in {} seconds\n".format(time.time() - start))
start = time.time()
print("Building test corpus vocabulary")
word2Idx_test, idx2Word_test, vocabCount_test, finalTokenizedCorpus_test = extractVocabMappers(testCorpusTokenized,
                                                                                              minVocabOccurence = 0)
print("test data tokenized in {} seconds".format(time.time() - start))

Building ap corpus vocabulary
Vocab size: 6510
ap data tokenized in 0.09615182876586914 seconds

Building pubMed corpus vocabulary
Vocab size: 3528
pubmed data tokenized in 0.09422731399536133 seconds

Building test corpus vocabulary
Vocab size: 37
test data tokenized in 0.00021076202392578125 seconds


## Word2Vec Implementation

In [8]:
import numpy as np
import torch
from torch import nn
import random

In [9]:
def generateObservations(tokenizedCorpus, word2Idx):
    """
    Decription: Iterates through every token in the corpus and creates a (center, context)
        pair for each context word in the window on either side of the center word. Please
        refer to the first figure to understand how word pairs are created
    Input:
        tokenizedCorpus (list(list(str))): A list where each index is a document from the corpus.
            Each document is further tokenized into a list of tokens. 
            [doc1, doc2,...] where doc1 = [tok1, tok2, ...]
        word2Idx (dict): A dictionary mapping words to their integer IDs
    Output:
        idxPairs (list(tuples)): A list of tuples where each tuple is a (center, context word)
        For example, idxPairs could look like [(w1, w2), (w1,w3), ...] where w1 and w2 are
        strings of tokens from the corpus.
    """
    window_size = 5
    idxPairs = []
    for sentence in tokenizedCorpus:
        for center_word_pos in range(len(sentence)):
            # Your code here
            # Populate idxPairs. Be sure not to jump outside the sentence bounds with your window.

            # End your code
    idxPairs = np.array(idxPairs)
    return(idxPairs)


def generateWordSamplingUnigramTable(vocabCount, word2Idx):
    """
    Decription: Generates a unigram table to sample data from. The unigram table
        should contains the index of every vocab index multiple times. The number
        of times an element appears is dictated by its sample probability. The unigram
        table can the be sampled. 
    Input:
        vocabCount (list(tuples)): A list of tuples mapping each vocab to its count in the
            corpus
        word2Idx (dict): A dictionary mapping words to their integer IDs
    Output:
        unigram_table (list(int)): A list of integers as described above. For example
        in a 3 word vocabulary it might look something like [0,0,1,1,1,1,1,1,1,2].
        Sampling from the previous example will mean that 0 is sampled 2/10 times,
        1 is sampled 7/10 times, and 2 is sampled 1/10 times.
    """
    unigram_table = []
#     numWords = np.sum([count for word, count in vocabCount])
    numWords = np.sum([count**0.75 for word, count in vocabCount])
    tableLength = 10000
    for w,c in vocabCount:
        unigram_table.extend([word2Idx[w]] * int((((c**0.75)/numWords))*tableLength))
#         unigram_table.extend([word2Idx[w]] * int(((c/numWords)**0.75)/0.001))
    return(unigram_table)
    
class SkipGram(nn.Module):
    """
    Decription: Instantiates and implements the forward pass of the skip gram
        algorithm with negative sampling.
    Input:
        vocabSize (int): Number of words to create embeddings for
        embedSize (int): Dimension of word embeddings
        word2Idx (dict): A dictionary mapping words to their integer IDs
    Output:
    """
    def __init__(self, vocabSize, embedSize, vocabCount, word2Idx):
        super(SkipGram, self).__init__()
        self.vocabSize = vocabSize
        self.word2Idx = word2Idx
        # Your code here
        # Init the center and context embedding matrices. These are learnable parameters
        self.centerEmbeddings = 
        self.contextEmbeddings = 
        # End your code
        # Init the embeddings however you like, but this init worked well for me.
        nn.init.xavier_uniform_(self.contextEmbeddings)
        nn.init.xavier_uniform_(self.centerEmbeddings)
        
        self.unigram_table = generateWordSamplingUnigramTable(vocabCount, word2Idx)
        self.logSigmoid = nn.LogSigmoid()
    def getNegSample(self, k, centerWords):
        """
        Decription: Randomly selects negative samples from the vocabulary. USes
            self.unigram_table in order to sample words. 
        Input:
            k (int): Number of negative samples to select
            centerWords (list(str)): A list of the string center words. There should
                be batchSize of these.
        Output:
            negSamples (list(numpyArray)): A list of numpy arrays where each numpy array
                contains the indices of negative samples. There are batchSize numpy arrays
        """
        negSamples = []
        for centerWord in centerWords:
            # Your code here
            # Using self.unigram_table sample indices to use as your negative samples
            # Be sure that for each center word you return negative samples, which
            # don't contain the center word. Should't happen often but just ot be sure.
            
            # self.unigram_table don't forget to use this.

        # End your code
        return(negSamples)
    def forward(self, center, context, negSampleIndices):
        """
        Decription Forward pass for the skipgram model. 
        Input:
            center (list(int)): A list of word integer IDs indicating all
                batchSize center words. Matches one to one with context
            context (list(int)): A list of word integer IDs indicating all
                batchSize context words. Matches one to one with center
            negSampleIndices (list(numpyArray)): A list of numpy arrays where
                each numpy array contains the indices of negative samples.
                There are batchSize numpy arrays. Returned by getNegSample()
        Output:
            logProb (tensor): The loss over the entire batch.
        """
        # Your Code
        # implement a forward pass of the model. Be sure to allow for varying batch sizes

        # End your code
        return(negLogProb)


def train_skipgram(embeddingSize, trainingData, vocabCount, word2Idx, idx2Word,
                   k, referenceWords, batchSize = 1024):
    """
    Decription: Instantiates and trains a skipgam model. The forward pass of the skipgram mode
        handles the forward pass so all you have to do here is handle the loss, and
        updating the weights.
    Input:
        embeddingSize (int): Size of each word embedding
        trainingData (list(tuples)): A list of tuples generated by generateWordSamplingUnigramTable()
            where each tuple is a center and context word
        vocabCount (list(tuples)): A list of tuples mapping each vocab to its count in the
            corpus
        word2Idx (dict): A dictionary mapping each word to its integer ID
        idx2Word (dict): A dictionary mapping each integer ID to its word
        k (int): Dictates the number of sampls used during negative sampling
        referenceWords (list(str)): A list of words to compare word embeddings for
        batchSize (int): The number of (center, context) words to run through each forward
            pass of the skipgram model.
    Output:
        model (SkipGram): The final trained SkipGram model
    """
    print("training on {} observations".format(len(trainingData)))
    model = SkipGram(vocabSize = len(word2Idx), embedSize = embeddingSize,
                     vocabCount = vocabCount, word2Idx = word2Idx)
    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
    optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
    listNearestWords(model = model, idx2Word = idx2Word,
     referenceWords = referenceWords, topN = 5)
    #         listNearestWords(model = model, idx2Word = idx2Word,
#                  referenceWords = referenceWords, topN = 5)
    for epoch in tqdm_notebook(range(n_epoch), position = 0):
        total_loss = .0
        avgLoss = 0.0
        iteration = 0
        for step in tqdm_notebook(range(0, len(trainingData), batchSize), position = 1):
            endIdx = np.min([(step+batchSize), len(trainingData)])
            myBatch = trainingData[step:(step+batchSize)]
            centerWords = [elem[0] for elem in myBatch]
            contextWords = [elem[1] for elem in myBatch]
            negSamples = model.getNegSample(k = k, centerWords = centerWords)
            centerIDs = [word2Idx[idx] for idx in centerWords]
            contextIDs = [word2Idx[idx] for idx in contextWords]
            model.zero_grad()
            loss = model(centerIDs, contextIDs, negSampleIndices = negSamples)
            
            loss.backward()
            optimizer.step()

            total_loss += loss.data.numpy()
            avgLoss += loss.data.numpy()
            iteration += 1
            if iteration % 500 == 0:
                avgLoss = avgLoss/(500)
                print("avg loss: {}".format(avgLoss))
        print("Loss at epoch {}: {}".format(epoch, total_loss/iteration))
        if epoch % 1 == 0:
            listNearestWords(model = model, idx2Word = idx2Word,
                         referenceWords = referenceWords, topN = 5)
    return(model)

In [10]:
from scipy.spatial.distance import cdist
def listNearestWords(model, idx2Word, referenceWords, topN):
    """
    Decription: Lists the topN closes words by cosine distance to each word in referenceWords
    Input:
        model (SkipGram): The final trained SkipGram model
        idx2Word (dict): A dictionary mapping each integer ID to its word
        referenceWords (list(str)): A list of words in the vocabulary of the model
        topN (int): The number of closest words to print
    Output:
        None: Just prints
    """
    assert len(idx2Word) == len(model.word2Idx), "Possibly passed in two different vocabularies"
    embeddings = model.centerEmbeddings.data.numpy()
    distMat = cdist(embeddings, embeddings, metric = "cosine")
    # Your code here
    # print the topN closest words to each word in referenceWords
    for word in referenceWords:

    # End your code

## Test Your Code
Using the test corpus below go ahead and test your code to make sure things run, and your loss is decreasing.

In [11]:
embd_size = 100
learning_rate = 0.001
n_epoch = 60
idxPairsTest = generateObservations(tokenizedCorpus = finalTokenizedCorpus_test, word2Idx = word2Idx_test)
sg_model = train_skipgram(embeddingSize = 5, trainingData = idxPairsTest, vocabCount = vocabCount_test,
                                     word2Idx = word2Idx_test, idx2Word = idx2Word_test, k = 10,
                                    referenceWords = ["thomas", "salmon"])

## Training a skip-gram model
Now we're ready to train our skip-gram model on two different corpora. Play with these hyper-parameters to and check your word embeddings regularly to make sure they're learning the right thing. **For the reference words provided, and more if you like, argue that the closest word embeddings make sense. Why might these word embeddings be so close? Consider the context in the corpus. I'll base your grade for this section mostly on your argument.**

In [None]:
embeddingSize = 50
learning_rate = 0.1
n_epoch = 10
idxPairsAP = generateObservations(tokenizedCorpus = finalTokenizedCorpus_ap, word2Idx = word2Idx_ap)
sg_model_ap = train_skipgram(embeddingSize = embeddingSize, trainingData = idxPairsAP,
                                     vocabCount = vocabCount_ap,
                                     word2Idx = word2Idx_ap, idx2Word = idx2Word_ap, k = 20,
                                          referenceWords = ["bush", "soviet", "president", "economy", "american"])

training on 2605610 observations
('bush', 0.0)
('shares', 0.5194150914349732)
('poland', 0.5218423178999131)
('estimate', 0.5441517338645673)
('widely', 0.5513883726368334)
**************************************************

('soviet', 1.1102230246251565e-16)
('criminal', 0.46760505696415045)
('born', 0.4898914966899621)
('persistent', 0.500935605592492)
('sentenced', 0.5165142317527203)
**************************************************

('president', 1.1102230246251565e-16)
('prestige', 0.4515375190452422)
('deicing', 0.49984655865450345)
('christians', 0.5155236051739778)
('legal', 0.5356843649976819)
**************************************************

('economy', 0.0)
('monets', 0.5346631316121371)
('robinson', 0.5358051657353012)
('certain', 0.5434795723617019)
('columbus', 0.5480942726486525)
**************************************************

('american', 0.0)
('die', 0.49618172889851486)
('german', 0.5058398432846687)
('walked', 0.5250782122010302)
('sharing', 0.537236426017595

HBox(children=(IntProgress(value=0, max=10), HTML(value='')))

HBox(children=(IntProgress(value=0, max=2545), HTML(value='')))

avg loss: 14.418619115829468
avg loss: 13.099470156589508
avg loss: 11.887569530523141
avg loss: 11.060079142061307
avg loss: 11.051930679886754
Loss at epoch 0: 12.2664336380181
('bush', 0.0)
('government', 0.016436052964986403)
('one', 0.022464927321183326)
('two', 0.023364495530326934)
('us', 0.024297377591558478)
**************************************************

('soviet', 0.0)
('also', 0.016949834118185847)
('us', 0.018801936313878387)
('would', 0.021414450920200445)
('two', 0.022375601991986627)
**************************************************

('president', 0.0)
('one', 0.011309800148615712)
('year', 0.013361777755761084)
('two', 0.013461328693379349)
('would', 0.013482899247442504)
**************************************************

('economy', 0.0)
('news', 0.1221998164462742)
('united', 0.13160612208791556)
('one', 0.1325020814429606)
('bush', 0.1373247260042273)
**************************************************

('american', 0.0)
('president', 0.018977035233529005)
('on

HBox(children=(IntProgress(value=0, max=2545), HTML(value='')))

avg loss: 10.843738045454025
avg loss: 10.595060351662159
avg loss: 10.131587166792496
avg loss: 9.44532161120416
avg loss: 9.41494197974928
Loss at epoch 1: 10.06158934011909
('bush', 0.0)
('many', 0.004564515499829724)
('united', 0.0048288900530799594)
('could', 0.004861329448970442)
('news', 0.004901564683698512)
**************************************************

('soviet', 0.0)
('states', 0.003799948786777696)
('time', 0.003918875168575475)
('officials', 0.004143978014025662)
('government', 0.00418413757657099)
**************************************************

('president', 0.0)
('two', 0.002568132564815917)
('also', 0.002768243360620315)
('us', 0.0028542633394230688)
('one', 0.002939002352398745)
**************************************************

('economy', 0.0)
('might', 0.020907855464797342)
('come', 0.021087863944941332)
('know', 0.02137706134443562)
('director', 0.021431731527101316)
**************************************************

('american', 0.0)
('president', 0.003

HBox(children=(IntProgress(value=0, max=2545), HTML(value='')))

avg loss: 9.141483906269073
avg loss: 8.870717495448112
avg loss: 8.381149161517156
avg loss: 7.760508625246405
avg loss: 7.699830366310552
Loss at epoch 2: 8.347920382655206
('bush', 1.1102230246251565e-16)
('many', 0.002238930462963973)
('news', 0.0025811540833399205)
('federal', 0.0025909907976920943)
('house', 0.0026683929547705043)
**************************************************

('soviet', 1.1102230246251565e-16)
('states', 0.0020294505469018453)
('time', 0.002068412481474491)
('officials', 0.0021661296134071195)
('also', 0.0022847743454394998)
**************************************************

('president', 0.0)
('two', 0.0015512300660099898)
('one', 0.0015515476147457408)
('also', 0.001640716637153261)
('first', 0.0016932215615717006)
**************************************************

('economy', 2.220446049250313e-16)
('might', 0.007834931139029777)
('come', 0.007916209916335903)
('know', 0.008077843561111964)
('director', 0.008246080951527679)
***************************

HBox(children=(IntProgress(value=0, max=2545), HTML(value='')))

avg loss: 7.46271546626091
avg loss: 7.253077980676174
avg loss: 6.853389161341982


In [None]:
embeddingSize = 50
learning_rate = 0.1
n_epoch = 10
idxPairsPubMed = generateObservations(tokenizedCorpus = finalTokenizedCorpus_pubMed, word2Idx = word2Idx_pubMed)
sg_model_pubMed = train_skipgram(embeddingSize = embeddingSize, trainingData = idxPairsPubMed,
                                     vocabCount = vocabCount_pubMed,
                                     word2Idx = word2Idx_pubMed, idx2Word = idx2Word_pubMed, k = 20,
                                                  referenceWords = ["clinical", "obesity", "microbial", "microbiome"])

## How Domains Affect Word Embeddings
Choose two words that appear in both the pubmed and ap vocabularies and compare the closest embeddings to both words in the pubmed and ap embeddings[.](https://www.youtube.com/watch?v=Tr-WrGcexlY) **Why might the two words you chose have different representations? How might this affect downstream NLP tasks?**

In [None]:
# Your code here