# Lab 10: Word Embeddings

## Introduction
In this lab you'll learn how [Skip-Gram](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf) is implemented. The skip-gram model works by training a single layer neural network to predict the surrounding words given a center word. The goal is to have a network that learns which words are more likely to appear in the context of a given word. The model is trained using word pairs given a center word, and the context words that appear within a fixed window around the word. The example below shows the word pairs created using different center words and a window size of 2

![trainingWordPairs](http://mccormickml.com/assets/word2vec/training_data.png)


[Cool](https://www.youtube.com/watch?v=A8q8PXoJwVk). So how does this model actually work? The model has two main moving parts, a set of weights representing the center word and context word embeddings or $V$ and $V^{\prime}$. Each matrix has separate weights and $\in R^{v, e}$ where v is the size of the vocabulary and e is the embedding dimension (a hyperparameter you choose).

The model learns to minimize the following function.

$$L = log(\sigma(v^{\prime}_{c_o}v_{c_e}^{T})) + \sum_{c_o,c_e \in \bar{D}} log(\sigma(-v^{\prime}_{c_o}v_{c_e}^{T}))$$

Where $c_o$ and $c_e$ are the context and center words respectively, $v$ and $v^{\prime}$ represent the center and context embeddings respectively and $\bar{D}$ is the set of word pairs where $c_o$ are the negatively sampled context embeddings.

## Negative Sampling
Please refer to [this](http://mccormickml.com/2017/01/11/word2vec-tutorial-part-2-negative-sampling/) tutorial to understand more about negative sampling. You don't have to build the unigram table but you'll need to know how it's used.

## Data
There are two datasets extracted for you. One is from the ap news data, and the other is a pull from pubmed. We'll train two sets of word embeddings and compare them at the end. There is also a test corpus which you can use for debugging and getting the model to run. Extra points ($\leq 0$) if you can [guess](www.google.com) where the corpus comes from.

## Installs
tqdm is a nice wrapper for loops to check your progress as you go

conda install -c conda-forge tqdm

ipywidgets makes tqdm look pretty

conda install -c conda-forge ipywidgets

Tokenization and NLP toolkit

conda install -c anaconda nltk 


## Janitorial Work
All of the data cleaning is handled for you. But please familiarize yourself with the objects created by extractVocabMappers as you'll be using these in the code.

In [1]:
testCorpus = ["First of all, quit grinnin’ like an idiot. Indians ain’t supposed to smile like that. Get stoic.",
             "No. Like this. You gotta look mean, or people won’t respect you.",
              " people will run all over you if you don’t look mean.",
              "You gotta look like a warrior. You gotta look like you just came back from killing a buffalo.",
             "But our tribe never hunted buffalo. We were fishermen."
             "What? You wanna look like you just came back from catching a fish?",
             "This ain’t dances with salmon, you know. Thomas, you gotta look like a warrior."]

# NOTE: reduce this number if you can't get things to run quickly.
maxDocs = 2000

In [2]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/ob2285/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [3]:
# Read in pubmed corpus into a text file

import glob
pubMedDataFolderPath = "data/pubMed_corpus/"
pubMedDataFiles = glob.glob(pubMedDataFolderPath + "*.txt")
pubMedCorpus = [""]*len(pubMedDataFiles)
for idx, pubMedDataPath in enumerate(pubMedDataFiles):
    with open(pubMedDataPath, "r") as pubMedFile:
        text = pubMedFile.read().strip()
        pubMedCorpus[idx] = text
pubMedCorpus = pubMedCorpus[0:maxDocs]
print("{} pub med abstracts".format(len(pubMedCorpus)))

1767 pub med abstracts


In [4]:
# Read in the ap corpus
apTextFile = "data/ap.txt"
apCorpus = []
readText = False
with open(apTextFile) as apDataFile:
    for line in apDataFile:
        if readText:
            apCorpus.append(line.strip())
            readText = False
        if line == "<TEXT>\n":
            readText = True
apCorpus = apCorpus[0:maxDocs]
print("{} ap articles".format(len(apCorpus)))

2000 ap articles


In [19]:
import string
import nltk
from nltk.tokenize import word_tokenize 
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords
import re
def removePunctuation(myStr):
    excludedCharacters = string.punctuation + "’" + "%"
    newStr = "".join(char for char in myStr if char not in excludedCharacters)
    return(newStr)
def removeStopWords(tokenList):
    newTokenList = [tok for tok in tokenList if tok not in stopwords.words('english')]
    return(newTokenList)
def cleanDocStr(docStr):
    docStr = docStr.lower()
    docStr = removePunctuation(docStr)
    docStr = re.sub('\d', '%d%', docStr)
    docStrTokenized = nltk.tokenize.word_tokenize(docStr)
    myStopWords = set(stopwords.words('english'))
    docStrTokenized = [tok for tok in docStrTokenized if tok not in myStopWords]
    return(docStrTokenized)
def tokenize_corpus(corpus):
    tokens = [cleanDocStr(x) for x in corpus]
    return tokens

apCorpusTokenized = tokenize_corpus(apCorpus)
pubMedCorpusTokenized = tokenize_corpus(pubMedCorpus)
testCorpusTokenized = tokenize_corpus(testCorpus)

[nltk_data] Downloading package stopwords to /home/ob2285/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/ob2285/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [20]:
import time
from tqdm import tqdm, tqdm_notebook
from collections import Counter

minVocabOccurence = 5

def extractVocabMappers(tokenizedCorpus, minVocabOccurence = 0):
    """
    Decription: 
    Input:
        tokenizedCorpus (list(list(str))): A list where each index is a document from the corpus.
            Each document is further tokenized into a list of tokens. 
            [doc1, doc2,...] where doc1 = [tok1, tok2, ...]
        minVocabOccurence (int): Minimum number of times a word needs to show up to be considered
            for the vocabulary
    Output:
        word2Idx (dict): A dictionary mapping each word to its integer ID
        idx2Word (dict): A dictionary mapping each integer ID to its word
        wordCounts (list(tuples)): A list of tuples mapping each vocab to its count in the
            corpus
        newTokenizedCorpus (list(list(str))): Same as tokenized corpus but out of vocabulary terms are
            mapped to <UNK>
        
    """
    UNK = "<UNK>"
    flattenedCorpus = [item for sublist in tokenizedCorpus for item in sublist]
    wordCounts = Counter(flattenedCorpus).most_common()
    wordCounts = [(w, c) for w,c in wordCounts if c > minVocabOccurence]
#     wordCounts = wordCounts.most_common(vocabSizeMax)
    vocabulary = [word for word, count in wordCounts]
    
    # below is more readable but significantly slower code
    if False:
        vocabulary = []
        for sentence in tqdm(tokenizedCorpus):
            for token in sentence:
                if token not in vocabulary:
                    vocabulary.append(token)
    vocabulary.append(UNK)
    print("Vocab size: {}".format(len(vocabulary)))
    word2idx = {w: idx for (idx, w) in enumerate(vocabulary)}
    idx2word = {idx: w for (idx, w) in enumerate(vocabulary)}
    newTokenizedCorpus = []# all words missing from vocab replaced with <UNK>
    for doc in tokenizedCorpus:
        newDoc = [word if word in word2idx else UNK for word in doc]
        newTokenizedCorpus.append(newDoc)
    return(word2idx, idx2word, wordCounts, newTokenizedCorpus)

start = time.time()
print("Building ap corpus vocabulary")
word2Idx_ap, idx2Word_ap, vocabCount_ap, finalTokenizedCorpus_ap = extractVocabMappers(apCorpusTokenized,
                                                                                      minVocabOccurence = minVocabOccurence)
print("ap data tokenized in {} seconds\n".format(time.time() - start))
start = time.time()
print("Building pubMed corpus vocabulary")
word2Idx_pubMed, idx2Word_pubMed, vocabCount_pubMed, finalTokenizedCorpus_pubMed = extractVocabMappers(pubMedCorpusTokenized,
                                                                                                      minVocabOccurence = minVocabOccurence)
print("pubmed data tokenized in {} seconds\n".format(time.time() - start))
start = time.time()
print("Building test corpus vocabulary")
word2Idx_test, idx2Word_test, vocabCount_test, finalTokenizedCorpus_test = extractVocabMappers(testCorpusTokenized,
                                                                                              minVocabOccurence = 0)
print("test data tokenized in {} seconds".format(time.time() - start))

Building ap corpus vocabulary
Vocab size: 9933
ap data tokenized in 0.2616081237792969 seconds

Building pubMed corpus vocabulary
Vocab size: 4994
pubmed data tokenized in 0.10646986961364746 seconds

Building test corpus vocabulary
Vocab size: 38
test data tokenized in 0.00021076202392578125 seconds


## Word2Vec Implementation

In [21]:
import numpy as np
import torch
from torch import nn
import random

In [39]:
##### BATCH VERSION ######


def generateObservations(tokenizedCorpus, word2Idx):
    """
    Decription: Iterates through every token in the corpus and creates a (center, context)
        pair for each context word in the window on either side of the center word. Please
        refer to the first figure to understand how word pairs are created
    Input:
        tokenizedCorpus (list(list(str))): A list where each index is a document from the corpus.
            Each document is further tokenized into a list of tokens. 
            [doc1, doc2,...] where doc1 = [tok1, tok2, ...]
        word2Idx (dict): A dictionary mapping words to their integer IDs
    Output:
        idxPairs (list(tuples)): A list of tuples where each tuple is a (center, context word)
    """
    window_size = 3
    idxPairs = []
    for sentence in tokenizedCorpus:
        for center_word_pos in range(len(sentence)):
            # Your code here
            # for each window position
            for w in range(-window_size, window_size + 1):
                context_word_pos = center_word_pos + w
                # make sure not jump out sentence
                if context_word_pos < 0 or context_word_pos >= len(sentence) or center_word_pos == context_word_pos:
                    continue
                idxPairs.append((sentence[center_word_pos], sentence[context_word_pos]))
            # End your code
    idxPairs = np.array(idxPairs)
    return(idxPairs)


def generateWordSamplingUnigramTable(vocabCount, word2Idx):
    """
    Decription: Generates a unigram table to sample data from. The unigram table
        should contains the index of every vocab index multiple times. The number
        of times an element appears is dictated by its sample probability. The unigram
        table can the be sampled. 
    Input:
        vocabCount (list(tuples)): A list of tuples mapping each vocab to its count in the
            corpus
        word2Idx (dict): A dictionary mapping words to their integer IDs
    Output:
        unigram_table (list(int)): A list of integers as described above. For example
        in a 3 word vocabulary it might look something like [0,0,1,1,1,1,1,1,1,2].
        Sampling from the previous example will mean that 0 is sampled 2/10 times,
        1 is sampled 7/10 times, and 2 is sampled 1/10 times.
    """
    
#     wordSampleProbs = [0.0]*len(vocabCount)
#     numWords = np.sum([count**0.75 for word, count in vocabCount])
#     for idx in range(len(vocabCount)):
#         w,c = vocabCount[idx]
#         wordSampleProbs[word2Idx[w]] = (c**0.75)/(numWords)
    unigram_table = []
    numWords = np.sum([count for word, count in vocabCount])
#     numWords = np.sum([count**0.75 for word, count in vocabCount])
    for w,c in vocabCount:
#         unigram_table.extend([word2Idx[w]] * int(((c/numWords)**0.75)/0.001))
        unigram_table.extend([word2Idx[w]] * int(((c/numWords)**0.75)/0.001))
    return(unigram_table)
    
class SkipGram(nn.Module):
    """
    Decription: Instantiates and implements the forward pass of the skip gram
        algorithm with negative sampling.
    Input:
        vocabSize (int): Number of words to create embeddings for
        embedSize (int): Dimension of word embeddings
        word2Idx (dict): A dictionary mapping words to their integer IDs
    Output:
    """
    def __init__(self, vocabSize, embedSize, vocabCount, word2Idx):
        super(SkipGram, self).__init__()
        self.vocabSize = vocabSize
        self.word2Idx = word2Idx
        # Your code here
        # Init the center and context embedding matrices. These are learnable parameters
        self.centerEmbeddings = nn.Parameter(torch.randn(vocabSize,
                                                     embedSize).float(), requires_grad=True)
        self.contextEmbeddings = nn.Parameter(torch.randn(vocabSize,
                                                      embedSize).float(), requires_grad=True)
        # End your code
#         initrange = (2.0 / (vocabSize + embedSize)) ** 0.5  # Xavier init
        nn.init.xavier_uniform_(self.contextEmbeddings)
#         self.output_emb.weight.data.uniform_(-0, 0)
        nn.init.xavier_uniform_(self.centerEmbeddings)
#         nn.init.uniform_(self.centerEmbeddings, -0,0)
        
        self.unigram_table = generateWordSamplingUnigramTable(vocabCount, word2Idx)
        self.logSigmoid = nn.LogSigmoid()
    def getNegSample(self, k, centerWords):
        """
        Decription: Randomly selects negative samples from the vocabulary. USes
            self.unigram_table in order to sample words. 
        Input:
            k (int): Number of negative samples to select
            centerWords (list(str)): A list of the string center words. There should
                be batchSize of these.
        Output:
            negSamples (list(numpyArray)): A list of numpy arrays where each numpy array
                contains the indices of negative samples. There are batchSize numpy arrays
        """
        vocabSizeWithoutUnk = self.vocabSize - 1
        negSamples = []
        for centerWord in centerWords:
            # Your code here
            # Using self.unigram_table sample indices to use as your negative samples
            # Be sure that for each center word you return negative samples, which
            # don't contain the center word. Should't happen often but just ot be sure.
            negSample = random.sample(self.unigram_table, k)
            while self.word2Idx[centerWord] in negSample:
                negSample = random.sample(self.unigram_table, k)
#                 print(centerWord)
#                 print(negSample)
            negSamples.append(negSample)
        # End your code
        return(negSamples)
    def forward(self, center, context, negSampleIndices):
        """
        Decription Forward pass for the skipgram model. 
        Input:
            center (list(int)): A list of word integer IDs indicating all
                batchSize center words. Matches one to one with context
            context (list(int)): A list of word integer IDs indicating all
                batchSize context words. Matches one to one with center
            negSampleIndices (list(numpyArray)): A list of numpy arrays where
                each numpy array contains the indices of negative samples.
                There are batchSize numpy arrays. Returned by getNegSample()
        Output:
            logProb (tensor): The loss over the entire batch.
        """
        # Your Code
        # implement a forward pass of the model. Be sure to allow for varying batch sizes
        embedCenter = self.centerEmbeddings[center]#.view((1, -1))
        embedContext = self.contextEmbeddings[context]#.view((1, -1))       
        posVal = self.logSigmoid(torch.sum(embedContext * embedCenter, dim = 1)).squeeze()
        negSampleIndices = torch.autograd.Variable(torch.LongTensor(negSampleIndices))
        negVal = torch.bmm(self.contextEmbeddings[negSampleIndices], embedCenter.unsqueeze(2)).squeeze(2)
        negVal = self.logSigmoid(-torch.sum(negVal, dim = 1)).squeeze()
#         negVal = torch.bmm(self.contextEmbeddings[negSampleIndices], embedCenter.unsqueeze(2)).squeeze()
#         negVal = torch.sum(self.logSigmoid(-negVal), dim = 1)
#         print(negVal.shape)
#         1/0
        logProb = -(posVal + negVal).mean()
        # End your code
        return(logProb)


def train_skipgram(embeddingSize, trainingData, vocabCount, word2Idx, idx2Word,
                   k, referenceWords, batchSize = 1024):
    """
    Decription: Instantiates and trains a skipgam model. The forward pass of the skipgram mode
        handles the forward pass so all you have to do here is handle the loss, and
        updating the weights.
    Input:
        embeddingSize (int): Size of each word embedding
        trainingData (list(tuples)): A list of tuples generated by generateWordSamplingUnigramTable()
            where each tuple is a center and context word
        vocabCount (list(tuples)): A list of tuples mapping each vocab to its count in the
            corpus
        word2Idx (dict): A dictionary mapping each word to its integer ID
        idx2Word (dict): A dictionary mapping each integer ID to its word
        k (int): Dictates the number of sampls used during negative sampling
        referenceWords (list(str)): A list of words to compare word embeddings for
        batchSize (int): The number of (center, context) words to run through each forward
            pass of the skipgram model.
    Output:
        model (SkipGram): The final trained SkipGram model
    """
    print("training on {} observations".format(len(trainingData)))
    model = SkipGram(vocabSize = len(word2Idx), embedSize = embeddingSize,
                     vocabCount = vocabCount, word2Idx = word2Idx)
    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
    optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
    listNearestWords(model = model, idx2Word = idx2Word,
     referenceWords = referenceWords, topN = 5)
    #         listNearestWords(model = model, idx2Word = idx2Word,
#                  referenceWords = referenceWords, topN = 5)
    for epoch in tqdm_notebook(range(n_epoch), position = 0):
        total_loss = .0
        avgLoss = 0.0
        iteration = 0
        for step in tqdm_notebook(range(0, len(trainingData), batchSize), position = 1):
            endIdx = np.min([(step+batchSize), len(trainingData)])
            myBatch = trainingData[step:(step+batchSize)]
            centerWords = [elem[0] for elem in myBatch]
            contextWords = [elem[1] for elem in myBatch]
            negSamples = model.getNegSample(k = k, centerWords = centerWords)
            centerIDs = [word2Idx[idx] for idx in centerWords]
            contextIDs = [word2Idx[idx] for idx in contextWords]
            model.zero_grad()
            loss = model(centerIDs, contextIDs, negSampleIndices = negSamples)
            
            loss.backward()
            optimizer.step()

            total_loss += loss.data.numpy()
            avgLoss += loss.data.numpy()
            iteration += 1
            if iteration % 500 == 0:
                avgLoss = avgLoss/(500)
                print("avg loss: {}".format(avgLoss))
#                 avgLoss = 0.0
#             if iteration % 2000 == 0:
#                 listNearestWords(model = model, idx2Word = idx2Word,
#                  referenceWords = referenceWords, topN = 5)
        print("Loss at epoch {}: {}".format(epoch, total_loss/iteration))
        if epoch % 1 == 0:
            listNearestWords(model = model, idx2Word = idx2Word,
                         referenceWords = referenceWords, topN = 5)
    return(model)

In [40]:
from scipy.spatial.distance import cdist
def listNearestWords(model, idx2Word, referenceWords, topN):
    """
    Decription: Lists the topN closes words by cosine distance to each word in referenceWords
    Input:
        model (SkipGram): The final trained SkipGram model
        idx2Word (dict): A dictionary mapping each integer ID to its word
        referenceWords (list(str)): A list of words in the vocabulary of the model
        topN (int): The number of closest words to print
    Output:
        None: Just prints
    """
    assert len(idx2Word) == len(model.word2Idx), "Possibly passed in two different vocabularies"
    embeddings = model.centerEmbeddings.data.numpy()
    distMat = cdist(embeddings, embeddings, metric = "cosine")
    # Your code here
    # print the topN closest words to each word in referenceWords
    for word in referenceWords:
        wordIdx = model.word2Idx[word]
        closestIndices = np.argsort(distMat[wordIdx,:])[0:topN]
        closestWords = [(idx2Word[idx], distMat[wordIdx, idx]) for idx in closestIndices]
        for elem in closestWords:
            print(elem)
        print("*"*50 + "\n")
    # End your code

In [41]:
# embd_size = 100
# learning_rate = 0.001
# n_epoch = 60
# idxPairsTest = generateObservations(tokenizedCorpus = finalTokenizedCorpus_test, word2Idx = word2Idx_test)
# sg_model = train_skipgram(embeddingSize = 5, trainingData = idxPairsTest, vocabCount = vocabCount_test,
#                                      word2Idx = word2Idx_test, idx2Word = idx2Word_test, k = 10,
#                                     referenceWords = ["thomas", "salmon"])

In [None]:
embeddingSize = 50
learning_rate = 0.1
n_epoch = 10
idxPairsAP = generateObservations(tokenizedCorpus = finalTokenizedCorpus_ap, word2Idx = word2Idx_ap)
sg_model_ap = train_skipgram(embeddingSize = embeddingSize, trainingData = idxPairsAP,
                                     vocabCount = vocabCount_ap,
                                     word2Idx = word2Idx_ap, idx2Word = idx2Word_ap, k = 20,
                                          referenceWords = ["bush", "soviet", "president", "economy", "american"])

training on 3514136 observations
('bush', 0.0)
('reputation', 0.4840850429783716)
('howard', 0.5417328610955479)
('letters', 0.5488030104134252)
('zinoviev', 0.5513447167183296)
**************************************************

('soviet', 2.220446049250313e-16)
('horn', 0.4431624164607766)
('requirement', 0.46700145912389945)
('grumman', 0.526616686795899)
('tools', 0.5274425114680998)
**************************************************

('president', 0.0)
('regulations', 0.4307966150103433)
('shrugged', 0.46149342045672037)
('bleeding', 0.47274395925607204)
('gunshot', 0.5145024110504706)
**************************************************

('economy', 0.0)
('applause', 0.455820987226346)
('kohl', 0.5376935100428473)
('thunderstorms', 0.5393534450014843)
('collectors', 0.5436018864247174)
**************************************************

('american', 0.0)
('onto', 0.5428843634880327)
('andre', 0.5485981542125034)
('south', 0.5600908087908752)
('sellers', 0.5607624853950721)
********

HBox(children=(IntProgress(value=0, max=10), HTML(value='')))

HBox(children=(IntProgress(value=0, max=3432), HTML(value='')))

In [None]:
embeddingSize = 50
learning_rate = 0.1
n_epoch = 10
idxPairsPubMed = generateObservations(tokenizedCorpus = finalTokenizedCorpus_pubMed, word2Idx = word2Idx_pubMed)
sg_model_pubMed = train_skipgram(embeddingSize = embeddingSize, trainingData = idxPairsPubMed,
                                     vocabCount = vocabCount_pubMed,
                                     word2Idx = word2Idx_pubMed, idx2Word = idx2Word_pubMed, k = 20,
                                                  referenceWords = ["clinical", "obesity", "microbial", "microbiome"])

In [None]:
##### BATCH VERSION ######


def generateObservations(tokenizedCorpus, word2Idx):
    """
    Decription: Iterates through every token in the corpus and creates a (center, context)
        pair for each context word in the window on either side of the center word. Please
        refer to the first figure to understand how word pairs are created
    Input:
        tokenizedCorpus (list(list(str))): A list where each index is a document from the corpus.
            Each document is further tokenized into a list of tokens. 
            [doc1, doc2,...] where doc1 = [tok1, tok2, ...]
        word2Idx (dict): A dictionary mapping words to their integer IDs
    Output:
        idxPairs (list(tuples)): A list of tuples where each tuple is a (center, context word)
    """
    window_size = 3
    idxPairs = []
    for sentence in tokenizedCorpus:
        for center_word_pos in range(len(sentence)):
            # Your code here
            # for each window position
            for w in range(-window_size, window_size + 1):
                context_word_pos = center_word_pos + w
                # make sure not jump out sentence
                if context_word_pos < 0 or context_word_pos >= len(sentence) or center_word_pos == context_word_pos:
                    continue
                idxPairs.append((sentence[center_word_pos], sentence[context_word_pos]))
            # End your code
    idxPairs = np.array(idxPairs)
    return(idxPairs)


def generateWordSamplingUnigramTable(vocabCount, word2Idx):
    """
    Decription: Generates a unigram table to sample data from. The unigram table
        should contains the index of every vocab index multiple times. The number
        of times an element appears is dictated by its sample probability. The unigram
        table can the be sampled. 
    Input:
        vocabCount (list(tuples)): A list of tuples mapping each vocab to its count in the
            corpus
        word2Idx (dict): A dictionary mapping words to their integer IDs
    Output:
        unigram_table (list(int)): A list of integers as described above. For example
        in a 3 word vocabulary it might look something like [0,0,1,1,1,1,1,1,1,2].
        Sampling from the previous example will mean that 0 is sampled 2/10 times,
        1 is sampled 7/10 times, and 2 is sampled 1/10 times.
    """
    
#     wordSampleProbs = [0.0]*len(vocabCount)
#     numWords = np.sum([count**0.75 for word, count in vocabCount])
#     for idx in range(len(vocabCount)):
#         w,c = vocabCount[idx]
#         wordSampleProbs[word2Idx[w]] = (c**0.75)/(numWords)
    unigram_table = []
    numWords = np.sum([count for word, count in vocabCount])
#     numWords = np.sum([count**0.75 for word, count in vocabCount])
    for w,c in vocabCount:
#         unigram_table.extend([word2Idx[w]] * int(((c/numWords)**0.75)/0.001))
        unigram_table.extend([word2Idx[w]] * int(((c/numWords)**0.75)/0.001))
    return(unigram_table)
    
class SkipGram(nn.Module):
    """
    Decription: Instantiates and implements the forward pass of the skip gram
        algorithm with negative sampling.
    Input:
        vocabSize (int): Number of words to create embeddings for
        embedSize (int): Dimension of word embeddings
        word2Idx (dict): A dictionary mapping words to their integer IDs
    Output:
    """
    def __init__(self, vocabSize, embedSize, vocabCount, word2Idx):
        super(SkipGram, self).__init__()
        self.vocabSize = vocabSize
        self.word2Idx = word2Idx
        # Your code here
        # Init the center and context embedding matrices. These are learnable parameters
        self.centerEmbeddings = nn.Parameter(torch.randn(vocabSize,
                                                     embedSize).float(), requires_grad=True)
        self.contextEmbeddings = nn.Parameter(torch.randn(vocabSize,
                                                      embedSize).float(), requires_grad=True)
        # End your code
#         initrange = (2.0 / (vocabSize + embedSize)) ** 0.5  # Xavier init
        nn.init.xavier_uniform_(self.contextEmbeddings)
#         self.output_emb.weight.data.uniform_(-0, 0)
        nn.init.xavier_uniform_(self.centerEmbeddings)
#         nn.init.uniform_(self.centerEmbeddings, -0,0)
        
        self.unigram_table = generateWordSamplingUnigramTable(vocabCount, word2Idx)
        self.logSigmoid = nn.LogSigmoid()
    def getNegSample(self, k, centerWords):
        """
        Decription: Randomly selects negative samples from the vocabulary. USes
            self.unigram_table in order to sample words. 
        Input:
            k (int): Number of negative samples to select
            centerWords (list(str)): A list of the string center words. There should
                be batchSize of these.
        Output:
            negSamples (list(numpyArray)): A list of numpy arrays where each numpy array
                contains the indices of negative samples. There are batchSize numpy arrays
        """
        vocabSizeWithoutUnk = self.vocabSize - 1
        negSamples = []
        for centerWord in centerWords:
            # Your code here
            # Using self.unigram_table sample indices to use as your negative samples
            # Be sure that for each center word you return negative samples, which
            # don't contain the center word. Should't happen often but just ot be sure.
            negSample = random.sample(self.unigram_table, k)
            while self.word2Idx[centerWord] in negSample:
                negSample = random.sample(self.unigram_table, k)
#                 print(centerWord)
#                 print(negSample)
            negSamples.append(negSample)
        # End your code
        return(negSamples)
    def forward(self, center, context, negSampleIndices):
        """
        Decription Forward pass for the skipgram model. 
        Input:
            center (list(int)): A list of word integer IDs indicating all
                batchSize center words. Matches one to one with context
            context (list(int)): A list of word integer IDs indicating all
                batchSize context words. Matches one to one with center
            negSampleIndices (list(numpyArray)): A list of numpy arrays where
                each numpy array contains the indices of negative samples.
                There are batchSize numpy arrays. Returned by getNegSample()
        Output:
            logProb (tensor): The loss over the entire batch.
        """
        # Your Code
        # implement a forward pass of the model. Be sure to allow for varying batch sizes
        embedCenter = self.centerEmbeddings[center]#.view((1, -1))
        embedContext = self.contextEmbeddings[context]#.view((1, -1))       
        posVal = self.logSigmoid(torch.sum(embedContext * embedCenter, dim = 1)).squeeze()
        negSampleIndices = torch.autograd.Variable(torch.LongTensor(negSampleIndices))
#         negVal = torch.bmm(self.contextEmbeddings[negSampleIndices], embedCenter.unsqueeze(2)).squeeze(2)
#         negVal = self.logSigmoid(-torch.sum(negVal, dim = 1)).squeeze()
        negVal = torch.bmm(self.contextEmbeddings[negSampleIndices], embedCenter.unsqueeze(2)).squeeze()
        negVal = torch.sum(self.logSigmoid(-negVal), dim = 1)
#         print(negVal.shape)
#         1/0
        logProb = -(posVal + negVal).mean()
        # End your code
        return(logProb)


def train_skipgram(embeddingSize, trainingData, vocabCount, word2Idx, idx2Word,
                   k, referenceWords, batchSize = 1024):
    """
    Decription: Instantiates and trains a skipgam model. The forward pass of the skipgram mode
        handles the forward pass so all you have to do here is handle the loss, and
        updating the weights.
    Input:
        embeddingSize (int): Size of each word embedding
        trainingData (list(tuples)): A list of tuples generated by generateWordSamplingUnigramTable()
            where each tuple is a center and context word
        vocabCount (list(tuples)): A list of tuples mapping each vocab to its count in the
            corpus
        word2Idx (dict): A dictionary mapping each word to its integer ID
        idx2Word (dict): A dictionary mapping each integer ID to its word
        k (int): Dictates the number of sampls used during negative sampling
        referenceWords (list(str)): A list of words to compare word embeddings for
        batchSize (int): The number of (center, context) words to run through each forward
            pass of the skipgram model.
    Output:
        model (SkipGram): The final trained SkipGram model
    """
    print("training on {} observations".format(len(trainingData)))
    model = SkipGram(vocabSize = len(word2Idx), embedSize = embeddingSize,
                     vocabCount = vocabCount, word2Idx = word2Idx)
    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
    optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
    listNearestWords(model = model, idx2Word = idx2Word,
     referenceWords = referenceWords, topN = 5)
    #         listNearestWords(model = model, idx2Word = idx2Word,
#                  referenceWords = referenceWords, topN = 5)
    for epoch in tqdm_notebook(range(n_epoch), position = 0):
        total_loss = .0
        avgLoss = 0.0
        iteration = 0
        for step in tqdm_notebook(range(0, len(trainingData), batchSize), position = 1):
            endIdx = np.min([(step+batchSize), len(trainingData)])
            myBatch = trainingData[step:(step+batchSize)]
            centerWords = [elem[0] for elem in myBatch]
            contextWords = [elem[1] for elem in myBatch]
            negSamples = model.getNegSample(k = k, centerWords = centerWords)
            centerIDs = [word2Idx[idx] for idx in centerWords]
            contextIDs = [word2Idx[idx] for idx in contextWords]
            model.zero_grad()
            loss = model(centerIDs, contextIDs, negSampleIndices = negSamples)
            
            loss.backward()
            optimizer.step()

            total_loss += loss.data.numpy()
            avgLoss += loss.data.numpy()
            iteration += 1
            if iteration % 500 == 0:
                avgLoss = avgLoss/(500)
                print("avg loss: {}".format(avgLoss))
#                 avgLoss = 0.0
#             if iteration % 2000 == 0:
#                 listNearestWords(model = model, idx2Word = idx2Word,
#                  referenceWords = referenceWords, topN = 5)
        print("Loss at epoch {}: {}".format(epoch, total_loss/iteration))
        if epoch % 1 == 0:
            listNearestWords(model = model, idx2Word = idx2Word,
                         referenceWords = referenceWords, topN = 5)
    return(model)

In [None]:
embeddingSize = 50
learning_rate = 0.1
n_epoch = 10
idxPairsAP = generateObservations(tokenizedCorpus = finalTokenizedCorpus_ap, word2Idx = word2Idx_ap)
sg_model_ap = train_skipgram(embeddingSize = embeddingSize, trainingData = idxPairsAP,
                                     vocabCount = vocabCount_ap,
                                     word2Idx = word2Idx_ap, idx2Word = idx2Word_ap, k = 20,
                                          referenceWords = ["bush", "soviet", "president", "economy", "american"])

In [None]:
embeddingSize = 50
learning_rate = 0.01
n_epoch = 10
idxPairsAP = generateObservations(tokenizedCorpus = finalTokenizedCorpus_ap, word2Idx = word2Idx_ap)
sg_model_ap = train_skipgram(embeddingSize = embeddingSize, trainingData = idxPairsAP,
                                     vocabCount = vocabCount_ap,
                                     word2Idx = word2Idx_ap, idx2Word = idx2Word_ap, k = 20,
                                          referenceWords = ["bush", "soviet", "president", "economy", "american"])

In [35]:
embeddingSize = 50
learning_rate = 0.1
n_epoch = 60
idxPairsPubMed = generateObservations(tokenizedCorpus = finalTokenizedCorpus_pubMed, word2Idx = word2Idx_pubMed)
sg_model_pubMed = train_skipgram(embeddingSize = embeddingSize, trainingData = idxPairsPubMed,
                                     vocabCount = vocabCount_pubMed,
                                     word2Idx = word2Idx_pubMed, idx2Word = idx2Word_pubMed, k = 20,
                                                  referenceWords = ["clinical", "obesity", "microbial", "microbiome"])

training on 1732584 observations
('clinical', 0.0)
('plants', 0.5420871165217046)
('testosterone', 0.5456948830163391)
('representations', 0.5462300435665581)
('kappa', 0.5473142146861234)
**************************************************

('obesity', 2.220446049250313e-16)
('caregivers', 0.531842442432588)
('later', 0.5556989785724507)
('substrate', 0.5634565534179742)
('demand', 0.5730475059978808)
**************************************************

('microbial', 0.0)
('running', 0.471812869974791)
('subgingival', 0.5096507723639621)
('probes', 0.5171420597951297)
('negex', 0.5324610490611787)
**************************************************

('microbiome', 0.0)
('ir', 0.531317384896125)
('altogether', 0.5536955875615477)
('antibacterial', 0.5603051657315941)
('review', 0.5621926293885555)
**************************************************



HBox(children=(IntProgress(value=0, max=60), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1692), HTML(value='')))

avg loss: 14.258970973968506
avg loss: 11.958345360752107
avg loss: 10.568374957933418
Loss at epoch 0: 11.99746179876598
('clinical', 0.0)
('data', 0.008837059314928286)
('health', 0.010145959788341208)
('activity', 0.01039178923932138)
('using', 0.011796748515758293)
**************************************************

('obesity', 0.0)
('microbial', 0.025882294899795988)
('study', 0.030091709967157043)
('used', 0.032194826166850876)
('physical', 0.03290342678705771)
**************************************************

('microbial', 0.0)
('study', 0.010835114262724876)
('microbiota', 0.013362522042441927)
('health', 0.014040121292628549)
('system', 0.015251843323560554)
**************************************************

('microbiome', 0.0)
('microbial', 0.016789383392108448)
('sedentary', 0.017026935503384744)
('system', 0.017956699076647653)
('health', 0.019610485146384415)
**************************************************



HBox(children=(IntProgress(value=0, max=1692), HTML(value='')))

KeyboardInterrupt: 

## How Domains Affect Word Embeddings
Choose two words that appear in both the pubmed and ap vocabularies and compare the closest embeddings to both words in the pubmed and ap embeddings[.](https://www.youtube.com/watch?v=Tr-WrGcexlY) *Why might the two words you chose have different representations? How might this affect downstream NLP tasks?*

In [None]:
# Your code here