In [107]:
import numpy as np
import random
from utils.treebank import StanfordSentiment
from word2vec import getNegativeSamples

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## what is dataset?

We have special class `StanfordSentiment` for handling data. For simplicity we created toy files, including `datasetSentences_small.txt` contains first 10 lines of the original file including 9 reviews:

In [108]:
!head -n 3 ~/data/cs224n/stanfordSentimentTreebank/datasetSentences_small.txt

sentence_index	sentence
1	The Rock is destined to be the 21st Century 's new `` Conan '' and that he 's going to make a splash even greater than Arnold Schwarzenegger , Jean-Claud Van Damme or Steven Segal .
2	The gorgeously elaborate continuation of `` The Lord of the Rings '' trilogy is so huge that a column of words can not adequately describe co-writer\/director Peter Jackson 's expanded vision of J.R.R. Tolkien 's Middle-earth .


`StanfordSentiment` has a lot of methods. Here we briefly describe main ones and in next chaptes we'll describe other methods. We have (for each those fields we have a method):

- `_tokens` - dictionary that contains all unique words from our corpus and their index;
- `_sentences` - list of lists, where an inner list contains normalized and splitted reviews;

In [109]:
dataset = StanfordSentiment()
dataset.tokens()  # build dictionary of _tokens
list(dataset._tokens.items())[:5]

[('the', 0), ('rock', 1), ('is', 2), ('destined', 3), ('to', 4)]

In [110]:
dataset._sentences[0][:5]

['the', 'rock', 'is', 'destined', 'to']

We use dataset to generate negative and positive examples. We describe this in details below. Finally we see how data are fed into our model.

## how do we sample negative words?

In [slp3, 6.8.2] we may read that: "The noise words are chosen according to their 
    weighted unigram frequency $P_\alpha(w)$, where $\alpha$ is a weight". It's computed using formula with $\alpha=.75$: 

$P_\alpha(w) = \frac{count(w)^\alpha}{\sum count(w^{'})^\alpha}$

To sample negative words we use `getNegativeSamples()`. The only purpose of this function is clearly stated in its description: "Samples K indexes which are not the outsideWordIdx". We should remember that: "... skipgram uses more negative examples than positive examples, the ratio set by a parameter k."

Internally `getNegativeSamples()` calls `dataset.sampleTokenIdx()` which is just chooses random token index from `sampleTable`.

### what is `sampleTable`?

We have to build a table of weighted probabilities $P_\alpha$ using formula above. We start from row frequencies.

We may find frequencies for words in `dataset._tokenfreq` (this is just count of word appearance in our corpus, for example, `the` appeared 12 times):

In [111]:
list(dataset._tokenfreq.items())[:5]

[('the', 12), ('rock', 1), ('is', 4), ('destined', 1), ('to', 8)]

We now have to weight our frequencies (see `[slp3], 6.31`):

In [112]:
nTokens = len(dataset.tokens())
samplingFreq = np.zeros((nTokens,))
dataset.allSentences()
i = 0
for w in range(nTokens):
    w = dataset._revtokens[i]
    if w in dataset._tokenfreq:
        freq = 1.0 * dataset._tokenfreq[w]
        # Reweight
        freq = freq ** 0.75
    else:
        freq = 0.0
    samplingFreq[i] = freq
    i += 1

In [113]:
samplingFreq[:5]

array([6.44741959, 1.        , 2.82842712, 1.        , 4.75682846])

We may see that for `the` we have `6.44741959` instead of 12. Let's check that this is correct:

In [114]:
12 ** .75

6.4474195909412515

We may use these weighted frequencies for negative sampling. But for some reason we build another table with size much bigger than our vocabulary. For example for our 135 words we may use table of size 1000 (specified in `__init__` method). This table contains sequencies of the same index for all our words (we skip here technical details of how it can be build):

In [115]:
dataset.sampleTable()[30:50], dataset.sampleTable()[-10:]

([0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2],
 [133, 133, 133, 133, 133, 134, 134, 134, 134, 134])

Now we may sample from this table:

In [116]:
dataset.sampleTokenIdx()

127

In [117]:
getNegativeSamples(outsideWordIdx=0, dataset=dataset, K=5)

[132, 69, 96, 36, 4]

## how do we use dataset?

We use dataset to produce both positive and negative examples:  
- We produce negative examples using helper function `getNegativeSamples()` which in turn calls `dataset.sampleTokenIdx()`. We produce them inside `negSamplingLossAndGradient`.
We described generation of negative samples above. 
- We produce positive examples with `dataset.getRandomContext()` in `word2vec_sgd_wrapper`. `dataset.getRandomContext()` produces pair `centerword, context`:

In [118]:
dataset.getRandomContext()

('mindset', ['film'])

We produce this pair at random: first we choose a sentence at random from all sentences and then we choose center words also at random. In example below we choose `wordID=7`, so:

- `centerword=sent[7]` or `'21st'`;
- context with `window_size=2` is `['be', 'the', 'century', "'s"]`;

In [119]:
sent = dataset._sentences[0]
sent[:10]

['the', 'rock', 'is', 'destined', 'to', 'be', 'the', '21st', 'century', "'s"]

In [120]:
random.seed(42)
wordID = random.randint(0, len(sent) - 1)
print(f'wordID={wordID}')
C=2  # window size
context = sent[max(0, wordID - C):wordID]
if wordID+1 < len(sent):
    context += sent[wordID+1:min(len(sent), wordID + C + 1)]

centerword = sent[wordID]
context = [w for w in context if w != centerword]

centerword, context

wordID=7


('21st', ['be', 'the', 'century', "'s"])

But there're some technical details yet again. Instead of using `_sentences` we use `_allsentences`. What is the difference between them? This is not documented. One of the reasons to do this is maybe to reduce using frequent words yet again.

## how data are fed into our model?

So we may see that:
- positive examples is a pair `centerword, context`;
- negative examples is a list of indicies ;

Remember that we store dictionary of the form `{word: index}` in `_tokens`.

In [121]:
list(dataset._tokens.items())[:5]

[('the', 0), ('rock', 1), ('is', 2), ('destined', 3), ('to', 4)]

In `pytorch` when we're working with `Embedding` layer we use indicies. Why is that? Well, you may read explanation in [nlp-pytorch, chapter 5]: *"By definition, the weight matrix of a Linear layer that accepts as input this one-hot vector must have the same number of rows as the size of the one-hot vector. When you perform the matrix multiplication, as shown in Figure 5-1, the resulting vector is actually just selecting the row indicated by the non zero entry. Based on this observation, we can just skip the multiplication step and instead directly use an integer as an index to retrieve the selected row."*

So in case of negative samples we already have list of indicies that we use to retrive vectors from `outsideVectors` in `negSamplingLossAndGradient`.

In [122]:
negSampleWordIndices = getNegativeSamples(outsideWordIdx=0, dataset=dataset, K=5)
negSampleWordIndices

[93, 25, 20, 18, 10]

In [133]:
np.random.seed(42)
outsideVectors = np.random.randn(len(dataset._tokens), 3)
outsideVectors.shape, outsideVectors[negSampleWordIndices[0], :]

((135, 3), array([-0.3853136 ,  0.11351735,  0.66213067]))

In [134]:
negSampledVect = outsideVectors[negSampleWordIndices, :]
negSampledVect

array([[-0.3853136 ,  0.11351735,  0.66213067],
       [ 0.8219025 ,  0.08704707, -0.29900735],
       [-0.47917424, -0.18565898, -1.10633497],
       [ 1.03099952,  0.93128012, -0.83921752],
       [-0.60170661,  1.85227818, -0.01349722]])

In case of positive examples we just convert words to indicies inside `skipgram`.

In [137]:
word2Ind = dataset._tokens
currentCenterWord = 'the'
currentCenterWordIdx = word2Ind[currentCenterWord]
currentCenterWordIdx

0