# Embeddings Lab

### Introduction 

In this lesson, we'll begin to explore how we can use embeddings.  Let's do so with the `YelpReview` dataset.

### Loading our Data

We can begin by importing the `datasets` and `data` fields from torchtext.  Set the TEXT field to tokenize with `spacy`.

In [131]:
TEXT = None
LABEL = None

In [98]:
from torchtext import datasets, data
import torch

train_data, test_data = datasets.TREC.splits(TEXT, LABEL, fine_grained=False)

Ok, now this time in building our vocabulary let's use the `glove.6b.100d` model.  And initialize our unknown vectors with the normal distribution. 

In [99]:
# build vocab for text here

Also build the vocabulary for the labels.

In [106]:
# do so here

In [133]:
LABEL.vocab.stoi
# defaultdict(None,
#             {'ENTY': 0, 'HUM': 1, 'DESC': 2, 'NUM': 3, 'LOC': 4, 'ABBR': 5})

Ok, it's time that we started to look at some of our word vectors.  Write a method called `word_to_vector` that, when given a string like `dog` will return to us the related vector.

In [134]:
def word_to_vector(word):
    pass

In [10]:
word_to_vector("dog")[:5]
# tensor([ 0.3082,  0.3094,  0.5280, -0.9254, -0.7367])

tensor([ 0.3082,  0.3094,  0.5280, -0.9254, -0.7367])

In [11]:
word_to_vector("cat")[:5]
# tensor([ 0.2309,  0.2828,  0.6318, -0.5941, -0.5860])

tensor([ 0.2309,  0.2828,  0.6318, -0.5941, -0.5860])

Ok, use torch (or whatever you wish) to calculate the cosine similarity, that when given two words, will calculate the cosine similarity between them.

In [135]:
def cosine_simiarity(word_1, word_2):
    pass

In [17]:
cosine_simiarity("dog", "cat")
# tensor([0.8798])

tensor([0.8798])

That looks close.

In [18]:
cosine_simiarity("dog", "peanut")

tensor([0.3640])

Less close.  Remember that cosine similarity finds the angle between two vectors, and it uses the dot product of two unit vectors to do so.

$cos(\theta) = \frac{a}{|a|} \cdot \frac{b}{|b|}$

### Incorporating an Embedding into a Neural Network

Now initialize a neural network that has only a single layer: an embedding layer.  This output of passing data through this layer should be returned from the `forward` method.  

Set the dimensions of the embedding layer so that there is a different embedding for every word in the vocabulary, and the embedding dimension is equal to the dimension of our word vectors.

In [136]:
import torch.nn as nn

class Net(nn.Module):
    pass

Assign an instance of the neural network to `net`.

In [47]:
net = Net()
net
# Net(
#   (embed): Embedding(9343, 100)
# )

Net(
  (embed): Embedding(9343, 100)
)

Ok, let's check that we have the specified the dimensions correctly.  Use the BucketIterator to create a batch of our data.  Set a batchsize of 50.

In [137]:
train_iterator, test_iterator = None, None

In [110]:
for text_batch, label_batch in test_iterator:
    text_batch, label_batch = text_batch, label_batch
    break

We can see that in this batch there are only questions of length 4. Although, yours may be different.

In [116]:
text_batch.shape

torch.Size([4, 50])

Ok, let's add an additional dimension for the channel.

In [117]:
text_batch_with_channel = None
text_batch_with_channel.shape

# torch.Size([1, 4, 50])

torch.Size([1, 4, 50])

And now let's pass our data through our network.

In [119]:
output = None

output.shape

# torch.Size([1, 4, 50, 100])

torch.Size([1, 4, 50, 100])

We should now see another dimension representing the 100 features for each word.  Ok, now let's replace the our random vectors in the neural network's embedding layer with the embedding from our vocab object.

> First assign the vectors to `vocab_vectors`.

In [121]:
vocab_vectors = None

vocab_vectors.shape

# torch.Size([9343, 100])

torch.Size([9343, 100])

Then assign them to the embedding.

In [123]:
# do so here

# tensor([[-0.8569, -0.5389, -0.0466,  ...,  1.9608, -0.0301, -1.2217],
#         [-0.6806, -0.5269,  1.5520,  ...,  1.3152,  0.7900, -1.2911],
#         [ 0.1638,  0.6046,  1.0789,  ..., -0.3140,  0.1844,  0.3624],
#         ...,
#         [ 0.0091,  0.2810,  0.7356,  ..., -0.7508,  0.8967, -0.7631],
#         [ 0.2906,  0.3217,  0.2419,  ..., -0.9444, -0.3790,  0.6196],
#         [-1.5447, -2.9450,  0.8136,  ..., -0.5756, -0.9730,  1.1454]])

tensor([[-0.8569, -0.5389, -0.0466,  ...,  1.9608, -0.0301, -1.2217],
        [-0.6806, -0.5269,  1.5520,  ...,  1.3152,  0.7900, -1.2911],
        [ 0.1638,  0.6046,  1.0789,  ..., -0.3140,  0.1844,  0.3624],
        ...,
        [ 0.0091,  0.2810,  0.7356,  ..., -0.7508,  0.8967, -0.7631],
        [ 0.2906,  0.3217,  0.2419,  ..., -0.9444, -0.3790,  0.6196],
        [-1.5447, -2.9450,  0.8136,  ..., -0.5756, -0.9730,  1.1454]])

Ok, we should now see our network updated with the weights above.  And find the unknown vector and pad vector and zero them out.

> This tells our model that, initially, they are irrelevant for determining sentiment.

In [125]:
UNK_IDX = TEXT.vocab.stoi[TEXT.unk_token]
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token]

In [128]:
# zero out related vectors here


In [130]:
net.embed.weight.data[UNK_IDX]

# tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
#         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
#         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
#         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
#         0., 0., 0., 0.])

tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0.])

Now if we have our numericalized text passed into a neural network, it will return the corresponding vectors for that text.

### Summary

In this lesson, we practiced working with word vectors and embeddings.  We saw that we can use our label in torchtext to incorporate pretrained word vectors in our vocabulary.  Then we practiced building a neural network that will translate a numericalized document into the appropriate vector.  In the following lessons, we'll go further to see how we can use this to train a CNN with text.