# Deep Learning for NLP with PyTorch

## Introduction to PyTorch

省略

## Deep Learning with PyTorch

### Deep Learning Building Blocks: Affine maps, non-linearities and objectives

Deep learning consists of composing linearities with non-linearities in clever ways. Non-linearities allows for powerful models.

#### Affine Maps

Affine map is a function *f(x)* where *f(x)=Ax+b* for a matrix A and vectors x, b.

PyTorch and most other deep learning frameworks do things a little differently than traditional linear algebra. It **maps the rows of the input instead of the columns**. That is, the i'th row of the output below is the mapping of the i'th row of input under A, plus the bias term.

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
torch.manual_seed(1)

<torch._C.Generator at 0x112849210>

In [2]:
lin = nn.Linear(5, 3)
data = torch.randn(2, 5)
print(lin(data))

tensor([[ 0.1755, -0.3268, -0.5069],
        [-0.6602,  0.2260,  0.1089]], grad_fn=<AddmmBackward>)


#### Non-Linearities

We know that composing affine maps gives you an affine map. So if our model is a long chain of affine compositions, this adds no new power to your model than just doing a single affine map.

If we introduce non-linearitiesin between the affine layers, we can build much more powerful models.

There are a few core non-linearities: **tanh(x), sigmoid(x), ReLU(x)**, etc. They are the most common since they have gradients that are easy to compute, and computing gradients is essential for learning.

In PyTorch, most non-linearities are in **torch.functional**, and they don't have parameters (weights) that need to be updated during training.

In [3]:
data = torch.randn(2, 2)
print(data)
print(F.relu(data))

tensor([[-0.5404, -2.2102],
        [ 2.1130, -0.0040]])
tensor([[0.0000, 0.0000],
        [2.1130, 0.0000]])


#### Softmax and Probabilities

The function Softmax(x) is also just a non-linearity, but it is special in that it usually is the last operation done in a network. This is because it takes in a vector of real numbers and returns a probability distribution.

> 刘尧：注意！！！log_softmax就是在softmax外进行个log操作，不过是同时进行的，而非分别进行。因为对于分类来说，一般使用对数似然激活函数，就一次性都计算了。

> 注意！ PyTorch里提供了**torch.nn.CrossEntropyLoss(This criterion combines nn.LogSoftmax() and nn.NLLLoss() in one single class)**，其整合了上面的步骤。这和tf.nn.softmax_cross_entropy_with_logits的功能是一致的。

> 必须明确一点：在PyTorch中若模型使用CrossEntropyLoss这个loss函数，则**不应该在最后一层再使用softmax进行激活**。CrossEntropyLoss() is the same as NLLLoss(), except it does the log softmax for you.

> 而在Keras中，我们通常在最后一层使用softmax进行激活，保证输出神经元的值即分类的概率值，然后在compile中使用损失函数categorical_crossentropy，这符合常理。

In [4]:
data = torch.randn(5)
print(data)
print(F.softmax(data, dim=0))
print(F.softmax(data, dim=0).sum())
print(F.log_softmax(data, dim=0))

tensor([ 1.3800, -1.3505,  0.3455,  0.5046,  1.8213])
tensor([0.2948, 0.0192, 0.1048, 0.1228, 0.4584])
tensor(1.)
tensor([-1.2214, -3.9519, -2.2560, -2.0969, -0.7801])


#### Objective Functions

省略

### Optimization and Training

省略

### Creating Network Components in PyTorch

#### Example: Logistic Regression Bag-of-Words classifier

Our model will map a sparse BoW representation to log probabilities over labels. Say our entire vocab is two words "hello" and "world", with indices 0 and 1 respectively. In general, the BoW vector for sentence is **[Count(hello), Count(world)]**. 

Denote this BoW vector as x. The output of our network is **logSoftmax(Ax+b)**

In [5]:
data = [('me gusta comer en la cafeteria'.split(), 'SPANISH'),
        ('Give it to me'.split(), 'ENGLISH'),
        ('No creo que sea una buena idea'.split(), 'SPANISH'),
        ('No it is not a good idea to get lost at sea'.split(), 'ENGLISH')]
test_data = [('Yo creo que si'.split(), 'SPANISH'),
             ('it is lost on me'.split(), 'ENGLISH')]

In [6]:
word2index = {}
for sent, _ in data + test_data:  # vocab来自于所有数据：训练数据和测试数据
    for word in sent:
        if word not in word2index:
            word2index[word] = len(word2index)  # 巧妙！
print(word2index)

{'me': 0, 'gusta': 1, 'comer': 2, 'en': 3, 'la': 4, 'cafeteria': 5, 'Give': 6, 'it': 7, 'to': 8, 'No': 9, 'creo': 10, 'que': 11, 'sea': 12, 'una': 13, 'buena': 14, 'idea': 15, 'is': 16, 'not': 17, 'a': 18, 'good': 19, 'get': 20, 'lost': 21, 'at': 22, 'Yo': 23, 'si': 24, 'on': 25}


In [7]:
VOCAB_SIZE = len(word2index)
NUM_LABEL = 2

In [8]:
class BoWClassifier(nn.Module):
    
    def __init__(self, num_labels, vocab_size):
        super(BoWClassifier, self).__init__()
        self.linear = nn.Linear(vocab_size, num_labels)
        # Note: the non-linearity log softmax does not have parameters! So we don't need to worry about them here
        
    def forward(self, bow_vec):
        return F.log_softmax(self.linear(bow_vec), dim=1)

In [9]:
def make_bow_vector(sentence, word2index):
    vec = torch.zeros(len(word2index))
    for word in sentence:
        vec[word2index[word]] += 1
    return vec.view(1, -1)

In [10]:
label2index = {"SPANISH": 0, "ENGLISH": 1}
def make_target(label, label2index):
    return torch.LongTensor([label2index[label]])

In [11]:
model = BoWClassifier(NUM_LABEL, VOCAB_SIZE)
for param in model.parameters():
    print(param)

Parameter containing:
tensor([[ 0.1194,  0.0609, -0.1268,  0.1274,  0.1191,  0.1739, -0.1099, -0.0323,
         -0.0038,  0.0286, -0.1488, -0.1392,  0.1067, -0.0460,  0.0958,  0.0112,
          0.0644,  0.0431,  0.0713,  0.0972, -0.1816,  0.0987, -0.1379, -0.1480,
          0.0119, -0.0334],
        [ 0.1152, -0.1136, -0.1743,  0.1427, -0.0291,  0.1103,  0.0630, -0.1471,
          0.0394,  0.0471, -0.1313, -0.0931,  0.0669,  0.0351, -0.0834, -0.0594,
          0.1796, -0.0363,  0.1106,  0.0849, -0.1268, -0.1668,  0.1882,  0.0102,
          0.1344,  0.0406]], requires_grad=True)
Parameter containing:
tensor([0.0631, 0.1465], requires_grad=True)


In [12]:
with torch.no_grad():
    sample = data[0]
    bow_vector = make_bow_vector(sample[0], word2index)
    log_probs = model(bow_vector)
    print(log_probs)

tensor([[-0.5378, -0.8771]])


In [13]:
loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)
for epoch in range(100):
    for instance, label in data:
        # Step 1. Remember that PyTorch accumulates gradients. We need to clear them out before each instance.
        model.zero_grad()
        # Step 2.
        bow_vec = make_bow_vector(instance, word2index)
        target = make_target(label, label2index)
        # Step 3. Run forward pass
        log_probs = model(bow_vec)
        # Step 4. Compute the loss, gradients, and update the parameters
        loss = loss_function(log_probs, target)
        loss.backward()
        optimizer.step()

In [14]:
with torch.no_grad():
    for instance, label in test_data:
        bow_vec = make_bow_vector(instance, word2index)
        log_probs = model(bow_vec)
        print(log_probs)

tensor([[-0.2093, -1.6669]])
tensor([[-2.5330, -0.0828]])


## Word Embeddings: Encoding Lexical Semantics

A fundamental linguistic assumption: words appearing in similar contexts are related to each other semantically. This is called the **distributional hypothesis**.

### Getting Dense Word Embeddings

Suppose we have seen the sentences in training data:

* The mathematician ran to the store.*

* The physicist ran to the store.*

* The mathematician solved the open problem.*

Now suppose we get a new sentence never seen before in training data:

* The physicist solved the open problem.*

How could we actually encode semantic similarity in words? Maybe we think up some semantic attributions. For example, in the sentences above, **we can give 'mathematician' and 'physicist' a high score for the 'is able to run' semantic attribute**. Think of some other attributes, and imagine what you might score some common words on those attributes.

If each attribute is a dimension, then we might give each word a vector, like this:

![](./image/word_embedding_attribute.png)

The attributes are: **'can run', 'likes coffee', 'majored in Physics'**, ..., which are dimensions.

A big pain: we could think of thousands of different semantic attributes that might be relevant to determining similarity.

Idea of deep learning: the neural network learns representations of the features, rather than requiring the programmer to design the features.

In this way, we will have some ***latent semantic attributes*** that the network can, in principle, learn. BTW, these attributes will probably not be interpretable.

In summary, **word embeddings are a representation of the semantics of a word, efficiently encoding semantic information that might be relevant to the task at hand**. We can embed other things too: part of speech tags, parse trees, anything.

### Word Embeddings in PyTorch

**Similar to how we defined a unique index for each word when making one-hot vectors, we also need to define an index for each word when using embeddings**. These will be keys into a lookup table. In code below, the mapping from words to indices is a dictionary named *word_to_ix*.

To index into this table, we must use ***torch.LongTensor***, since the indices are integers, not floats.

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
torch.manual_seed(1)

<torch._C.Generator at 0x128689310>

In [2]:
word_to_ix = {'hello': 0, 'world': 1}
embeds = nn.Embedding(2, 5)   # V=2, D=5   初始化from标准正态分布
lookup_tensor = torch.tensor([0, 1], dtype=torch.long)
hello_embed = embeds(lookup_tensor)
print(hello_embed)

tensor([[ 0.6614,  0.2669,  0.0617,  0.6213, -0.4519],
        [-0.1661, -1.5228,  0.3817, -1.0276, -0.5631]],
       grad_fn=<EmbeddingBackward>)


### An Example: N-Gram Language Modeling

In [3]:
CONTEXT_SIZE = 2
EMBEDDING_DIM = 10
test_sentence = """When forty winters shall besiege thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery so gazed on now,
Will be a totter'd weed of small worth held:
Then being asked, where all thy beauty lies,
Where all the treasure of thy lusty days;
To say, within thine own deep sunken eyes,
Were an all-eating shame, and thriftless praise.
How much more praise deserv'd thy beauty's use,
If thou couldst answer 'This fair child of mine
Shall sum my count, and make my old excuse,'
Proving his beauty by succession thine!
This were to be new made when thou art old,
And see thy blood warm when thou feel'st it cold.""".split()

# Build a list of tuples: ([word_i-2, word_i-1], target word)   构造出这样的tuple，是trigram的精髓！
trigrams = [([test_sentence[i], test_sentence[i + 1]], test_sentence[i + 2]) for i in range(len(test_sentence) - 2)]
print(trigrams[:3])

[(['When', 'forty'], 'winters'), (['forty', 'winters'], 'shall'), (['winters', 'shall'], 'besiege')]


In [4]:
vocab = set(test_sentence)
word_to_ix = {word: i for i, word in enumerate(vocab)}

In [5]:
print(word_to_ix)

{'mine': 0, 'treasure': 1, 'now,': 2, 'Were': 3, 'trenches': 4, 'When': 5, 'art': 6, 'be': 7, 'sum': 8, 'couldst': 9, 'held:': 10, "beauty's": 11, 'besiege': 12, 'were': 13, 'on': 14, 'blood': 15, 'to': 16, 'gazed': 17, 'Proving': 18, 'worth': 19, 'weed': 20, 'Where': 21, 'brow,': 22, 'own': 23, 'forty': 24, 'more': 25, "excuse,'": 26, "'This": 27, 'old,': 28, 'deep': 29, 'lusty': 30, 'thine': 31, 'of': 32, 'say,': 33, 'an': 34, 'old': 35, 'much': 36, 'his': 37, 'all-eating': 38, 'in': 39, 'and': 40, 'And': 41, 'How': 42, 'days;': 43, 'count,': 44, 'winters': 45, 'small': 46, 'my': 47, 'make': 48, 'fair': 49, 'Then': 50, 'thy': 51, 'all': 52, 'warm': 53, 'it': 54, 'when': 55, 'succession': 56, 'child': 57, 'beauty': 58, 'dig': 59, 'thriftless': 60, 'praise.': 61, 'so': 62, 'shame,': 63, 'thou': 64, 'Will': 65, 'where': 66, 'thine!': 67, 'Shall': 68, 'new': 69, 'shall': 70, 'eyes,': 71, 'use,': 72, 'livery': 73, 'To': 74, "youth's": 75, "deserv'd": 76, 'praise': 77, 'asked,': 78, 'being

In [6]:
class NGramLanguageModeler(nn.Module):
    
    def __init__(self, vocab_size, embedding_dim, context_size):
        super(NGramLanguageModeler, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(context_size * embedding_dim, 128)  # context_size个word的embedding拼接在一起
        self.linear2 = nn.Linear(128, vocab_size)  # 为什么是vocab_size? 因为对于Language Model来说，每次都要输出所有word的概率！
        
    # Structure: Inputs -> Embedding -> Reshape(view) -> Linear -> ReLU -> Linear -> LogSoftmax(Softmax->Log): 输出的不是probability，而是取了对数(以方便计算对数似然Loss)
    def forward(self, inputs):
        embeds = self.embeddings(inputs).view((1, -1))  # 所有(context_size个)embedding经reshape(view)后拼接在一起
        out = F.relu(self.linear1(embeds))
        out = self.linear2(out)
        log_probs = F.log_softmax(out, dim=1)
        return log_probs

In [8]:
losses = []
loss_function = nn.NLLLoss()  # Negative Log Likelyhood Loss，一般用于最后一层是LogSoftmax的Network的输出与Label/Target之间计算Loss
model = NGramLanguageModeler(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE)
optimizer = optim.SGD(model.parameters(), lr=0.001)   # PyTorch中，optimizer在此处与model建立联系！optimizer与model解耦！(Keras中通过model.compile(optimizer=xxx,...))

In [10]:
for epoch in range(10):
    total_loss = 0
    for context, target in trigrams:
        # Step1: Original Words -> Words Indices -> Tensor Inputs/Target
        context_idxs = torch.tensor([word_to_ix[word] for word in context], dtype=torch.long)
        target_idx = torch.tensor([word_to_ix[target]], dtype=torch.long)
        
        # Step2: Recall that torch *accumulates* gradients, so before a new instance, we need to zero out the gradients from old instance
        model.zero_grad()
        
        # Step3: Forward Pass, getting log probabilities over next words
        log_probs = model(context_idxs)
        
        # Step4: Compute Loss
        # 看似是log(probabilities)与target indices之间计算Loss，有些怪怪的，但实际上target indices本质上表示的就是probabilities
        # nn.NLLLoss就是用于此种场景，且target indices不需要是one-hot形式，NLLLoss内部会自动处理好
        loss = loss_function(log_probs, target_idx)
        
        # Step5: Backward Pass and update gradients
        loss.backward()
        optimizer.step()
        
        # Step6: Get loss of this instance, then sum
        total_loss += loss.item()
        
    losses.append(total_loss)
    print(f'Epoch: {epoch: 2d}\tLoss: {total_loss: .4f}')

Epoch:  0	Loss:  498.0866
Epoch:  1	Loss:  495.7089
Epoch:  2	Loss:  493.3401
Epoch:  3	Loss:  490.9778
Epoch:  4	Loss:  488.6210
Epoch:  5	Loss:  486.2679
Epoch:  6	Loss:  483.9204
Epoch:  7	Loss:  481.5777
Epoch:  8	Loss:  479.2389
Epoch:  9	Loss:  476.9040


In [26]:
context, target = ['forty', 'winters'], 'shall'
context_idxs = torch.tensor([word_to_ix[word] for word in context], dtype=torch.long)
target_idx = torch.tensor([word_to_ix[target]], dtype=torch.long)
log_probs = model(context_idxs)
probs = torch.exp(log_probs)    # vocab中各个word的概率
idxs_pred = [x.item() for x in probs.argsort(descending=True)[0]]  # 概率从大到小的各个word(的index)
print(log_probs)
print(probs)
print(idxs_pred)
print(target_idx)

tensor([[-4.1285, -4.8924, -4.5670, -4.4169, -4.3237, -4.9840, -5.1074, -3.9780,
         -4.5509, -4.5927, -4.5508, -4.6368, -4.5543, -4.2217, -4.4133, -4.4977,
         -4.3662, -4.9869, -4.3270, -4.6744, -4.6586, -4.3654, -4.6573, -4.5618,
         -4.7001, -4.8383, -4.8571, -4.4534, -4.3775, -4.4940, -4.7256, -4.6027,
         -4.4307, -4.9522, -4.7060, -4.8519, -4.2616, -5.1266, -4.6863, -4.4071,
         -4.7274, -4.4504, -4.9084, -4.5227, -4.3537, -4.3968, -4.6996, -4.7482,
         -4.4920, -4.6869, -4.9353, -3.8121, -4.6192, -4.5857, -4.4281, -4.6062,
         -4.9471, -4.7977, -4.4791, -4.2306, -4.7345, -4.4488, -4.5005, -4.7399,
         -4.2529, -4.8868, -4.7573, -4.9295, -4.4227, -4.3233, -4.0720, -4.4927,
         -4.6067, -4.7999, -4.8972, -4.6717, -4.7526, -4.9750, -5.0851, -4.6859,
         -5.1751, -4.7061, -4.5185, -4.5514, -4.5556, -4.8913, -4.7191, -4.4529,
         -4.6498, -4.5292, -4.0338, -4.7052, -4.8735, -4.4604, -4.7468, -4.9042,
         -4.6064]], grad_fn=