# Deep Learning for NLP with PyTorch

## Introduction to PyTorch

省略

## Deep Learning with PyTorch

### Deep Learning Building Blocks: Affine maps, non-linearities and objectives

Deep learning consists of composing linearities with non-linearities in clever ways. Non-linearities allows for powerful models.

#### Affine Maps

Affine map is a function *f(x)* where *f(x)=Ax+b* for a matrix A and vectors x, b.

PyTorch and most other deep learning frameworks do things a little differently than traditional linear algebra. It **maps the rows of the input instead of the columns**. That is, the i'th row of the output below is the mapping of the i'th row of input under A, plus the bias term.

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
torch.manual_seed(1)

<torch._C.Generator at 0x112849210>

In [2]:
lin = nn.Linear(5, 3)
data = torch.randn(2, 5)
print(lin(data))

tensor([[ 0.1755, -0.3268, -0.5069],
        [-0.6602,  0.2260,  0.1089]], grad_fn=<AddmmBackward>)


#### Non-Linearities

We know that composing affine maps gives you an affine map. So if our model is a long chain of affine compositions, this adds no new power to your model than just doing a single affine map.

If we introduce non-linearitiesin between the affine layers, we can build much more powerful models.

There are a few core non-linearities: **tanh(x), sigmoid(x), ReLU(x)**, etc. They are the most common since they have gradients that are easy to compute, and computing gradients is essential for learning.

In PyTorch, most non-linearities are in **torch.functional**, and they don't have parameters (weights) that need to be updated during training.

In [3]:
data = torch.randn(2, 2)
print(data)
print(F.relu(data))

tensor([[-0.5404, -2.2102],
        [ 2.1130, -0.0040]])
tensor([[0.0000, 0.0000],
        [2.1130, 0.0000]])


#### Softmax and Probabilities

The function Softmax(x) is also just a non-linearity, but it is special in that it usually is the last operation done in a network. This is because it takes in a vector of real numbers and returns a probability distribution.

> 刘尧：注意！！！log_softmax就是在softmax外进行个log操作，不过是同时进行的，而非分别进行。因为对于分类来说，一般使用对数似然激活函数，就一次性都计算了。

> 注意！ PyTorch里提供了**torch.nn.CrossEntropyLoss(This criterion combines nn.LogSoftmax() and nn.NLLLoss() in one single class)**，其整合了上面的步骤。这和tf.nn.softmax_cross_entropy_with_logits的功能是一致的。

> 必须明确一点：在PyTorch中若模型使用CrossEntropyLoss这个loss函数，则**不应该在最后一层再使用softmax进行激活**。CrossEntropyLoss() is the same as NLLLoss(), except it does the log softmax for you.

> 而在Keras中，我们通常在最后一层使用softmax进行激活，保证输出神经元的值即分类的概率值，然后在compile中使用损失函数categorical_crossentropy，这符合常理。

In [4]:
data = torch.randn(5)
print(data)
print(F.softmax(data, dim=0))
print(F.softmax(data, dim=0).sum())
print(F.log_softmax(data, dim=0))

tensor([ 1.3800, -1.3505,  0.3455,  0.5046,  1.8213])
tensor([0.2948, 0.0192, 0.1048, 0.1228, 0.4584])
tensor(1.)
tensor([-1.2214, -3.9519, -2.2560, -2.0969, -0.7801])


#### Objective Functions

省略

### Optimization and Training

省略

### Creating Network Components in PyTorch

#### Example: Logistic Regression Bag-of-Words classifier

Our model will map a sparse BoW representation to log probabilities over labels. Say our entire vocab is two words "hello" and "world", with indices 0 and 1 respectively. In general, the BoW vector for sentence is **[Count(hello), Count(world)]**. 

Denote this BoW vector as x. The output of our network is **logSoftmax(Ax+b)**

In [5]:
data = [('me gusta comer en la cafeteria'.split(), 'SPANISH'),
        ('Give it to me'.split(), 'ENGLISH'),
        ('No creo que sea una buena idea'.split(), 'SPANISH'),
        ('No it is not a good idea to get lost at sea'.split(), 'ENGLISH')]
test_data = [('Yo creo que si'.split(), 'SPANISH'),
             ('it is lost on me'.split(), 'ENGLISH')]

In [6]:
word2index = {}
for sent, _ in data + test_data:  # vocab来自于所有数据：训练数据和测试数据
    for word in sent:
        if word not in word2index:
            word2index[word] = len(word2index)  # 巧妙！
print(word2index)

{'me': 0, 'gusta': 1, 'comer': 2, 'en': 3, 'la': 4, 'cafeteria': 5, 'Give': 6, 'it': 7, 'to': 8, 'No': 9, 'creo': 10, 'que': 11, 'sea': 12, 'una': 13, 'buena': 14, 'idea': 15, 'is': 16, 'not': 17, 'a': 18, 'good': 19, 'get': 20, 'lost': 21, 'at': 22, 'Yo': 23, 'si': 24, 'on': 25}


In [7]:
VOCAB_SIZE = len(word2index)
NUM_LABEL = 2

In [8]:
class BoWClassifier(nn.Module):
    
    def __init__(self, num_labels, vocab_size):
        super(BoWClassifier, self).__init__()
        self.linear = nn.Linear(vocab_size, num_labels)
        # Note: the non-linearity log softmax does not have parameters! So we don't need to worry about them here
        
    def forward(self, bow_vec):
        return F.log_softmax(self.linear(bow_vec), dim=1)

In [9]:
def make_bow_vector(sentence, word2index):
    vec = torch.zeros(len(word2index))
    for word in sentence:
        vec[word2index[word]] += 1
    return vec.view(1, -1)

In [10]:
label2index = {"SPANISH": 0, "ENGLISH": 1}
def make_target(label, label2index):
    return torch.LongTensor([label2index[label]])

In [11]:
model = BoWClassifier(NUM_LABEL, VOCAB_SIZE)
for param in model.parameters():
    print(param)

Parameter containing:
tensor([[ 0.1194,  0.0609, -0.1268,  0.1274,  0.1191,  0.1739, -0.1099, -0.0323,
         -0.0038,  0.0286, -0.1488, -0.1392,  0.1067, -0.0460,  0.0958,  0.0112,
          0.0644,  0.0431,  0.0713,  0.0972, -0.1816,  0.0987, -0.1379, -0.1480,
          0.0119, -0.0334],
        [ 0.1152, -0.1136, -0.1743,  0.1427, -0.0291,  0.1103,  0.0630, -0.1471,
          0.0394,  0.0471, -0.1313, -0.0931,  0.0669,  0.0351, -0.0834, -0.0594,
          0.1796, -0.0363,  0.1106,  0.0849, -0.1268, -0.1668,  0.1882,  0.0102,
          0.1344,  0.0406]], requires_grad=True)
Parameter containing:
tensor([0.0631, 0.1465], requires_grad=True)


In [12]:
with torch.no_grad():
    sample = data[0]
    bow_vector = make_bow_vector(sample[0], word2index)
    log_probs = model(bow_vector)
    print(log_probs)

tensor([[-0.5378, -0.8771]])


In [13]:
loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)
for epoch in range(100):
    for instance, label in data:
        # Step 1. Remember that PyTorch accumulates gradients. We need to clear them out before each instance.
        model.zero_grad()
        # Step 2.
        bow_vec = make_bow_vector(instance, word2index)
        target = make_target(label, label2index)
        # Step 3. Run forward pass
        log_probs = model(bow_vec)
        # Step 4. Compute the loss, gradients, and update the parameters
        loss = loss_function(log_probs, target)
        loss.backward()
        optimizer.step()

In [14]:
with torch.no_grad():
    for instance, label in test_data:
        bow_vec = make_bow_vector(instance, word2index)
        log_probs = model(bow_vec)
        print(log_probs)

tensor([[-0.2093, -1.6669]])
tensor([[-2.5330, -0.0828]])


## Word Embeddings: Encoding Lexical Semantics

