## 递归神经网络 Recurrent Neural Networks

<img src="data/logo.png" alt="Drawing" style="width: 300px;"/>

当处理序列化数据时(时间序列，句子，等等) 输入的顺训对于任务是否能顺利完成至关重要。递归神经网络(RNN)会把从之前的输入学习到的信息和新的输入结合处理，这个课程项目中我们将会学习如何创建，并使用序列化数据用RNN建模。


<img src="data/rnn.png" alt="Drawing" style="width: 300px;"/>


## 概览

* **目标:** 从之前的输入学习到的信息和新的输入结合处理, 从而处理序列化数据

* **优点:**
	* 把序列信息和前置输入有机结合处理
	* 生成序列时进行条件判断。
* **缺点:**
		* 每一时间步输出的预测都依赖于上一步的预测输出，所以RNN很难并行计算。
		* 处理很长的序列数据时，会出现内存和运算问题。
		* 模型的可解释性比较困难，但是有一些好用的[工具和技巧](https://arxiv.org/abs/1506.02078)会检查RNN中的激活函数来推断哪一部分数据正在被处理。
* **其他:**
		* 如何进行构架变化从而让RNN变得更快，解释性更好，其实一直是一个热门的研究课题。		

<img src="data/rnn2.png" alt="Drawing" style="width: 300px;"/>

RNN 每一时间步的前馈网络 ~$X_t~$:

~$h_t = tanh(W_{hh}h_{t-1} + W_{xh}X_t+b_h)~$

~$y_t = W_{hy}h_t + b_y~$

~$P(y) = softmax(y_t) = \frac{e^y}{\sum e^y}~$

*其中*:
* ~$X_t~$ = 时间步t时刻的输入 | ~$\in \mathbb{R}^{NXE}~$ (~$N~$ 批次大小, ~$E~$ 是嵌入层维度)
* ~$W_{hh}~$ = 隐藏单元权重| ~$\in \mathbb{R}^{HXH}~$ (~$H~$ 隐藏单元维度)
* ~$h_{t-1}~$ = 上一个时间步的隐藏状态 ~$\in \mathbb{R}^{NXH}~$
* ~$W_{xh}~$ = 输入权重| $\in \mathbb{R}^{EXH}~$
* ~$b_h~$ = 隐藏单元偏差 ~$\in \mathbb{R}^{HX1}~$
* ~$W_{hy}~$ = 输出权重 | ~$\in \mathbb{R}^{HXC}~$ (~$C~$ 类别数量)
* ~$b_y~$ = 输出偏差 ~$\in \mathbb{R}^{CX1}~$

对每个时间步的输入 ~$(X_{t+1}, X_{t+2}, ..., X_{N})~$ 都进行这一套操作来得到每个时间步的预测输出。

**注意:** 在每个时间步的开始，前一个隐藏状态~$h_{t-1}~$可能是一个零值向量(没有条件)或是已被初始化(有条件)。如果RNN有前置条件，第一个隐藏状态 ~$h_0~$ 属于一个特定的条件，或者在每一个时间步将特定的条件和随机初始化的零值向量拼接。关于RNN的更多细节我们会在下一个 项目课程里进行讲解。下面来看一个假设情景，就是我们需要处理一些评价数据然后判断这些评价是好评还是差评，在这个情景里RNN的前馈网络是什么样的。

In [1]:
# 安装PyTorch
!pip install torch torchvision --upgrade -i https://pypi.tuna.tsinghua.edu.cn/simple --quiet

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Requirement already up-to-date: torch in /opt/conda/lib/python3.5/site-packages (1.0.0)
Requirement already up-to-date: torchvision in /opt/conda/lib/python3.5/site-packages (0.2.1)
[33mYou are using pip version 18.0, however version 18.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [2]:
import torch
import torch.nn as nn
import torch.nn.functional as F

In [3]:
batch_size = 5
seq_size = 10 # 输入最长长度 (对不等于最长序列的序列使用遮罩)
x_lengths = [8, 5, 4, 10, 5] # 每个输入数据序列的长度
embedding_dim = 100
rnn_hidden_dim = 256
output_dim = 4

In [4]:
# 初始化数据
x_in = torch.randn(batch_size, seq_size, embedding_dim)
x_lengths = torch.tensor(x_lengths)
print (x_in.size())

torch.Size([5, 10, 100])


In [5]:
# 初始化隐藏层
hidden_t = torch.zeros((batch_size, rnn_hidden_dim))
print (hidden_t.size())

torch.Size([5, 256])


In [6]:
# 初始化RNN Cell
rnn_cell = nn.RNNCell(embedding_dim, rnn_hidden_dim)
print (rnn_cell)

RNNCell(100, 256)


In [7]:
# RNN前馈网络
x_in = x_in.permute(1, 0, 2) # RNN需要把batchsize数据放在第一维

# 对时间步循环
hiddens = []
for t in range(seq_size):
    hidden_t = rnn_cell(x_in[t], hidden_t)
    hiddens.append(hidden_t)
hiddens = torch.stack(hiddens)
hiddens = hiddens.permute(1, 0, 2) # 把batchsize数据置回第0维
print (hiddens.size())

torch.Size([5, 10, 256])


In [8]:
# 其实可以用更抽象的网络层
x_in = torch.randn(batch_size, seq_size, embedding_dim)
rnn = nn.RNN(embedding_dim, rnn_hidden_dim, batch_first=True)
out, h_n = rnn(x_in) #h_n 是前一个隐藏状态
print ("out: ", out.size())
print ("h_n: ", h_n.size())

out:  torch.Size([5, 10, 256])
h_n:  torch.Size([1, 5, 256])


In [9]:
def gather_last_relevant_hidden(hiddens, x_lengths):
    x_lengths = x_lengths.long().detach().cpu().numpy() - 1
    out = []
    for batch_index, column_index in enumerate(x_lengths):
        out.append(hiddens[batch_index, column_index])
    return torch.stack(out)

In [10]:
# 获取最后一个相关的隐藏状态
z = gather_last_relevant_hidden(hiddens, x_lengths)
print (z.size())

torch.Size([5, 256])


In [11]:
# 通过全连接层向前传播
fc1 = nn.Linear(rnn_hidden_dim, output_dim)
y_pred = fc1(z)
y_pred = F.softmax(y_pred, dim=1)
print (y_pred.size())
print (y_pred)

torch.Size([5, 4])
tensor([[0.2734, 0.2128, 0.2010, 0.3129],
        [0.1772, 0.2352, 0.2647, 0.3228],
        [0.2697, 0.2114, 0.2957, 0.2231],
        [0.1680, 0.2455, 0.2652, 0.3212],
        [0.2690, 0.2353, 0.1680, 0.3277]], grad_fn=<SoftmaxBackward>)


## 序列化数据

RNN 可以帮助处理许多不同的序列化任务。

1. **一对一 (One to one)**: 有一个输入，产出一个输出。
	* 举个例子: 输入一个单词，给出类别(它是动词，名字，等等)
2. **一对多 (One to many)**: 有一个输入，产出多个输出。
	* 举个例子: 输入一种评价分类(差评还是好评)，输出一个评价文案
3. **多对一 (Many to one)**: 多个输入，产出一个输出
	* 举个例子: 输入输出一个评价文案, 输出一种评价分类(差评还是好评)
4. **多对多 (Many to many)**: 多个输入序列化处理生成多个输出
	* 举个例子: 输入一句法语，处理后输出英语翻译
	* 再举个例子: 输入一组时间序列数据，在每一时间步输出某个事件的概率(疾病风险)


<img src="data/seq2seq.jpeg" alt="Drawing" style="width: 600px;"/>


## 普通RNN的问题

到现在为止我们看到的其实都属于 普通RNN(Vanills RNN), 其中存在一些问题。

1. 当输入序列包含的时间步越多, 在处理后置的数据同时保存前置已经学习到的信息的难度就会越来越大。模型的目的确实是要保存之前已处理的数据中的有效信息，但是当时间步很多的时候这个操作就会显得异常笨重。

2. 进行反向传播时，损失梯度需要传播到最开始的第一个时间步。当梯度比1大(~${1.01}^{1000} = 20959~$)或比1小(~${0.99}^{1000} = 4.31e-5~$)，且序列中时间步很多时，反向传播就会成为灾难。

为了解决这些问题，**门控(Gating)** 的机制被引进了RNN. 这一机制可以使网络得以控制时间步之间的信息流，从而优化模型。选择性地向下一个时间步传可以递信息让模型可以轻松处理含有许多时间步的数据。类似的网络变种有长短期记忆网络 Long Short Term Memory ([LSTM](https://pytorch.org/docs/stable/nn.html#torch.nn.LSTM)), 和 门控递归神经网络 Gated Recurrent Units ([GRUs](https://pytorch.org/docs/stable/nn.html#torch.nn.GRU))。 大家可以在[here](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)学习具体信息。


<img src="data/gates.png" alt="Drawing" style="width: 800px;"/>


In [12]:
# 用PyTorch实现GRU
gru = nn.GRU(input_size=embedding_dim, hidden_size=rnn_hidden_dim, 
             batch_first=True)

In [13]:
# 初始化人造数据
x_in = torch.randn(batch_size, seq_size, embedding_dim)
print (x_in.size())

torch.Size([5, 10, 100])


In [14]:
# 向前传播
out, h_n = gru(x_in)
print ("out:", out.size())
print ("h_n:", h_n.size())

out: torch.Size([5, 10, 256])
h_n: torch.Size([1, 5, 256])



**注意**: 数据特性和性能决定了我们到底是选择 GRU 还是 LSTM。GRU在参数较少时性能客观，LSTM 效率更高，不过性能随实际情况变化。

## 双向递归神经网络
我们会在接下来的项目中看到 RNN 的优化和提升 ([注意](https://www.oreilly.com/ideas/interpretability-via-attentional-and-memory-based-interfaces-using-tensorflow), Quasi RNNs, etc.)。顾名思义，双向递归神经网络(bidirectional RNNs, Bi-RNNs) 会从输入的两个方向来处理数据。分别从两头开始处理数据结构提供给输出层输入序列中每一个点的完整的过去和未来的上下文信息，有助于性能的提升。Bi-RNN的一种常用情景是 翻译，从两头分别开始处理整句对将一种语言译成另一种语言的任务非常有帮助。


<img src="data/birnn.png" alt="Drawing" style="width: 600px;"/>


In [52]:
# PyTorch实现Bi-GRU
bi_gru = nn.GRU(input_size=embedding_dim, hidden_size=rnn_hidden_dim, 
                batch_first=True, bidirectional=True)

In [16]:
# 向前传播
out, h_n = bi_gru(x_in)
print ("out:", out.size()) # collection of all hidden states from the RNN for each time step
print ("h_n:", h_n.size()) # last hidden state from the RNN

out: torch.Size([5, 10, 512])
h_n: torch.Size([2, 5, 256])


注意每个样本的输出的大小是512，就是隐藏维度的两倍。因为同时包含了向前传播和反向传播的数据。

## 使用 RNN 作文本分类
让我们把RNN应用在[嵌入层](https://www.kesci.com/home/project/share/50c13cecc0bf7ba1)的项目中，我们当时想通过标题预测文章的分类。

### 配置

In [17]:
import os
from argparse import Namespace
import collections
import copy
import json
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import re
import torch

In [18]:
# 设置Numpy和PyTorch随机种子
def set_seeds(seed, cuda):
    np.random.seed(seed)
    torch.manual_seed(seed)
    if cuda:
        torch.cuda.manual_seed_all(seed)
        
# 创建字典
def create_dirs(dirpath):
    if not os.path.exists(dirpath):
        os.makedirs(dirpath)

In [19]:
# 参数
args = Namespace(
    seed=1234,
    cuda=True,
    shuffle=True,
    data_file="data/news.csv",
    split_data_file="data/split_news.csv",
    vectorizer_file="vectorizer.json",
    model_state_file="model.pth",
    save_dir="news",
    train_size=0.7,
    val_size=0.15,
    test_size=0.15,
    pretrained_embeddings=None,
    cutoff=25, # token must appear at least <cutoff> times to be in SequenceVocabulary
    num_epochs=5,
    early_stopping_criteria=5,
    learning_rate=1e-3,
    batch_size=64,
    embedding_dim=100,
    rnn_hidden_dim=128,
    hidden_dim=100,
    num_layers=1,
    bidirectional=False,
    dropout_p=0.1,
)

# 设置种子
set_seeds(seed=args.seed, cuda=args.cuda)

# 创建保存目录
create_dirs(args.save_dir)

# 拓展路径
args.vectorizer_file = os.path.join(args.save_dir, args.vectorizer_file)
args.model_state_file = os.path.join(args.save_dir, args.model_state_file)

# 检查GPU可用性
if torch.cuda.is_available():
    args.cuda = True
else:
    args.cude = False
args.device = torch.device("cuda" if args.cuda else "cpu")
print("Using CUDA: {}".format(args.cuda))

Using CUDA: True


### 数据

In [20]:
import re
import urllib

In [21]:
# 加载数据
url = "https://raw.githubusercontent.com/GokuMohandas/practicalAI/master/data/news.csv"
response = urllib.request.urlopen(url)
html = response.read()
with open(args.data_file, 'wb') as fp:
    fp.write(html)

In [22]:
# 原始数据
df = pd.read_csv(args.data_file, header=0)
df.head()

Unnamed: 0,category,title
0,Business,Wall St. Bears Claw Back Into the Black (Reuters)
1,Business,Carlyle Looks Toward Commercial Aerospace (Reu...
2,Business,Oil and Economy Cloud Stocks' Outlook (Reuters)
3,Business,Iraq Halts Oil Exports from Main Southern Pipe...
4,Business,"Oil prices soar to all-time record, posing new..."


In [23]:
# 按类别拆分数据集
by_category = collections.defaultdict(list)
for _, row in df.iterrows():
    by_category[row.category].append(row.to_dict())
for category in by_category:
    print ("{0}: {1}".format(category, len(by_category[category])))

Business: 30000
Sports: 30000
World: 30000
Sci/Tech: 30000


In [24]:
# 新建切分数据集
final_list = []
for _, item_list in sorted(by_category.items()):
    if args.shuffle:
        np.random.shuffle(item_list)
    n = len(item_list)
    n_train = int(args.train_size*n)
    n_val = int(args.val_size*n)
    n_test = int(args.test_size*n)

  # 给数据指定切分属性
    for item in item_list[:n_train]:
        item['split'] = 'train'
    for item in item_list[n_train:n_train+n_val]:
        item['split'] = 'val'
    for item in item_list[n_train+n_val:]:
        item['split'] = 'test'  

    # 添加到列表
    final_list.extend(item_list)

In [25]:
# 切分后的数据集dataframe
split_df = pd.DataFrame(final_list)
split_df["split"].value_counts()

train    84000
test     18000
val      18000
Name: split, dtype: int64

In [26]:
# 预处理
def preprocess_text(text):
    text = ' '.join(word.lower() for word in text.split(" "))
    text = re.sub(r"([.,!?])", r" \1 ", text)
    text = re.sub(r"[^a-zA-Z.,!?]+", r" ", text)
    text = text.strip()
    return text
    
split_df.title = split_df.title.apply(preprocess_text)

In [27]:
# 存为CSV文件
split_df.to_csv(args.split_data_file, index=False)
split_df.head()

Unnamed: 0,category,split,title
0,Business,train,general electric posts higher rd quarter profit
1,Business,train,lilly to eliminate up to us jobs
2,Business,train,s amp p lowers america west outlook to negative
3,Business,train,does rand walk the talk on labor policy ?
4,Business,train,housekeeper advocates for changes


### 词汇表

In [28]:
class Vocabulary(object):
    def __init__(self, token_to_idx=None):

        # 词条转换为索引
        if token_to_idx is None:
            token_to_idx = {}
        self.token_to_idx = token_to_idx

        # 索引转换为词条
        self.idx_to_token = {idx: token \
                             for token, idx in self.token_to_idx.items()}

    def to_serializable(self):
        return {'token_to_idx': self.token_to_idx}

    @classmethod
    def from_serializable(cls, contents):
        return cls(**contents)

    def add_token(self, token):
        if token in self.token_to_idx:
            index = self.token_to_idx[token]
        else:
            index = len(self.token_to_idx)
            self.token_to_idx[token] = index
            self.idx_to_token[index] = token
        return index

    def add_tokens(self, tokens):
        return [self.add_token[token] for token in tokens]

    def lookup_token(self, token):
        return self.token_to_idx[token]

    def lookup_index(self, index):
        if index not in self.idx_to_token:
            raise KeyError("the index (%d) is not in the Vocabulary" % index)
        return self.idx_to_token[index]

    def __str__(self):
        return "<Vocabulary(size=%d)>" % len(self)

    def __len__(self):
        return len(self.token_to_idx)

In [29]:
# 词汇表实例
category_vocab = Vocabulary()
for index, row in df.iterrows():
    category_vocab.add_token(row.category)
print (category_vocab) # __str__
print (len(category_vocab)) # __len__
index = category_vocab.lookup_token("Business")
print (index)
print (category_vocab.lookup_index(index))

<Vocabulary(size=4)>
4
0
Business


### 序列词汇表

接下来我们将为文章标题创建词汇表类，它由一系列词条构成。

In [30]:
from collections import Counter
import string

In [31]:
class SequenceVocabulary(Vocabulary):
    def __init__(self, token_to_idx=None, unk_token="<UNK>",
                 mask_token="<MASK>", begin_seq_token="<BEGIN>",
                 end_seq_token="<END>"):

        super(SequenceVocabulary, self).__init__(token_to_idx)

        self.mask_token = mask_token
        self.unk_token = unk_token
        self.begin_seq_token = begin_seq_token
        self.end_seq_token = end_seq_token

        self.mask_index = self.add_token(self.mask_token)
        self.unk_index = self.add_token(self.unk_token)
        self.begin_seq_index = self.add_token(self.begin_seq_token)
        self.end_seq_index = self.add_token(self.end_seq_token)
        
        # 索引转换为词条
        self.idx_to_token = {idx: token \
                             for token, idx in self.token_to_idx.items()}

    def to_serializable(self):
        contents = super(SequenceVocabulary, self).to_serializable()
        contents.update({'unk_token': self.unk_token,
                         'mask_token': self.mask_token,
                         'begin_seq_token': self.begin_seq_token,
                         'end_seq_token': self.end_seq_token})
        return contents

    def lookup_token(self, token):
        return self.token_to_idx.get(token, self.unk_index)
    
    def lookup_index(self, index):
        if index not in self.idx_to_token:
            raise KeyError("the index (%d) is not in the SequenceVocabulary" % index)
        return self.idx_to_token[index]
    
    def __str__(self):
        return "<SequenceVocabulary(size=%d)>" % len(self.token_to_idx)

    def __len__(self):
        return len(self.token_to_idx)


In [32]:
# 得到词数
word_counts = Counter()
for title in split_df.title:
    for token in title.split(" "):
        if token not in string.punctuation:
            word_counts[token] += 1

# 创建SequenceVocabulary实例
title_vocab = SequenceVocabulary()
for word, word_count in word_counts.items():
    if word_count >= args.cutoff:
        title_vocab.add_token(word)
print (title_vocab) # __str__
print (len(title_vocab)) # __len__
index = title_vocab.lookup_token("general")
print (index)
print (title_vocab.lookup_index(index))

<SequenceVocabulary(size=4400)>
4400
641
general


### 向量化

在向量化中这次我们会引入新的操作: 计算输入序列的长度。我们会在为每个输入序列提取最新的相关隐藏状态时用到它。

In [33]:
class NewsVectorizer(object):
    def __init__(self, title_vocab, category_vocab):
        self.title_vocab = title_vocab
        self.category_vocab = category_vocab

    def vectorize(self, title):
        indices = [self.title_vocab.lookup_token(token) for token in title.split(" ")]
        indices = [self.title_vocab.begin_seq_index] + indices + \
            [self.title_vocab.end_seq_index]
        
        # 创建向量
        title_length = len(indices)
        vector = np.zeros(title_length, dtype=np.int64)
        vector[:len(indices)] = indices

        return vector, title_length
    
    def unvectorize(self, vector):
        tokens = [self.title_vocab.lookup_index(index) for index in vector]
        title = " ".join(token for token in tokens)
        return title

    @classmethod
    def from_dataframe(cls, df, cutoff):
        
        # 创建分类的词汇表实例
        category_vocab = Vocabulary()        
        for category in sorted(set(df.category)):
            category_vocab.add_token(category)

        # 获取词数
        word_counts = Counter()
        for title in df.title:
            for token in title.split(" "):
                word_counts[token] += 1
        
        # 创建标题的词汇表实例
        title_vocab = SequenceVocabulary()
        for word, word_count in word_counts.items():
            if word_count >= cutoff:
                title_vocab.add_token(word)
        
        return cls(title_vocab, category_vocab)

    @classmethod
    def from_serializable(cls, contents):
        title_vocab = SequenceVocabulary.from_serializable(contents['title_vocab'])
        category_vocab = Vocabulary.from_serializable(contents['category_vocab'])
        return cls(title_vocab=title_vocab, category_vocab=category_vocab)
    
    def to_serializable(self):
        return {'title_vocab': self.title_vocab.to_serializable(),
                'category_vocab': self.category_vocab.to_serializable()}

In [34]:
# 向量化实例
vectorizer = NewsVectorizer.from_dataframe(split_df, cutoff=args.cutoff)
print (vectorizer.title_vocab)
print (vectorizer.category_vocab)
vectorized_title, title_length = vectorizer.vectorize(preprocess_text(
    "Roger Federer wins the Wimbledon tennis tournament."))
print (np.shape(vectorized_title))
print ("title_length:", title_length)
print (vectorized_title)
print (vectorizer.unvectorize(vectorized_title))

<SequenceVocabulary(size=4404)>
<Vocabulary(size=4)>
(10,)
title_length: 10
[   2    1 2032  115 1075    1 2590 4387 3927    3]
<BEGIN> <UNK> federer wins the <UNK> tennis tournament . <END>


### 数据集

In [35]:
from torch.utils.data import Dataset, DataLoader

In [36]:
class NewsDataset(Dataset):
    def __init__(self, df, vectorizer):
        self.df = df
        self.vectorizer = vectorizer

        # 切分数据
        self.train_df = self.df[self.df.split=='train']
        self.train_size = len(self.train_df)
        self.val_df = self.df[self.df.split=='val']
        self.val_size = len(self.val_df)
        self.test_df = self.df[self.df.split=='test']
        self.test_size = len(self.test_df)
        self.lookup_dict = {'train': (self.train_df, self.train_size), 
                            'val': (self.val_df, self.val_size),
                            'test': (self.test_df, self.test_size)}
        self.set_split('train')

        # 类权重(用于样本失衡)
        class_counts = df.category.value_counts().to_dict()
        def sort_key(item):
            return self.vectorizer.category_vocab.lookup_token(item[0])
        sorted_counts = sorted(class_counts.items(), key=sort_key)
        frequencies = [count for _, count in sorted_counts]
        self.class_weights = 1.0 / torch.tensor(frequencies, dtype=torch.float32)

    @classmethod
    def load_dataset_and_make_vectorizer(cls, split_data_file, cutoff):
        df = pd.read_csv(split_data_file, header=0)
        train_df = df[df.split=='train']
        return cls(df, NewsVectorizer.from_dataframe(train_df, cutoff))

    @classmethod
    def load_dataset_and_load_vectorizer(cls, split_data_file, vectorizer_filepath):
        df = pd.read_csv(split_data_file, header=0)
        vectorizer = cls.load_vectorizer_only(vectorizer_filepath)
        return cls(df, vectorizer)

    def load_vectorizer_only(vectorizer_filepath):
        with open(vectorizer_filepath) as fp:
            return NewsVectorizer.from_serializable(json.load(fp))

    def save_vectorizer(self, vectorizer_filepath):
        with open(vectorizer_filepath, "w") as fp:
            json.dump(self.vectorizer.to_serializable(), fp)

    def set_split(self, split="train"):
        self.target_split = split
        self.target_df, self.target_size = self.lookup_dict[split]

    def __str__(self):
        return "<Dataset(split={0}, size={1})".format(
            self.target_split, self.target_size)

    def __len__(self):
        return self.target_size

    def __getitem__(self, index):
        row = self.target_df.iloc[index]
        title_vector, title_length = self.vectorizer.vectorize(row.title)
        category_index = self.vectorizer.category_vocab.lookup_token(row.category)
        return {'title': title_vector, 'title_length': title_length, 
                'category': category_index}

    def get_num_batches(self, batch_size):
        return len(self) // batch_size

    def generate_batches(self, batch_size, collate_fn, shuffle=True, 
                         drop_last=False, device="cpu"):
        dataloader = DataLoader(dataset=self, batch_size=batch_size,
                                collate_fn=collate_fn, shuffle=shuffle, 
                                drop_last=drop_last)
        for data_dict in dataloader:
            out_data_dict = {}
            for name, tensor in data_dict.items():
                out_data_dict[name] = data_dict[name].to(device)
            yield out_data_dict

In [37]:
# 数据集实例
dataset = NewsDataset.load_dataset_and_make_vectorizer(args.split_data_file,
                                                       args.cutoff)
print (dataset) # __str__
input_ = dataset[5] # __getitem__
print (input_['title'], input_['title_length'], input_['category'])
print (dataset.vectorizer.unvectorize(input_['title']))
print (dataset.class_weights)

<Dataset(split=train, size=84000)
[   2 1182 1955  816 3138 2806    3] 7 0
<BEGIN> software firm to cut jobs <END>
tensor([3.3333e-05, 3.3333e-05, 3.3333e-05, 3.3333e-05])


### 模型
input → embedding → RNN → FC

In [38]:
import torch.nn as nn
import torch.nn.functional as F

In [39]:
def gather_last_relevant_hidden(hiddens, x_lengths):
    x_lengths = x_lengths.long().detach().cpu().numpy() - 1
    out = []
    for batch_index, column_index in enumerate(x_lengths):
        out.append(hiddens[batch_index, column_index])
    return torch.stack(out)

In [40]:
class NewsModel(nn.Module):
    def __init__(self, embedding_dim, num_embeddings, rnn_hidden_dim, 
                 hidden_dim, output_dim, num_layers, bidirectional, dropout_p, 
                 pretrained_embeddings=None, freeze_embeddings=False, 
                 padding_idx=0):
        super(NewsModel, self).__init__()
        
        if pretrained_embeddings is None:
            self.embeddings = nn.Embedding(embedding_dim=embedding_dim,
                                          num_embeddings=num_embeddings,
                                          padding_idx=padding_idx)
        else:
            pretrained_embeddings = torch.from_numpy(pretrained_embeddings).float()
            self.embeddings = nn.Embedding(embedding_dim=embedding_dim,
                                           num_embeddings=num_embeddings,
                                           padding_idx=padding_idx,
                                           _weight=pretrained_embeddings)
        
        # 卷积层权重
        self.gru = nn.GRU(input_size=embedding_dim, hidden_size=rnn_hidden_dim, 
                          num_layers=num_layers, batch_first=True, 
                          bidirectional=bidirectional)
     
        # 全连接层权重
        self.dropout = nn.Dropout(dropout_p)
        self.fc1 = nn.Linear(rnn_hidden_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, output_dim)
        
        if freeze_embeddings:
            self.embeddings.weight.requires_grad = False

    def forward(self, x_in, x_lengths, apply_softmax=False):
        
        # 嵌入
        x_in = self.embeddings(x_in)
            
        # 传入RNN
        out, h_n = self.gru(x_in)
        
        # 获取上一个相关隐藏状态
        out = gather_last_relevant_hidden(out, x_lengths)

        # 全连接层
        z = self.dropout(out)
        z = self.fc1(z)
        z = self.dropout(z)
        y_pred = self.fc2(z)

        if apply_softmax:
            y_pred = F.softmax(y_pred, dim=1)
        return y_pred

### 训练

In [41]:
import torch.optim as optim

In [42]:
class Trainer(object):
    def __init__(self, dataset, model, model_state_file, save_dir, device, shuffle, 
               num_epochs, batch_size, learning_rate, early_stopping_criteria):
        self.dataset = dataset
        self.class_weights = dataset.class_weights.to(device)
        self.model = model.to(device)
        self.save_dir = save_dir
        self.device = device
        self.shuffle = shuffle
        self.num_epochs = num_epochs
        self.batch_size = batch_size
        self.loss_func = nn.CrossEntropyLoss(self.class_weights)
        self.optimizer = optim.Adam(self.model.parameters(), lr=learning_rate)
        self.scheduler = optim.lr_scheduler.ReduceLROnPlateau(
            optimizer=self.optimizer, mode='min', factor=0.5, patience=1)
        self.train_state = {
            'stop_early': False, 
            'early_stopping_step': 0,
            'early_stopping_best_val': 1e8,
            'early_stopping_criteria': early_stopping_criteria,
            'learning_rate': learning_rate,
            'epoch_index': 0,
            'train_loss': [],
            'train_acc': [],
            'val_loss': [],
            'val_acc': [],
            'test_loss': -1,
            'test_acc': -1,
            'model_filename': model_state_file}
    
    def update_train_state(self):

        # 打印checkpoint信息
        print ("[EPOCH]: {0:02d} | [LR]: {1} | [TRAIN LOSS]: {2:.2f} | [TRAIN ACC]: {3:.1f}% | [VAL LOSS]: {4:.2f} | [VAL ACC]: {5:.1f}%".format(
          self.train_state['epoch_index'], self.train_state['learning_rate'], 
            self.train_state['train_loss'][-1], self.train_state['train_acc'][-1], 
            self.train_state['val_loss'][-1], self.train_state['val_acc'][-1]))

        # 至少保存一次模型
        if self.train_state['epoch_index'] == 0:
            torch.save(self.model.state_dict(), self.train_state['model_filename'])
            self.train_state['stop_early'] = False

        # 如果模型性能表现有提升，再次保存
        elif self.train_state['epoch_index'] >= 1:
            loss_tm1, loss_t = self.train_state['val_loss'][-2:]

            # 如果损失增大
            if loss_t >= self.train_state['early_stopping_best_val']:
                # 更新步数
                self.train_state['early_stopping_step'] += 1

            # 损失变小
            else:
                # 保存最优的模型
                if loss_t < self.train_state['early_stopping_best_val']:
                    torch.save(self.model.state_dict(), self.train_state['model_filename'])

                # 重置early stopping 的步数
                self.train_state['early_stopping_step'] = 0

            # 是否需要early stop?
            self.train_state['stop_early'] = self.train_state['early_stopping_step'] \
              >= self.train_state['early_stopping_criteria']
        return self.train_state
  
    def compute_accuracy(self, y_pred, y_target):
        _, y_pred_indices = y_pred.max(dim=1)
        n_correct = torch.eq(y_pred_indices, y_target).sum().item()
        return n_correct / len(y_pred_indices) * 100
    
    def pad_seq(self, seq, length):
        vector = np.zeros(length, dtype=np.int64)
        vector[:len(seq)] = seq
        vector[len(seq):] = self.dataset.vectorizer.title_vocab.mask_index
        return vector
    
    def collate_fn(self, batch):
        
        # 深度拷贝
        batch_copy = copy.deepcopy(batch)
        processed_batch = {"title": [], "title_length": [], "category": []}
        
        # 得到最长序列长度
        get_length = lambda sample: len(sample["title"])
        max_seq_length = max(map(get_length, batch))
        
        # 填充
        for i, sample in enumerate(batch_copy):
            padded_seq = self.pad_seq(sample["title"], max_seq_length)
            processed_batch["title"].append(padded_seq)
            processed_batch["title_length"].append(sample["title_length"])
            processed_batch["category"].append(sample["category"])
            
        # 转换为合适的tensor
        processed_batch["title"] = torch.LongTensor(
            processed_batch["title"])
        processed_batch["title_length"] = torch.LongTensor(
            processed_batch["title_length"])
        processed_batch["category"] = torch.LongTensor(
            processed_batch["category"])
        
        return processed_batch   
  
    def run_train_loop(self):
        for epoch_index in range(self.num_epochs):
            self.train_state['epoch_index'] = epoch_index
      
            # 遍历训练集
            # 初始化批生成器, 将损失和准确率归零，设置为训练模式
            self.dataset.set_split('train')
            batch_generator = self.dataset.generate_batches(
                batch_size=self.batch_size, collate_fn=self.collate_fn, 
                shuffle=self.shuffle, device=self.device)
            running_loss = 0.0
            running_acc = 0.0
            self.model.train()

            for batch_index, batch_dict in enumerate(batch_generator):
                # 梯度归零
                self.optimizer.zero_grad()

                # 计算输出
                y_pred = self.model(batch_dict['title'], batch_dict['title_length'])

                # 计算损失
                loss = self.loss_func(y_pred, batch_dict['category'])
                loss_t = loss.item()
                running_loss += (loss_t - running_loss) / (batch_index + 1)

                # 反向传播
                loss.backward()

                # 更新梯度
                self.optimizer.step()
                
                # 计算准确率
                acc_t = self.compute_accuracy(y_pred, batch_dict['category'])
                running_acc += (acc_t - running_acc) / (batch_index + 1)

            self.train_state['train_loss'].append(running_loss)
            self.train_state['train_acc'].append(running_acc)

            # 遍历验证集
            # 初始化批生成器, 将损失和准确率归零，设置为运行模式
            self.dataset.set_split('val')
            batch_generator = self.dataset.generate_batches(
                batch_size=self.batch_size, collate_fn=self.collate_fn, 
                shuffle=self.shuffle, device=self.device)
            running_loss = 0.
            running_acc = 0.
            self.model.eval()

            for batch_index, batch_dict in enumerate(batch_generator):

                # 计算输出
                y_pred =  self.model(batch_dict['title'], batch_dict['title_length'])

                # 计算损失
                loss = self.loss_func(y_pred, batch_dict['category'])
                loss_t = loss.to("cpu").item()
                running_loss += (loss_t - running_loss) / (batch_index + 1)

                # 计算准确率
                acc_t = self.compute_accuracy(y_pred, batch_dict['category'])
                running_acc += (acc_t - running_acc) / (batch_index + 1)

            self.train_state['val_loss'].append(running_loss)
            self.train_state['val_acc'].append(running_acc)

            self.train_state = self.update_train_state()
            self.scheduler.step(self.train_state['val_loss'][-1])
            if self.train_state['stop_early']:
                break
          
    def run_test_loop(self):
        # 初始化批生成器, 将损失和准确率归零，设置为运行模式
        self.dataset.set_split('test')
        batch_generator = self.dataset.generate_batches(
            batch_size=self.batch_size, collate_fn=self.collate_fn, 
            shuffle=self.shuffle, device=self.device)
        running_loss = 0.0
        running_acc = 0.0
        self.model.eval()

        for batch_index, batch_dict in enumerate(batch_generator):
            # 计算输出
            y_pred =  self.model(batch_dict['title'], batch_dict['title_length'])

            # 计算损失
            loss = self.loss_func(y_pred, batch_dict['category'])
            loss_t = loss.item()
            running_loss += (loss_t - running_loss) / (batch_index + 1)

            # 计算准确率
            acc_t = self.compute_accuracy(y_pred, batch_dict['category'])
            running_acc += (acc_t - running_acc) / (batch_index + 1)

        self.train_state['test_loss'] = running_loss
        self.train_state['test_acc'] = running_acc
    
    def plot_performance(self):
        # 设置图大小
        plt.figure(figsize=(15,5))

        # 画出损失
        plt.subplot(1, 2, 1)
        plt.title("Loss")
        plt.plot(trainer.train_state["train_loss"], label="train")
        plt.plot(trainer.train_state["val_loss"], label="val")
        plt.legend(loc='upper right')

        # 画出准确率
        plt.subplot(1, 2, 2)
        plt.title("Accuracy")
        plt.plot(trainer.train_state["train_acc"], label="train")
        plt.plot(trainer.train_state["val_acc"], label="val")
        plt.legend(loc='lower right')

        # 存图
        plt.savefig(os.path.join(self.save_dir, "performance.png"))

        # 展示图
        plt.show()
    
    def save_train_state(self):
        with open(os.path.join(self.save_dir, "train_state.json"), "w") as fp:
            json.dump(self.train_state, fp)

In [43]:
# 初始化
dataset = NewsDataset.load_dataset_and_make_vectorizer(args.split_data_file,
                                                       args.cutoff)
dataset.save_vectorizer(args.vectorizer_file)
vectorizer = dataset.vectorizer
model = NewsModel(embedding_dim=args.embedding_dim, 
                  num_embeddings=len(vectorizer.title_vocab), 
                  rnn_hidden_dim=args.rnn_hidden_dim,
                  hidden_dim=args.hidden_dim,
                  output_dim=len(vectorizer.category_vocab),
                  num_layers=args.num_layers,
                  bidirectional=args.bidirectional,
                  dropout_p=args.dropout_p, 
                  pretrained_embeddings=None, 
                  padding_idx=vectorizer.title_vocab.mask_index)
print (model.named_modules)

<bound method Module.named_modules of NewsModel(
  (embeddings): Embedding(3406, 100, padding_idx=0)
  (gru): GRU(100, 128, batch_first=True)
  (dropout): Dropout(p=0.1)
  (fc1): Linear(in_features=128, out_features=100, bias=True)
  (fc2): Linear(in_features=100, out_features=4, bias=True)
)>


In [44]:
# 训练
trainer = Trainer(dataset=dataset, model=model, 
                  model_state_file=args.model_state_file, 
                  save_dir=args.save_dir, device=args.device,
                  shuffle=args.shuffle, num_epochs=args.num_epochs, 
                  batch_size=args.batch_size, learning_rate=args.learning_rate, 
                  early_stopping_criteria=args.early_stopping_criteria)
trainer.run_train_loop()

[EPOCH]: 00 | [LR]: 0.001 | [TRAIN LOSS]: 0.75 | [TRAIN ACC]: 70.4% | [VAL LOSS]: 0.54 | [VAL ACC]: 80.3%
[EPOCH]: 01 | [LR]: 0.001 | [TRAIN LOSS]: 0.48 | [TRAIN ACC]: 82.8% | [VAL LOSS]: 0.49 | [VAL ACC]: 82.3%
[EPOCH]: 02 | [LR]: 0.001 | [TRAIN LOSS]: 0.41 | [TRAIN ACC]: 85.1% | [VAL LOSS]: 0.46 | [VAL ACC]: 83.3%
[EPOCH]: 03 | [LR]: 0.001 | [TRAIN LOSS]: 0.37 | [TRAIN ACC]: 86.6% | [VAL LOSS]: 0.47 | [VAL ACC]: 83.3%
[EPOCH]: 04 | [LR]: 0.001 | [TRAIN LOSS]: 0.32 | [TRAIN ACC]: 88.2% | [VAL LOSS]: 0.47 | [VAL ACC]: 83.3%


In [45]:
# 画出训练过程
trainer.plot_performance()

In [46]:
# 测试集上的性能
trainer.run_test_loop()
print("Test loss: {0:.2f}".format(trainer.train_state['test_loss']))
print("Test Accuracy: {0:.1f}%".format(trainer.train_state['test_acc']))

Test loss: 0.47
Test Accuracy: 83.3%


In [47]:
# 保存结果
trainer.save_train_state()

### 预测

In [48]:
class Inference(object):
    def __init__(self, model, vectorizer):
        self.model = model
        self.vectorizer = vectorizer
  
    def predict_category(self, title):
        # 向量化
        vectorized_title, title_length = self.vectorizer.vectorize(title)
        vectorized_title = torch.tensor(vectorized_title).unsqueeze(0)
        title_length = torch.tensor([title_length]).long()
        
        # 向前传播
        self.model.eval()
        y_pred = self.model(x_in=vectorized_title, x_lengths=title_length, 
                            apply_softmax=True)

        # 可能性最高的分类
        y_prob, indices = y_pred.max(dim=1)
        index = indices.item()

        # 预测出的分类
        category = vectorizer.category_vocab.lookup_index(index)
        probability = y_prob.item()
        return {'category': category, 'probability': probability}
    
    def predict_top_k(self, title, k):
        # 向量化
        vectorized_title, title_length = self.vectorizer.vectorize(title)
        vectorized_title = torch.tensor(vectorized_title).unsqueeze(0)
        title_length = torch.tensor([title_length]).long()
        
        # 向前传播
        self.model.eval()
        y_pred = self.model(x_in=vectorized_title, x_lengths=title_length, 
                            apply_softmax=True)
        
        # 最有可能的K种分类
        y_prob, indices = torch.topk(y_pred, k=k)
        probabilities = y_prob.detach().numpy()[0]
        indices = indices.detach().numpy()[0]

        # 结果
        results = []
        for probability, index in zip(probabilities, indices):
            category = self.vectorizer.category_vocab.lookup_index(index)
            results.append({'category': category, 'probability': probability})

        return results

In [49]:
# 加载模型
dataset = NewsDataset.load_dataset_and_load_vectorizer(
    args.split_data_file, args.vectorizer_file)
vectorizer = dataset.vectorizer
model = NewsModel(embedding_dim=args.embedding_dim, 
                  num_embeddings=len(vectorizer.title_vocab), 
                  rnn_hidden_dim=args.rnn_hidden_dim,
                  hidden_dim=args.hidden_dim,
                  output_dim=len(vectorizer.category_vocab),
                  num_layers=args.num_layers,
                  bidirectional=args.bidirectional,
                  dropout_p=args.dropout_p, 
                  pretrained_embeddings=None, 
                  padding_idx=vectorizer.title_vocab.mask_index)
model.load_state_dict(torch.load(args.model_state_file))
model = model.to("cpu")
print (model.named_modules)

<bound method Module.named_modules of NewsModel(
  (embeddings): Embedding(3406, 100, padding_idx=0)
  (gru): GRU(100, 128, batch_first=True)
  (dropout): Dropout(p=0.1)
  (fc1): Linear(in_features=128, out_features=100, bias=True)
  (fc2): Linear(in_features=100, out_features=4, bias=True)
)>


In [50]:
# 预测
inference = Inference(model=model, vectorizer=vectorizer)
title = input("Enter a title to classify: ")
prediction = inference.predict_category(preprocess_text(title))
print("{} → {} (p={:0.2f})".format(title, prediction['category'], 
                                   prediction['probability']))

Enter a title to classify: : shanghai fashion

shanghai fashion → Sports (p=0.83)


In [51]:
# 可能性最高的K个结果
top_k = inference.predict_top_k(preprocess_text(title), k=len(vectorizer.category_vocab))
print ("{}".format(title))
for result in top_k:
    print ("{} (p={:0.2f})".format(result['category'], 
                                   result['probability']))

shanghai fashion
Sports (p=0.83)
Business (p=0.10)
World (p=0.04)
Sci/Tech (p=0.03)


## 层标准化

在 [卷积神经网络](https://www.kesci.com/home/project/share/5377343963c1a400) 中我们用了批标准化来解决 internal covariate shift 问题。RNN的激活也会遇到类似的问题，这里我们会用到 [层标准化(layer normalization)](https://arxiv.org/abs/1607.06450) 来维持0均值单位方差。

层标准化当然和批标准化不同。这里我们会分别计算每个样本(而不是每一隐藏维)的均值和方差，然后再非线性化前进行操作。PyTorch 的 [LayerNorm 类](https://pytorch.org/docs/stable/nn.html#torch.nn.LayerNorm) 已经帮我们完成了大部分工作。

~$LN = \frac{a - \mu_{L}}{\sqrt{\sigma^2_{L} + \epsilon}}  * \gamma + \beta~$

*其中* :
* ~$a~$ = 激活 | ~$\in \mathbb{R}^{NXH}~$ (~$N~$ 样本数量, ~$H~$ 隐藏维度)
* ~$\mu_{L}~$ = 输入均值| ~$\in \mathbb{R}^{NX1}~$
* ~$\sigma^2_{L}~$ = 输入方差 | ~$\in \mathbb{R}^{NX1}~$
* ~$epsilon~$ = 噪声
* ~$\gamma~$ = 规模参数 (通过学习得到)
* ~$\beta~$ = 偏移参数 (通过学习得到)


<img src="data/batchnorm.png" alt="Drawing" style="width: 600px;"/>

进行层标准化操作最好的为止是在激活时，并且在做非线性操作之前。
但是 PyTorch的[LayerNorm类](https://pytorch.org/docs/stable/nn.html#torch.nn.LayerNorm) 还并没有被默认加入 RNN 相关的代码种。所以你需要用下面的方法自己实现。

```python
# Layernorm
for t in range(seq_size):
    # Normalize over hidden dim
    layernorm = nn.LayerNorm(args.hidden_dim)
    # Activating the module
    a = layernorm(x)
```