In [11]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

from fastai.io import *
from fastai.conv_learner import *

from fastai.column_data import *

## Setup

We're going to download the collected works of Nietzsche to use as our data for this class.

In [1]:
PATH='data/nietzsche/'

In [2]:
# get_data("https://s3.amazonaws.com/text-datasets/nietzsche.txt", f'{PATH}nietzsche.txt')
text = open(f'{PATH}nietzsche.txt').read()
print('corpus length:', len(text))

corpus length: 600893


In [3]:
text[:400]

'PREFACE\n\n\nSUPPOSING that Truth is a woman--what then? Is there not ground\nfor suspecting that all philosophers, in so far as they have been\ndogmatists, have failed to understand women--that the terrible\nseriousness and clumsy importunity with which they have usually paid\ntheir addresses to Truth, have been unskilled and unseemly methods for\nwinning a woman? Certainly she has never allowed herself '

grab unique characters（vocab）：

In [4]:
chars = sorted(list(set(text)))
vocab_size = len(chars) + 1
print('total chars:', vocab_size)

total chars: 85


Sometimes it's useful to have a zero value in the dataset, e.g. for padding

In [5]:
chars.insert(0, "\0")

''.join(chars[1:-6])

'\n !"\'(),-.0123456789:;=?ABCDEFGHIJKLMNOPQRSTUVWXYZ[]_abcdefghijklmnopqrstuvwxy'

Map from chars to indices and back again

In [6]:
char_indices = {c: i for i, c in enumerate(chars)}
indices_char = {i: c for i, c in enumerate(chars)}

*idx* will be the data we use from now on - it simply converts all the characters to their index (based on the mapping above)

In [7]:
idx = [char_indices[c] for c in text]

idx[:10]

[40, 42, 29, 30, 25, 27, 29, 1, 1, 1]

check一下这个map对不对：

In [8]:
''.join(indices_char[i] for i in idx[:70])

'PREFACE\n\n\nSUPPOSING that Truth is a woman--what then? Is there not gro'

## Three char model

### Create inputs

Create a list of every 4th character, starting at the 0th, 1st, 2nd, then 3rd characters

要预测4th character，from前三个characters。

cs=3，skip over 3 characters at a time

In [9]:
cs=3
c1_dat = [idx[i]   for i in range(0, len(idx)-cs, cs)]
c2_dat = [idx[i+1] for i in range(0, len(idx)-cs, cs)]
c3_dat = [idx[i+2] for i in range(0, len(idx)-cs, cs)]
c4_dat = [idx[i+3] for i in range(0, len(idx)-cs, cs)]

Our inputs

In [15]:
len(c1_dat[:-2])

200295

In [12]:
x1 = np.stack(c1_dat)
x2 = np.stack(c2_dat)
x3 = np.stack(c3_dat)

In [18]:
x1.shape

(200295,)

Our output

In [13]:
y = np.stack(c4_dat)

The first 4 inputs and outputs

In [20]:
x1[:4], x2[:4], x3[:4]

(array([40, 30, 29,  1]), array([42, 25,  1, 43]), array([29, 27,  1, 45]))

In [21]:
y[:4]

array([30, 29,  1, 40])

input是 40， 42， 29， predict 30
<img src="images/rnn_example.png" width="100%">


In [22]:
x1.shape, y.shape

((200295,), (200295,))

### Create and train model

Pick a size for our hidden state
>so we're going to build this model which means we need to decide how many activations so I'm going to use 256 

<img src="images/rnn_4.png" width="80%">

I put some very important coloured arrows here. all the arrows of the **same color** are going to use the **same matrix** (the same weight matrix).

So all of our input embeddings(character1 character2 character3 **green arrow**) are going to use the same matrix, all of our layers (**orange arrows**)that go from one layer to the next they're going to use the same orange arrow weight matrix, and then our output (**blue arrow**) will have its own matrix.

so we're going to have one two three(green orange blue) weight matrices.
>这样分arrow的原因是：characters应该具有相同size的embedding表达，不应该因为是第一、第二、第三个就有所不同；

In [23]:
n_hidden = 256

The number of latent factors to create (i.e. the size of the embedding matrix)
>and we need to decide how big our embeddings are going to be and so I decided to use 42 so about half (the number of characters I have),and you can play around these so you can come up with better numbers it's just a kind of experimental

In [24]:
n_fac = 42

 so let's create a three character model and so we're going to create 
 * one linear layer for our Green Arrow； 
 * one linear layer for orange arrow;
 * and one linear layer for our blue arrow
 * and then also one embedding 
 

so the embedding is going to bring in something with size 84——vocab size, and spit out something with an factors in the embedding well then put that through a linear layer(green arrow) 

<img src="images/rnn_4.png" width="80%">

```python
        #h = V(torch.zeros(in1.size()).cuda())
        h = F.tanh(self.l_hidden(in1))
        h = F.tanh(self.l_hidden(h+in2))
        h = F.tanh(self.l_hidden(h+in3))
```

 `h = F.tanh(self.l_hidden(in1))` 这一句是：先create this circle of activations（橙色椭圆），and that matrix called `h`，`h` is equal to my input activations after going through the embedding(第一层的橙色长方形)->linear->relu（green arrow） 之后得到 `in1`，然后apply `l_hiddeng` （orange arrow），最后到达FC2的橙色椭圆处 ( `h` )。

`h = F.tanh(self.l_hidden(h+in2))`，除了character1得到的`h`（第2个椭圆）以外，还要加上character2 input（即经过green arrow之后的`in2`），之后经过第二个orange arrow到达FC3处，最后加上character3 input最后经过`l_out`得到output。

**Dimensions**
* `self.e(c1) = n_fac = 84` 
* `self.l_in(self.e(c1))`:`n_fac` -> `n_hidden`
* `self.l_hidden(in1)`: `n_hidden` -> `n_hidden`，所以这个是square matrix（方阵）
* `h = F.tanh(self.l_hidden(h+in2))`里的 `(h+in2)`: `h` 和 `in2` 都是 `n_hidden`，所以可以相加，然后把 `h+in2` pass in `l_hidden`，最后输出的 `h` 还是 `n_hidden` 的size，所以这里的trick就是： `l_hidden`是 `n_hidden x n_hidden` 的方阵，而 `n_hidden` 是 `l_in` 的output size。


```python
        h = F.tanh(self.l_hidden(in1))
        h = F.tanh(self.l_hidden(h+in2))
        h = F.tanh(self.l_hidden(h+in3))
```
这里的3个`h`只是nearly identical的，所以我们加上一个全是0的h：`h = V(torch.zeros(in1.size()).cuda())`。这样三个`h`就是identical的了。这样我们就可以改写成for循环的形式，方便过渡到RNN。

```python
        h = V(torch.zeros(in1.size()).cuda())
        h = F.tanh(self.l_hidden(h+in1))
        h = F.tanh(self.l_hidden(h+in2))
        h = F.tanh(self.l_hidden(h+in3))
```



In [25]:
class Char3Model(nn.Module):
    def __init__(self, vocab_size, n_fac):
        super().__init__()
        self.e = nn.Embedding(vocab_size, n_fac)

        # The 'green arrow' from our diagram - the layer operation from input to hidden
        self.l_in = nn.Linear(n_fac, n_hidden)

        # The 'orange arrow' from our diagram - the layer operation from hidden to hidden
        self.l_hidden = nn.Linear(n_hidden, n_hidden)
        
        # The 'blue arrow' from our diagram - the layer operation from hidden to output
        self.l_out = nn.Linear(n_hidden, vocab_size)
        
    def forward(self, c1, c2, c3):
        # each character -> embedding -> linear -> relu
        in1 = F.relu(self.l_in(self.e(c1)))
        in2 = F.relu(self.l_in(self.e(c2)))
        in3 = F.relu(self.l_in(self.e(c3)))
        
        h = V(torch.zeros(in1.size()).cuda())
        h = F.tanh(self.l_hidden(h+in1))
        h = F.tanh(self.l_hidden(h+in2))
        h = F.tanh(self.l_hidden(h+in3))
        
        return F.log_softmax(self.l_out(h))

batch size 512 because this data is tiny so I can use a bigger batch size.

Signature: `ColumnarModelData.from_arrays(path, val_idxs, xs, y, bs=64, test_xs=None, shuffle=True)`

In [26]:
md = ColumnarModelData.from_arrays('.', [-1], np.stack([x1,x2,x3], axis=1), y, bs=512)

I'm actually going to create a standard pytorch model， I'm not going to create a learner. so this is a standard pytorch model and because I'm using pytorch that means I have to remember to write CUDA.

In [27]:
m = Char3Model(vocab_size, n_fac).cuda()

grab the iterator to iterate through the training set.

we can then call `next` on that to grab a mini batch and that's going to return all of our `x` and `y` tensor.

then we can use a model as if it's a function by passing to it the `Variable` version of our tensors， 



In [28]:
it = iter(md.trn_dl)
*xs,yt = next(it)
t = m(*V(xs))

In [29]:
# 3
len(xs)

3

In [30]:
# 512
xs[0].size()

torch.Size([512])

it's not actually one hot encoded because we use an embedding to pretend it is. 

In [31]:
xs[0]


  2
 73
 62
 58
 73
 71
 58
 61
 54
 54
 62
 42
 67
 65
 57
 61
 72
 68
 58
 10
  2
 62
  1
 62
 65
  2
 69
 73
 59
 67
  2
  2
  2
 78
 59
 78
  1
 73
  9
 72
 62
 65
 12
  8
 71
 72
  8
  2
 78
 56
 72
 71
 68
 62
  2
 67
  2
 62
 72
 61
 66
  2
 78
 58
 62
 58
 68
 71
 69
  2
 61
 73
 73
 67
 62
 73
 73
 62
  8
 58
 72
 58
 66
 57
 73
 56
 61
  9
  4
 69
 61
 62
 67
  2
 59
 58
 68
 67
 66
 68
 54
 68
 73
 73
 71
  2
 71
  2
 58
 68
 61
 72
 62
 72
 73
 74
 58
 74
  2
 54
 58
 55
 67
 68
 56
  1
 68
 54
 73
 62
 62
 56
 56
  2
 10
 67
 67
 72
 54
 32
 57
 72
 67
 73
 22
  2
 65
 71
 60
 55
 58
 59
 62
 55
 67
 73
  2
 54
 72
  2
 54
  2
 67
 61
 57
 78
 72
 73
 61
 73
 54
 58
 69
 69
  2
 58
 67
 65
 61
 67
  2
  8
 68
 54
 54
  2
 73
 68
 76
 74
 75
 58
 54
 76
 57
 54
 54
 57
  2
 68
 73
 58
 54
 72
  2
  2
 58
 10
 55
 57
 68
 62
 71
 58
 58
 68
  2
 62
 60
  2
 58
 61
 58
  1
 78
 58
 58
  2
 59
 73
 58
 58
  8
 58
 74
 61
  2
 58
 66
  2
 73
 61
 62
 62
 62
 72
 62
 61
 73
 58

In [33]:
t

Variable containing:
-4.3958 -4.4295 -4.4103  ...  -4.6557 -4.4531 -4.5174
-4.4096 -4.4601 -4.7197  ...  -4.4669 -4.3733 -4.5091
-4.5257 -4.4618 -4.5672  ...  -4.7357 -4.4730 -4.5301
          ...             ⋱             ...          
-4.3848 -4.5626 -4.4671  ...  -4.6800 -4.4856 -4.6089
-4.3360 -4.5479 -4.8019  ...  -4.3477 -4.3131 -4.2772
-4.4628 -4.5725 -4.6111  ...  -4.5197 -4.3768 -4.6698
[torch.cuda.FloatTensor of size 512x85 (GPU 0)]

`return F.log_softmax(self.l_out(h))`

so 85 is the probability of each of the possible vocab items and of course we've got the log of them.


In [32]:
opt = optim.Adam(m.parameters(), 1e-2)

In [192]:
fit(m, md, 1, opt, F.nll_loss)

A Jupyter Widget

[ 0.       2.09627  6.52849]                                 



we don't have learning rate finders and all that stuff because we're not using a learner so we'll have to manually do learning rate annealing so set the learning rate a little bit lower and fit again. 




In [193]:
set_lrs(opt, 0.001)

In [194]:
fit(m, md, 1, opt, F.nll_loss)

A Jupyter Widget

[ 0.       1.84525  6.52312]                                 



### Test model

In [34]:
def get_next(inp):
    idxs = T(np.array([char_indices[c] for c in inp]))
    p = m(*VV(idxs))
    i = np.argmax(to_np(p))
    return chars[i]

* `idxs = T(np.array([char_indices[c] for c in inp]))` : we can pass in three characters like: 'y. ', and so I can then go through and turn that into a tensor（using capital `T` ）of an array（`np.array`） of the character index for each character in that input list. so basically turn those into the integers.

* `V(idxs)`: turn those into variables
*  `p = m(*VV(idxs))`: pass that to our model
*  `i = np.argmax(to_np(p))`: and then we can do an Argmax on that to grab which character number is it and in order to turns that into numpy, use `to_np` to turn that that variable into a numpy array 
* `return chars[i]`: and then I can return that character

In [196]:
get_next('y. ')

'T'

In [197]:
get_next('ppl')

'e'

In [198]:
get_next(' th')

'e'

In [199]:
get_next('and')

' '

## Our first RNN!

<img src="images/new_rnn_1.png" width="100%">

把input 1这个特例融入到模型中的话：

<img src="images/new_rnn_2.png" width="100%">



### Create inputs

This is the size of our unrolled RNN.

cs=8

For each of 0 through 7, create a list of every 8th character with that starting point. These will be the 8 inputs to our model.

In [36]:
c_in_dat = [[idx[i+j] for i in range(cs)] for j in range(len(idx)-cs)]

Then create a list of the next character in each of these series. This will be the labels for our model.

In [39]:
len(c_in_dat)

600884

 our outputs will be the next character: 


In [40]:
c_out_dat = [idx[j+cs] for j in range(len(idx)-cs)]

and so we can stack them together :

In [45]:
xs = np.stack(c_in_dat, axis=0)

In [46]:
xs.shape

(600884, 8)

In [47]:
y = np.stack(c_out_dat)

In [48]:
y.shape

(600884,)

So each column below is one series of 8 characters from the text.

第1行是0-8，第2行是1-9，第3行是2-10。所以每一行的最后一个数字（如第2行第8列的`1`）就是0-8之后要预测的那个character。

In [69]:
xs[:cs,:cs]

array([[40, 42, 29, 30, 25, 27, 29,  1],
       [42, 29, 30, 25, 27, 29,  1,  1],
       [29, 30, 25, 27, 29,  1,  1,  1],
       [30, 25, 27, 29,  1,  1,  1, 43],
       [25, 27, 29,  1,  1,  1, 43, 45],
       [27, 29,  1,  1,  1, 43, 45, 40],
       [29,  1,  1,  1, 43, 45, 40, 40],
       [ 1,  1,  1, 43, 45, 40, 40, 39]])

...and this is the next character after each sequence.
`y[:cs]`就是`xs[:cs,:cs]`的最后一列：

In [70]:
y[:cs]

array([ 1,  1, 43, 45, 40, 40, 39, 43])

### Create and train model

In [42]:
val_idx = get_cv_idxs(len(idx)-cs-1)

In [49]:
md = ColumnarModelData.from_arrays('.', val_idx, xs, y, bs=512)

In [50]:
class CharLoopModel(nn.Module):
    # This is an RNN!
    def __init__(self, vocab_size, n_fac):
        super().__init__()
        self.e = nn.Embedding(vocab_size, n_fac)
        self.l_in = nn.Linear(n_fac, n_hidden)
        self.l_hidden = nn.Linear(n_hidden, n_hidden)
        self.l_out = nn.Linear(n_hidden, vocab_size)
        
    def forward(self, *cs):
        bs = cs[0].size(0)
        h = V(torch.zeros(bs, n_hidden).cuda())
        for c in cs:
            inp = F.relu(self.l_in(self.e(c)))
            h = F.tanh(self.l_hidden(h+inp))
        
        return F.log_softmax(self.l_out(h), dim=-1)

用loop替换之前的几个`h`。

In [82]:
m = CharLoopModel(vocab_size, n_fac).cuda()
opt = optim.Adam(m.parameters(), 1e-2)

In [83]:
fit(m, md, 1, opt, F.nll_loss)

A Jupyter Widget

[ 0.       2.02986  1.99268]                                



In [84]:
set_lrs(opt, 0.001)

In [85]:
fit(m, md, 1, opt, F.nll_loss)

A Jupyter Widget

[ 0.       1.73588  1.75103]                                 



 we're gonna try something else which is we're going to use this a trick that your net rather hinted at before which is maybe we shouldn't be adding `(h + inp)` together and the reason is adding these things together is that the `input state` and the `hidden state` are kind of qualitatively different kinds of things:
 * the `input state` is the encoding of the character
 * `h` represents the encoding of the series of characters so adding them together is kind of potentially going to **lose information**
 
所以我们下面选择`concat`而不是`add` them: `inp = torch.cat((h, self.e(c)), 1)`

而`l_in`也要变成: `self.l_in = nn.Linear(n_fac + n_hidden, n_hidden)`


In [92]:
class CharLoopConcatModel(nn.Module):
    def __init__(self, vocab_size, n_fac):
        super().__init__()
        self.e = nn.Embedding(vocab_size, n_fac)
        self.l_in = nn.Linear(n_fac + n_hidden, n_hidden)
        self.l_hidden = nn.Linear(n_hidden, n_hidden)
        self.l_out = nn.Linear(n_hidden, vocab_size)
        
    def forward(self, *cs):
        bs = cs[0].size(0)
        h = V(torch.zeros(bs, n_hidden).cuda())
        for c in cs:
            inp = torch.cat((h, self.e(c)), 1)
            inp = F.relu(self.l_in(inp))
            h = F.tanh(self.l_hidden(inp))
        
        return F.log_softmax(self.l_out(h), dim=-1)

不同类型的information最好是concat而不要随便把它们相加，因为会lose information。

In [93]:
m = CharLoopConcatModel(vocab_size, n_fac).cuda()
opt = optim.Adam(m.parameters(), 1e-3)

In [94]:
it = iter(md.trn_dl)
*xs,yt = next(it)
t = m(*V(xs))

In [95]:
fit(m, md, 1, opt, F.nll_loss)

A Jupyter Widget

[ 0.       1.81654  1.78501]                                



In [96]:
set_lrs(opt, 1e-4)

In [97]:
fit(m, md, 1, opt, F.nll_loss)

A Jupyter Widget

[ 0.       1.69008  1.69936]                                 



### Test model

In [98]:
def get_next(inp):
    idxs = T(np.array([char_indices[c] for c in inp]))
    p = m(*VV(idxs))
    i = np.argmax(to_np(p))
    return chars[i]

In [99]:
get_next('for thos')

'e'

In [100]:
get_next('part of ')

't'

In [101]:
get_next('queens a')

'n'

## RNN with pytorch

In [108]:
class CharRnn(nn.Module):
    def __init__(self, vocab_size, n_fac):
        super().__init__()
        self.e = nn.Embedding(vocab_size, n_fac)
        self.rnn = nn.RNN(n_fac, n_hidden)
        self.l_out = nn.Linear(n_hidden, vocab_size)
        
    def forward(self, *cs):
        bs = cs[0].size(0)
        # for loop's starting point
        h = V(torch.zeros(1, bs, n_hidden))
        inp = self.e(torch.stack(cs))
        # equals the for loop, self.rnn returns outp and all the hidden states h (orange ellipse) 
        outp,h = self.rnn(inp, h)
        # because pytorch returns all the hidden state(stacking to be a large list),we just want the final one([-1])
        return F.log_softmax(self.l_out(outp[-1]), dim=-1)

In [109]:
m = CharRnn(vocab_size, n_fac).cuda()
opt = optim.Adam(m.parameters(), 1e-3)

In [110]:
it = iter(md.trn_dl)
*xs,yt = next(it)

In [111]:
t = m.e(V(torch.stack(xs)))
t.size()

torch.Size([8, 512, 42])

In [112]:
ht = V(torch.zeros(1, 512,n_hidden))
outp, hn = m.rnn(t, ht)
outp.size(), hn.size()

(torch.Size([8, 512, 256]), torch.Size([1, 512, 256]))

In [113]:
t = m(*V(xs)); t.size()

torch.Size([512, 85])

In [114]:
fit(m, md, 4, opt, F.nll_loss)

A Jupyter Widget

[ 0.       1.86065  1.84255]                                 
[ 1.       1.68014  1.67387]                                 
[ 2.       1.58828  1.59169]                                 
[ 3.       1.52989  1.54942]                                 



In [115]:
set_lrs(opt, 1e-4)

In [116]:
fit(m, md, 2, opt, F.nll_loss)

A Jupyter Widget

[ 0.       1.46841  1.50966]                                 
[ 1.       1.46482  1.5039 ]                                 



### Test model

In [117]:
def get_next(inp):
    idxs = T(np.array([char_indices[c] for c in inp]))
    p = m(*VV(idxs))
    i = np.argmax(to_np(p))
    return chars[i]

In [118]:
get_next('for thos')

'e'

now we can look through like 40 times calling `get_next` each time and then each time will replace our input by removing the first character and adding the thing that we just predicted and so that way we can like feed in a new set of 8 characters that get them again and again :



In [119]:
def get_next_n(inp, n):
    res = inp
    for i in range(n):
        c = get_next(inp)
        res += c
        inp = inp[1:]+c
    return res

In [120]:
get_next_n('for thos', 40)

'for those the same the same the same the same th'

结果陷入死循环...

## Multi-output model

1. 把output都分开，比如character1的output `outp1`，character2的output `outp2`，character3的output `outp3`，分开保存：
```python
    def forward(self, *cs):
        bs = cs[0].size(0)
        h = V(torch.zeros(bs, n_hidden).cuda())
        res = []
        for c in cs:
            inp = torch.cat((h, self.e(c)), 1)
            inp = F.relu(self.l_in(inp))
            h = F.tanh(self.l_hidden(inp))
            res.append(F.log_softmax(self.l_out(h)))
        
        return torch.stack(result)
```

2. 之前的处理方式很低效，比如下面，第二行的8个character基本和第一行overlap了7个character，而它们都需要重新计算一遍，所以下面用 `non-overlapping sets`
<img src="images/rnn_analyz.png" width="80%">

### Setup

Let's take non-overlapping sets of characters this time

In [53]:
c_in_dat = [[idx[i+j] for i in range(cs)] for j in range(0, len(idx)-cs-1, cs)]

Then create the exact same thing, offset by 1, as our labels

In [54]:
c_out_dat = [[idx[i+j] for i in range(cs)] for j in range(1, len(idx)-cs, cs)]

In [55]:
xs = np.stack(c_in_dat)
xs.shape

(75111, 8)

In [56]:
ys = np.stack(c_out_dat)
ys.shape

(75111, 8)

In [57]:
xs[:cs,:cs]

array([[40, 42, 29, 30, 25, 27, 29,  1],
       [ 1,  1, 43, 45, 40, 40, 39, 43],
       [33, 38, 31,  2, 73, 61, 54, 73],
       [ 2, 44, 71, 74, 73, 61,  2, 62],
       [72,  2, 54,  2, 76, 68, 66, 54],
       [67,  9,  9, 76, 61, 54, 73,  2],
       [73, 61, 58, 67, 24,  2, 33, 72],
       [ 2, 73, 61, 58, 71, 58,  2, 67]])

现在第一行和第二行之间没有overlap，而是顺序连接。

In [58]:
ys[:cs,:cs]

array([[42, 29, 30, 25, 27, 29,  1,  1],
       [ 1, 43, 45, 40, 40, 39, 43, 33],
       [38, 31,  2, 73, 61, 54, 73,  2],
       [44, 71, 74, 73, 61,  2, 62, 72],
       [ 2, 54,  2, 76, 68, 66, 54, 67],
       [ 9,  9, 76, 61, 54, 73,  2, 73],
       [61, 58, 67, 24,  2, 33, 72,  2],
       [73, 61, 58, 71, 58,  2, 67, 68]])

### Create and train model

In [59]:
val_idx = get_cv_idxs(len(xs)-cs-1)

In [60]:
md = ColumnarModelData.from_arrays('.', val_idx, xs, ys, bs=512)

之前 `return F.log_softmax(self.l_out(outp[-1]), dim=-1)` 只保留了最后一个output，

现在要保留所有的output：`        return F.log_softmax(self.l_out(outp), dim=-1)`

In [61]:
class CharSeqRnn(nn.Module):
    def __init__(self, vocab_size, n_fac):
        super().__init__()
        self.e = nn.Embedding(vocab_size, n_fac)
        self.rnn = nn.RNN(n_fac, n_hidden)
        self.l_out = nn.Linear(n_hidden, vocab_size)
        
    def forward(self, *cs):
        bs = cs[0].size(0)
        h = V(torch.zeros(1, bs, n_hidden))
        inp = self.e(torch.stack(cs))
        outp,h = self.rnn(inp, h)
        return F.log_softmax(self.l_out(outp), dim=-1)

In [62]:
m = CharSeqRnn(vocab_size, n_fac).cuda()
opt = optim.Adam(m.parameters(), 1e-3)

In [63]:
it = iter(md.trn_dl)
*xst,yt = next(it)

现在label是 512x8，we're trying to predict 8 things every time through. 

In [64]:
yt


    2    72    69  ...     67    60    72
   61    58     2  ...     65    62    58
   73    58    21  ...     76    61    68
       ...          ⋱          ...       
    8     2    73  ...      2    57    74
    2    62    72  ...     58    67    56
   66    55    62  ...     68    74    72
[torch.LongTensor of size 512x8]

so we're going to create a special negative log likelihood loss function for sequences. 
* `inp.size()`: sequence length(8), batch size(512), hidden states(`n_hidden=256`)
* `targ.transpose(0,1).contiguous().view(-1)`: flatten our targets(`yt`)
    * `yt.size=512x8`, 而inp.size的前两个axis为`8x512`，所以需要transpose
    * `.congigous()`: 不加会报错
    * `view()`: flattern, `-1`是让它自己根据shape填上去
*  `inp.view(-1,nh)`: flatten our input 

In [136]:
def nll_loss_seq(inp, targ):
    sl,bs,nh = inp.size()
    targ = targ.transpose(0,1).contiguous().view(-1)
    return F.nll_loss(inp.view(-1,nh), targ)

In [65]:
n_hidden

256

In [137]:
fit(m, md, 4, opt, nll_loss_seq)

A Jupyter Widget

[ 0.       2.59241  2.40251]                                
[ 1.       2.28474  2.19859]                                
[ 2.       2.13883  2.08836]                                
[ 3.       2.04892  2.01564]                                



In [138]:
set_lrs(opt, 1e-4)

In [139]:
fit(m, md, 1, opt, nll_loss_seq)

A Jupyter Widget

[ 0.       1.99819  2.00106]                               



### Identity init!

In [140]:
m = CharSeqRnn(vocab_size, n_fac).cuda()
opt = optim.Adam(m.parameters(), 1e-2)

In [66]:
??m.rnn

【Hinton的paper】用Identity Matrix初始化，可以avoid 梯度消失和梯度爆炸，因为`l_hidden`要一直乘一直乘，而I乘上任何矩阵都不变。

after we've constructed our `M` we can just go in:
* `copy_`: replace 
* `torch.eye(n_hidden)`: an identity matrix of size n



In [141]:
m.rnn.weight_hh_l0.data.copy_(torch.eye(n_hidden))


    1     0     0  ...      0     0     0
    0     1     0  ...      0     0     0
    0     0     1  ...      0     0     0
       ...          ⋱          ...       
    0     0     0  ...      1     0     0
    0     0     0  ...      0     1     0
    0     0     0  ...      0     0     1
[torch.cuda.FloatTensor of size 256x256 (GPU 0)]

In [142]:
fit(m, md, 4, opt, nll_loss_seq)

A Jupyter Widget

[ 0.       2.39428  2.21111]                                
[ 1.       2.10381  2.03275]                                
[ 2.       1.99451  1.96393]                               
[ 3.       1.93492  1.91763]                                



In [143]:
set_lrs(opt, 1e-3)

In [144]:
fit(m, md, 4, opt, nll_loss_seq)

A Jupyter Widget

[ 0.       1.84035  1.85742]                                
[ 1.       1.82896  1.84887]                                
[ 2.       1.81879  1.84281]                               
[ 3.       1.81337  1.83801]                                



## Stateful model

<img src="images/rnn_output1.png" width="50%">
每次time step（8）重新开始的时候，`hidden state`都会清零（绿色箭头），现在我们想保留`hidden state`。

<img src="images/rnn_trunk.png" width="50%">
假设现在有长度为64 million的corpus data，把它们分为64个trunks，每个trunk 1 million，把它们按列排起来，就是上图这个64 x 100m的矩阵，而一个mini batch是蓝色的椭圆部分，也就是说每个mini batch，我们是64个bptt并行处理的，对每一行（每个并行单元）都是take bptt这么长的text作为input，然后output bptt这么长的text作为预测值；然后取下一个mini batch（64 x bptt）继续并行处理。

### Setup

取nietzsche.txt的后20%作为val.txt，剩下的作为trn


In [69]:
from torchtext import vocab, data

from fastai.nlp import *
from fastai.lm_rnn import *

PATH='data/nietzsche/'

TRN_PATH = 'trn/'
VAL_PATH = 'val/'
TRN = f'{PATH}{TRN_PATH}'
VAL = f'{PATH}{VAL_PATH}'

%ls {PATH}

nietzsche.txt  [0m[01;34mtrn[0m/  [01;34mval[0m/


In [70]:
%ls {PATH}trn

nietzsche.txt


* `Field`: field initially is just a description of how to go about pre-processing the text(lowercase/tokenize...)
* `list`: 我们的目标是要预测character而不是word，所以可以直接用list来tokenize
* `n_fac`: embedding size
* `md.trn_dl`: 963 batches to go through  本来应该等于：`(the number of the tokens:493747)/(batch size:64)/(bptt:8)`，但是没有，是因为bptt不是刚好等于8，而是randomize approximate 8 each time
* ` md.nt`: number of tokens(how many unique characters)

In [73]:
list('abc')

['a', 'b', 'c']

In [18]:
TEXT = data.Field(lower=True, tokenize=list)
bs=64; bptt=8; n_fac=42; n_hidden=256

FILES = dict(train=TRN_PATH, validation=VAL_PATH, test=VAL_PATH)
md = LanguageModelData.from_text_files(PATH, TEXT, **FILES, bs=bs, bptt=bptt, min_freq=3)

len(md.trn_dl), md.nt, len(md.trn_ds), len(md.trn_ds[0].text)

(963, 56, 1, 493747)

In [None]:
TEXT.vocab.itos

### RNN

```python
class CharSeqRnn(nn.Module):
    def __init__(self, vocab_size, n_fac):
        super().__init__()
        self.e = nn.Embedding(vocab_size, n_fac)
        self.rnn = nn.RNN(n_fac, n_hidden)
        self.l_out = nn.Linear(n_hidden, vocab_size)
        
    def forward(self, *cs):
        bs = cs[0].size(0)
        h = V(torch.zeros(1, bs, n_hidden))
        inp = self.e(torch.stack(cs))
        outp,h = self.rnn(inp, h)
        return F.log_softmax(self.l_out(outp), dim=-1)
```
* `self.init_hidden(bs)`: 取代了`h = V(torch.zeros(1, bs, n_hidden))`，把`h`放在类里，只初始化一次
* `repackage_var`: 把`h`从Tensor变成Variable，这样后向传播由于没有记录operator，就无法求导，所以相当于只会计算一个time step(8)的梯度（it's just saying after one for loop just throw your history operations and start afresh, so we're keeping our hidden state but we're not keeping our hidden states history）



In [72]:
??repackage_var

* if self.h.size(1) != bs: self.init_hidden(bs)`: the height of `self.h.size()` is going to be the number of activations and the width is going to be the mini batch size;  check if that's equal to the actual batch size length that we've received, if they're not the same then set it back to 0 's again. 做这个check的原因是，每个epoch的最后一个mini batch可能会是非常mini的batch，size很小，要refresh一下
* loss functions such as softmax not happy receiving a rank 3 tensor  【37min】
* `F.log_softmax(self.l_out(outp), dim=-1)`: softmax in pytorch 0.3 requires that we pass in a number here saying which axis do we want to do the softmax over, so in this case clearly we want to do it over the last axis because the last axis is the one that contains the probability per letter of the alphabet and we want all of those probabilities to sum to one (`dim=-1`)




In [21]:
class CharSeqStatefulRnn(nn.Module):
    def __init__(self, vocab_size, n_fac, bs):
        self.vocab_size = vocab_size
        super().__init__()
        self.e = nn.Embedding(vocab_size, n_fac)
        self.rnn = nn.RNN(n_fac, n_hidden)
        self.l_out = nn.Linear(n_hidden, vocab_size)
        self.init_hidden(bs)
        
    def forward(self, cs):
        bs = cs[0].size(0)
        if self.h.size(1) != bs: self.init_hidden(bs)
        outp,h = self.rnn(self.e(cs), self.h)
        self.h = repackage_var(h)
        return F.log_softmax(self.l_out(outp), dim=-1).view(-1, self.vocab_size)
    
    def init_hidden(self, bs): self.h = V(torch.zeros(1, bs, n_hidden))

In [22]:
m = CharSeqStatefulRnn(md.nt, n_fac, 512).cuda()
opt = optim.Adam(m.parameters(), 1e-3)

In [23]:
fit(m, md, 4, opt, F.nll_loss)

A Jupyter Widget

[ 0.       1.81983  1.81247]                                 
[ 1.       1.63097  1.66228]                                 
[ 2.       1.54433  1.57824]                                 
[ 3.       1.48563  1.54505]                                 



In [24]:
set_lrs(opt, 1e-4)

fit(m, md, 4, opt, F.nll_loss)

A Jupyter Widget

[ 0.       1.4187   1.50374]                                 
[ 1.       1.41492  1.49391]                                 
[ 2.       1.41001  1.49339]                                 
[ 3.       1.40756  1.486  ]                                 



### RNN loop

In [9]:
# From the pytorch source

def RNNCell(input, hidden, w_ih, w_hh, b_ih, b_hh):
    return F.tanh(F.linear(input, w_ih, b_ih) + F.linear(hidden, w_hh, b_hh))

In [12]:
class CharSeqStatefulRnn2(nn.Module):
    def __init__(self, vocab_size, n_fac, bs):
        super().__init__()
        self.vocab_size = vocab_size
        self.e = nn.Embedding(vocab_size, n_fac)
        self.rnn = nn.RNNCell(n_fac, n_hidden)
        self.l_out = nn.Linear(n_hidden, vocab_size)
        self.init_hidden(bs)
        
    def forward(self, cs):
        bs = cs[0].size(0)
        if self.h.size(1) != bs: self.init_hidden(bs)
        outp = []
        o = self.h
        for c in cs: 
            o = self.rnn(self.e(c), o)
            # append the result onto my list okay 
            outp.append(o)
        # and at the end the result is all stacked up
        outp = self.l_out(torch.stack(outp))
        self.h = repackage_var(o)
        return F.log_softmax(outp, dim=-1).view(-1, self.vocab_size)
    
    def init_hidden(self, bs): self.h = V(torch.zeros(1, bs, n_hidden))

In [13]:
m = CharSeqStatefulRnn2(md.nt, n_fac, 512).cuda()
opt = optim.Adam(m.parameters(), 1e-3)

In [8]:
fit(m, md, 4, opt, F.nll_loss)

A Jupyter Widget

[ 0.       1.81013  1.7969 ]                                 
[ 1.       1.62515  1.65346]                                 
[ 2.       1.53913  1.58065]                                 
[ 3.       1.48698  1.54217]                                 



### GRU

实际中没人用这个RNNCell的，因为即便用了`tanh`还是有梯度爆炸的问题，所以只能用很小的learning rate to rain。

实际中都是用`GRUCell`。

In [18]:
class CharSeqStatefulGRU(nn.Module):
    def __init__(self, vocab_size, n_fac, bs):
        super().__init__()
        self.vocab_size = vocab_size
        self.e = nn.Embedding(vocab_size, n_fac)
        self.rnn = nn.GRU(n_fac, n_hidden)
        self.l_out = nn.Linear(n_hidden, vocab_size)
        self.init_hidden(bs)
        
    def forward(self, cs):
        bs = cs[0].size(0)
        if self.h.size(1) != bs: self.init_hidden(bs)
        outp,h = self.rnn(self.e(cs), self.h)
        self.h = repackage_var(h)
        return F.log_softmax(self.l_out(outp), dim=-1).view(-1, self.vocab_size)
    
    def init_hidden(self, bs): self.h = V(torch.zeros(1, bs, n_hidden))

- WILD ML RNN Tutorial - http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/
- Chris Olah on LSTM http://colah.github.io/posts/2015-08-Understanding-LSTMs/


<img src="images/lstm_2.png" width="80%">


In [None]:
# From the pytorch source code - for reference

def GRUCell(input, hidden, w_ih, w_hh, b_ih, b_hh):
    gi = F.linear(input, w_ih, b_ih)
    gh = F.linear(hidden, w_hh, b_hh)
    i_r, i_i, i_n = gi.chunk(3, 1)
    h_r, h_i, h_n = gh.chunk(3, 1)

    resetgate = F.sigmoid(i_r + h_r)
    inputgate = F.sigmoid(i_i + h_i)
    newgate = F.tanh(i_n + resetgate * h_n)
    return newgate + inputgate * (hidden - newgate)

In [27]:
m = CharSeqStatefulGRU(md.nt, n_fac, 512).cuda()

opt = optim.Adam(m.parameters(), 1e-3)

In [29]:
fit(m, md, 6, opt, F.nll_loss)

A Jupyter Widget

[ 0.       1.68409  1.67784]                                 
[ 1.       1.49813  1.52661]                                 
[ 2.       1.41674  1.46769]                                 
[ 3.       1.36359  1.43818]                                 
[ 4.       1.33223  1.41777]                                 
[ 5.       1.30217  1.40511]                                 



In [30]:
set_lrs(opt, 1e-4)

In [31]:
fit(m, md, 3, opt, F.nll_loss)

A Jupyter Widget

[ 0.       1.22708  1.36926]                                 
[ 1.       1.21948  1.3696 ]                                 
[ 2.       1.22541  1.36969]                                 



### Putting it all together: LSTM

In [12]:
from fastai import sgdr

n_hidden=512

double the size of hidden layer since I've now added 0.5 dropout.


In [22]:
class CharSeqStatefulLSTM(nn.Module):
    def __init__(self, vocab_size, n_fac, bs, nl):
        super().__init__()
        self.vocab_size,self.nl = vocab_size,nl
        self.e = nn.Embedding(vocab_size, n_fac)
        self.rnn = nn.LSTM(n_fac, n_hidden, nl, dropout=0.5)
        self.l_out = nn.Linear(n_hidden, vocab_size)
        self.init_hidden(bs)
        
    def forward(self, cs):
        bs = cs[0].size(0)
        if self.h[0].size(1) != bs: self.init_hidden(bs)
        outp,h = self.rnn(self.e(cs), self.h)
        self.h = repackage_var(h)
        return F.log_softmax(self.l_out(outp), dim=-1).view(-1, self.vocab_size)
    
    def init_hidden(self, bs):
        self.h = (V(torch.zeros(self.nl, bs, n_hidden)),
                  V(torch.zeros(self.nl, bs, n_hidden)))

In [23]:
m = CharSeqStatefulLSTM(md.nt, n_fac, 512, 2).cuda()
# fast AI layer optimizer class 
lo = LayerOptimizer(optim.Adam, m, 1e-2, 1e-5)

用`LayeOptimizer`这个类包含：differential learning rates and differential weight decay 

`lo.opt` gives you the optimizer:

In [None]:
lo.opt

In [18]:
os.makedirs(f'{PATH}models', exist_ok=True)

In [19]:
fit(m, md, 2, lo.opt, F.nll_loss)

A Jupyter Widget

[ 0.       1.72032  1.64016]                                 
[ 1.       1.62891  1.58176]                                 



when we call fit we can pass in that optimizer(lo.opt) and we can also some callbacks and specifically we're going to use the cosine annealing callback. so the cosine annealing callback requires a layer optimizer object(lo), and it's going to do cosine annealing by changing the learning rate inside `lo`.



In [20]:
on_end = lambda sched, cycle: save_model(m, f'{PATH}models/cyc_{cycle}')
cb = [CosAnneal(lo, len(md.trn_dl), cycle_mult=2, on_cycle_end=on_end)]
fit(m, md, 2**4-1, lo.opt, F.nll_loss, callbacks=cb)

A Jupyter Widget

[ 0.       1.47969  1.4472 ]                                 
[ 1.       1.51411  1.46612]                                 
[ 2.       1.412    1.39909]                                 
[ 3.       1.53689  1.48337]                                 
[ 4.       1.47375  1.43169]                                 
[ 5.       1.39828  1.37963]                                 
[ 6.       1.34546  1.35795]                                 
[ 7.       1.51999  1.47165]                                 
[ 8.       1.48992  1.46146]                                 
[ 9.       1.45492  1.42829]                                 
[ 10.        1.42027   1.39028]                              
[ 11.        1.3814    1.36539]                              
[ 12.        1.33895   1.34178]                              
[ 13.        1.30737   1.32871]                              
[ 14.        1.28244   1.31518]                              



In [44]:
on_end = lambda sched, cycle: save_model(m, f'{PATH}models/cyc_{cycle}')
cb = [CosAnneal(lo, len(md.trn_dl), cycle_mult=2, on_cycle_end=on_end)]
fit(m, md, 2**6-1, lo.opt, F.nll_loss, callbacks=cb)

A Jupyter Widget

[ 0.       1.46053  1.43462]                                 
[ 1.       1.51537  1.47747]                                 
[ 2.       1.39208  1.38293]                                 
[ 3.       1.53056  1.49371]                                 
[ 4.       1.46812  1.43389]                                 
[ 5.       1.37624  1.37523]                                 
[ 6.       1.3173   1.34022]                                 
[ 7.       1.51783  1.47554]                                 
[ 8.       1.4921   1.45785]                                 
[ 9.       1.44843  1.42215]                                 
[ 10.        1.40948   1.40858]                              
[ 11.        1.37098   1.36648]                              
[ 12.        1.32255   1.33842]                              
[ 13.        1.28243   1.31106]                              
[ 14.        1.25031   1.2918 ]                              
[ 15.        1.49236   1.45316]                              
[ 16.   

### Test

In [45]:
def get_next(inp):
    idxs = TEXT.numericalize(inp)
    p = m(VV(idxs.transpose(0,1)))
    r = torch.multinomial(p[-1].exp(), 1)
    return TEXT.vocab.itos[to_np(r)[0]]

In [46]:
get_next('for thos')

'e'

In [47]:
def get_next_n(inp, n):
    res = inp
    for i in range(n):
        c = get_next(inp)
        res += c
        inp = inp[1:]+c
    return res

In [50]:
print(get_next_n('for thos', 400))

for those the skemps), or
imaginates, though they deceives. it should so each ourselvess and new
present, step absolutely for the
science." the contradity and
measuring, 
the whole!

293. perhaps, that every life a values of blood
of
intercourse when it senses there is unscrupulus, his very rights, and still impulse, love?
just after that thereby how made with the way anything, and set for harmless philos
