In [3]:
import torch

torch.__version__, torch.cuda.is_available()

('1.3.0', True)

# Recurrent Neural Networks

- Chris Olah [Understanding LSTM Networks](https://colah.github.io/posts/2015-08-Understanding-LSTMs/)
- Chris Olah, Shan Carter [Attention and Augmented Recurrent Neural Networks](https://distill.pub/2016/augmented-rnns)
- Andreas Madsen [Visualizing memorization in RNNs](https://distill.pub/2019/memorization-in-rnns)
- Chris Nicholson [A Beginner's Guide to LSTMs and Recurrent Neural Networks](https://skymind.ai/wiki/lstm)
- Michael Nguyen [Illustrated Guide to Recurrent Neural Networks](https://towardsdatascience.com/illustrated-guide-to-recurrent-neural-networks-79e5eb8049c9)
- Michael Nguyen [Illustrated Guide to LSTM’s and GRU’s: A step by step explanation](https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21)

---

- [An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling](https://arxiv.org/abs/1803.01271)
  > "_For most deep learning practitioners, sequence modeling is synonymous with recurrent networks. Yet recent results indicate that convolutional architectures can outperform recurrent networks on tasks such as audio synthesis and machine translation. Given a new sequence modeling task or dataset, which architecture should one use? We conduct a systematic evaluation of generic convolutional and recurrent architectures for sequence modeling. The models are evaluated across a broad range of standard tasks that are commonly used to benchmark recurrent networks. Our results indicate that a simple convolutional architecture outperforms canonical recurrent networks such as LSTMs across a diverse range of tasks and datasets, while demonstrating longer effective memory. We conclude that the common association between sequence modeling and recurrent networks should be reconsidered, and convolutional networks should be regarded as a natural starting point for sequence modeling tasks. To assist related work, we have made code available at this http URL._"
  
---

## Motivation

Natural language does not usually come in neatly packaged fixed length sequences. The simple feed forward neural network from part 1, however, assumed that we can cut the documents down to 100 words. For some applications and data sources this may be approriate for others not. For instance, cutting tweets down to 100 or 200 words would not be that unreasonable, but cutting wikipedia articles would.

Recurrent neural networks get around, to some extent, this limitation on the sequence length. Given infinite time they can consume an infinite sequence. The network maintains an internal state that keeps track of what is important and what isn't. There are [lots of different kinds of recurrent neural networks](https://en.wikipedia.org/wiki/Recurrent_neural_network) the main difference between being the way in which the internal state is updated and maintained.

In [450]:
import utils

gnad_train, gnad_test = utils.load_gnad()

In [451]:
# what percentage of the 20 newsgroups documents are below our document length threshold?
sum([True for d in gnad_train.text if len(d.split()) > 100]) / len(gnad_train)

0.9415900486749594

In [452]:
long_articles = [idx for idx, d in enumerate(gnad_train.text) if len(d.split()) > 100]
print(gnad_train.text[long_articles[13]])

Österreich bleibt in der EM-Qualifikation für 2016 ungeschlagen, ja, feiert in und gegen Russland den sechsten Auswärtssieg en suite. Marc Janko macht mit einem Traum von einem Tor die Reise nach Frankreich beinahe zur Gewissheit. Moskau - Sie wollten unmittelbar vor dem Urlaub die allerletzten Kräfte mobilisieren, im abschließenden Saisonspiel einen Kampf bis zum Umfallen abliefern, den Russen zeigen, wer Tabellenführer der Gruppe G ist. Und die österreichische Fußballnationalmannschaft mobilisierte, zeigte. Umgefallen ist sie nicht. Teamchef Marcel Koller konnte, vom verletzten David Alaba abgesehen (wurde durch Stefan Ilsanker ersetzt), aus dem Vollen schöpfen. Martin Harnik ist fit geworden, also lief die Einsergarnitur ein. Der Boss hatte den Auftrag erteilt, hinten kompakt zu stehen, aber in erster Linie Fußball zu spielen. Und sie spielten von Anpfiff an. Russland kam gar nicht zum Schauen, fand keine Zeit, eine Ordnung zu finden. Die Gäste pressten, drückten, kombinierten. Sie 

---

# The Idea Behind RNNs

![](./img/1280px-Recurrent_neural_network_unfold.svg.png)


<sub>Image By François Deloche, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=60109157</sub>

# Elman Network

$$ h_t = \sigma{(W_hx_t + U_hh_{t-1}+b_h)} $$
$$ y_t = \sigma{(W_yh_t + b_y)}$$

---

# An Elman Network in Practice

PyTorch has an implementation of an Elman network as the base `nn.RNN` class. The class enables uni- or bi-directional training and supports an arbitrary number of network layers.

In [None]:
from gensim.models import KeyedVectors

import numpy as np

import torch
from torch import nn


ft_vec = KeyedVectors.load_word2vec_format('./cc.de.300.vec.SMALL.gz')
word2idx = dict((w, idx+1) for (idx, w) in enumerate(ft_vec.index2word))

In [471]:
vectors = np.concatenate([np.random.rand(1, 300), ft_vec.vectors])
emb = nn.Embedding.from_pretrained(torch.FloatTensor(vectors))
rnn = nn.RNN(ft_vec.vectors.shape[1], 10, num_layers=1)

UNK = np.random.rand(1, ft_vec.vectors[0].shape[0])
vectors = np.concatenate([UNK, ft_vec.vectors], axis=0)
sent = 'Die neue BER wird bald geöffnet'.split()
word_idx = [word2idx[w] for w in sent]

In [472]:
len(word2idx)

175121

In [457]:
word_idx

[88478, 91678, 142125, 84236, 251, 38088]

In [460]:
# we can feed individual documents through the embedding layer ...
emb(torch.LongTensor(word_idx))

tensor([[ 0.0138, -0.0134, -0.0165,  ..., -0.0256,  0.0201, -0.0439],
        [-0.1180, -0.0394, -0.0214,  ...,  0.0711,  0.0084, -0.0139],
        [ 0.0351, -0.0030,  0.0232,  ...,  0.0440, -0.0155, -0.0054],
        [-0.0824,  0.1310, -0.1151,  ...,  0.0324,  0.0836,  0.0936],
        [-0.0111,  0.0363, -0.0078,  ..., -0.0070,  0.0040, -0.0265],
        [ 0.0368, -0.0124,  0.0299,  ...,  0.0051,  0.0385, -0.0078]])

In [461]:
# ... and then feed the embedded document through the RNN
output, hn = rnn(emb(torch.LongTensor([word_idx])))

In [462]:
output.size()

torch.Size([1, 6, 10])

In [463]:
hn.size()

torch.Size([1, 6, 10])

In [464]:
output

tensor([[[-0.1940, -0.0908, -0.1706,  0.3068,  0.4633, -0.4665,  0.5681,
           0.4381,  0.3606,  0.0855],
         [-0.3440,  0.0154, -0.2777,  0.1304,  0.3534, -0.5114,  0.4618,
           0.5423,  0.4515,  0.1751],
         [-0.1124,  0.0558, -0.1972,  0.3448,  0.3629, -0.5432,  0.5532,
           0.5761,  0.4233,  0.0113],
         [-0.2124,  0.2233,  0.1864,  0.4236,  0.0920, -0.4740,  0.2988,
           0.5289,  0.1059,  0.1892],
         [-0.3647,  0.4744, -0.2364,  0.4440,  0.5096, -0.4413,  0.5552,
           0.5197,  0.3503, -0.1133],
         [-0.1695,  0.0051, -0.1974,  0.2503,  0.4625, -0.4975,  0.5649,
           0.6053,  0.3775,  0.0334]]], grad_fn=<StackBackward>)

In [76]:
hn

tensor([[[ 0.0699,  0.4773, -0.2269,  0.0485,  0.3940, -0.6445, -0.1033,
          -0.4411, -0.2875, -0.1820],
         [ 0.4298,  0.4310,  0.2294,  0.3335, -0.0151, -0.5991, -0.5415,
          -0.2417, -0.4159, -0.5094],
         [-0.1932,  0.1373, -0.6091,  0.3863,  0.1393, -0.2915, -0.2880,
          -0.3617, -0.8717, -0.1275],
         [ 0.5143,  0.7247,  0.0362,  0.5913, -0.2482, -0.6949,  0.3882,
          -0.2663,  0.0628,  0.2348],
         [-0.0294,  0.0504, -0.2885, -0.1780,  0.2908, -0.3357, -0.2692,
          -0.0353,  0.0394, -0.2128],
         [ 0.3471,  0.2983, -0.0493,  0.2738,  0.3065, -0.1331, -0.0619,
          -0.6883, -0.6351, -0.3907],
         [-0.2459,  0.0986, -0.3187, -0.2063,  0.3427,  0.1184, -0.2180,
          -0.1361, -0.6476, -0.2689]],

        [[-0.0255, -0.2716, -0.1416,  0.3043,  0.2569, -0.1661, -0.1401,
          -0.3326,  0.1700, -0.0462],
         [-0.3487, -0.4301,  0.0090,  0.1152,  0.2997, -0.0407, -0.2095,
          -0.6039,  0.1502,  0.0897],

In [48]:
UNK = torch.FloatTensor().random_()
emb = nn.Embedding.from_pretrained(torch.FloatTensor(ft_vec.vectors))
sent = news_train.data[long_articles[12]].split()
word_idx = [word2idx[w] for w in sent]
output, hn = rnn(emb(torch.LongTensor([word_idx])))

KeyError: 'Actually,'

---

Load german word embeddings from fastText

In [2]:
from gensim.models import KeyedVectors
ft_vec_DE = KeyedVectors.load_word2vec_format('./cc.de.300.vec.gz')

In [3]:
word2idx = dict((w, idx+1) for (idx, w) in enumerate(ft_vec_DE.index2word))

In [4]:
from sklearn.preprocessing import LabelEncoder
import utils

gnad_train, gnad_test = utils.load_gnad()
label_encoder = LabelEncoder()

# turn all the data into integer indices
X_train = [[word2idx.get(w, 0) for w in doc.split()] for doc in gnad_train.text]
y_train = label_encoder.fit_transform(gnad_train.category)

In [5]:
import numpy as np
from torch import nn

UNK = np.random.rand(1, 300)
vectors = np.concatenate([UNK, ft_vec_DE.vectors], axis=0)

# A strange training loop (??)

Let's train an LSTM classifier that uses that last hidden state from each sequence (document) as the representation for a classification task.

In [314]:
from time import time
from torch.nn.utils import clip_grad_norm_

bidirectional = False
emb = nn.Embedding.from_pretrained(torch.FloatTensor(vectors))
lstm = nn.LSTM(vectors.shape[1], 64, num_layers=1, bidirectional=bidirectional, dropout=0.01)
classifier = nn.Linear(lstm.hidden_size if not bidirectional else lstm.hidden_size * 2, len(label_encoder.classes_))
optimizer = torch.optim.Adam(lstm.parameters(), lr=1e-2)

  "num_layers={}".format(dropout, num_layers))


In [318]:
num_epochs = 1
h, c = None, None
lossfct = nn.CrossEntropyLoss()
for _ in range(num_epochs):
    train_loss = 0
    start_time = time()
    for i_step, (X_, y_) in enumerate(zip(X_train, y_train), 1):
        X_ = torch.LongTensor(X_)
        y_ = torch.LongTensor([y_])
        
        # run the word indices through the embedding layer and then the LSTM
        embed = emb(X_).unsqueeze(dim=1)
        output, *_ = lstm(embed)
        output = classifier(output[-1, :, :])
        loss = lossfct(output, y_)
        train_loss += loss.item()
        loss.backward()
        if i_step > 0 and i_step % 64 == 0:
            delta = time() - start_time
            avg_delta = delta // (i_step // 64)
            clip_grad_norm_(lstm.parameters(), 1.0)
            optimizer.step()
            optimizer.zero_grad()
            n_total = len(X_train) // 64
            n_remaining = n_total - (i_step // 64)
            print('iter', i_step // 64, '/', n_total, f'{delta:.2f}s', f'{avg_delta*n_remaining:.2f}s', f'{train_loss / i_step:.4f}')

iter 1 / 144 7.87s 1001.00s 2.2111
iter 2 / 144 15.06s 994.00s 2.2263
iter 3 / 144 21.86s 987.00s 2.2287
iter 4 / 144 27.39s 840.00s 2.2228
iter 5 / 144 33.10s 834.00s 2.2213
iter 6 / 144 39.65s 828.00s 2.2114
iter 7 / 144 45.28s 822.00s 2.2038
iter 8 / 144 52.30s 816.00s 2.2002
iter 9 / 144 60.54s 810.00s 2.1944
iter 10 / 144 66.95s 804.00s 2.1845
iter 11 / 144 73.32s 798.00s 2.1809
iter 12 / 144 80.35s 792.00s 2.1778
iter 13 / 144 87.38s 786.00s 2.1684
iter 14 / 144 94.31s 780.00s 2.1664
iter 15 / 144 101.28s 774.00s 2.1591
iter 16 / 144 108.15s 768.00s 2.1529
iter 17 / 144 115.09s 762.00s 2.1449
iter 18 / 144 121.91s 756.00s 2.1424
iter 19 / 144 129.18s 750.00s 2.1438
iter 20 / 144 135.88s 744.00s 2.1431
iter 21 / 144 142.04s 738.00s 2.1436
iter 22 / 144 148.78s 732.00s 2.1435
iter 23 / 144 155.89s 726.00s 2.1443
iter 24 / 144 161.91s 720.00s 2.1411


KeyboardInterrupt: 

    What makes the training loop strange?

- iterate through the dataset one document at a time

Iterating through a data set one item at a time is very inefficient and unlikely to yield good performance (_Stochastic_ GD). However, bacthing together variable length sequences requires a little bit of work.

---

In order to create batches, or mini batches, from variable sized "_tensors_" we need to first make all tensors the same length either by padding short documents or truncating long documents or both.

Then, in the training loop we remove the padding before the instances are passed to the RNN.

In [112]:
import torch
from torch.utils.data import TensorDataset, DataLoader

# record document lenghts
doc_lengths = [len(doc) for doc in X_train]
longest_doc = max(doc_lengths)

# pad all documents to the length of the longest document
# notice that this is very inefficient from a memory perspective
data = np.zeros((len(X_train), longest_doc), dtype=np.int)
for i_doc, doc in enumerate(X_train):
    data[i_doc, :len(doc)] += doc

dataset = TensorDataset(torch.LongTensor(data), torch.LongTensor(doc_lengths), torch.LongTensor(y_train))
dataloader = DataLoader(dataset, batch_size=256)

In [113]:
dataset[0][0].size()

torch.Size([3063])

In [114]:
dataset[:5]

(tensor([[132813, 130692,  82087,  ...,      0,      0,      0],
         [     0,  85372, 104301,  ...,      0,      0,      0],
         [ 28850, 148805,  88260,  ...,      0,      0,      0],
         [     0,      0, 143159,  ...,      0,      0,      0],
         [104810,  27421, 144802,  ...,      0,      0,      0]]),
 tensor([ 63, 578, 277,  97, 619]),
 tensor([5, 3, 6, 7, 1]))

In [115]:
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence

X_tmp, len_tmp, _ = dataset[:4]
embed = emb(X_tmp)

packed = pack_padded_sequence(embed, len_tmp, enforce_sorted=False, batch_first=True)
output, *_ = lstm(packed)
X, _ = pad_packed_sequence(output, batch_first=True)

X.size()

torch.Size([4, 578, 128])

## hmmm ...

That's a little strange, we passed in somethig that has `3063` columns and got back something with only 578. Why?

---

The LSTM implementation in `pytorch` takes the packed padded sequence of elements and applies the LSTM to _only_ the non padded parts of each element in the sequence.

How do we find the _last hidden state_ of each document in the batch?

In [132]:
import numpy as np

X = np.asarray([list(range(1, n)) + [0] * (10 - n) for n in [5, 6, 7, 8, 9]])
X

array([[1, 2, 3, 4, 0, 0, 0, 0, 0],
       [1, 2, 3, 4, 5, 0, 0, 0, 0],
       [1, 2, 3, 4, 5, 6, 0, 0, 0],
       [1, 2, 3, 4, 5, 6, 7, 0, 0],
       [1, 2, 3, 4, 5, 6, 7, 8, 0]])

In [135]:
doc_lengths = np.asarray([4, 5, 6, 7, 8])
X[range(0, 5), doc_lengths - 1]

array([4, 5, 6, 7, 8])

---

In [17]:
from time import time
from torch.nn.utils import clip_grad_norm_

bidirectional = True
emb = nn.Embedding.from_pretrained(torch.FloatTensor(vectors))
lstm = nn.LSTM(vectors.shape[1], 64, num_layers=4, bidirectional=bidirectional, dropout=0.01)
classifier = nn.Linear(lstm.hidden_size if not bidirectional else lstm.hidden_size * 2, len(label_encoder.classes_))
optimizer_parameters = list(emb.parameters()) + list(lstm.parameters()) + list(classifier.parameters())
optimizer = torch.optim.Adam(optimizer_parameters, lr=1e-2)

In [18]:
from torch.nn.utils import clip_grad_norm_
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence

from time import time
from datetime import datetime

DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'

num_epochs = 100
lossfct = nn.CrossEntropyLoss().to(DEVICE)
start_time = time()
for i_epoch in range(num_epochs):
    train_loss = 0
    
    lstm.train()
    classifier.train()
    emb.to(DEVICE)
    lstm.to(DEVICE)
    classifier.to(DEVICE)
    for i_step, batch in enumerate(dataloader, 1):
        X_batch, lengths, y_batch = (b.to(DEVICE) for b in batch)
        # run the word indices through the embedding layer and then the LSTM
        embed = emb(X_batch)
        
        # run the embeddings through the LSTM
        packed = pack_padded_sequence(embed, lengths, enforce_sorted=False, batch_first=True)
        output, *_ = lstm(packed)
        output, _ = torch.nn.utils.rnn.pad_packed_sequence(output, batch_first=True)

        # run the encoded last hidden state through the classifier
        last_hidden_states = output[torch.arange(0, X_batch.size()[0]), lengths-1, :]
        output = classifier(last_hidden_states)

        loss = lossfct(output, y_batch)
        train_loss += loss.item()
        loss.backward()
        clip_grad_norm_(lstm.parameters(), 1.0)
        optimizer.step()
        optimizer.zero_grad()
        
    delta = time() - start_time
    avg_delta = delta / (i_epoch + 1)
    remaining = avg_delta * (num_epochs - i_epoch)
    print('epoch', i_epoch, '/', num_epochs, f'{train_loss / i_step:.4f}',
          f'avg epoch {datetime.fromtimestamp(avg_delta):%M:%S}',
          f'total elapsed {datetime.fromtimestamp(delta):%M:%S}',
          f'remaining {datetime.fromtimestamp(remaining):%M:%S}')

epoch 0 / 100 2.0936 avg epoch 00:43 total elapsed 00:43 remaining 12:42
epoch 1 / 100 2.1721 avg epoch 00:43 total elapsed 01:26 remaining 11:39
epoch 2 / 100 2.1082 avg epoch 00:43 total elapsed 02:10 remaining 10:49
epoch 3 / 100 2.0824 avg epoch 00:43 total elapsed 02:53 remaining 10:02
epoch 4 / 100 1.9883 avg epoch 00:43 total elapsed 03:36 remaining 09:17
epoch 5 / 100 1.7496 avg epoch 00:43 total elapsed 04:19 remaining 08:31
epoch 6 / 100 1.5657 avg epoch 00:43 total elapsed 05:02 remaining 07:46
epoch 7 / 100 1.3766 avg epoch 00:43 total elapsed 05:46 remaining 07:03
epoch 8 / 100 1.2049 avg epoch 00:43 total elapsed 06:29 remaining 06:19
epoch 9 / 100 1.0891 avg epoch 00:43 total elapsed 07:12 remaining 05:36
epoch 10 / 100 1.0291 avg epoch 00:43 total elapsed 07:55 remaining 04:52
epoch 11 / 100 0.9333 avg epoch 00:43 total elapsed 08:39 remaining 04:09
epoch 12 / 100 0.7869 avg epoch 00:43 total elapsed 09:22 remaining 03:25
epoch 13 / 100 0.6904 avg epoch 00:43 total elap

In [19]:
from sklearn.preprocessing import LabelEncoder
import utils

# turn all the data into integer indices
X_test = [[word2idx.get(w, 0) for w in doc.split()] for doc in gnad_test.text]
y_test = label_encoder.transform(gnad_test.category)

doc_lengths = [len(doc) for doc in X_test]
longest_doc = max(doc_lengths)

data = np.zeros((len(X_test), longest_doc), dtype=np.int)
for i_doc, doc in enumerate(X_test):
    data[i_doc, :len(doc)] += doc

test_dataset = TensorDataset(torch.LongTensor(data), torch.LongTensor(doc_lengths), torch.LongTensor(y_test))
test_dataloader = DataLoader(test_dataset, batch_size=32)

In [20]:
from torch.utils.data import SequentialSampler
from torch.nn import functional as F

DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
lstm.to(DEVICE)
pred = []
for i_step, (X_batch, lengths, y_batch) in enumerate(test_dataloader, 1):
    X_batch = X_batch.to(DEVICE)
    # run the word indices through the embedding layer and then the LSTM
    embed = emb(X_batch)

    # run the embeddings through the LSTM
    packed = pack_padded_sequence(embed, lengths, enforce_sorted=False, batch_first=True)
    output, *_ = lstm(packed)
    output, _ = torch.nn.utils.rnn.pad_packed_sequence(output, batch_first=True)

    # run the encoded last hidden state through the classifier
    last_hidden_states = output[torch.arange(0, X_batch.size()[0]), lengths-1, :]
    output = classifier(last_hidden_states)

    _, pred_ = F.log_softmax(output, dim=-1).max(dim=-1)
    pred.extend(pred_.cpu().numpy().tolist())

In [21]:
from sklearn import metrics

print(metrics.classification_report(y_test, pred, target_names=list(label_encoder.classes_)))

               precision    recall  f1-score   support

         Etat       0.87      0.69      0.77        67
       Inland       0.75      0.75      0.75       102
International       0.83      0.85      0.84       151
       Kultur       0.81      0.78      0.79        54
     Panorama       0.73      0.82      0.78       168
        Sport       0.99      0.98      0.99       120
          Web       0.92      0.93      0.92       168
   Wirtschaft       0.84      0.79      0.82       141
 Wissenschaft       0.91      0.86      0.88        57

     accuracy                           0.84      1028
    macro avg       0.85      0.83      0.84      1028
 weighted avg       0.85      0.84      0.84      1028



---
    **  ------------------------------------------------------------------------- **
    embedding + LSTM (64, 1-layer) + Linear
    100 epochs
    SGD lr=0.1 / momentum = 0.9
    ~ 25 minutes
    ** -----------------------------------------------------

                   precision    recall  f1-score   support

             Etat       0.80      0.70      0.75        67
           Inland       0.66      0.72      0.69       102
    International       0.87      0.81      0.84       151
           Kultur       0.73      0.74      0.73        54
         Panorama       0.78      0.71      0.75       168
            Sport       0.95      0.98      0.97       120
              Web       0.90      0.89      0.89       168
       Wirtschaft       0.81      0.70      0.75       141
     Wissenschaft       0.53      0.89      0.66        57

         accuracy                           0.80      1028
        macro avg       0.78      0.79      0.78      1028
     weighted avg       0.81      0.80      0.80      1028
     
     
     

    **  ------------------------------------------------------------------------- **
    embedding + LSTM (64, 1-layer) + Linear
    100 epochs
    Adam lr=0.01
    ~ 25 minutes
    ** -----------------------------------------------------

                   precision    recall  f1-score   support

             Etat       0.81      0.75      0.78        67
           Inland       0.71      0.74      0.72       102
    International       0.86      0.83      0.84       151
           Kultur       0.85      0.81      0.83        54
         Panorama       0.74      0.75      0.74       168
            Sport       0.97      0.96      0.97       120
              Web       0.89      0.89      0.89       168
       Wirtschaft       0.77      0.79      0.78       141
     Wissenschaft       0.81      0.88      0.84        57

         accuracy                           0.82      1028
        macro avg       0.82      0.82      0.82      1028
     weighted avg       0.82      0.82      0.82      1028




    **  ------------------------------------------------------------------------- **
    embedding + LSTM (128, 2-layer, bidir) + Linear
    fine tuned embeddings
    100 epochs
    Adam lr=0.01
    ~ 55 minutes
                  -----------------------------------------------------
                  precision    recall  f1-score   support

             Etat       0.86      0.76      0.81        67
           Inland       0.73      0.78      0.76       102
    International       0.88      0.86      0.87       151
           Kultur       0.85      0.83      0.84        54
         Panorama       0.79      0.76      0.77       168
            Sport       0.98      0.98      0.98       120
              Web       0.90      0.91      0.91       168
       Wirtschaft       0.81      0.80      0.81       141
     Wissenschaft       0.82      0.96      0.89        57

         accuracy                           0.85      1028
        macro avg       0.85      0.85      0.85      1028
     weighted avg       0.85      0.85      0.85      1028
     
    **  ------------------------------------------------------------------------- **
    embedding + LSTM (64, 4-layer, bidir) + Linear
    fine tuned embeddings
    100 epochs
    Adam lr=0.01
    ~ 70 minutes
                  -----------------------------------------------------
                   precision    recall  f1-score   support

             Etat       0.87      0.69      0.77        67
           Inland       0.75      0.75      0.75       102
    International       0.83      0.85      0.84       151
           Kultur       0.81      0.78      0.79        54
         Panorama       0.73      0.82      0.78       168
            Sport       0.99      0.98      0.99       120
              Web       0.92      0.93      0.92       168
       Wirtschaft       0.84      0.79      0.82       141
     Wissenschaft       0.91      0.86      0.88        57

         accuracy                           0.84      1028
        macro avg       0.85      0.83      0.84      1028
     weighted avg       0.85      0.84      0.84      1028
---

# Questions

---

# Contextual Embeddings, Encoder / Decoder Architecture and Transfer Learning

- Akbik et al.: [Pooled Contextualized Embeddings for Named Entity Recognition.](https://www.aclweb.org/anthology/N19-1078/) NAACL-HLT (1) 2019: 724-728

- Howard et al.: [_Universal Language Model Fine-tuning for Text Classification._](https://www.aclweb.org/anthology/P18-1031/) ACL (1) 2018: 328-339
- Jeremy Howard and Sebastian Ruder [Introducing state of the art text classification with universal language models](http://nlp.fast.ai/classification/2018/05/15/introducing-ulmfit.html) 15 May 2018

![ULMfit](./img/ulmfit.png)

---

# Pre-trained AWD-LSTM from fast.ai

In [137]:
import fastai
path = fastai.datasets.untar_data(URLs.IMDB_SAMPLE)

AttributeError: module 'fastai' has no attribute 'datasets'

In [434]:
data_lm = text.data.TextLMDataBunch.from_csv(path, 'texts.csv')
awd_lstm = text.language_model_learner(data_lm, text.AWD_LSTM)

In [440]:
awd_lstm.beam_search('What made this the hugely successful triumph it was? Was it casting, music, imagination, ingenuity, or luck?',
                     n_words=25,
                     temperature=0.95,
                     top_k=25,
                     beam_sz=250)

'What made this the hugely successful triumph it was? Was it casting, music, imagination, ingenuity, or luck? What made this the xxunk successful triumph it was ? Was it casting , music , imagination , ingenuity , or luck ? No , no , no , no , no , no , no , no ! No , no , no , no'