# 3. Capture patterns with recurrent neural networks

## 3.1 Recurrent neural networks
In the previous module, we have been using rich semantic representations of text, and a simple linear classifier on
top of the embeddings. What this architecture does is to capture aggregated meaning of words in a sentence, but it
does not take into account the order of words, because aggregation operation on top of embeddings removed this
information from the original text. Because these models are unable to model word ordering, they cannot solve
more complex or ambiguous tasks such as text generation or question answering.

To capture the meaning of text sequence, we need to use another neural network architecture, which is called a
**recurrent neural network, or RNN**. In RNN, we pass our sentence through the network one symbol at a time,
and the network produces some state, which we then pass to the network again with the next symbol.

![rnn-model](notebooks/img/sample-rnn-model-generation.png)

Given the input sequence of tokens X0,...,Xn.  RNN creates a sequence of neural network blocks, and trains this 
sequence end-to-end using back propagation. Each network block takes a pair (Xi,Si)as an input, and produces S(i+1) 
as a result. Final state Sn or output Xn goes into a linear classifier to produce the result. All network blocks 
share the same weights, and are trained end-to-end using one backpropagation pass.

Because state vectors S0,...,Sn are passed through the network, it is able to learn the sequential dependencies 
between words. For example, when the word not appears somewhere in the sequence, it can learn to negate certain 
elements within the state vector, resulting in negation.

In [None]:
import torch
import torchtext
from torchnlp import *
from torchtext.vocab import vocab
from collections import Counter, OrderedDict


In [None]:
# download data
def load_dataset(storage_path):
    print("Loading dataset...")
    train_dataset, test_dataset = torchtext.datasets.AG_NEWS(root=storage_path)
    train_dataset = list(train_dataset)
    test_dataset = list(test_dataset)
    return train_dataset, test_dataset

path = "/tmp/pytorch/data"
train, test = load_dataset(path)


In [None]:
# build a simple tokenizer
my_tokenizer = torchtext.data.utils.get_tokenizer('basic_english')

# build label class list
label_classes = ['World', 'Sports', 'Business', 'Sci/Tech']

# function that build vocabulary with the token of all text
def build_vocabulary(dataset, tokenizer, ngrams=1, min_freq=1):
    # here we use counter to store the generated token to take in account the token frequency
    counter = Counter()
    # we iterate over all rows, covert text to word token, and add these token to bag_of words
    for (label, line) in dataset:
        counter.update(torchtext.data.utils.ngrams_iterator(tokenizer(line), ngrams=ngrams))
    # sort the collected token counter by token's frequencies
    sorted_by_token_freq_tuples = sorted(counter.items(), key=lambda x: x[1], reverse=True)
    # build a set of words as an orderedDict
    words_dict = OrderedDict(sorted_by_token_freq_tuples)
    # we build a vocabulary based on the words token
    return vocab(words_dict, min_freq=min_freq)

# build a vocab
my_vocab = build_vocabulary(train, my_tokenizer)


def encode(text, vocabulary, tokenizer):
    return [vocabulary[word] for word in tokenizer(text)]

vocab_size = len(my_vocab)