# Hands-On Lab: Language Understanding with Recurrent Networks

This hands-on lab shows how to implement a recurrent network to process text,
for the Air Travel Information Services (ATIS) tasks of slot tagging and intent classification.
We will start with a straight-forward embedding followed by a recurrent LSTM.
We will then extend it to include neighbor words and run bidirectionally.
Lastly, we will turn this system into an intent classifier.  

The techniques you will practice include:

* model description by composing layer blocks instead of writing formulas
* creating your own layer block
* variables with different sequence lengths in the same network
* parallel training

We assume that you are familiar with basics of deep learning, and these specific concepts:

* recurrent networks ([Wikipedia page](https://en.wikipedia.org/wiki/Recurrent_neural_network))
* text embedding ([Wikipedia page](https://en.wikipedia.org/wiki/Word_embedding))

### Prerequisites

We assume that you have already [installed CNTK](https://www.cntk.ai/pythondocs/setup.html).
This tutorial requires CNTK V2. We strongly recommend to run this tutorial on a machine with 
a capable CUDA-compatible GPU. Deep learning without GPUs is not fun.

Finally you need to download the training and test set. The following piece of code does that for you. If you get an error, please follow the manual instructions below it.

We also list the imports we will need for this tutorial

In [1]:
import os
import math
from cntk.blocks import *  # non-layer like building blocks such as LSTM()
from cntk.layers import *  # layer-like stuff such as Linear()
from cntk.models import *  # higher abstraction level, e.g. entire standard models and also operators like Sequential()
from cntk.utils import *
from cntk.io import MinibatchSource, CTFDeserializer, StreamDef, StreamDefs, INFINITELY_REPEAT, FULL_DATA_SWEEP
from cntk import Trainer
from cntk.ops import cross_entropy_with_softmax, classification_error, splice
from cntk.learner import adam_sgd, learning_rate_schedule, momentum_schedule
from cntk.persist import load_model, save_model

try:
    from tqdm import tqdm
except:
    tqdm = lambda x: x
import requests

def download(data):
    url = "https://github.com/Microsoft/CNTK/blob/master/Examples/Tutorials/SLUHandsOn/atis.%s.ctf?raw=true"
    response = requests.get(url%data, stream=True)

    with open("atis.%s.ctf"%data, "wb") as handle:
        for data in tqdm(response.iter_content()):
            handle.write(data)

for t in "train","test":
    try:
        f=open("atis.%s.ctf"%t)
        f.close()
    except:
        download(t)

### Fallback manual instructions
Please download the ATIS [training](https://github.com/Microsoft/CNTK/blob/master/Tutorials/SLUHandsOn/atis.train.ctf) 
and [test](https://github.com/Microsoft/CNTK/blob/master/Tutorials/SLUHandsOn/atis.test.ctf) 
files and put them at the same folder as this notebook.


## Task and Model Structure

The task we want to approach in this tutorial is slot tagging.
We use the [ATIS corpus](https://catalog.ldc.upenn.edu/LDC95S26).
ATIS contains human-computer queries from the domain of Air Travel Information Services,
and our task will be to annotate (tag) each word of a query whether it belongs to a
specific item of information (slot), and which one.

The data in your working folder has already been converted into the "CNTK Text Format."
Let's look at an example from the test-set file `atis.test.ctf`:

    19  |S0 178:1 |# BOS      |S1 14:1 |# flight  |S2 128:1 |# O
    19  |S0 770:1 |# show                         |S2 128:1 |# O
    19  |S0 429:1 |# flights                      |S2 128:1 |# O
    19  |S0 444:1 |# from                         |S2 128:1 |# O
    19  |S0 272:1 |# burbank                      |S2 48:1  |# B-fromloc.city_name
    19  |S0 851:1 |# to                           |S2 128:1 |# O
    19  |S0 789:1 |# st.                          |S2 78:1  |# B-toloc.city_name
    19  |S0 564:1 |# louis                        |S2 125:1 |# I-toloc.city_name
    19  |S0 654:1 |# on                           |S2 128:1 |# O
    19  |S0 601:1 |# monday                       |S2 26:1  |# B-depart_date.day_name
    19  |S0 179:1 |# EOS                          |S2 128:1 |# O

This file has 7 columns:

* a sequence id (19). There are 11 entries with this sequence id. This means that sequence 19 consists
of 11 tokens;
* column `S0`, which contains numeric word indices;
* a comment column denoted by `#`, to allow a human reader to know what the numeric word index stands for;
Comment columns are ignored by the system. `BOS` and `EOS` are special words
to denote beginning and end of sentence, respectively;
* column `S1` is an intent label, which we will only use in the last part of the tutorial;
* another comment column that shows the human-readable label of the numeric intent index;
* column `S2` is the slot label, represented as a numeric index; and
* another comment column that shows the human-readable label of the numeric label index.

The task of the neural network is to look at the query (column `S0`) and predict the
slot label (column `S2`).
As you can see, each word in the input gets assigned either an empty label `O`
or a slot label that begins with `B-` for the first word, and with `I-` for any
additional consecutive word that belongs to the same slot.

The model we will use is a recurrent model consisting of an embedding layer,
a recurrent LSTM cell, and a dense layer to compute the posterior probabilities:


    slot label   "O"        "O"        "O"        "O"  "B-fromloc.city_name"
                  ^          ^          ^          ^          ^
                  |          |          |          |          |
              +-------+  +-------+  +-------+  +-------+  +-------+
              | Dense |  | Dense |  | Dense |  | Dense |  | Dense |  ...
              +-------+  +-------+  +-------+  +-------+  +-------+
                  ^          ^          ^          ^          ^
                  |          |          |          |          |
              +------+   +------+   +------+   +------+   +------+   
         0 -->| LSTM |-->| LSTM |-->| LSTM |-->| LSTM |-->| LSTM |-->...
              +------+   +------+   +------+   +------+   +------+   
                  ^          ^          ^          ^          ^
                  |          |          |          |          |
              +-------+  +-------+  +-------+  +-------+  +-------+
              | Embed |  | Embed |  | Embed |  | Embed |  | Embed |  ...
              +-------+  +-------+  +-------+  +-------+  +-------+
                  ^          ^          ^          ^          ^
                  |          |          |          |          |
    w      ------>+--------->+--------->+--------->+--------->+------... 
                 BOS      "show"    "flights"    "from"   "burbank"

Or, as a CNTK network description. Please have a quick look and match it with the description above:
(descriptions of these functions can be found at: [the layers reference](http://cntk.ai/pythondocs/layerref.html)

In [2]:
vocab_size = 943 ; num_labels = 129 ; num_intents = 26    # number of words in vocab, slot labels, and intent labels

model_dir = "./Models"
data_dir  = "."
# model dimensions
input_dim  = vocab_size
label_dim  = num_labels
emb_dim    = 150
hidden_dim = 300

def create_model():
    with default_options(initial_state=0.1):  # inject an option to mimic the BrainScript version identically; remove some day
        return Sequential([
            Embedding(emb_dim),
            Recurrence(LSTM(hidden_dim), go_backwards=False),
            Dense(num_labels)
        ])

## CNTK Configuration

To train and test a model in CNTK, we need to create a model and specify how to read data and perform training and testing. 

In order to train we need to specify:

* how to read the data 
* the model function and its inputs and outputs
* hyper-parameters for the learner

[comment]: <> (For testing ...)

### A Brief Look at Data and Data Reading

We already looked at the data.
But how do you generate this format?
For reading text, this tutorial uses the `CNTKTextFormatReader`. It expects the input data to be
of a specific format, which is described [here](https://github.com/Microsoft/CNTK/wiki/CNTKTextFormat-Reader).

For this tutorial, we created the corpora by two steps:
* convert the raw data into a plain text file that contains of TAB-separated columns of space-separated text. For example:

  ```
  BOS show flights from burbank to st. louis on monday EOS (TAB) flight (TAB) O O O O B-fromloc.city_name O B-toloc.city_name I-toloc.city_name O B-depart_date.day_name O
  ```

  This is meant to be compatible with the output of the `paste` command.
* convert it to CNTK Text Format (CTF) with the following command:

  ```
  python Scripts/txt2ctf.py --map query.wl intent.wl slots.wl --annotated True --input atis.test.txt --output atis.test.ctf
  ```

  where the three `.wl` files give the vocabulary as plain text files, one line per word.

In these CTF files, our columns are labeled `S0`, `S1`, and `S2`.
These are connected to the actual network inputs by the corresponding lines in the reader definition:

In [3]:
def create_reader(path, is_training):
    return MinibatchSource(CTFDeserializer(path, StreamDefs(
         query         = StreamDef(field='S0', shape=vocab_size,  is_sparse=True),
         intent_unused = StreamDef(field='S1', shape=num_intents, is_sparse=True),  
         slot_labels   = StreamDef(field='S2', shape=num_labels,  is_sparse=True)
     )), randomize=is_training, epoch_size = INFINITELY_REPEAT if is_training else FULL_DATA_SWEEP)

### Running it

You can find the complete recipe below.

In [4]:
def train(reader, model, max_epochs):
    # Input variables denoting the features and label data
    query       = Input(input_dim,  is_sparse=False)
    slot_labels = Input(num_labels, is_sparse=True)  # TODO: make sparse once it works

    # apply model to input
    z = model(query)

    # loss and metric
    ce = cross_entropy_with_softmax(z, slot_labels)
    pe = classification_error      (z, slot_labels)

    # training config
    epoch_size = 36000
    minibatch_size = 70
    num_mbs_to_show_result = 100
    momentum_as_time_constant = minibatch_size / -math.log(0.9)  # TODO: Change to round number. This is 664.39. 700?

    lr_per_sample = [0.003]*2+[0.0015]*12+[0.0003] # LR schedule over epochs (we don't run that mayn epochs, but if we did, these are good values)

    # trainer object
    lr_schedule = learning_rate_schedule(lr_per_sample, units=epoch_size)
    learner = adam_sgd(z.parameters,
                       lr_per_sample=lr_schedule, momentum_time_constant=momentum_as_time_constant,
                       low_memory=True,
                       gradient_clipping_threshold_per_sample=15, gradient_clipping_with_truncation=True)

    trainer = Trainer(z, ce, pe, [learner])

    # define mapping from reader streams to network inputs
    input_map = {
        query       : reader.streams.query,
        slot_labels : reader.streams.slot_labels
    }

    # process minibatches and perform model training
    log_number_of_parameters(z) ; print()
    progress_printer = ProgressPrinter(freq=100, first=10, tag='Training') # more detailed logging
    #progress_printer = ProgressPrinter(tag='Training')

    t = 0
    for epoch in range(max_epochs):         # loop over epochs
        epoch_end = (epoch+1) * epoch_size
        while t < epoch_end:               # loop over minibatches on the epoch
            # BUGBUG? The change of minibatch_size parameter vv has no effect.
            data = reader.next_minibatch(min(minibatch_size, epoch_end-t), input_map=input_map) # fetch minibatch
            trainer.train_minibatch(data)                                   # update model with it
            t += data[slot_labels].num_samples                              # count samples processed so far
            progress_printer.update_with_trainer(trainer, with_metric=True) # log progress
            #def trace_node(name):
            #    nl = [n for n in z.parameters if n.name() == name]
            #    if len(nl) > 0:
            #        print (name, np.asarray(nl[0].value))
            #trace_node('W')
            #trace_node('stabilizer_param')
        loss, metric, actual_samples = progress_printer.epoch_summary(with_metric=True)

    return loss, metric


In [5]:
from _cntk_py import set_computation_network_trace_level, set_fixed_random_seed, force_deterministic_algorithms
#set_computation_network_trace_level(1)  # TODO: remove debugging facilities once this all works
set_fixed_random_seed(1)  # BUGBUG: has no effect at present  # TODO: remove debugging facilities once this all works
force_deterministic_algorithms()


In [6]:
reader = create_reader(data_dir + "/atis.train.ctf", is_training=True)
model = create_model()
train(reader, model, max_epochs=8)

Training 721479 parameters in 6 parameter tensors.

 Minibatch[   1-   1]: loss = 4.865357 * 67, metric = 100.0% * 67
 Minibatch[   2-   2]: loss = 4.845598 * 63, metric = 73.0% * 63
 Minibatch[   3-   3]: loss = 4.807514 * 68, metric = 38.2% * 68
 Minibatch[   4-   4]: loss = 4.759234 * 70, metric = 35.7% * 70
 Minibatch[   5-   5]: loss = 4.681714 * 65, metric = 30.8% * 65
 Minibatch[   6-   6]: loss = 4.566493 * 62, metric = 27.4% * 62
 Minibatch[   7-   7]: loss = 4.454796 * 58, metric = 31.0% * 58
 Minibatch[   8-   8]: loss = 4.259974 * 70, metric = 28.6% * 70
 Minibatch[   9-   9]: loss = 4.057600 * 59, metric = 33.9% * 59
 Minibatch[  10-  10]: loss = 3.580238 * 64, metric = 28.1% * 64
 Minibatch[  11- 100]: loss = 1.443393 * 5654, metric = 27.4% * 5654
 Minibatch[ 101- 200]: loss = 0.825825 * 6329, metric = 17.9% * 6329
 Minibatch[ 201- 300]: loss = 0.653802 * 6259, metric = 14.4% * 6259
 Minibatch[ 301- 400]: loss = 0.521054 * 6229, metric = 11.4% * 6229
 Minibatch[ 401- 500]

(0.06193035719939996, 0.014038397514149373)

If you run it you will soon see out this:

```
Training 721479 parameters in 6 parameter tensors.

 Minibatch[   1-   1]: loss = 4.865357 * 67, metric = 100.0% * 67
 Minibatch[   2-   2]: loss = 4.845598 * 63, metric = 73.0% * 63
 Minibatch[   3-   3]: loss = 4.807514 * 68, metric = 38.2% * 68
 Minibatch[   4-   4]: loss = 4.759234 * 70, metric = 35.7% * 70
 Minibatch[   5-   5]: loss = 4.681714 * 65, metric = 30.8% * 65
 Minibatch[   6-   6]: loss = 4.566493 * 62, metric = 27.4% * 62
 Minibatch[   7-   7]: loss = 4.454796 * 58, metric = 31.0% * 58
 Minibatch[   8-   8]: loss = 4.259974 * 70, metric = 28.6% * 70
 Minibatch[   9-   9]: loss = 4.057600 * 59, metric = 33.9% * 59
 Minibatch[  10-  10]: loss = 3.580238 * 64, metric = 28.1% * 64
 Minibatch[  11- 100]: loss = 1.443393 * 5654, metric = 27.4% * 5654
 Minibatch[ 101- 200]: loss = 0.825825 * 6329, metric = 17.9% * 6329
 Minibatch[ 201- 300]: loss = 0.653802 * 6259, metric = 14.4% * 6259
 Minibatch[ 301- 400]: loss = 0.521054 * 6229, metric = 11.4% * 6229
 Minibatch[ 401- 500]: loss = 0.462903 * 6289, metric = 10.2% * 6289
Finished Epoch [1]: [Training] loss = 0.783951 * 36061, metric = 15.5% * 36061
```

This shows how learning proceeds over epochs (passes through the data).
For example, after two epochs, the cross-entropy criterion, which is the `ce` variable in
the `train` function, has reached 0.27 as measured on the 36000 samples of this epoch,
and that the error rate is 5.9% on those same 36000 training samples.

The epoch size is the number of samples--counted as *word tokens*, not sentences--to
process between model checkpoints.

Once the training has completed (a little less than 2 minutes on a Titan-X or a Surface Book),
you will see an output like this
```
(0.06193035719939996, 0.014038397514149373)
```
which is a tuple containing the loss (cross entropy) and the metric (classification error) averaged over the final epoch.

On a CPU-only machine, it can be 4 or more times slower.

## Modifying the Model

In the following, you will be given tasks to practice modifying CNTK configurations.
The solutions are given at the end of this document... but please try without!

### A Word About [`Sequential()`](https://www.cntk.ai/pythondocs/layerref.html#sequential)

Before jumping to the tasks, let's have a look again at the model we just ran.
The model is described in what we call *function-composition style*.
```python
        Sequential([
            Embedding(emb_dim),
            Recurrence(LSTM(hidden_dim), go_backwards=False),
            Dense(num_labels)
        ])
```
You may be familiar with the "sequential" notation from other neural-network toolkits.
If not, [`Sequential()`](https://www.cntk.ai/pythondocs/layerref.html#sequential) is a powerful operation that,
in a nutshell, allows to compactly express a very common situation in neural networks
where an input is processed by propagating it through a progression of layers.
`Sequential()` takes an list of functions as its argument,
and returns a *new* function that invokes these functions in order,
each time passing the output of one to the next.
For example,
```python
	FGH = Sequential ([F,G,H])
    y = FGH (x)
```
means the same as
```
    y = H(G(F(x))) 
```
This is known as ["function composition"](https://en.wikipedia.org/wiki/Function_composition),
and is especially convenient for expressing neural networks, which often have this form:

         +-------+   +-------+   +-------+
    x -->|   F   |-->|   G   |-->|   H   |--> y
         +-------+   +-------+   +-------+

Coming back to our model at hand, the `Sequential` expression simply
says that our model has this form:

         +-----------+   +----------------+   +------------+
    x -->| Embedding |-->| Recurrent LSTM |-->| DenseLayer |--> y
         +-----------+   +----------------+   +------------+

### Task 1: Add Batch Normalization

We now want to add new layers to the model, specifically batch normalization.

Batch normalization is a popular technique for speeding up convergence.
It is often used for image-processing setups, for example our other [hands-on lab on image
recognition](./Hands-On-Labs-Image-Recognition).
But could it work for recurrent models, too?
  
So your task will be to insert batch-normalization layers before and after the recurrent LSTM layer.
If you have completed the [hands-on labs on image processing](https://github.com/Microsoft/CNTK/blob/master/bindings/python/tutorials/CNTK_201B_CIFAR-10_ImageHandsOn.ipynb),
you may remember that the [batch-normalization layer](https://www.cntk.ai/pythondocs/layerref.html#batchnormalization-layernormalization-stabilizer) has this form:
```
    BatchNormalization()
```
So please go ahead and modify the configuration and see what happens.

If everything went right, you will notice improved convergence speed (`loss` and `metric`)
compared to the previous configuration.

In [7]:
# TODO: Add batch normalization
def create_model():
    with default_options(initial_state=0.1):  # inject an option to mimic the BrainScript version identically; remove some day
        return Sequential([
            Embedding(emb_dim),
            Recurrence(LSTM(hidden_dim), go_backwards=False),
            Dense(num_labels)
        ])

#reader = create_reader(data_dir + "/atis.train.ctf", is_training=True)
#model = create_model()
#train(reader, model, max_epochs=8)

### Task 2: Add a Lookahead 

Our recurrent model suffers from a structural deficit:
Since the recurrence runs from left to right, the decision for a slot label
has no information about upcoming words. The model is a bit lopsided.
Your task will be to modify the model such that
the input to the recurrence consists not only of the current word, but also of the next one
(lookahead).

Your solution should be in function-composition style.
Hence, you will need to write a Python function that does the following:

* takes no input arguments
* creates a placeholder sequence variable
* computes the "next value" in this sequence using the `Delay()` layer (use this specific form: `Delay(T=-1)`); and
* concatenate the current and the next value into a vector of twice the embedding dimension using `splice()`

and then insert this function into `Sequential()`'s list between the embedding and the recurrent layer.

In [8]:
# TODO: Add lookahead
def create_model():
    with default_options(initial_state=0.1):  # inject an option to mimic the BrainScript version identically; remove some day
        return Sequential([
            Embedding(emb_dim),
            Recurrence(LSTM(hidden_dim), go_backwards=False),
            Dense(num_labels)
        ])

#reader = create_reader(data_dir + "/atis.train.ctf", is_training=True)
#model = create_model()
#train(reader, model, max_epochs=8)

### Task 3: Bidirectional Recurrent Model

Aha, knowledge of future words help. So instead of a one-word lookahead,
why not look ahead until all the way to the end of the sentence, through a backward recurrence?
Let us create a bidirectional model!

Your task is to implement a new layer that
performs both a forward and a backward recursion over the data, and
concatenates the output vectors.

Note, however, that this differs from the previous task in that
the bidirectional layer contains learnable model parameters.
In function-composition style,
the pattern to implement a layer with model parameters is to write a *factory function*
that creates a *function object*.

A function object, also known as *functor*, is an object that is both a function and an object.
Which means nothing else that it contains data yet still can be invoked as if it was a function.

For example, `Dense(outDim)` is a factory function that returns a function object that contains
a weight matrix `W`, a bias `b`, and another function to compute `W * input + b`.
E.g. saying `Dense(1024)` will create this function object, which can then be used
like any other function, also immediately: `Dense(1024)(x)`. 

Confused? Let's take an example: Let us implement a new layer that combines
a linear layer with a subsequent batch normalization. 
To allow function composition, the layer needs to be realized as a factory function,
which could look like this:

```python
def DenseLayerWithBN(dim):
    F = Dense(dim)
    G = BatchNormalization()
    x = Placeholder()
    apply_x = G(F(x))
    return apply_x
```

Invoking this factory function will create `F`, `G`, `x`, and `apply_x`. In this example, `F` and `G` are function objects themselves, and `apply_x` is the function to be applied to the data.
Thus, e.g. calling `DenseLayerWithBN(1024)` will
create an object containing a linear-layer function object called `F`, a batch-normalization function object `G`,
and `apply_x` which is the function that implements the actual operation of this layer
using `F` and `G`. It will then return `apply_x`. To the outside, `apply_x` looks and behaves
like a function. Under the hood, however, `apply_x` retains access to its specific instances of `F` and `G`.

Now back to our task at hand. You will now need to create a factory function,
very much like the example above.
You shall create a factory function
that creates two recurrent layer instances (one forward, one backward), and then defines an `apply_x` function
which applies both layer instances to the same `x` and concatenate the two results.

Allright, give it a try! To know how to realize a backward recursion in CNTK,
please take a hint from how the forward recursion is done.
Please also do the following:
* remove the one-word lookahead you added in the previous task, which we aim to replace; and
* change the `hidden_dim` parameter from 300 to 150, to keep the total number of model parameters limited.

In [9]:
# TODO: Add bidirectional recurrence
def create_model():
    with default_options(initial_state=0.1):  # inject an option to mimic the BrainScript version identically; remove some day
        return Sequential([
            Embedding(emb_dim),
            Recurrence(LSTM(hidden_dim), go_backwards=False),
            Dense(num_labels)
        ])

#reader = create_reader(data_dir + "/atis.train.ctf", is_training=True)
#model = create_model()
#train(reader, model, max_epochs=8)

Works like a charm! This model achieves 1.83%, a tiny bit better than the lookahead model above.
The bidirectional model has 40% less parameters than the lookahead one. However, if you go back and look closely
at the complete log output (not shown on this web page), you may find that the lookahead one trained
about 30% faster.
This is because the lookahead model has both less horizontal dependencies (one instead of two
recurrences) and larger matrix products, and can thus achieve higher parallelism.

### Solution 1: Adding Batch Normalization

In [10]:
def create_model():
    with default_options(initial_state=0.1):  # inject an option to mimic the BrainScript version identically; remove some day
        return Sequential([
            Embedding(emb_dim),
            BatchNormalization(),
            Recurrence(LSTM(hidden_dim), go_backwards=False),
            BatchNormalization(),
            Dense(num_labels)
        ])

reader = create_reader(data_dir + "/atis.train.ctf", is_training=True)
model = create_model()
train(reader, model, max_epochs=8)

Training 722379 parameters in 10 parameter tensors.

 Minibatch[   1-   1]: loss = 5.717494 * 67, metric = 98.5% * 67
 Minibatch[   2-   2]: loss = 4.657044 * 63, metric = 73.0% * 63
 Minibatch[   3-   3]: loss = 3.498547 * 68, metric = 66.2% * 68
 Minibatch[   4-   4]: loss = 2.029608 * 70, metric = 32.9% * 70
 Minibatch[   5-   5]: loss = 2.782364 * 65, metric = 36.9% * 65
 Minibatch[   6-   6]: loss = 2.508770 * 62, metric = 27.4% * 62
 Minibatch[   7-   7]: loss = 1.439061 * 58, metric = 20.7% * 58
 Minibatch[   8-   8]: loss = 1.267150 * 70, metric = 14.3% * 70
 Minibatch[   9-   9]: loss = 0.748346 * 59, metric = 15.3% * 59
 Minibatch[  10-  10]: loss = 1.607091 * 64, metric = 21.9% * 64
 Minibatch[  11- 100]: loss = 0.562347 * 5654, metric = 9.4% * 5654
 Minibatch[ 101- 200]: loss = 0.276623 * 6329, metric = 5.2% * 6329
 Minibatch[ 201- 300]: loss = 0.189869 * 6259, metric = 4.5% * 6259
 Minibatch[ 301- 400]: loss = 0.173477 * 6229, metric = 3.8% * 6229
 Minibatch[ 401- 500]: lo

(0.014827127369880783, 0.004660969925646432)

### Solution 2: Add a Lookahead

In [11]:
def OneWordLookahead():
    x = Placeholder()
    apply_x = splice ([x, future_value(x)])
    return apply_x

def create_model():
    with default_options(initial_state=0.1):  # inject an option to mimic the BrainScript version identically; remove some day
        return Sequential([
            Embedding(emb_dim),
            OneWordLookahead(),
            BatchNormalization(),
            Recurrence(LSTM(hidden_dim), go_backwards=False),
            BatchNormalization(),
            Dense(num_labels)        
        ])

reader = create_reader(data_dir + "/atis.train.ctf", is_training=True)
model = create_model()
train(reader, model, max_epochs=1)


Training 902679 parameters in 10 parameter tensors.

 Minibatch[   1-   1]: loss = 5.907517 * 67, metric = 100.0% * 67
 Minibatch[   2-   2]: loss = 4.943243 * 63, metric = 84.1% * 63
 Minibatch[   3-   3]: loss = 3.374831 * 68, metric = 58.8% * 68
 Minibatch[   4-   4]: loss = 2.692569 * 70, metric = 38.6% * 70
 Minibatch[   5-   5]: loss = 2.942126 * 65, metric = 43.1% * 65
 Minibatch[   6-   6]: loss = 2.558049 * 62, metric = 32.3% * 62
 Minibatch[   7-   7]: loss = 1.299130 * 58, metric = 20.7% * 58
 Minibatch[   8-   8]: loss = 1.375856 * 70, metric = 15.7% * 70
 Minibatch[   9-   9]: loss = 1.242764 * 59, metric = 11.9% * 59
 Minibatch[  10-  10]: loss = 1.922263 * 64, metric = 25.0% * 64
 Minibatch[  11- 100]: loss = 0.588840 * 5654, metric = 10.0% * 5654
 Minibatch[ 101- 200]: loss = 0.275335 * 6329, metric = 5.8% * 6329
 Minibatch[ 201- 300]: loss = 0.191039 * 6259, metric = 4.3% * 6259
 Minibatch[ 301- 400]: loss = 0.183137 * 6229, metric = 3.9% * 6229
 Minibatch[ 401- 500]: 

(0.30319248410476657, 0.05801281162474696)

### Solution 3: Bidirectional Recurrent Model

In [12]:
def BiRecurrence(fwd, bwd):
    F = Recurrence(fwd)
    G = Recurrence(bwd, go_backwards=True)
    x = Placeholder()
    apply_x = splice ([F(x), G(x)])
    return apply_x 

def create_model():
    with default_options(initial_state=0.1):  # inject an option to mimic the BrainScript version identically; remove some day
        return Sequential([
            Embedding(emb_dim),
            BatchNormalization(),
            BiRecurrence(LSTM(hidden_dim), LSTM(hidden_dim)),
            BatchNormalization(),
            Dense(num_labels)
        ])

reader = create_reader(data_dir + "/atis.train.ctf", is_training=True)
model = create_model()
train(reader, model, max_epochs=8)

Training 761679 parameters in 10 parameter tensors.

 Minibatch[   1-   1]: loss = 5.669997 * 67, metric = 95.5% * 67
 Minibatch[   2-   2]: loss = 4.440025 * 63, metric = 74.6% * 63
 Minibatch[   3-   3]: loss = 3.123526 * 68, metric = 58.8% * 68
 Minibatch[   4-   4]: loss = 2.220408 * 70, metric = 40.0% * 70
 Minibatch[   5-   5]: loss = 2.585504 * 65, metric = 36.9% * 65
 Minibatch[   6-   6]: loss = 2.259114 * 62, metric = 29.0% * 62
 Minibatch[   7-   7]: loss = 1.162399 * 58, metric = 19.0% * 58
 Minibatch[   8-   8]: loss = 1.024514 * 70, metric = 17.1% * 70
 Minibatch[   9-   9]: loss = 0.870560 * 59, metric = 11.9% * 59
 Minibatch[  10-  10]: loss = 1.675982 * 64, metric = 26.6% * 64
 Minibatch[  11- 100]: loss = 0.642226 * 5654, metric = 9.9% * 5654
 Minibatch[ 101- 200]: loss = 0.274710 * 6329, metric = 5.5% * 6329
 Minibatch[ 201- 300]: loss = 0.217240 * 6259, metric = 4.5% * 6259
 Minibatch[ 301- 400]: loss = 0.186087 * 6229, metric = 3.7% * 6229
 Minibatch[ 401- 500]: lo

(0.005203571479390112, 0.0013317056930418378)