In [30]:
from __future__ import print_function
import cntk
import numpy as np
import scipy.sparse
import cntk.tests.test_utils
cntk.tests.test_utils.set_device_from_pytest_env() # (only needed for our build system)
from IPython.display import Image

# CNTK Quick Start Guide

Welcome to CNTK and the wonders of deep learning! This tutorial will give a brief overview of CNTK. It is meant for users that are new to CNTK but have some experience with deep neural networks.
The focus will be on how the basic steps of deep learning are done in CNTK.

To train a deep model, you will need to:

 * define your model structure
 * prepare your data
 * train it
 * evaluate its accuracy
 * deploy it

This tutorial is structured along these five steps.

To run this tutorial, you will need CNTK 2.0 and ideally a CUDA-capable GPU (deep learning is no fun without GPUs).

## Defining Your Model Structure

So let us dive right in. Below we will introduce CNTK's programming model--*networks are function objects* and CNTK's data model. We will put that into action for logistic regression and MNIST digit recognition,
using CNTK's Functional API.
Lastly, we will replicate one example using CNTK's lower-level, TensorFlow/Theano-like graph API.

#### Programming Model: Networks are Function Objects

In CNTK, a neural network is a function object.
On one hand, a neural network in CNTK is just a function that you can call
to apply it to data.
On the other hand, a neural network contains learnable parameters
that can be accessed like object members.
Complicated function objects can be composed as hierarchies of simpler ones, which,
for example, represent layers.
The function-object approach is similar to Keras, Chainer, Dynet, Pytorch,
and the recent Sonnet.

The following illustrates the function-object approach with pseudo-code, using the example
of a fully-connected layer (called `Dense` in CNTK)::


In [31]:
# numpy *pseudo-code* for CNTK Dense layer (vastly simplified, e,g. no back-prop)
def Dense(out_dim, activation):
    # create the learnable parameters
    b = np.zeros(out_dim)
    W = np.ndarray((0,out_dim)) # input dimension is unknown
    # define the function itself
    def dense(x):
        if len(W) == 0:         # first call: reshape and initialize W
            W.resize((x.shape[-1], W.shape[-1]), refcheck=False)
            W[:] = np.random.randn(*W.shape) * 0.05
        return activation(x @ W + b)
    # return it as a function object: can be called & holds the parameters as members
    dense.W = W
    dense.b = b
    return dense

d = Dense(5, np.tanh)    # create the function object
y = d(np.array([1, 2]))  # apply it like a function
W = d.W                  # access member like an object
print('W =', d.W)
print('y =', y)

W = [[-0.01335975 -0.07401208  0.09924505 -0.02813118  0.00424906]
 [ 0.01411086  0.03614169 -0.03891578 -0.06236706  0.01096796]]
y = [ 0.01486089 -0.0017287   0.02141022 -0.15168561  0.02617901]


Note that the real CNTK function objects are not actual Python lambdas.
Rather, they are static graphs in C++,
wrapped in the Python class `Function` that exposes `__call__()` and `__getattr__()` methods.
However, CNTK allows to define networks as Python expressions and functions,
as long as they can be converted into static graphs of type `Function`.

The function object is CNTK's single abstraction used to represent different levels of neural networks, which
are only distinguished by convention:

 * **basic operations** without learnable parameters (e.g. `times()`, `__add__()`, `sigmoid()`...)
 * **layers** (`Dense()`, `Embedding()`, `Convolution()`...). Layers map one input to one output.
 * **recurrent step functions** (`LSTM()`, `GRU()`, `RNNStep()`). Step functions map a previous state and a new input to a new state.
 * **loss and metric** functions (`cross_entropy_with_softmax()`, `binary_cross_entropy()`, `squared_error()`, `classification_error()`...).
   In CNTK, losses and metric are not special, just functions.
 * **models**. Models are defined by the user and map features to predictions or scores, and is what gets deployed in the end.
 * **criterion function**. The criterion function maps (features, labels) to (loss, metric).
   The Trainer optimizes the loss by SGD, and logs the metric, which may be non-differentiable.

Higher-order layers compose objects into more complex ones, including:

 * **stacking** (`Sequential()`, `For()`)
 * **recurrence** (`Recurrence()`, `Fold()`, `Unfold()`...)

Users can use arbitrary Python expressions as CNTK function objects with `Function()`.
This is similar to Keras' `Lambda()`.
Expressions can be written as multi-line functions through decorator syntax (`@Function`).

Lastly, function objects enable parameter sharing. If you call the same
function object at multiple places, all invocations will naturally share the same learnable parameters.

In summary, the function object is CNTK's single abstraction for conveniently defining
simple and complex models, parameter sharing, and training objectives.

(Note that under the hood, CNTK uses a static graph,
and it is possible to define CNTK networks as graphs like TensorFlow and Theano, as discussed
further below.)

### CNTK's Data model: Sequences of Tensors

CNTK can operate on two types of data:

 * **tensors** (that is, N-dimensional arrays), dense or sparse
 * **sequences** of tensors

The distinction is that the shape of a tensor is static during operation,
while the length of a sequence depends on data.
Tensors have *static axes*, while a sequence has a *dynamic axis*.

In CNTK, categorical data is represented as sparse one-hot tensors, not as integer vectors.
This allows to write embeddings and loss functions in a unified fashion as matrix products.

CNTK adopts Python's type-annotation syntax to declare CNTK types.
For example,

 * `Tensor[(13,42)]` denotes a tensor with 13 rows and 42 columns, and
 * `Sequence[SparseTensor[300000]]` a sequence of sparse vectors, which for example could represent a word out of a 300k dictionary

Note that unlike Python type hints, this works with Python 2.7. There are two more data types: 'batch of tensors' and 'batch of sequences of tensors'.
These are used internally for minibatching, but we generally try to hide batching from the user:
We want users to think in tensors and sequences, and leave mini-batching to CNTK.
To this end, unlike other toolkits, CNTK transparently batches *sequences with different lengths*
into the same minibatch, and handles all the necessary padding and packing.
Workarounds like 'bucketing' are not needed.

### Your First CNTK Network: Simple Logistic Regression

Let us put all of this in action for a very simple example of logistic regression.
For this example, we create a synthetic data set of 2-dimensional normal-distributed 
data points, which should be classified into belonging to one of two classes.
Note that CNTK expects the labels as one-hot encoded.


In [32]:
input_dim_lr = 2    # classify 2-dimensional data
num_classes_lr = 2  # into one of two classes

# This example uses synthetic data from normal distributions, which we generate in the following.
#  X_lr[corpus_size,input_dim] - input data
#  Y_lr[corpus_size]           - labels (0 or 1), in one-hot representation
np.random.seed(0)
def generate_synthetic_data(N):
    Y = np.random.randint(size=N, low=0, high=num_classes_lr)  # labels
    X = (np.random.randn(N, input_dim_lr)+3) * (Y[:,None]+1)   # data
    # Our model expects float32 features, and cross-entropy expects one-hot encoded labels.
    Y = scipy.sparse.csr_matrix((np.ones(N,np.float32), (range(N), Y)), shape=(N, num_classes_lr))
    X = X.astype(np.float32)
    return X, Y
X_train_lr, Y_train_lr = generate_synthetic_data(20000)
X_test_lr,  Y_test_lr  = generate_synthetic_data(1024)
X_train_lr, Y_train_lr.todense() # let's have a peek
print(X_train_lr[:4])
print(Y_train_lr[:4].todense())

[[ 2.2741797   3.56347561]
 [ 5.12873602  5.79089499]
 [ 1.3574543   5.5718112 ]
 [ 3.54340553  2.46254587]]
[[ 1.  0.]
 [ 0.  1.]
 [ 0.  1.]
 [ 1.  0.]]


We now define the model function. The model function maps input data to predictions.
It is the final product of the training process.
In this example, we use the simplest of all models: logistic regression.

In [33]:
model_lr = cntk.layers.Dense(num_classes_lr, activation=None)

Next, we define the criterion function. The criterion function is
the harness via which the trainer uses to optimize the model:
It maps (input vectors, labels) to (loss, metric).
The loss is used for the SGD updates. We choose cross entropy.
Specifically, `cross_entropy_with_softmax()` first applies
the `softmax()` function to the network's output, as
cross entropy expects probabilities.
We do not include `softmax()` in the model function itself, because
it is not necessary for using the model.
As the metric, we count classification errors (this metric is not differentiable).

We define criterion function as Python code and convert it to a `Function` object.
A single expression can be written as `Function(lambda x, y: `*expression of x and y*`)`,
similar to Keras' `Lambda()`.
To avoid evaluating the model twice, we use a Python function definition
with decorator syntax. Lastly, this is a good time to tell CNTK about the
data types of our inputs, which is done via the decorator `@Function.with_signature(`*argument types*`)`:

In [34]:
@cntk.Function.with_signature(cntk.layers.Tensor[input_dim_lr], cntk.layers.SparseTensor[num_classes_lr])
def criterion_lr(data, label_one_hot):
    z = model_lr(data)  # apply model. Computes a non-normalized log probability for every output class.
    loss   = cntk.cross_entropy_with_softmax(z, label_one_hot) # this applies softmax to z under the hood
    metric = cntk.classification_error(z, label_one_hot)
    return loss, metric
print(criterion_lr)

Composite(data: Tensor[2], label_one_hot: SparseTensor[2]) -> Tuple[Tensor[1], Tensor[1]]


The decorator will 'compile' the function into a static graph
by calling `criterion()` passing it input variables of the given data types as arguments.
Thus, `criterion` is not a Python function but a static graph in a CNTK `Function` object.

We are now ready to train our model:

In [35]:
learner = cntk.sgd(model_lr.parameters, cntk.learning_rate_schedule(0.1, cntk.UnitType.minibatch))
progress_writer = cntk.logging.ProgressPrinter(50)
_ = criterion_lr.train((X_train, Y_train), parameter_learners=[learner], callbacks=[progress_writer])

Learning rate per minibatch: 0.1
 Minibatch[   1-  50]: loss = 0.607452 * 1600, metric = 31.87% * 1600;
 Minibatch[  51- 100]: loss = 0.461523 * 1600, metric = 18.81% * 1600;
 Minibatch[ 101- 150]: loss = 0.389729 * 1600, metric = 12.38% * 1600;
 Minibatch[ 151- 200]: loss = 0.379355 * 1600, metric = 13.12% * 1600;
 Minibatch[ 201- 250]: loss = 0.323799 * 1600, metric = 9.25% * 1600;
 Minibatch[ 251- 300]: loss = 0.298206 * 1600, metric = 9.44% * 1600;
 Minibatch[ 301- 350]: loss = 0.296859 * 1600, metric = 9.38% * 1600;
 Minibatch[ 351- 400]: loss = 0.277590 * 1600, metric = 8.94% * 1600;
 Minibatch[ 401- 450]: loss = 0.279621 * 1600, metric = 8.19% * 1600;
 Minibatch[ 451- 500]: loss = 0.260024 * 1600, metric = 7.81% * 1600;
 Minibatch[ 501- 550]: loss = 0.243758 * 1600, metric = 7.12% * 1600;
 Minibatch[ 551- 600]: loss = 0.242975 * 1600, metric = 8.31% * 1600;
Finished Epoch[1]: loss = 0.335261 * 20000, metric = 11.90% * 20000 7.027s (2846.2 samples/s);


Let us test how we are doing on our test set:

In [36]:
test_metric = criterion_lr.test((X_test, Y_test), callbacks=[progress_writer]).metric

Finished Evaluation [1]: Minibatch[1-32]: metric = 8.11% * 1024;


And lastly, let us run a few samples through our model and see how it is doing.
Oops, `criterion` knew the input types, but `model_lr` does not,
so we tell it using `update_signature()`.

In [37]:
model_lr.update_signature(cntk.layers.Tensor[input_dim_lr])
print(model_lr)

Dense(x: Tensor[2]) -> Tensor[2]


Now we can call it like any Python function:

In [38]:
z = model_lr(X_test_lr[:25])
print("Label    :", [label.todense().argmax() for label in Y_test_lr[:25]])
print("Predicted:", [z[i,:].argmax() for i in range(len(z))])

Label    : [0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0]
Predicted: [0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0]


## Your Second CNTK Network: MNIST Digit Recognition



Let us do the same thing as above on an actual task--the MNIST benchmark, which is sort of the "hello world" of deep learning.
The MNIST task is to recognize scans of hand-written digits. We first download and prepare the data.

In [46]:
input_shape_mn = (28, 28)  # MNIST digits are 28 x 28
num_classes_mn = 10        # classify as one of 10 digits

# Fetch the MNIST data. Best done with scikit-learn.
try:
    from sklearn import datasets, utils
    mnist = datasets.fetch_mldata("MNIST original")
    X, Y = mnist.data / 255.0, mnist.target
    X_train_mn, X_test_mn = X[:60000].reshape((-1,28,28)), X[60000:].reshape((-1,28,28))
    Y_train_mn, Y_test_mn = Y[:60000].astype(int), Y[60000:].astype(int)
except: # workaround if scikit-learn is not present
    import requests, io, gzip
    X_train_mn, X_test_mn = (np.fromstring(gzip.GzipFile(fileobj=io.BytesIO(requests.get('http://yann.lecun.com/exdb/mnist/' + name + '-images-idx3-ubyte.gz').content)).read()[16:], dtype=np.uint8).reshape((-1,28,28)).astype(np.float32) / 255.0 for name in ('train', 't10k'))
    Y_train_mn, Y_test_mn = (np.fromstring(gzip.GzipFile(fileobj=io.BytesIO(requests.get('http://yann.lecun.com/exdb/mnist/' + name + '-labels-idx1-ubyte.gz').content)).read()[8:], dtype=np.uint8).astype(int) for name in ('train', 't10k'))

# Shuffle the training data.
np.random.seed(0) # always use the same reordering, for reproducability
idx = np.random.permutation(len(X_train_mn))
X_train_mn, Y_train_mn = X_train_mn[idx], Y_train_mn[idx]

# Further split off a cross-validation set
X_train_mn, X_cv_mn = X_train_mn[:54000], X_train_mn[54000:]
Y_train_mn, Y_cv_mn = Y_train_mn[:54000], Y_train_mn[54000:]

# Our model expects float32 features, and cross-entropy expects one-hot encoded labels.
Y_train_mn, Y_cv_mn, Y_test_mn = (scipy.sparse.csr_matrix((np.ones(len(Y),np.float32), (range(len(Y)), Y)), shape=(len(Y), 10)) for Y in (Y_train_mn, Y_cv_mn, Y_test_mn))
X_train_mn, X_cv_mn, X_test_mn = (X.astype(np.float32) for X in (X_train_mn, X_cv_mn, X_test_mn))

# here's what it looks like
(X_train_mn[0]>0.5).astype(int)[:,5:-3]

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0],
       [0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0],
       [0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0],
       [0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0],
       [0, 0, 0, 1, 1, 1, 1, 0, 0,

Let's define the CNTK model function to map (28x28)-dimensional images to a 10-dimensional score vector.

In [54]:
with cntk.layers.default_options(activation=cntk.ops.relu, pad=False):
    model_mn = cntk.layers.Sequential([
        cntk.layers.Convolution2D((5,5), num_filters=32, reduction_rank=0, pad=True), # reduction_rank=0 for B&W images
        cntk.layers.MaxPooling((2,2), strides=(2,2)),
        cntk.layers.Convolution2D((3,3), num_filters=48),
        cntk.layers.MaxPooling((2,2), strides=(2,2)),
        cntk.layers.Convolution2D((3,3), num_filters=64),
        cntk.layers.Dense(96),
        cntk.layers.Dropout(dropout_rate=0.5),
        cntk.layers.Dense(num_classes_mn, activation=None) # no activation in final layer (softmax is done in criterion)
    ])

Woah, this model is a tad bit more complicated! It consists of several convolution-pooling layeres and two
fully-connected layers for classification which is typical for MNIST. This demonstrates several aspects of CNTK's Functional API.

First, we create each layer using a function from CNTK's layers library (`cntk.layers`).

Second, the higher-order layer `Sequential()` creates a new function that applies all those layers
one after another. This is known [forward function composition](https://en.wikipedia.org/wiki/Function_composition).
Note that unlike some other toolkits, you cannot `Add()` more layers afterwards to a sequential layer.
CNTK's `Function` objects are immutable, besides their learnable parameters (to edit a `Function` object, you can `clone()` it).
If you prefer that style, create your layers as a Python list and pass that to `Sequential()`.

Third, the context manager `default_options()` allows to specify defaults for various optional arguments to layers,
such as that the activation function is always `relu`, unless overriden.

Lastly, note that `relu` is passed as the actual function, not a string.
Any function can be an activation function.
It is also allowed to pass a Python lambda directly, for example relu could also be
realized manually by saying `activation=lambda x: cntk.ops.element_max(x, 0)`.

The criterion function is defined like in the previous example, to map maps (28x28)-dimensional features and according
labels to loss and metric.

In [52]:
@cntk.Function.with_signature(cntk.layers.Tensor[input_shape_mn], cntk.layers.SparseTensor[num_classes_mn])
def criterion(data, label_one_hot):
    z = model_mn(data)  # apply model. Computes a non-normalized log probability for every output class.
    loss   = cntk.cross_entropy_with_softmax(z, label_one_hot) # this applies softmax to z under the hood
    metric = cntk.classification_error(z, label_one_hot)
    return loss, metric

### Learning model parameters

Now that the network is setup, we would like to learn the parameters $\bf w$ and $b$ for our simple linear layer. To do so we convert, the computed evidence ($z$) into a set of predicted probabilities ($\textbf p$) using a `softmax` function.

$$ \textbf{p} = \mathrm{softmax}(z)$$ 

The `softmax` is an activation function that maps the accumulated evidences to a probability distribution over the classes (Details of the [softmax function][]). Other choices of activation function can be [found here][].

[softmax function]: https://www.cntk.ai/pythondocs/cntk.ops.html#cntk.ops.softmax

[found here]: https://github.com/Microsoft/CNTK/wiki/Activation-Functions

## Training
The output of the `softmax` is a probability of observations belonging to the respective classes. For training the classifier, we need to determine what behavior the model needs to mimic. In other words, we want the generated probabilities to be as close as possible to the observed labels. This function is called the *cost* or *loss* function and shows what is the difference between the learnt model vs. that generated by the training set.

[`Cross-entropy`][] is a popular function to measure the loss. It is defined as:

$$ H(p) = - \sum_{j=1}^C y_j \log (p_j) $$  

where $p$ is our predicted probability from `softmax` function and $y$ represents the label. This label provided with the data for training is also called the ground-truth label. In the two-class example, the `label` variable has dimensions of two (equal to the `num_output_classes` or $C$). Generally speaking, if the task in hand requires classification into $C$ different classes, the label variable will have $C$ elements with 0 everywhere except for the class represented by the data point where it will be 1.  Understanding the [details][] of this cross-entropy function is highly recommended.

[`cross-entropy`]: http://cntk.ai/pythondocs/cntk.ops.html#cntk.ops.cross_entropy_with_softmax
[details]: http://colah.github.io/posts/2015-09-Visual-Information/

In [None]:
label = C.input_variable((num_output_classes), np.float32)
loss = C.cross_entropy_with_softmax(z, label)

#### Evaluation

In order to evaluate the classification, one can compare the output of the network which for each observation emits a vector of evidences (can be converted into probabilities using `softmax` functions) with dimension equal to number of classes.

In [None]:
eval_error = C.classification_error(z, label)

### Configure training

The trainer strives to reduce the `loss` function by different optimization approaches, [Stochastic Gradient Descent][] (`sgd`) being one of the most popular one. Typically, one would start with random initialization of the model parameters. The `sgd` optimizer would calculate the `loss` or error between the predicted label against the corresponding ground-truth label and using [gradient-decent][] generate a new set model parameters in a single iteration. 

The aforementioned model parameter update using a single observation at a time is attractive since it does not require the entire data set (all observation) to be loaded in memory and also requires gradient computation over fewer datapoints, thus allowing for training on large data sets. However, the updates generated using a single observation sample at a time can vary wildly between iterations. An intermediate ground is to load a small set of observations and use an average of the `loss` or error from that set to update the model parameters. This subset is called a *minibatch*.

With minibatches we often sample observation from the larger training dataset. We repeat the process of model parameters update using different combination of training samples and over a period of time minimize the `loss` (and the error). When the incremental error rates are no longer changing significantly or after a preset number of maximum minibatches to train, we claim that our model is trained.

One of the key parameter for optimization is called the `learning_rate`. For now, we can think of it as a scaling factor that modulates how much we change the parameters in any iteration. We will be covering more details in later tutorial. 
With this information, we are ready to create our trainer. 

[optimization]: https://en.wikipedia.org/wiki/Category:Convex_optimization
[Stochastic Gradient Descent]: https://en.wikipedia.org/wiki/Stochastic_gradient_descent
[gradient-decent]: http://www.statisticsviews.com/details/feature/5722691/Getting-to-the-Bottom-of-Regression-with-Gradient-Descent.html

In [None]:
# Instantiate the trainer object to drive the model training
learning_rate = 0.5
lr_schedule = C.learning_rate_schedule(learning_rate, C.UnitType.minibatch) 
learner = C.sgd(z.parameters, lr_schedule)
trainer = C.Trainer(z, (loss, eval_error), [learner])

First let us create some helper functions that will be needed to visualize different functions associated with training. Note these convinience functions are for understanding what goes under the hood.

In [None]:

# Define a utility function to compute the moving average sum.
# A more efficient implementation is possible with np.cumsum() function
def moving_average(a, w=10):
    if len(a) < w: 
        return a[:]    
    return [val if idx < w else sum(a[(idx-w):idx])/w for idx, val in enumerate(a)]


# Defines a utility that prints the training progress
def print_training_progress(trainer, mb, frequency, verbose=1):
    training_loss, eval_error = "NA", "NA"

    if mb % frequency == 0:
        training_loss = trainer.previous_minibatch_loss_average
        eval_error = trainer.previous_minibatch_evaluation_average
        if verbose: 
            print ("Minibatch: {0}, Loss: {1:.4f}, Error: {2:.2f}".format(mb, training_loss, eval_error))
        
    return mb, training_loss, eval_error

### Run the trainer

We are now ready to train our Logistic Regression model. We want to decide what data we need to feed into the training engine.

In this example, each iteration of the optimizer will work on 25 samples (25 dots w.r.t. the plot above) a.k.a. `minibatch_size`. We would like to train on say 20000 observations. If the number of samples in the data is only 10000, the trainer will make 2 passes through the data. This is represented by `num_minibatches_to_train`. Note: In real world case, we would be given a certain amount of labeled data (in the context of this example, observation (age, size) and what they mean (benign / malignant)). We would use a large number of observations for training say 70% and set aside the remainder for evaluation of the trained model.

With these parameters we can proceed with training our simple feedforward network.

In [None]:
# Initialize the parameters for the trainer
minibatch_size = 25
num_samples_to_train = 20000
num_minibatches_to_train = int(num_samples_to_train  / minibatch_size)

In [None]:
# Run the trainer and perform model training
training_progress_output_freq = 50

plotdata = {"batchsize":[], "loss":[], "error":[]}

for i in range(0, num_minibatches_to_train):
    features, labels = generate_random_data_sample(minibatch_size, input_dim, num_output_classes)
    
    # Specify input variables mapping in the model to actual minibatch data to be trained with
    trainer.train_minibatch({feature : features, label : labels})
    batchsize, loss, error = print_training_progress(trainer, i, 
                                                     training_progress_output_freq, verbose=1)
    
    if not (loss == "NA" or error =="NA"):
        plotdata["batchsize"].append(batchsize)
        plotdata["loss"].append(loss)
        plotdata["error"].append(error)
        

In [None]:
# Compute the moving average loss to smooth out the noise in SGD
plotdata["avgloss"] = moving_average(plotdata["loss"])
plotdata["avgerror"] = moving_average(plotdata["error"])

# Plot the training loss and the training error
import matplotlib.pyplot as plt

plt.figure(1)
plt.subplot(211)
plt.plot(plotdata["batchsize"], plotdata["avgloss"], 'b--')
plt.xlabel('Minibatch number')
plt.ylabel('Loss')
plt.title('Minibatch run vs. Training loss')

plt.show()

plt.subplot(212)
plt.plot(plotdata["batchsize"], plotdata["avgerror"], 'r--')
plt.xlabel('Minibatch number')
plt.ylabel('Label Prediction Error')
plt.title('Minibatch run vs. Label Prediction Error')
plt.show()

## Evaluation / Testing 

Now that we have trained the network. Let us evaluate the trained network on data that hasn't been used for training. This is called **testing**. Let us create some new data and evaluate the average error and loss on this set. This is done using `trainer.test_minibatch`. Note the error on this previously unseen data is comparable to training error. This is a **key** check. Should the error be larger than the training error by a large margin, it indicates that the trained model will not perform well on data that it has not seen during training. This is known as [overfitting][]. There are several ways to address overfitting that is beyond the scope of this tutorial but the Cognitive Toolkit provides the necessary components to address overfitting.

Note: We are testing on a single minibatch for illustrative purposes. In practice one runs several minibatches of test data and reports the average. 

**Question** Why is this suggested? Try plotting the test error over several set of generated data sample and plot using plotting functions used for training. Do you see a pattern?

[overfitting]: https://en.wikipedia.org/wiki/Overfitting


In [None]:
# Run the trained model on newly generated dataset
test_minibatch_size = 25
features, labels = generate_random_data_sample(test_minibatch_size, input_dim, num_output_classes)

trainer.test_minibatch({feature : features, label : labels}) 

### Checking prediction / evaluation 
For evaluation, we map the output of the network between 0-1 and convert them into probabilities for the two classes. This suggests the chances of each observation being malignant and benign. We use a softmax function to get the probabilities of each of the class. 

In [None]:
out = C.softmax(z)
result = out.eval({feature : features})

Let us compare the ground-truth label with the predictions. They should be in agreement.

**Question:** 
- How many predictions were mislabeled? Can you change the code below to identify which observations were misclassified? 

In [None]:
print("Label    :", [np.argmax(label) for label in labels])
print("Predicted:", [np.argmax(result[i,:]) for i in range(len(result))])

### Visualization
It is desirable to visualize the results. In this example, the data is conveniently in two dimensions and can be plotted. For data with higher dimensions, visualization can be challenging. There are advanced dimensionality reduction techniques that allow for such visualizations [t-sne][].

[t-sne]: https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding

In [None]:
# Model parameters
print(mydict['b'].value)

bias_vector   = mydict['b'].value
weight_matrix = mydict['w'].value

# Plot the data 
import matplotlib.pyplot as plt

# given this is a 2 class 
colors = ['r' if l == 0 else 'b' for l in labels[:,0]]
plt.scatter(features[:,0], features[:,1], c=colors)
plt.plot([0, bias_vector[0]/weight_matrix[0][1]], 
         [ bias_vector[1]/weight_matrix[0][0], 0], c = 'g', lw = 3)
plt.xlabel("Scaled age (in yrs)")
plt.ylabel("Tumor size (in cm)")
plt.show()

**Exploration Suggestions** 
- Try exploring how the classifier behaves with different data distributions - suggest changing the `minibatch_size` parameter from 25 to say 64. Why is the error increasing?
- Try exploring different activation functions
- Try exploring different learners 
- You can explore training a [multiclass logistic regression][] classifier.

[multiclass logistic regression]: https://en.wikipedia.org/wiki/Multinomial_logistic_regression