# Handwritten Digit Recognition

In this tutorial, we'll give you a step by step walk-through of how to build a hand-written digit classifier using the [MNIST](https://en.wikipedia.org/wiki/MNIST_database) dataset. For someone new to deep learning, this exercise is arguably the "Hello World" equivalent.

MNIST is a widely used dataset for the hand-written digit classification task. It consists of 70,000 labeled 28x28 pixel grayscale images of hand-written digits. The dataset is split into 60,000 training images and 10,000 test images. There are 10 classes (one for each of the 10 digits). The task at hand is to train a model using the 60,000 training images and subsequently test its classification accuracy on the 10,000 test images.

![png](https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/example/mnist.png)

**Figure 1:** Sample images from the MNIST dataset.

## Prerequisites
To complete this tutorial, we need:  

- MXNet version 0.10 or later. See the installation instructions for your operating system in [Setup and Installation](http://mxnet.io/install/index.html).

- [Python Requests](http://docs.python-requests.org/en/master/) and [Jupyter Notebook](http://jupyter.org/index.html).

```
$ pip install requests jupyter
```

## Loading Data

Before we define the model, let's first fetch the [MNIST](http://yann.lecun.com/exdb/mnist/) dataset.

The following source code downloads and loads the images and the corresponding labels into memory.

In [27]:
import numpy
import mxnet as mx

#Faz o download do dataset mnist
mnist = mx.test_utils.get_mnist()

# Fix the seed
# Seta o valor aleatório inicial
mx.random.seed(42)

# Define o contexto computacional, se a GPU estiver disponivel vai executar nela pq é mais rápido, senão na CPU
ctx_comp = mx.gpu() if mx.test_utils.list_gpus() else mx.cpu()



INFO:root:train-labels-idx1-ubyte.gz exists, skipping download
INFO:root:train-images-idx3-ubyte.gz exists, skipping download
INFO:root:t10k-labels-idx1-ubyte.gz exists, skipping download
INFO:root:t10k-images-idx3-ubyte.gz exists, skipping download


Depois de rodar o código acima, todo o dataset MNIST deve ter sido carregado integralmente na memória. Note que sem o mxnet não é possível carregar um grande número de dados como fizemos aqui.O que é necessário é um mecanismo pelo qual possamos transmitir dados de forma rápida e eficiente diretamente da fonte. Vamos configurar o interador de dados para agrupar lotes de 100 imagens (28x28) e o seu rótulo correspondente.

After running the above source code, the entire MNIST dataset should be fully loaded into memory. Note that for large datasets it is not feasible to pre-load the entire dataset first like we did here. What is needed is a mechanism by which we can quickly and efficiently stream data directly from the source. MXNet Data iterators come to the rescue here by providing exactly that. Data iterator is the mechanism by which we feed input data into an MXNet training algorithm and they are very simple to initialize and use and are optimized for speed. During training, we typically process training samples in small batches and over the entire training lifetime will end up processing each training example multiple times. In this tutorial, we'll configure the data iterator to feed examples in batches of 100. Keep in mind that each example is a 28x28 grayscale image and the corresponding label.

Lotes de imagem são normalmente representados por um vetor 4-D com modelo `(tamanho do lote, número de canais, largura, altura)`. Para o dataset MNIST, desde que esteja em grayscale, só terá um canal de cor. Também, as imagens são 28x28 pixels, portanto todas as imagens tem a mesma largura e altura. Logom o formato de entrada fica `(tamanho do lote, 1, 28, 28)`. Outro ponto importante a ser considerado é a ordem de entrada das amostras. Quando os exemplos de treinamento são alimentados, é importante que não alimentemos as amostras com o mesmo rótulo em sequência. Podendo fazer com que o treino seja mais lento.

Image batches are commonly represented by a 4-D array with shape `(batch_size, num_channels, width, height)`. For the MNIST dataset, since the images are grayscale, there is only one color channel. Also, the images are 28x28 pixels, and so each image has width and height equal to 28. Therefore, the shape of input is `(batch_size, 1, 28, 28)`. Another important consideration is the order of input samples. When feeding training examples, it is critical that we don't feed samples with the same label in succession. Doing so can slow down training.
Data iterators take care of this by randomly shuffling the inputs. Note that we only need to shuffle the training data. The order does not matter for test data.

O código fonte que segue, inicializa o interador de dados pelo dataset mnist. Note que nós inicializamos dois interadores: um pro dado de treino e outro pro dado de teste.

The following source code initializes the data iterators for the MNIST dataset. Note that we initialize two iterators: one for train data and one for test data.

In [60]:
#lote de 100 "neurônios"
tam_lote= 100

#referencia= https://mxnet.incubator.apache.org/api/python/io/io.html
#http://mxnet.incubator.apache.org/test/versions/0.10/api/python/ndarray.html#module-mxnet.ndarray
#procurar youtube data interator
#The NDArray API, defined in the ndarray (or simply nd) package, provides imperative tensor operations on CPU/GPU. 
#An NDArray represents a multi-dimensional, fixed-size homogenous array.

#(tam_lote, num_channels, width, height)

#inicializou e embaralhou o vetor de imagens de teste e seus labels(onde diz qual é o número que está na imagem) 
dados_treino = mx.io.NDArrayIter(mnist['train_data'], mnist['train_label'], tam_lote, shuffle=True)
print(dados_treino.provide_data)
print(dados_treino.provide_label)

#inicializou o vetor de teste
dados_teste = mx.io.NDArrayIter(mnist['test_data'], mnist['test_label'], tam_lote)
print(dados_teste.provide_data)
print(dados_teste.provide_label)

[DataDesc[data,(100, 1, 28, 28),<class 'numpy.float32'>,NCHW]]
[DataDesc[softmax_label,(100,),<class 'numpy.float32'>,NCHW]]
[DataDesc[data,(100, 1, 28, 28),<class 'numpy.float32'>,NCHW]]
[DataDesc[softmax_label,(100,),<class 'numpy.float32'>,NCHW]]


## Training
We will cover a couple of approaches for performing the hand written digit recognition task. The first approach makes use of a traditional deep neural network architecture called Multilayer Perceptron (MLP). We'll discuss its drawbacks and use that as a motivation to introduce a second more advanced approach called Convolution Neural Network (CNN) that has proven to work very well for image classification tasks.

### Multilayer Perceptron

Nós vamos começar criando um local reservado para o dado de entrada. Quando trabalhamos com MPL, nós precisamos achatar nossas imagens 28x28 em um plano 1-D estrutura bruta de valores de pixels de 784 (28*28). A ordem dos valores dos pixels no vetor achatado não importa se nós estivermos sendo consistentes, fazendo isso com todas as imagens.

The first approach makes use of a [Multilayer Perceptron](https://en.wikipedia.org/wiki/Multilayer_perceptron) to solve this problem. We'll define the MLP using MXNet's symbolic interface. We begin by creating a place holder variable for the input data. When working with an MLP, we need to flatten our 28x28 images into a flat 1-D structure of 784 (28 * 28) raw pixel values. The order of pixel values in the flattened vector does not matter as long as we are being consistent about how we do this across all images.

In [30]:
#criando um local reservado para o dado de entrada
dado_entrada = mx.sym.var('data')

# Flatten the data from 4-D shape into 2-D (tam_lote, num_channel*width*height)
dato_entrada = mx.sym.flatten(data=data)

Você pode estar se perguntando se estamos perdendo informações valiosas achatando o dado. De fato isso é verdade e nós vamos falar sobre isso quando nós falarmos sobre redes neurais convolacionais onde nós preservamos o formato de entrada do dado. Por enquanto, nós vamos em frente com as imagens achatadas.

One might wonder if we are discarding valuable information by flattening. That is indeed true and we'll cover this more when we talk about convolutional neural networks where we preserve the input shape. For now, we'll go ahead and work with flattened images.

MPLs camadas totalmente conectadas. Uma camada totalmente conectada ou camada FC pra encurtar, é onde cada neurônio na camada está conectado com todos os neorônios da camada anterior. Pela perspectiva da algebra linear, uma camada FC aplica uma "affine transform" para a matriz de entrada *X* de *n x m* e gera uma matriz "Y' do tamanho *n x k*, onde *k* é o número de neorônios na camada FC. *k* também se refere como o tamanho oculto. A matriz Y é gerada a partir da equação Y = X W<sup>T</sup> + b. A camada FC tem dois parâmetros aprendíveis, a matriz W de peso *k x m* e o vetor de polarização b (bias) de peso *1 x k*.  soma do vetor de viés segue as regras de transmissão explicadas em [`mxnet.sym.broadcast_to()`](https://mxnet.incubator.apache.org/api/python/symbol/symbol.html#mxnet.symbol.broadcast_to). 
Conceitualmente, a transmissão replica os elementos de linha do vetor de polarização para criar uma matriz * n x k * antes do somatório.

MLPs contains several fully connected layers. A fully connected layer or FC layer for short, is one where each neuron in the layer is connected to every neuron in its preceding layer. From a linear algebra perspective, an FC layer applies an [affine transform](https://en.wikipedia.org/wiki/Affine_transformation) to the *n x m* input matrix *X* and outputs a matrix *Y* of size *n x k*, where *k* is the number of neurons in the FC layer. *k* is also referred to as the hidden size. The output *Y* is computed according to the equation *Y = X W<sup>T</sup> + b*. The FC layer has two learnable parameters, the *k x m* weight matrix *W* and the *1 x k* bias vector *b*. The summation of bias vector follows the broadcasting rules explained in [`mxnet.sym.broadcast_to()`](https://mxnet.incubator.apache.org/api/python/symbol/symbol.html#mxnet.symbol.broadcast_to). Conceptually, broadcasting replicates row elements of the bias vector to create an *n x k* matrix before summation.


In an MLP, the outputs of most FC layers are fed into an activation function, which applies an element-wise non-linearity. This step is critical and it gives neural networks the ability to classify inputs that are not linearly separable. Common choices for activation functions are sigmoid, tanh, and [rectified linear unit](https://en.wikipedia.org/wiki/Rectifier_%28neural_networks%29) (ReLU). In this example, we'll use the ReLU activation function which has several desirable properties and is typically considered a default choice.

The following code declares two fully connected layers with 128 and 64 neurons each. Furthermore, these FC layers are sandwiched between ReLU activation layers each one responsible for performing an element-wise ReLU transformation on the FC layer output.

In [34]:
# The first fully-connected layer and the corresponding activation function
#seta o numero de neoronios da camada 1 
camada1  = mx.sym.FullyConnected(data=dado_entrada, num_hidden=128)
#Diz a activation function usada.
resultado1 = mx.sym.Activation(data=camada1, act_type="relu")

# The second fully-connected layer and the corresponding activation function
camada2  = mx.sym.FullyConnected(data=resultado1, num_hidden = 64)
resultado2 = mx.sym.Activation(data=camada2, act_type="relu")

A última camada fully connect layer normalmente tem o mesmo tamanho das classes de saida do dataset
The last fully connected layer often has its hidden size equal to the number of output classes in the dataset. The activation function for this layer will be the softmax function. The Softmax layer maps its input to a probability score for each class of output. During the training stage, a loss function computes the [cross entropy](https://en.wikipedia.org/wiki/Cross_entropy) between the probability distribution (softmax output) predicted by the network and the true probability distribution given by the label.

The following source code declares the final fully connected layer of size 10. 10 incidentally is the total number of digits. The output from this layer is fed into a `SoftMaxOutput` layer that performs softmax and cross-entropy loss computation in one go. Note that loss computation only happens during training.

In [37]:
# MNIST has 10 classes
#o dado da camada é a camada anterior com a função de ativação(RELU) aplicada
camada3  = mx.sym.FullyConnected(data=camada2, num_hidden=10)

# Softmax with cross entropy loss
#softmax=distribuição de probalidade entre 0 e 1
#cross-entropy=mede o número médio de bits necessários para identificar um evento 
mlp  = mx.sym.SoftmaxOutput(data=camada3, name='softmax')

![png](https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/image/mlp_mnist.png)

**Figure 2:** MLP network architecture for MNIST.

Now that both the data iterator and neural network are defined, we can commence training. Here we'll employ the `module` feature in MXNet which provides a high-level abstraction for running training and inference on predefined networks. The module API allows the user to specify appropriate parameters that control how the training proceeds.

The following source code initializes a module to train the MLP network we defined above. For our training, we will make use of the stochastic gradient descent (SGD) optimizer. In particular, we'll be using mini-batch SGD. Standard SGD processes train data one example at a time. In practice, this is very slow and one can speed up the process by processing examples in small batches. In this case, our batch size will be 100, which is a reasonable choice. Another parameter we select here is the learning rate, which controls the step size the optimizer takes in search of a solution. We'll pick a learning rate of 0.1, again a reasonable choice. Settings such as batch size and learning rate are what are usually referred to as hyper-parameters. What values we give them can have a great impact on training performance. For the purpose of this tutorial, we'll start with some reasonable and safe values. In other tutorials, we'll discuss how one might go about finding a combination of hyper-parameters for optimal model performance.

Typically, one runs the training until convergence, which means that we have learned a good set of model parameters (weights + biases) from the train data. For the purpose of this tutorial, we'll run training for 10 epochs and stop. An epoch is one full pass over the entire train data.

In [96]:
# TREINAMENTO

# stochastic gradient descent (SGD)= busca minimizar a loss function(normalmente a diferença entre a saída prevista e a real), 
# escolhendo os melhores parâmetros de entrada dos que tem disponível.

# cria um módulo de treino no contexto computacional atual(GPU ou CPU)
mlp_model = mx.mod.Module(symbol=mlp, context=ctx)

# parêmtros do modelo
mlp_model.fit(dados_treino,                                           # dados de treino
          eval_data=dados_teste,                                      # dados para avaliação
          optimizer='sgd',                                            # usa o método de otimização SGD
          optimizer_params={'learning_rate':0.6},                     # controla o tamanho da passo que o otimizador toma em busca de uma solução.
                                                                      # Este parâmetro informa ao otimizador até onde mover os pesos na direção do gradiente para um mini-lote.
          eval_metric='acc',                                          # printa a precisão do treino
          batch_end_callback = mx.callback.Speedometer(tam_lote,100), # mostra progresso de treino para cada 100 lotes de 100 imagens
          num_epoch=10)                                               # Passa 10 vezes o dataset inteiro

INFO:root:Epoch[0] Batch [100]	Speed: 48340.47 samples/sec	accuracy=0.148020
INFO:root:Epoch[0] Batch [200]	Speed: 53797.96 samples/sec	accuracy=0.609500
INFO:root:Epoch[0] Batch [300]	Speed: 54088.44 samples/sec	accuracy=0.873400
INFO:root:Epoch[0] Batch [400]	Speed: 52665.47 samples/sec	accuracy=0.914100
INFO:root:Epoch[0] Batch [500]	Speed: 53798.03 samples/sec	accuracy=0.923300
INFO:root:Epoch[0] Train-accuracy=0.939495
INFO:root:Epoch[0] Time cost=1.170
INFO:root:Epoch[0] Validation-accuracy=0.948200
INFO:root:Epoch[1] Batch [100]	Speed: 50032.31 samples/sec	accuracy=0.944752
INFO:root:Epoch[1] Batch [200]	Speed: 54088.44 samples/sec	accuracy=0.952900
INFO:root:Epoch[1] Batch [300]	Speed: 53226.19 samples/sec	accuracy=0.956900
INFO:root:Epoch[1] Batch [400]	Speed: 51053.36 samples/sec	accuracy=0.957100
INFO:root:Epoch[1] Batch [500]	Speed: 48340.24 samples/sec	accuracy=0.959700
INFO:root:Epoch[1] Train-accuracy=0.968283
INFO:root:Epoch[1] Time cost=1.175
INFO:root:Epoch[1] Validat

### Prediction

After the above training completes, we can evaluate the trained model by running predictions on test data. The following source code computes the prediction probability scores for each test image. *prob[i][j]* is the probability that the *i*-th test image contains the *j*-th output class.

Depois do treinamento acima dser completado, nós podemos avaliar o modelo de treinamento rodando as previsões nos dados de teste. O código que segue calcula a probabilidade prevista para cada imagem de teste. *prob[i][j]* é a probabilidade de cada imagem das 10000 ser de cada uma das 10 classes

In [93]:
teste = mx.io.NDArrayIter(mnist['test_data'], None, tam_lote) #vetor setado com as imagens, sem os labels para teste

prob = mlp_model.predict(teste) #calcula a probabilidade de cada imagem das 10000, ser de cada uma das 10 classes

assert prob.shape == (10000, 10) #verificação em tempo de execução de uma condição qualquer. Se a condição não for verdadeira, 
                                 #uma exceção AssertionError acontece e o programa para

Since the dataset also has labels for all test images, we can compute the accuracy metric as follows:

Desde que o dataser também tenha labels pra todas as imagens de teste, nós podemos calcular a precisão 

In [94]:
teste = mx.io.NDArrayIter(mnist['test_data'], mnist['test_label'], tam_lote)

# predict accuracy of mlp
# previsão da precisão do modelo mlp
precisao = mx.metric.Accuracy()
mlp_model.score(teste, precisao)
print(precisao)
assert precisao.get()[1] > 0.96, "A precisão alcançada (%f) é menor do que a esperada (0.96)" % precisao.get()[1]

EvalMetric: {'accuracy': 0.9737}


If everything went well, we should see an accuracy value that is around 0.96, which means that we are able to accurately predict the digit in 96% of test images. This is a pretty good result. But as we will see in the next part of this tutorial, we can do a lot better than that.

### Convolutional Neural Network


Anteriormente, falamos brevemente sobre uma desvantagem do MLP quando dissemos que precisamos descartar a forma original da imagem de entrada e achatá-la como um vetor antes de podermos alimentá-la como entrada para a primeira camada totalmente conectada do MLP. Acontece que essa é uma questão importante, porque não aproveitamos o fato de que os pixels da imagem têm uma correlação espacial natural ao longo dos eixos horizontal e vertical. Uma rede neural convolucional (CNN) visa resolver este problema usando uma representação de peso mais estruturada. Em vez de achatar a imagem e fazer uma multiplicação simples matriz-matriz, ela emprega uma ou mais camadas convolucionais que cada uma realiza uma convolução 2-D na imagem de entrada.

Earlier, we briefly touched on a drawback of MLP when we said we need to discard the input image's original shape and flatten it as a vector before we can feed it as input to the MLP's first fully connected layer. Turns out this is an important issue because we don't take advantage of the fact that pixels in the image have natural spatial correlation along the horizontal and vertical axes. A convolutional neural network (CNN) aims to address this problem by using a more structured weight representation. Instead of flattening the image and doing a simple matrix-matrix multiplication, it employs one or more convolutional layers that each performs a 2-D convolution on the input image.


Uma única camada de convolução consiste em um ou mais filtros que desempenham cada um o papel de um detector de características. Durante o treinamento, a CNN aprende representações apropriadas (parâmetros) para esses filtros. Semelhante ao MLP, a saída da camada convolucional é transformada pela aplicação de uma não linearidade. Além da camada convolucional, outro aspecto importante de uma CNN é a camada de pooling. Uma camada de pooling serve para tornar a tradução CNN invariante: um dígito permanece o mesmo, mesmo quando é deslocado para a esquerda / direita / para cima / baixo por alguns pixels. Uma camada de pooling reduz um patch * n x m * em um valor único para tornar a rede menos sensível à localização espacial. A camada de pool é sempre incluída após cada camada conv (+ ativação) na CNN.

A single convolution layer consists of one or more filters that each play the role of a feature detector. During training, a CNN learns appropriate representations (parameters) for these filters. Similar to MLP, the output from the convolutional layer is transformed by applying a non-linearity. Besides the convolutional layer, another key aspect of a CNN is the pooling layer. A pooling layer serves to make the CNN translation invariant: a digit remains the same even when it is shifted left/right/up/down by a few pixels. A pooling layer reduces a *n x m* patch into a single value to make the network less sensitive to the spatial location. Pooling layer is always included after each conv (+ activation) layer in the CNN.

O código-fonte a seguir define uma arquitetura de rede neural convolucional chamada LeNet. LeNet é uma rede popular conhecida por funcionar bem em tarefas de classificação de dígitos. Usaremos uma versão ligeiramente diferente da implementação original do LeNet, substituindo as ativações sigmóides por ativações de tanh para os neurônios

The following source code defines a convolutional neural network architecture called LeNet. LeNet is a popular network known to work well on digit classification tasks. We will use a slightly different version from the original LeNet implementation, replacing the sigmoid activations with tanh activations for the neurons

In [63]:
data = mx.sym.var('data')

# first conv layer
conv1 = mx.sym.Convolution(data=data, kernel=(5,5), num_filter=20)
tanh1 = mx.sym.Activation(data=conv1, act_type="tanh")
pool1 = mx.sym.Pooling(data=tanh1, pool_type="max", kernel=(2,2), stride=(2,2))

# second conv layer
conv2 = mx.sym.Convolution(data=pool1, kernel=(5,5), num_filter=50)
tanh2 = mx.sym.Activation(data=conv2, act_type="tanh")
pool2 = mx.sym.Pooling(data=tanh2, pool_type="max", kernel=(2,2), stride=(2,2))

# first fullc layer
flatten = mx.sym.flatten(data=pool2)
fc1 = mx.symbol.FullyConnected(data=flatten, num_hidden=500)
tanh3 = mx.sym.Activation(data=fc1, act_type="tanh")

# second fullc
fc2 = mx.sym.FullyConnected(data=tanh3, num_hidden=10)

# softmax loss
lenet = mx.sym.SoftmaxOutput(data=fc2, name='softmax')

![png](https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/image/conv_mnist.png)

**Figure 3:** First conv + pooling layer in LeNet.

Now we train LeNet with the same hyper-parameters as before. Note that, if a GPU is available, we recommend using it. This greatly speeds up computation given that LeNet is more complex and compute-intensive than the previous multilayer perceptron. To do so, we only need to change `mx.cpu()` to `mx.gpu()` and MXNet takes care of the rest. Just like before, we'll stop training after 10 epochs.

In [64]:
lenet_model = mx.mod.Module(symbol=lenet, context=ctx)
# train with the same
lenet_model.fit(dados_treino,
                eval_data=dados_teste,
                optimizer='sgd',
                optimizer_params={'learning_rate':0.1},
                eval_metric='acc',
                batch_end_callback = mx.callback.Speedometer(tam_lote, 100),
                num_epoch=10)

INFO:root:Epoch[0] Batch [100]	Speed: 927.64 samples/sec	accuracy=0.106733
INFO:root:Epoch[0] Batch [200]	Speed: 977.58 samples/sec	accuracy=0.115000
INFO:root:Epoch[0] Batch [300]	Speed: 989.66 samples/sec	accuracy=0.111400
INFO:root:Epoch[0] Batch [400]	Speed: 985.28 samples/sec	accuracy=0.115700
INFO:root:Epoch[0] Batch [500]	Speed: 998.05 samples/sec	accuracy=0.110000
INFO:root:Epoch[0] Train-accuracy=0.110707
INFO:root:Epoch[0] Time cost=61.410
INFO:root:Epoch[0] Validation-accuracy=0.113500
INFO:root:Epoch[1] Batch [100]	Speed: 936.94 samples/sec	accuracy=0.139802
INFO:root:Epoch[1] Batch [200]	Speed: 986.15 samples/sec	accuracy=0.568200
INFO:root:Epoch[1] Batch [300]	Speed: 983.83 samples/sec	accuracy=0.860600
INFO:root:Epoch[1] Batch [400]	Speed: 974.63 samples/sec	accuracy=0.905400
INFO:root:Epoch[1] Batch [500]	Speed: 975.96 samples/sec	accuracy=0.921400
INFO:root:Epoch[1] Train-accuracy=0.937879
INFO:root:Epoch[1] Time cost=62.289
INFO:root:Epoch[1] Validation-accuracy=0.948

### Prediction

Finally, we'll use the trained LeNet model to generate predictions for the test data.

In [65]:
teste = mx.io.NDArrayIter(mnist['test_data'], None, tam_lote)
prob = lenet_model.predict(teste)
teste = mx.io.NDArrayIter(mnist['test_data'], mnist['test_label'], tam_lote)
# predict accuracy for lenet
precisao = mx.metric.Accuracy()
lenet_model.score(teste, precisao)
print(precisao)
assert precisao.get()[1] > 0.98, "Achieved accuracy (%f) is lower than expected (0.98)" % precisao.get()[1]

EvalMetric: {'accuracy': 0.9871}


If all went well, we should see a higher accuracy metric for predictions made using LeNet. With CNN we should be able to correctly predict around 98% of all test images.

## Summary

In this tutorial, we have learned how to use MXNet to solve a standard computer vision problem: classifying images of hand written digits. You have seen how to quickly and easily build, train and evaluate models such as MLP and CNN with MXNet.


<!-- INSERT SOURCE DOWNLOAD BUTTONS -->

