# How to feed data into CNTK 

Feeding data is an integral part of training a deep neural network.  While expressiveness and succinct model representation is one of the key aspects of CNTK, efficient and flexible data reading is also made available to the users.  Key to deep learning is the ability to provide randomly sampled training data to the CNTK model trainers. These small sampled data sets are called mini-batches. In this manual, you will see code samples that show how minibatch samples can be read from data sources and passed on to trainer objects.

One of the following three mechanisms to read data should cover the different use case when it comes to training a model: 

A. **Small data (in memory)**: Users can generate own data using NumPy or read in data as **NumPy** arrays or **SciPy** sparse (CSR) matrices. This is the way to go when data set is small and can be loaded in memory.

B. **Large data (cannot be loaded in memory)**: 

> (1) Using built-in **MinibatchSource** class is the choice when data cannot be loaded in memory and one wants to build models leveraging distributed computing capabilities in CNTK. This is supported via built-in readers and optional parameters that help reduce writing what otherwise would be considered as boiler plate code. 

> (2) Using explicit **minibatch-loop** when data being used does not fit in one of the aforementioned categories. For instance, if one needs to training parameters based on certain loss conditions. 


In [10]:
from __future__ import print_function
import cntk
import numpy as np
import os
import scipy.sparse

import cntk.tests.test_utils
cntk.tests.test_utils.set_device_from_pytest_env() # (only needed for CNTK internal build system)

## Small data (in memory)

The `train()` and `test()` functions accept a tuple of NumPy arrays or SciPy sparse matrices (in CSR format) for their `minibatch_source` arguments.
The tuple members must be in the same order as the arguments of the `criterion` function that `train()` or `test()` are called on.
For dense tensors, use numpy arrays, while sparse data should have the type `scipy.sparse.csr_matrix`.

Each of the arguments should be a Python list of numpy/scipy arrays, where each list entry represents a data item. For arguments declared as a sequence, the first axis (dimension) of the numpy/scipy array is the sequence length, while the remaining axes are the shape of each element of the sequence. Arguments that are not sequences consist of a single tensor. The shapes, data types (`np.float32/float64`) and sparseness must match the argument types.

As an optimization, arguments that are not sequences can also be passed as a single large numpy/scipy array (instead of a list).

**Note** It is the responsibility of the user to randomize the data.

We use a simple Logistic Regression model for illustration purpose

In [2]:
# Generate your own data
input_dim_lr = 2    # classify 2-dimensional data
num_classes_lr = 2  # into one of two classes

# This example uses synthetic data from normal distributions,
# which we generate in the following.
#  X_lr[corpus_size,input_dim] - input data
#  Y_lr[corpus_size]           - labels (0 or 1), one-hot-encoded
np.random.seed(0)
def generate_synthetic_data(N):
    Y = np.random.randint(size=N, low=0, high=num_classes_lr)  # labels
    X = (np.random.randn(N, input_dim_lr)+3) * (Y[:,None]+1)   # data
    # Our model expects float32 features, and cross-entropy
    # expects one-hot encoded labels.
    Y = scipy.sparse.csr_matrix((np.ones(N,np.float32), (range(N), Y)), shape=(N, num_classes_lr))
    X = X.astype(np.float32)
    return X, Y
X_train_lr, Y_train_lr = generate_synthetic_data(20000)
X_test_lr,  Y_test_lr  = generate_synthetic_data(1024)
print('data =\n', X_train_lr[:4])
print('labels =\n', Y_train_lr[:4].todense())

data =
 [[ 2.2741797   3.56347561]
 [ 5.12873602  5.79089499]
 [ 1.3574543   5.5718112 ]
 [ 3.54340553  2.46254587]]
labels =
 [[ 1.  0.]
 [ 0.  1.]
 [ 0.  1.]
 [ 1.  0.]]


We create a small model and train on the data. Note: this manual does not focus on training instead just on different ways to pass data into the trainers.

In [3]:
## Define a small model function
x = cntk.input_variable(input_dim_lr)
y = cntk.input_variable(num_classes_lr, is_sparse=True)
model = cntk.layers.Dense(num_classes_lr, activation=None)
loss = cntk.cross_entropy_with_softmax(model(x), y) # applies softmax to z under the hood
print(loss)

Composite(Tensor[2], SparseTensor[2]) -> Tensor[1]


### Feed NumPy data from memory

In [7]:
learner = cntk.sgd(model.parameters,
                   cntk.learning_rate_schedule(0.1, cntk.UnitType.minibatch))
progress_writer = cntk.logging.ProgressPrinter(0)

loss.train((X_train_lr, Y_train_lr), parameter_learners=[learner],
                   callbacks=[progress_writer])

 average      since    average      since      examples
    loss       last     metric       last              
 ------------------------------------------------------
Learning rate per minibatch: 0.1
     0.17       0.17          0          0            32
    0.192      0.203          0          0            96
    0.197        0.2          0          0           224
    0.206      0.215          0          0           480
     0.21      0.213          0          0           992
    0.208      0.206          0          0          2016
    0.208      0.209          0          0          4064
    0.214      0.219          0          0          8160
    0.216      0.219          0          0         16352


{'epoch_summaries': [{'loss': 0.213730029296875,
   'metric': 0.0,
   'samples': 20000}],
 'updates': [{'loss': 0.1700250208377838, 'metric': 0.0, 'samples': 32},
  {'loss': 0.20306728780269623, 'metric': 0.0, 'samples': 64},
  {'loss': 0.19989347457885742, 'metric': 0.0, 'samples': 128},
  {'loss': 0.2151442915201187, 'metric': 0.0, 'samples': 256},
  {'loss': 0.213142529129982, 'metric': 0.0, 'samples': 512},
  {'loss': 0.20606712996959686, 'metric': 0.0, 'samples': 1024},
  {'loss': 0.20870935916900635, 'metric': 0.0, 'samples': 2048},
  {'loss': 0.21932090818881989, 'metric': 0.0, 'samples': 4096},
  {'loss': 0.21873274445533752, 'metric': 0.0, 'samples': 8192}]}

Note: We have not set any function defining the `metric`. Hence, you see the metric values are all set to 0.0. 

### 2. Feeding Data Using the `MinibatchSource` class for Reading Data

In deep learning, when we deal with large amount of data training data sometimes does not fit into RAM. For this case, CNTK provides the `MinibatchSource` class, which provides:

 * A **chunked randomization algorithm** that holds only part of the data in RAM at any given time.
 * **Distributed reading** where each worker reads a different subset.
 * A **transformation pipeline** for images and image augmentation.
 * **Composability** across multiple data types (e.g. image captioning).
 * Transparent **asynchronous loading** so that the GPU is not stalling while a minibatch is read/prepared 

At present, the `MinibatchSource` class implements a limited set of data types in the form of "deserializers":

 * **Text/Basic** (`CTFDeserializer`)
 * **Images** (`ImageDeserializer`).
 * **Speech files** (`HTKFeatureDeserializer`, `HTKMLFDeserializer`).

In this section we will illustrate the use of CNTK's **canonical text format (CTF)**, which consists of a set of named feature channels each containing a one dimensional sparse or dense sequence per example. The CTFDeserializer can then associates each feature channel with an input of your model or criterion. One can convert NumPy to CTF format using the [sequence to CTF](https://www.cntk.ai/pythondocs/cntk.io.html#cntk.io.sequence_to_cntk_text_format) convertor.

Here we will read MNIST data using CTF deserializers. A detailed outline of the CTF format are available in CNTK 103A tutorial and the deserialization features are illustrated in CNTK 103B tutorial. We reuse the code from those tutorials to demonstrate the use of CTF file readers.

#### MNIST data reading using CTF deserializer

In this tutorial, we are using the MNIST data you have downloaded using CNTK_103A_MNIST_DataLoader notebook. The dataset has 60,000 training images and 10,000 test images with each image being 28 x 28 pixels. Thus the number of features is equal to 784 (= 28 x 28 pixels), 1 per pixel. The variable `num_output_classes` is set to 10 corresponding to the number of digits (0-9) in the dataset.

The data is in the following format:

    |labels 0 0 0 1 0 0 0 0 0 0 |features 0 0 0 0 ... 
                                                  (784 integers each representing a pixel)
    
In this tutorial we are going to use the image pixels corresponding the integer stream named "features". We define a `create_reader` function to read the training and test data using the [CTF deserializer](https://cntk.ai/pythondocs/cntk.io.html?highlight=ctfdeserializer#cntk.io.CTFDeserializer). The labels are [1-hot encoded](https://en.wikipedia.org/wiki/One-hot). Refer to CNTK 103A tutorial for data format visualizations.

In case of image data, ImageDesrializers are very useful. In computer vision, one would often augment the data set by transforming the input image by random cropping, affine transforms. Please refer to CNTK 201 (A & B) tutorials for an end-to-end use of the deserialziers.

In [13]:
# Ensure the training and test data is generated and available for this tutorial.
# We search in two locations in the toolkit for the cached MNIST data set.
data_found = False

for data_dir in [os.path.join("..", "Examples", "Image", "DataSets", "MNIST"),
                 os.path.join("data", "MNIST")]:
    train_file = os.path.join(data_dir, "Train-28x28_cntk_text.txt")
    test_file = os.path.join(data_dir, "Test-28x28_cntk_text.txt")
    if os.path.isfile(train_file) and os.path.isfile(test_file):
        data_found = True
        break
        
if not data_found:
    raise ValueError("Please generate the data by completing CNTK 103 Part A")
    
print("Data directory is {0}".format(data_dir))

Data directory is ..\Examples\Image\DataSets\MNIST


In [14]:
# Define the data dimensions
mnist_input_dim = 784
mnist_num_output_classes = 10

# Read a CTF formatted text (as mentioned above) using the CTF deserializer from a file
def create_reader(path, is_training, input_dim, num_label_classes):
    
    labelStream = cntk.io.StreamDef(field='labels', shape=num_label_classes, is_sparse=False)
    featureStream = cntk.io.StreamDef(field='features', shape=input_dim, is_sparse=False)
    
    deserailizer = cntk.io.CTFDeserializer(path, cntk.io.StreamDefs(labels = labelStream, features = featureStream))
            
    return cntk.io.MinibatchSource(deserailizer,
       randomize = is_training, max_sweeps = cntk.io.INFINITELY_REPEAT if is_training else 1)

Note, the use of the class `cntk.io.MinibatchSource` that uses CTFDeserializer and enables randomization with allowing for specifications of number of sweeps to be made through the data.

In [55]:
# Create a MNIST LR model
## Define a small model function

x = cntk.input_variable(mnist_input_dim)
y = cntk.input_variable(mnist_num_output_classes)
model = cntk.layers.Dense(mnist_num_output_classes, activation=None)
z = model(x/255.0) #scale the input to 0-1 range
loss = cntk.cross_entropy_with_softmax(z, y)
learner = cntk.sgd(z.parameters,
                   cntk.learning_rate_schedule(0.05, cntk.UnitType.minibatch))

### Feeding data into `training_session`

In the code below, we use the `training_session` class to accept data from the reader (using `cntk.io.Minibatchsource`)

In [56]:
reader_train = create_reader(train_file, True, mnist_input_dim, mnist_num_output_classes)
minibatch_size = 64
input_map = {x : reader_train.streams.features, y : reader_train.streams.labels }
num_samples_per_sweep = 60000
num_sweeps_to_train_with = 10


progress_writer = cntk.logging.ProgressPrinter(0)
trainer = cntk.Trainer(z, (loss, None), learner, progress_writer)

cntk.train.training_session(
        trainer=trainer,
        mb_source = reader_train,
        mb_size = minibatch_size,
        model_inputs_to_streams = input_map,
        max_samples = num_samples_per_sweep * num_sweeps_to_train_with,
    ).train()

 average      since    average      since      examples
    loss       last     metric       last              
 ------------------------------------------------------
Learning rate per minibatch: 0.05
      2.4        2.4          0          0            64
     2.32       2.28          0          0           192
     2.24       2.18          0          0           448
     2.07       1.92          0          0           960
     1.81       1.57          0          0          1984
     1.49       1.17          0          0          4032
     1.16      0.845          0          0          8128
    0.902      0.643          0          0         16320
    0.704      0.506          0          0         32704
    0.563      0.422          0          0         65472
    0.466      0.369          0          0        131008
    0.398      0.331          0          0        262080
    0.351      0.303          0          0        524224


### Feeding data with full-control over each minibatch update 

This is the most granular method for feeding data into trainer. Note the `cntk.Trainer` class provides access to functions such as `previous_minibatch_loss_average` and `previous_minibatch_evaluation_average` to gain fine grain control over how the training proceeds.

In [58]:
z1 = model(x/255.0) #scale the input to 0-1 range
loss = cntk.cross_entropy_with_softmax(z1, y)
learner = cntk.sgd(z1.parameters,
                   cntk.learning_rate_schedule(0.05, cntk.UnitType.minibatch))

num_minibatches_to_train = (num_samples_per_sweep * num_sweeps_to_train_with) / minibatch_size

progress_writer = cntk.logging.ProgressPrinter(0)
trainer = cntk.Trainer(z1, (loss, None), learner, progress_writer)

for i in range(0, int(num_minibatches_to_train)):
    
    # Read a mini batch from the training data file
    data = reader_train.next_minibatch(minibatch_size, input_map = input_map)
    
    trainer.train_minibatch(data)

 average      since    average      since      examples
    loss       last     metric       last              
 ------------------------------------------------------
Learning rate per minibatch: 0.05
    0.177      0.177          0          0            64
    0.136      0.115          0          0           192
    0.177      0.208          0          0           448
    0.252      0.318          0          0           960
    0.256      0.259          0          0          1984
    0.262      0.268          0          0          4032
    0.268      0.274          0          0          8128
    0.271      0.274          0          0         16320
    0.272      0.273          0          0         32704
    0.272      0.272          0          0         65472
    0.272      0.272          0          0        131008
     0.27      0.269          0          0        262080
    0.269      0.267          0          0        524224
