# Using AWS for CRISP-DM Phases 3-5: Data Prepartion, Modeling, and Evaluation

Now that you are inside a Jupyter Notebook, we assume that most of you are within familiar territory. As such, this tutorial will not go into detail about these phases. Rather, we'll quickly breeze through these three phases with a focus on productionalizing this code into Sagemaker. In the next steps, we'll provide more detail on how to deploy real-time models using Sagemaker's SDK.

Because this tutorial is focused on Sagemaker rather than the Data Science, we'll use a common dataset, MNIST, and train an image classifier using MXNet.

## 1. Standard CRISP-DM Phases 3-5

As mentioned above, this tutorial will not focus on the Data Science of the modeling. As such, the following section training and evaluation code is 90% taken from https://mxnet.incubator.apache.org/tutorials/python/mnist.html

If you are familiar with MXNet and the standard training and evaluation code, feel free to jump ahead.

### Phase 3: Data Preparation

#### Load Data

In [1]:
import mxnet as mx

mnist = mx.test_utils.get_mnist()

  import OpenSSL.SSL


#### Split training/test sets

In [2]:
ntrain = int(mnist['train_data'].shape[0]*0.8)
X_train = mnist['train_data'][:ntrain]
y_train = mnist['train_label'][:ntrain]
X_test = mnist['train_data'][ntrain:]
y_test = mnist['train_label'][ntrain:]

### Phase 4: Model Training

#### Define Network

In [3]:
data = mx.sym.var('data')
# first conv layer
conv1 = mx.sym.Convolution(data=data, kernel=(5,5), num_filter=20)
tanh1 = mx.sym.Activation(data=conv1, act_type="tanh")
pool1 = mx.sym.Pooling(data=tanh1, pool_type="max", kernel=(2,2), stride=(2,2))
# second conv layer
conv2 = mx.sym.Convolution(data=pool1, kernel=(5,5), num_filter=50)
tanh2 = mx.sym.Activation(data=conv2, act_type="tanh")
pool2 = mx.sym.Pooling(data=tanh2, pool_type="max", kernel=(2,2), stride=(2,2))
# first fullc layer
flatten = mx.sym.flatten(data=pool2)
fc1 = mx.symbol.FullyConnected(data=flatten, num_hidden=500)
tanh3 = mx.sym.Activation(data=fc1, act_type="tanh")
# second fullc
fc2 = mx.sym.FullyConnected(data=tanh3, num_hidden=10)
# softmax loss
lenet = mx.sym.SoftmaxOutput(data=fc2, name='softmax')

#### Train Network

In [5]:
# define training batch size
batch_size = 100

# create iterator around training and validation data
train_iter = mx.io.NDArrayIter(mnist['train_data'][:ntrain], mnist['train_label'][:ntrain], batch_size, shuffle=True)
val_iter = mx.io.NDArrayIter(mnist['train_data'][ntrain:], mnist['train_label'][ntrain:], batch_size)

# create a trainable module
# toggle this between mx.cpu() and mx.gpu() depending on if you're using ml.c-family or ml.p-family for this notebook.
lenet_model = mx.mod.Module(symbol=lenet, context=mx.cpu())
# train with the same
lenet_model.fit(train_iter,
                eval_data=val_iter,
                optimizer='sgd',
                optimizer_params={'learning_rate':0.1},
                eval_metric='acc',
                batch_end_callback = mx.callback.Speedometer(batch_size, 100),
                num_epoch=10)

KeyboardInterrupt: 

As you have probabily noticed by now, this step can take awhile since it's being trained locally on the instance currently used to host Jupyter.

Instead, since we know that this code works, let's go to the next step and refactor the above code into the Sagemaker SDK. This allows us to use Sagemaker's distributed training capabilities to drastically speed up training time.

In the instructions, please move on to Part2: Model Development