# Using AWS for CRISP-DM Phases 3-5: Data Prepartion, Modeling, and Evaluation

Now that you are inside a Jupyter Notebook, we assume that most of you are within familiar territory. As such, this tutorial will not go into detail about these phases. Rather, we'll quickly breeze through these three phases with a focus on productionalizing this code into Sagemaker. In the next steps, we'll provide more detail on how to deploy real-time models using Sagemaker's SDK.

Because this tutorial is focused on Sagemaker rather than the Data Science, we'll use a common dataset, MNIST, and train an image classifier using MXNet.

## 1. Standard CRISP-DM Phases 3-5

As mentioned above, this tutorial will not focus on the Data Science of the modeling. As such, the following section training and evaluation code is 99% taken from https://mxnet.incubator.apache.org/tutorials/python/mnist.html

If you are familiar with MXNet and the standard training and evaluation code, feel free to jump ahead.

### Phase 3: Data Preparation

#### Load Data

In [1]:
import mxnet as mx
import numpy as np
import boto3

  import OpenSSL.SSL


In [2]:
bucket_name = 'jakechenawspublic'
key_name = 'sample_data/mnist/train/mnist_train.csv'

s3 = boto3.client('s3')
s3.download_file(bucket_name, key_name, 'mnist_train.csv')

mnist_train = np.loadtxt('mnist_train.csv', delimiter=',')
X_train = mnist_train.T[1:].T.reshape(-1,1,28,28)
y_train = mnist_train.T[:1].T.reshape(-1)

batch_size = 100
train_iter = mx.io.NDArrayIter(X_train[:-1000], y_train[:-1000], batch_size, shuffle=True)
val_iter = mx.io.NDArrayIter(X_train[-1000:], y_train[-1000:], batch_size)

### Phase 4: Model Training

#### Define Network

In [3]:
data = mx.sym.var('data')
# first conv layer
conv1 = mx.sym.Convolution(data=data, kernel=(5,5), num_filter=20)
tanh1 = mx.sym.Activation(data=conv1, act_type="tanh")
pool1 = mx.sym.Pooling(data=tanh1, pool_type="max", kernel=(2,2), stride=(2,2))
# second conv layer
conv2 = mx.sym.Convolution(data=pool1, kernel=(5,5), num_filter=50)
tanh2 = mx.sym.Activation(data=conv2, act_type="tanh")
pool2 = mx.sym.Pooling(data=tanh2, pool_type="max", kernel=(2,2), stride=(2,2))
# first fullc layer
flatten = mx.sym.flatten(data=pool2)
fc1 = mx.symbol.FullyConnected(data=flatten, num_hidden=500)
tanh3 = mx.sym.Activation(data=fc1, act_type="tanh")
# second fullc
fc2 = mx.sym.FullyConnected(data=tanh3, num_hidden=10)
# softmax loss
lenet = mx.sym.SoftmaxOutput(data=fc2, name='softmax')

#### Train Network

In [4]:
# create a trainable module on CPU 0
lenet_model = mx.mod.Module(symbol=lenet, context=mx.cpu()) # change to mx.gpu() if using ml.p2.xlarge
# train with the same
lenet_model.fit(train_iter,
                eval_data=val_iter,
                optimizer='sgd',
                optimizer_params={'learning_rate':0.1},
                eval_metric='acc',
                batch_end_callback = mx.callback.Speedometer(batch_size, 100),
                num_epoch=10)

#### Test Network

In [5]:
bucket_name = 'jakechenawspublic'
key_name = 'sample_data/mnist/test/mnist_test.csv'

s3 = boto3.client('s3')
s3.download_file(bucket_name, key_name, 'mnist_test.csv')

mnist_test = np.loadtxt('mnist_test.csv', delimiter=',')
X_test = mnist_test.T[1:].T.reshape(-1,1,28,28)
y_test = mnist_test.T[:1].T.reshape(-1)

test_iter = mx.io.NDArrayIter(X_test, y_test, batch_size)

# predict accuracy for lenet
acc = mx.metric.Accuracy()
lenet_model.score(test_iter, acc)
print(acc)
assert acc.get()[1] > 0.98

EvalMetric: {'accuracy': 0.98870000000000002}


As you have probabily noticed by now, this step can take awhile since it's being trained locally on the instance currently used to host Jupyter.

Instead, since we know that this code works, let's go to the next step and refactor the above code into the Sagemaker SDK. This allows us to use Sagemaker's distributed training capabilities to drastically speed up training time.

In the [instructions](./part0_instructions.md), please move on to 2. Model Development for SageMaker.