# 1 - Configuring the Data for our Model

In this notebook, you will walk through the process to configure the data that you will be leveraging for your model. There are additional capabilities for centralizing and managing data that will be covered in future modules.

## MNIST Data Set

We will be leveraging the industry standard training data set, [MNIST](https://azure.microsoft.com/services/open-datasets/catalog/mnist/) along with Keras and TensorFlow.

The MNIST database of handwritten digits is a common data set that is used when learning to solve the classification problem with neural networks. The Azure Machine Learning Python SDK includes the MNIST data as one of the open sata sets, which enables it to be included into our project easily.

> "The MNIST database of handwritten digits, available from this page, has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image.It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting."

You can see some sample images from this data set in the image below:

![](mnist_data.png)

If you want to see which other data sets can be easily included in your project with the SDK, you [see the list here](https://docs.microsoft.com/en-us/python/api/azureml-opendatasets/azureml.opendatasets?view=azure-ml-py#classes).


## Module

Within this module, you will be:

- Setup a notebook server in the Azure Machine Learning Studio
- Download and Analyze MNIST Data Set

## Import Python Packages

You will first need to import the following Python packages:

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import os

The following packages will be needed from the Azure ML Python SDK:

In [None]:
import azureml.core
from azureml.core import Workspace
from azureml.core import Dataset
from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget
from azureml.core.compute_target import ComputeTargetException
from azureml.opendatasets import MNIST
print("Azure ML SDK Version: ", azureml.core.VERSION)

## Workspace

If you are running this notebook as a part of your Azure Machine Learning Studio workspace, you can get a reference to your workspace by simply running the following code:

In [None]:
ws = Workspace.from_config()
print("Azure ML Workspace")
print(f'Name: {ws.name}')
print(f'Location: {ws.location}')
print(f'Resource Group: {ws.resource_group}')

## Provisioning Compute Resources

We will include the logic from the previous module around provisioning compute resources.  This will create a cluster of Standard NC6 instances that can scale from 0 to 4 nodes.

In [None]:
# Create a name for our new cluster
cpu_cluster_name = 'tdsp-cluster'

# Verify that cluster does not exist already
try:
    cpu_cluster = AmlCompute(workspace=ws, name=cpu_cluster_name)
    print('Cluster already exists.')
    
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='Standard_NC6',
                                                           max_nodes=4)
    cpu_cluster = AmlCompute.create(ws, cpu_cluster_name, compute_config)

cpu_cluster.wait_for_completion(show_output=True)

We can use the `get_status()` method on the `cpu_cluster` instance to get the details on our compute cluster.

In [None]:
status = cpu_cluster.get_status()
print(f'Nodes: {str(status.current_node_count)}')
print(f'VM Size: {str(status.vm_size)}')
print(f'Provsioning State: {str(status.provisioning_state)}')
print(f'Scale Settings: {str(status.scale_settings.serialize())}')

For more information around GPU VM instances, view the following link:
    
* [GPU optimized virtual machine sizes](https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes-gpu)

## Data Set

We first need to setup the directory that will store our data set:

In [None]:
data_folder = os.path.join(os.getcwd(), 'data')
print(f'Data folder: {data_folder}')
os.makedirs(data_folder, exist_ok=True)

Next, we will use the open dataset for MNIST that exists within the Azure ML Python SDK.  We will get a reference to the data set, and then we will download the data into the directory that we just created:

In [None]:
mnist_file_dataset = MNIST.get_file_dataset()
mnist_file_dataset.download(data_folder, overwrite=True)

## Exploring the Data

To be able to load and visualize the data, we will be leverage methods from two files that are included with this notebook, `utils.py` and `display.py`.

In [None]:
# make sure utils.py is in the same directory as this code
from utils import load_data
from display import find_sample_data
from display import plot_images

Next, we need to load in the train and test data (both images and labels). We will leverage the `load_data` function that is inlcuded from Microsoft in the `utils.py` file:

In [None]:
# note we also shrink the intensity values (X) from 0-255 to 0-1. This helps the model converge faster.
training_images = load_data(data_folder, "train-images-idx3-ubyte.gz") / 255.0
training_images = np.reshape(training_images, (-1, 28,28))
test_images = load_data(data_folder, "t10k-images-idx3-ubyte.gz") / 255.0
test_images = np.reshape(test_images, (-1, 28,28))

training_labels = load_data(data_folder, "train-labels-idx1-ubyte.gz", True).reshape(-1)
test_labels = load_data(data_folder, "t10k-labels-idx1-ubyte.gz", True).reshape(-1)

print(f'Training Image: {training_images.shape}')
print(f'Training Labels: {training_labels.shape}')
print(f'Test Images: {test_images.shape}')
print(f'Test Labels: {test_labels.shape}')

Next, we can visulalize some of the data that we have pulled in.  We can review four examples of each category from the test data using the following code:

In [None]:
# find exmaples of the data and visualize it
extracted_test_data = find_sample_data(training_images, training_labels, 8, randomize=False)
plot_images(extracted_test_data)