# Getting started: Preprocess Mnist Dataset to CSV or TFRecords format
---

In this notebook, you will learn how to run preprocessing for the mnist dataset. After finishing this notebook, you will be ready to learn how to run Distributed Training using TensorFlowOnSpark.

### In this notebook we are going to learn how to:
- Download the Mnist dataset to your local machine
- Upload Mnist dataset to your project in HopsWorks
- Learn how to submit a .zip containing the Mnist dataset to be accessible by the notebook
- Convert Mnist dataset to .csv or .tfrecords format



## 1. Download mnist to your machine and create a .zip

```bash
mkdir ./mnist
cd mnist
curl -O "http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz"
curl -O "http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz"
curl -O "http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz"
curl -O "http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz"
zip -r mnist.zip *
```

## 2. Upload zip to project in HopsWorks

First navigate to your Datasets view in HopsWorks

![image11-Dataset-ProjectPath.png](../images/datasets.png)


Then create a new dataset by clicking Create New DataSet, and upload the mnist.zip file to your project.

![upload.png](../images/upload.png)

## 3. Path in HDFS where mnist.zip is stored

All datasets contained in your project is stored in HDFS. Each project is contained in an HDFS root folder.
Your root folder can be found by evaluating the following python code:

```python
from hops import hdfs
project_path = "/Projects/" + hdfs.project_name()
```


### Approach 1. Get the path programmatically

Simply append the name of the mnist dataset you created and point to the mnist.zip file contained inside

```python
from hops import hdfs
mnist_zip_path = project_path + "mnist/mnist.zip
```

### Approach 2. In the HopsWorks UI

Navigate inside the mnist dataset and simply select mnist.zip and copy the path

![upload.png](../images/dataset_path.png)


## 4. Restart Jupyter and upload the mnist.zip

When the Jupyter notebook server is started, it is possible to supply dependencies that should be accessible by your program, this could be .jars, or archives which should be automatically unzipped like .zip or .tgz, or one or more .py files which may or may not be encased in a .zip or .egg file.

To demonstrate this you will need to restart Jupyter and add the mnist.zip as an archive in the Jupyter configuration.


## The conversion python code

This code defines the conversion

In [1]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import numpy
import tensorflow as tf
from array import array
from tensorflow.contrib.learn.python.learn.datasets import mnist
from hops import hdfs

def toTFExample(image, label):
    """Serializes an image/label as a TFExample byte string"""
    example = tf.train.Example(
      features = tf.train.Features(
        feature = {
          'label': tf.train.Feature(int64_list=tf.train.Int64List(value=label.astype("int64"))),
          'image': tf.train.Feature(int64_list=tf.train.Int64List(value=image.astype("int64")))
        }
      )
    )
    return example.SerializeToString()

def fromTFExample(bytestr):
    """Deserializes a TFExample from a byte string"""
    example = tf.train.Example()
    example.ParseFromString(bytestr)
    return example

def toCSV(vec):
    """Converts a vector/array into a CSV string"""
    return ','.join([str(i) for i in vec])

def fromCSV(s):
    """Converts a CSV string to a vector/array"""
    return [float(x) for x in s.split(',') if len(s) > 0]

def writeMNIST(sc, input_images, input_labels, output, format, num_partitions):
    """Writes MNIST image/label vectors into parallelized files on HDFS"""
    # load MNIST gzip into memory
    with open(input_images, 'rb') as f:
        images = numpy.array(mnist.extract_images(f))

    with open(input_labels, 'rb') as f:
        if format == "csv2":
            labels = numpy.array(mnist.extract_labels(f, one_hot=False))
        else:
            labels = numpy.array(mnist.extract_labels(f, one_hot=True))

    shape = images.shape
    print("images.shape: {0}".format(shape))          # 60000 x 28 x 28
    print("labels.shape: {0}".format(labels.shape))   # 60000 x 10

    # create RDDs of vectors
    imageRDD = sc.parallelize(images.reshape(shape[0], shape[1] * shape[2]), num_partitions)
    labelRDD = sc.parallelize(labels, num_partitions)

    output_images = output + "/images"
    output_labels = output + "/labels"

    # save RDDs as specific format
    if format == "pickle":
        imageRDD.saveAsPickleFile(output_images)
        labelRDD.saveAsPickleFile(output_labels)
    elif format == "csv":
        imageRDD.map(toCSV).saveAsTextFile(output_images)
        labelRDD.map(toCSV).saveAsTextFile(output_labels)
    elif format == "csv2":
        imageRDD.map(toCSV).zip(labelRDD).map(lambda x: str(x[1]) + "|" + x[0]).saveAsTextFile(output)
    else: # format == "tfr":
        tfRDD = imageRDD.zip(labelRDD).map(lambda x: (bytearray(toTFExample(x[0], x[1])), None))
        # requires: --jars tensorflow-hadoop-1.0-SNAPSHOT.jar
        tfRDD.saveAsNewAPIHadoopFile(output, "org.tensorflow.hadoop.io.TFRecordFileOutputFormat",
                                    keyClass="org.apache.hadoop.io.BytesWritable",
                                    valueClass="org.apache.hadoop.io.NullWritable")
#  Note: this creates TFRecord files w/o requiring a custom Input/Output format
#  else: # format == "tfr":
#    def writeTFRecords(index, iter):
#      output_path = "{0}/part-{1:05d}".format(output, index)
#      writer = tf.python_io.TFRecordWriter(output_path)
#      for example in iter:
#        writer.write(example)
#      return [output_path]
#    tfRDD = imageRDD.zip(labelRDD).map(lambda x: toTFExample(x[0], x[1]))
#    tfRDD.mapPartitionsWithIndex(writeTFRecords).collect()

def readMNIST(sc, output, format):
    """Reads/verifies previously created output"""

    output_images = output + "/images"
    output_labels = output + "/labels"
    imageRDD = None
    labelRDD = None

    if format == "pickle":
        imageRDD = sc.pickleFile(output_images)
        labelRDD = sc.pickleFile(output_labels)
    elif format == "csv":
        imageRDD = sc.textFile(output_images).map(fromCSV)
        labelRDD = sc.textFile(output_labels).map(fromCSV)
    else: # format.startswith("tf"):
        # requires: --jars tensorflow-hadoop-1.0-SNAPSHOT.jar
        tfRDD = sc.newAPIHadoopFile(output, "org.tensorflow.hadoop.io.TFRecordFileInputFormat",
                              keyClass="org.apache.hadoop.io.BytesWritable",
                              valueClass="org.apache.hadoop.io.NullWritable")
        imageRDD = tfRDD.map(lambda x: fromTFExample(str(x[0])))

    num_images = imageRDD.count()
    num_labels = labelRDD.count() if labelRDD is not None else num_images
    samples = imageRDD.take(10)
    print("num_images: ", num_images)
    print("num_labels: ", num_labels)
    print("samples: ", samples)

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
1120,application_1511276242554_0456,pyspark,idle,Link,Link,✔


SparkSession available as 'spark'.


## Settings arguments and running conversion

In [2]:
import argparse

from pyspark.context import SparkContext
from pyspark.conf import SparkConf
from hops import hdfs

parser = argparse.ArgumentParser()
parser.add_argument("-f", "--format", help="output format", choices=["csv","csv2","pickle","tf","tfr"], default="csv")
parser.add_argument("-n", "--num-partitions", help="Number of output partitions", type=int, default=10)
parser.add_argument("-o", "--output", help="HDFS directory to save examples in parallelized format", default="/Projects/" + hdfs.project_name() + '/mnist')
parser.add_argument("-r", "--read", help="read previously saved examples", action="store_true")
parser.add_argument("-v", "--verify", help="verify saved examples after writing", action="store_true")

args = parser.parse_args()
print("args:",args)

sc = spark.sparkContext

if not args.read:
    # Note: these files are inside the mnist.zip file
    #https://stackoverflow.com/questions/41498365/upload-zip-file-using-archives-option-of-spark-submit-on-yarn#41544273
    #When inputing the .zip file we use #fashion
    mnist_path = "fashion/"
    writeMNIST(sc, mnist_path+"train-images-idx3-ubyte.gz",mnist_path+"train-labels-idx1-ubyte.gz", args.output + "/train", args.format, args.num_partitions)
    writeMNIST(sc, mnist_path+"t10k-images-idx3-ubyte.gz", mnist_path+"t10k-labels-idx1-ubyte.gz", args.output + "/test", args.format, args.num_partitions)

if args.read or args.verify:
    readMNIST(sc, args.output + "/train", args.format)


('args:', Namespace(format='csv', num_partitions=10, output='/Projects/bonus_lab2/mnist', read=False, verify=False))
Extracting fashion/train-images-idx3-ubyte.gz
Extracting fashion/train-labels-idx1-ubyte.gz
images.shape: (60000, 28, 28, 1)
labels.shape: (60000, 10)
Extracting fashion/t10k-images-idx3-ubyte.gz
Extracting fashion/t10k-labels-idx1-ubyte.gz
images.shape: (10000, 28, 28, 1)
labels.shape: (10000, 10)

<open file 'fashion/t10k-images-idx3-ubyte.gz', mode 'rb' at 0x7fac486c26f0>