# Digit Recognizer Project
## Part 1. Data Wrangling

This project applies supervised machine learning techniques to classify images from the MNIST database of handwritten images available at http://yann.lecun.com/exdb/mnist/. Each image is a 28 x 28 matirx that represents a handwritten digit from 0 to 9.The database has been divided into a train set of 60,000 examples and a test set of 10,000 examples.  

### 1. Load data files and convert from idx to numpy arrays

The database is broken into four gzip files with the images and labels stored separately for the train and test sets. The images are also encoded using the .idx3-ubyte format. The **idx2numpy** package can be used to convert the images into 28x28 numpy arrays in which the integers 0 to 256 are used to represent the brightness of each pixel on a grayscale. 

In [1]:
import gzip
import numpy as np
import pandas as pd
import idx2numpy

train_image_file = gzip.open('./data/train-images-idx3-ubyte.gz')
train_label_file = gzip.open('./data/train-labels-idx1-ubyte.gz')
test_image_file = gzip.open('./data/t10k-images-idx3-ubyte.gz')
test_label_file = gzip.open('./data/t10k-labels-idx1-ubyte.gz')

train_image_array = idx2numpy.convert_from_file(train_image_file)
train_label_array = idx2numpy.convert_from_file(train_label_file)
test_image_array = idx2numpy.convert_from_file(test_image_file)
test_label_array = idx2numpy.convert_from_file(test_label_file)

### 2. Convert numpy arrays from 2d to 1d and save as pandas df

The data must next be converted into a **tidy** format table with one row per observation. 

First, I will transform the image files by flattening the 28x28 arrays into 1d arrays with 784 features. Next, I will transform the nparrays into pandas dataframes and then save as compressed CSV files to save time in future steps.

In [None]:
def array_to_pandas(array):
    """
    Function to convert MNIST data from nparray to pandas dataframe
    :param array:
    :return df:
    """
    features = []
    for row in array:
        features.append(row.flatten())

    df = pd.DataFrame(data=features,
                      index=[i for i in range(len(features))],
                      columns=['p' + str(i) for i in range(len(features[0]))]
                     )
    return df

# Convert arrays to pandas dfs
train_images = array_to_pandas(train_image_array)
test_images = array_to_pandas(test_image_array)

# Write pandas dfs to .csv files
train_images_out = './data/mnist_train_images.gz'
test_images_out = './data/mnist_test_images.gz'
train_images.to_csv(train_images_out, compression = 'gzip')
test_images.to_csv(test_images_out, compression = 'gzip')

train_images.head()

For simplification, I will also write the label files in CSV format. 

In [None]:
# Convert arrays to pandas dfs
train_labels = pd.DataFrame(data=train_label_array,
                      index=[i for i in range(len(train_label_array))],
                      columns=['label']
                     )
test_labels = pd.DataFrame(data=test_label_array,
                      index=[i for i in range(len(test_label_array))],
                      columns=['label']
                     )

# Write pandas dfs to .csv files
train_labels_out = './data/mnist_train_labels.gz'
test_labels_out = './data/mnist_test_labels.gz'
train_labels.to_csv(train_labels_out, compression = 'gzip')
test_labels.to_csv(test_labels_out, compression = 'gzip')

train_labels.head()