# Dataset Wrapper
I will be explaining the reasoning behind the implementation of the dataset wrapper for the vectorcardiograms and dyssynchrony indices.

## Goal
The goal is to be able to iterate through a given dataset with minimum effort for the purpose of training a neural network in batches. Ideally, we would want to be able to call ```next_batch``` and it would give us the next batch of a specified size within the dataset. 

## Next Batch
We provide additional requirements of the ```next_batch``` function here. They are as follows:
* Deliver a specified number of examples upon calling ```next_batch``` for both the dyssynchrony index and the vectorcardiogram
* We deliver the batches sequentially. For example, if we deliver the the first batch, it should be example numbers 1 through 10. The second batch should deliver example numbers 11-20. There are exceptions for corner cases however (such as when the specified batch size is greater than the dataset size, or if we have reached the end of the dataset and need to pull from the beginning).
* If we have iterated through the entire dataset, then start pulling batches from the beginning. This is reasonable because most often, neural networks are usually trained with more than one epoch (the network sees the entire dataset usually more than once).

## Implementation Steps
### *Step 1: Rename Files*
We wish to rename the files for three reasons:
* Impose an ordering on the example numbers 
* Make the filenames more readable
* Maintain that the filenames are predictable and follow a well-defined format

Thus, we will execute the following bash script to rename all the ```.txt``` files in the current directory:
```
a=1
for file in allParams-1_ECG_VCG_{1..608}_dump.txt; do 
    
    # Require a 3 digit padding
    new=$(printf "version%03d.txt" "$a")
    
    # Change the name
    mv "$file" "$new"
    let a=a+1
done
```
Since the original filenames were not zero-padded, when we ```ls```, we would get ```allParams-1_ECG_VCG_109_dump.txt``` lexicographically before ```allParams-1_ECG_VCG_10_dump.txt```. Thus, instead of iterating through each ```file in ls *.txt```, we have to iterate through them using curly braces ```{1..608}``` to maintain that ```allParams-1_ECG_VCG_9_dump.txt``` corresponds to ```version009.txt``` and not ```allParams-1_ECG_VCG_109_dump.txt```

To show that this has preserved the original ordering, look at the content of the first file before and after the renaming.

#### Before:
```
>>> head allParams-1_ECG_VCG_1_dump.txt
1.14859698e-06	-8.52689793e-07	-1.62738886e-07
6.27637865e-03	8.56158099e-04	-2.80092680e-05
1.73977577e-02	2.37706707e-03	-8.70847085e-05
0.03220872	0.00439809	-0.00017971
0.04663505	0.00636597	-0.00028655
0.05990819	0.00821321	-0.00043461
0.07573148	0.01035879	-0.00061512
0.09897242	0.01347624	-0.0008311
0.11859204	0.01606282	-0.00100526
0.13539736	0.01838106	-0.00139427
```

#### After:
```
>>> head version001.txt
1.14859698e-06	-8.52689793e-07	-1.62738886e-07
6.27637865e-03	8.56158099e-04	-2.80092680e-05
1.73977577e-02	2.37706707e-03	-8.70847085e-05
0.03220872	0.00439809	-0.00017971
0.04663505	0.00636597	-0.00028655
0.05990819	0.00821321	-0.00043461
0.07573148	0.01035879	-0.00061512
0.09897242	0.01347624	-0.0008311
0.11859204	0.01606282	-0.00100526
0.13539736	0.01838106	-0.00139427
```
It works!

### *Step 2: Convert To NumPy Arrays*
We wish to read in each VCG data as a NumPy 2D matrix and store them all as a 3D matrix. Similarly, we wish to read in each dyssynchrony index, a scalar value, and store them all as a column vector. We will read them in with a simple python script, provided below (We provided the script as Markdown because it only needs to be executed once, and we have done it for you):

#### Read in Vectorcardiogram Simulations with Python
```
# read_vcg.py

import numpy as np

# Initialize python list containing the vcg lengths and input 
vcg_length = []
vcg = []

for index in range(608):

	# Create filename with zero pad
	filename = 'version{:03d}.txt'.format(index + 1)

	# Read in the text file as numpy matrix
	x = np.loadtxt(filename, delimiter="\t")
	vcg.append(x)

	# Grab and store the vcg length
	vcg_length.append(x.shape[0])

# Convert Python list to NumPy array and save
np_vcg_length = np.asarray(vcg_length, dtype=np.int32)
np.save("vcg_length.npy", np_vcg_length)

np_vcg = np.asarray(vcg, dtype=np.float32)
np.save("vcg.npy", np_vcg)

```

The result should be two files: ```vcg.npy``` and ```vcg_length.npy```, saved in our current directory. We save the length of each VCG simulation because the TensorFlow function ```tf.nn.dynamic_rnn``` accepts the optional parameter ```sequence_length```, an int32/int64 vector sized [batch_size] as a way of checking that the dimensions of our data reflect our design.

#### Read in Dyssynchrony Indices With Python
We do a similar thing with the corresponding dyssynchrony indices, but they are easier since they all lie in a single file. We run the following script (We provided the script as Markdown because it only needs to be executed once, and we have done it for you):
```
# read_dyssync.py

import numpy as np 

dyssync = np.loadtxt("dyssync.txt")
np.save("dyssync.npy", dyssync)
```

#### Converting the Dyssynchrony Indices to Class Indices
When we save the first file, ```dyssync.npy```, each entry is a real value between ```0.5``` and ```1```. However, we need to map it to the set of indices of classes, namely, the integers between ```1``` and ```5```. The mapping is as follows:
* Multiply by 10.
* Floor the result.
* Subtract 5.
* Corner case: if the dyssynchrony index is less than 0.5 (0 is common), then we set those entries to ```0``` (instead of ```0*10 - 5 = -5```).
* Corner case: if the dyssynchrony index is exactly 1, then we set those entries to ```4``` (instead of ```1*10 - 5 = 5```).
* Convert to int datatype.

We execute the following python script to implement the mapping and save it as a new ```.npy``` file:
```
# mapping.py

import numpy as np 

# Load the column vector containing the dyssynchrony indices
init_x = np.load("dyssync.npy")

# Multiply elementwise by 10, floor result, subtract 5
x_scaled = np.multiply(init_x, 10)
x_floor = np.floor(x_scaled)
x = np.subtract(x_floor, 5)

# Corner case: dyssynchrony index is between [0, 0.5)
x[x < 0] = 0

# Corner case: dyssynchrony index is 1.0
x[x > 4] = 4

# Convert each element to int 
x = x.astype(int)

# Save to file 
np.save("target.npy", x)
```

#### Playing with the NumPy Dataset
We will play around with the NumPy matrices that we have just created. The files of interest are as follows:
* ```vcg.npy``` The VCG itself
* ```vcg_length.npy``` The length of each VCG sequence
* ```target.npy``` The class indices that the corresponding VCG lands in

In [1]:
import numpy as np

# Import our length and vcg sequence and class index
vcg = np.load("vcg.npy")
vcg_length = np.load("vcg_length.npy")
target = np.load("target.npy")

print "VCG Dimensions: " + str(vcg.shape)
print "VCG Sequence Length Dimensions: " + str(vcg_length.shape)
print "VCG Sequence Class Indices: " + str(target.shape)

VCG Dimensions: (608, 170, 3)
VCG Sequence Length Dimensions: (608,)
VCG Sequence Class Indices: (608,)


In [12]:
# Print the first five timesteps of the 0th simulation
vcg[0][:5]

array([[  1.14859699e-06,  -8.52689766e-07,  -1.62738885e-07],
       [  6.27637887e-03,   8.56158091e-04,  -2.80092681e-05],
       [  1.73977576e-02,   2.37706699e-03,  -8.70847070e-05],
       [  3.22087184e-02,   4.39808983e-03,  -1.79709998e-04],
       [  4.66350503e-02,   6.36596978e-03,  -2.86549999e-04]], dtype=float32)

In [10]:
# Print the first five VCG sequence lengths
vcg_length[:5]

array([170, 170, 170, 170, 170], dtype=int32)

Note that the data type of the ```vcg.npy``` matrix is a 32 bit float whereas the data type of the ```vcg_length.npy``` matrix is a 32 bit int, to match the parameter requirements of ```sequence_length``` in ```tf.nn.dynamic_rnn```.

In [7]:
# Print the first five class indices
print "VCG Sequences #1-5 falls in classes: " + str(target[:5])

# VCG Sequence number 187 has a dyssynchrony index of 0.0
# It should be mapped to class 0 instead of -5
print "VCG Sequence #187 falls in class: " + str(target[186])

VCG Sequences #1-5 falls in classes: [2 2 2 2 3]
VCG Sequence #187 falls in class: 0


We have the necessary data in the correct format; we are now ready to create our wrapper class!

### *Step 3: Create a Wrapper Class*
Now that we've saved the dataset as NumPy files, we now create a Python class that can:
* Provide the ```next_batch``` function.
* Randomize the dataset, and store how we randomized it.
* Store the sequence lengths of each VCG.
* Allow the user to specify how to split the data set into training, validation, and testing.

This class will be called ```Simulations```. Within it contains three members, ```train, validate,``` and ```test```, which correspond to the three subsets we will divide our dataset into. For clarification, the purpose of the three sets are as follows:
* Training set: a set of examples used for learning the weights of the classifier.
* Validation set: a set of examples used to tune the hyperparameters (architectures, not weights) of a classifier. We can use this to determine the optimum number of hidden units in a neural network.
* Testing set: a set of examples used ONLY to access the performance of a fully specified classifier.