# Dataset Wrapper
I will be explaining the reasoning behind the implementation of the dataset wrapper for the vectorcardiograms and dyssynchrony indices.

## Goal
The goal is to be able to easily iterate through a given dataset when we are training a neural network. Ideally, we would want to be able to call ```next_batch``` and it would give us the next batch of a specified size within the dataset. 

## Next Batch
We provide additional requirements of the ```next_batch``` function here. They are as follows:
* Deliver a specified number of examples upon calling ```next_batch``` for the vectorcardiograms sequences, vectorcardiogram lengths, and the dyssynchrony indices.
* The batches are delivered sequentially. For example, if we deliver the the first batch, it should contain example numbers 0 through 9. The second batch should deliver example numbers 10 through 20.
* If we have iterated through the entire dataset, then start pulling batches from the beginning. This is reasonable because most often, neural networks are usually trained with more than one epoch (the network sees the entire dataset usually more than once).

## Implementation Steps
### *Step 0: Setup*
To download the dataset, please email Chris Villongco for the link to the Dropbox. Once you have access, download the folder ```BiV2``` that corresponds to Patient 2 (our choice of Patient 2 is arbitrary, and we can use any of the patients available). Within that folder, we find the following files and folders:
* ```vcg_measured```* The actual (recorded) VCG (not simulated)
* ```vcg_model/``` The VCG simulations
* ```BiV2_LVdyssync_opt.txt```* The calculated dyssynchrony index (not from simulations)
* ```BiV2_LVdyssync.txt``` The dyssynchrony indices resulting from the simulations
* ```pts_to_eval.txt```* The parameters of the simulations

\* We are not interested in these files.

Once you download and extract it (usually comes as a .zip file), run the following:
```
>>> cd BiV2/vcg_model
```
and we are ready to begin!

### *Step 1: Rename Files*
We wish to rename the files for three reasons:
* Impose an ordering on the example numbers that is clearly visible in the file name
* Make the filenames more readable
* *Maintain* that the filenames are predictable and follow a well-defined format

Thus, we will execute the following bash script to rename all the ```.txt``` files in the current directory into something more readable:
```
a=1
for file in allParams-1_ECG_VCG_{1..608}_dump.txt; do 
    
    # Require a 3 digit padding
    new=$(printf "version%03d.txt" "$a")
    
    # Change the name
    mv "$file" "$new"
    let a=a+1
done
```
The result will be 608 files that are renamed to ```version%%%.txt```, where the number is padded with three digits.


Since the original filenames were not zero-padded, when we ```ls```, we would get ```allParams-1_ECG_VCG_109_dump.txt``` lexicographically before ```allParams-1_ECG_VCG_10_dump.txt```. Thus, instead of iterating through each ```file in ls *.txt```, we have to iterate through them using curly braces ```{1..608}``` to maintain that ```allParams-1_ECG_VCG_9_dump.txt``` corresponds to ```version009.txt``` and not ```allParams-1_ECG_VCG_109_dump.txt```

To show that this has preserved the original ordering, look at the content of the file labelled ```1``` before and after the renaming.

#### Before:
```
>>> head allParams-1_ECG_VCG_1_dump.txt
1.14859698e-06	-8.52689793e-07	-1.62738886e-07
6.27637865e-03	8.56158099e-04	-2.80092680e-05
1.73977577e-02	2.37706707e-03	-8.70847085e-05
0.03220872	0.00439809	-0.00017971
0.04663505	0.00636597	-0.00028655
0.05990819	0.00821321	-0.00043461
0.07573148	0.01035879	-0.00061512
0.09897242	0.01347624	-0.0008311
0.11859204	0.01606282	-0.00100526
0.13539736	0.01838106	-0.00139427
```

#### After:
```
>>> head version001.txt
1.14859698e-06	-8.52689793e-07	-1.62738886e-07
6.27637865e-03	8.56158099e-04	-2.80092680e-05
1.73977577e-02	2.37706707e-03	-8.70847085e-05
0.03220872	0.00439809	-0.00017971
0.04663505	0.00636597	-0.00028655
0.05990819	0.00821321	-0.00043461
0.07573148	0.01035879	-0.00061512
0.09897242	0.01347624	-0.0008311
0.11859204	0.01606282	-0.00100526
0.13539736	0.01838106	-0.00139427
```
The first ten lines appear to match. It works!

### *Step 2: Convert To NumPy Arrays*
We wish to read in each VCG data as a NumPy 2D matrix and store them all in a list (creating a list of 2D matrices, thus becoming 3D). Similarly, we wish to read in each dyssynchrony index, a scalar value, and store them all as a column vector. We will read them in with simple python scripts, provided below (We provided the script as Markdown because it only needs to be executed once, and we have done it for you):

#### Read in Vectorcardiogram Simulations with Python
```
# read_vcg.py

import numpy as np

# Initialize python list containing the vcg lengths and input 
vcg_length = []
vcg = []

for index in range(608):

	# Create filename with zero pad
	filename = 'version{:03d}.txt'.format(index + 1)

	# Read in the text file as numpy matrix
	x = np.loadtxt(filename, delimiter="\t")
	vcg.append(x)

	# Grab and store the vcg length
	vcg_length.append(x.shape[0])

# Convert Python list to NumPy array and save
np_vcg_length = np.asarray(vcg_length, dtype=np.int32)
np.save("vcg_length.npy", np_vcg_length)

np_vcg = np.asarray(vcg, dtype=np.float32)
np.save("vcg.npy", np_vcg)

```

The result should be two files: ```vcg.npy``` and ```vcg_length.npy```, saved in our current directory. We save the length of each VCG simulation because the TensorFlow function ```tf.nn.dynamic_rnn``` accepts the optional parameter ```sequence_length```, an int32/int64 vector sized [batch_size] as a way of checking that the dimensions of our data reflect our design.

#### Read in Dyssynchrony Indices With Python
We do a similar thing with the corresponding dyssynchrony indices, but they are easier since they all lie in a single file. We run the following script (We provided the script as Markdown because it only needs to be executed once, and we have done it for you):
```
# read_dyssync.py

import numpy as np 

dyssync = np.loadtxt("dyssync.txt")
np.save("dyssync.npy", dyssync)
```

#### Converting the Dyssynchrony Indices to Class Indices
When we save the first file, ```dyssync.npy```, each entry is a real number value between ```0.5``` and ```1```. However, we need to map it to the set of class indices (what each VCG belongs to), namely, the integers between ```0``` and ```4```. The mapping is as follows:
* Multiply by 10.
* Floor the result.
* Subtract 5.
* Corner case: if the dyssynchrony index is less than 0.5 (0 is common), then we set those entries to ```0``` (instead of ```0*10 - 5 = -5```).
* Corner case: if the dyssynchrony index is exactly 1, then we set those entries to ```4``` (instead of ```1*10 - 5 = 5```).
* Convert to int datatype.

We execute the following python script to implement the mapping and save it as a new ```.npy``` file:
```
# mapping.py

import numpy as np 

# Load the column vector containing the dyssynchrony indices
init_x = np.load("dyssync.npy")

# Multiply elementwise by 10, floor result, subtract 5
x_scaled = np.multiply(init_x, 10)
x_floor = np.floor(x_scaled)
x = np.subtract(x_floor, 5)

# Corner case: dyssynchrony index is between [0, 0.5)
x[x < 0] = 0

# Corner case: dyssynchrony index is 1.0
x[x > 4] = 4

# Convert each element to int 
x = x.astype(int)

# Save to file 
np.save("target.npy", x)
```

#### Playing with the NumPy Dataset
We will play around with the NumPy matrices that we have just created. The files of interest are as follows:
* ```vcg.npy``` The VCG itself
* ```vcg_length.npy``` The length of each VCG sequence
* ```target.npy``` The class indices that the corresponding VCG lands in

In [1]:
import numpy as np

# Import our length and vcg sequence and class index
vcg = np.load("dataset/vcg.npy")
vcg_length = np.load("dataset/vcg_length.npy")
target = np.load("dataset/target.npy")

print "VCG Dimensions: " + str(vcg.shape)
print "VCG Sequence Length Dimensions: " + str(vcg_length.shape)
print "VCG Sequence Class Indices: " + str(target.shape)

VCG Dimensions: (1817, 130, 3)
VCG Sequence Length Dimensions: (1817,)
VCG Sequence Class Indices: (1817,)


In [2]:
# Print the first five timesteps of the 0th simulation
vcg[0][:5]

array([[ -4.17658654e-07,  -2.90673141e-06,   4.57174694e-06],
       [  5.85916000e-03,   8.50800000e-04,  -6.00100000e-04],
       [  1.88238800e-02,   2.74487000e-03,  -1.94500000e-03],
       [  3.68574500e-02,   5.37931000e-03,  -3.82863000e-03],
       [  5.84938100e-02,   8.55723000e-03,  -6.13386000e-03]])

In [3]:
# Print the first five VCG sequence lengths
vcg_length[:5]

array([130, 130, 130, 130, 130])

Note that the data type of each entry in the ```vcg.npy``` matrix is a 32 bit float whereas the data type of the ```vcg_length.npy``` matrix is a 32 bit int, to match the parameter requirements of ```sequence_length``` in ```tf.nn.dynamic_rnn```.

In [4]:
# Check that the length of each VCG is exactly 130 for Patient 6/7
if vcg_length[vcg_length < 130]:
    print "Exists a VCG with a length not equal to 130 timesteps."
else:
    print "All VCGs have 130 timesteps."

All VCGs have 130 timesteps.


Note that all VCG's for Patient 6/7 have exactly 130 timesteps. It is possible that this feature does not exist in other patients.

In [5]:
# Print the first five class indices
print "VCG Sequences #1-5 falls in classes: " + str(target[:5])

# VCG Sequence number 187 has a dyssynchrony index of 0.604
# It should be mapped to class 1
print "VCG Sequence #187 falls in class: " + str(target[186])

VCG Sequences #1-5 falls in classes: [4 2 2 2 3]
VCG Sequence #187 falls in class: 1


We have the necessary data in the correct format; we are now ready to create our wrapper class!

### *Step 3: Create a Wrapper Class*
Now that we've saved the dataset as NumPy files, we now create a Python class that can:
* Provide the ```next_batch``` function.
* Randomize the dataset, and store how we randomized it.
* Store the sequence lengths of each VCG.

This class will be called ```Simulations```. Within it, contains three members, ```train, validate,``` and ```test```, which correspond to the three subsets we will divide our dataset into. For clarification, the purpose of the three sets are as follows:
* Training set: a set of examples used for learning the weights of the classifier.
* Validation set: a set of examples used to tune the hyperparameters (architectures, not weights) of a classifier. We can use this to determine the optimum number of hidden units in a neural network.
* Testing set: a set of examples used ONLY to access the performance of a fully specified classifier.

For the purpose of simplifying implementation, we will pre-partition the training, validation, and testing sets, as well as fix the batch size to ```32``` examples. We will try to get the split as close to a ```60%, 20%, 20%``` split, but we will keep the set sizes divisble by ```32```. Thus, our set sizes will be as follows:
* Training set: 416 examples, 13 batches (~68%)
* Validation set: 96 examples, 3 batches (~16%)
* Testing set: 96 examples, 3 batches (~16%)

Which adds up to 608 examples and 19 batches.

#### Playing Around with the Dataset Wrapper
This is our final product.

In [6]:
from dataset import Patient

# Instantiate wrapper
patient_dataset = Patient("dataset/vcg.npy", "dataset/vcg_length.npy", "dataset/target.npy")

# Sizes of sets
print "Training set size: " + str(patient_dataset.train.randomize.max() + 1)
print "Testing set size: " + str(patient_dataset.test.randomize.max() + 1)

Training set size: 1472
Testing set size: 345


In [7]:
# get the first batch
batch_vcg, batch_vcg_length, batch_target = patient_dataset.train.next_batch()

# Index of first example given in batch
print "Index of first example: " + str(patient_dataset.train.index)

# Shape of next batch
print "Dimensions of VCG: " + str(batch_vcg.shape)
print "Dimensions of VCG lengths: " + str(batch_vcg_length.shape)
print "Dimensions of targets: " + str(batch_target.shape)


Index of first example: 23
Dimensions of VCG: (23, 130, 3)
Dimensions of VCG lengths: (23,)
Dimensions of targets: (23,)


In [8]:
# We can see the how the set was randomized by accessing the "randomize" member
print patient_dataset.train.randomize

[1375  885  524 ..., 1158  298 1191]
