## Importing Data into Tensorflow

This is an exploration in how to load external data, especially one that is stored as .mat into TensorFlow.


With new TensorFlow API, ```tf.data``` module is the preferred way to load the data in TensorFlow. A ```tf.data.Dataset``` object represents a sequence of elements, each with at least one tensor, such as an image pipeline. A ```tf.data.Iterator``` provides the main way to extract elements from a dataset. The operation returned by ```Iterator.get_next``` yields the next element of a Dataset when executed, and typically acts as the interface between input pipeline code and your model.

Let us first think about creating a dataset and then worry about the later parts like iterators.

Datasets can be created from tensors or tensor slices. If we have a numpy array, we can call ```Dataset.from_tensor_slices``` to turn it into a dataset. So it seems fair to first get our .mat file into a numpy array. Luckily, ```scipy.io``` has a function which can help us do that. Let's go ahead and import the libraries.

In [None]:
import tensorflow as tf
import numpy as np
import scipy.io as spio
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
%matplotlib notebook

In scipy, we have function ```scipy.io.loadmat``` which will load the data. It takes various arguments, but we don't need to concern ourselves with those here I think. You can look up the docs if you wish. It will return a dictionary from which we will get our numpy array.

In [None]:
path = "Dropbox/Deep Learning Microscope Code/data/thin_smear/raw_data/examples_with_malaria/all_with_malaria_combo_328.mat"
mal_dict = spio.loadmat(path)
mal_dict

As can be seen, there is our data inside the dict. Now, just get it out.

In [None]:
withMalaria = mal_dict['all_with_malaria_combo']
withMalaria

In [None]:
withMalaria.shape

So, we got positive set done. we will repeat the procedure for negative samples

In [None]:
path = "Dropbox/Deep Learning Microscope Code/data/thin_smear/raw_data/examples_without_malaria/all_without_malaria_combo_693.mat"
wmal_dict = spio.loadmat(path)
wmal_dict

In [None]:
withoutMalaria = wmal_dict['all_without_malaria_combo']
withoutMalaria.shape

Great! Now we have to concatenate these two arrays along their length to get the complete image dataset and we are about a quarter of the way there.

In [None]:
imageData = np.concatenate((withMalaria, withoutMalaria), axis=3)
imageData.shape

Cool. Now we gotta make an array for labels. Its better to straight away create one-hot labels since we will need those later anyway. We know that top 328 are postive samples so, we will initialise the top 328 values as ```[1 0]``` and rest as ```[0 1]```.

In [None]:
labels = np.zeros([2,1021])
labels[0,0:328] = 1
labels[1,328:1021] = 1
labels

Now we will just reshape to get the length, which is currently the fourth dimension, as our first. Then we will shuffle the arrays beacuse right now they are in a perfect order which is undesirable.

In [None]:
imageData = imageData.swapaxes(0,3).swapaxes(2,3).swapaxes(1,2)
labels = labels.swapaxes(0,1)
s = np.arange(labels.shape[0])
np.random.shuffle(s)
imageData = imageData[s,:,:,:]
labels = labels[s,:]

Okay! We now have the raw materials for getting datasets and iterators using TensorFlow. We will construct out dataset using from_tensor_slices method and use Initializable iterator because we want to create training and testing sets. So One shot iterator is a one that can iterate once through a dataset, you cannot feed any value to it. Initializable is where you can dynamically change calling its initializer operation and passing the new data with feed_dict. There also is a reinitializable and a feedable iterator but I think we don't need them. Feedable is definitely too complex.

In [None]:
# First make a dataset from tensor One shot. It can iterate once through a dataset, you cannot feed any value to it.
EPOCHS = 10
# Create a placeholder to dynamically switch between batch sizes
batch_size = tf.placeholder(tf.int64)
BATCH_SIZE = 10

x, y = tf.placeholder(tf.float32, shape=[None, 28, 28, 96]), tf.placeholder(tf.float32, shape=[None, 2])
dataset = tf.data.Dataset.from_tensor_slices((x, y)).batch(batch_size).repeat()

# train-test split
train_images, test_images, train_labels, test_labels = train_test_split(imageData, labels, test_size=0.25)

# make an iterator to get next batch
iter = dataset.make_initializable_iterator()
features, labels = iter.get_next()

Now suppose there is model which is already defined elsewhere and ```pred = model(train_images)``` then the training will look like:

In [None]:
#pass the second value from iter.get_net() as label
loss = tf.losses.mean_squared_error(pred, labels) 
train_op = tf.train.AdamOptimizer().minimize(loss)

n_batches = train_images.shape[0] // BATCH_SIZE

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    # initialise iterator with train data
    sess.run(iter.initializer, feed_dict={x: train_images, y: train_labels, batch_size: BATCH_SIZE})
    print('Training...')
    for i in range(EPOCHS):
        tot_loss = 0
        for _ in range(n_batches):
            _, loss_value = sess.run([train_op, loss])
            tot_loss += loss_value
        print("Iter: {}, Loss: {:.4f}".format(i, tot_loss / n_batches))
    # initialise iterator with test data
    sess.run(iter.initializer, feed_dict={x: test_images, y: test_labels, batch_size: test_images.shape[0]})
    print('Test Loss: {:4f}'.format(sess.run(loss)))

Let us try to visualize the data to check if it is correct. So we will create a session, run an initialiser and try to plot it using matplotlib.

In [None]:
with tf.Session() as sess:
    idx = 9
    sess.run(iter.initializer, feed_dict={x: train_images, y: train_labels, batch_size: 10})
    image = sess.run(features)
    label = sess.run(labels)
    io = image[idx][:,:,1]
    print(label[idx])
    plt.imshow(io)

Yup, it seems legit. So, that is it.