# CS375 - Tutorial 1
### Welcome to tutorial 1! This tutorial will introduce you to basic MongoDB, Tensorflow, and tfutils functions and commands. We will learn how to setup a MongoDB database, and connect, read and write to it. We will then give you a brief refresher on Tensorflow, show you how to read data not only from feed dicts but also TFRecords and build a simple Multi-Layer Perceptron (MLP). We will also show you how to create TFRecord files out of your own data. Finally, we will show you how to use tfutils which puts all pieces together into one coherent framework.

## 1.) Using MongoDB with pymongo
### First we will learn how to use MongoDB. We will start with setting up a MongoDB database. We assume that you have installed MongoDB following any of the instructions that you can find online. A principled way to setup a MongoDB database is to use a config file. A mongodb.conf file needs to have the following content:

\# Where to store the data.  
dbpath=/mongodb/mongodb  
\# Where to log  
logpath=/mongodb/mongodb/mongodb.log  
\# IPs and ports your database is broadcasting to (127.0.0.1 is localhost).  
bind_ip = 127.0.0.1  
port = 24444

### That's it! That's all we need to setup and start a MongoDB database. If any of these is not defined MongoDB will fall back to its default settings. We can setup and start a MongoDB database with the following command:
`sudo mongod --config=mongodb.conf --quiet&`

### Now that we have setup our database, let's try to connect to it using the python interface "pymongo". Therefore, we need to specify the port and the host of the database we want to connect to:

In [None]:
import pymongo as pm
import numpy as np

port = 24444
host = 'localhost'
connection = pm.MongoClient(port = port, host = host)

### To test if we have successfully connected to our database let's print out the last 3 database names.

In [None]:
print(connection.database_names()[1:])

### Now let's say we are interested in the contents of "imagenet". As all Mongo data is returned in form of a dictionary we can access "future prediction" as follows and look up it's collections:

In [None]:
database = connection['imagenet']
print(database.collection_names())

#database2 = connection['task']
#print(database2.collection_names())

### Let's dig deeper and have a look at "alexnet.files". As we have learned in the last lecture we have reached the level where we store our experiments uniquely identified by "exp_id". So let's find all unique "exp_id" in the "alexnet.files" collection.

In [None]:
collection = database['alexnet.files']
collection.distinct('exp_id')

#collection2 = database2['polyfunction']
#collection2.distinct('exp_id')

### In order to find all entries with e. g. "experiment_1" we need to query the database for it's "exp_id". So we formulate a search query and use it to retrieve the entry as follows:

In [None]:
query = {'exp_id': 'experiment_1'}
entries = collection.find(query)
print(entries.count())

### We have found 2760 entries with "exp_id" "experiment_1", some of which belong to training data, and some of which belong to validation data, which I know as I have created those entries with tfutils. Let's have a look at the contents of the first entry:

In [None]:
entries[0]

### We see that the entry consists of a bunch of keys that store information about the trained model and training results. Now let's say we want to access only those entries that actually contain training results. In order to do this we have to formuate a new database query where we use the "$exists" keyword to signal that "train_results" has to exist in any entry that gets returned:

In [None]:
query_train = {'exp_id': 'experiment_1', 'train_results': {'$exists' : True}}
train_entries = collection.find(query_train)
train_entries[0]

### Again, we obtain a list of entries out of which all contain "train_results".  We printed the content of the first returned entry and we can observe that although we requested only entries that contain "train_results" all of the entries data was returned and not only "train_results". If we want Mongo to return only "train_results" in each entry we have to use a projection as follows: 

In [None]:
train_entries = collection.find(query_train, projection = ['train_results'])
train_entries[1]

#duration_entries = collection.find(query_train,projection = ['duration'])
#duration_entries[0]

### As we can see we have now successfully retrieved "train_results" only, except for the unique identifier "_id" which always gets returned. These "train_results" entries were saved every couple of iterations while training a model in tfutils and for instance contain the training loss. Using this training data we can now easily construct a vector that contains the training loss over time by using a simple list comprehension:

In [None]:
train_entries = collection.find(query_train, projection = ['train_results'])
loss = np.array([results['loss'] for entry in train_entries for results in entry['train_results']])
learning_rate = np.array([results['learning_rate'] for entry in train_entries for results in entry['train_results']])
print loss.size

### We can now go ahead and plot the loss:

In [None]:
%matplotlib inline 
import matplotlib.pyplot as plt
plt.plot(loss)
plt.grid()

In [None]:
plt.plot(learning_rate)

### Inserting records into MongoDB as easy as reading data from the database. You simply can simply use the "insert_one()" method as follows:

In [None]:
# entry has to be a dictionary
entry_1 = {'value': 10.1}
coll = connection['test_db']['test_coll']
coll.insert_one(entry_1)

#entry_GR = {'Sarah': 999}
#coll.insert_one(entry_GR)


### And if we now query the "test_coll" collection we can see that "entry_1" has been stored in the database:

In [None]:
coll = connection['test_db']['test_coll']
coll.find({'value': {'$exists' : True}})[0]
#coll.find({'Sarah':{'$exists' : True}})[0]

### Now let's insert two more entries and sort them by the "value" field:

In [None]:
entry_2 = {'value': 4.2}
entry_3 = {'value': 8.3}

coll = connection['test_db']['test_coll']
coll.insert_many([entry_2, entry_3])

entries = coll.find({'value': {'$exists' : True}}, projection=['value'])
entries = [entry for entry in entries]
print('Not sorted:')
for entry in entries:
    print(entry['value'])

entries = coll.find({'value': {'$exists' : True}}, projection=['value']).sort([('value',pm.ASCENDING)])
entries = [entry for entry in entries]
print('Sorted:')
for entry in entries:
    print(entry['value'])

### Finally, let's clean up and delete some entries, the collection and database to conclude the tutorial on pymongo and MongoDB:

In [None]:
# remove entry with field 'value' = 10.1
#coll.remove({'value' : 10.1}, {'justOne': True})
coll.delete_one({'value' : 10.1})
coll.delete_one({'Sarah' : 999})

# remove collection
connection['test_db'].drop_collection('test_coll')

# remove database
connection.drop_database('test_db')

## 2.) Creating a simple model in Tensorflow
### We will now move on to a brief exercise on Tensorflow. We will keep this section short since most of you should be familiar with Tensorflow by now. In the following, we will talk about how to create a Tensorflow graph, execute the graph in a session, variables and variable scopes, placeholders and feed dicts.

### First we will define a 3 layer MLP with 3 input neurons, 20 hidden neurons, and 3 output neurons in Tensorflow:

In [None]:
import tensorflow as tf
import numpy as np

# input placeholder
input_placeholder = tf.placeholder(tf.float32, shape=[None, 3])
# trainable variables randomly initialized
W1 = tf.Variable(tf.random_normal(shape=[3,20]), tf.float32)
b1 = tf.Variable(tf.random_normal(shape=[20,]), tf.float32)
W2 = tf.Variable(tf.random_normal(shape=[20,3]), tf.float32)
b2 = tf.Variable(tf.random_normal(shape=[3,]), tf.float32)
# hidden layer
h = tf.nn.sigmoid(tf.matmul(input_placeholder, W1) + b1)
# output
out = tf.matmul(h, W2) + b2

### To run one forward pass of this randomly initialized MLP with some random data we first have to create a session and initialize all variables:

In [None]:
import os
os.environ['CUDA_VISIBLE_DEVICES'] = "1"
# creates a session and allocates memory on GPU 1
sess = tf.Session()
# initialize all variables
init_op = tf.global_variables_initializer()
sess.run(init_op)

### Then we can execute the forward pass with a call to the session's "run()" method, while specifying the output targets and feeding the input data through the feed_dict parameter to the placeholders:

In [None]:
# create some random input and run forward pass on it
inp = np.random.rand(1, 3).astype(np.float32)
result = sess.run(out, feed_dict={input_placeholder: inp})
print(result)

In [None]:
tf.size(W2)

### Now let's train this MLP to regress some arbitrary function "y". In order to do that we need to define a loss and an optimizer that minimizes that loss:

In [None]:
# the input x and output y
x = np.random.rand(1000,3).astype(np.float32)
y = x ** 2 + 3
# labels placeholder
labels_placeholder = tf.placeholder(tf.float32, shape=[None, 3])
# simple l2-loss
loss = tf.nn.l2_loss(out - labels_placeholder)
# Gradient descent optimizer
train_step = tf.train.GradientDescentOptimizer(0.01).minimize(loss)

### Let's train for 1,000 steps...

In [None]:
from tqdm import trange
for i in trange(1000):
    idx = np.random.RandomState(0).permutation(1000)[:32]
    inputs = x[idx]
    labels = y[idx]
    sess.run(train_step, feed_dict={input_placeholder: inputs, labels_placeholder: labels})

### ... and evaluate the result:

In [None]:
inp = np.random.rand(1, 3).astype(np.float32)
ground_truth = inp ** 2 + 3
prediction = sess.run(out, feed_dict={input_placeholder: inp})
print('Ground truth', ground_truth) 
print('Prediction', prediction)

### Looking good! Our model approximates the function "y" pretty well! However there's a better way to feed data to your tensorflow model which comes in particularly handy when working with large-scale datasets: TFRecords!

## 3.) Using TFRecords

### TFRecord is a file format for Tensorflow used to store data in its flattened, serialized byte format. Any type of data can be stored in TFRecords as long as its serializable. Tensorflow implements a TFRecordReader that is able to read this file format into a Tensorflow graph. In Tensorflow queues are used to allow for an asynchronous data reading and model execution. Queues also enables the use of multiple threads for data reading. TFRecords store data sequentially, and only allow for it to be read in sequential order. Random access is not possible. This makes the data loading from disk super fast but makes data randomization impossible. Thus, in order to randomize the input data as it is usually necessary to train machine learning models, there needs to be a separate postprocessing stage that takes care of the randomization. The nice thing about using queues is that one can simply shuffle the data after loading and before handing it over to the model for the feedforward pass. 

### Thus in summary, TFRecords together with file and data queues provide an elegant framework to load large amounts of data in a short period of time which is crucial at a time where GPUs are no longer the bottleneck for training large-scale models and the training speed of a network is dependent on the data loading speed from hard drives. 

### In the following, we will first show you how to write TFRecords. So let's construct some arbitrary data and save it on disk in the tfrecords format:

In [None]:
# Arbitrary data with some arbitrary attributes
data = {
    'idx': np.array(range(100)).astype(np.int32),
    'values': np.array(range(100)).astype(np.int32)[::-1].astype(np.float32),
    'labels': np.array(['large', 'scale', 'neural', 'network', 'modeling', 'for', 'neuroscience', 'is', 'the', 'best!'] * 10)
    }

# Make sure all attributes are of the same length
for k in data:
    assert len(data[k]) == len(data[data.keys()[0]])

# We are storing our data as byte strings
def _bytes_feature(value):
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

# Create tfrecord file
file_path = '/mnt/fs0/mrowca/notebooks/mrowca/myfile.tfrecords'
writer = tf.python_io.TFRecordWriter(file_path)
for i in range(len(data[data.keys()[0]])):
    # Construct the datum to be written
    feature = {}
    for k in data:
        # serialize the data
        feature[k] = _bytes_feature(data[k][i].tostring())
    datum = tf.train.Example(features = tf.train.Features(feature = feature))
    # Write the data
    writer.write(datum.SerializeToString())
# Close the writer
writer.close()

### Now let's read the data that we have written. There are two ways to do that. One that requires you to use a session as the "tf.TFRecordReader()" is part of the Tensorflow graph, and one that you can use to view the data stored inside of TFRecords without constructing a graph: "tf.python_io.tf_record_iterator()". While the first method is actually used when running large-scale experiments as it can be easily parallelized using multiple queue runners which enables faster loading, the second method is useful for debugging purposes. So let's use "tf.python_io.tf_record_iterator()" to have a brief look at our newly written TFRecords file to see if it has been written correctly.

In [None]:
# Open the tfrecord file
file_path = '/mnt/fs0/mrowca/notebooks/mrowca/myfile.tfrecords'
tfrec = tf.python_io.tf_record_iterator(path=file_path)
datum = tf.train.Example()
# Go to the first entry and decode it
datum.ParseFromString(tfrec.next())
# Print the attribute names in this record
print('Attributes:')
print(datum.features.feature.keys())

In [None]:
# Access the attributes and decode them to the correct data type
idx = np.fromstring(datum.features.feature['idx'].bytes_list.value[0], dtype=np.int32)
value = np.fromstring(datum.features.feature['values'].bytes_list.value[0], dtype=np.float32)
label = datum.features.feature['labels'].bytes_list.value[0]
print('idx:', idx)
print('value:', value)
print('label:', label)

### Everything looks correct for the first entry. If we would want to examine the second entry we would simply call "tfrec.next()" again and we would get the second entry and so on. We won't do that here though and instead assume the tfrecord has been correctly written. 

### Now let's use the first method "tf.TFRecordReader()" to read data into a Tensorflow graph that simply outputs the data. We will need "tf.TFRecordReader()" as well as a file queue for that which we can use as follows:

In [None]:
import tensorflow as tf

batch_size = 1
# construct the filename queue that holds all file names in our case only "myfile.tfrecords"
file_path = '/mnt/fs0/mrowca/notebooks/mrowca/myfile.tfrecords'
filename_queue = tf.train.string_input_producer([file_path])

# setup the reader and read the first batch_size examples
reader = tf.TFRecordReader()
_, serialized_example = reader.read_up_to(filename_queue, batch_size)

# decode the example
features = tf.parse_example(
        serialized_example, features={
        'idx': tf.FixedLenFeature([], tf.string),
        'values': tf.FixedLenFeature([], tf.string),
        'labels': tf.FixedLenFeature([], tf.string),
        })
idx_record = tf.decode_raw(features['idx'], tf.int32)
values_record = tf.decode_raw(features['values'], tf.float32)
labels_record = features['labels']

# enqueue the example onto a random shuffle queue for shuffling and add a queue runner that takes care of this
queue = tf.RandomShuffleQueue(dtypes=[tf.int32, tf.float32, tf.string],
                              capacity=10, 
                              min_after_dequeue=5, 
                              seed = 5)
enqueue_op = queue.enqueue([idx_record, values_record, labels_record])
tf.train.queue_runner.add_queue_runner(tf.train.queue_runner.QueueRunner(queue, [enqueue_op]))

# define the output as dequeuing from the random shuffle queue, so we can use it as a target in sess.run
output = queue.dequeue()

# initialize variables and start queue runners
sess = tf.Session()
init_op = tf.global_variables_initializer()
sess.run(init_op)                                             
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)

# run forward pass 
data = sess.run(output)

# print results
print('idx:', data[0][0])
print('value:', data[1][0])
print('label:', data[2][0])

# stop the queue runners
coord.request_stop()
coord.join(threads)

### To summarize, we first created a filename_queue that holds the filenames of all TFRecords we want to load data from. We then used a TFRecordReader to read one example off a TFRecords file on the filename_queue. We decoded this example into its original attributes. We enqueued those attributes onto another queue that we used to shuffle our data using another queue runner. We then dequeued an example from the queue as output, which concluded the construction of our graph. To run the graph, we initialized all variables and queue runners and called the "run()" method, which allowed us to retrieve the outputs and print them out. Finally, after we were done, we had to stop the queue runner threads.  TFUtils implements exactly this procedure and we will have a closer look how to use it in the next section.

## 4.) Using TFUtils

### TFUtils, in one way or another, uses everything we have discussed so far. It is a great framework to keep track of your experiments in a database. You can train a model, restore its weights, test it, modify it, extract intermediate features, or do all at once while TFUtils is keeping track of the developmental trajectory of your network. 

### In the following, we will first show you how to train a model in TFUtils. Therefore, let's use the standard MNIST example:

In [None]:
from __future__ import division
from tfutils import base, data, optimizer, utils

# delete exp1
connection['mnist']['simple.files'].delete_many({'exp_id' : 'exp1'})

def mnist_model(inputs, train=True, **kwargs):
    # trainable variables randomly initialized
    with tf.variable_scope("mnist"):
        W1 = tf.get_variable('W1', [784,128], tf.float32, tf.random_normal_initializer(stddev=0.1))
        b1 = tf.get_variable('b1', [128], tf.float32, tf.random_normal_initializer(stddev=0.1))
        W2 = tf.get_variable('W2', [128,32], tf.float32, tf.random_normal_initializer(stddev=0.1))
        b2 = tf.get_variable('b2', [32], tf.float32, tf.random_normal_initializer(stddev=0.1))
        W3 = tf.get_variable('W3', [32,10], tf.float32, tf.random_normal_initializer(stddev=0.1))
        b3 = tf.get_variable('b3', [10], tf.float32, tf.random_normal_initializer(stddev=0.1))
        
        # hidden layers
        h1 = tf.nn.sigmoid(tf.matmul(inputs['images'], W1) + b1, name='hidden1')
        h2 = tf.nn.sigmoid(tf.matmul(h1, W2) + b2, name='hidden2')
        # output
        output = tf.matmul(h2, W3) + b3

    return output, {}

params = {}

params['load_params'] = {
    'do_restore': False}

params['save_params'] = {
    'host': 'localhost',
    'port': 24444,
    'dbname': 'mnist',
    'collname': 'simple',
    'exp_id': 'exp1',
    'save_valid_freq': 200,
    'save_filters_freq': 100,
    'cache_filters_freq': 100}

params['train_params'] = {
    'data_params': {'func': data.MNIST,
                    'batch_size': 256,
                    'group': 'train',
                    'n_threads': 1},
    'queue_params': {'queue_type': 'random',
                     'batch_size': 256},
    'num_steps': 100}

params['model_params'] = {
    'func': mnist_model} 

params['learning_rate_params'] = {
    'learning_rate': 0.5,
    'decay_steps': 500,
    'decay_rate': 0.95,
    'staircase': True}

params['optimizer_params'] = {
    'func': optimizer.ClipOptimizer,
    'optimizer_class': tf.train.MomentumOptimizer,
    'momentum': 0.9,
    'clip': True,
}

params['loss_params'] = {
    'targets': ['labels'],
    'loss_per_case_func': tf.nn.sparse_softmax_cross_entropy_with_logits,
    'agg_func': tf.reduce_mean
}

params['skip_check'] = True

base.train_from_params(**params)

### As you can see, all you need to train a model in TFUtils is to define "params" and call "base.train_from_params(**params"). TFUtils then executes the specified experiment, in this case training MNIST on a 10-way digit recognition task. As you can see the loss decreases, and at the end of the experiment the trained model is saved in the database as specified by "save_params". Now let's load the trained model from the database and test it on the validation set using TFUtils:

In [None]:
def top1_func(inputs, outputs): 
    return {'top1': tf.nn.in_top_k(outputs, inputs['labels'], 1)}

def online_agg(agg_res, res, step):
    """
    Appends the value for each key
    """
    if agg_res is None:
        agg_res = {k: [] for k in res}
    for k, v in res.items():
        agg_res[k].append(v)
    return agg_res

def agg_mean(x):
    return {k: np.mean(v) for k, v in x.items()}

params = {}

params['load_params'] = {
    'host': 'localhost',
    'port': 24444,
    'dbname': 'mnist',
    'collname': 'simple',
    'exp_id': 'exp1',
    'do_restore': True}

params['validation_params'] = {'valid0': {
    'data_params': {'func': data.MNIST,
                    'batch_size': 100,
                    'group': 'test',
                    'n_threads': 1},
    'queue_params': {'queue_type': 'fifo',
                     'batch_size': 100},
    'targets': {'func': top1_func},
    'num_steps': 100,
    'agg_func': agg_mean,
    'online_agg_func': online_agg,}}

params['model_params'] = {
    'func': mnist_model}

params['skip_check'] = True

base.test_from_params(**params)

# Extract record from database
q_val = {'exp_id' : 'exp1', 'validation_results' : {'$exists' : True}, 'validates': {'$exists': True}}
val_steps = connection['mnist']['simple.files'].find(q_val, projection = ['validation_results'])
top1 = [val_steps[i]['validation_results']['valid0']['top1'] 
        for i in range(val_steps.count())]
print(top1)

### Of course you can also put train and validation together. So let's restore the training, train some more and validate every 100 steps:

In [None]:
params = {}

params['load_params'] = {
    'host': 'localhost',
    'port': 24444,
    'dbname': 'mnist',
    'collname': 'simple',
    'exp_id': 'exp1',
    'do_restore': True}

params['save_params'] = {
    'host': 'localhost',
    'port': 24444,
    'dbname': 'mnist',
    'collname': 'simple',
    'exp_id': 'exp1',
    'save_valid_freq': 100,
    'save_filters_freq': 100,
    'cache_filters_freq': 100}

params['train_params'] = {
    'data_params': {'func': data.MNIST,
                    'batch_size': 256,
                    'group': 'train',
                    'n_threads': 1},
    'queue_params': {'queue_type': 'random',
                     'batch_size': 256},
    'num_steps': 3000}

params['learning_rate_params'] = {
    'learning_rate': 0.5,
    'decay_steps': 500,
    'decay_rate': 0.95,
    'staircase': True}

params['optimizer_params'] = {
    'func': optimizer.ClipOptimizer,
    'optimizer_class': tf.train.MomentumOptimizer,
    'momentum': 0.9,
    'clip': True,
}

params['loss_params'] = {
    'targets': ['labels'],
    'loss_per_case_func': tf.nn.sparse_softmax_cross_entropy_with_logits,
    'agg_func': tf.reduce_mean
}

params['validation_params'] = {'valid0': {
    'data_params': {'func': data.MNIST,
                    'batch_size': 100,
                    'group': 'test',
                    'n_threads': 1},
    'queue_params': {'queue_type': 'fifo',
                     'batch_size': 100},
    'targets': {'func': top1_func},
    'num_steps': 100,
    'agg_func': agg_mean,
    'online_agg_func': online_agg,}}

params['model_params'] = {
    'func': mnist_model}

params['skip_check'] = True

base.train_from_params(**params)

# Extract record from database
q_val = {'exp_id' : 'exp1', 'validation_results' : {'$exists' : True}}
val_steps = connection['mnist']['simple.files'].find(q_val, projection = ['validation_results'])
top1 = [val_steps[i]['validation_results']['valid0']['top1'] 
        for i in range(val_steps.count())]
print(top1)

In [None]:
plt.plot(top1)
plt.grid()
plt.title('MNIST Accuracy')

### Great! We have reached 98 % accuracy. Let's add another hidden layer and see if we can improve the result by keeping our already trained layers fixed and only training the new layer:

In [None]:
def mnist_model_with_additional_layer(inputs, train=True, **kwargs):
    # not trainable variables; Note how we set trainable=False
    with tf.variable_scope("mnist"):
        W1 = tf.get_variable('W1', [784,128], tf.float32, tf.random_normal_initializer(stddev=0.1), trainable=False)
        b1 = tf.get_variable('b1', [128], tf.float32, tf.random_normal_initializer(stddev=0.1), trainable=False)
        W2 = tf.get_variable('W2', [128,32], tf.float32, tf.random_normal_initializer(stddev=0.1), trainable=False)
        b2 = tf.get_variable('b2', [32], tf.float32, tf.random_normal_initializer(stddev=0.1), trainable=False)
        W3 = tf.get_variable('W3', [32,10], tf.float32, tf.random_normal_initializer(stddev=0.1), trainable=False)
        b3 = tf.get_variable('b3', [10], tf.float32, tf.random_normal_initializer(stddev=0.1), trainable=False) 
        
        # hidden layers
        h1 = tf.nn.sigmoid(tf.matmul(inputs['images'], W1) + b1, name='hidden1')
        h2 = tf.nn.sigmoid(tf.matmul(h1, W2) + b2, name='hidden2')
    
        output = tf.nn.sigmoid(tf.matmul(h2, W3) + b3, name='hidden3')
    
        # new layer, trainable, randomly initialized
        W4 = tf.get_variable('W4', [10,10], tf.float32, tf.random_normal_initializer(stddev=0.1))
        b4 = tf.get_variable('b4', [10], tf.float32, tf.random_normal_initializer(stddev=0.1))
        output = tf.matmul(output, W4) + b4

    return output, {}

params = {}

params['load_params'] = {
    'host': 'localhost',
    'port': 24444,
    'dbname': 'mnist',
    'collname': 'simple',
    'exp_id': 'exp1',
    'do_restore': True}

params['save_params'] = {
    'host': 'localhost',
    'port': 24444,
    'dbname': 'mnist',
    'collname': 'simple',
    'exp_id': 'exp1',
    'save_valid_freq': 100,
    'save_filters_freq': 100,
    'cache_filters_freq': 100}

params['train_params'] = {
    'data_params': {'func': data.MNIST,
                    'batch_size': 256,
                    'group': 'train',
                    'n_threads': 1},
    'queue_params': {'queue_type': 'random',
                     'batch_size': 256},
    'num_steps': 5000}

params['learning_rate_params'] = {
    'learning_rate': 0.5,
    'decay_steps': 500,
    'decay_rate': 0.95,
    'staircase': True}

params['optimizer_params'] = {
    'func': optimizer.ClipOptimizer,
    'optimizer_class': tf.train.MomentumOptimizer,
    'momentum': 0.9,
    'clip': True,
}

params['loss_params'] = {
    'targets': ['labels'],
    'loss_per_case_func': tf.nn.sparse_softmax_cross_entropy_with_logits,
    'agg_func': tf.reduce_mean
}

params['validation_params'] = {'valid0': {
    'data_params': {'func': data.MNIST,
                    'batch_size': 100,
                    'group': 'test',
                    'n_threads': 1},
    'queue_params': {'queue_type': 'fifo',
                     'batch_size': 100},
    'targets': {'func': top1_func},
    'num_steps': 100,
    'agg_func': agg_mean,
    'online_agg_func': online_agg,}}

params['model_params'] = {
    'func': mnist_model_with_additional_layer}

params['skip_check'] = True

base.train_from_params(**params)

# Extract record from database
q_val = {'exp_id' : 'exp1', 'validation_results' : {'$exists' : True}}
val_steps = connection['mnist']['simple.files'].find(q_val, projection = ['validation_results'])
top1 = [val_steps[i]['validation_results']['valid0']['top1'] 
        for i in range(val_steps.count())]
print(top1)

In [None]:
plt.plot(top1)
plt.grid()
plt.title('MNIST Accuracy')

### This clearly did not work. Although the performance keeps increasing after evaluation step 30 when the new layer was added, it seems to level off at 94 %. This was example only to illustrate how you can reload weights from your trained model, keep the trained layers fixed, add new layers and only train the newly added layers. So let's not do it again. 

### The last thing that we will show you in TFUtils is how to extract features from the model graph.


In [None]:
import re
from collections import defaultdict

def get_extraction_target(inputs, outputs, to_extract, **loss_params):
    names = [[x.name for x in op.values()] for op in tf.get_default_graph().get_operations()]
    names = [y for x in names for y in x]

    #print('nodes in graph: ', names)
    r = re.compile(r'__GPU__\d/')
    _targets = defaultdict(list)

    for name in names:
        name_without_gpu_prefix = r.sub('', name)
        for save_name, actual_name in to_extract.items():
            if actual_name in name_without_gpu_prefix:
                tensor = tf.get_default_graph().get_tensor_by_name(name)
                _targets[save_name].append(tensor)

    targets = {k: tf.concat(v, axis=0) for k, v in _targets.items()}
    return targets

def online_agg_no_mean(agg_res, res, step):
    """
    Appends the value for each key
    """
    if agg_res is None:
        agg_res = {k: [] for k in res}
    for k, v in res.items():
        agg_res[k].append(v)
    return agg_res

def agg_no_mean(x):
    return {k: v for k, v in x.items()}

params = {}

params['load_params'] = {
    'host': 'localhost',
    'port': 24444,
    'dbname': 'mnist',
    'collname': 'simple',
    'exp_id': 'exp1',
    'do_restore': True}

params['validation_params'] = {'valid0': {
    'data_params': {'func': data.MNIST,
                    'batch_size': 1,
                    'group': 'test',
                    'n_threads': 1},
    'queue_params': {'queue_type': 'fifo',
                     'batch_size': 1},
    'targets': {'func': get_extraction_target,
               'to_extract': {
                   'W2': 'mnist/W2:0',
                   'hidden2': 'model_0/validation/valid0/mnist/hidden2:0'}},
    'num_steps': 1,
    'agg_func': agg_no_mean, # we do not want to mean the features that we extract
    'online_agg_func': online_agg_no_mean,}}

params['model_params'] = {
    'func': mnist_model}

params['skip_check'] = True

base.test_from_params(**params)

# print shape of retrieved weights and activations
q_val = {'exp_id' : 'exp1', 'validation_results' : {'$exists' : True}, 'validates': {'$exists' : True}}
val_steps = connection['mnist']['simple.files'].find(q_val, projection = ['validation_results'])
idx = val_steps.count() - 1
print('W2', np.array(val_steps[idx]['validation_results']['valid0']['W2'])[0].shape)
print('hidden2', np.array(val_steps[idx]['validation_results']['valid0']['hidden2'])[0].shape)

### This concludes the tutorial on TFUtils. We will now diverge a little bit from deep learning and introduce some useful tools to analyze neural data, such as tabular, scikit-learn and dldata.

## 5.) Working with the tabular meta data from dldata
### During your analysis of the neural data, you will be working with our lab-internal python package "dldata". "dldata" contains stimuli in the form of images and the neural responses of a macaque to those images accompanied by some meta data. The meta data contains information about the composition of the stimuli such as which category is depicted or where the object is located. Meta data itself is a tabarray which allows for an easy way of subsetting the data. So let's first load the meta data and the IT features from a pickle file that we have created and have put on your instances and then learn how to subsample the data by using the meta data:

In [None]:
#load from tfrecords
import cPickle
data_path = '/home/mrowca/neural_data.pkl' # CHANGE THIS TO '/datasets/neural_data/neural_data.pkl'
with open(data_path) as f:
    data = cPickle.load(f)
meta = data['meta']
IT_features = data['IT']

### Now, let's list all the fields within meta...

In [None]:
meta.dtype.names

### ...and have a look at the contents of 'obj':

In [None]:
meta['obj']

### To get all unique entries we can use np.unique:

In [None]:
np.unique(meta['obj'])

### Let's see how many unique 'obj' there are:

In [None]:
len(np.unique(meta['obj']))

### Now let's look at the unique 'categories':

In [None]:
np.unique(meta['category'])

### Let's say we want to find all 'Cars' in 'obj' now:

In [None]:
np.unique(meta[meta['category'] == 'Cars']['obj'])

### We can also use intersections and unions to subselect our data. Let's say we want to find all 'Cars' and all 'Faces' in 'obj':

In [None]:
np.unique(meta[(meta['category'] == 'Cars') | (meta['category'] == 'Faces')]['obj'])

### Now let's find out how many images have an object on the left side of the image:

In [None]:
len((meta['ty'] < 0).nonzero()[0])

### We can also combine 'category' with 'ty' and look for e. g. all 'Cars' with at least one instance on the left side of the screen:

In [None]:
np.unique(meta[(meta['category'] == 'Cars') & (meta['ty'] < 0)]['obj'])

### To conclude you can combine the fields in a tabarray in any possible imaginable way you want using logical operators to subselect the data as you need it. This enables us to save and query data in a structured way.

## 6.) Training and testing a classifier with scikit-learn and dldata
### During the analysis of neural responses it useful to see how well the recorded features can be used to disentangle object categories using a simple linear classifier. In the following we will thus show you how to train a linear support vector classifier on top of IT features to perform a basic 8-way classification task using scikit-learn:

In [None]:
from sklearn.svm import LinearSVC

# Create a random train / test split and 
# use 75 % of the V0 data for training and the rest for testing
v0_inds = (meta['var'] == 'V0').nonzero()[0]
n = len(v0_inds)
train_inds = np.random.RandomState(0).permutation(n)[: int(3 * n / 4)]
test_inds = np.random.RandomState(0).permutation(n)[int(3 * n / 4): ]
train_inds = v0_inds[train_inds]
test_inds = v0_inds[test_inds]

# Subselect the train and test IT features
train_features = IT_features[train_inds]
test_features = IT_features[test_inds]

# Subselect the train and test category labels
train_labels = meta['category'][train_inds]
test_labels = meta['category'][test_inds]

# Convert labels to integers
cat_id = dict([(category, i) for i, category in enumerate(np.unique(meta['category']))])
train_labels = np.array([cat_id[label] for label in train_labels])
test_labels = np.array([cat_id[label] for label in test_labels])

# Train LinearSVC (Linear Support Vector Classifier)
cls = LinearSVC(C = 5e-3)
cls.fit(train_features, train_labels)

# Test LinearSVC
prediction = cls.predict(test_features)
accuracy = np.sum(prediction == test_labels) / len(test_labels) * 100

# Print results
print('Overall accuracy of IT features on 8-way classification task: %.2f%%' % accuracy)

### As you can see a significant amount of code is needed to perform a classification experiment on only one split! All of this code is abstracted away in dldata to a simple function called "compute_metric_base" which takes care of everything we need including the creation of multiple train / test splits. This means that we can setup the same experiment with a simple spec that we then pass on to "compute_metric_base" to perform our classification experiment with multiple splits as follows: 

In [None]:
from dldata.metrics.utils import compute_metric_base
# Definition of classification experiment
category_eval_spec = {
    'npc_train': None,
    'npc_test': 2,
    'num_splits': 20,
    'npc_validate': 0,
    'metric_screen': 'classifier',
    'metric_labels': None,
    'metric_kwargs': {'model_type': 'svm.LinearSVC',
                      'model_kwargs': {'C':5e-3}
                     },
    'labelfunc': 'category',
    'train_q': {'var': ['V0']},
    'test_q': {'var': ['V0']},
    'split_by': 'obj'
}
# Execute classification experiment
res = compute_metric_base(IT_features, meta, category_eval_spec)

# Print results
print('Overall accuracy of IT features on 8-way classification task: %.2f%%' % \
      ((np.array(res['result_summary']['accbal']).mean(0) - 0.5) * 2.0 * 100.0))

### 'compute_metric_base' also returns a bunch of other evaluation metrics such as dprime or the confusion matrix. Feel free to have a look at res['results_summary'] to see which metrics are returned. To plot the confusion matrix for instance, we can execute the following straight forward commands:

In [None]:
# Confusion matrix
fig = plt.figure(figsize=(8, 8))
m = fig.gca().matshow(np.array(res['result_summary']['cms']).mean(2))
plt.xticks(range(8), res['result_summary']['labelset'])
plt.yticks(range(8), res['result_summary']['labelset'])
plt.colorbar(m)
plt.title('8-way categorization task - across category confusion matrix')
ax = plt.gca()
ax.xaxis.tick_bottom()

### This example hopefully made it clear why it is simpler to use "compute_metric_base" for simple classification or regression experiments. Nonetheless, we wanted to show you the inner workings of "compute_metric_base", so you can implement your own machine learning tools with scikit-learn in case that the particular functionality you are looking for is not implemented in "compute_metric_base".

## 7.) Putting everything together
### At this point, you should have learned everything needed to setup a large-scale deep learning experiment and be able to understand most of the code in assignment 1 which is a great example of how everything that we have discussed fits together. Please have a look at the code of assignment 1. Conceptually, we have setup assignment 1 in the following way:  
### 1.) We generated TFRecord files for the ImageNet dataset and the neural data including all of the meta data needed and wrote dataproviders that read those TFRecords into TFUtils in "dataprovider.py".  
### 2.) We then wrote the "train_imagenet.py" script which specifies a TFUtils training experiment to train AlexNet on ImageNet saving the training results into our database. While the experiment is running the accuracy on the ImageNet classification task is evaluated and stored into the database to see if the model is training correctly.
### 3.) We then created "test_imagenet.py" which pulls the pretrained AlexNet model from the database and evaluates it using among others "compute_metric_base" from dldata with the help of the meta data that is read from the TFRecords and piped through the network. Essentially, we aggregate all of our network activations for all neural stimuli in our "agg_func" "neural_analysis" and perform our classification and regression experiments on the aggregated data. We store the evaluation results in the database.
### 4.) Finally, we pull the evaluation results from the database in our Jupyter notebook, and evaluate and visualize them. 
### In general, this is how most of the experiments are done in our lab. TFRecords allow us to read data fast. TFUtils helps us to keep organized. And Jupyter notebooks are a great and interactive way to visualize results.
### This concludes this tutorial on CNNs and Neural data. Good luck on assignment 1!