# Creating a Base Benchmark for Motivating ASIC DNN Framework

I'm calling this benchmarking for idiots since taking the wall clock and looking at execution time is a simplistic approach to hardware benchmarking. If we were to put some serious effort into this study, we'd get a bit closer to the metal and see what the "production" exection time looks like. In addition, I"m currently working on the computer using both the CPU and GPU, so we're going to need errors bars to show what the uncertainty in our measurements looks like.

First, the libraries we're going to use:

* mvnc is the movidius interface library. In the future, I intend to wrap a bit of this out
* matplotlib is being used in case we need fine control of the plotting
* subprocess is necessary to call the movidius kernel compiler
* tensorflow is our NN library du'jour
* pandas is being used for data processing
* numpy is being used to generate fake data
* time is being used to benchmark the execution time
* csv is being used to record incremental results

* Finally, seaborn is being used to make the plots... pretty.


In [1]:
from mvnc import mvncapi as mvnc
import matplotlib.pyplot as plt
from subprocess import call
import tensorflow as tf
import pandas as pd
import numpy as np
import time
import csv

import seaborn as sns

%matplotlib notebook
sns.set_context("notebook")
sns.set_style("whitegrid")
sns.set_palette("Set1",8, 0.75)

  EagerTensor = c_api.TFE_Py_InitEagerTensor(_EagerTensorBase)
  if d.decorator_argspec is not None), _inspect.getargspec(target))
  from ._conv import register_converters as _register_converters
  if d.decorator_argspec is not None), _inspect.getargspec(target))


## The Movidius Device
Right off the bat, I'm going to make sure that movidius device is initialized properly.

In [2]:
devices = mvnc.EnumerateDevices()
if len(devices) == 0:
    print('No devices found')
    
device = mvnc.Device(devices[0])
device.OpenDevice()

## Network Definition

This is the perceptron network model. The idea is to do 2-class classification on 100 inputs and the intermediate layers are just 100x100 perceptron layers. In order to test latency as a function of depth, we just increase the number of hidden layers and thus parameters and measure the execution time.

In [3]:
def multilayer_perceptron( n_layers):
    """ Create a multilayer perceptron network with n_layers -1 
    layers and return the first and last node.
    """
    # First Layer definition
    X = tf.placeholder("float", [None, 100], name='input')
    x = tf.add(tf.matmul(X, tf.Variable(tf.random_normal([100, 100]))),
                                  tf.Variable(tf.random_normal([100])))
    # Add in layers up to n_layers
    for _ in range(n_layers):
        x = tf.add(tf.matmul(x, tf.Variable(tf.random_normal([100, 100]))),
                                tf.Variable(tf.random_normal([100])))
        
    # Output fully connected layer with a neuron for each class
    x = tf.add(tf.matmul(x, tf.Variable(tf.random_normal([100, 2]))),
                                  tf.Variable(tf.random_normal([2])))
    x = tf.nn.softmax(x, name='output')
    return X, x

# Experiment Definition

I'm going to furthermore wrap the network calls so that between the CPU and GPU execution, we won't need to write redundant code. This returns a list of execution times and the number of parameters in the network.

In [None]:
save_path = None
def run_for_layers(n_layers=1, save=False, n_iterations=100):
    """
    :type n_layers: int The number of hidden layers in the network -1
    :type save: bool Whether or not to save the checkpoint of the model
    "type n_iterations: int The number of experiments to repeat
    :rtype: int, list The number of parameters and a list of the execution times for the model
    """
    global save_path
    X, logits = multilayer_perceptron(n_layers)
    init = tf.global_variables_initializer()
    # If we're saving the model, create a saver
    saver = None
    if save:
        saver = tf.train.Saver()
    with tf.Session() as sess:
        # Compute out the number of parameters in this network
        x=100*100+100+100*2+2 + n_layers*(100*100+100)
        y=[]
        for _ in range(n_iterations):        
            sess.run(init)
            # Start the clock, run the network, and then stop the network.
            start = time.time()
            sess.run([logits, ], feed_dict={X: np.ndarray(shape=(100, 100))})
            end = time.time()
            
            y.append(end-start)
        if save:
            save_path = saver.save(sess, "./model")
        return x, y

## Script to run the full sequence

In [None]:

with open('results.csv', 'w') as csvfile:
    # Record the data to a csv file
    fieldnames = ['N Params', 'Exec Time', "Device"]
    writer = csv.DictWriter(csvfile, fieldnames)
    writer.writeheader()
    for i in range(11):
        # Run the CPU computation
        with tf.device('/cpu:0'):
            x, y = run_for_layers(i*100, False, 1000)
            for j in range(len(y)):
                output = {'N Params':x, 
                          'Exec Time': y[j],
                          "Device":"CPU"}
                writer.writerow(output)
        # Run the GPU computation
        with tf.device('/gpu:0'):
            x, y = run_for_layers(i*100, False, 1000)
            for j in range(len(y)):
                output = {'N Params':x, 
                          'Exec Time': y[j],
                          "Device":"GPU"}
                writer.writerow(output)
            csvfile.flush()
        # Run the ASIC Computation
        with tf.device('/cpu:0'):
            # This is required to save the network checkpoint
            run_for_layers(i*100, True, 1)
        # Compile the checkpoint into something the Movidius stick can understand
        call(['/usr/local/bin/mvNCCompile','model.meta','-w','model',
              '-s','12','-in','input','-on','output','-o','model.graph'])
        
        with open('./model.graph', mode='rb') as f:    
            # Load the model onto the stick
            graphfile = f.read()
            graph = device.AllocateGraph(graphfile)
            x=[]
            y=[]
            for j in range(1000):
                # Start the clock, move data over, then get the results and stop the clock.
                start = time.time()
                graph.LoadTensor(np.ndarray(shape=(100, 100)), 'user object')
                output, userobj = graph.GetResult()
                end = time.time()
                output = {'N Params': 100*100+100+100*2+2 + 100*i*(100*100+100), 
                          'Exec Time': end-start,
                          "Device":"ASIC"}
                writer.writerow(output)
            # Don't forget to cleanup!
            graph.DeallocateGraph()
            
            csvfile.flush()

  return np.fromstring(tensor.tensor_content, dtype=dtype).reshape(shape)
  if d.decorator_argspec is not None), _inspect.getargspec(target))


## Plotting 

We're going to plot the execution time as a function of number of parameters. Separate results on device type, and prettify the plot.

To that end, we'll use seaborn, since this is the 1-line solution.

In [None]:
df = pd.read_csv("results.csv")
ax = sns.pointplot(x="N Params", y="Exec Time", hue="Device", data=df)
locs, labels = plt.xticks()
plt.setp(labels, rotation=45)
plt.xlabel("Number of Network Parameters")
plt.ylabel("Execution Time [s]")
plt.tight_layout()

For Completeness, the following devices were used in this study.

CPU: Intel® Core™ i7-5500U CPU @ 2.40GHz × 4 

GPU: GeForce GTX 950M/PCIe/SSE2

ASIC: Intel Movidius Neural Compute Stick