## On the use of TF layers in the building of a Convolutional Neural Network
#### *(draft)*

### Overview
A *convolutional neural network (CNN)* usually contains three component.

1. a *convolutional layer* usually applies a certain amount of filter to the input. Each filter has a specific patch size and runs through image subregions. For each subregion, the layer perform a set of mathematical operations to produce a single value in the output **feature map**. Convolutional layers usually apply a *non-linear* function to introduce non-linearities into the model. In a standard convnet layer, a *ReLU* (rectified linear unit) function is used;
2. a *pooling layer* usually downsamples the data extracted by the convolutional layer to reduce the dimensionality of the feature map and leading to a decreasing processing time. A common strategy is to use a *max pooling algorithm* which extracts subregions of the feature map. In short, a pooling layer takes a subregion of *n x m* size and keep a specific value, discarding all the others;
3. A *dense* layer is a layer that comes after the *conv + pool* layers structure, which has all the neurons connected to each neuron (or *node*) in the previous layer.

A typical CNN is made of a stack of convolutional modules that perform *feature extraction* and some dense fully-connected layer that performs *classification*. The final layer (called also *logits*) contains a neuron for each *target*, e.g. a label, with a *softmax* activation function in order to generate probability values between *0* and *1*, that is *how likely my input is described by each of my targets?*. 

In [1]:
# code skeleton
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function


import numpy as np

# due import for cnn evaluation
import tensorflow as tf
from tensorflow.contrib import learn
from tensorflow.contrib.learn.python.learn.estimators import model_fn as model_fn_lib

# set threshold for logging output produced
tf.logging.set_verbosity(tf.logging.INFO)

### CNN structure

##### - convolutional modules -
- conv layer #1: applies 256 filters which span subregion of 128x6 with ReLU activation function
- pool layer #1: performs a max pooling with a 128x4 filter and stride of 4 (not overlapped)
- conv layer #2: applies 256 filters which span subregion of 256x6 with ReLU activation function
- pool layer #2: performs a max pooling with a 256x4 filter and stride of 2 (overlapped)
- conv layer #3: applies 512 filters which span subregion of 256x6 with ReLU activation function
- pool layer #3: performs a max pooling with a 512x4 filter and stride of 2 (overlapped) 

##### - dense modules - 
- dense layer #1: 2048 neurons
- dense layer #2: 2048 neurons with dropout regolarisation
- dense layer #3: readout layer (logits) with as many neurons as the target

In [2]:
# A. prepare the input
# (3D-tensor input required for 1D convolution, e.g. temporal convolution)
batch_size = 3 # an example

# - to check
# input length: 10-sec long spectrogram = 862 timesteps
# input dimension: mel-spectrogram bins = 128
input_layer = tf.placeholder(tf.float32, [batch_size, 862, 128])

# dimensions check
print(input_layer.shape)

(3, 862, 128)


In [6]:
# B. Convolutional layer #1
# This layer creates a convolution kernel that is convolved with the layer input to produce a tensor of outputs.
# obs: kernel size has only one value since the convolution is only temporal
conv1 = tf.layers.conv1d(
    inputs=input_layer, 
    filters=256, 
    kernel_size=[6], # the filter is spanning 6 timesteps
    padding="same", 
    activation=tf.nn.relu
)

# dimensions check
print(conv1.shape)

# get activation values

(3, 862, 256)


AttributeError: 'Tensor' object has no attribute 'output'

In [19]:
# C. Pooling layer #1
pool1 = tf.layers.max_pooling1d(
    inputs=conv1, 
    pool_size=4, 
    strides=4, # none of the subregions will overlap
    padding="same"
)

# dimensions check - expected reduced length by 75%
print(pool1.shape)

(3, 216, 256)


In [21]:
# D. Convolutional + pooling layer #2
conv2 = tf.layers.conv1d(
    inputs=pool1, 
    filters=256, 
    kernel_size=[6], # the filter is spanning 6 timesteps
    padding="same", 
    activation=tf.nn.relu
)

# dimensions check
print(conv2.shape)


pool2 = tf.layers.max_pooling1d(
    inputs=conv2, 
    pool_size=2, 
    strides=2, # none of the subregions will overlap
    padding="same"
)

# dimensions check - expected reduced length by 50%
print(pool2.shape)

(3, 216, 256)
(3, 108, 256)


In [24]:
# D. Convolutional layer #3
conv3 = tf.layers.conv1d(
    inputs=pool2, 
    filters=512, 
    kernel_size=[6], # the filter is spanning 6 timesteps
    padding="same", 
    activation=tf.nn.relu
)

# dimensions check
print(conv3.shape)

(3, 108, 512)


In [35]:
# E. Global temporal pooling 
# It pools across the entire time axis, i.e. one value per filter

# using max pooling 
pool3Max = tf.layers.max_pooling1d(
    inputs=conv3, 
    pool_size=108, 
    strides=108,
    padding="same"
)

# using average pooling
pool3Avg = tf.layers.average_pooling1d(
    inputs=conv3,
    pool_size=108,
    strides=108,
    padding="same"
)


# dimensions check - expected same for both tensors
print(pool3Max.shape)
print(pool3Avg.shape)

(3, 1, 512)
(3, 1, 512)


In [43]:
# stack the two pooling layers
globalPool = tf.concat([pool3Max,pool3Avg], axis=1)
print(globalPool.shape)

# reshape in tensor like [batch_size, features]
pool_flat = tf.reshape(globalPool,[-1, 2*512])
print(pool_flat.shape)

(3, 2, 512)
(3, 1024)


In [44]:
# F. Dense layers
# They are mainly used to perform classification on the features extracted by the convolutional module.

# functional interface for densely-connected layers
# (implements the operation: outputs = activation(inputs.kernel + bias))
# dense layer #1
dense1 = tf.layers.dense(
    inputs=pool_flat,
    units=2048,
    activation=tf.nn.relu
)

# dense layer #2
dense2 = tf.layers.dense(
    inputs=dense1,
    units=2048,
    activation=tf.nn.relu
)

# dropout regularisation for dense layer #2
dropout = tf.layers.dropout(
    inputs=dense2,
    rate=0.5, # 50% elements randomly dropped out during training
    training=True
)

# dimensions check
print(dense1.shape)
print(dense2.shape)

(3, 2048)
(3, 2048)


In [45]:
# G. logits layer
# units number chosen by heuristics (to be changed?)
logits = tf.layers.dense(
    activation=tf.nn.softmax,
    inputs=dense2,
    units=10
)

# dimensions check
print(logits.shape)

(3, 10)
