# Alternate Configurations

In the paper, we presented the best performing network configurations. Here we also present some alternative network configurations, and their relative results.

# Visual Network Architecture

In the paper, we presented Tensor-Train Gated Recurrent Unit (TT-GRU) since it preformed the best. The second best performing architecture is CNN for video classification. Similar architecture is used in "Learning Spatiotemporal Features With 3D Convolutional Networks" by Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, Manohar Paluri.
We tested it on crush_120x90_testset1.bin, and got 43.3% accuracy, whereas the TT-GRU achieves 49.4%.
Here's the implementation of the CNN architecture:

In [1]:
def model(data_placeholder):
    with tf.name_scope('Model'):
        net = tf.layers.conv3d(inputs=data_placeholder, filters=32, kernel_size=3, padding='SAME', activation=tf.nn.relu)
        net = tf.layers.max_pooling3d(inputs=net, pool_size=2, strides=2, padding='SAME')
        net = tf.layers.conv3d(inputs=net, filters=64, kernel_size=3, padding='SAME', activation=tf.nn.relu)
        net = tf.layers.max_pooling3d(inputs=net, pool_size=2, strides=2, padding='SAME')
        net = tf.layers.conv3d(inputs=net, filters=128, kernel_size=3, padding='SAME', activation=tf.nn.relu)
        net = tf.layers.conv3d(inputs=net, filters=128, kernel_size=3, padding='SAME', activation=tf.nn.relu)
        net = tf.layers.max_pooling3d(inputs=net, pool_size=2, strides=2, padding='SAME')
        net = tf.layers.flatten(net)
        net = tf.layers.dense(inputs=net, units=512, activation=tf.nn.relu)
        net = tf.identity(net, name='fc1')
        net = tf.layers.dense(inputs=net, units=512, activation=tf.nn.relu)
        net = tf.identity(net, name='fc2')
        net = tf.layers.dense(inputs=net, units=num_classes, activation=None)
        logits = tf.identity(net, name='logits')
        
    return logits

# Auditory Network Architecture

In the paper, we used CNN for sound classification, but Tensor-Train Long Short Term Memory (TT-LSTM) also provide comparable results. TT-LSTM enables to train end-to-end on high dimensional sequential data. The proposed CNN method achieves 51.1% accuracy and TT-LSTM achieves 45.0% on push_60Freq_50TimePerSec_testset1.bin Here's the implementation of the TT-LSTM architecture:

In [2]:
def model(x):
    with tf.name_scope('Model'):
        net = tf.reshape(x, [-1, frames//5, 5*freq_height*channel])
        rnn_layer = TT_LSTM(tt_input_shape=tt_input_shape, tt_output_shape=tt_output_shape,
                           tt_ranks=tt_ranks,
                            return_sequences=False,
                            dropout=0.25, recurrent_dropout=0.25, activation='tanh')
        h = rnn_layer(net)
        logits = Dense(output_dim=num_classes, activation='softmax', kernel_regularizer=l2(alpha))(h)
    return logits

# Haptic Network Architecture
In the paper, we used CNN for haptic classification, but we also tried CNN+LSTM architecture. CNN achieves 51.3%, whereas CNN+LSTM achieves 46.0% accuracy on crush_Freq_50TimePerSec_testset1.bin. Here's the implementation of the CNN+LSTM architecture:

In [3]:
def model():
    with tf.name_scope("Model"):
        data_placeholder = tf.placeholder('float', [None, frames, channel], name='InputData')
        net = tf.reshape(data_placeholder, [-1, frames, channel, 1])
        
        net = tf.layers.conv2d(inputs=net, filters=32, kernel_size=[20, 1], padding="same", activation=tf.nn.relu)
        net = tf.layers.max_pooling2d(inputs=net, pool_size=[10, 1], strides=2)
        net = tf.layers.conv2d(inputs=net, filters=64, kernel_size=[20, 1], padding="same", activation=tf.nn.relu)
        net = tf.layers.max_pooling2d(inputs=net, pool_size=[10, 1], strides=2)        
        net = tf.reshape(net, [-1, 128])
        net = tf.split(net, 54, 0)
        lstm = tf.contrib.rnn.BasicLSTMCell(128, forget_bias=1.0, state_is_tuple=True)
        outputs, _states = tf.contrib.rnn.static_rnn(lstm, net, dtype=tf.float32)
        logits = tf.layers.dense(outputs[-1], num_classes, name='logits')
        
    return logits

# Multimodal Network Architecture

Several networks can be combined in many ways. Second best performing network in our experiments in similar to the one used in "Deep learning for tactile understanding from visual and haptic data" by Gao, Yang and Hendricks, Lisa Anne and Kuchenbecker, Katherine J and Darrell, Trevor. In this network, activations from the second last layer of haptic and visual CNN network was concatenated.

We implemented this by included 512 neurons in video's last fully connected layer, removing the last fully connected layer from audio and haptic network, which as 256 and 1024 neurons, respectively. Then concatenating all three outputs to get a layer of 1792 (512+256+1024) neurons. Finally, added two dense layers: one consisting 512 neurons and one consisting 20 neurons for the outputs.

The proposed method achieves 77.2% accuracy, and this method achieves 73.5% on hold_testset1.bin. Here's the implementation of this method:

In [4]:
def model(video_data_placeholder):
    with tf.name_scope("Model"):
        # Video
        XR = tf.reshape(video_data_placeholder, [batch, video_frames, video_height*video_width*video_channel])
        rnn_layer = TT_GRU(tt_input_shape=tt_input_shape, tt_output_shape=tt_output_shape,
                               tt_ranks=tt_ranks,
                                return_sequences=False,
                                dropout=0.25, recurrent_dropout=0.25, activation='tanh')
        h = rnn_layer(XR)
        video_logits = Dense(output_dim=512, activation='relu', kernel_regularizer=l2(alpha))(h)
        
        # Audio
        audio_data_placeholder = tf.placeholder('float', [None, audio_frames, audio_freq_height, audio_channel], name='audio_InputData')
        
        net = tf.layers.conv2d(inputs=audio_data_placeholder, filters=20, kernel_size=[57, 6], strides=[1, 1], padding="same", activation=tf.nn.relu)
        net = tf.layers.max_pooling2d(inputs=net, pool_size=[4, 4], strides=[4, 4])
        net = tf.layers.dropout(inputs=net, rate=audio_keep_prob)
        net = tf.layers.conv2d(inputs=net, filters=40, kernel_size=[1, 3], strides=[1, 1], padding="same", activation=tf.nn.relu)
        if (db_file_name.split("_")[0]) == "hold":
            net = tf.layers.max_pooling2d(inputs=net, pool_size=[1, 4], strides=[1, 4])
        else:
            net = tf.layers.max_pooling2d(inputs=net, pool_size=[4, 4], strides=[4, 4])
        net = tf.layers.flatten(net)
        # Dense Layer
        net = tf.layers.dense(inputs=net, units=256, activation=tf.nn.relu)
        net = tf.layers.dropout(inputs=net, rate=audio_keep_prob)
        net = tf.layers.dense(inputs=net, units=256, activation=tf.nn.relu)
        audio_logits = tf.layers.dropout(inputs=net, rate=audio_keep_prob)
        
        # Haptic
        haptic_data_placeholder = tf.placeholder('float', [None, haptic_frames, haptic_channel], name='haptic_InputData')
        net = tf.reshape(haptic_data_placeholder, [-1, haptic_frames, haptic_channel, 1])
        net = tf.layers.conv2d(inputs=net, filters=32, kernel_size=[20, 5], padding="same", activation=tf.nn.relu)
        net = tf.layers.max_pooling2d(inputs=net, pool_size=[10, 1], strides=2)
        net = tf.layers.conv2d(inputs=net, filters=64, kernel_size=[1, 3], padding="same", activation=tf.nn.relu)
        if (db_file_name.split("_")[0]) in haptic_skip_2nd_maxpool:
            net = tf.layers.max_pooling2d(inputs=net, pool_size=[1, 1], strides=[1, 2])
        else:
            net = tf.layers.max_pooling2d(inputs=net, pool_size=[10, 1], strides=2)
                
        net = tf.layers.flatten(net)
        # Dense Layer
        net = tf.layers.dense(inputs=net, units=1024, activation=tf.nn.relu)
        haptic_logits = tf.layers.dropout(inputs=net, rate=haptic_keep_prob)
        
        # Concatenate 
        logits = tf.concat([video_logits, audio_logits, haptic_logits], axis=1)
        logits = tf.nn.relu(logits)
        logits = tf.layers.dense(inputs=logits, units=512, activation=tf.nn.relu)
        logits = tf.layers.dense(inputs=logits, units=num_classes)
        
    return logits