# Convnet feature extractor with Keras 2.0


2017-12-06, Keunwoo Choi (keunwoo.choi@qmul.ac.uk)

Because the previous release with keras 1.2.x has non-consistency problem by varying batch size, I re-ran a similar but slightly different convnet. 

### What's similar
 * 5-layer convnet
 * Average-pooling based feature computation
 * Trained for music tagging

#### What's the same but actually bad
 * The current version of kapre has a slight bug in computing melspectrogram. 

### What's better
 * Stable (or consistent) feature prediction regardless of the batch size
 * Works for all signal > 5.12s
 * Uses a data sample-based normalization as a audio level normalizer, just in case.

### What might be worse
 * It's basically different weights, so different features. I guess they'de be pretty similar though
 * Tagging performance is 0.825 AUC, which is not as good as the prev one (0.845 AUC)
 

# USAGE

## Load model

In [17]:
from keras import backend as K
from keras.models import Model
from keras.layers import GlobalAveragePooling2D as GAP2D
from keras.layers import concatenate as concat

In [11]:
import keras
import kapre

print(keras.backend.image_data_format())
print(keras.backend.backend())
print(keras.__version__)
print(kapre.__version__)  # 0c37638

channels_last
tensorflow
2.0.6
0.1.2.1


In [12]:
model = keras.models.load_model('model_best.hdf5', 
                                custom_objects={'Melspectrogram':kapre.time_frequency.Melspectrogram,
                                                'Normalization2D':kapre.utils.Normalization2D})

int_axis=0 passed but is ignored, str_axis is used instead.


In [13]:
model.summary()
# MP: [(2, 4), (3, 4), (2, 5), (2, 4), (4, 4)]
# --> Melspectrogram: should have more than 320 time frames.
# --> Input audio should be >= 81,920 [samples] with sampling rate=16000.
#     I.e., longer than 5.12 [seconds].

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
melgram (Melspectrogram)     (None, 96, None, 1)       287840    
_________________________________________________________________
normalization2d_1 (Normaliza (None, 96, None, 1)       0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 96, None, 32)      320       
_________________________________________________________________
batch_normalization_1 (Batch (None, 96, None, 32)      128       
_________________________________________________________________
elu_1 (ELU)                  (None, 96, None, 32)      0         
_________________________________________________________________
MP_1 (MaxPooling2D)          (None, 48, None, 32)      0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 48, None, 32)      9248      
__________

## Build a feature extractor based on the trained model

In [19]:
feat_layer1 = GAP2D()(model.get_layer('elu_1').output)
feat_layer2 = GAP2D()(model.get_layer('elu_2').output)
feat_layer3 = GAP2D()(model.get_layer('elu_3').output)
feat_layer4 = GAP2D()(model.get_layer('elu_4').output)
feat_layer5 = GAP2D()(model.get_layer('elu_5').output)

In [24]:
feat_all = concat([feat_layer1, feat_layer2, feat_layer3, feat_layer4, feat_layer5])

In [27]:
feat_extractor = Model(inputs=model.input, outputs=feat_all)
# You just build the feature extractor. 
# This is a keras model, you can use .predict() method to get the features as below.

In [32]:
# model.summary() is always right.
feat_extractor.summary(line_length=90)

__________________________________________________________________________________________
Layer (type)                 Output Shape        Param #    Connected to                  
melgram_input (InputLayer)   (None, 1, None)     0                                        
__________________________________________________________________________________________
melgram (Melspectrogram)     (None, 96, None, 1) 287840     melgram_input[0][0]           
__________________________________________________________________________________________
normalization2d_1 (Normaliza (None, 96, None, 1) 0          melgram[0][0]                 
__________________________________________________________________________________________
conv2d_1 (Conv2D)            (None, 96, None, 32 320        normalization2d_1[0][0]       
__________________________________________________________________________________________
batch_normalization_1 (Batch (None, 96, None, 32 128        conv2d_1[0][0]                

In [50]:
# Let's see the feature is really stable per batch size
import numpy as np

In [51]:
inp = np.random.random((12, 1, 16000*6))  # a mono, 6-second input signals with batch_size=12
feat = feat_extractor.predict(inp, batch_size=12)

In [52]:
feat2 = feat_extractor.predict(inp[0:1]) # what if the batch size is 1?

In [53]:
np.isclose(feat[0], feat2[0], rtol=1e-5, atol=1e-4)

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,