# How to use Kapre - example

In [13]:
import librosa
import kapre
import tensorflow as tf
from tensorflow.keras.models import Sequential
import numpy as np

from datetime import datetime
now = datetime.now()

print('%s/%s/%s' % (now.year, now.month, now.day))
print('Tensorflow: {}'.format(tf.__version__))
print('Librosa: {}'.format(librosa.__version__))
print('Image data format: {}'.format(tf.keras.backend.image_data_format()))
print('Kapre: {}'.format(kapre.__version__))

2020/8/14
Tensorflow: 2.3.0
Librosa: 0.8.0
Image data format: channels_last
Kapre: 0.3.0-rc


# Loading an mp3 file

In [19]:
src, sr = librosa.load('../srcs/bensound-cute.mp3', sr=None, mono=True)
print('Audio length: %d samples, %04.2f seconds. \n' % (len(src), len(src) / sr) +
      'Audio sample rate: %d Hz' % sr)

Audio length: 453888 samples, 10.29 seconds. 
Audio sample rate: 44100 Hz




# Trim it and make it a 2d.

If your file is mono, librosa.load returns a 1D array. Kapre always expects 2d array, so make it 2d.

On my computer, I use default `image_data_format == 'channels_last'`. I'll keep it in that way for the audio data, too.

In [20]:
len_second = 1.0 # Let's trim it to make it quick
src = src[:int(sr*len_second)]
src = np.expand_dims(src, axis=1)
input_shape = src.shape
print('The shape of an item', input_shape)

The shape of an item (44100, 1)


# Let's make it a batch of 4 items

to make it more like a proper dataset. You should have many files indeed.

In [21]:
x = np.array([src] * 4)
print('The shape of a batch: ',x.shape)

The shape of a batch:  (4, 44100, 1)


# A Keras model

A simple model with 10-class and single-label classification.

In [33]:
from kapre.time_frequency import STFT, Magnitude, MagnitudeToDecibel


model = Sequential()
# A STFT layer
model.add(STFT(n_fft=2048, win_length=2018, hop_length=1024,
               window_fn=None, pad_end=False,
               input_data_format='channels_last', output_data_format='channels_last',
               input_shape=input_shape,
              name='stft-layer'))
model.summary()

Model: "sequential_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
stft-layer (STFT)            (None, 42, 1025, 1)       0         
Total params: 0
Trainable params: 0
Non-trainable params: 0
_________________________________________________________________


- The model has no trainable parameters because `STFT` layer uses `tf.signal.stft()` function which is just an implementation of FFT-based short-time Fourier transform.
- The output shape is `(batch, time, frequency, channels)`. 
  - `42` (time) is the number of STFT frames. A shorter hop length would make it (nearly) proportionally longer. If `pad_end=True`, the input audio signals become a little longer, hence the number of frames would get longer, too.
  - `1025` is the number of STFT bins and decided as `n_fft // 2 + 1`. 
  - `1` channel: because the input signal was single-channel.
- The output of `STFT` layer is `complex` number.

Let's add more layers like a real model!

In [34]:
from tensorflow.keras.layers import Conv2D, BatchNormalization, ReLU, GlobalAveragePooling2D, Dense, Softmax

In [35]:
model.add(Magnitude())
model.add(MagnitudeToDecibel())
model.add(Conv2D(32, (3, 3), strides=(2, 2)))
model.add(BatchNormalization())
model.add(ReLU())
model.add(GlobalAveragePooling2D())
model.add(Dense(10))
model.add(Softmax())

# Compile the model
model.compile('adam', 'categorical_crossentropy') # if single-label classification


In [36]:
model.summary()

Model: "sequential_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
stft-layer (STFT)            (None, 42, 1025, 1)       0         
_________________________________________________________________
magnitude (Magnitude)        (None, 42, 1025, 1)       0         
_________________________________________________________________
magnitude_to_decibel (Magnit (None, 42, 1025, 1)       0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 20, 512, 32)       320       
_________________________________________________________________
batch_normalization (BatchNo (None, 20, 512, 32)       128       
_________________________________________________________________
re_lu (ReLU)                 (None, 20, 512, 32)       0         
_________________________________________________________________
global_average_pooling2d (Gl (None, 32)               

- I added `Magnitude()` which is a simple `abs()` operation on the complex numbers.
- `MagnitudeToDecibel` maps the numbers to a decibel scale.