<a href="https://colab.research.google.com/github/ithingv/AudioProcessing/blob/main/Kapre_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [6]:
from google.colab import drive
drive.mount('/content/MyDrive')

Mounted at /content/MyDrive


In [1]:
!pip install kapre



# How to use Kapre - example

In [5]:
import librosa
import kapre
import tensorflow as tf
from tensorflow.keras.models import Sequential
import numpy as np

from datetime import datetime
now = datetime.now()

print('%s/%s/%s' % (now.year, now.month, now.day))
print('Tensorflow: {}'.format(tf.__version__))
print('Librosa: {}'.format(librosa.__version__))
print('Image data format: {}'.format(tf.keras.backend.image_data_format()))
print('Kapre: {}'.format(kapre.__version__))

2022/1/6
Tensorflow: 2.7.0
Librosa: 0.8.1
Image data format: channels_last
Kapre: 0.3.6


# Loading an mp3 file

In [8]:
audio_path = '/content/MyDrive/MyDrive/[해커톤]DeepASMR/Dataset/asmr/wave/0.wav'
src, sr = librosa.load(audio_path, sr=None, mono=True)
print('Audio length: %d samples, %04.2f seconds. \n' % (len(src), len(src) / sr) +
      'Audio sample rate: %d Hz' % sr)

Audio length: 220500 samples, 5.00 seconds. 
Audio sample rate: 44100 Hz


In [9]:
src.shape

(220500,)

# Trim it and make it a 2d.
If your file is mono, librosa.load returns a 1D array. Kapre always expects 2d array, so make it 2d.

On my computer, I use default `image_data_format == channels_last`. I'll keep it in that way for the audio data, too.

In [10]:
len_second = 1.0 
src = src[:int(sr*len_second)]
src = np.expand_dims(src, axis=1)
input_shape = src.shape
print('The shape of an item', input_shape)

The shape of an item (44100, 1)


# Let's make it a batch of 4 items

In [11]:
x = np.array([src] * 4)
print('The shape of a batch: ',x.shape)

The shape of a batch:  (4, 44100, 1)


# A Keras Model

A simple model with 10-class and single-label classification.

In [14]:
from kapre.time_frequency import STFT, Magnitude, MagnitudeToDecibel


model = Sequential()
# A STFT layer
model.add(STFT(n_fft=2048, win_length=2018, hop_length=1024,
               window_name=None, pad_end=False,
               input_data_format='channels_last', output_data_format='channels_last',
               input_shape=input_shape,
              name='stft-layer'))
model.summary()

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 stft-layer (STFT)           (None, 42, 1025, 1)       0         
                                                                 
Total params: 0
Trainable params: 0
Non-trainable params: 0
_________________________________________________________________


- The model has no trainable parameters because STFT layer uses tf.signal.stft() function which is just an implementation of FFT-based short-time Fourier transform.
- The output shape is (batch, time, frequency, channels).
    - 42 (time) is the number of STFT frames. A shorter hop length would make it (nearly) proportionally longer. If pad_end=True, the input audio signals become a little longer, hence the number of frames would get longer, too.
    - 1025 is the number of STFT bins and decided as n_fft // 2 + 1.
    - 1 channel: because the input signal was single-channel.
- The output of STFT layer is complex number.

In [4]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, BatchNormalization, ReLU, GlobalAveragePooling2D, Dense, Softmax
from kapre import STFT, Magnitude, MagnitudeToDecibel
from kapre.composed import get_melspectrogram_layer, get_log_frequency_spectrogram_layer

In [16]:
model.add(Magnitude())
model.add(MagnitudeToDecibel())
model.add(Conv2D(32, (3, 3), strides=(2, 2)))
model.add(BatchNormalization())
model.add(ReLU())
model.add(GlobalAveragePooling2D())
model.add(Dense(10))
model.add(Softmax())

# Compile the model
model.compile('adam', 'categorical_crossentropy') # if single-label classification

In [17]:
model.summary()

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 stft-layer (STFT)           (None, 42, 1025, 1)       0         
                                                                 
 magnitude (Magnitude)       (None, 42, 1025, 1)       0         
                                                                 
 magnitude_to_decibel (Magni  (None, 42, 1025, 1)      0         
 tudeToDecibel)                                                  
                                                                 
 conv2d (Conv2D)             (None, 20, 512, 32)       320       
                                                                 
 batch_normalization (BatchN  (None, 20, 512, 32)      128       
 ormalization)                                                   
                                                                 
 re_lu (ReLU)                (None, 20, 512, 32)      

- I added Magnitude() which is a simple abs() operation on the complex numbers.
- MagnitudeToDecibel maps the numbers to a decibel scale.