In this notebook, we are trying to reproduce [the paper](https://pdfs.semanticscholar.org/edff/b62b32ffcc2b5cc846e26375cb300fac9ecc.pdf) for speaker change detection

## Review

**Sequence Labelling** 

They think this task as a binary classification. Thus, they label changing frame as a **1** and non-changing frame as a **0**. So that, they use the _binary cross-entropy loss function_.

**Network Architecture**
- 2 Bi-LSTM
    - 64 and 32 outputs respectively.
- Multi Layer Perceptron
    - 3 Fully Connected Feedforward Layers
        - 40, 20, 1 dimensional respectively.
    - Tanh activation for first 2 layer
    - Sigmoid activation for last layer
    
**Feature Extraction**
- "35-dimensional acoustic features are extracted every 16ms on a 32ms window using [Yaafe toolkit](http://yaafe.sourceforge.net)."
    - 11 Mel-Frequency Cepstral Coefficients (MFCC), 
    - Their first and second derivatives,
    - First and second derivatives of the energy.

**Class Imbalance**

- _"The number of positive labels isincreased artificially by labeling as positive every frame in the direct neighborhood of the manually annotated change point."_
- A positive neighborhood of 100ms (50ms on both sides) is used around each change point, to partially solve the class imbalance problem.

**Subsequences**

- _"The long audio sequences are split into short fixed-length overlapping sequences."_

**Prediction**

- _"Finally, local score maxima exceeding a pre-determined threshold θ are marked as speaker change points."_

**Training**

- Subsequences for training are 3.2s long with a step of 800ms (i.e. two adjacent sequences overlap by 75%).

## Code

### Feature Extraction

We will use Yaafe Toolkit. (To see all available features, you can use _!yaafe -l_) To learn how we can do that, start with http://yaafe.github.io/Yaafe/manual/quickstart.html#quick-start-using-yaafe


In [1]:
# You can view a description of each feature (or output format) with the -d option:

!yaafe -d MFCC


    Compute the Mel-frequencies cepstrum coefficients [DM1980]_.

    Mel filter bank is built as 40 log-spaced filters according to
    the following mel-scale:

    .. math:: melfreq = 1127 * log(1 + \frac{freq}{700})

    Each filter is a triangular filter with height :math:`2/(f_{max}-f_{min})`.
    Then MFCCs are computed as following, using DCT II:

    .. math:: mfcc = dct(log(abs(fft(hanning(N).x)).MelFilterBank))

    .. [DM1980] S.B. Davis and P.Mermelstrin, *Comparison of parametric
                representations for monosyllabic word recognition in
                continuously spoken sentences.* IEEE Transactions on
                Acoustics, Speech and Signal Processing,
                28 :357-366, 1980.

    
Parameters are :
- CepsIgnoreFirstCoeff (default=1): 0 keeps the first cepstral coeffcient, 1 ignore it
- CepsNbCoeffs (default=13): Number of cepstral coefficient to keep.
- FFTWindow (default=Hanning): Weighting window to apply before fft. Hanning|Hamming|None
-

In [2]:
!yaafe -d Energy


    Compute energy as root mean square of an audio Frame.

    .. math:: en = \sqrt\frac{\sum_{i=0}^{N-1}x(i)^2}{N}

    
Parameters are :
- blockSize (default=1024): output frames size
- stepSize (default=512): step between consecutive frames


Let's determine blockSize and stepSize. 

If we have 16kHz audio signal(in AMI, we have 16kHz), for 32 ms block, we need 16x32, For the stepsize as 16 ms, we need 16x16 size.

We need these features:

- mfcc: MFCC blockSize=512 stepSize=256 CepsNbCoeffs=11
- mfcc_d1: MFCC blockSize=512 stepSize=256 CepsNbCoeffs=11 > Derivate DOrder=1
- mfcc_d2: MFCC blockSize=512 stepSize=256 CepsNbCoeffs=11 > Derivate DOrder=2
- energy_d1: Energy blockSize=512 stepSize=256  > Derivate DOrder=1
- energy_d2: Energy blockSize=512 stepSize=256  > Derivate DOrder=2

To extract all of these, we will use [this technique](http://yaafe.github.io/Yaafe/manual/quickstart.html#extract-several-features). Shortly, we will write all these features into single text file.

In [3]:
f = open("featureplan.txt", "w")
f.write("mfcc: MFCC blockSize=512 stepSize=256 CepsNbCoeffs=11 \n"
        "mfcc_d1: MFCC blockSize=512 stepSize=256 CepsNbCoeffs=11 > Derivate DOrder=1 \n"
        "mfcc_d2: MFCC blockSize=512 stepSize=256 CepsNbCoeffs=11 > Derivate DOrder=2 \n"
        "energy_d1: Energy blockSize=512 stepSize=256  > Derivate DOrder=1 \n"
        "energy_d2: Energy blockSize=512 stepSize=256  > Derivate DOrder=2")
f.close() 

In [4]:
cat featureplan.txt

mfcc: MFCC blockSize=512 stepSize=256 CepsNbCoeffs=11 
mfcc_d1: MFCC blockSize=512 stepSize=256 CepsNbCoeffs=11 > Derivate DOrder=1 
mfcc_d2: MFCC blockSize=512 stepSize=256 CepsNbCoeffs=11 > Derivate DOrder=2 
energy_d1: Energy blockSize=512 stepSize=256  > Derivate DOrder=1 
energy_d2: Energy blockSize=512 stepSize=256  > Derivate DOrder=2

In [5]:
#!ls

In [6]:
!yaafe -c featureplan.txt -r 16000 a2002011001-e02-16kHz.wav -p Precision=8 -p Metadata=False

process file a2002011001-e02-16kHz.wav
ERROR: cannot open audio file a2002011001-e02-16kHz.wav: System error : No such file or directory.
ERROR: cannot initialize reader AudioFileReader for file a2002011001-e02-16kHz.wav


**With Librosa**

In [49]:
import librosa
import os
import glob
import numpy as np
import sys

def wav_to_matrix(filename, hop, win_len): # hop and win_len in milisecond 
    audio, sr = librosa.load(filename)
    # https://github.com/librosa/librosa/issues/584
    mfccs = librosa.feature.mfcc(audio, sr, n_mfcc=11, hop_length=int(float(hop/1000)*sr), n_fft=int(float(win_len/1000)*sr))
    mfccs_d1 = librosa.feature.delta(mfccs)
    mfccs_d2 = librosa.feature.delta(mfccs, order=2)
    energy = librosa.feature.rmse(y=audio, hop_length=int(float(hop/1000)*sr), frame_length=int(float(win_len/1000)*sr))
    energy_d1 = librosa.feature.delta(energy)
    energy_d2 = librosa.feature.delta(energy, order=2)
    print (mfccs.shape)
    print (mfccs_d1.shape)
    print (mfccs_d2.shape)
    print (energy_d1.shape)
    print (energy_d2.shape)

    a = np.vstack((mfccs, mfccs_d1, mfccs_d2, energy_d1, energy_d2))
    # line_mfccs = np.ravel(mfccs, order='F')
    return a

In [50]:
!ls

amicorpus
audio_files
audio_files_full
BiLSTM.ipynb
featureplan.txt
How to Read a Research Paper.mp3
ISCI_extractor.ipynb
Kafa Ayarı #5 - Hızlı Trenler ile Saatte 8000 Kilometre!.mp3_prediction.txt
LICENSE
npy_extractor.ipynb
npy_files
pyannote-audio
pyannote_reproduce_dev.ipynb
pyannote_reproduce.ipynb
README.md
Speaker2Vec_dev.ipynb
Speaker2Vec.ipynb


In [51]:
k = wav_to_matrix("How to Read a Research Paper.mp3", 10, 25)

(11, 52490)
(11, 52490)
(11, 52490)
(1, 52490)
(1, 52490)


In [52]:
k.shape

(35, 52490)

### Create Subsequences with Label

At that point, we should create training and test data with their label. 



### Deep Learning Architecture

We can directly upload the model's architecture from the .yml file which is provided by writer.

However, I want to directly write all steps.

In [37]:
# Author's .yml files

!wget https://raw.githubusercontent.com/yinruiqing/change_detection/master/model/architecture.yml

--2018-07-16 10:39:09--  https://raw.githubusercontent.com/yinruiqing/change_detection/master/model/architecture.yml
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.112.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.112.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3434 (3.4K) [text/plain]
Saving to: 'architecture.yml.4'


2018-07-16 10:39:09 (27.1 MB/s) - 'architecture.yml.4' saved [3434/3434]



In [38]:
# Load to model

from keras.models import model_from_yaml
yaml_file = open('architecture.yml', 'r')
loaded_model_yaml = yaml_file.read()
yaml_file.close()
model = model_from_yaml(loaded_model_yaml)

  return cls(**config)
  return cls(**config)
  return cls(**config)
  return cls(**config)
  return cls(**config)
  return cls(**config)
  return cls(**config)


In [39]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
labeling_input (InputLayer)  (None, 320, 35)           0         
_________________________________________________________________
bidirectional_1 (Bidirection (None, 320, 64)           17408     
_________________________________________________________________
bidirectional_2 (Bidirection (None, 320, 40)           13600     
_________________________________________________________________
timedistributed_1 (TimeDistr (None, 320, 40)           1640      
_________________________________________________________________
timedistributed_2 (TimeDistr (None, 320, 10)           410       
_________________________________________________________________
timedistributed_3 (TimeDistr (None, 320, 1)            11        
Total params: 33,069
Trainable params: 33,069
Non-trainable params: 0
_________________________________________________________________


In [42]:
from keras import layers
from keras import models
import keras
from keras.models import Model
import tensorflow as tf
from keras.layers.advanced_activations import *
from keras.utils.generic_utils import get_custom_objects


frame_shape = (320, 35)

## Network Architecture

input_frame = keras.Input(frame_shape, name='main_input')

bidirectional_1 = layers.Bidirectional(layers.LSTM(64, return_sequences=True))(input_frame)
bidirectional_2 = layers.Bidirectional(layers.LSTM(40, activation='tanh', return_sequences=True))(bidirectional_1)

tdistributed_1 = layers.TimeDistributed(layers.Dense(40, activation='tanh'))(bidirectional_2)
tdistributed_2 = layers.TimeDistributed(layers.Dense(20, activation='tanh'))(tdistributed_1)
tdistributed_3 = layers.TimeDistributed(layers.Dense(1, activation='sigmoid'))(tdistributed_2)

## Source: https://stackoverflow.com/questions/37743574/hard-limiting-threshold-activation-function-in-tensorflow
def step_activation(x):
    threshold = 0.4
    cond = tf.less(x, tf.fill(value=threshold, dims=tf.shape(x)))
    out = tf.where(cond, tf.zeros(tf.shape(x)), tf.ones(tf.shape(x)))

    return out

step_activation = layers.Dense(1, activation=step_activation, name='threshold_activation')(tdistributed_3)



model = Model(input_frame, step_activation)

In [41]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
main_input (InputLayer)      (None, 320, 35)           0         
_________________________________________________________________
bidirectional_51 (Bidirectio (None, 320, 128)          51200     
_________________________________________________________________
bidirectional_52 (Bidirectio (None, 320, 80)           54080     
_________________________________________________________________
time_distributed_76 (TimeDis (None, 320, 40)           3240      
_________________________________________________________________
time_distributed_77 (TimeDis (None, 320, 20)           820       
_________________________________________________________________
time_distributed_78 (TimeDis (None, 320, 1)            21        
_________________________________________________________________
threshold_activation (Dense) (None, 320, 1)            2         
Total para

In [None]:
# To save our model

model_yaml = model.to_yaml()
with open("model.yaml", "w") as yaml_file:
    yaml_file.write(model_yaml)

In [None]:
# To look our model

!cat model.yaml