In this notebook, we are trying to reproduce [the paper](http://scuba.usc.edu/pdf/jati2017_Speaker2Vec.pdf).

For review of the paper, you can look [here](https://hedonistrh.github.io/2018-07-09-Literature-Review-for-Speaker-Change-Detection/)

## Feature Extraction

_"We have adopted 40 dimensional high definition MFCC features extracted from 40 mel-spaced filters over a 25ms hamming window with a shift of 10ms using Kaldi toolkit"_ However, I will use [Librosa](https://librosa.github.io).



## Prepare Data

_"We have used d = 100 frames (1s) for all training scenarios. This makes the size of input and output layers of the DNN models to be 4000."_

_"40 mel-spaced filters over a 25ms hamming window with a shift of 10ms"_

In [None]:
!mkdir wav_files # this folder store wav_files
cd wav_files

We will use some data from Youtube. To download these files, [youtube-dl](https://rg3.github.io/youtube-dl/index.html) is used.

In [None]:
# https://askubuntu.com/questions/564567/how-to-download-playlist-from-youtube-dl
# https://www.slashgeek.net/2016/06/24/5-youtube-dl-tips-might-not-know/
!youtube-dl -f  140 -o "%(title)s.%(ext)s" https://www.youtube.com/playlist?list=PLImzRKDoJEgGR5Tv0zQ1SIS2rZOXOcCNB 

In [None]:
!cd smwav
!youtube-dl --extract-audio --audio-format 251 -o "%(title)s.%(ext)s" https://www.youtube.com/playlist?list=PLLJkwRbZqtbkv6mliNJpEAsMQVMFd8P9N

In [None]:
!ls

In [1]:
import librosa
import os
import glob
import numpy as np
import sys

def wav_to_matrix(filename, hop, win_len): # hop and win_len in milisecond 
    audio, sr = librosa.load(filename)
    # https://github.com/librosa/librosa/issues/584
    mfccs = librosa.feature.mfcc(audio, sr, n_mfcc=40, hop_length=int(float(hop/1000)*sr), n_fft=int(float(win_len/1000)*sr))
    # line_mfccs = np.ravel(mfccs, order='F')
    return mfccs

root_dir = ('./wav_files')

def create_data_for_unsupervised(root_dir, hop, win_len):
    all_wav_paths = glob.glob(os.path.join(root_dir, '*wav'))
    matrix_of_all_wav = []
    print (all_wav_paths)

    # All midi have to be in same shape. (?)
    for single_wav_path in all_wav_paths:
        print (single_wav_path)
        matrix_of_single_wav = wav_to_matrix(single_wav_path, hop, win_len)
        array_of_single_wav = np.ravel(matrix_of_single_wav)
        
        if (matrix_of_single_wav is not None):
            print (matrix_of_single_wav.shape)
            matrix_of_all_wav.extend(array_of_single_wav)
            print (single_wav_path + " is done.")
            
    wav_array = np.asarray(matrix_of_all_wav)
    wav_array = np.reshape(matrix_of_all_wav, (40,-1))
    input_array = []
    output_array = []

    print (wav_array.shape)
    
    start_point = 0
    
    while (start_point+200 < wav_array.shape[1]):
        single_input = wav_array[0:40,start_point:start_point+100]
        single_output = wav_array[:40, start_point+100:start_point+200]
        input_array.append(single_input)
        output_array.append(single_output)
        start_point += 100
        
        
    input_array = np.asarray(input_array)
    input_array = input_array.reshape((len(input_array), np.prod(input_array.shape[1:])))  
    print(input_array.shape)

    output_array = np.asarray(output_array)
    output_array = output_array.reshape((len(output_array), np.prod(output_array.shape[1:])))  
    print(output_array.shape)
    
    return (input_array, output_array)

In [3]:
input_array, output_array = create_data_for_unsupervised('./smwav', 10, 25)

['./smwav/Kezzo - Bekliyorken Ölmeyi [Official Audio] #Kafamınİçi.wav', './smwav/Kezzo - Çare Değil (Ft. EYPİO & No.1) [Official Audio] #Kafamınİçi.wav']
./smwav/Kezzo - Bekliyorken Ölmeyi [Official Audio] #Kafamınİçi.wav
(40, 17093)
./smwav/Kezzo - Bekliyorken Ölmeyi [Official Audio] #Kafamınİçi.wav is done.
./smwav/Kezzo - Çare Değil (Ft. EYPİO & No.1) [Official Audio] #Kafamınİçi.wav
(40, 24055)
./smwav/Kezzo - Çare Değil (Ft. EYPİO & No.1) [Official Audio] #Kafamınİçi.wav is done.
(40, 41148)
(410, 4000)
(410, 4000)


**Let's focus on AMI Corpus**

In the [dev.mdtm](https://github.com/pyannote/pyannote-db-odessa-ami/tree/master/AMI/data/speaker_diarization), we can see the:
    - filename, offset, duration, "speaker NA unknown", id of speaker

We should focus on this. We can download the corresponding .wav files easily. 

Firsly, I will create database for speaker change detection. It will be not-optimized until the end of test.

In [None]:
# !ls

In [None]:
from keras.datasets import mnist
import numpy as np
(x_train, _), (x_test, _) = mnist.load_data()

In [None]:
print (x_train.shape)
print (x_test.shape)

x_train = x_train.reshape((len(x_train), np.prod(x_train.shape[1:])))
print (x_train.shape)

In [None]:
mkdir txt_files

In [4]:
import sys
import numpy as np

def create_ground_truth(main_set):
    # start_time_array = []
    end_time_array = []
    output_frame_array = []
    with open(main_set) as f:
        content = f.readlines()
    # you may also want to remove whitespace characters like `\n` at the end of each line
    content = [x.strip() for x in content] 
    tmp_filename = content[0].split(' ')[0]
    # sys.exit("WAIT, TEST :)")

    for single_line in content:
        filename = single_line.split(' ')[0]
        # print (filename)
        if (filename != tmp_filename):
            # start_time_array = np.asarray(start_time_array)
            # np.savetxt(fname=tmp_filename + "_start_time.txt", X=start_time_array, delimiter=' ', fmt='%1.3f')
            # end_time_array = np.asarray(end_time_array)
            np.savetxt(fname=tmp_filename + "_end_time.txt", X=end_time_array, delimiter=' ', fmt='%1.3f')
            # start_time_array = []
            end_time_array = []
            
        tmp_filename = single_line.split(' ')[0]
        offset = float(single_line.split(' ')[2])
        duration = float(single_line.split(' ')[3])
        end_time = offset+duration
        # start_time_array.append(offset)
        end_time_array.append(end_time)

In [6]:
cd txt_files

/home/herdogan/Desktop/SpChangeDetect/txt_files


In [5]:
!wget https://raw.githubusercontent.com/pyannote/pyannote-db-odessa-ami/master/AMI/data/speaker_diarization/dev.mdtm

--2018-07-18 15:30:24--  https://raw.githubusercontent.com/pyannote/pyannote-db-odessa-ami/master/AMI/data/speaker_diarization/dev.mdtm
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.112.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.112.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 580919 (567K) [text/plain]
Saving to: ‘dev.mdtm’


2018-07-18 15:30:24 (16,5 MB/s) - ‘dev.mdtm’ saved [580919/580919]



In [7]:
create_ground_truth('dev.mdtm')

In [8]:
cd ..

/home/herdogan/Desktop/SpChangeDetect


## Deep Learning Architectures

- They use 2 different autoencoder. 
    - Smaller is 4000 → 2000 → 40 → 2000 → 4000
    - Larger is 4000 → 6000 → 2000 → 40 → 2000 → 6000 → 4000
![Image of Autoencoder](https://docs.google.com/uc?id=1epse9ba1fRTdmyN3pF0XoECEHcaiLsa3)


In [9]:
from keras import layers
from keras import models
import keras
from keras.models import Model
import tensorflow as tf
from keras.layers.advanced_activations import *


input_frame = keras.Input(shape=(4000,), name='main_input')
encoded_1 = layers.Dense(4000, activation='relu', name='encoded_1')(input_frame) # not quite sure about the activation
encoded_2 = layers.Dense(2000, activation='relu', name='encoded_2')(encoded_1)

embedding = layers.Dense(40, activation='relu', name='embedding')(encoded_2)

decoded_1 = layers.Dense(2000, activation='relu', name='decoded_1')(embedding)
decoded_2 = layers.Dense(4000, activation='linear', name='decoded_2')(decoded_1)

autoencoder = Model(input_frame, decoded_2)
autoencoder.compile(optimizer='adadelta', loss='mean_squared_logarithmic_error')

autoencoder.summary()

Using TensorFlow backend.


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
main_input (InputLayer)      (None, 4000)              0         
_________________________________________________________________
encoded_1 (Dense)            (None, 4000)              16004000  
_________________________________________________________________
encoded_2 (Dense)            (None, 2000)              8002000   
_________________________________________________________________
embedding (Dense)            (None, 40)                80040     
_________________________________________________________________
decoded_1 (Dense)            (None, 2000)              82000     
_________________________________________________________________
decoded_2 (Dense)            (None, 4000)              8004000   
Total params: 32,172,040
Trainable params: 32,172,040
Non-trainable params: 0
________________________________________________________________

In [10]:
autoencoder.fit(input_array, output_array,
               epochs=2,
               batch_size=32,
               shuffle=True)

Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7f4ed532f588>

**Let's try to detect speaker change points. We will use encoder part of autoencoder.**
- Firstly, extract the embedding,
- Compare neighboorhood embeddings via different parameters,
- If result of comparision exceed determined threshold, it represent the speaker change point.

In [11]:
embedding_system = Model(input_frame, embedding) # create system to extraxt embed.

Now, we have embedding system. We should feed this sytem via input frames.

In [12]:
def KL(P,Q):
    """ Epsilon is used here to avoid conditional code for
    checking that neither P nor Q is equal to 0. """
    epsilon = 0.00001

     # You may want to instead make copies to avoid changing the np arrays.
    P = P+epsilon
    Q = Q+epsilon

    divergence = np.sum(P*np.log(P/Q))
    return divergence

def create_prediction(wav_file, hop, win_len, threshold, embedding_system):
    prediction_array = []
    audio, sr = librosa.load(wav_file)
    # at that point, we should not use overlapping segment
    # https://github.com/librosa/librosa/issues/584
    mfccs = librosa.feature.mfcc(audio, sr, n_mfcc=40, hop_length=int(hop/1000*sr), n_fft=int(win_len/1000*sr))
    # now we have mfcc of all audio file.
    offset=0
    while (offset+200 < mfccs.shape[1]):
    # line_mfccs = np.ravel(mfccs, order='F')
        first_frame = np.ravel(mfccs[0:40,offset:offset+100])
        first_frame = np.expand_dims(first_frame, axis=0)
        first_embed = embedding_system.predict(first_frame)
        
        second_frame = np.ravel(mfccs[0:40,offset+100:offset+200])
        second_frame = np.expand_dims(second_frame, axis=0)
        second_embed = embedding_system.predict(second_frame)
        
        if (KL(first_embed, second_embed) > threshold):
            ms_version = float(win_len + ((offset+100) * hop)) # milisecond version to represent end point of first embed
            prediction_array.append(ms_version/1000)
        offset += 100
    prediction_array = np.asarray(prediction_array)
    print ("cx")
    np.savetxt(fname=wav_file + "_prediction.txt", X=prediction_array, delimiter=' ')

    return prediction_array

**Now, implement transfer learning part. We will do it because of unsupervised domain adaptation.**
- Find the speaker change points by trained DNN,
- Get all possible speaker homogeneous regions,
- Retrain the same DNN again on these homogeneous segments of speech.

In [13]:
change_points = create_prediction('test.wav', hop=10, win_len=25, threshold=0.6, embedding_system=embedding_system)

# This is for unsupervised adaptation.

## Very bad according to optimization view. !!!
def creata_data_for_unsuper_adaptation(root_dir, hop, win_len):
    all_wav_paths = glob.glob(os.path.join(root_dir, '*wav'))
    matrix_of_all_wav = []
    
    input_array = []
    output_array = []
    
    for single_wav_path in all_wav_paths:
        prediction_array = create_prediction(wav_file, hop, win_len, threshold, embedding_system)
        matrix_of_single_wav = wav_to_matrix(single_wav_path, hop, win_len)
            
        start_point = 0
        while (start_point+200 < matrix_of_single_wav.shape[1]):
            change = 0 # value to check any value in prediction array is in range of frame
            for single_change in prediction_array:
                if (single_change in range(float(win_len + ((offset) * hop)), 
                                               float(win_len + ((offset+200) * hop)))):
                    change = 1
            if (change == 0):
                single_input = wav_array[0:40,start_point:start_point+100]
                single_output = wav_array[:40, start_point+100:start_point+200]
                input_array.append(single_input)
                output_array.append(single_output)
            start_point += 100

    input_array = np.asarray(input_array)
    input_array = input_array.reshape((len(input_array), np.prod(input_array.shape[1:])))
    
    output_array = np.asarray(output_array)
    output_array = output_array.reshape((len(output_array)*output_array.shape[1], output_array.shape[2]*output_array.shape[3] ))

    print(output_array.shape)
    
    return (input_array, output_array)



    

cx
