I will use 2 different method to create VAD. 

- CNN architecture with MFCC features which is based on [this paper.](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8278160)

- Pitch estimation which is based on [CREPE library.](https://github.com/marl/crepe)

Before these, we need to create our dataset.

------

## Create Dataset

We need to download speech and noise files. I will use
- Snips Validation Dataset for Speech
- [ESC-50 for Noise](https://github.com/karoldvl/ESC-50)

    - **For Speech:** Also, we can use different corpus like [AMI Corpus](http://groups.inf.ed.ac.uk/ami/corpus/), however, with limited resources, I should choose small dataset. So I choose Snips Validation Dataset. 
    - **For Noise:** This dataset includes many different types of noise. So that, our model can be more robust according to very few types of noise.

In [None]:
!wget https://s3.amazonaws.com/snips/challenge/vad_data.tar.gz
!wget https://github.com/karoldvl/ESC-50/archive/master.zip

In [None]:
# Uncompress files.

!tar xzf vad_data.tar.gz 

import zipfile

with zipfile.ZipFile("master.zip","r") as zip_noise:
    zip_noise.extractall(".")
    
!mkdir outputs_speech # to store combination of noise and speech files.
!mkdir outputs_txt

- I will use the speech from Snips Validation set and noise from ESC-50 . Firstly, I want to create noisy audio file which includes speech and non-speech segments. To create non-speech segment, I will add silent segment to between of two segment. 

In [None]:
from pydub import AudioSegment
import json
import IPython.display as ipd
import numpy as np
import matplotlib.pyplot as pp
import math
import glob, os

def create_single_entry(speech_file, noise_file, threshold,
                       noise_supression):
    """With this function, we can create a wav file which is noisy and
    it includes both speech and non-speech segments. Also, these informations
    will be saved into txt file.""" 
    
    """Arguments:
    speech_file: Which speech file will be used
    noise_file: Which noise file will be used
    threshold: if difference between two consecutive segment is bigger than
        threshold value, we will add more silence (non-speech segment)
    noise_supression: how many DB, we will supress noise"""
    
    speech_filename=speech_file.split('/')[-1].split('.')[0]
    noise_filename=noise_file.split('/')[-1].split('.')[0]
    
    # Read the wav file 
    song = AudioSegment.from_wav(speech_file) 

    # Read the segments from json file and create array to store
    # end times and start times

    with open(speech_file[:-3] + 'json') as f:
        data = json.load(f)

    length = len(data["speech_segments"])

    start_time_array = []
    end_time_array = []

    for ix in range(0, length):
        start_time = (data["speech_segments"][ix]["start_time"])
        end_time = (data["speech_segments"][ix]["end_time"])
        start_time_array.append(start_time)
        end_time_array.append(end_time)

    # Add more silence to between of two segments which has silent part
    # that is bigger than our threshold. Also create the txt file to store 
    # speech and non-speech segments' times.

    output = AudioSegment.empty() # to hold output audio segment

    f = open("./outputs_txt/" + speech_filename + "_" + noise_filename + ".txt", "w")

    prev_end_time = 0

    times_array = np.vstack((end_time_array[:-1],start_time_array[1:])) 


    difference_between_segments = np.diff(times_array, axis=0)
    difference_between_segments = np.squeeze(difference_between_segments)

    total_silence_duration = 0
    offset = start_time_array[0]

    for ix_diff in range(0, len(difference_between_segments)):

        if (difference_between_segments[ix_diff] > threshold):
            silence_duration = np.random.randint(700, high=1800)

            random_silence = AudioSegment.silent(duration=silence_duration, frame_rate = 16000)

            start_time = start_time_array[ix_diff]
            end_time = end_time_array[ix_diff] 

            output += (song[int(start_time*1000):int(end_time*1000)] + random_silence)

            f.write('Speech: ' + str(total_silence_duration + int((start_time-offset)*1000)) +
                " " + str(total_silence_duration + int((end_time-offset)*1000)) + "\n")
            f.write('Noise: ' + str(total_silence_duration + int((end_time-offset)*1000)) + 
                    " " + str(total_silence_duration + int((end_time-offset)*1000) + silence_duration) + "\n")

            total_silence_duration += silence_duration

        else: 
            start_time = start_time_array[ix_diff]
            end_time = end_time_array[ix_diff] 
            output += (song[int(start_time*1000):int(end_time*1000)])
            f.write('Speech: ' + str(total_silence_duration + int((start_time-offset)*1000)) +
                    " " + str(total_silence_duration + int((end_time-offset)*1000)) + "\n")

        offset += difference_between_segments[ix_diff]
    

    f.close()

    # Add noise to whole file

    noise = AudioSegment.from_wav(noise_file) 
    noise = noise - noise_supression

    played_together = output.overlay(noise, loop=True)
    played_together = played_together.set_frame_rate(16000) # https://github.com/jiaaro/pydub/issues/232

    # Save the file as wav

    file_handle = played_together.export("./outputs_audio/" + speech_filename + "_" + noise_filename + ".wav", format="wav")

Now, we can test this function.

In [None]:
create_single_entry("125-121342-0019.wav", "5-178997-A-24.wav", 0.1, 8)

ipd.Audio("./outputs_audio/125-121342-0019_5-178997-A-24.wav")

Our function works for one speech and one noise. Now, we need to create all dataset. 

In [None]:
root_dir_speech = glob.glob(os.path.join("./vad_data/", '*wav'))

root_dir_noise = glob.glob(os.path.join('./ESC-50-master/audio/', '*wav'))

In [None]:
print ('How many speech: ', len(root_dir_speech))
print ('How many noise: ', len(root_dir_noise))

Before the creation of dataset, I want to check speech datas.

In [None]:
array_for_length = []

def investigate_folder(folder):
    
    for file in folder:
    
        filename=file.split('/')[-1].split('.')[0]    
        
        # Read the wav file 
        audio = AudioSegment.from_wav(file)

        # Learn the length of audio as a milisecond version
        length = len(audio)
        # Store the length in array
        array_for_length.append(length)
        
    return array_for_length

In [None]:
length_of_speechs = investigate_folder(root_dir_speech)

pp.plot(length_of_speechs)

We can remove the file which is shorter than 5 seconds.

In [None]:
def remove_short_speech_files(folder, threshold):
    total_num = 0
    for file in folder:
    
        filename=file.split('/')[-1].split('.')[0]    
        
        # Read the wav file 
        audio = AudioSegment.from_wav(file)

        # Learn the length of audio as a milisecond version
        length = len(audio)
        
        if (length<threshold):
            
            os.remove(file)
            total_num += 1

            print (filename, " has been deleted: ", length, " milisecond.")
            
    print (total_num, " files has been deleted.")

In [None]:
remove_short_speech_files(root_dir_speech, 5000)

Now, we can create our dataset.

In [None]:
def create_dataset(speech_folder, noise_folder, how_many_noise, threshold,
                  noise_supression):
    """This function takes speech folder and noise folder and create 
    new file via combination of these files from this folder according to
    following arguments."""
    
    """Arguments:
    speech folder: which folder includes wav and json for speech files
    noise folder: which folder includes wav file for noise
    how_many_noise: we will create multiple files with one speech 
        and more than one noise 
    threshold: if difference between two consecutive segment is bigger than
    threshold value, we will add more silence (non-speech segment)
    noise_supression: how many DB, we will supress noise"""
    
    
    for ix in range(750):
        speech_file = speech_folder[ix]
        
        # numpy.random.randint(low, high=None, size=None, dtype='l')
        
        which_noise_files_ix = np.random.randint(0, 2000, how_many_noise)
        
        for single_noise_file_ix in which_noise_files_ix:
        
            noise_file = noise_folder[single_noise_file_ix]
            
            create_single_entry(speech_file, noise_file, threshold,
                               noise_supression)

In [None]:
root_dir_speech = glob.glob(os.path.join("./vad_data/", '*wav'))

root_dir_noise = glob.glob(os.path.join('./ESC-50-master/audio/', '*wav'))

In [None]:
create_dataset(root_dir_speech, root_dir_noise,
              10, 0.03, 4)

-------

# Deep Learning Architectures

### CNN which is based on MFCC feature.

- Create MFCC feature array.

In [None]:
!mkdir mfcc_storage # to store mfcc files.

In [None]:
import warnings
warnings.filterwarnings("ignore", message="numpy.dtype size changed")

from numpy import genfromtxt
import numpy as np
import os
import glob
import math
import sys

def create_numpy_for_audio(audio_file, featureplan, sr=16000):
    """This function is based on YAAFE. It will return 2D Array which is features of audio file. 
    Also it will save the numpy array.
    
    Its arguments:
    featureplan: Text file which introduce which features will be extracted. 
    audio_file: Path of audio file, it can be wav, mp3, ogg etc.
    sr: Sample rate for audio file. Default as 16000."""
    
    !yaafe -c $featureplan -r $sr $audio_file -p Precision=6 -p Metadata=False -n
    filename = (audio_file.split("/")[-1]).split(".")[0]

    my_data = genfromtxt(audio_file + ".mfcc.csv", delimiter=',')
    my_data = np.append(my_data, genfromtxt(audio_file + ".mfcc_d1.csv", delimiter=','), axis=1)
    my_data = np.append(my_data, genfromtxt(audio_file + ".mfcc_d2.csv", delimiter=','), axis=1)

    my_data = np.append(my_data, np.expand_dims(genfromtxt(audio_file + ".energy_d1.csv", delimiter=','), axis=1), axis=1)
    my_data = np.append(my_data, np.expand_dims(genfromtxt(audio_file + ".energy_d2.csv", delimiter=','), axis=1), axis=1)

    # Previous codes creates csv file for features to load numpy array. After that, we can 
    # remove them.
    
    os.remove(audio_file + ".mfcc.csv")
    os.remove(audio_file + ".mfcc_d1.csv")
    os.remove(audio_file + ".mfcc_d2.csv")
    os.remove(audio_file + ".energy_d1.csv")
    os.remove(audio_file + ".energy_d2.csv")
    np.save('./mfcc_storage/' + filename, my_data)

    return my_data

In [None]:
f = open.("featureplan_new.txt")
f.write("mels: MelSpectrum blockSize=400 stepSize=160")
f.close()

In [None]:
import warnings
warnings.filterwarnings("ignore", message="numpy.dtype size changed")

from numpy import genfromtxt
import numpy as np
import os
import glob
import math
import sys

def create_numpy_for_audio_new(audio_file, featureplan, sr=16000):
    """This function is based on YAAFE. It will return 2D Array which is features of audio file. 
    Also it will save the numpy array.
    
    Its arguments:
    featureplan: Text file which introduce which features will be extracted. 
    audio_file: Path of audio file, it can be wav, mp3, ogg etc.
    sr: Sample rate for audio file. Default as 16000."""
    
    !yaafe -c $featureplan -r $sr $audio_file -p Precision=6 -p Metadata=False -n
    filename = (audio_file.split("/")[-1]).split(".")[0]

    my_data = genfromtxt(audio_file + ".mels.csv", delimiter=',')
   
    os.remove(audio_file + ".mels.csv")
    np.save('./mfcc_storage/' + filename, my_data)

    return my_data

This function saves the feature extraction of single audio file. We need to apply this function to whole dataset.

In [None]:
def create_dataset_mfcc(root_dir):
    """With this function, we will store the each file's mfcc feature extraction
    as a numpy array.
    Arguments:
    root_dir: which folder includes the audio files."""
    
    for file in root_dir:
        create_numpy_for_audio_new(file, featureplan="featureplan_new.txt")

In [None]:
root_dir_out_speech = glob.glob(os.path.join("./outputs_audio/", '*wav'))

In [None]:
create_dataset_mfcc(root_dir_out_speech)

Also, we have txt file which holds the information for speech and non-speech segments. We need to transform these informations to suitable formats.

In [None]:
def create_output_mfcc(filename, shape, win_len=25, hop=10):
    """With this function, we will create output for
    a file for mfcc feature extraction type of input."""
    
    """Arguments:
    filename: Which txt file include the information
    for speech and non-speech part.
    shape: This info is based on mfcc input's shape."""

    output_array = np.zeros(shape)
    
    with open(filename) as f:
        content = f.readlines()
    
    content = [x.strip() for x in content]

    for i in content:
        type_of, start, end = i.split(" ")
        if (type_of == 'Speech:'):
            which_start_hop = (int(start) - win_len) / hop
            which_end_hop = (int(end) - win_len) / hop

            start_location = math.ceil(which_start_hop)

            if (start_location<0): # to get rid confusion at first frame
                start_location=0

            end_location = math.ceil(which_end_hop)

            output_array[start_location:end_location] = 1.0
    return output_array

Now, we can create our Deep Learning Architecture.

In [None]:
import warnings
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
import keras

from keras import layers
from keras import models
from keras import optimizers
from keras.models import Model
import tensorflow as tf
from keras.utils.generic_utils import get_custom_objects
from keras.models import load_model
import time

frame_shape = (240, 40, 1)

input_frame = keras.Input(frame_shape, name='main_input')

conv1 = layers.Conv2D(40, kernel_size=(5, 5), strides=(1, 1), padding="same",
                     kernel_initializer="TruncatedNormal",
                     bias_initializer="TruncatedNormal")(input_frame)
conv1 = layers.LeakyReLU()(conv1)
conv1_BN = layers.BatchNormalization()(conv1)


conv2 = layers.Conv2D(20, kernel_size=(5, 5), strides=(1, 1), padding="same",
                     kernel_initializer="TruncatedNormal",
                     bias_initializer="TruncatedNormal")(conv1_BN)
conv2 = layers.LeakyReLU()(conv2)
conv2_BN = layers.BatchNormalization()(conv2)


conv3 = layers.Conv2D(10, kernel_size=(5, 5), strides=(1, 1), padding="same",
                     kernel_initializer="TruncatedNormal",
                     bias_initializer="TruncatedNormal")(conv2_BN)
conv3 = layers.LeakyReLU()(conv3)
conv3_BN = layers.BatchNormalization()(conv3)


xx = layers.TimeDistributed(layers.Flatten())(conv3_BN)


tdistributed_1 = layers.TimeDistributed(layers.Dense(100, kernel_initializer='TruncatedNormal',
                bias_initializer='TruncatedNormal'))(xx)
tdistributed_1 = layers.LeakyReLU()(tdistributed_1)
tdistributed_1_BN = layers.BatchNormalization()(tdistributed_1)
tdistributed_1_drop = layers.Dropout(0.5)(tdistributed_1_BN)

tdistributed_2 = layers.TimeDistributed(layers.Dense(1, activation='sigmoid', 
                                                     kernel_initializer='TruncatedNormal',
                                            bias_initializer='TruncatedNormal'))(tdistributed_1_drop)

model = Model(input_frame, tdistributed_2)

Adagrad = keras.optimizers.Adagrad(lr=0.01, epsilon=None, decay=0.0)

Nadam = keras.optimizers.Nadam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=0.00001, schedule_decay=0.0002)

model.compile(loss='binary_crossentropy', optimizer="Nadam", metrics=['accuracy'])

In [None]:
model.summary()

Now, we need to train this system. Before the train, we need to define function to upload our datas.

In [None]:
def prepare_data_to_train(root_dir_speech, from_ix, to_ix):
    
    input_array = []
    output = []
    
    for ix in range(from_ix, to_ix):
        feature_vector = np.load(root_dir_speech[ix]) 
        feature_array = np.ravel(feature_vector)
        input_array.extend(feature_array)

        filename_txt = (str(root_dir_speech[ix]).split("/")[-1])[:-3] + "txt"
        output_array = create_output_mfcc("./outputs_txt/" + filename_txt, 
                                          shape=feature_vector.shape[0])
        
        # print ('Input File: ', root_dir_speech[ix])
        # print ('Output File: ', filename_txt)


        output.extend(output_array)
        
    input_array = np.reshape(input_array, (-1, 40))
        
    output_array = np.asarray(output)
    output_array = np.expand_dims(output_array, axis=1)
    
    return (input_array, output_array)

In [None]:
how_many_step = 1
how_many_repeat = 1

root_dir_out_mfcc = glob.glob(os.path.join("./mfcc_storage/", '*npy'))

ix_repeat = 0

while (ix_repeat < how_many_repeat):
    ix_repeat += 0
    
    print ("REPEAT:", ix_repeat)
    
    ix_step = 0
    from_ix = 1
    
    while (ix_step < how_many_step):
        ix_step += 1

        print ("STEP:", ix_step)

        input_array, output_array = prepare_data_to_train (root_dir_out_mfcc, from_ix, from_ix+1000)
        
        print (np.mean(output_array))
        
        max_len = 240 # how many frame will be taken
        step = 240 # step size.

        input_array_specified = []
        output_array_specified = []

        for i in range (0, input_array.shape[0]-max_len, step):
            single_input_specified = (input_array[i:i+max_len,:])
            single_output_specified = (output_array[i:i+max_len,:])

            input_array_specified.append(single_input_specified)
            output_array_specified.append(single_output_specified)

        output_array_specified = np.asarray(output_array_specified)
        input_array_specified = np.asarray(input_array_specified)

        input_array_specified = np.expand_dims(input_array_specified, axis=4)
        history = model.fit(input_array_specified, output_array_specified,
                           epochs=10,
                           batch_size=4,
                           validation_split=0.2,
                           shuffle=False)


        model.save_weights('bilstm_weights_2DCNN.h5')    

        from_ix += 1000
        
    model.save_weights("bilstm_weights_2DCNN" + str(ix_repeat) + ".h5")    

Now, we need to evaluate our method. We will use our test data. 

In [None]:
def minimum_speech_time(prediction_array, how_many_frame, threshold_mean):
    """It takes prediction array(it contains 0 and 1. 0 means that
    non-speech, 1 means that speech.) As we know, when people speaks, it takes
    at least some frame. So, if we see just 1 without determined number of neighboor 1,
    we need to discard it.
    
    Arguments:
    prediction_array: Which array will be considered.
    how_many_frame: Minumum speech duration as frame number.
    threshold_mean: """
    
    for ix in range(0, len(prediction_array-how_many_frame)):
        prediction_segment = prediction_array[ix:ix+how_many_frame]
        if (np.mean(prediction_segment) < threshold_mean):
            prediction_array[ix] = 0
    
    return prediction_array

In [None]:
def hangover(prediction_array, threshold_hangover):
    """We can use a hangover time, such that after a speech segment
    we keep the label as speech for a while until we are sure that
    speech has ended. (resource: 
    https://mycourses.aalto.fi/pluginfile.php/146209/mod_resource/content/1/slides_07_vad.pdf)
    
    Arguments:
    prediction_array: Which array will be evaluated.
    threshold_hangover: If two speech segment difference less than this value, we
    will consider the between segments as a speech segment."""
    
    speech_segment_locations = np.where(prediction_array == 1)
    
    for ix in range(0, len(speech_segment_locations)-1):
        if (speech_segment_locations[ix+1]+threshold_hangover < speech_segment_locations[ix]):
            prediction_array[speech_segment_locations[ix]:speech_segment_locations[ix+1]] = 1
            
    return prediction_array

In [None]:
def smooth(feature_vector):
    
    ix_frame = 0
    
    while ((ix_frame+239)<feature_vector.shape[0]):        
           
            prediction = dl_model.predict(np.expand_dims(feature_vector[ix_frame:ix_frame+800], axis=0))
            prediction = prediction.squeeze(axis=2)
            prediction = prediction.squeeze(axis=0)

            prediction_vector.append(prediction)
            
            ix_frame += 60
            
        prediction_vector = np.asarray(prediction_vector)
        prediction_array = np.ravel(prediction_vector)
        
        ix_frame_pred = 0

        total_prediction = len(prediction_array)
        
        print (total_prediction)

        prediction_array_average[0:60] = prediction_array[0:200]
        prediction_array_average[60:120] = (prediction_array[200:400]+prediction_array[800:1000]) * 0.5
        prediction_array_average[120:180] = (prediction_array[400:600]+prediction_array[1000:1200]+
                                             prediction_array[1600:1800]) * 0.33
        
        ix_frame = 180
        count = 0
        
        while ((ix_frame+238)<total_prediction):        

            next_frame = ix_frame + (count * 180) 
            try:
                prediction_array_average[ix_frame:ix_frame+60] = (prediction_array[next_frame:next_frame+60]+
                                                                  prediction_array[next_frame+180:next_frame+240]+
                                                                  prediction_array[next_frame+360:next_frame+420]+
                                                                  prediction_array[next_frame+540:next_frame+600]) * 0.25
            except:
                pass
            ix_frame += 60
            count += 1

        prediction_array = np.asarray(prediction_array_average)
        
        return prediction_array

In [None]:
def evaluate_mfcc(filename, dl_model, smooth, threshold):
    
    """We evaluate 1 as speech, 0 as non-speech."""
    
    feature_vector = np.load(filename) 
    feature_array = np.ravel(feature_vector)
    input_array.extend(feature_array)

    filename_txt = (str(root_dir_speech).split("/")[-1])[:-5] + "txt"
    ground_truth = create_output_mfcc("./outputs_txt/" + filename_txt, 
                                      shape=feature_vector.shape[0])
    
    prediction_vector = []
    ix_frame = 0
    
    if (smooth):
        
        prediction_array = smooth(feature_vector)
        
    else:
        
        while (ix_frame+239<feature_vector.shape[0]):        
        
            prediction = dl_model.predict(np.expand_dims(prediction_vector[ix_frame:ix_frame+240], axis=0))
            prediction = prediction.squeeze(axis=2)
            prediction = prediction.squeeze(axis=0)

            prediction_vector.append(prediction)
            ix_frame += 240
        
        prediction_vector = np.asarray(prediction_vector)
        print (prediction_vector.shape)

        prediction_array = np.ravel(prediction_vector)
        
    prediction=[]
    
    for ix in range(len(prediction_array)):
        if (prediction_array[ix] > threshold):
            prediction.append(1)
        else:
            prediction.append(0)
            
    prediction = np.asarray(prediction)
    
    """METRICS"""
    num_speech = 0
    num_noise = 0
    
    num_speech_speech = 0
    num_speech_noise = 0
    num_noise_noise = 0
    num_noise_speech = 0
    
    for ix in range(0, len(ground_truth)):
        if (ground_truth[ix]==1):
            num_speech += 1
            if (prediction[ix]==1):
                num_speech_speech += 1
            else:
                num_speech_noise += 1
        else:
            num_noise += 1
            if (prediction[ix]==1):
                num_noise_speech += 1
            else:
                num_noise_noise += 1
    
    print ('Truth is Speech, Prediction is Speech: ', num_speech_speech,
          '\n Truth is Speech, Prediction is Noise: ', num_speech_noise,
          '\n Truth is Noise, Prediction is Speech: ', num_noise_speech,
          '\n Trurh is Noise, Prediction is Noise: ', num_noise_noise)
    

**TO-DO**

Create prediction and transform this prediction into suitable format. (BiLSTM kodlarına bakarak yapabilirim.)

### Pitch Estimation Based 

I will use directly CREPE method. This method will be quick solution for the problem. Also, we can combine the outputs of this method as a feature of another deep learning method. However, at that time, I will use outputs of CREPE without deep learning. 

In [None]:
import crepe
from scipy.io import wavfile

sr, audio = wavfile.read(rastgele)
time, frequency, confidence, activation = crepe.predict(audio, sr, viterbi=True)

Lets look some examples.

**TO-DO**

- ADD EXAMPLES
- [Cyclic Learning Rates](https://github.com/bckenstler/CLR)

In [None]:
!mkdir crepe_storage

In [None]:
def create_numpy_crepe(file, viterbi, step_size):
    """With this function, we will store the each file's confidence and frequency outputs
    as a numpy array. These outputs will be created by CREPE.
    
    Arguments:
    root_dir: which folder includes the audio files.
    viterbi: If it is true, CREPE will apply viterbi algorithm.
    step_size: Which step size will be used."""
    
    sr, audio = wavfile.read(file)
    time, frequency, confidence, activation = crepe.predict(audio, sr, 
                                                            viterbi=viterbi, 
                                                            step_size=step_size)
    np.save('./crepe_storage/' + filename + '_frequency', confidence)
    np.save('./crepe_storage/' + filename + '_confidence', my_data)

In [None]:
def create_dataset_crepe(root_dir, viterbi, step_size):
    """With this function, we will store the folder's confidence and frequency outputs
    as a numpy array. These outputs will be created by CREPE.
    
    Arguments:
    root_dir: which folder includes the audio files.
    viterbi: If it is true, CREPE will apply viterbi algorithm.
    step_size: Which step size will be used."""
    
    for file in root_dir:
        create_numpy_crepe(file, viterbi, step_size)

We need to npy file for frequency and confidence via crepe, however, it takes much time on my CPU. So that, I have used Google Colab to produce outputs.

In [None]:
"""Idea: Speech sounds can be efficiently modelled by linear prediction.
Source: https://mycourses.aalto.fi/pluginfile.php/146209/mod_resource/content/1/slides_07_vad.pdf"""

def create_prediction(confidence, step, threshold):
    length_of_matrix = len(confidence)
    
    prediction_matrix = np.zeros(length_of_matrix)
    
    for ix in range(0, length_of_matrix-step):
        mean = np.mean(confidence[ix:ix+step])
        if(mean > threshold):
            prediction_matrix[ix:ix+step] = 1.0
            
    return prediction_matrix

In [None]:
def create_prediction_freq(freq, step, threshold):
    length_of_matrix = len(freq)
    
    prediction_matrix = np.zeros(length_of_matrix)
    
    for ix in range(0, length_of_matrix-(2*step)):
        mean = np.mean(freq[ix:ix+step])
        mean_next = np.mean(freq[ix+step:ix+step+step])
        if((mean-mean_next) < threshold):
            prediction_matrix[ix+int(step/2):ix+step] = 1.0
            
    return prediction_matrix

In [None]:
root_dir_out = glob.glob(os.path.join('./outputs/', '*wav'))
rastgele = root_dir_out[777]

sr, audio = wavfile.read(rastgele)
time, frequency, confidence, activation = crepe.predict(audio, sr, viterbi=True)

In [None]:
rastgele_txt = rastgele[:-3] + "txt"

In [None]:
%config InlineBackend.figure_format = 'retina'

# Change the color of axis to better look with dark theme.
# https://stackoverflow.com/questions/1982770/matplotlib-changing-the-color-of-an-axis
pp.tick_params(axis='x', colors='green')
pp.tick_params(axis='y', colors='green')


pp.plot(frequency)
pp.plot(confidence*380, color='k')
pp.plot(create_output_crepe(rastgele_txt)*400, color='cyan')
pp.plot(create_prediction(confidence, 30, 0.3)*500, color='pink')
ipd.Audio(rastgele)

TO-DO

- Örneklerini ekle.
- Daha sonra neden create prediction tarzı şeyler yaptığını açıkla.
- Daha sonra evaluation metriklerini yaz.

In [None]:
def evaluate_crepe(ground_truth,
                  prediction):
    
    """We evaluate 1 as speech, 0 as non-speech."""
    
    num_speech = 0
    num_noise = 0
    
    num_speech_speech = 0
    num_speech_noise = 0
    num_noise_noise = 0
    num_noise_speech = 0
    
    for ix in range(0, len(ground_truth)):
        if (ground_truth[ix]==1):
            num_speech += 1
            if (prediction[ix]==1):
                num_speech_speech += 1
            else:
                num_speech_noise += 1
        else:
            num_noise += 1
            if (prediction[ix]==1):
                num_noise_speech += 1
            else:
                num_noise_noise += 1
    
    print ('Truth is Speech, Prediction is Speech: ', num_speech_speech,
          '\n Truth is Speech, Prediction is Noise: ', num_speech_noise,
          '\n Truth is Noise, Prediction is Speech: ', num_noise_speech,
          '\n Trurh is Noise, Prediction is Noise: ', num_noise_noise)
    