Text Independent Speaker Identification is the task of identifying and separating individual speakers in a given audio sample. This technique has been applied in various fields, for example it is used in certain banking and home security services as a way to verify the identity of users beyond traditional methods. Unlabeled voice registrations are used as input for a model whose objective is to identify the different voices based on individual characteristics. This dictionary of voices can then be compared with new samples to identify the speaker.

One of the ways this kind of task has been completed is by extracting the frequency spectrum for the audio sample and then using a Gaussian Mixture Model (GMM) to distinguish different people.

The data set used was sourced from the openslr project at the link: https://www.openslr.org/45/. It was uploaded by the Surfingtech Company that extracted it from a private bigger data set. It is made up of samples of a few seconds each for 10 speakers. For each speaker there are about 350 recordings in the form of .wav files. To decrease the computing time, the first 20 recordings for each speaker where used for training, for a total of 200 training samples, and the next 5 for testing, so 50 testing samples.

To obtain an effective modeling of the audio samples, it isn't possible to just pass the samples to a GMM for fitting and then complete the prediction, it is instead necessary to process the signals and extract from them certain features which can than be used to successfully train a predictor model, in this case a GMM.

This process of feature extraction is done in this case through the Mel-Frequency Cepstrum (MFC). A cepstrum is sometimes described as the spectrum of a spectrum, because it refers to the representation of a signal obtained by first applying the Fourier Transform to such signal and then reapplying it to the log of the transformed signal. This is also equivalent, ignoring a scaling factor, to calculating the inverse transform:
$$\begin{equation}
C=\left|\mathcal{F}^{-1}\left\{\log \left(|\mathcal{F}\{f(t)\}|^2\right)\right\}\right|^2
\quad \sim \quad
C=\left|\mathcal{F}\left\{\log \left(|\mathcal{F}\{f(t)\}|^2\right)\right\}\right|^2
\end{equation}$$

This representation in a new time domain, different from the original one of the signal  𝑓(𝑡) , is particularly useful when analyzing the human voice because characteristics like the voice pitch and the so called formants (peaks in the spectrum caused by resonances in the human vocal tract) are more easily distinguishable.

The Mel-Frequency Cepstrum is a variant of this process, where the powers of the spectrum obtained with the first transform are mapped on the mel (melody) scale, which is a pitch scale based on the human ear.
$$m = 2595 \log_{10} \left( 1 + \frac{f}{700} \right)$$


Than a Discrete Cosine Transform is applied and the amplitudes of the resulting spectrum are the Mel-Frequency Cepstrum Coefficients or MFCCs. The "delta" are than values computed as differences between coefficients as in the following formula. They should offer a description of how the sounds vary in time and are considered important for voice analysis.
$$\begin{equation}
d_t=\frac{\sum_{n=1}^N n\left(c_{t+n}-c_{t-n}\right)}{2 \sum_{n=1}^N n^2}
\end{equation}$$

In [1]:
import numpy as np
import sklearn
from sklearn import preprocessing
import python_speech_features as mfcc

The computation of the MFC coefficients and of the delta happens through two functions, calculate_delta and extract_features. calculate_delta accepts in input the array of coefficients extracted for each time frame by the extract_features function and, for each time frame (each row), extracts the +1/-1 and +2/-2 indices to use to obtain the correct coefficients for each delta. For every coefficient of every time window, we have 1 delta.

In [2]:
def calculate_delta(array):
    """Calculate and returns the delta of given feature vector matrix"""

    rows,cols = array.shape     #Number of cols and rows in the input array, representing respectively the coefficients and the time frames.
    deltas = np.zeros((rows,20)) #Initialize array to store the delta
    N = 2 #Temporal distance of the coefficients to be considered, so t-1, t+1 and t-2, t+2
    for i in range(rows):
        index = []
        j = 1
        while j <= N:
            if i-j < 0: #Expect for the first frame...
              first =0
            else:
              first = i-j
            if i+j > rows-1:  #... and the last one
                second = rows-1
            else:
                second = i+j 
            index.append((second,first)) #first adding n=1, then n=2
            j+=1
        deltas[i] = ( array[index[0][0]]-array[index[0][1]] + (2 * (array[index[1][0]]-array[index[1][1]])) ) / 10 #computing delta for the i-th row
    return deltas

The extract_features function takes as input the raw audio file with its sampling rate. It then applies the mfcc method with a window size of 25ms and a hop lenght of 10ms. The Fast-Fourier Transform is executed with a precision of 1200 points (for example, if the file sample rate is 44100Hz, then this would mean a frequency resolution of about 37 Hz). After the second (Discrete Cosine) Transform is applied to the log of the data mapped on the mel scale, the first 20 coefficients are kept. This should be sufficient to keep enough details for our analysis.

In [3]:
def extract_features(audio,rate):
    """extract 20 dim mfcc features from an audio, performs CMS and combines 
    delta to make it 40 dim feature vector"""    
    
    mfcc_feature = mfcc.mfcc(audio,rate, 0.025, 0.01,20,nfft = 1200, appendEnergy = True)    
    mfcc_feature = preprocessing.scale(mfcc_feature)
    delta = calculate_delta(mfcc_feature)
    combined = np.hstack((mfcc_feature,delta)) 
    return combined


In [4]:
import _pickle as cPickle
import numpy as np
from scipy.io.wavfile import read
from sklearn.mixture import GaussianMixture as GMM
import warnings
warnings.filterwarnings("ignore")

As we said the training data comes from a data set sourced by openslr. We have taken 25 sample for each speaker and used 20 for training and 5 for testing. The "source" variable contains the path to the folder with the training samples, "train_file" is a .txt file with the name of each training sample.

In [5]:
#path to training data
source   = "/Users/chiaracangelosi/Desktop/Speaker-Recognition-Using-GMM-MFCC-Python3-master/Voice_Samples_Training/"   

#path where training speakers will be saved
dest = "Trained_Speech_Models/"
train_file = "Voice_Samples_Training_Path.txt"        
file_paths = open(train_file,'r')

In the next cell we loop through the audio samples of each speaker (20 per speaker) and extract coefficients+deltas using the previously defined functions. The feature array for each sample is stacked row-wise. This array becomes the training data for a GMM witch will then be a representation of that specific person voice. The GMM function is the one from sklearn.mixture. 5 gaussian distributions are used for each model, which proved to be an effective number. Each model is initialized three different times in a tentative to avoid local optima. Each speaker model is saved separately.

In [6]:
#counter for how many files have been processed per speaker
count = 1
# Extracting features for each speaker (20 files per speakers)
features = np.asarray(()) #extracted features
for path in file_paths:    
    path = path.strip() #Removes whitespaces
    print (path)
    
    # read the audio
    sr,audio = read(source+path) #sr is the file sampling rate, which is automatically sourced by the scipy read function
    
    # extract 40 dimensional MFCC & delta MFCC features
    vector   = extract_features(audio,sr)
    
    if features.size == 0:
        features = vector
    else:
        features = np.vstack((features, vector)) #stacks new features below existing ones (row-wise)
        
    # After features of 5 files of a single speaker have been concatenated, then we are ready to train the model
    if count == 20:    
        gmm = GMM(n_components = 5, covariance_type='diag',n_init = 3) #Creates a GMM with 5 components, initializing it 3 times to search for a good fit
        gmm.fit(features)
        
        #saving the trained model using the python module Pickle, used for serializing (converting an object into a file so it can be saved and loaded later)
        #and deserializing objects (reading that file and reconstructing the object in memory)
        picklefile = path.split("-")[0]+".gmm" #Taking the name and adding .gmm
        cPickle.dump(gmm,open(dest + picklefile,'wb'))  #Saving the GMM parameters for each speaker
        print ('+ modeling completed for speaker:',picklefile," with data point = ",features.shape) 

        #reset variables
        features = np.asarray(())

        #reset counter
        count = 0
        
        #increment counter
        count = count + 1

f1-001/f1_1.wav
f1-001/f1_2.wav
f1-001/f1_3.wav
f1-001/f1_4.wav
f1-001/f1_5.wav
f1-001/f1_6.wav
f1-001/f1_7.wav
f1-001/f1_8.wav
f1-001/f1_9.wav
f1-001/f1_10.wav
f1-001/f1_11.wav
f1-001/f1_12.wav
f1-001/f1_13.wav
f1-001/f1_14.wav
f1-001/f1_15.wav
f1-001/f1_16.wav
f1-001/f1_17.wav
f1-001/f1_18.wav
f1-001/f1_19.wav
f1-001/f1_20.wav
f2-002/f2_1.wav
f2-002/f2_2.wav
f2-002/f2_3.wav
f2-002/f2_4.wav
f2-002/f2_5.wav
f2-002/f2_6.wav
f2-002/f2_7.wav
f2-002/f2_8.wav
f2-002/f2_9.wav
f2-002/f2_10.wav
f2-002/f2_11.wav
f2-002/f2_12.wav
f2-002/f2_13.wav
f2-002/f2_14.wav
f2-002/f2_15.wav
f2-002/f2_16.wav
f2-002/f2_17.wav
f2-002/f2_18.wav
f2-002/f2_19.wav
f2-002/f2_20.wav
f3-003/f3_1.wav
f3-003/f3_2.wav
f3-003/f3_3.wav
f3-003/f3_4.wav
f3-003/f3_5.wav
f3-003/f3_6.wav
f3-003/f3_7.wav
f3-003/f3_8.wav
f3-003/f3_9.wav
f3-003/f3_10.wav
f3-003/f3_11.wav
f3-003/f3_12.wav
f3-003/f3_13.wav
f3-003/f3_14.wav
f3-003/f3_15.wav
f3-003/f3_16.wav
f3-003/f3_17.wav
f3-003/f3_18.wav
f3-003/f3_19.wav
f3-003/f3_20.wav
f4-004/

In the final section "source" becomes the folder with the audio samples we want to test. We created two possibilities, to either check a single file against our models to check the best correspondence or to pass a whole library of files to assess the overall performances of voice classifier.

In [7]:
import os
import time


#path to training data
source   = "/Users/chiaracangelosi/Desktop/Speaker-Recognition-Using-GMM-MFCC-Python3-master/Testing_Audio/"

#path where training speakers will be saved 
#path with the previously trained models to check against
modelpath = "/Users/chiaracangelosi/Desktop/Speaker-Recognition-Using-GMM-MFCC-Python3-master/Trained_Speech_Models/"

gmm_files = [os.path.join(modelpath,fname) for fname in 
              os.listdir(modelpath) if fname.endswith('.gmm')]

#Load the Gaussian gender Models
models    = [cPickle.load(open(fname,'rb')) for fname in gmm_files]  #list containing all loaded models
speakers   = [fname.split("/")[-1].split(".gmm")[0] for fname        #it extracts speaker names from .gmm filenames
              in gmm_files]

error = 0
total_sample = 0.0

The user can choose single audio testing or batch audio testing:

In [19]:
print("Press '1' for checking a single Audio or Press '0' for testing a complete set of audio with Accuracy?")
take=int(input().strip())

#This is the single audio testing case. The system first reads the user-provided audio file and extracts MFC features, which capture the characteristics of 
#the speaker's voice. It then computes log-likelihood scores for each trained GMM (Gaussian Mixture Model), measuring how well the extracted features match 
#each stored speaker profile. The speaker with the highest likelihood score is selected as the best match. Finally, the system prints the identified speaker’s 
#name, providing the recognition result.

if take == 1:
    print ("Enter the File name from the sample with .wav notation :") #Specifying the file to be tested
    path =input().strip()
    print (("Testing Audio : ",path))
    sr,audio = read(source + path)
    vector   = extract_features(audio,sr) #Extracting the file features
    
    log_likelihood = np.zeros(len(models)) 
    
    for i in range(len(models)):
        gmm    = models[i]  #checking with each model one by one
        scores = np.array(gmm.score(vector))
        log_likelihood[i] = scores.sum() #Summing the log-likelihoods is equivalent to multiplying the probabilities of each interval sample (25ms)
    
    winner = np.argmax(log_likelihood)
    print ("\tThe person in the given audio sample is detected as - ", speakers[winner])

    time.sleep(1.0)

#This is the batch audio testing case. The system reads all test audio files listed in Testing_audio_Path.txt and extracts MFCC features for each file. 
#It then computes log-likelihood scores for all trained GMM (Gaussian Mixture Model) models, determining how well each model matches the extracted features. 
#The detected speaker is compared with the expected speaker name to check for correctness. Finally the accuracy is computed.

elif take == 0:
    test_file = "/Users/chiaracangelosi/Desktop/Speaker-Recognition-Using-GMM-MFCC-Python3-master/Testing_audio_Path.txt"        
    file_paths = open(test_file,'r')
    # Read the test directory and get the list of test audio files 
    total_sample = 0.0  
    error = 0  
    for path in file_paths:   
        total_sample+= 1.0
        path=path.strip()
        print("Testing Audio : ", path)
        sr,audio = read(source + path) #read audio file
        vector   = extract_features(audio,sr) #extract MFCC features
        log_likelihood = np.zeros(len(models)) 
        for i in range(len(models)):
            gmm    = models[i]  #checking with each model one by one
            scores = np.array(gmm.score(vector)) #compute the likelihood score
            log_likelihood[i] = scores.sum() 
        winner=np.argmax(log_likelihood) #best matching speaker
        print ("\tdetected as - ", speakers[winner])
        checker_name = path.split("_")[0]
        if speakers[winner] != checker_name:  #check if the name is correct
            error += 1
        time.sleep(1.0)
    print (error, total_sample)
    accuracy = ((total_sample - error) / total_sample) * 100

    print ("The Accuracy Percentage for the current testing Performance with MFCC + GMM is : ", accuracy, "%")

Press '1' for checking a single Audio or Press '0' for testing a complete set of audio with Accuracy?


 0


Testing Audio :  f1_21.wav
	detected as -  f1
Testing Audio :  f1_22.wav
	detected as -  f4
Testing Audio :  f1_23.wav
	detected as -  f1
Testing Audio :  f1_24.wav
	detected as -  f1
Testing Audio :  f1_25.wav
	detected as -  f1
Testing Audio :  f2_21.wav
	detected as -  f2
Testing Audio :  f2_22.wav
	detected as -  f3
Testing Audio :  f2_23.wav
	detected as -  f2
Testing Audio :  f2_24.wav
	detected as -  f2
Testing Audio :  f2_25.wav
	detected as -  f2
Testing Audio :  f3_21.wav
	detected as -  f3
Testing Audio :  f3_22.wav
	detected as -  f4
Testing Audio :  f3_23.wav
	detected as -  f3
Testing Audio :  f3_24.wav
	detected as -  f3
Testing Audio :  f3_25.wav
	detected as -  m5
Testing Audio :  f4_21.wav
	detected as -  f4
Testing Audio :  f4_22.wav
	detected as -  f4
Testing Audio :  f4_23.wav
	detected as -  f4
Testing Audio :  f4_24.wav
	detected as -  f4
Testing Audio :  f4_25.wav
	detected as -  f4
Testing Audio :  f5_21.wav
	detected as -  f4
Testing Audio :  f5_22.wav
	detect