## Gender Recognition from Audio Data

Data set can be downloaded from : https://research.google.com/audioset/dataset/index.html

First lets install Libraries which might not be there. Do remove the hashtags to uncomment the code so that the installation happens.

In [1]:
# !pip install librosa
# !pip install python_speech_features
# !pip install sounddevice
# !pip install soundfile

Importing Libraries

In [8]:
# Data manipulation
import numpy as np
import matplotlib.pyplot as plt
import random

# Feature extraction
import scipy
import librosa
import python_speech_features as mfcc
import os
from scipy.io.wavfile import read

# Model training
from sklearn import preprocessing
import pickle
import tqdm
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Dropout
from tensorflow.keras.callbacks import ModelCheckpoint, TensorBoard, EarlyStopping
from sklearn.model_selection import train_test_split

# Live recording
import sounddevice as sd
import soundfile as sf

MFCC — Mel-Frequency Cepstral Coefficients

The first step in any automatic speech recognition system is to extract features i.e. identify the components of the audio signal that are good for identifying the linguistic content and discarding all the other stuff which carries information like background noise, emotion etc.

The main point to understand about speech is that the sounds generated by a human are filtered by the shape of the vocal tract including tongue, teeth etc. This shape determines what sound comes out. If we can determine the shape accurately, this should give us an accurate representation of the phoneme being produced. The shape of the vocal tract manifests itself in the envelope of the short time power spectrum, and the job of MFCCs is to accurately represent this envelope. By printing the shape of mfccs you get how many mfccs are calculated on how many frames.

Extracting the MFCC of a audio file is really easy

In [11]:
def get_MFCC(sr,audio):
    
    features = mfcc.mfcc(audio, sr, 0.025, 0.01, 13, appendEnergy = False)
    features = preprocessing.scale(features)
    
    return features

The data is in a folder called AudioSet, in which there are two sub-folders: male_clips and female_clips. We can extract the features of the training set simply by running the function above on all files in the training folder. The problem is however that for the moment, both the train and the test set are in the folder. 

We must, therefore, split these files in two, and run get_MFCC iteratively,

In [4]:
def get_features(source):
    
    # Split files
    files = [os.path.join(source,f) for f in os.listdir(source) if f.endswith('.wav')]
    len_train = int(len(files)*0.8) #we will take only 80% of data for training and remaining 10% for testing and 10% for validation
    len_valortest = int(len(files)*0.1) 
    train_files = files[:len_train]
    testval_files = files[len_train:]
    test_files = testval_files[:len_valortest]
    val_files = testval_files[len_valortest:]
    
    # Train features
    features_train = []
    for f in train_files:
        sr, audio = read(f)
        vector = get_MFCC(sr,audio) #using the function we defined above to get features.
        if len(features_train) == 0:
            features_train = vector
        else:
            features_train = np.vstack((features_train, vector)) #The vstack() function is used to stack arrays in sequence vertically (row wise)
            
    # Test features  
    features_test = []
    for f in test_files:
        sr, audio = read(f)
        vector = get_MFCC(sr,audio)
        if len(features_test) == 0:
            features_test = vector
        else:
            features_test = np.vstack((features_test, vector))
            
    # Val features  
    features_val = []
    for f in val_files:
        sr, audio = read(f)
        vector = get_MFCC(sr,audio)
        if len(features_val) == 0:
            features_val = vector
        else:
            features_val = np.vstack((features_val, vector))
            
    return features_train, features_test, features_val

Extracting male audio data from folder,

In [5]:
source = "male_clips"
features_train_male, features_test_male, features_val_male = get_features(source)

Similarly for females,

In [6]:
source = "female_clips"
features_train_female, features_test_female, features_val_female =  get_features(source)

Making the training and the validantion data numpy array

In [7]:
X = []
y = []
for i in features_train_male:
    X.append(i)
    y.append(1)
for i in features_train_female:
    X.append(i)
    y.append(0)

In [8]:
X_val = []
y_val = []
for i in features_val_male:
    X_val.append(i)
    y_val.append(1)
for i in features_val_female:
    X_val.append(i)
    y_val.append(0)
    


In [15]:
X_test = []
y_test = []
for i in features_test_male:
    X_test.append(i)
    y_test.append(1)
for i in features_test_female:
    X_test.append(i)
    y_test.append(0)

In [9]:
X = np.array(X)
y = np.array(y)

X_val = np.array(X_val)
y_val = np.array(y_val)

X_test = np.array(X_test)
y_test = np.array(y_test)

Building the Model
We are going to use a deep feed-forward neural network with 6 hidden layers, it isn't the perfect architecture, but it does the job so far:

In [251]:
def create_model():
    """6 hidden dense layers from 512 units to 64, not the best model."""
    model = Sequential()
    model.add(Dense(512, input_shape=(None,13)))
    model.add(Dropout(0.3))
    model.add(Dense(512, activation="relu"))
    model.add(Dropout(0.3))
    model.add(Dense(256, activation="relu"))
    model.add(Dropout(0.3))
    model.add(Dense(128, activation="relu"))
    model.add(Dropout(0.3))
    model.add(Dense(128, activation="relu"))
    model.add(Dropout(0.3))
    model.add(Dense(64, activation="relu"))
    model.add(Dropout(0.3))
    # one output neuron with sigmoid activation function, 0 means female, 1 means male
    model.add(Dense(1, activation="sigmoid"))
    # using binary crossentropy as it's male/female classification (binary)
    model.compile(loss="binary_crossentropy", metrics=["accuracy"], optimizer="adam")
    # print summary of the model
    model.summary()
    return model


We're using a 30% dropout rate after each fully connected layer, this type of regularization will hopefully prevent overfitting on the training dataset.

An important thing to note here is we're using a single output unit (neuron) with a sigmoid activation function in the output layer, the model will output the scalar 1 (or close to it) when the audio's speaker is a male, and female when it's closer to 0.

Also, we're using binary cross entropy as the loss function, as it is a special case of categorical cross entropy when we only have 2 classes to predict. Let's use this function to build our model:

In [252]:
# construct the model

model = create_model()

Model: "sequential_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_35 (Dense)             (None, None, 512)         7168      
_________________________________________________________________
dropout_30 (Dropout)         (None, None, 512)         0         
_________________________________________________________________
dense_36 (Dense)             (None, None, 512)         262656    
_________________________________________________________________
dropout_31 (Dropout)         (None, None, 512)         0         
_________________________________________________________________
dense_37 (Dense)             (None, None, 256)         131328    
_________________________________________________________________
dropout_32 (Dropout)         (None, None, 256)         0         
_________________________________________________________________
dense_38 (Dense)             (None, None, 128)        

Two callbacks that will get executed after the end of each epoch:

The first is the tensorboard, we gonna use it to see how the model goes during the training in terms of loss and accuracy.
The second callback is early stopping, this will stop the training when the model stops improving, a patience of 5 is specified, which means it will stop training after 5 epochs of not improving, setting restore_best_weights to True will restore the optimal weights that was recorded during the training and assign it to the model weights.

In [22]:
# use tensorboard to view metrics
tensorboard = TensorBoard(log_dir="logs")
# define early stopping to stop training after 5 epochs of not improving
early_stopping = EarlyStopping(mode="min", patience=5, restore_best_weights=True)

batch_size = 500
epochs = 50
# train the model using the training set and validating using validation set
model.fit(X, y, epochs=epochs, batch_size=batch_size, validation_data=(X_val, y_val),
          callbacks=[tensorboard, early_stopping])

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50


<tensorflow.python.keras.callbacks.History at 0x1e406d43f70>

Since the model now is trained and the weights are optimal, let's test it using our testing set we created earlier:

In [23]:
# save the model to a file
model.save("results/model.h5")

In [25]:
# evaluating the model using the testing set
loss, accuracy = model.evaluate(X_test, y_test, verbose=0)
print(f"Loss: {loss:.4f}")
print(f"Accuracy: {accuracy*100:.2f}%")

Loss: 0.5982
Accuracy: 66.70%


Testing the Model with your own Voice

In [4]:
from keras.models import load_model
model = load_model("results/model.h5")

In [5]:
model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_14 (Dense)             (None, None, 512)         7168      
_________________________________________________________________
dropout_12 (Dropout)         (None, None, 512)         0         
_________________________________________________________________
dense_15 (Dense)             (None, None, 512)         262656    
_________________________________________________________________
dropout_13 (Dropout)         (None, None, 512)         0         
_________________________________________________________________
dense_16 (Dense)             (None, None, 256)         131328    
_________________________________________________________________
dropout_14 (Dropout)         (None, None, 256)         0         
_________________________________________________________________
dense_17 (Dense)             (None, None, 128)        

In [12]:
def record_and_predict(sr=16000, channels=1, duration=3, filename='pred_record.wav'):
    print("-----------------------------------------------------------------------------------")
    print("Recording Started...")
    print("-----------------------------------------------------------------------------------")
    recording = sd.rec(int(duration * sr), samplerate=sr, channels=channels).reshape(-1)
    sd.wait()
    print("Recording ended...")
    print("-----------------------------------------------------------------------------------")
    features = get_MFCC(sr,recording)
    m = model.predict(features)
    m = np.mean(m)
    f = (1-m)*100
    m = m * 100
    if m>f:
        print("THE SPEAKER IS MALE")
    else:
        print("THE SPEAKER IS FEMALE")
    print("-----------------------------------------------------------------------------------")


In [13]:
record_and_predict()

-----------------------------------------------------------------------------------
Recording Started...
-----------------------------------------------------------------------------------
Recording ended...
-----------------------------------------------------------------------------------
THE SPEAKER IS MALE
-----------------------------------------------------------------------------------
