# Gender and Age Classification From Audio
## Cheolmin Oh(Jason Oh)
### Project Description: I plan to determine gender and age by extracting features from audio recordings. I will utilize these features in a convolutional neural network (CNN) to develop a model capable of identifying gender and age. Finally, I will assess the model's performance using actual audio recordings to evaluate its effectiveness in real-world scenarios.

#### Quick overall summary:

1. I have used Common Voice datasets. I am not using their api. I downloaded the datasets into my google drive.

2. `common_test.tsv` contains gender and age labels and path to audio mp3. The path extracted from this tsv is used `/{your-folder}/en_test_0/{path}` to read over mp3 file in this folder.

3. Initially, I experimented with MFCCs, but I found it challenging to understand how they work. Consequently, I switched to using spectrogram-based methods. I explored both spectrogram and Mel Spectrogram techniques. However, I encountered difficulties in managing a small dataset while retaining essential features when using the spectrogram method. As a result, I settled on utilizing Mel Spectrogram feature extraction for my CNN model.

# Mel-Spectrogram Implementation

In [None]:
'''
I have initialized Google Drive for this project.

The initial portion of these codes is focused on gender classification.
I have implemented age classification in a separate section.

I removed data labeled as 'None' and 'other,' resulting in a total of around 2300 data points.

For age classification, I initially attempted using softmax for each label,
but this approach resulted in very low accuracy for age prediction.
Subsequently, I decided to transform it into a binary classification problem,
predicting whether the voice belongs to the 0~30 age group or is older.
However, the results were not very promising.
'''

from google.colab import drive
drive.mount('/content/drive')

import pandas as pd
import numpy as np
import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder

# Change audio-project to your folder name
file_path = './drive/MyDrive/audio-project/common_test.tsv'

# get full csv data
df_raw = pd.read_csv(file_path, sep='\t')
print(len(df_raw))

# gender parse
# drop data with None values
df_include_other = df_raw.dropna(subset=['gender'])
print(len(df_include_other))
# drop data with 'other' gender label
df = df_include_other[df_include_other['gender'] != 'other']
print(len(df))
# drop age with None values
df = df.dropna(subset=['age'])
print(len(df))

# age parse and display age types
df_age_data = df_raw.dropna(subset=['age'])
print(len(df_age_data))
age_counts = df_age_data.groupby('age').size().reset_index(name='count')
print(age_counts)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
16372
2356
2320
2308
2373
         age  count
0   eighties      5
1    fifties    109
2   fourties    206
3  seventies     37
4    sixties     39
5      teens    354
6   thirties    465
7   twenties   1158


In [None]:
df.head()

Unnamed: 0,client_id,path,sentence,up_votes,down_votes,age,gender,accents,variant,locale,segment
16,00c3f0e7c691ef30257d1bfa9adc410535b7ba3f48e344...,common_voice_en_18295850.mp3,The long-lived bridge still stands today.,2,0,twenties,male,,,en,
61,030d0b51d96c93d1db9e4ba94dceaf341d98d51eb36820...,common_voice_en_22338655.mp3,The prints are then delivered to the customer.,3,1,twenties,female,Hong Kong English,,en,
79,040595ac714a98d21fe0c2f36d96997900085115175065...,common_voice_en_18277778.mp3,We should not take for granted how fortunate w...,2,1,fourties,male,United States English,,en,
84,043a451f648097c1a200f7e966289233e234f4e35ee00f...,common_voice_en_21943181.mp3,eight,4,3,twenties,male,,,en,Benchmark
86,0446e65032f30acdda12c87fef9d1de14d34946a4d2430...,common_voice_en_20586574.mp3,Geils began playing jazz trumpet but eventuall...,4,0,twenties,male,"United States English,wolof",,en,


In [None]:
# These code are for feature extraction and model creation only for gender

from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Dense, Dropout, Conv1D, MaxPooling1D, Flatten, Conv2D, MaxPooling2D
import matplotlib.pyplot as plt
import librosa
import librosa.display
from IPython.display import Audio
from scipy.ndimage import zoom
from tensorflow import keras

def resize_spectrogram_interpolation(spectrogram, new_size):
    return zoom(spectrogram, (new_size[0] / spectrogram.shape[0], new_size[1] / spectrogram.shape[1]))

def create_spectrogram(input_mp3_file):
    # I tried using 44.1k sample rate but didn't see difference.
    # Probably because audio quality is not good enough.
    # audio_signal, sample_rate = librosa.load(input_mp3_file, sr=44100)
    audio_signal, sample_rate = librosa.load(input_mp3_file, sr=None) # 22.05k sample rate base
    # https://librosa.org/doc/main/generated/librosa.effects.time_stretch.html
    audio_signal = librosa.effects.time_stretch(y=audio_signal, rate=len(audio_signal)/sample_rate)

    # https://librosa.org/doc/main/generated/librosa.feature.melspectrogram.html
    melspectrogram = librosa.feature.melspectrogram(y=audio_signal, sr=sample_rate, n_fft=2048, hop_length=512)
    spectrogram_db = librosa.power_to_db(S=melspectrogram, ref=1.0)

    # # display the spectrogram
    # plt.figure(figsize=(14, 5))
    # librosa.display.specshow(spectrogram_db, sr=sample_rate, x_axis='time', y_axis='log')
    # plt.colorbar()
    # print(spectrogram_db.shape)

    # # this method didn't have much difference in accuracy than using interpolation
    # # add constant to smaller spectrograms
    # target_size = 94
    # if spectrogram_db.shape[1] < target_size:
    #     pad_width = target_size - spectrogram_db.shape[1]
    #     spectrogram_db = np.pad(spectrogram_db, ((0, 0), (0, pad_width)), mode='constant')
    # spectrogram_db = spectrogram_db.reshape((128, 94, 1))

    spectrogram_db = resize_spectrogram_interpolation(spectrogram_db, (128, 94))
    return spectrogram_db

features = [] # contains audio features in 2D array
labels_gender = [] # contains label for gender('male' or 'female') same order as features

# implemeted counter here more for testing things our in smaller portion
counter = 0
for _, row in df.iterrows():
    if counter < 2320:  # max 2320
        file_name = './drive/MyDrive/audio-project/en_test_0/' + row['path']

        spectrogram = create_spectrogram(file_name)

        # get the gender labels
        if spectrogram is not None:
            features.append(spectrogram)
            if row['gender'].lower() == 'male':
                labels_gender.append(0)
            elif row['gender'].lower() == 'female':
                labels_gender.append(1)
            else:
                print("Can't handle this gender label!")
        counter += 1
    else:
        break

# make features, labels_gender to np array
X = np.array(features)
y = np.array(labels_gender)

# split train and test
X_train, X_test, y_train, y_test = train_test_split(X, y)

# # test code to check what is passed in train or test
# # mainly used to check how spectrogram looked like
# # display X_train spectrogram
# for i in range(len(X_train)):
#     plt.figure(figsize=(8, 8))
#     plt.imshow(X_train[i])
#     plt.show()

print("X_train dimensions:", X_train.shape)

X_train dimensions: (1731, 128, 94)


In [None]:
# using model structure of https://keras.io/examples/vision/mnist_convnet/ for below models

# I tried different approaches to find best model.
# I believe having drop out is important to have, but too much drop out is not great.
# I tried different things up to verion 4. From version 5, I tried to make it over fit.
# Then made the model better.

# # verision 1
# # no dropout 0.5 > epoch 9 batch_size=32
# # Test Loss: 0.4199875295162201, Test Accuracy: 0.8965517282485962
# model = Sequential()
# model.add(Conv2D(32, (3, 3), activation='relu', input_shape=(X_train.shape[1], X_train.shape[2], 1)))
# model.add(MaxPooling2D((2, 2)))
# model.add(Conv2D(64, (3, 3), activation='relu'))
# model.add(MaxPooling2D((2, 2)))
# model.add(Conv2D(128, (3, 3), activation='relu'))
# model.add(MaxPooling2D((2, 2)))
# model.add(Flatten())
# model.add(Dense(64, activation='relu'))
# model.add(Dense(1, activation='sigmoid')) # using binary because determining 0 - male or 1 - female

# version 2
# yes dropout 0.5 > epoch 9 batch_size=32
# Test Loss: 0.28963503241539, Test Accuracy: 0.9051724076271057
# model = Sequential()
# model.add(Conv2D(32, (3, 3), activation='relu', input_shape=(X_train.shape[1], X_train.shape[2], 1)))
# model.add(MaxPooling2D((2, 2)))
# model.add(Conv2D(64, (3, 3), activation='relu'))
# model.add(MaxPooling2D((2, 2)))
# model.add(Conv2D(128, (3, 3), activation='relu'))
# model.add(MaxPooling2D((2, 2)))
# model.add(Flatten())
# model.add(Dense(64, activation='relu'))
# model.add(Dropout(0.5))
# model.add(Dense(1, activation='sigmoid')) # using binary because determining 0 - male or 1 - female

# # version 3
# # epoch 14/20 batch_size=32
# # Test Loss: 0.24547772109508514, Test Accuracy: 0.9172413945198059
# model = Sequential()
# model.add(Conv2D(32, (3, 3), padding='same', activation='relu', input_shape=(X_train.shape[1], X_train.shape[2], 1)))
# model.add(MaxPooling2D((2, 2)))
# model.add(Dropout(0.2))
# model.add(Conv2D(64, (3, 3), padding='same', activation='relu'))
# model.add(MaxPooling2D((2, 2)))
# model.add(Dropout(0.2))
# model.add(Conv2D(128, (3, 3), padding='same', activation='relu'))
# model.add(MaxPooling2D((2, 2)))
# model.add(Dropout(0.2))
# model.add(Conv2D(256, (3, 3), padding='same', activation='relu'))
# model.add(MaxPooling2D((2, 2)))
# model.add(Dropout(0.2))
# model.add(Flatten())
# model.add(Dropout(0.5))
# model.add(Dense(1, activation='sigmoid')) # using binary because determining 0 - male or 1 - female

# # version 4
# epoch 15/20 batch_size=32
# Test Loss: 0.26435738801956177, Test Accuracy: 0.8982758522033691
# model = Sequential()
# model.add(Conv2D(32, (3, 3), activation='relu', input_shape=(X_train.shape[1], X_train.shape[2], 1)))
# model.add(MaxPooling2D((2, 2)))
# model.add(Dropout(0.2))
# model.add(Conv2D(64, (3, 3), activation='relu'))
# model.add(MaxPooling2D((2, 2)))
# model.add(Dropout(0.2))
# model.add(Conv2D(128, (3, 3), activation='relu'))
# model.add(MaxPooling2D((2, 2)))
# model.add(Dropout(0.2))
# model.add(Flatten())
# model.add(Dropout(0.5))
# model.add(Dense(1, activation='sigmoid')) # using binary because determining 0 - male or 1 - female

# version 5 - overfitting when no drop out, so adjusted with some dropout and dense until accuracies are about the same
# also I initialially started with Conv2D(16,) it worked well but somewhat less accurate for longer audios
# so I instead increased it to 64, 128, 256. 128 seems to give me most accurate result for random audio test
model = Sequential()
model.add(Conv2D(128, (3, 3), padding='same', activation='relu', input_shape=(X_train.shape[1], X_train.shape[2], 1)))
model.add(MaxPooling2D((2, 2)))
model.add(Dropout(0.2))
model.add(Conv2D(256, (3, 3), padding='same', activation='relu'))
model.add(MaxPooling2D((2, 2)))
model.add(Dropout(0.2))
model.add(Conv2D(512, (3, 3), padding='same', activation='relu'))
model.add(MaxPooling2D((2, 2)))
model.add(Dropout(0.2))
model.add(Conv2D(1024, (3, 3), padding='same', activation='relu'))
model.add(MaxPooling2D((2, 2)))
model.add(Dropout(0.2))
model.add(Flatten())
model.add(Dropout(0.3))
model.add(Dense(1, activation='sigmoid')) # using binary because determining 0 - male or 1 - female

# model.summary() # show model's summary

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# https://keras.io/api/callbacks/early_stopping/
early_stopping = keras.callbacks.EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)
history = model.fit(X_train, y_train, epochs=20, batch_size=64, validation_split=0.2, callbacks=[early_stopping])

loss, accuracy = model.evaluate(X_test, y_test)
print(f'Test Loss: {loss}, Test Accuracy: {accuracy}')

# save the model
model.save('gender_classification_model.keras')

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Test Loss: 0.23981405794620514, Test Accuracy: 0.9150779843330383


In [None]:
'''
My project's primary objective was to develop a gender/age classification model
and subsequently apply it to test various audio recordings for age determination.
During the testing phase, I utilized some personal recordings and audios in wav format
that I found online.
'''

# load model
model = tf.keras.models.load_model('gender_classification_model.keras')

def create_spectrogram_from_audio(audio, sr):
    melspectrogram = librosa.feature.melspectrogram(y=audio, sr=sr, n_fft=2048, hop_length=512)
    spectrogram_db = librosa.power_to_db(S=melspectrogram, ref=1.0)

    spectrogram_db = resize_spectrogram_interpolation(spectrogram_db, (128, 94))
    return spectrogram_db

# segmenting the audios in 5 seconds and testing each of them
# average prediction on these 5 seconds
def predict_gender(audio_file, segment_duration=5, model=model):
    audio, sr = librosa.load(audio_file, sr=None)
    samples_per_segment = int(segment_duration * sr)

    predictions = []

    if len(audio) < samples_per_segment:
        segment = audio
        spectrogram = create_spectrogram_from_audio(segment, sr)
        features_reshaped = np.expand_dims(spectrogram, axis=0)
        prediction = model.predict(features_reshaped)
        predictions.append(prediction[0][0])
    else:
        # Calculate the number of full segments that can fit in the audio file
        num_segments = len(audio) // samples_per_segment

        predictions = []

        for i in range(num_segments):
            start = i * samples_per_segment
            end = start + samples_per_segment
            segment = audio[start:end]

            spectrogram = create_spectrogram_from_audio(segment, sr)
            features_reshaped = np.expand_dims(spectrogram, axis=0)
            prediction = model.predict(features_reshaped)
            predictions.append(prediction[0][0])

    avg_prediction = np.mean(predictions)
    print(predictions)

    if avg_prediction >= 0.5:
        gender = "Female"
    else:
        gender = "Male"

    return gender, avg_prediction

# sample audio: You can use any format of audio of your choice to test this.
# gender_prediction = predict_gender('./drive/MyDrive/audio-project/en_test_0/common_voice_en_18277778.mp3') # male
# gender_prediction = predict_gender('./drive/MyDrive/audio-project/en_test_0/common_voice_en_22338655.mp3') # female
# gender_prediction = predict_gender('./drive/MyDrive/audio-project/Recording.m4a') # me
# gender_prediction = predict_gender('./drive/MyDrive/audio-project/sally.m4a') # female
# gender_prediction = predict_gender('./drive/MyDrive/audio-project/Speaker_0000_00000.wav') # male
gender_prediction = predict_gender('./drive/MyDrive/audio-project/Speaker_0000_00001.wav') # male
# gender_prediction = predict_gender('./drive/MyDrive/audio-project/Speaker_0000_00002.wav') # male
# gender_prediction = predict_gender('./drive/MyDrive/audio-project/Speaker_0001_00000.wav') # male
# gender_prediction = predict_gender('./drive/MyDrive/audio-project/Speaker0048_000.wav') # female
# gender_prediction = predict_gender('./drive/MyDrive/audio-project/Speaker0048_028.wav') # female

print(f"Predicted Gender: {gender_prediction}")

[0.43388322, 0.6385415, 0.21062468, 0.7336307, 0.2577917, 0.3255142, 0.2294937, 0.3939412, 0.58805007, 0.30647746, 0.09952997, 0.1183857]
Predicted Gender: ('Male', 0.36132202)


## **Start of age implementation (Mel-Spectrogram)**

I have used above code and version 5 CNN model.

In [None]:
'''
This part is very similar to what I have done above.

The result was not very good. I believe this is due to lack of datasets.

I will need to figure out a way to utilize more than 2300 datasets. Going something
larger probably can't be done in Google Colab as it already uses up most resources.
'''

from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Dense, Dropout, Conv1D, MaxPooling1D, Flatten, Conv2D, MaxPooling2D, Input
import matplotlib.pyplot as plt
import librosa
import librosa.display
from IPython.display import Audio
from scipy.ndimage import zoom
from tensorflow import keras

def resize_spectrogram_interpolation(spectrogram, new_size):
    return zoom(spectrogram, (new_size[0] / spectrogram.shape[0], new_size[1] / spectrogram.shape[1]))

def create_spectrogram(input_mp3_file):
    audio_signal, sample_rate = librosa.load(input_mp3_file, sr=None)
    audio_signal = librosa.effects.time_stretch(y=audio_signal, rate=len(audio_signal)/sample_rate)

    melspectrogram = librosa.feature.melspectrogram(y=audio_signal, sr=sample_rate, n_fft=2048, hop_length=512)
    spectrogram_db = librosa.power_to_db(S=melspectrogram, ref=1.0)

    # # display the spectrogram
    # plt.figure(figsize=(14, 5))
    # librosa.display.specshow(spectrogram_db, sr=sample_rate, x_axis='time', y_axis='log')
    # plt.colorbar()
    # print(spectrogram_db.shape)

    # # add constant 0 to smaller spectrograms
    # target_size = 94
    # if spectrogram_db.shape[1] < target_size:
    #     pad_width = target_size - spectrogram_db.shape[1]
    #     spectrogram_db = np.pad(spectrogram_db, ((0, 0), (0, pad_width)), mode='constant')
    # spectrogram_db = spectrogram_db.reshape((128, 94, 1))

    spectrogram_db = resize_spectrogram_interpolation(spectrogram_db, (128, 94))
    return spectrogram_db

# tried to split 0~30 years old and 30+
# mapping so I can binary classificaiton
age_mapping = {
    'teens': 0,
    'twenties': 0,
    'thirties': 1,
    'fourties': 1,
    'fifties': 1,
    'sixties': 1,
    'seventies': 1,
    'eighties': 1
}

features = [] # contains audio features in 2D array
labels_gender = [] # contains label for gender('male' or 'female') same order as features
labels_age = [] # contains label for age(0 or 1) same order as features

counter = 0
for _, row in df.iterrows():
    if counter < 2308:  # max 2308 for age parsed
        file_name = './drive/MyDrive/audio-project/en_test_0/' + row['path']

        spectrogram = create_spectrogram(file_name)

        # get the gender labels
        if spectrogram is not None:
            features.append(spectrogram)
            if row['gender'].lower() == 'male':
                labels_gender.append(0)
            elif row['gender'].lower() == 'female':
                labels_gender.append(1)
            else:
                print("Can't handle this gender label!")
            labels_age.append(age_mapping[row['age']])
        counter += 1
    else:
        break

# make features, labels_gender to np array
X = np.array(features)
y_gender = np.array(labels_gender)
y_age = np.array(labels_age)

# split train and test
X_train, X_test, y_train_gender, y_test_gender, y_train_age, y_test_age = train_test_split(X, y_gender, y_age, test_size=0.2)

# # display X_train spectrogram
# for i in range(len(X_train)):
#     plt.figure(figsize=(8, 8))
#     plt.imshow(X_train[i])
#     plt.show()

print("X_train dimensions:", X_train.shape)

X_train dimensions: (1846, 128, 94)


In [None]:
# copy of version 5 model created previously
model = Sequential()
model.add(Conv2D(128, (3, 3), padding='same', activation='relu', input_shape=(X_train.shape[1], X_train.shape[2], 1)))
model.add(MaxPooling2D((2, 2)))
model.add(Dropout(0.2))
model.add(Conv2D(256, (3, 3), padding='same', activation='relu'))
model.add(MaxPooling2D((2, 2)))
model.add(Dropout(0.2))
model.add(Conv2D(512, (3, 3), padding='same', activation='relu'))
model.add(MaxPooling2D((2, 2)))
model.add(Dropout(0.2))
model.add(Conv2D(1024, (3, 3), padding='same', activation='relu'))
model.add(MaxPooling2D((2, 2)))
model.add(Dropout(0.2))
model.add(Flatten())
model.add(Dropout(0.3))

# Got some help from https://stackoverflow.com/questions/44036971/multiple-outputs-in-keras
input_layer = Input(shape=(X_train.shape[1], X_train.shape[2], 1))
x = model(input_layer)
gender = Dense(1, activation='sigmoid', name='gender')(x)
age = Dense(1, activation='sigmoid', name='age')(x)
model = Model(inputs=input_layer, outputs=[gender, age])

# model.summary() # show model's summary

model.compile(loss={'gender': 'binary_crossentropy', 'age': 'binary_crossentropy'},
              optimizer='adam',
              metrics={'gender': 'accuracy', 'age': 'accuracy'})

early_stopping = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)
history = model.fit(X_train, {'gender': y_train_gender, 'age': y_train_age},
                    validation_data=(X_test, {'gender': y_test_gender, 'age': y_test_age}),
                    epochs=20, batch_size=64, callbacks=[early_stopping])

eval_results = model.evaluate(X_test, {'gender': y_test_gender, 'age': y_test_age})
print(f'Age Test Loss: {eval_results[0]}, Gender Test Loss: {eval_results[1]}, Gender Accuracy: {eval_results[3]}, Age Accuracy: {eval_results[4]}')

# save the model
model.save('gender_age_classification_model.keras')

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Age Test Loss: 0.9156877398490906, Gender Test Loss: 0.2924714982509613, Gender Accuracy: 0.8939393758773804, Age Accuracy: 0.6385281682014465


In [None]:
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Dense, Dropout, Conv1D, MaxPooling1D, Flatten, Conv2D, MaxPooling2D
import matplotlib.pyplot as plt
import librosa
import librosa.display
from IPython.display import Audio
from scipy.ndimage import zoom
from tensorflow import keras
model = tf.keras.models.load_model('gender_age_classification_model.keras')

# I asked gpt to fit the codes to age version using my original code
def create_spectrogram_from_audio(audio, sr):
    melspectrogram = librosa.feature.melspectrogram(y=audio, sr=sr, n_fft=2048, hop_length=512)
    spectrogram_db = librosa.power_to_db(S=melspectrogram, ref=1.0)

    spectrogram_db = resize_spectrogram_interpolation(spectrogram_db, (128, 94))
    return spectrogram_db

def predict_gender_age(audio_file, segment_duration=5, model=model):
    audio, sr = librosa.load(audio_file, sr=None)
    samples_per_segment = int(segment_duration * sr)

    gender_predictions = []
    age_predictions = []

    if len(audio) < samples_per_segment:
        segment = audio
        spectrogram = create_spectrogram_from_audio(segment, sr)
        features_reshaped = np.expand_dims(spectrogram, axis=0)
        prediction = model.predict(features_reshaped)
        gender_predictions.append(prediction[0][0])
        age_predictions.append(prediction[1][0])
    else:
        num_segments = len(audio) // samples_per_segment

        for i in range(num_segments):
            start = i * samples_per_segment
            end = start + samples_per_segment
            segment = audio[start:end]

            spectrogram = create_spectrogram_from_audio(segment, sr)
            features_reshaped = np.expand_dims(spectrogram, axis=0)
            prediction = model.predict(features_reshaped)
            gender_predictions.append(prediction[0][0])
            age_predictions.append(prediction[1][0])

    avg_gender_prediction = np.mean(gender_predictions)
    avg_age_prediction = np.mean(age_predictions)

    gender = "Female" if avg_gender_prediction >= 0.5 else "Male"
    age_group = "Below 30" if avg_age_prediction < 0.5 else "Above 30"

    return gender, avg_gender_prediction, age_group, avg_age_prediction

# gender, gender_confidence, age_group, age_confidence = predict_gender_age('./drive/MyDrive/audio-project/en_test_0/common_voice_en_18277778.mp3') # male below 30
# gender, gender_confidence, age_group, age_confidence = predict_gender_age('./drive/MyDrive/audio-project/en_test_0/common_voice_en_22338655.mp3') # female below 30
gender, gender_confidence, age_group, age_confidence = predict_gender_age('./Recording.m4a') # male below 30
# gender, gender_confidence, age_group, age_confidence = predict_gender_age('./drive/MyDrive/audio-project/sally.m4a') # female above 30
# gender, gender_confidence, age_group, age_confidence = predict_gender_age('./drive/MyDrive/audio-project/Speaker_0000_00000.wav')
# gender, gender_confidence, age_group, age_confidence = predict_gender_age('./drive/MyDrive/audio-project/Speaker_0000_00001.wav')
# gender, gender_confidence, age_group, age_confidence = predict_gender_age('./drive/MyDrive/audio-project/Speaker_0000_00002.wav')
# gender, gender_confidence, age_group, age_confidence = predict_gender_age('./drive/MyDrive/audio-project/Speaker_0001_00000.wav')
# gender, gender_confidence, age_group, age_confidence = predict_gender_age('./drive/MyDrive/audio-project/Speaker0048_000.wav')
# gender, gender_confidence, age_group, age_confidence = predict_gender_age('./drive/MyDrive/audio-project/Speaker0048_028.wav')
print(f"Predicted Gender: {gender}, Result: {gender_confidence}")
print(f"Predicted Age Group: {age_group}, Result: {age_confidence}")

### **Below here are MFCCS and Spectrogram Implementations**

## These are here just for future reference. The code is not completed or working well.

MFCCS I will need more research on it before utilizing it.

Spectrogram implementation is something I tried before Mel-Spectrogram, which didn't work well.

# MFCCS implementation

I think this is better than mel-spectrogram when used correctly. However, I am not sure what the MFCCS feature is calculated. So, I have not used it.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import pandas as pd
import numpy as np
import librosa
import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.utils import to_categorical

file_path = './drive/MyDrive/audio-project/common_test.tsv'
# file_path = './drive/MyDrive/audio-project/common_train.tsv'
df_raw = pd.read_csv(file_path, sep='\t')
print(len(df_raw))
df_include_other = df_raw.dropna(subset=['gender'])
print(len(df_include_other))
df = df_include_other[df_include_other['gender'] != 'other']
print(len(df))

16372
2356
2320


In [None]:
df.head()

Unnamed: 0,client_id,path,sentence,up_votes,down_votes,age,gender,accents,variant,locale,segment
16,00c3f0e7c691ef30257d1bfa9adc410535b7ba3f48e344...,common_voice_en_18295850.mp3,The long-lived bridge still stands today.,2,0,twenties,male,,,en,
61,030d0b51d96c93d1db9e4ba94dceaf341d98d51eb36820...,common_voice_en_22338655.mp3,The prints are then delivered to the customer.,3,1,twenties,female,Hong Kong English,,en,
79,040595ac714a98d21fe0c2f36d96997900085115175065...,common_voice_en_18277778.mp3,We should not take for granted how fortunate w...,2,1,fourties,male,United States English,,en,
84,043a451f648097c1a200f7e966289233e234f4e35ee00f...,common_voice_en_21943181.mp3,eight,4,3,twenties,male,,,en,Benchmark
86,0446e65032f30acdda12c87fef9d1de14d34946a4d2430...,common_voice_en_20586574.mp3,Geils began playing jazz trumpet but eventuall...,4,0,twenties,male,"United States English,wolof",,en,


In [None]:
# !tar -xvf './drive/MyDrive/audio-project/en_test_0.tar' -C './drive/MyDrive/audio-project/' # unzip already done!

In [None]:
import pandas as pd
import numpy as np
import librosa
import tensorflow as tf
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Conv1D, MaxPooling1D, Flatten

# feature extraction function
def extract_features(file_path):
    try:
        audio, sample_rate = librosa.load(file_path)
        # https://librosa.org/doc/main/generated/librosa.feature.mfcc.html
        mfccs = librosa.feature.mfcc(y=audio, sr=sample_rate, n_mfcc=40) # can increase n_mfcc for longer array
        mfccs_processed = np.mean(mfccs.T, axis=0)
        return mfccs_processed
    except Exception as e:
        print("Error encountered while parsing file: ", file_path)
        return None

features = [] # contains audio features in 1D array
labels_gender = [] # this will contain label for gender 1 to 1 of features

# for _, row in df.iterrows():
#     file_name = './drive/MyDrive/audio-project/en_test_0/' + row['path']
#     mfccs = extract_features(file_name)
#     if mfccs is not None:
#         features.append(mfccs)
#         if row['gender'].lower() == 'male':
#             labels_gender.append(0)
#         elif row['gender'].lower() == 'female':
#             labels_gender.append(1)
#         else:
#             print(f"Unexpected gender label: {row['gender']}")
counter = 0
for _, row in df.iterrows():
    if counter < 2000:  # only 5 audio
        file_name = './drive/MyDrive/audio-project/en_test_0/' + row['path']
        mfccs = extract_features(file_name)
        if mfccs is not None:
            features.append(mfccs)
            if row['gender'].lower() == 'male':
                labels_gender.append(0)
            elif row['gender'].lower() == 'female':
                labels_gender.append(1)
            else:
                print(f"Unexpected gender label: {row['gender']}")
        counter += 1
    else:
        break

# make features, labels_gender to np array
X = np.array(features) # 2D array
y = np.array(labels_gender)

# split train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# reshape this to 3D to use Conv1D
X_train = np.expand_dims(X_train, axis=-1)
X_test = np.expand_dims(X_test, axis=-1)

print("X_train dimensions:", X_train.shape)

# CNN Model using Conv1D and MaxPooling1D
# I chose this method because a lot of people in stackoverflow seemed to use Conv1D and MaxPooling for speach recognition
# https://keras.io/api/layers/convolution_layers/convolution1d/
# https://keras.io/keras_core/api/layers/pooling_layers/max_pooling1d/
model = Sequential()
model.add(Conv1D(32, 3, activation='relu', input_shape=(X_train.shape[1], X_train.shape[2])))
model.add(MaxPooling1D(2))
model.add(Conv1D(64, 3, activation='relu'))
model.add(MaxPooling1D(2))
model.add(Flatten())
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid')) # using binary because determining 0 - male or 1 - female

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X_train, y_train, epochs=20, batch_size=32, validation_split=0.2)
loss, accuracy = model.evaluate(X_test, y_test)
print(f'Test Loss: {loss}, Test Accuracy: {accuracy}')

# save the model
model.save('gender_classification_model.keras')

X_train dimensions: (1600, 40, 1)
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Test Loss: 0.24523252248764038, Test Accuracy: 0.8999999761581421


In [None]:
# load model
model = tf.keras.models.load_model('gender_classification_model.keras')

# def extract_features(file_path, sr=44100):
#     max_length = 120
#     try:
#         audio, sample_rate = librosa.load(file_path, sr=sr)
#         stft = np.abs(librosa.stft(audio, n_fft=255, hop_length=2048))
#         print(stft.shape)
#         if stft.shape[1] < max_length:
#             pad_width = max_length - stft.shape[1]
#             stft = np.pad(stft, pad_width=((0, 0), (0, pad_width)), mode='constant')
#         else:
#             stft = stft[:, 720:max_length+720]

#         return stft
#     except Exception as e:
#         print("Error encountered while parsing file: ", file_path)
#         return None

def predict_gender(audio_file):
    features = extract_features(audio_file)
    # features = resize_spectrogram(features, target_shape=(128, 128))

    # reshape feature for the model
    features_reshaped = np.expand_dims(features, axis=0)
    features_reshaped = np.expand_dims(features_reshaped, axis=-1)

    # get the prediction
    prediction = model.predict(features_reshaped)
    print(prediction)

    if prediction[0][0] >= 0.5:
        return "Female"
    else:
        return "Male"

# sample audio
# gender_prediction = predict_gender('Speaker_0000_00000.wav')
# gender_prediction = predict_gender('Recording.m4a')
# gender_prediction = predict_gender('Speaker0048_000.wav')
# gender_prediction = predict_gender('Speaker0048_028.wav')
# gender_prediction = predict_gender('./drive/MyDrive/audio-project/en_test_0/common_voice_en_18277778.mp3') # male
# gender_prediction = predict_gender('./drive/MyDrive/audio-project/en_test_0/common_voice_en_22338655.mp3') # female

print(f"Predicted Gender: {gender_prediction}")



[[0.74120694]]
Predicted Gender: Female


# Spectrogram Implementation

Didn't work because the feature extracted was too large and lower it resizing made it lose too much data.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

import pandas as pd
import numpy as np
import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder

file_path = './drive/MyDrive/audio-project/common_test.tsv'
# file_path = './drive/MyDrive/audio-project/common_train.tsv'
df_raw = pd.read_csv(file_path, sep='\t')
print(len(df_raw))
df_include_other = df_raw.dropna(subset=['gender'])
print(len(df_include_other))
df = df_include_other[df_include_other['gender'] != 'other']
print(len(df))

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
16372
2356
2320


In [None]:
import pandas as pd
import numpy as np
import librosa
import tensorflow as tf
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Conv1D, MaxPooling1D, Flatten, Conv2D, MaxPooling2D
from scipy.interpolate import interp2d

# extract spectogram: Short-Time Fourier Transform
# frequency bands / time segments
def extract_features(file_path):
    max_length = 250
    # target_shape=(40, 40)
    try:
        audio, sample_rate = librosa.load(file_path)
        # https://librosa.org/doc/main/generated/librosa.stft.html
        # use stft. n_fft determines the column length: (n_fft/2)+1 = column length
        # https://stackoverflow.com/questions/62584184/understanding-the-shape-of-spectrograms-and-n-mels
        stft = np.abs(librosa.stft(audio, n_fft=512))

        if stft.shape[1] < max_length:
            pad_width = max_length - stft.shape[1]
            stft = np.pad(stft, pad_width=((0, 0), (0, pad_width)), mode='constant')
        else:
            stft = stft[:, :max_length]

        # print part of the spectrogram
        print("spectogram:")
        print(f"The sampling rate of the audio file is: {sample_rate} Hz")
        print(stft[:6, :6])
        print("Shape of stft array:", stft.shape)
        print("Number of rows:", stft.shape[0])
        print("Number of columns:", stft.shape[1])

        return stft
    except Exception as e:
        print("Error encountered while parsing file: ", file_path)
        return None

features = [] # contains audio features in 1D array
labels_gender = [] # this will contain label for gender 1 to 1 of features

# for _, row in df.iterrows():
#     file_name = './drive/MyDrive/audio-project/en_test_0/' + row['path']
#     mfccs = extract_features(file_name)
#     if mfccs is not None:
#         features.append(mfccs)
#         if row['gender'].lower() == 'male':
#             labels_gender.append(0)
#         elif row['gender'].lower() == 'female':
#             labels_gender.append(1)
#         else:
#             print(f"Unexpected gender label: {row['gender']}")
counter = 0
for _, row in df.iterrows():
    if counter < 1000:  # only 5 audio
        file_name = './drive/MyDrive/audio-project/en_test_0/' + row['path']
        spectrogram = extract_features(file_name)
        if spectrogram is not None:
            features.append(spectrogram)
            if row['gender'].lower() == 'male':
                labels_gender.append(0)
            elif row['gender'].lower() == 'female':
                labels_gender.append(1)
            else:
                print(f"Unexpected gender label: {row['gender']}")
        counter += 1
    else:
        break

# make features, labels_gender to np array
X = np.array(features) # 2D array
y = np.array(labels_gender)

# split train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# reshape this to 3D to use Conv1D
X_train = np.expand_dims(X_train, axis=-1)
X_test = np.expand_dims(X_test, axis=-1)

print("X_train dimensions:", X_train.shape)

# model = Sequential()
# model.add(Conv2D(32, (3, 3), activation='relu', input_shape=(X_train.shape[1], X_train.shape[2], 1)))
# model.add(MaxPooling2D((2, 2)))
# model.add(Conv2D(64, (3, 3), activation='relu'))
# model.add(MaxPooling2D((2, 2)))
# model.add(Flatten())
# model.add(Dense(64, activation='relu'))
# model.add(Dropout(0.5))
# model.add(Dense(1, activation='sigmoid')) # using binary because determining 0 - male or 1 - female
model = Sequential()
model.add(Conv2D(16, (3, 3), activation='relu', input_shape=(X_train.shape[1], X_train.shape[2], 1)))
model.add(MaxPooling2D((2, 2)))
model.add(Flatten())
model.add(Dense(32, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X_train, y_train, epochs=20, batch_size=16, validation_split=0.2)
loss, accuracy = model.evaluate(X_test, y_test)
print(f'Test Loss: {loss}, Test Accuracy: {accuracy}')

# save the model
model.save('gender_classification_model.keras')

In [None]:
# load model
model = tf.keras.models.load_model('gender_classification_model.keras')

def predict_gender(audio_file):

    spectrogram = create_spectrogram(audio_file)

    # spectrogram = resize_spectrogram_interpolation(spectrogram)

    # spectrogram = max_pooling_spectrogram(spectrogram)

    # reshape feature for the model
    features_reshaped = np.expand_dims(spectrogram, axis=0)

    # get the prediction
    prediction = model.predict(features_reshaped)
    print(prediction)

    if prediction[0][0] >= 0.5:
        return "Female"
    else:
        return "Male"

# sample audio
# gender_prediction = predict_gender('Speaker_0000_00000.wav')
# gender_prediction = predict_gender('Speaker_0000_00001.wav')
# gender_prediction = predict_gender('Speaker_0000_00002.wav')
# gender_prediction = predict_gender('Speaker_0001_00000.wav')
# gender_prediction = predict_gender('Recording.m4a')
# gender_prediction = predict_gender('sally.m4a')
# gender_prediction = predict_gender('Speaker0048_000.wav')
# gender_prediction = predict_gender('Speaker0048_028.wav')
# gender_prediction = predict_gender('Speaker0049_000.wav')
# gender_prediction = predict_gender('./drive/MyDrive/audio-project/en_test_0/common_voice_en_18277778.mp3') # male
# gender_prediction = predict_gender('./drive/MyDrive/audio-project/en_test_0/common_voice_en_22338655.mp3') # female


print(f"Predicted Gender: {gender_prediction}")

[[0.52720743]]
Predicted Gender: Female


In [None]:
import librosa
import librosa.display
import matplotlib.pyplot as plt
import numpy as np

def plot_waveform_spectrogram_melspectrogram(file_path):
    audio, sample_rate = librosa.load(file_path)

    plt.figure(figsize=(12, 4))
    librosa.display.waveshow(audio, sr=sample_rate)
    plt.title('My Audio Waveform')
    plt.xlabel('Time (seconds)')
    plt.ylabel('Amplitude')
    plt.show()

    stft = np.abs(librosa.stft(audio))
    db_stft = librosa.amplitude_to_db(stft, ref=np.max)

    plt.figure(figsize=(12, 6))
    librosa.display.specshow(db_stft, sr=sample_rate, x_axis='time', y_axis='hz')
    plt.colorbar(format='%+2.0f dB')
    plt.title('My Spectrogram')
    plt.xlabel('Time (seconds)')
    plt.ylabel('Frequency (Hz)')
    plt.show()

    mel_spectrogram = librosa.feature.melspectrogram(y=audio, sr=sample_rate)
    db_mel_spectrogram = librosa.power_to_db(mel_spectrogram, ref=np.max)

    plt.figure(figsize=(12, 6))
    librosa.display.specshow(db_mel_spectrogram, sr=sample_rate, x_axis='time', y_axis='mel')
    plt.colorbar(format='%+2.0f dB')
    plt.title('My Mel-Spectrogram')
    plt.xlabel('Time (seconds)')
    plt.ylabel('Mel')
    plt.show()

file_path = './drive/MyDrive/audio-project/en_test_0/common_voice_en_18277778.mp3'
plot_waveform_spectrogram_melspectrogram(file_path)