# Age and Gender Classification

This notebook demonstrates the process of loading a pretrained Wav2Vec2 model, performing dynamic quantization, and evaluating the model on an audio file for age and gender classification.

## Imports and Model Definition

First, we import the necessary libraries and define the ModelHead and AgeGenderModel classes, which extend the pretrained Wav2Vec2 model with custom heads for age and gender classification.

In [1]:
import os
import numpy as np
import torch
import torch.nn as nn
from transformers import Wav2Vec2Processor, Wav2Vec2Model, Wav2Vec2Config, Wav2Vec2PreTrainedModel
import soundfile as sf
import torchaudio

class ModelHead(nn.Module):
    def __init__(self, config, num_labels):
        super().__init__()
        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        self.dropout = nn.Dropout(config.final_dropout)
        self.out_proj = nn.Linear(config.hidden_size, num_labels)

    def forward(self, features, **kwargs):
        x = features
        x = self.dropout(x)
        x = self.dense(x)
        x = torch.tanh(x)
        x = self.dropout(x)
        x = self.out_proj(x)
        return x

class AgeGenderModel(Wav2Vec2PreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
        self.config = config
        self.wav2vec2 = Wav2Vec2Model(config)
        self.age = ModelHead(config, 1)
        self.gender = ModelHead(config, 3)
        self.init_weights()

    def forward(self, input_values):
        outputs = self.wav2vec2(input_values)
        hidden_states = outputs[0]
        hidden_states = torch.mean(hidden_states, dim=1)
        logits_age = self.age(hidden_states)
        logits_gender = self.gender(hidden_states)
        return logits_age, logits_gender

## Loading and Saving the Original Model

Next, we load the pretrained Wav2Vec2 model and save its state dictionary. This step ensures we have a backup of the original model before applying quantization.

In [2]:
# Load model from hub
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
cpu_device = torch.device('cpu')
model_name = 'audeering/wav2vec2-large-robust-24-ft-age-gender'
processor = Wav2Vec2Processor.from_pretrained(model_name)
config = Wav2Vec2Config.from_pretrained(model_name)

# Load the original model
model = AgeGenderModel.from_pretrained(model_name, config=config)
model.to(cpu_device)  # Move to CPU for quantization

# Save the original model - We will not push this to the git repo as it's too large
original_model_path = "age_gender_model.pth"
torch.save(model.state_dict(), original_model_path)

Some weights of AgeGenderModel were not initialized from the model checkpoint at audeering/wav2vec2-large-robust-24-ft-age-gender and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Dynamic Quantization

We apply dynamic quantization to the model to reduce its size. Quantization converts the model weights from floating-point to integer representation, which can lead to significant reductions in model size and potential improvements in inference speed.

In [3]:
# Ensure the quantization engine is set
torch.backends.quantized.engine = 'qnnpack'

# Apply dynamic quantization
quantized_model = torch.quantization.quantize_dynamic(
    model, {torch.nn.Linear}, dtype=torch.qint8
)

# Save the quantized model
quantized_model_path = "quantized_age_gender_model.pth"
torch.save(quantized_model.state_dict(), quantized_model_path)

# Verify model size reduction
original_model_size = os.path.getsize(original_model_path) / (1024 * 1024)
quantized_model_size = os.path.getsize(quantized_model_path) / (1024 * 1024)
print(f"Original model size: {original_model_size:.2f} MB")
print(f"Quantized model size: {quantized_model_size:.2f} MB")

Original model size: 1211.49 MB
Quantized model size: 340.10 MB


## Audio Preprocessing

We define helper functions to resample and normalize audio signals to ensure they are in the correct format for the model.

In [4]:
# Ensure sampling rate is 16,000 Hz
TARGET_SAMPLING_RATE = 16000

def resample_audio(signal, orig_sr, target_sr):
    if orig_sr != target_sr:
        resampler = torchaudio.transforms.Resample(orig_sr, target_sr)
        signal = resampler(torch.tensor(signal).float())
    return signal.numpy()

def normalize_audio(signal):
    return (signal - np.mean(signal)) / np.std(signal)

## Model Inference

This function processes an audio file and uses the model to predict the age and gender. The audio is resampled, normalized, and passed through the Wav2Vec2 processor before being fed into the model.

In [5]:
def process_func(model, file_path: str):
    signal, sr = sf.read(file_path)
    if len(signal.shape) > 1:
        signal = np.mean(signal, axis=1)  # Convert to mono
    signal = resample_audio(signal, sr, TARGET_SAMPLING_RATE)
    signal = normalize_audio(signal)
    inputs = processor(signal, sampling_rate=TARGET_SAMPLING_RATE, return_tensors="pt", padding=True)
    inputs = inputs.to(cpu_device)  # Ensure processing on CPU
    
    with torch.no_grad():
        logits_age, logits_gender = model(inputs['input_values'])
        
        # Apply scaling to the age logits
        age = round(logits_age.item() * 100)  # Assuming a scale factor of 100 for interpretation
        
        gender_probs = torch.softmax(logits_gender, dim=1).cpu().numpy()[0]
        gender = ['female', 'male', 'child'][np.argmax(gender_probs)]
        
    return age, gender, gender_probs

## Testing the Model

Finally, we test the model using an audio file. We perform inference using both the original and quantized models to compare their outputs.

In [6]:
# Path to the specific file to test
file_path = 'all_recordings/01.wav'

# Predict age and gender using the original model
age_original, gender_original, gender_probs_original = process_func(model, file_path)
print(f"Original Model - File: {os.path.basename(file_path)}, Age: {age_original}, Gender: {gender_original} (Probs: {gender_probs_original})")

# Load the quantized model for prediction
quantized_model = AgeGenderModel.from_pretrained(model_name, config=config)
quantized_model.load_state_dict(torch.load(quantized_model_path, map_location=cpu_device), strict=False)
quantized_model.to(cpu_device)  # Ensure quantized model is on CPU

# Predict age and gender using the quantized model
age_quantized, gender_quantized, gender_probs_quantized = process_func(quantized_model, file_path)
print(f"Quantized Model - File: {os.path.basename(file_path)}, Age: {age_quantized}, Gender: {gender_quantized} (Probs: {gender_probs_quantized})")

Original Model - File: 01.wav, Age: 37, Gender: female (Probs: [9.944589e-01 4.861043e-03 6.799646e-04])


Some weights of AgeGenderModel were not initialized from the model checkpoint at audeering/wav2vec2-large-robust-24-ft-age-gender and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  device=storage.device,


Quantized Model - File: 01.wav, Age: 37, Gender: female (Probs: [9.944589e-01 4.861043e-03 6.799646e-04])
