# MFCC Feature Extraction

This notebook extracts Mel-Frequency Cepstral Coefficients (MFCCs) from the selected audio samples to create a numerical feature representation suitable for machine learning.

# Install Required Libraries

Installs librosa for audio processing, pandas for data manipulation, and numpy for numerical operations.

In [None]:
# Install librosa, pandas, and numpy for audio processing and data manipulation
%pip install librosa pandas numpy

In [None]:
# Import necessary libraries for audio processing and data manipulation
import librosa
import numpy as np
import pandas as pd
import os

# Load Filtered Dataset

This cell loads the previously created 1000-sample dataset subset from disk and displays the first few rows to verify successful loading. 

In [None]:
# Load subset metadata
subset_path = r"D:/mcv-scripted-en-v23.0/cv-corpus-23.0-2025-09-05/en/subset_1000.csv"
df_small = pd.read_csv(subset_path)

# Display start of data
df_small.head()

# Inspect Gender Labels

This cell displays the values in the gender column to identify available label categories and check for inconsistencies or missing values.

In [None]:
# Check the unique values in the "gender" column
df_small["gender"].unique()

# Define MFCC Extraction Function

This cell defines a helper function that loads an audio file, extracts MFCC features, and returns a fixed-length feature vector by averaging time frames. Basic error handling is also implemented to skip files that cannot be processed. 

In [None]:
# Function to extract MFCC features from an audio file
def extract_mfcc(audio_path, n_mfcc=13):
    try:
        # Load audio file and extract MFCC features
        signal, sr = librosa.load(audio_path, sr=22050) # Standardize sample rate
        mfcc_result = librosa.feature.mfcc(y=signal, sr=sr, n_mfcc=n_mfcc) # Extract MFCCs from audio
        mfcc_mean = np.mean(mfcc_result, axis=1) # Compute mean MFCCs across time frames
        # Return MFCC features
        return mfcc_mean 
    # Handle exceptions during audio processing and return None if an error occurs
    except Exception as e:
        print(f"Error processing {audio_path}: {e}")
        return None


# Define Pitch Feature Extraction Function

This cell defines a function that extracts basic pitch statistics from an audio file using fundamental frequency estimation. Unvoiced frames are removed, and summary statistics are returned as a fixed length feature vector. 

In [None]:
def extract_pitch_features(audio_path, sr=16000):
    y, sr = librosa.load(audio_path, sr=sr)

    # Extract pitch features - adjusted fmin/fmax for broader range
    f0 = librosa.yin(y, fmin=50, fmax=400, sr=sr)  # Increased fmax to 400

    # Remove unvoiced frames
    f0 = f0[~np.isnan(f0)]

    # If no pitch is detected, return zeros
    if len(f0) == 0:
        return [0, 0, 0, 0, 0]
    
    # Return pitch features
    return [
        np.mean(f0),        # Average pitch
        np.std(f0),         # Pitch variation
        np.min(f0),         # Lowest pitch
        np.max(f0),         # Highest pitch
        np.median(f0)       # Median pitch 
    ]

# Extract Combined Audio Features

This cell iterates through the dataset subset and extracts MFCC and pitch features for each audio. The extracted features are combined with age, gender, and sentence identifiers and stored for further processing. 

In [None]:
# Extract MFCCs AND pitch features for all samples in the subset
mfcc_features = [] #

# Iterate through each row in the DataFrame and extract features
for i, row in df_small.iterrows():
    # Get the audio file path from the DataFrame
    audio_file = row["audio_path"] 

    # Extract MFCC features using the defined function
    mfcc_feats = extract_mfcc(audio_file)
    
    # Extract pitch features
    pitch_feats = extract_pitch_features(audio_file)

    # Ensures that only valid features are appended to the list
    if mfcc_feats is not None:
        mfcc_features.append([
            *mfcc_feats,    
            *pitch_feats,     
            row["age"],
            row["gender"],
            row["sentence_id"]
        ])

# Display the number of successfully extracted feature sets
print(f"Extracted features for {len(mfcc_features)} samples")
print(f"Feature vector size: {len(mfcc_features[0]) - 3} (excluding age, gender, sentence_id)")

# Construct Feature DataFrame

This cell organises the extracted MFCC and pitch features into a structured DataFrame with labelled columns and displays the first few rows for verification. 

In [None]:
# Create a DataFrame with the extracted features
columns = (
    [f"mfcc_{i}" for i in range(14)] +
    ["pitch_mean", "pitch_std", "pitch_min", "pitch_max"] +
    ["age", "gender", "sentence_id"]
)

# Create a DataFrame with the extracted features
df_features = pd.DataFrame(mfcc_features, columns=columns)
df_features.head()

# Save Extracted Features

This cell saves the extracted audio feature dataset to a CSV file for use in subsequent model training and evaluation.

In [None]:
# Save the extracted MFCC features to a CSV file
df_features.to_csv("D:/mcv-scripted-en-v23.0/cv-corpus-23.0-2025-09-05/features.csv", index=False)
print("SAVED :D")

# Prepare Feature Arrays and Labels

This cell converts the extracted feature data into NumPy arrays, separates input features from gender and age labels, perform basic sanity checks, and saves the arrays for use in model training.

In [None]:
# Convert to numpy and separate features from labels
mfcc_features = np.array(mfcc_features, dtype=object)
X = np.array([row[:-3] for row in mfcc_features], dtype=float)
y_gender = np.array([row[-2] for row in mfcc_features])
y_age = np.array([row[-3] for row in mfcc_features])

print(f"Feature matrix shape BEFORE saving: {X.shape}")  # Should be (1000, 18)

# Check that pitch is there
print(f"First sample last 5 features (pitch): {X[0, -5:]}")

# Save the NEW features
np.save('X_features.npy', X)
np.save('y_gender.npy', y_gender)
np.save('y_age.npy', y_age)

print("\nâœ“ Features with pitch saved successfully!")