# Extracting Paralinguistic Features from Audio Files

Paralinguistic features of audio files refer to the aspects of speech that convey information about the speaker’s emotional state, personality, social identity, and other non-verbal cues that accompany spoken language. These features are distinct from the linguistic content (the words themselves) and focus more on how something is said rather than what is said. Paralinguistic features can be used to understand nuances like tone, emotion, and intent in speech, which are important for applications in emotion recognition, sentiment analysis, and speaker profiling.


This notebook extracts three types of paralinguistic features: 
* MFCC
* eGEMAPS
* ComPAre

The code assumes that the raw audio signals have been preprocessed. For preprocessing, refer to the other notebook

## Setup

In [1]:
import opensmile
import audiofile
from tqdm import tqdm
import librosa
import pandas as pd  
import numpy as np
import os 
import soundfile as sf
import torch

## Load data

In [2]:
# load raw data
train_y = pd.read_csv("original/train_labels.csv", index_col=0)
ss = pd.read_csv("original/submission_format.csv", index_col=0)

# MFCC

MFCC (Mel-Frequency Cepstral Coefficients) are a widely used feature extraction method for speech and audio processing. They represent the short-term power spectrum of an audio signal and are used extensively in speech recognition, speaker identification, and other audio processing tasks.

MFCCs are derived from the Fourier transform of an audio signal, which provides a frequency-domain representation of the signal. However, MFCCs are designed to more closely resemble the way humans perceive sound, which makes them effective for audio and speech-related tasks.

In [24]:
INPUT_FOLDER = "raw_preprocessed"
FILE_FORMAT = "wav" # mp3
n_mfcc = 20  # Number of MFCC features to extract

In [25]:
train_mfcc = pd.DataFrame(index=train_y.index, columns=[f"mfcc_{i+1}" for i in range(n_mfcc)])

for uid in tqdm(train_mfcc.index):
    if "wraw" in uid: continue 
    signal, sr = librosa.load(f"{INPUT_FOLDER}/train_audios/{uid}.{FILE_FORMAT}", sr=None)
    signal = signal / np.max(np.abs(signal))  
    mfcc = librosa.feature.mfcc(y=signal, sr=sr, n_mfcc=n_mfcc)
    # Take the mean of MFCCs across frames (to get a single feature vector per audio file)
    mfcc_mean = np.mean(mfcc, axis=1)  # This will give you a 1D array with n_mfcc features
    train_mfcc.loc[uid] = mfcc_mean

test_mfcc = pd.DataFrame(index=ss.index, columns=[f"mfcc_{i+1}" for i in range(n_mfcc)])
for uid in tqdm(test_mfcc.index):
    signal, sr = librosa.load(f"{INPUT_FOLDER}/test_audios/{uid}.{FILE_FORMAT}", sr=None)
    signal = signal / np.max(np.abs(signal))  # Normalizing the audio
    mfcc = librosa.feature.mfcc(y=signal, sr=sr, n_mfcc=n_mfcc)
    # Take the mean of MFCCs across frames (to get a single feature vector per audio file)
    mfcc_mean = np.mean(mfcc, axis=1)  # This will give you a 1D array with n_mfcc feature
    test_mfcc.loc[uid] = mfcc_mean

100%|██████████| 1646/1646 [00:29<00:00, 55.95it/s]
100%|██████████| 412/412 [00:07<00:00, 58.69it/s]


In [26]:
train_mfcc.to_csv("paralinguistic/train_mfcc_features_v2.csv", index=True)
test_mfcc.to_csv("paralinguistic/test_mfcc_features_v2.csv", index=True)

# eGEMAPS

eGeMAPSv02 is a set of audio features primarily used for voice and speech analysis. These features include:

* Low-level descriptors (LLDs) such as pitch, energy, formants, and spectral features.
* Statistical functionals to summarize these LLDs over time.

These features are typically extracted from the entire audio signal, including both speech and silence segments. In general, silence regions do not contribute meaningful information for these features. However, silence can still influence certain aspects of the signal, like energy and formant frequencies, especially if the silence is of significant duration or if there's background noise.

In [9]:
INPUT_FOLDER = "raw_preprocessed"
FILE_FORMAT = "wav" # mp3

In [5]:
smile1 = opensmile.Smile(
    feature_set=opensmile.FeatureSet.eGeMAPSv02,
    feature_level=opensmile.FeatureLevel.Functionals,
)
print(len(smile1.feature_names))

88


In [None]:
def strip_silence(y, sr, silence_threshold=0.02, frame_length=2048, hop_length=512, output_path=None):
    """
    Strip silence from an audio file and save the resulting audio without silence.
    
    Parameters:
        audio_path (str): Path to the input audio file.
        silence_threshold (float): Energy threshold below which the region is considered silent.
        frame_length (int): Length of each frame to analyze in samples.
        hop_length (int): The number of samples to shift for each frame.
        output_path (str, optional): Path to save the output audio without silence. If None, it will return the numpy array.
    
    Returns:
        numpy.ndarray: Audio signal with silence removed (if no output_path is provided).
    """
    # Compute the energy (Root Mean Square) of the audio signal
    energy = librosa.feature.rms(y=y, frame_length=frame_length, hop_length=hop_length)[0]
    
    # Identify non-silent regions where energy is above the threshold
    non_silent_frames = energy > silence_threshold
    
    # Get the time intervals corresponding to non-silent frames
    times = librosa.frames_to_time(np.arange(len(energy)), sr=sr, hop_length=hop_length)
    
    # Extract the non-silent portions of the audio
    non_silent_audio = []
    start_idx = None
    
    for i in range(len(non_silent_frames)):
        if non_silent_frames[i] and start_idx is None:
            # Start of non-silent region
            start_idx = librosa.frames_to_samples(i, hop_length=hop_length)
        elif not non_silent_frames[i] and start_idx is not None:
            # End of non-silent region
            end_idx = librosa.frames_to_samples(i, hop_length=hop_length)
            non_silent_audio.append(y[start_idx:end_idx])
            start_idx = None
    
    # Handle case where the audio ends with a non-silent region
    if start_idx is not None:
        non_silent_audio.append(y[start_idx:])
    
    # Concatenate all non-silent audio parts
    stripped_audio = np.concatenate(non_silent_audio)
    
    # Save the result or return it
    return stripped_audio

In [None]:
train_gemaps = pd.DataFrame(index=train_y.index, columns=smile1.feature_names)
for uid in tqdm(train_gemaps.index):
    try:
        signal, sr = librosa.load(f"{INPUT_FOLDER}/train_audios/{uid}.{FILE_FORMAT}", sr=None)
        # signal = signal / np.max(np.abs(signal))
        # signal_nosilence = strip_silence(signal, sr)
        output = smile1.process_signal(signal, sampling_rate=sr)
        train_gemaps.loc[uid] = np.asarray(output[smile1.feature_names])
    except Exception as e:
        print(e)

test_gemaps = pd.DataFrame(index=ss.index, columns=smile1.feature_names)
for uid in tqdm(test_gemaps.index):
    signal, sr = librosa.load(f"{INPUT_FOLDER}/test_audios/{uid}.{FILE_FORMAT}", sr=None)
    # signal = signal / np.max(np.abs(signal))
    # signal_nosilence = strip_silence(signal, sr)
    output = smile1.process_signal(signal, sampling_rate=sr)
    test_gemaps.loc[uid] = np.asarray(output[smile1.feature_names])

100%|██████████| 1646/1646 [00:29<00:00, 55.95it/s]
100%|██████████| 412/412 [00:07<00:00, 58.69it/s]


In [14]:
train_gemaps.to_csv("paralinguistic/train_gemaps_features_v2.csv", index=True)
test_gemaps.to_csv("paralinguistic/test_gemaps_features_v2.csv", index=True)

# ComPAre

ComPAre (Common Phonetic Audio Representation) is a set of audio features designed for capturing phonetic and prosodic information from speech, typically used in speech recognition and other speech-related tasks. These features are derived from the raw audio signal to represent the speech in a compact and more informative way, highlighting characteristics such as pitch, intensity, formants, and speech rhythms.

ComPAre features are particularly useful when analyzing prosodic features of speech, including emotional tone, stress, and intonation patterns, as well as for tasks like speech-to-text and speaker identification.

ComPAre features are typically high-level speech features extracted from the raw audio signal. 

In [15]:
INPUT_FOLDER = "raw_preprocessed"
FILE_FORMAT = "wav" # mp3

In [16]:
smile2 = opensmile.Smile(
    feature_set=opensmile.FeatureSet.ComParE_2016,
    feature_level=opensmile.FeatureLevel.LowLevelDescriptors_Deltas
)
print(len(smile2.feature_names))

65


In [None]:
train_compare = pd.DataFrame(index=train_y.index, columns=smile2.feature_names)
for uid in tqdm(train_compare.index):
    try:
        signal, sr = librosa.load(f"{INPUT_FOLDER}/train_audios/{uid}.{FILE_FORMAT}", sr=None)
        # signal = signal / np.max(np.abs(signal))
        # signal_nosilence = strip_silence(signal, sr)
        output = smile2.process_signal(signal, sampling_rate=sr)
        train_compare.loc[uid] = np.asarray(output.mean(axis=0)[smile2.feature_names])
    except Exception as e:
        print(e)
test_compare = pd.DataFrame(index=ss.index, columns=smile2.feature_names)
for uid in tqdm(test_gemaps.index):
    signal, sr = librosa.load(f"{INPUT_FOLDER}/test_audios/{uid}.{FILE_FORMAT}", sr=None)
    # signal = signal / np.max(np.abs(signal))
    # signal = strip_silence(signal, sr)
    output = smile2.process_signal(signal, sampling_rate=sr)
    test_compare.loc[uid] = np.asarray(output.mean(axis=0)[smile2.feature_names])

100%|██████████| 1646/1646 [00:29<00:00, 55.95it/s]
100%|██████████| 412/412 [00:07<00:00, 58.69it/s]


In [20]:
train_compare.to_csv("paralinguistic/train_compare_features_v2.csv", index=True)
test_compare.to_csv("paralinguistic/test_compare_features_v2.csv", index=True)