# Audiofeatures for AI-TD
We need a more or less big feature vector directly derived from a recorded waveform. To be specific, we want the audiofeatures of a **1 s** frame from a recording which may have any length between **1** and **5** seconds. After that, we can take the mean of all these frames. Some features may be linked to a peak in the signal or output an array which can be averaged. As it stands, we only want scalar values for each feature per frame, which after averaging also means one value per recording, making the inputs for the neural net easy to handle. The features are listed here and described and commputed below:

- $T_{1A}$ : Analytical time constant of rise time of the most significant event in frame.
- $T_{2A}$ : Analytical time constant of fall time of the most significant event in frame.
- $G_{1H}$ : Attack-gain setting.
- $G_{2H}$ : Sustain-gain setting.
- $F_e$ : Spectral Flattness.
- $C_f$ : Crest Factor.
- $S_c$ : Spectral centroid.
- $BPM$ : Beats per minute (beat detection).
- $P_{band}$ : 4 band EQ, Multiplikation im Spektrum

*Abbreviations:*  
*$H$* : human-generated  
*$A$* : analytically generated  

The audiofeatures computed below are purposefully **NOT** pythonic, because they have to be ported into C/C++ for an embedded system and writing them C-like makes that process easier, as we do not rely on magic library functions.

In [1]:
"""
General imports and loading of the audio file for development and explorative research.
The audio signal is normalized in the time domain to its highest value.
"""

import numpy as np
import matplotlib.pyplot as plt
import os
%run transient_shaper_lib.ipynb

SAMPLE_LENGTH = 10

def read_audio_files(directory):
  """
  Reads all audio files from a directory and returns their data with file name labels.

  Args:
      directory: The directory path (string).

  Returns:
      A list of tuples, where each tuple contains:
          - The audio data as a NumPy array.
          - The filename (without extension).
  """

  audio_data_list = []
  for filename in os.listdir(directory):
    if filename.endswith(".wav"):  # Check for .wav files
        filepath = os.path.join(directory, filename)
        sample_rate, audio_data = sp.io.wavfile.read(filepath)
        try:
          audio_data = audio_data[:, 0]
        except:
          pass
        n_bits = 32  # Assuming 32-bit audio
        audio_data = audio_data / (2**(n_bits - 1))  # Adjust range to -1 to 1 
        audio_data /= np.abs(np.max(audio_data))  # Safer normalization
        audio_data = audio_data[: sample_rate * SAMPLE_LENGTH]
        label = os.path.splitext(filename)[0]  # Extract filename without extension
        audio_data_list.append((audio_data, label, sample_rate))

  return audio_data_list

audio_data_with_labels = read_audio_files(os.getcwd())

print("Audio data and labels:")
for audio_data, label, sample_rate in audio_data_with_labels:
  print(f"- Label: {label}, Audio data shape: {audio_data.shape}, Sample rate: {sample_rate}")

FRAME_LEN = 1   # in s


  sample_rate, audio_data = sp.io.wavfile.read(filepath)


Audio data and labels:
- Label: 1 - looperman-l-5151565-0354397-spicy-drums, Audio data shape: (441000,), Sample rate: 44100
- Label: 10 - 484656__yellowtree__gloomy-guitar-loop, Audio data shape: (480000,), Sample rate: 48000
- Label: 11 - Vocal A, Audio data shape: (351744,), Sample rate: 44100
- Label: 12 - Vocal B, Audio data shape: (441000,), Sample rate: 44100
- Label: 13 - 20240208_Eli Preiss - Alles (und nichts) ｜ A COLORS SHOW, Audio data shape: (480000,), Sample rate: 48000
- Label: 2 - looperman-l-2379402-0354276-aftershock-hard-trap-drums-x-808-x-percs-kb, Audio data shape: (441000,), Sample rate: 44100
- Label: 3 - looperman-l-3066414-0354301-boom-bap-classic-hip-hop-drums, Audio data shape: (441000,), Sample rate: 44100
- Label: 4 - 244392__insidebeat__hip-hop-3-mpc500, Audio data shape: (348923,), Sample rate: 44100
- Label: 5 - 345289__50fps__4-beat-14-upbeat, Audio data shape: (195049,), Sample rate: 44100
- Label: 6 - 367962__trngle__175bpm-db-drum-sequence, Audio dat

In [2]:
# my_python_script.py
from ctypes import *

# Load the shared library
lib = CDLL("./AFInC.so")  # Adjust path accordingly

# Set up the square function's argtypes and restype
lib.BeatDetectionInit.argtypes = []
lib.BeatDetectionInit.restype = None

lib.AFInCAppend.argtypes = [c_double]
lib.AFInCAppend.restype = None

lib.AFInCProcess.argtypes = []
lib.AFInCProcess.restype = None

lib.getTempo.argtypes = []
lib.getTempo.restype = c_double

lib.BeatDetectionInit()

print(audio_data_with_labels[10][1])
for sample in audio_data_with_labels[10][0]:
    lib.AFInCAppend(sample)

lib.AFInCProcess()

result = lib.getTempo()
print(result)  # Output: 25

7 - 330744__alonnaallen__90s-beat-loop-140bpm
93.53789592760181


# $T_{1A}$ & $T_{2A}$
$T_{1A}$ and $T_{2A}$ are the time constants describing the duration of attack ($T_1$) and release ($T_2$). In this case, we define $T_1$ to be the time it takes to rise from the first detected minimum in the smoothed envelope $x_e$ to the peak value in a frame. $x_e$ is derived from applying cascaded exponential envelope filters onto the audio signal. The parameters which produce these filters are listed below in code. $T_2$ is obtained by calculating the time between the frame's peak and the next detected minimum. After each $T_1$ and $T_2$ have been found per frame, a mean calculation breaks them down into one scalar value per audio file.

In [5]:
"""
Constants for the envelope followers
"""
ENV_SMOOTH_ORDER = 4            # in n
ENV_SMOOTH_ATTACK = 2           # in ms
ENV_SMOOTH_RELEASE = 200        # in ms
EXTREMA_SEARCH_INTERVAL = 4000  # in samples

"""
Apply the envelope followers onto a given signal.
"""
def getEnvelope(sig: np.ndarray, fs: int) -> np.ndarray:
    order = 4
    attack = 2
    release = 200
    smooth_fast = ExpSmooth(ENV_SMOOTH_ORDER)
    smooth_fast.reset(fs)
    smooth_fast.set_attack(ENV_SMOOTH_ATTACK)
    smooth_fast.set_release(ENV_SMOOTH_RELEASE)
    num_samples = len(sig)
    env_smooth = np.zeros(num_samples)
    for i, sample in enumerate(sig):
        env_smooth[i] = smooth_fast.process(sample)
    return env_smooth

"""
Get the index of the peak value of a given signal within a range of samples.
"""
def getIdxOfMax(sig: np.ndarray, from_idx: int, to_idx: int) -> int:
    idx_max = from_idx
    for i in range(from_idx, to_idx):
        if(sig[i] > sig[idx_max]):
            idx_max = i
    return idx_max


"""
Get the index of the smallest value of a given signal within a range of samples.
"""
def getIdxOfMin(sig: np.ndarray, from_idx: int, to_idx: int) -> int:
    idx_min = from_idx
    for i in range(from_idx, to_idx):
        if(sig[i] < sig[idx_min]):
            idx_min = i
    return idx_min

"""
Calculate T_1 and T_2 per frame and return them as both mean values and one value per frame.
"""
def getTA(sig: np.ndarray, search_interval: int, fs: int) -> int:
    T1As = []
    T2As = []
    num_samples = len(sig)
    frames = num_samples // (FRAME_LEN * fs) # ignore residual samples which don't form a full frame
    for i in range(0, frames):
        l_bound = i*fs
        u_bound = (i+1)*fs-1
        idx_max = getIdxOfMax(sig, l_bound, u_bound)
        if idx_max < search_interval:
            start = l_bound
        else:
            start = idx_max - search_interval
        idx_min_pre = getIdxOfMin(sig, start, idx_max)

        if idx_max + search_interval > u_bound:
            stop = u_bound
        else:
            stop = idx_max + search_interval
        idx_min_post = getIdxOfMin(env_smooth, idx_max, stop)
        T1As.append((idx_max - idx_min_pre) / fs)
        T2As.append((idx_min_post - idx_max) / fs)
    T1A = np.sum(np.asarray(T1As))
    T2A = np.sum(np.asarray(T2As))
    return T1A / frames, T1As, T2A / frames, T2A

for audio_data, label, sample_rate in audio_data_with_labels:
  env_smooth = getEnvelope(audio_data, sample_rate)
  T1A, T1As, T2A, T2As = getTA(env_smooth, EXTREMA_SEARCH_INTERVAL, sample_rate)
  print(f"- Label: {label}, Attack time (mean): {T1A:.5f} s, Release time (mean): {T2A:.5f} s")


- Label: 1 - looperman-l-5151565-0354397-spicy-drums, Attack time (mean): 0.00964 s, Release time (mean): 0.09068 s
- Label: 10 - 484656__yellowtree__gloomy-guitar-loop, Attack time (mean): 0.01117 s, Release time (mean): 0.02496 s
- Label: 11 - Vocal A, Attack time (mean): 0.05912 s, Release time (mean): 0.05810 s
- Label: 12 - Vocal B, Attack time (mean): 0.08676 s, Release time (mean): 0.09068 s
- Label: 13 - 20240208_Eli Preiss - Alles (und nichts) ｜ A COLORS SHOW, Attack time (mean): 0.06098 s, Release time (mean): 0.08331 s
- Label: 2 - looperman-l-2379402-0354276-aftershock-hard-trap-drums-x-808-x-percs-kb, Attack time (mean): 0.04213 s, Release time (mean): 0.07249 s
- Label: 3 - looperman-l-3066414-0354301-boom-bap-classic-hip-hop-drums, Attack time (mean): 0.02308 s, Release time (mean): 0.09068 s
- Label: 4 - 244392__insidebeat__hip-hop-3-mpc500, Attack time (mean): 0.03937 s, Release time (mean): 0.09068 s
- Label: 5 - 345289__50fps__4-beat-14-upbeat, Attack time (mean): 0.

# $G_{1H}$ & $G_{2H}$
$G_{1H}$ and $G_{2H}$ represent the given gain value for the attack or release control voltage in the AI-TD. This is a human parameter and can't be derived from the signal, as it is purely determined by taste, hence the $H$ for "human". In the embedded program, this value is derived from the position of potentiometers that the user dialed in.

# $F_e$
Tonality. Describes how "tonal" audio is. Right now, we treat this feature as the NOT percentage of spectral flatness, meaning a tonality of 0.3 for a signal with a spectral flatness of 0.7.

Other ideas include:
- [link](https://community.sw.siemens.com/s/article/Tonality) for definitions
- [link](https://github.com/cocosci/pam-nac) for a python implementation (looks messy)


In [6]:
#Todo: Inspect Interative rolling scheiss teil

def getSpectralFlatness(sig, fs: int):
    spectrum = np.fft.fft(sig)
    power_spectrum = np.abs(spectrum)**2
    arithmetic_mean = 0
    sum_of_logs = 0

    for i in range(len(power_spectrum)):
        arithmetic_mean += power_spectrum[i]
        if power_spectrum[i] > 0:  # Avoid log(0)
            sum_of_logs += np.log(power_spectrum[i])

    arithmetic_mean /= len(power_spectrum)
    geometric_mean = np.exp(sum_of_logs / len(power_spectrum))

    if arithmetic_mean == 0:
        return np.inf

    flatness = geometric_mean / arithmetic_mean
    return 20 * np.log10(flatness)

for audio_data, label, sample_rate in audio_data_with_labels:
  print(f"- Label: {label}, Spectral Flattness: {getSpectralFlatness(audio_data, sample_rate):.2f}")

- Label: 1 - looperman-l-5151565-0354397-spicy-drums, Spectral Flattness: -41.88
- Label: 10 - 484656__yellowtree__gloomy-guitar-loop, Spectral Flattness: -70.72
- Label: 11 - Vocal A, Spectral Flattness: -48.97
- Label: 12 - Vocal B, Spectral Flattness: -56.01
- Label: 13 - 20240208_Eli Preiss - Alles (und nichts) ｜ A COLORS SHOW, Spectral Flattness: -82.33
- Label: 2 - looperman-l-2379402-0354276-aftershock-hard-trap-drums-x-808-x-percs-kb, Spectral Flattness: -67.40
- Label: 3 - looperman-l-3066414-0354301-boom-bap-classic-hip-hop-drums, Spectral Flattness: -59.15
- Label: 4 - 244392__insidebeat__hip-hop-3-mpc500, Spectral Flattness: -50.30
- Label: 5 - 345289__50fps__4-beat-14-upbeat, Spectral Flattness: -42.75
- Label: 6 - 367962__trngle__175bpm-db-drum-sequence, Spectral Flattness: -14.62
- Label: 7 - 330744__alonnaallen__90s-beat-loop-140bpm, Spectral Flattness: -33.79
- Label: 8 - 652462__yellowtree__midwest-clean-guitar, Spectral Flattness: -87.09
- Label: 9 - 584282__yellowtr

# $C_f$
Calculate the dynamic variation of a signal. Idea: Combine the amount of peaks and valleys and their deltas.


In [7]:
def calculate_crest_factor(buffer, fs: int):
    """Calculates the crest factor of a given audio buffer.

    Args:
        buffer: A list or array containing audio samples.

    Returns:
        The crest factor as a float.
    """

    max_sample = 0.0
    rms = 0.0
    n_samples = len(buffer)

    # Find the maximum absolute sample value
    for i in range(n_samples):
        abs_sample = abs(buffer[i])
        if abs_sample > max_sample:
            max_sample = abs_sample

    # Calculate RMS (Root Mean Square)
    for i in range(n_samples):
        rms += buffer[i] * buffer[i]
    rms = rms / n_samples  # Divide by number of samples
    rms = pow(rms, 0.5)   # Take the square root

    # Avoid division by zero
    if rms == 0.0:
        return 0.0

    crest_factor = max_sample / rms
    return crest_factor

for audio_data, label, sample_rate in audio_data_with_labels:
  print(f"- Label: {label}, Crest Factor: {calculate_crest_factor(audio_data, sample_rate):.2f}")


- Label: 1 - looperman-l-5151565-0354397-spicy-drums, Crest Factor: 7.36
- Label: 10 - 484656__yellowtree__gloomy-guitar-loop, Crest Factor: 3.85
- Label: 11 - Vocal A, Crest Factor: 3.71
- Label: 12 - Vocal B, Crest Factor: 4.02
- Label: 13 - 20240208_Eli Preiss - Alles (und nichts) ｜ A COLORS SHOW, Crest Factor: 4.09
- Label: 2 - looperman-l-2379402-0354276-aftershock-hard-trap-drums-x-808-x-percs-kb, Crest Factor: 1.86
- Label: 3 - looperman-l-3066414-0354301-boom-bap-classic-hip-hop-drums, Crest Factor: 4.66
- Label: 4 - 244392__insidebeat__hip-hop-3-mpc500, Crest Factor: 6.98
- Label: 5 - 345289__50fps__4-beat-14-upbeat, Crest Factor: 3.54
- Label: 6 - 367962__trngle__175bpm-db-drum-sequence, Crest Factor: 4.54
- Label: 7 - 330744__alonnaallen__90s-beat-loop-140bpm, Crest Factor: 6.30
- Label: 8 - 652462__yellowtree__midwest-clean-guitar, Crest Factor: 3.78
- Label: 9 - 584282__yellowtree__clean-guitar-loop, Crest Factor: 6.10
- Label: bass, Crest Factor: 4.71
- Label: congas_82_2

In [8]:
def calculate_spectral_centroid(buffer, fs):
  """Calculates the spectral centroid of a given audio buffer.

  Args:
      buffer: A list or array containing audio samples.
      fs: The sampling rate of the audio data (in Hz).

  Returns:
      The spectral centroid as a float (in Hz).
  """

  fft = np.fft.rfft(buffer)
  fft_abs = np.abs(fft)
  fft_freqs = np.fft.rfftfreq(len(buffer), d=1/fs)  # Keep only positive frequencies

  # Calculate weighted mean frequency
  numerator = 0.0
  denominator = 0.0
  for i in range(len(fft_freqs)):
      numerator += fft_freqs[i] * fft_abs[i]
      denominator += fft_abs[i]

  # Avoid division by zero
  if denominator == 0.0:
      return 0.0

  centroid = numerator / denominator
  return centroid

for audio_data, label, sample_rate in audio_data_with_labels:
  print(f"- Label: {label}, Spectral Centroid: {calculate_spectral_centroid(audio_data, sample_rate):.2f}")


- Label: 1 - looperman-l-5151565-0354397-spicy-drums, Spectral Centroid: 3353.50
- Label: 10 - 484656__yellowtree__gloomy-guitar-loop, Spectral Centroid: 1807.38
- Label: 11 - Vocal A, Spectral Centroid: 4500.27
- Label: 12 - Vocal B, Spectral Centroid: 4030.68
- Label: 13 - 20240208_Eli Preiss - Alles (und nichts) ｜ A COLORS SHOW, Spectral Centroid: 1301.27
- Label: 2 - looperman-l-2379402-0354276-aftershock-hard-trap-drums-x-808-x-percs-kb, Spectral Centroid: 3538.25
- Label: 3 - looperman-l-3066414-0354301-boom-bap-classic-hip-hop-drums, Spectral Centroid: 2693.29
- Label: 4 - 244392__insidebeat__hip-hop-3-mpc500, Spectral Centroid: 2634.09
- Label: 5 - 345289__50fps__4-beat-14-upbeat, Spectral Centroid: 5944.02
- Label: 6 - 367962__trngle__175bpm-db-drum-sequence, Spectral Centroid: 8200.64
- Label: 7 - 330744__alonnaallen__90s-beat-loop-140bpm, Spectral Centroid: 5415.12
- Label: 8 - 652462__yellowtree__midwest-clean-guitar, Spectral Centroid: 1545.31
- Label: 9 - 584282__yellowtr