<a href="https://colab.research.google.com/github/meetAmarAtGithub/15_Reva_Speech_Analytics/blob/main/Session_2_Leg_1_Audio_preprocessing_Normalization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Normalization for audio analysis**

Normalization in audio analysis refers to the process of adjusting audio signals or feature vectors to a standard scale or range. The goal is to make the audio data more manageable, comparable, and suitable for various analysis tasks. Normalization techniques can be applied to the entire audio signal or specific features extracted from the audio.

Normalization methods in audio analysis can vary depending on the specific requirements and characteristics of the audio data. The choice of normalization method depends on the specific analysis objectives, the characteristics of the audio data, and the desired outcome.

In [None]:
#Audio data preprocessing
#if needed use (pip install librosa)
import librosa

import numpy as np

In [None]:
# Specify the path to the audio file
audio_path = "D:\\RACE\\Speech Analytics\\Session 1\\Data\\session1_violin-origional.wav"

# Load the audio file
audio, sr = librosa.load(audio_path)

NameError: name 'librosa' is not defined

**Min-Max normalization**

Min-max normalization, also known as feature scaling, is a common technique used in audio analysis to normalize audio feature vectors within a specific range. This normalization method scales the feature values to fit within a predetermined minimum and maximum value, typically 0 and 1, respectively. This ensures that all feature values are proportionally adjusted while preserving their relative order.

The process of min-max normalization can be described using mathematical notation as follows:

    Find the minimum and maximum values in the feature vector:
        Let v = [v1, v2, ..., vn] be the feature vector of length n.
        Find the minimum value min(v) and the maximum value max(v) within the feature vector.

    Normalize the feature vector using the min-max formula:
        Let u = [u1, u2, ..., un] be the normalized feature vector.
        Compute each normalized component ui as:
        ui = (vi - min(v)) / (max(v) - min(v)).

By applying the min-max normalization, the feature values will be scaled linearly to the range between 0 and 1. Values that were originally equal to the minimum value will be transformed to 0, and values that were equal to the maximum value will be transformed to 1. All other values will be proportionally adjusted within this range based on their original values.

Min-max normalization is useful in audio analysis for various reasons, such as ensuring that features with different scales or ranges are comparable, preparing data for machine learning algorithms that require normalized inputs, or facilitating visualizations of feature distributions.

In [None]:
def min_max_normalization(audio):
    min_val = np.min(audio)
    max_val = np.max(audio)
    normalized_audio = (audio - min_val) / (max_val - min_val)
    return normalized_audio

In [None]:
mmn_data = min_max_normalization(audio)
print(mmn_data[:10])

[0.42699417 0.42719498 0.42727232 0.42714185 0.42698434 0.42713115
 0.42727783 0.42729017 0.4273099  0.42720184]


**Z-score normalization**

Z-score normalization, also known as standardization, is a widely used technique in audio analysis to normalize audio feature vectors by transforming them to have zero mean and unit variance. This normalization method ensures that the feature values are centered around 0 and have a consistent scale, making them suitable for various statistical analyses.

The process of z-score normalization can be described using mathematical notation as follows:

    Calculate the mean and standard deviation of the feature vector:
        Let v = [v1, v2, ..., vn] be the feature vector of length n.
        Compute the mean μ and standard deviation σ of the feature vector v:
        μ = mean(v)
        σ = std(v)

    Normalize the feature vector using the z-score formula:
        Let u = [u1, u2, ..., un] be the normalized feature vector.
        Compute each normalized component ui as:
        ui = (vi - μ) / σ.

By applying the z-score normalization, the feature values will be transformed to have a mean of 0 and a standard deviation of 1. Values that were originally above the mean will have positive z-scores, and values below the mean will have negative z-scores. The magnitude of the z-score indicates the number of standard deviations the original value is from the mean.

Z-score normalization is advantageous in audio analysis as it facilitates comparison and analysis across different features with varying scales and distributions. It allows for meaningful interpretation of feature values based on their deviation from the mean. Moreover, it can be particularly useful in machine learning algorithms where standardized inputs can improve convergence and model performance.

In [None]:
def z_score_normalization(audio):
    mean = np.mean(audio)
    std = np.std(audio)
    normalized_audio = (audio - mean) / std
    return normalized_audio

In [None]:
zsn_data = z_score_normalization(audio)
print(zsn_data[:10])

[0.00098683 0.00339753 0.00432611 0.00275946 0.00086878 0.00263122
 0.00439207 0.0045404  0.00477674 0.00347998]


**Peak amplitude normalization**

Peak amplitude normalization is a technique used in audio analysis to normalize the amplitude or level of an audio signal to a specified maximum peak value. The goal is to ensure that the audio signal reaches its maximum dynamic range without distortion or clipping. This normalization method scales the entire audio signal by a factor that brings the highest peak to the desired level.

The process of peak amplitude normalization can be described using mathematical notation as follows:

    Find the maximum absolute value in the audio signal:
        Let x(t) be the audio signal at time t.
        Find the maximum absolute value max(|x(t)|) within the signal.

    Normalize the audio signal by scaling it to the desired peak level:
        Let y(t) be the normalized audio signal.
        Compute the normalization factor f as the ratio between the desired peak level L_peak and the maximum absolute value in the signal:
        f = L_peak / max(|x(t)|).
        Normalize the audio signal by multiplying it by the factor f:
        y(t) = x(t) * f.

By performing peak amplitude normalization, the highest peak in the audio signal is scaled to the desired level, while maintaining the relative amplitudes and dynamics of the original signal. The remaining samples in the signal are proportionally adjusted based on the scaling factor.

Peak amplitude normalization is commonly used in audio production, mastering, and broadcasting to ensure that audio signals are optimized for playback on different systems while maintaining their original characteristics. It helps avoid distortion and ensures that the audio signal makes full use of the available dynamic range.

**RMS (Root Mean Square) normalization**

RMS (Root Mean Square) normalization is a technique used in audio analysis to normalize audio signals based on their average power level. It scales the audio signal by dividing it by the square root of the mean of the squared values of the samples. The purpose of RMS normalization is to bring different audio signals to a similar average power level, making them more comparable and suitable for analysis.

The process of RMS normalization can be described using mathematical notation as follows:

    Calculate the RMS value of the audio signal:
        Let x(t) be the audio signal at time t.
        Compute the RMS value RMS(x) as the square root of the mean of the squared values of the samples:
        RMS(x) = sqrt(mean(x(t)^2)).

    Normalize the audio signal by dividing it by the RMS value:
        Let y(t) be the normalized audio signal.
        Normalize the audio signal by dividing each sample by the RMS value:
        y(t) = x(t) / RMS(x).

By applying RMS normalization, the audio signal is scaled to have an average power level of 1. This means that the normalized signal will have a similar energy distribution as the original signal, but with a consistent scale across different signals.

RMS normalization is useful in audio analysis tasks where the average power or energy of the signal is of interest. It can be particularly beneficial when comparing or combining audio signals, such as in audio mixing, speech recognition, or feature extraction, where it is important to account for differences in signal levels and energy.

In [None]:
def rms_normalization(audio, target_rms):
    rms = np.sqrt(np.mean(audio ** 2))
    normalized_audio = (audio / rms) * target_rms
    return normalized_audio

In [None]:
target_rms = 0.9
rmsn_data = rms_normalization(audio, target_rms)
print(rmsn_data)

[ 4.0330013e-04  2.5729311e-03  3.4086527e-03 ... -1.0171641e+00
 -1.0378585e+00 -1.0789999e+00]


**Energy Normalization**

In audio analysis, energy is a fundamental concept that quantifies the magnitude or intensity of a signal over time. It provides information about the overall "strength" or "loudness" of the signal. Mathematically, energy is calculated as the squared sum of the signal's amplitude values within a given time frame.

Let's consider an example of a simple audio signal represented as a discrete-time sequence, denoted as x[n]. Each sample of the signal, x[n], represents the amplitude at a particular point in time.

For a given time frame or segment of the signal, the energy can be calculated using the following equation:

E(x) = ∑(|x[n]|^2)

where E(x) represents the energy of the signal x[n], and the summation ∑ is taken over all samples within the time frame.

Let's illustrate this with a concrete example. Suppose we have a 1-second audio signal with a sampling rate of 44100 Hz (meaning there are 44100 samples per second). We'll consider a small segment of this signal, consisting of 100 samples.

x = [0.1, 0.2, 0.3, ..., 0.1]

To calculate the energy of this segment, we square each sample and sum them up:

E(x) = (0.1^2 + 0.2^2 + 0.3^2 + ... + 0.1^2)

E(x) = 0.01 + 0.04 + 0.09 + ... + 0.01

E(x) = 0.55

The resulting value, 0.55, represents the energy of the signal within that specific time frame. A higher energy value indicates a louder or more intense signal, while a lower value suggests a quieter or less intense signal.

Energy is a crucial metric in audio analysis and has various applications. For instance, it can be used to detect changes in the overall loudness of a signal, identify significant events or transitions, or determine the presence of particular sound patterns or characteristics. Additionally, energy-based features, such as the short-term energy or energy contour, are commonly used in tasks like speech recognition, music analysis, and sound classification.

Energy normalization is a technique used in audio analysis to adjust the energy levels of audio signals. It aims to make different audio signals comparable by removing variations in their overall energy. By normalizing the energy, the focus can be shifted to other aspects of the audio signal, such as the spectral content or temporal patterns.

Mathematically, energy normalization involves scaling the audio signal such that its energy reaches a desired level. The energy of a discrete-time signal x[n] is typically calculated using the squared sum of its samples over a given time frame. Let's denote the energy of the signal as E(x):

E(x) = ∑(|x[n]|^2)

where n represents the sample index.

To normalize the energy of an audio signal, we can perform the following steps:

    Calculate the energy of the input signal, E(x).

    Determine the desired energy level, which is often set to a fixed value such as 1 or some other reference level.

    Calculate the scaling factor, which is the ratio between the desired energy level and the actual energy of the signal:

    scaling_factor = desired_energy / E(x)

    Multiply each sample of the input signal by the scaling factor to normalize its energy:

    normalized_signal[n] = scaling_factor * x[n]

After energy normalization, the normalized signal will have a consistent energy level, which facilitates comparisons and analysis across different audio signals. It is important to note that energy normalization only adjusts the overall energy level and does not alter the relative amplitudes or shape of the signal.

It's worth mentioning that there are different variations and considerations in energy normalization depending on the specific context or application. For example, in some cases, a logarithmic scale may be used to normalize the energy, such as applying a logarithmic transformation to the signal before calculating the energy and scaling factors. Additionally, the normalization can be performed over different time frames, such as short-term or long-term normalization, depending on the analysis requirements.

In [None]:
def energy_normalization(audio, target_energy):
    energy = np.sum(audio ** 2)
    normalized_audio = (audio / np.sqrt(energy)) * np.sqrt(target_energy)
    return normalized_audio

In [None]:
energy = np.sum(audio ** 2)
target_energy = 0.9 * energy

In [None]:
enrgn = energy_normalization(energy, target_energy)
print(enrgn)

26.859639846534556


**Logarithmic normalization**

Logarithmic normalization, also known as logarithmic scaling or logarithmic compression, is a technique used in audio analysis to adjust the dynamic range of a signal. It maps the original signal to a new representation using a logarithmic function, which compresses the range of amplitudes. This normalization method is particularly useful when dealing with signals that have a wide range of amplitudes, such as audio signals that contain both soft and loud sounds.

The goal of logarithmic normalization is to improve the perceptual representation of the signal by reducing the perceived differences between low-amplitude and high-amplitude portions. It helps to preserve the details in soft sounds while preventing clipping or distortion in loud sounds.

Mathematically, logarithmic normalization is achieved by applying a logarithmic function to the original signal. The logarithm function compresses the dynamic range by scaling down higher values more than lower values. One commonly used logarithmic function is the decibel (dB) scale, which is a logarithmic scale commonly used in audio and acoustics. The decibel scale is defined as:

dB = 20 * log10(x / x0)

where x represents the original signal amplitude, and x0 is a reference amplitude or threshold value.

To perform logarithmic normalization on an audio signal, we typically follow these steps:

    Calculate the magnitude or absolute value of the signal samples, |x[n]|.

    Apply the logarithmic function to each sample, either using the decibel scale or another logarithmic transformation. The choice of the logarithmic function may depend on the specific application or context.

    Optionally, apply a scaling factor or adjust the parameters of the logarithmic function to control the amount of compression and the desired dynamic range.

    The resulting transformed signal represents the logarithmically normalized representation of the original signal.

Logarithmic normalization is commonly used in audio processing tasks such as audio visualization, dynamic range compression, volume control, and audio perception studies. It helps to enhance the perceptual quality of the audio signal and improve the representation of soft and loud sounds in a more balanced manner.

In [None]:
def logarithmic_normalization(audio):
    normalized_audio = np.log(1 + audio)
    return normalized_audio

In [None]:
logn = logarithmic_normalization(audio)
print(logn)

[ 3.82654471e-05  2.43753282e-04  3.22885840e-04 ... -1.01334415e-01
 -1.03506498e-01 -1.07838795e-01]


**Power normalization**

In audio analysis, power is another important metric that characterizes the strength or intensity of a signal. While energy quantifies the overall magnitude of a signal over a specific time frame, power measures the average rate of energy consumption or distribution over time. It provides information about the signal's strength per unit of time.

Mathematically, power is calculated as the average value of the squared amplitudes of a signal within a given time frame. The formula to calculate power is as follows:

P(x) = (1/N) * ∑(|x[n]|^2)

where P(x) represents the power of the signal x[n], N is the number of samples in the time frame, and the summation ∑ is taken over all samples within the time frame.

The key difference between power and energy lies in their temporal characteristics:

    Energy (E): Energy measures the total accumulated strength or magnitude of a signal over a specific time frame. It considers the squared sum of the signal's amplitudes without considering the time dimension. Energy is calculated using the formula:

    E(x) = ∑(|x[n]|^2)

    Energy is useful for analyzing the overall strength or loudness of a signal, but it does not provide information about the rate of energy consumption or distribution over time.

    Power (P): Power represents the average rate at which energy is distributed or consumed over time. It accounts for the temporal dimension by dividing the energy by the duration of the time frame. Power is calculated using the formula:

    P(x) = (1/N) * ∑(|x[n]|^2)

    Power is a useful metric for analyzing how the energy of a signal is distributed over time. It provides information about the average intensity or strength of the signal per unit of time.

In summary, energy quantifies the overall magnitude of a signal, while power measures the average rate of energy consumption or distribution over time. Energy represents the total strength of a signal, while power captures the average strength per unit of time. Both energy and power are fundamental metrics in audio analysis and are used for various applications such as loudness measurement, signal processing, and audio feature extraction.

In [None]:
def power_normalization(audio, target_power):
    power = np.mean(audio ** 2)
    normalized_audio = (audio / np.sqrt(power)) * np.sqrt(target_power)
    return normalized_audio

In [None]:
power = np.mean(audio ** 2)
target_power = 0.9*power
pwrn = power_normalization(audio, target_power)
print(pwrn)

[ 3.6249061e-05  2.3125789e-04  3.0637346e-04 ... -9.1423832e-02
 -9.3283854e-02 -9.6981697e-02]


**Piecewise linear normalization**

Piecewise linear normalization, also known as linear scaling or amplitude scaling, is a technique used in audio analysis to adjust the amplitude range of a signal. It involves linearly mapping the amplitude values of the signal from one range to another, effectively stretching or compressing the signal's dynamic range.

The purpose of piecewise linear normalization is to ensure that the signal occupies the full available range of amplitudes, making it easier to analyze or process. It is particularly useful when working with signals that have a limited dynamic range or when the amplitudes are not distributed optimally for a specific application.

To perform piecewise linear normalization on an audio signal, the following steps are typically followed:

    Identify the minimum and maximum amplitude values of the original signal. This can be done by scanning the entire signal or a specific segment of interest.

    Define the desired minimum and maximum amplitudes for the normalized signal. These values determine the range in which the normalized signal will be mapped.

    Calculate the scaling parameters to map the original amplitudes to the desired range. This is done using a linear scaling function:

    normalized_value = (original_value - min_original) * (max_normalized - min_normalized) / (max_original - min_original) + min_normalized

    where original_value is the original amplitude value, min_original and max_original are the minimum and maximum amplitudes of the original signal, min_normalized and max_normalized are the desired minimum and maximum amplitudes for the normalized signal, and normalized_value is the resulting normalized amplitude value.

    Apply the scaling function to each sample of the original signal to obtain the normalized signal.

Piecewise linear normalization allows the signal to be stretched or compressed in a controlled manner, making it occupy the desired range of amplitudes. This normalization technique is commonly used in various audio processing tasks, such as volume normalization, dynamic range compression, and equalization.

By adjusting the amplitude range of the signal, piecewise linear normalization can improve the signal's perceptual quality, facilitate further analysis or processing, and ensure compatibility with other audio signals or systems.

In [None]:
def piecewise_linear_normalization(audio):
    abs_audio = np.abs(audio)
    max_val = np.max(abs_audio)
    normalized_audio = (audio / max_val) * (np.log(1 + max_val) / np.log(1 + np.max(abs_audio)))
    return normalized_audio

In [None]:
pln = piecewise_linear_normalization(audio)
print(pln)

[ 6.5139291e-05  4.1556871e-04  5.5055082e-04 ... -1.6428794e-01
 -1.6763040e-01 -1.7427538e-01]


**Exponential normalization**

Exponential normalization, also known as exponential scaling or gain adjustment, is a technique used in audio analysis to modify the amplitude values of a signal using an exponential function. It provides a way to adjust the dynamic range of the signal by applying an exponential gain factor to the amplitudes.

Mathematically, exponential normalization is achieved by raising the original amplitudes to a power determined by the exponential function. The general formula for exponential normalization is:

normalized_value = original_value * exp(gain)

where normalized_value is the resulting normalized amplitude value, original_value is the original amplitude value, gain represents the desired gain factor, and exp() denotes the exponential function.

The gain factor, often expressed in decibels (dB), determines the amount of amplification or attenuation applied to the signal. Positive gain values amplify the signal, while negative gain values attenuate the signal. The specific calculation of the gain factor may vary depending on the desired normalization behavior.

To perform exponential normalization on an audio signal, the following steps are typically followed:

    Determine the desired gain factor or amplification level in dB.

    Convert the gain factor from dB to a linear scale using the following formula:

    gain_linear = 10^(gain_dB / 20)

    where gain_dB is the gain factor in decibels and gain_linear is the corresponding linear gain factor.

    Apply the exponential function with the gain_linear factor to each sample of the original signal to obtain the normalized signal:

    normalized_value = original_value * exp(gain_linear)

Exponential normalization allows for controlled amplification or attenuation of the signal based on the desired gain factor. It can be used to adjust the overall loudness, equalize the frequency response, or bring the signal to a desired level for further processing or analysis.

It's important to note that the specific application of exponential normalization and the choice of gain factor depend on the context and desired outcome. Additionally, variations of exponential normalization techniques may exist to address specific requirements or constraints in different audio analysis applications.

**Unit vector normalization**

Unit vector normalization is a technique used in audio analysis to normalize audio feature vectors. In this context, a feature vector represents a set of audio features, such as spectral coefficients, mel-frequency cepstral coefficients (MFCCs), or any other numerical representation of audio characteristics.

The goal of unit vector normalization is to scale the feature vector so that its magnitude becomes equal to 1, while preserving the relative proportions of the individual feature values. This normalization technique is particularly useful when comparing feature vectors or when the absolute magnitude of the feature values is not relevant, but their relative directions and proportions are important.

The process of unit vector normalization can be described using mathematical notation as follows:

    Calculate the magnitude (or Euclidean norm) of the feature vector:
        Let v = [v1, v2, ..., vn] be the feature vector of length n.
        Compute the magnitude ||v|| as the square root of the sum of the squares of its components:
        ||v|| = sqrt(v1^2 + v2^2 + ... + vn^2).

    Normalize the feature vector by dividing each component by its magnitude:
        Let u = [u1, u2, ..., un] be the normalized feature vector.
        Compute each normalized component ui as:
        ui = vi / ||v||.

By performing unit vector normalization, the resulting feature vector u will have a magnitude of 1 while maintaining the direction and proportions of the original feature vector v. This normalization is useful in various audio analysis tasks, such as classification, clustering, or similarity comparisons, where the emphasis is on the relationships between the feature values rather than their absolute magnitudes.

In [None]:
def unit_vector_normalization(audio):
    norm = np.linalg.norm(audio)
    normalized_audio = audio / norm
    return normalized_audio

In [None]:
uvn_data = unit_vector_normalization(audio)
print(uvn_data[:10])

[1.3495738e-06 8.6098671e-06 1.1406464e-05 6.6882103e-06 9.9405281e-07
 6.3019947e-06 1.1605115e-05 1.2051857e-05 1.2763629e-05 8.8581737e-06]
