## Feature Extraction

Breebaart et. al (2004) compared various features such as mel-frequency cepstral coefficients (MFCC), auditory filter-bank temporal envelope (AFTE), standard low-level (SLL) and psychoacoustic (PA) features when training a standard Gaussian framework for both genre and general audio classification.

Some of the SLL features included Root-mean-square (RMS) level, spectral (centroid, bandwidth, and roll-off), zero-crossing rate, band energy ratio, delta spectrum magnitude, and pitch. The PA features included Roughness (the perception of temporal envelope modulations in the range of about 20-150 Hz), loudness (the sensation of signal strength) and sharpness (the spectral density and the relative strength of high-frequency energy) The AFTE features were made using GammaTone filter banks to create gammatone-frequency cepstral coefﬁcients (GFCC).

**Spectral Centroid** "indicates at which frequency the energy of a spectrum is centred upon" (Chauhan, 2020).

$$
\mathrm{SC}(m)=\frac{\sum_k f_k|X(m, k)|}{\sum_k|X(m, k)|}
$$

**Spectral Flatness** measures the ratio of the geometric mean of the spectrum to the arithmetic mean of the spectrum (Johnston, 1988).

$$
\mathrm{SF}(m) = \frac{\left(\prod_k|X(m, k)|\right)^{\frac{1}{K}}}{\frac{1}{K} \sum_k|X(m, k)|}
$$

**Spectral Roll-off** is a metric that helps determine the bandwidth of an audio signal. It identifies the frequency bin below which a certain percentage of the total energy of the signal is concentrated (Scheirer and Slaney, 1997).

$$
\mathrm{SR}(m) = \frac{{\sum_k |X(m, k)|}}{{\sum_k |X(m, k)|}} \times \text{{roll\_off\_percentage}}
$$


It was demonstrated that AFTE features provided the highest accuracy in Breebaart's study with MFCC's coming in at a close second, this corresponds with Liu's (2018) study showing 79% accuracy when training an attention based long short-term memory (LSTM+A) network with GFCC's.

**GFCC's** and GammaTone filter bands typically offer enhanced resolution in the lower frequency spectrum and generate a Cochlegram, which is relative to the Cochlea, an inner ear component.

```flow
     Gammatone Filter bank
               ↓
Audio Input → FFT → Decimation → Non-Linear Rectification → DCT → GFCC Feature
```


The literature above shows the results for general audio classification, whereas Deng et. al (2008) and Racharla et. al (2020) showed that MFCC and perception-based features dominated in SVM's for instrument classification models with up to 79% accuracy.

**MFCC's** represent a non-linear 'spectrum-of-a-spectrum' where "the frequency bands are equally spaced on the mel scale, which approximates the human auditory system’s response" (Sdour 2020)

Rathikarani (2020) shows 98% accuracy with SVM and 95% with kNN classifiers using MFCC features in instrument classification.

Principal Component Analysis (PDA) is a common method for reducing the features in a dataset while maintaining key information and patterns. This is performed by:
1. Normalising the continuous initial variables' range
2. Calculating the covariance matrix to detect correlations
3. Computing the eigenvectors and eigenvalues of the covariance matrix to identify the main components
4. Formulate a feature vector to determine which principal components to retain
5. Transforming the data along the axes of principal components

t-SNE
Van Der Maaten (2014) improves the Barnes-Hut implementation of the t-SNE which drastically accelerates the process using tree based algorithms.

## Convert .bib to Harvard
Zaman, K. et al. (2023) ‘A Survey of Audio Classification Using Deep Learning’, IEEE Access, 11, pp. 106620–106649. doi: 10.1109/ACCESS.2023.3318015.


Racharla, K. et al. (2020) ‘Predominant Musical Instrument Classification based on Spectral Features’, in 2020 7th International Conference on Signal Processing and Integrated Networks (SPIN). IEEE. doi: 10.1109/spin48934.2020.9071125.

