# Instrument Classification
This python script performs classification using 100 audio samples with 5 classes.
To run the script make sure to first download any python packages required (all should be specified in setup.py `packages`)

## Feature Extraction

Breebaart et al. (2004) compared various features such as mel-frequency cepstral coefficients (MFCC), auditory filter-bank temporal envelope (AFTE), standard low-level (SLL) and psychoacoustic (PA) features when training a standard Gaussian framework for both genre and general audio classification.

Some of the SLL features included Root-mean-square (RMS) level, spectral (centroid, bandwidth, and roll-off), zero-crossing rate, band energy ratio, delta spectrum magnitude, and pitch. The PA features included Roughness (the perception of temporal envelope modulations in the range of about 20-150 Hz), loudness (the sensation of signal strength) and sharpness (the spectral density and the relative strength of high-frequency energy) The AFTE features were made using GammaTone filter banks to create gammatone-frequency cepstral coefﬁcients (GFCC).

**Spectral Centroid** "indicates at which frequency the energy of a spectrum is centred upon" (Chauhan, 2020).

$$
\mathrm{SC}(m)=\frac{\sum_k f_k|X(m, k)|}{\sum_k|X(m, k)|}
$$

**Spectral Flatness** measures the ratio of the geometric mean of the spectrum to the arithmetic mean of the spectrum (Johnston, 1988).

$$
\mathrm{SF}(m) = \frac{\left(\prod_k|X(m, k)|\right)^{\frac{1}{K}}}{\frac{1}{K} \sum_k|X(m, k)|}
$$

**Spectral Roll-off** is a metric that helps determine the bandwidth of an audio signal. It identifies the frequency bin below which a certain percentage of the total energy of the signal is concentrated (Scheirer and Slaney, 1997).

$$
\mathrm{SR}(m) = \frac{{\sum_k |X(m, k)|}}{{\sum_k |X(m, k)|}} \times \text{{roll\_off\_percentage}}
$$


It was demonstrated that AFTE features provided the highest accuracy in Breebaart's study with MFCC's coming in at a close second, this corresponds with Liu's (2018) study showing 79% accuracy when training an attention based long short-term memory (LSTM+A) network with GFCC's.

**GFCC's** and GammaTone filter bands typically offer enhanced resolution in the lower frequency spectrum and generate a Cochlegram, which is relative to the Cochlea, an inner ear component.

```flow
     Gammatone Filter bank
               ↓
Audio Input → FFT → Decimation → Non-Linear Rectification → DCT → GFCC Feature
```


The literature above shows the results for general audio classification, whereas Deng et al. (2008) and Racharla et al. (2020) showed that MFCC and perception-based features dominated in support vector machines (SVM) for instrument classification models with up to 79% accuracy. Rathikarani (2020) also shows 98% accuracy with SVM and 95% with kNN classifiers using MFCC features in instrument classification.

**MFCC's** represent a non-linear 'spectrum-of-a-spectrum' where "the frequency bands are equally spaced on the mel scale, which approximates the human auditory system’s response" (Sdour, 2020). This is calculated similar to the GFCC but using mel filter banks.

Dupont et al. (2013) used k-nearest neighbour (kNN) classification to compare dimensionality reduction techniques such as Isomap and t-SNE. He used the mean, standard deviation, skewness and kurtosis of the MFCC features. Therefore, this demonstration uses the mean $\quad \bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i$ and standard deviation $\quad \sigma = \sqrt{\frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2}$ for all features to partially reduce the dimensions.

There are two main feature selection techniques that can be used in this classification scenario, Analysis of Variance (ANOVA) and Kendall's algorithm. These methods will compare the variances between features and see how closely they correlate with the targets (Brownlee, 2020). ANOVA is used to select the top 10 features from this dataset. This is done by calculating the F-value for each feature, which measures the ratio of between-group variance to within-group variance. Here's the equation to compute the F-value for a single feature:

$$ F = \frac{{\text{Between-group variance}}}{{\text{Within-group variance}}} = \frac{{\text{MSB}}}{{\text{MSW}}} $$

- Between-group variance is the variance between different groups or classes.
- Within-group variance is the variance within each group or class.
- MSB is the Mean Square Between (or among) groups, calculated as the between-group sum of squares divided by the degrees of freedom between groups.
- MSW is the Mean Square Within groups, calculated as the within-group sum of squares divided by the degrees of freedom within groups.


## Dimensionality Reduction

### PCA
Principal Component Analysis (PDA) is a common method for reducing the features in a dataset while maintaining key information and patterns. This is performed by:
1. Normalising the continuous initial variables' range and calculate the mean vector
$$\quad \boldsymbol{\mu} = \frac{1}{N}\sum_{i=1}^{N} \mathbf{x}_i$$
2. Calculating the covariance matrix to detect correlations
$$\quad \mathbf{\Sigma} = \frac{1}{N} \sum_{i=1}^{N} (\mathbf{x}_i - \boldsymbol{\mu}) (\mathbf{x}_i - \boldsymbol{\mu})^T$$
3. Computing the eigenvectors and eigenvalues of the covariance matrix to identify the main components
$$\quad \mathbf{\Sigma v}_i = \lambda_i \mathbf{v}_i \quad \text{for } i = 1, 2, ..., D$$
4. Formulate a feature vector to determine which principal components to retain
5. Transforming the data along the axes of principal components
$$\quad \mathbf{Y} = \mathbf{X} \mathbf{W}$$

### t-SNE
t-distributed Stochastic Neighbor Embedding (t-SNE) is a unsupervised method for dimensionality reduction on non-linear data, primarily used for preserving small pairwise distances unlike PCA. Van Der Maaten (2014) improves the Barnes-Hut implementation of the t-SNE which drastically accelerates the process using tree based algorithms as shown below. Pál et al. (2020) shows that t-SNE outperforms PCA, Isomap and Self-Organizing Maps (SOM) when using MFCC features for general audio dimensionality reduction (DR) in machine learning (ML).

### LDA
As stated by Dupont et al. (2013), Linear Discriminant Analysis (LDA) is a common DR technique used in classification algorithms. This supervised DR method finds the linear combinations of features that best separate multiple classes while maximizing their between-class distance and minimizing their within-class variance. LDA is performed by:
1. Calculate the mean vectors for each class by averaging the feature vectors belonging to that class.
$$\quad \boldsymbol{\mu}_k = \frac{1}{N_k}\sum_{i=1}^{N_k} \mathbf{x}_{ki} \quad \text{for } k = 1, 2, ..., K$$
2. Determine the within-class scatter matrix by computing the scatter matrix for each class and summing them up.
$$\quad \mathbf{S}_W = \sum_{k=1}^{K} \sum_{i=1}^{N_k} (\mathbf{x}_{ki} - \boldsymbol{\mu}_k) (\mathbf{x}_{ki} - \boldsymbol{\mu}_k)^T$$
3. Derive the between-class scatter matrix by computing the mean vector of all data points across all classes and then calculating the scatter matrix between classes.
$$\quad \mathbf{S}_B = \sum_{k=1}^{K} N_k (\boldsymbol{\mu}_k - \boldsymbol{\mu}) (\boldsymbol{\mu}_k - \boldsymbol{\mu})^T$$
4. Find the eigenvectors and eigenvalues of the matrix formed by the product of the inverse of the within-class scatter matrix and the between-class scatter matrix.
$$\quad \mathbf{S}_W^{-1} \mathbf{S}_B \mathbf{w}_i = \lambda_i \mathbf{w}_i \quad \text{for } i = 1, 2, ..., D$$
5. Select the top k eigenvectors corresponding to the largest eigenvalues to form the projection matrix.
6. Project the original data onto the subspace formed by the selected eigenvectors to obtain the lower-dimensional representation of the data.
$$\quad \mathbf{Y} = \mathbf{X} \mathbf{W}$$

As this dataset provides targets, supervised learning is most likely the best approach. As mentioned above, some common supervised learning models for audio classification are support vector classification (SVC) and kNN's. The script below uses k-fold validation to provide an accuracy score for each model on each dimensionality reduction technique.

## Convert .bib to Harvard
Zaman, K. et al. (2023) ‘A Survey of Audio Classification Using Deep Learning’, IEEE Access, 11, pp. 106620–106649. doi: 10.1109/ACCESS.2023.3318015.


Racharla, K. et al. (2020) ‘Predominant Musical Instrument Classification based on Spectral Features’, in 2020 7th International Conference on Signal Processing and Integrated Networks (SPIN). IEEE. doi: 10.1109/spin48934.2020.9071125.

