# <center>Genre Classification Using Extracted Audio Features</center>

<center> Levi Davis <br> ljd3frf@virginia.edu <br> CS 6501: Digital Signal Processing, Spring 2023 </center>

# Introuction

The aim of this project is to 1) analyze the effectiveness of audio features extracted from songs for genre classification, and 2) benchmark the performance of various classification models.

I use the librosa package to calculate 12 distinct audio features such as Mel Frequency Cepstral Coefficients, spectral bandwidth, and tempogram ratio for each song. After extracting features, I calculate statistical measures such as mean and standard deviation for each feature. I use five statistical machine learning classification algorithms to explore which feature or feature pair most accurately predicts genre. I indentify the top individual features, feature pairs, and classifiers to create 12 final models using cross-validation to optimizatize hyperparamters. Finally, trying a different approach, I build a convulotional neural network (CNN) to classify songs by genre using slices of mel-spectrogram images.

# Data

I use data from the [Free Music Archive](https://github.com/mdeff/fma) (FMA) song dataset located on GitHub. I use the fma_small dataset, comprised of 8,000 .wav song files of 30 seconds each, with 8 balanced genres: Electronic, Experimental, Folk, Hip-Hop, Instrumental, International, Pop, and Rock. The archive contains metadata files that includes some audio features extracted using Librosa; however, I modify the provided FMA feature extraction code to calculate additional audio features and to create a complete data pipeline. I compute the following statistical measures for each feature: mean, standard deviation, minimum, maximum, median, skewness, and kurtosis. Note: most audio features are mutli-dimensional, so the calculated statistics are not single integers but rather an array or matrix. Below is an overview of the extracted audio features.

Features computed from the raw audio waveform:
 - Tempogram Ratio: This feature captures the rhythmic content of a song by calculating the tempo and beat positions.  
 - ZCR (Zero-Crossing Rate): This feature represents the number of times a signal crosses the zero axis and captures the temporal characteristics of a song.
 
Features computed from the Constant-Q Transform: (CQT)
 - Chroma_cqt (Constant-Q Transform): This feature represents the chromatic scale of a song and captures the pitch class distribution in each audio frame.
 - Chroma_cens (Chroma Energy Normalized Statistics): This feature also captures the pitch class distribution of a song but is more robust to variations in timbre and dynamics.
 - Tonnetz: This feature represents the tonal hierarchy of a song and captures the relationships between musical chords.

Feautures computed from the Short-Time Fourier Transfrom (STFT):
 - Chroma_stft (Short-Time Fourier Transform): This feature is similar to chroma_cqt but is computed using the short-time Fourier transform.
 - RMS (Root Mean Square): This feature represents the energy of a signal and captures the overall loudness of a song.
 - Spectral Bandwidth: This feature represents the bandwidth of a signal and captures the spread of frequencies in each audio frame.
 - Spectral Centroid: This feature represents the center of mass of the frequency distribution in each audio frame and captures the "brightness" of a song.
 - Spectral Contrast: This feature captures the difference in spectral energy between adjacent frequency bands and provides information about the spectral texture of a song.
 - Spectral Rolloff: This feature represents the frequency below which a specified percentage of the total spectral energy lies.
 - Tempogram Ratio: This feature captures the rhythmic content of a song by calculating the tempo and beat positions.
 
Feature computed from mel-sectrograms (using STFT):
  - MFCC (Mel-Frequency Cepstral Coefficients): This feature represents the spectral envelope of a song and captures the variations in timbre.

## ML classification algorithms

I select five popular statistical machine learning algorithms to serve as benchmarks for genre classification. Using multiple algorithms allows me to calculate an average score over all models to explore the performance of different audio features. Additionally, by averaging model accuracy scores across all audio features I can identify which algorithms preform best for this problem. Below are short descriptions of each classification algorithm.
  
    Support Vector Classification (SVC) is a powerful algorithm that uses a kernel function to transform the input data into a higher-dimensional space, which allows it to separate nonlinearly separable data. Then it finds the optimal hyperplane to separate data points into different classes. It does this by maximizing the margin between the hyperplane and the closest data points. Despite its high accuracy and ability to handle high-dimensional data, SVC can become computationally expensive and slow with large datasets.

    Random Forest Classifier: Random Forest Classifier is an ensemble learning algorithm that creates multiple decision trees and combines their results to make a final prediction. It is based on the bagging method, which randomly selects a subset of features and data points to train each tree, reducing overfitting and improving generalization. The final prediction is made by aggregating the predictions of all trees. The algorithm is known for its ability to handle high-dimensional data, feature selection, and reducing overfitting. The runtime of the model is typically faster than other ensemble methods, such as boosting algorithms.

    K-Nearest Neighbors Classifier: K-Nearest Neighbors Classifier is a non-parametric method that classifies new data points based on the majority class of the K nearest training examples. It is based on the assumption that similar data points are likely to belong to the same class. The choice of K affects the bias-variance trade-off of the model: a smaller K leads to a high variance and low bias, while a larger K leads to a low variance and high bias. The main disadvantage of this algorithm is that it is sensitive to the choice of K and can become computationally expensive for large datasets.

    Decision Tree Classifier: Decision Tree Classifier is a simple and intuitive classification algorithm that works by creating a tree-like structure that represents the decision-making process. The algorithm recursively partitions the feature space into subsets based on the value of each feature and assigns a class to each subset based on the majority class of the training examples. The complexity of the model depends on the depth and width of the decision tree, and it can become slow when dealing with large datasets. The main advantage of this algorithm is that it can handle nonlinear relationships between the input features and the output classes.

    Gradient Boosting Classifier: Gradient Boosting Classifier is an ensemble learning algorithm that combines multiple weak classifiers to create a strong classifier. It is based on the boosting method, which iteratively adds decision trees to the model and adjusts the weights of the data points based on the errors made in the previous iteration. The algorithm is known for its ability to handle complex datasets with high accuracy. However, it can become computationally expensive for large datasets and may require a longer runtime than other algorithms.

In summation, the performance and processing time of each machine learning model depends on several factors, such as the size and intricacy of the selected features, the selected hyperparameters, and the precise implementation of the algorithm. 

# Modeling

First, I explore the effectivenss of various audio features and pinpoint the most effective audio features and classifeirs and then build final models with hyperparameter tuning.  


Essientitally, the modeling part of this project is split into three phases. The first phase consists of 5 five classification models paired with all 12 individual features and all combinations of feature pairs - 60 single feature models and 330 two feature models. This two-dimensional (multiple algorithms and multiple features) exploration serves as an informal cross-validation process, which allows for a broad analysis of combinations of features and classification algorithms. For the second phase, I select the top two classification algorithms, the top two single features, and the top four feature pairs - totaling to 12 unqiue models - and for each model I optimize hyperparameters using 5-fold cross-validation. In phase three I implement the CNN model and show to best recorded model.

## Phase one

Hyperparameter optimization is not employed at this phase, and instead I use the default values in Sklearn. While this may not be the recommended course of action to procure an optiamally performing model, it is acceptable for this phase since all models utilize the same dataset and the purpose is to gather benchmark scores. Each model has an accuracy score for each feature or feature combination. Consequently, I can average accross models for each feature to obtain an average feature score, as well as average across features to obtain an average model score. While not a contributing factor for feature selection or algorithm selection in this project, I record the runtime of each model (shown in seconds).

### Single Feature

First, I create a seperate model for each distinct combination of classifier and the computed statisitcs for each feature (mean, std, skew, etc.), resulting in 60 models. Below I show the top 10 perfroming models.

![My Images](Images/single_feature_table_head.png)

Next, I group by classifier to get an average accuracy score for each feature, and subsequently group by feature to get an avergae accuracy score for each classifier.

![My Image](Images/single_feature_table_features.png)

![My Image](Images/single_feature_table_algos.png)

These results show that mfcc is the best prediciting feature by an average of 8%, followed by spectral contrast which is 7% better than the third best feature. The Random Forest and Gradient Boosting classifiers have extremely similar average accuracies, SVC falls 6% below, and finally KNN and Decision Tree have significantly lower scores. It's worth noting the average duration of these algoorithms - Random Forest and Gradient Boosting have the same accuracies yet Random Forest runs 20x faster.

### Two features

Next, I use all possible combinations of two features to explore if expanding the feature space will result in greater model performance. Below are the top 10 performing feature combinations. These top 10 performing models all out-score the top single-feature model, yet only by about 2%.

![My Image](Images/two_feature_table_head.png)

Again, I group by classifier to get an average accuracy score for each feature pair, and subsequently group by feature pair to get an avergae accuracy score for each classifier.

![My Image](Images/two_feature_table_features.png)

![My Image](Images/two_feature_table_algos.png)

The top 10 feature pairs all include mfcc which is clearly the best audio feature for this pr. The top feature pair model (mfcc and spectral constrast) scores 48%, 2% higher than the top single feature model (mfcc). Having two features increases average performance by classifier, boosting the average accuracies by 8% for the top two algorithms; however, this increase doesn't matter much in terms of choosing the best feature - it just demonstates that using two random features will achieve better accuracy than using one feature. 

### All Features

I create models using all features to examine if group effects will boost performance. Adding all features generates higher accuracies for all algorithms. Gradient Boosting scores an average of 3% higher than the best two feature model and almost 6% better than the best single feature model. 

![My Image](Images/all_feature_table.png)

## Round Two

The two best classification algorithms are Gradient Boosting and Random Forest, the two best individual features are mfcc ad spectral contrast, and the four best feature pairs are mfcc/spectral_contrast, mfcc/chroma_cqt, mfcc/zcr, and mfcc/tempogram_ratio. I use 5-fold cross-validation to automatically optimize hyperparameters of each model and record the test accuracy of each final model. Duration includes the cross-validation time and the final model fit time. 

![My Image](Images/tm__table.png)

The results of the models with tuned hyperparameters are underwhelming - all models have extremely similar test accuracies compared to the same model without hyperparamter optimization. Classifing music genre solely from a couple extracted features is no simple task, and 54% accuracy using 'simple' statistical algorithms is respectable compared to randomly guessing (12.5%). Nevertheless, I am not quite satisfied and now turn to deep learning. Instead of using statisical measures of extracted audio features as input, I will use Mel-spectrogram images as input into a convolutional neural network (CNN).

I use the Librosa package to make the Mel-spectrograms, using n_mels = 128 and max frequency = 8000, and clip each image into 3-sec slices. Each resulting image is sized 256 x 256 (although I reszie down to 128x128 becuase the larger size offered no boost in accuracy and exceptionally slowed down training. Shown below is the CNN architeture:

![My Image](Images/model_arch.png)}

I spent awhile fidgeting with model archetecture and hyperparamters, but here I show the final model trained for 60 epochs with the following parameters; batch_size=128, validation split=.3, and a custom learning rate that starts at .001 and decreases by a factor of 0.5 every 10 epochs. The final test accuracy is .7249, and below are plots showing training/validatiion accuracy and loss metrics.

![My Image](Images/model_test_accuracy.png)

![My Image](Images/model_history.png)

The CNN model surpasses the top statistical machine learning model by about 15%, well worth the trade-off of more data pre-processing and longer training time. For future work, I would experiment with altering the CNN architecture and hyperparameters along with applying transformations to the image data. It would also be intersting to create a mixed model classifier by combing the CNN model with a Gradient Boosting or Random Forest model trained with all statisical measures.

# Conclusion

I explored various types of input data and algorithms for classifing songs by music genre. I first extracted 12 audio features and computed 7 statistical measures for each feature. I choose five statisical machine learning classification algorithms to create models for each feature, possible pair of features, and all features. From these results, I conclude that mfcc is by far the best feature, and spectral contrast is a strong second. For the classifiers, Gradient Boosting and Random Forest tie for first place, significantly ouperforming the other three models. Finally, I used Mel-spectrograms as input to a convolutional neural network which had a final test acccuracy of 72.72%. Compared to the top-performing statisitical model, the CNN model significantly out-performs the best non-neural network algorithm, Gradient Boosting trained on all 12 features, which scored 57.38%. This project explores the efficacy of extracted audio features for classifing music genre, and demonstrates how deep learning methods can significantly boost performance for signal processing problems.