## Music Genre Classification using CNN or classic ML

The idea of this mini project is to deploy a music genre classifier into a web application. Music genre classification (MGC) has been a problem in music information retrieval (MIR) for a long time - starting with the problem of what a genre actually is. For this project, however, I assume that genre boundaries are somewhat known, and use a labeled dataset (GTZAN, Free Music Archive (FMA) or Audio set) to perform classification using either CNN or classic ML techniques, depending on time constraints. 

For the CNN implementation, transfer learning can be done using the VGG-16 architecture with the fixed weights, or just as a starting point. The NN consists of 5 convolutional blocks, and the output layer would be ea softmax activation. I would need to download the architecture, change the output layer, use regularization and dropout to reduce overfitting (reported by the referencee) and feed the songs' spectrograms as inputs. 

For the ML implementation, I would have to manually extract the features. This would be done using the audio library librosa to extract frequency domain features (Mel-frequency Cepstral Coefficients (MFCC), Spectral Centroid, Spectral Roll-off...) and time domain features (Central moments, Zero Crossing Rate (ZCR), Root Means Squared Energy (RMSE)...). Those would then be fed into one or several classifiers for comparison (logistic regression, random forest, gradient boosting, support vector machines...).

The evaluation consists of standard accuracy, f-score and AUC scores.

Other ideas could be speech classification or noise reduction, but both of those seem considerably more complex and less flexible (solely based on DL with RNN, LSTM and CNN with millions of parameters and extensive literature background).

Reference:
Bahuleyan, Hareesh. Music Genre Classification using Machine Learning Techniques. University of Waterloo, 2018.

Ideal: mid-sized problems

- version control
- work with real world dataset
- clean dataset
- maybe extract audio features
- optimize machine learning algorithm
- deploy

work with DL, expandable projects (CNN, like MGC)

- python
- librosa
- tensorflow/keras

In [50]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import librosa
import librosa.display
import IPython.display as ipd
import os
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, precision_score, confusion_matrix, classification_report, precision_recall_curve, roc_curve, roc_auc_score
import seaborn as sns


In [33]:
# Loading audio files for visualization and analysis

audio_file = "../MLME_MiniProject/classical.00007.wav"
ipd.Audio(audio_file)
classical, sr = librosa.load(audio_file)

In [34]:
# Importing the data set CSV

data = pd.read_csv("../MusicGenreClassifier_MiniProject/Data/features_3_sec.csv")
print(f"data.head: {data.head}")#, data.info(): {data.info()}")  # Visualizing the data


data.head: <bound method NDFrame.head of                filename  length  chroma_stft_mean  chroma_stft_var  rms_mean  \
0     blues.00000.0.wav   66149          0.335406         0.091048  0.130405   
1     blues.00000.1.wav   66149          0.343065         0.086147  0.112699   
2     blues.00000.2.wav   66149          0.346815         0.092243  0.132003   
3     blues.00000.3.wav   66149          0.363639         0.086856  0.132565   
4     blues.00000.4.wav   66149          0.335579         0.088129  0.143289   
...                 ...     ...               ...              ...       ...   
9985   rock.00099.5.wav   66149          0.349126         0.080515  0.050019   
9986   rock.00099.6.wav   66149          0.372564         0.082626  0.057897   
9987   rock.00099.7.wav   66149          0.347481         0.089019  0.052403   
9988   rock.00099.8.wav   66149          0.387527         0.084815  0.066430   
9989   rock.00099.9.wav   66149          0.369293         0.086759  0.050524   

# Data Preprocessing

In [35]:
# Dropping the length and filename columns
data = data.drop(['length', 'filename'], axis=1)  

In [36]:
y = data['label'] # Getting the label column
X = data.loc[:, data.columns != 'label'] # All columns except for label as input

print(f"X shape: {X.shape}, y shape: {y.shape}")

X shape: (9990, 57), y shape: (9990,)


In [37]:
X.head

<bound method NDFrame.head of       chroma_stft_mean  chroma_stft_var  rms_mean   rms_var  \
0             0.335406         0.091048  0.130405  0.003521   
1             0.343065         0.086147  0.112699  0.001450   
2             0.346815         0.092243  0.132003  0.004620   
3             0.363639         0.086856  0.132565  0.002448   
4             0.335579         0.088129  0.143289  0.001701   
...                ...              ...       ...       ...   
9985          0.349126         0.080515  0.050019  0.000097   
9986          0.372564         0.082626  0.057897  0.000088   
9987          0.347481         0.089019  0.052403  0.000701   
9988          0.387527         0.084815  0.066430  0.000320   
9989          0.369293         0.086759  0.050524  0.000067   

      spectral_centroid_mean  spectral_centroid_var  spectral_bandwidth_mean  \
0                1773.065032          167541.630869              1972.744388   
1                1816.693777           90525.690866  

In [38]:
# Using LabelEncoder to transform genres into numerical values

le = preprocessing.LabelEncoder()
le.fit(y)
# le.classes_ # outputs the labels
y_le = le.transform(y) # transforms the labels into numerical values
# le.inverse_transform(y) # retrieves the original labels afterwards

print(f"The label encoded output y_le is: {y_le}, original classes: {le.classes_}")

The label encoded output y_le is: [0 0 0 ... 9 9 9], original classes: ['blues' 'classical' 'country' 'disco' 'hiphop' 'jazz' 'metal' 'pop'
 'reggae' 'rock']


# Data Normalization?

In [39]:
# from https://www.kaggle.com/code/andradaolteanu/work-w-audio-data-visualise-classify-recommend

#### NORMALIZE X ####

# Normalizing increased the accuracy by almost 20%

# Normalize so everything is on the same scale. 

cols = X.columns
min_max_scaler = preprocessing.MinMaxScaler()
np_scaled = min_max_scaler.fit_transform(X)

# new data frame with the new scaled data. 
X = pd.DataFrame(np_scaled, columns = cols)

# Data splitting

In [40]:
# Splitting the data into train, cross-validation and test using sklearn

X_train, X_, y_train, y_ = train_test_split(X, y_le, test_size=0.3, random_state=1)
X_cv, X_test, y_cv, y_test = train_test_split(X_, y_, test_size=0.5, random_state=1)

print(f"X_train.shape:", X_train.shape, "y_train.shape:", y_train.shape)
print(f"X_cv.shape", X_cv.shape, "y_cv.shape:", y_cv.shape)
print("X_test.shape:", X_test.shape, "y_test.shape", y_test.shape)

X_train.shape: (6993, 57) y_train.shape: (6993,)
X_cv.shape (1498, 57) y_cv.shape: (1498,)
X_test.shape: (1499, 57) y_test.shape (1499,)


# Function to assess different models

In [45]:
def model_assess (model, title = "Default"):
    model = model.fit(X_train, y_train)
    #predictions = model.predict(X_cv)
    #print('Accuracy', title, ':', round(accuracy_score(y_test, predictions), 5), '\n')
    accuracy = model.score(X_cv, y_cv)
    print(f"Accuracy for {title} is {format(accuracy)}")


# Logistic Regression

In [49]:
# Testing logistic regression
# From Microsoft 4.2

# Problems with solver multinomial (asking for scaling data or
# increasing max_iter. The second doesn't work, the first seems
# out of scope. Used 'ovr' (one vs rest) instead.)

# similar results with linear and lbfgs solvers
# (acc for lbfgs = 0.53)

lr = LogisticRegression(multi_class='multinomial', solver='lbfgs', max_iter=100000)
#model = lr.fit(X_train, y_train) #np.ravel(y_train))

#accuracy = model.score(X_cv, y_cv)
#print("Accuracy is {}".format(accuracy))

model_assess(lr, "Logistic Regression")

# Support Vector Machine
svm = SVC(decision_function_shape="ovo")
model_assess(svm, "Support Vector Machine")

# Cross Gradient Booster - to implement or not?


Accuracy for Support Vector Machine is 0.7576769025367156
Accuracy for Logistic Regression is 0.7002670226969292


In [47]:
# Checking for individual accuracy, using Microsoft's function

# Single row test
cv_test = X_cv.iloc[50].values.reshape(-1, 1).T

y_cv_scores = model.predict_proba(X_cv) #cv_test for the single row test
classes = model.classes_
resultdf = pd.DataFrame(data=y_cv_scores, columns=classes)

topPrediction = resultdf.T.sort_values(by=[0], ascending = [False])
topPrediction.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1488,1489,1490,1491,1492,1493,1494,1495,1496,1497
5,0.764226,0.038262,0.01376,0.000898,0.007683,0.027945,0.000385,0.043317,0.046884,0.08365,...,0.003461,0.108869,0.042488,0.011247,0.000562,0.124271,0.004905,0.124806,0.024727,0.070758
1,0.069429,0.001957,0.018166,0.000183,0.001234,0.000167,4.6e-05,0.074895,0.000856,0.011016,...,0.001176,0.739149,0.006904,0.009639,0.00075,0.005049,0.000158,0.001985,0.713253,0.01076
3,0.050003,0.045113,0.019869,0.448341,0.013455,0.030274,0.101188,0.243108,0.051523,0.050641,...,0.428619,0.006189,0.043928,0.018508,0.045488,0.055118,0.278171,0.000494,0.0343,0.006032
2,0.048035,0.030282,0.086804,0.003918,0.009612,0.432675,0.001077,0.114442,0.020938,0.359239,...,0.070628,0.084852,0.587163,0.785164,0.022565,0.083361,0.036835,0.077173,0.036326,0.239054
9,0.034165,0.118028,0.043396,0.023867,0.020496,0.227226,0.001754,0.333696,0.094231,0.342078,...,0.211242,0.014148,0.063749,0.008412,0.082185,0.358675,0.093537,0.009081,0.012271,0.083908


In [None]:
y_cv_scores[:,1].shape

(1498,)

In [None]:
# Calculating Receiving Operating Characteristic (ROC)'s 
# 'Area Under the Curve' (AUC) score using the previous y_cv_scores

auc = roc_auc_score(y_cv, y_cv_scores, multi_class='ovr') # y_cv_scores[:, 1]
print(auc)


0.9496903919404678


In [None]:
# Classification report, also from scikit learn

y_pred = model.predict(X_cv)
print(classification_report(y_cv,y_pred))

              precision    recall  f1-score   support

           0       0.62      0.68      0.65       155
           1       0.88      0.90      0.89       155
           2       0.72      0.59      0.65       174
           3       0.60      0.62      0.61       141
           4       0.80      0.56      0.66       151
           5       0.73      0.80      0.77       148
           6       0.77      0.88      0.82       159
           7       0.74      0.82      0.78       142
           8       0.62      0.68      0.65       143
           9       0.48      0.43      0.45       130

    accuracy                           0.70      1498
   macro avg       0.70      0.70      0.69      1498
weighted avg       0.70      0.70      0.70      1498

