### EMD Music Classification
For this project I chose to create a model that predicts the specific subgenere an EDM song belongs to. 
I am using the [EDM Music Genres](https://www.kaggle.com/datasets/sivadithiyan/edm-music-genres/data) dataset from Kaggle.<br>

Since the data was already split into training and test sets, I created a class to read and merge the data files. After creating a dataframe, I also split the data into a set of labels `y` and an set of features `x`.

In [129]:
# import scipy as sp
# import scipy.stats as stats
import pandas as pd
import re
# import numpy as np
# import matplotlib.pyplot as plt
# import seaborn as sns
# import copy
# # Set color map to have light blue background
# sns.set()
# import statsmodels.formula.api as smf
# import statsmodels.api as sm
# %matplotlib inline

In [130]:
class import_data:
    def __init__(self):

        # Merge datasets so they can be split into test and train sets manually
        df = pd.concat([pd.read_csv('data/test_data_final.csv'),pd.read_csv('data/train_data_final.csv')])
        df = df.drop(['rmse_mean', 'rmse_std'],axis=1)

        # Just use the mean of each feature
        for c in df.columns:
            if 'std' in c:
                df = df.drop(c, axis=1)

        self.df = df

        # Split data into features (x) and labels (y)
        
        # self.y = df['label'].values
        # self.x = df.drop('label',axis=1).values
        self.y = df['label']
        self.x = df.drop('label',axis=1)

    # Update x and y if df is updated
    def update_data(self, df):
        self.df = df
        self.y = df['label']
        self.x = df.drop('label',axis=1)
        
data = import_data()

## Data Overview
### Features
Each row in the data set corresponts to a 3 second audio clip taken from an EDM song. The clips are each have a list of 65 features representing the mean of different audio measurments like spectral centroid (which measures how 'bright' a piece of music sounds).

In [131]:
print(data.x.columns)
data.x.head()

Index(['spectral_centroid_mean', 'spectral_bandwidth_mean', 'rolloff_mean',
       'zero_crossing_rate_mean', 'mfcc1_mean', 'mfcc2_mean', 'mfcc3_mean',
       'mfcc4_mean', 'mfcc5_mean', 'mfcc6_mean', 'mfcc7_mean', 'mfcc8_mean',
       'mfcc9_mean', 'mfcc10_mean', 'mfcc11_mean', 'mfcc12_mean',
       'mfcc13_mean', 'mfcc14_mean', 'mfcc15_mean', 'mfcc16_mean',
       'mfcc17_mean', 'mfcc18_mean', 'mfcc19_mean', 'mfcc20_mean',
       'mfcc21_mean', 'mfcc22_mean', 'mfcc23_mean', 'mfcc24_mean',
       'mfcc25_mean', 'mfcc26_mean', 'mfcc27_mean', 'mfcc28_mean',
       'mfcc29_mean', 'mfcc30_mean', 'mfcc31_mean', 'mfcc32_mean',
       'mfcc33_mean', 'mfcc34_mean', 'mfcc35_mean', 'mfcc36_mean',
       'mfcc37_mean', 'mfcc38_mean', 'mfcc39_mean', 'mfcc40_mean',
       'chroma1_mean', 'chroma2_mean', 'chroma3_mean', 'chroma4_mean',
       'chroma5_mean', 'chroma6_mean', 'chroma7_mean', 'chroma8_mean',
       'chroma9_mean', 'chroma10_mean', 'chroma11_mean', 'chroma12_mean',
       'tonnetz1_mea

Unnamed: 0,spectral_centroid_mean,spectral_bandwidth_mean,rolloff_mean,zero_crossing_rate_mean,mfcc1_mean,mfcc2_mean,mfcc3_mean,mfcc4_mean,mfcc5_mean,mfcc6_mean,...,chroma11_mean,chroma12_mean,tonnetz1_mean,tonnetz2_mean,tonnetz3_mean,tonnetz4_mean,tonnetz5_mean,tonnetz6_mean,chroma_cqt_mean,spectral_contrast_mean
0,1102.736704,1566.536901,2227.527043,0.0412,-53.611938,164.518143,-7.087708,38.057152,-16.607727,10.361456,...,0.929799,0.477323,-0.070403,0.147618,0.064746,-0.113985,0.061702,-0.012916,0.519082,22.407045
1,1132.941821,1562.966911,2189.926758,0.045035,-52.066601,164.440659,-11.349409,31.933271,-17.954233,13.940317,...,0.799746,0.443811,-0.064409,0.145052,0.092985,-0.135469,0.048009,-0.030921,0.471938,22.588178
2,1096.436587,1570.798792,2146.943172,0.042616,-51.251637,164.457367,-5.636142,36.922642,-17.254499,11.659356,...,0.592361,0.458213,-0.034737,0.180186,0.120014,0.050019,0.052763,0.010471,0.462152,22.300065
3,1139.86836,1609.516321,2273.07805,0.042533,-48.97261,162.229477,-7.740341,35.910172,-15.405805,13.718477,...,0.966027,0.431601,-0.078172,0.126693,0.080715,-0.222283,0.070434,-0.039719,0.46648,21.888805
4,1107.523113,1549.296915,2076.960261,0.04639,-53.172951,165.3694,-8.298407,30.948669,-20.563414,12.666272,...,0.557843,0.490463,-0.044983,0.156088,0.078666,0.014288,0.016941,0.006308,0.493172,22.513137


### Datatype
Calling `df.info` shows that each feature is stored as a `float` value, while the label is a string `object`.

In [132]:
data.df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 40000 entries, 0 to 31999
Data columns (total 65 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   spectral_centroid_mean   40000 non-null  float64
 1   spectral_bandwidth_mean  40000 non-null  float64
 2   rolloff_mean             40000 non-null  float64
 3   zero_crossing_rate_mean  40000 non-null  float64
 4   mfcc1_mean               40000 non-null  float64
 5   mfcc2_mean               40000 non-null  float64
 6   mfcc3_mean               40000 non-null  float64
 7   mfcc4_mean               40000 non-null  float64
 8   mfcc5_mean               40000 non-null  float64
 9   mfcc6_mean               40000 non-null  float64
 10  mfcc7_mean               40000 non-null  float64
 11  mfcc8_mean               40000 non-null  float64
 12  mfcc9_mean               40000 non-null  float64
 13  mfcc10_mean              40000 non-null  float64
 14  mfcc11_mean              40

### Labels
Each audio clip is labeled as 1 of 16 sub genres of EDM.

In [133]:
print('genres: ', data.y.unique())
print('length:', len(data.y.unique()))

genres:  ['ambient' 'big_room_house' 'dnb' 'dubstep' 'future_garage_wave_trap'
 'hardcore' 'hardstyle' 'house' 'lofi' 'moombahton_reggaeton' 'phonk'
 'psytrance' 'synthwave' 'techno' 'trance' 'trap']
length: 16


## Data Cleaning
On Kaggle the creator of the dataset noted that some of the clips that were used might be identical. This would lead to duplicate rows in the dataframe. These can be removed using the `.drop_duplicates` function. Passing in `keep='first'` tells the function to keep the first instance of a duplicate row. Calling `.info` again shows that 340 duplicate rows were removed.

In [134]:
no_duplicates = data.df.drop_duplicates(keep='first')
data.update_data(no_duplicates)
data.x.info()

<class 'pandas.core.frame.DataFrame'>
Index: 39660 entries, 0 to 31999
Data columns (total 64 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   spectral_centroid_mean   39660 non-null  float64
 1   spectral_bandwidth_mean  39660 non-null  float64
 2   rolloff_mean             39660 non-null  float64
 3   zero_crossing_rate_mean  39660 non-null  float64
 4   mfcc1_mean               39660 non-null  float64
 5   mfcc2_mean               39660 non-null  float64
 6   mfcc3_mean               39660 non-null  float64
 7   mfcc4_mean               39660 non-null  float64
 8   mfcc5_mean               39660 non-null  float64
 9   mfcc6_mean               39660 non-null  float64
 10  mfcc7_mean               39660 non-null  float64
 11  mfcc8_mean               39660 non-null  float64
 12  mfcc9_mean               39660 non-null  float64
 13  mfcc10_mean              39660 non-null  float64
 14  mfcc11_mean              39

### Check for null values
It is also a good idea to check for any null values in the dataframe. Calling `.isnull().sum()` shows the total number of null vlaues in each column. Calling `.sum()` again on that result returns the total number of null values in the dataframe. The output below shows that there are no null values that need to be removed.

In [135]:
data.df.isnull().sum().sum()

0