This notebook performs the following preliminary tasks:
    - Unzipping data folder
    - Investigating categories/subfolders
    - Investigating metadata
    - Suggesting points of action related to Task 1, 2 and 3

Unzip the music tribe ml test zip folder

In [None]:
!unzip music_tribe_ml_test.zip

Make two ipython notebooks, one for the tasks and the other as a preliminary to investigate the meta data which helps make the tasks easier. Below is a sanity check:

In [1]:
!ls

Initial investigation.ipynb
Tasks 2 and 3.ipynb
music_tribe_ml_test
report.pdf
util.py


Nine categories present in the dataset, from bass to vocals

In [2]:
!ls music_tribe_ml_test/data/

bass
guitar
hihat
kick
piano
saxophone
snare
tom
vocals


In [3]:
# store categories for later as a list of strings
categories = !ls music_tribe_ml_test/data/ 

In [4]:
categories

['bass',
 'guitar',
 'hihat',
 'kick',
 'piano',
 'saxophone',
 'snare',
 'tom',
 'vocals']

Test example looking at a json file present in the bass category

In [5]:
!cat music_tribe_ml_test/data/bass/ml_test_data_bass_1.json

{
"metadata": {
    "version": {
        "essentia": "2.1-dev"
    }
},
"frameSize": [0],
"hopSize": [0],
"local": [0],
"numSamples": [16],
"sampleRate": [44100],
"sample_1": {
    "centroid": [3366.9440918],
    "end": [61440],
    "energy": [432.869293213],
    "flatness": [0.0673862621188],
    "flux": [0.200650200248],
    "spectralComplexity": [11],
    "start": [39936],
    "zeroCrossingRate": [0.00995163712651],
    "gfcc": [[-41.4613838196, 51.7674179077, -14.6619844437, 48.881816864, 7.87215137482, 5.79177284241, -16.4220085144, 8.27585315704, -16.4037017822, -0.0769567489624, -3.21247673035, 2.71528244019, -7.92222976685]],
    "lpc": [[1, -2.66367602348, 3.01420617104, -1.86377108097, 0.441763997078, 0.236569702625, -0.247302025557, 0.0417397618294, 0.0533112436533, 0.0510247573256, -0.0635518655181]],
    "mfcc": [[-886.197509766, 145.798126221, 19.5697784424, 56.4283218384, 39.3673744202, 44.4896240234, 9.14548110962, 20.0837497711, 11.7885322571, 8.00518989563, 6.09222030

Points to consider:
    
There are many samples titled as sample_num. From a quick glance these samples have the following features; centroid, end, energy, flatness, flux, spectral complexity, start, zero cross rate, gfcc (gammatone frequency cepestral coefficient) , lpc (linear predictive code) and mfcc (mel-frequency ceprestral coefficient) and the onset time

It appears that gfcc, lpc and mfcc have 13, 11 and 13 entries respectively with the other features having 1 entry each

I had not heard of gfcc, lpc and mfcc before this time

MFCC: sound processing feature extracted using the Mel scale, the coefficients are based on the resulting amplitudes, non-linear function apparently

GFCC: apparently a more robust version of MFCC, being less sensitive to outliers in the noise present

LPC: acoustic coefficients extracted using a linear model apparently


Need to check:

Whether the sample rate, frame size, hop size and local are always fixed across all json files over each of the 9 category to ensure consistent comparisons, if not further preprocessing may need to be taken

In [6]:
import os, json
import pandas as pd

# initialise list of json files across all categories
num_files_per_category = []
json_files = [] 

for category in categories:
    # this finds the json files in a given category
    path_to_json = 'music_tribe_ml_test/data/'+category+'/'
    
    
    # list files in path
    # check if file ends with a .json extension
    # if so add to list
    category_files = [category+'/'+json_file for json_file in os.listdir(path_to_json) if json_file.endswith('.json')]
    
    num_files_per_category.append(len(category_files))
    
    # add the elements of category files to the end of json files
    json_files.extend(category_files) 

In [7]:
num_files_per_category

[400, 400, 400, 400, 8, 23, 400, 400, 400]

There are 8 json files in the piano category and 23 in the saxophone category. This suggests a possible class imbalance which would probably result in 2 classification metrics. For class imbalance, one can analyse the resulting confusion matrix and couple it with the (normal) generic accuracy score. This class imbalance will be further investigated in the exploratory data analysis section in the ipython notebook titled Tasks 2 and 3. 

In [8]:
len(json_files)

2831

There are in total 2,831 json files

In [9]:
json_files

['bass/ml_test_data_bass_1.json',
 'bass/ml_test_data_bass_10.json',
 'bass/ml_test_data_bass_100.json',
 'bass/ml_test_data_bass_101.json',
 'bass/ml_test_data_bass_102.json',
 'bass/ml_test_data_bass_103.json',
 'bass/ml_test_data_bass_104.json',
 'bass/ml_test_data_bass_105.json',
 'bass/ml_test_data_bass_106.json',
 'bass/ml_test_data_bass_107.json',
 'bass/ml_test_data_bass_108.json',
 'bass/ml_test_data_bass_109.json',
 'bass/ml_test_data_bass_11.json',
 'bass/ml_test_data_bass_110.json',
 'bass/ml_test_data_bass_111.json',
 'bass/ml_test_data_bass_112.json',
 'bass/ml_test_data_bass_113.json',
 'bass/ml_test_data_bass_114.json',
 'bass/ml_test_data_bass_115.json',
 'bass/ml_test_data_bass_116.json',
 'bass/ml_test_data_bass_117.json',
 'bass/ml_test_data_bass_118.json',
 'bass/ml_test_data_bass_119.json',
 'bass/ml_test_data_bass_12.json',
 'bass/ml_test_data_bass_120.json',
 'bass/ml_test_data_bass_121.json',
 'bass/ml_test_data_bass_122.json',
 'bass/ml_test_data_bass_123.json

In [10]:
# here I define a pandas Dataframe with the columns I want to get from the json
jsons_data = pd.DataFrame(columns=['frameSize', 'hopSize', 'local', 'numSamples', 'sampleRate', 'gfcc_length', 'lpc_length', 'mfcc_length'])

path_to_dataFolder = 'music_tribe_ml_test/data/'

gfcc_length = []
lpc_length = []
mfcc_length = []


# need both the json and an index number
for index, js in enumerate(json_files):
    with open(os.path.join(path_to_dataFolder, js)) as json_file:

        json_text = json.load(json_file)
        
        # only one entry in these features hence [0]
        frameSize = json_text['frameSize'][0] 
        hopSize = json_text['hopSize'][0]
        local = json_text['local'][0]
        numSamples = json_text['numSamples'][0]
        sampleRate = json_text['sampleRate'][0]
        for name in json_text: 
            if "sample_" in name:
                
                check1 = len(json_text[name]['gfcc'][0])
                if check1 != 13: # TEST: expect an empty string if all are 13
                    gfcc_length.append(check1)
                
                check2 = len(json_text[name]['lpc'][0])            
                if check2 != 11: # TEST: expect an empty string if all are 11
                    lpc_length.append(check2)

                check3 = len(json_text[name]['mfcc'][0])
                if check3 != 13: # TEST: expect an empty string if all are 13
                    mfcc_length.append(check3)

                
        # here I push a list of data into a pandas DataFrame at row given by 'index'
        jsons_data.loc[index] = [frameSize, hopSize, local, numSamples, sampleRate, gfcc_length, lpc_length, mfcc_length]

In [11]:
# represent the json dataframe as a pandas dataframe
metaData = pd.DataFrame(jsons_data)

In [12]:
metaData

Unnamed: 0,frameSize,hopSize,local,numSamples,sampleRate,gfcc_length,lpc_length,mfcc_length
0,0,0,0,16,44100,[],[],[]
1,0,0,0,10,44100,[],[],[]
2,0,0,0,3,44100,[],[],[]
3,0,0,0,4,44100,[],[],[]
4,0,0,0,5,44100,[],[],[]
5,0,0,0,2,44100,[],[],[]
6,0,0,0,4,44100,[],[],[]
7,0,0,0,5,44100,[],[],[]
8,0,0,0,7,44100,[],[],[]
9,0,0,0,1,44100,[],[],[]


Sanity check to see if each of these features apart from number of samples are constant

In [13]:
metaData['frameSize'].min() == metaData['frameSize'].max()

True

In [14]:
metaData['hopSize'].min() == metaData['hopSize'].max()

True

In [15]:
metaData['local'].min() == metaData['local'].max()

True

In [16]:
# result is expected, the number of samples can change per json file
metaData['numSamples'].min() == metaData['numSamples'].max()

False

In [17]:
metaData['sampleRate'].min() == metaData['sampleRate'].max()

True

Expecting all of these lengths to be empty, which means that for each sample there are 13 gfc and mfc coefficients as features and 11 lp coefficients 

In [18]:
metaData['gfcc_length'].min() == metaData['gfcc_length'].max() 

True

In [19]:
metaData['lpc_length'].min() == metaData['lpc_length'].max() 

True

In [20]:
metaData['mfcc_length'].min() == metaData['mfcc_length'].max() 

True

To summarise this means that each file is constructed using the same sample rate, local, hop size, frame size. In addition, for the sample features; gfcc (13), mfcc (13) and lpc (11), they are of the same length regardless of the file which enables consistent comparisons.

Finally, here I check for the total number of samples present in the dataset. This is not just the number of json files which is 2831 as seen above

In [21]:
metaData['numSamples'].sum()

8869

There are 8869 samples across the data files provided by Music Tribe

Ideas moving forward:

Check the degree of class imbalance across samples in each of the 9 categories

Clean up into a dataframe the features 13 for gfcc, 13 for mfcc, 11 for lpc and 1 for each of the following; centroid, end, energy, flatness, flux, spectral complexity, start, zero crossing rate and onset time.

This should result in (13 + 13 + 11 + 9 =) 46 features for each of the 8,869 samples. The labels of each sample corresponds to one of the 9 categories.

The dataset split in consideration would be a double hold-out data split resulting in a 60:20:20 ratio for the train, validation and testing data split. In addition, these will be proportioned relative to the class imbalance using stratified sampling.

The exploratory data analysis on the training dataset will gauage whether extra preprocessing the featues like normalisation should be done before being fed into a classification model, via the use of visualisation and some statistical analysis. I will note any interesting observations made.

Due to the nature of the data, for example, not being traditionally textual data, I decided against the use of a simple naive bayes model. In addition, I initially planned to use a support vector machine but its relative complexity and interpretability when compared to a decision tree, made me choose a decision tree. I also decided to make use of an instance based model, hence the K nearest neighbour model. As a result, the models in mind are the K nearest neighbours and decision tree classifier. Depending on the performance of these models on the validation dataset, an extra model might be investigated before a choice of 2 final models (including the baseline) are decided.

As alluded to earlier, there will be 2 methods of evaluating the generalisation performance of the final chosen model, this will be accuracy and the use of a confusion matrix to see how each class or category is labelled by the chosen model. In addition, the use of a simple baseline will help in justifying how well it performs with respect to the class imbalance. This could be by using a classifier that predicts by copying the split seen for the training data split (stratified classifier) or one that simply outputs the most frequent class (majority class classifier).  The models will be compared and evaluated taking into account these baselines.