# Preparing the MagnaTagATune Dataset for Music Genre Classification

**Arun Das**    
Research Fellow,    
Open Cloud Institute,    
The University of Texas at San Antonio.    
arun.das@my.utsa.edu

A little bit about me: I am a Computer Engineer by trade with research concentration on cloud computing and deep learning. 
I started researching on deep learning only a year (and half) back with emphasis on computer vision. 
So, I'm still on the learning curve when it comes to advanced DL topics in some other areas. 
This notebook is the first step in the deep learning pipeline of an interesting Music Genre Classification problem: the intense data science part where you prepare the dataset in the way you want.   

The dataset used for the project is [MagnaTagATune](http://mirg.city.ac.uk/codeapps/the-magnatagatune-dataset). It has more than 25K mp3 files. The aim of the project is in using a deep neural network to predict the genre of music, provided the mp3 as an input. The way it is achieved is through a combination of convolutional and reccurent neural networks working together as a whole.

I used pandas to work with the dataset which contains annotations of each of the 25K mp3 files. These annotations contains information about the genre, file id, mp3 file location etc. Pandas is an easy, flexible and powerful tool with many functions related to data structures for data analysis, time series analysis and statistics. After the dataset is processed, the mp3 file as such needs to be converted from raw mp3 to Mel-scaled power spectrogram. We use librosa to do it. You can see an example [here](https://librosa.github.io/librosa/generated/librosa.feature.melspectrogram.html). 

Let's do it then.

In [1]:
# Import required libraries
import pandas as pd
import numpy as np
import os
import shutil
import librosa
# Set number of columns to show in the notebook
pd.set_option('display.max_columns', 200)
# Set number of rows to show in the notebook
pd.set_option('display.max_rows', 50) 
# Make the graphs a bit prettier
pd.set_option('display.mpl_style', 'default') 

# Import MatPlotLib Package
import matplotlib.pyplot as plt

# Display pictures within the notebook itself
%matplotlib inline

mpl_style had been deprecated and will be removed in a future version.
Use `matplotlib.pyplot.style.use` instead.

  exec(code_obj, self.user_global_ns, self.user_ns)


In [2]:
# Read the annotations file
newdata = pd.read_csv('annotations_final.csv', sep="\t")

In [3]:
# Display the top 5 rows
newdata.head(5)

Unnamed: 0,clip_id,no voice,singer,duet,plucking,hard rock,world,bongos,harpsichord,female singing,clasical,sitar,chorus,female opera,male vocal,vocals,clarinet,heavy,silence,beats,men,woodwind,funky,no strings,chimes,foreign,no piano,horns,classical,female,no voices,soft rock,eerie,spacey,jazz,guitar,quiet,no beat,banjo,electric,solo,violins,folk,female voice,wind,happy,ambient,new age,synth,funk,no singing,middle eastern,trumpet,percussion,drum,airy,voice,repetitive,birds,space,strings,bass,harpsicord,medieval,male voice,girl,keyboard,acoustic,loud,classic,string,drums,electronic,not classical,chanting,no violin,not rock,no guitar,organ,no vocal,talking,choral,weird,opera,soprano,fast,acoustic guitar,electric guitar,male singer,man singing,classical guitar,country,violin,electro,reggae,tribal,dark,male opera,no vocals,irish,electronica,horn,operatic,arabic,lol,low,instrumental,trance,chant,strange,drone,synthesizer,heavy metal,modern,disco,bells,man,deep,fast beat,industrial,hard,harp,no flute,jungle,pop,lute,female vocal,oboe,mellow,orchestral,viola,light,echo,piano,celtic,male vocals,orchestra,eastern,old,flutes,punk,spanish,sad,sax,slow,male,blues,vocal,indian,no singer,scary,india,woman,woman singing,rock,dance,piano solo,guitars,no drums,jazzy,singing,cello,calm,female vocals,voices,different,techno,clapping,house,monks,flute,not opera,not english,oriental,beat,upbeat,soft,noise,choir,female singer,rap,metal,hip hop,quick,water,baroque,women,fiddle,english,mp3_path
0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,f/american_bach_soloists-j_s__bach_solo_cantat...
1,6,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,f/american_bach_soloists-j_s__bach_solo_cantat...
2,10,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,f/american_bach_soloists-j_s__bach_solo_cantat...
3,11,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,f/american_bach_soloists-j_s__bach_solo_cantat...
4,12,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,f/american_bach_soloists-j_s__bach_solo_cantat...


In [4]:
# Get to know the data better
newdata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25863 entries, 0 to 25862
Columns: 190 entries, clip_id to mp3_path
dtypes: int64(189), object(1)
memory usage: 37.5+ MB


In [5]:
# What colums are there ?
newdata.columns

Index([u'clip_id', u'no voice', u'singer', u'duet', u'plucking', u'hard rock',
       u'world', u'bongos', u'harpsichord', u'female singing',
       ...
       u'rap', u'metal', u'hip hop', u'quick', u'water', u'baroque', u'women',
       u'fiddle', u'english', u'mp3_path'],
      dtype='object', length=190)

In [6]:
# Extract the clip_id and mp3_path
newdata[["clip_id", "mp3_path"]]

Unnamed: 0,clip_id,mp3_path
0,2,f/american_bach_soloists-j_s__bach_solo_cantat...
1,6,f/american_bach_soloists-j_s__bach_solo_cantat...
2,10,f/american_bach_soloists-j_s__bach_solo_cantat...
3,11,f/american_bach_soloists-j_s__bach_solo_cantat...
4,12,f/american_bach_soloists-j_s__bach_solo_cantat...
5,14,c/lvx_nova-lvx_nova-01-contimune-30-59.mp3
6,19,c/lvx_nova-lvx_nova-01-contimune-175-204.mp3
7,21,c/lvx_nova-lvx_nova-01-contimune-233-262.mp3
8,23,c/lvx_nova-lvx_nova-01-contimune-291-320.mp3
9,25,0/american_bach_soloists-j_s__bach__cantatas_v...


In [7]:
# Previous command extracted it as a Dataframe. We need it as a matrix to do analyics on. 
# Extract clip_id and mp3_path as a matrix.
clip_id, mp3_path = newdata[["clip_id", "mp3_path"]].as_matrix()[:,0], newdata[["clip_id", "mp3_path"]].as_matrix()[:,1]

In [8]:
# Some of the tags in the dataset are really close to each other. Lets merge them together
synonyms = [['beat', 'beats'],
            ['chant', 'chanting'],
            ['choir', 'choral'],
            ['classical', 'clasical', 'classic'],
            ['drum', 'drums'],
            ['electro', 'electronic', 'electronica', 'electric'],
            ['fast', 'fast beat', 'quick'],
            ['female', 'female singer', 'female singing', 'female vocals', 'female vocal', 'female voice', 'woman', 'woman singing', 'women'],
            ['flute', 'flutes'],
            ['guitar', 'guitars'],
            ['hard', 'hard rock'],
            ['harpsichord', 'harpsicord'],
            ['heavy', 'heavy metal', 'metal'],
            ['horn', 'horns'],
            ['india', 'indian'],
            ['jazz', 'jazzy'],
            ['male', 'male singer', 'male vocal', 'male vocals', 'male voice', 'man', 'man singing', 'men'],
            ['no beat', 'no drums'],
            ['no singer', 'no singing', 'no vocal','no vocals', 'no voice', 'no voices', 'instrumental'],
            ['opera', 'operatic'],
            ['orchestra', 'orchestral'],
            ['quiet', 'silence'],
            ['singer', 'singing'],
            ['space', 'spacey'],
            ['string', 'strings'],
            ['synth', 'synthesizer'],
            ['violin', 'violins'],
            ['vocal', 'vocals', 'voice', 'voices'],
            ['strange', 'weird']]

In [9]:
# Merge the synonyms and drop all other columns than the first one.
"""
Example:
Merge 'beat', 'beats' and save it to 'beat'.
Merge 'classical', 'clasical', 'classic' and save it to 'classical'.
"""
for synonym_list in synonyms:
    newdata[synonym_list[0]] = newdata[synonym_list].max(axis=1)
    newdata.drop(synonym_list[1:], axis=1, inplace=True)

In [10]:
# Did it work ?
newdata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25863 entries, 0 to 25862
Columns: 136 entries, clip_id to mp3_path
dtypes: int64(135), object(1)
memory usage: 26.8+ MB


In [11]:
# Lets view it.
newdata.head()

Unnamed: 0,clip_id,singer,duet,plucking,world,bongos,harpsichord,sitar,chorus,female opera,clarinet,heavy,woodwind,funky,no strings,chimes,foreign,no piano,classical,female,soft rock,eerie,jazz,guitar,quiet,no beat,banjo,solo,folk,wind,happy,ambient,new age,synth,funk,middle eastern,trumpet,percussion,drum,airy,repetitive,birds,space,bass,medieval,girl,keyboard,acoustic,loud,string,not classical,no violin,not rock,no guitar,organ,talking,opera,soprano,fast,acoustic guitar,electric guitar,classical guitar,country,violin,electro,reggae,tribal,dark,male opera,irish,horn,arabic,lol,low,trance,chant,strange,drone,modern,disco,bells,deep,industrial,hard,harp,no flute,jungle,pop,lute,oboe,mellow,viola,light,echo,piano,celtic,orchestra,eastern,old,punk,spanish,sad,sax,slow,male,blues,vocal,no singer,scary,india,rock,dance,piano solo,cello,calm,different,techno,clapping,house,monks,flute,not opera,not english,oriental,beat,upbeat,soft,noise,choir,rap,hip hop,water,baroque,fiddle,english,mp3_path
0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,f/american_bach_soloists-j_s__bach_solo_cantat...
1,6,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,f/american_bach_soloists-j_s__bach_solo_cantat...
2,10,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,f/american_bach_soloists-j_s__bach_solo_cantat...
3,11,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,f/american_bach_soloists-j_s__bach_solo_cantat...
4,12,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,f/american_bach_soloists-j_s__bach_solo_cantat...


In [12]:
# Drop the mp3_path tag from the dataframe
newdata.drop('mp3_path', axis=1, inplace=True)
# Save the column names into a variable
data = newdata.sum(axis=0)

In [13]:
# Find the distribution of tags.
data

clip_id         736770326
singer               1308
duet                   74
plucking               69
world                  65
bongos                 52
harpsichord          1123
sitar                 926
chorus                241
female opera           85
clarinet               49
heavy                 583
woodwind               38
funky                 195
no strings             56
chimes                 77
foreign               275
no piano              324
classical            4358
female               2067
soft rock              54
eerie                  65
jazz                  555
guitar               4861
quiet                1072
                  ...    
rock                 2371
dance                 649
piano solo             45
cello                 575
calm                  131
different             118
techno               2954
clapping               59
house                  90
monks                  51
flute                1035
not opera              61
not english 

In [14]:
# Sort the column names.
data.sort_values(axis=0, inplace=True)

In [15]:
# Find the top tags from the dataframe.
topindex, topvalues = list(data.index[84:]), data.values[84:]
del(topindex[-1])
topvalues = np.delete(topvalues, -1)

In [16]:
# Get the top column names
topindex

['no beat',
 'folk',
 'trance',
 'foreign',
 'orchestra',
 'baroque',
 'chant',
 'hard',
 'no piano',
 'modern',
 'bass',
 'eastern',
 'country',
 'jazz',
 'cello',
 'heavy',
 'harp',
 'strange',
 'dance',
 'new age',
 'choir',
 'solo',
 'sitar',
 'soft',
 'pop',
 'flute',
 'quiet',
 'loud',
 'harpsichord',
 'opera',
 'singer',
 'india',
 'synth',
 'violin',
 'ambient',
 'piano',
 'female',
 'beat',
 'male',
 'fast',
 'rock',
 'no singer',
 'drum',
 'electro',
 'vocal',
 'string',
 'techno',
 'slow',
 'classical',
 'guitar']

In [17]:
# Get only the top column values
topvalues

array([ 242,  243,  253,  275,  296,  297,  312,  323,  324,  327,  337,
        406,  541,  555,  575,  583,  623,  640,  649,  650,  791,  826,
        926,  985,  995, 1035, 1072, 1086, 1123, 1298, 1308, 1402, 1734,
       1907, 1956, 2056, 2067, 2123, 2169, 2331, 2371, 2550, 2698, 2764,
       2813, 2842, 2954, 3547, 4358, 4861])

In [18]:
# Get a list of columns to remove
rem_cols = data.index[:84]

In [19]:
# Cross-check: How many columns are we removing ?
len(rem_cols)

84

In [20]:
# Drop the columns that needs to be removed
newdata.drop(rem_cols, axis=1, inplace=True)

In [21]:
newdata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25863 entries, 0 to 25862
Data columns (total 51 columns):
clip_id        25863 non-null int64
singer         25863 non-null int64
harpsichord    25863 non-null int64
sitar          25863 non-null int64
heavy          25863 non-null int64
foreign        25863 non-null int64
no piano       25863 non-null int64
classical      25863 non-null int64
female         25863 non-null int64
jazz           25863 non-null int64
guitar         25863 non-null int64
quiet          25863 non-null int64
no beat        25863 non-null int64
solo           25863 non-null int64
folk           25863 non-null int64
ambient        25863 non-null int64
new age        25863 non-null int64
synth          25863 non-null int64
drum           25863 non-null int64
bass           25863 non-null int64
loud           25863 non-null int64
string         25863 non-null int64
opera          25863 non-null int64
fast           25863 non-null int64
country        25863 non-nu

In [22]:
# Create a backup of the dataframe
backup_newdata = newdata

In [23]:
# Shuffle the dataframe
from sklearn.utils import shuffle
newdata = shuffle(newdata)

In [24]:
newdata.reset_index(drop=True)

Unnamed: 0,clip_id,singer,harpsichord,sitar,heavy,foreign,no piano,classical,female,jazz,guitar,quiet,no beat,solo,folk,ambient,new age,synth,drum,bass,loud,string,opera,fast,country,violin,electro,trance,chant,strange,modern,hard,harp,pop,piano,orchestra,eastern,slow,male,vocal,no singer,india,rock,dance,cello,techno,flute,beat,soft,choir,baroque
0,29918,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0
1,42456,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,51371,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,4107,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
4,9085,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,1,0,0,0
5,11923,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
6,58476,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0
7,13163,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
8,50609,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
9,26199,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0


In [26]:
# One final check
newdata.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 25863 entries, 13631 to 23146
Data columns (total 51 columns):
clip_id        25863 non-null int64
singer         25863 non-null int64
harpsichord    25863 non-null int64
sitar          25863 non-null int64
heavy          25863 non-null int64
foreign        25863 non-null int64
no piano       25863 non-null int64
classical      25863 non-null int64
female         25863 non-null int64
jazz           25863 non-null int64
guitar         25863 non-null int64
quiet          25863 non-null int64
no beat        25863 non-null int64
solo           25863 non-null int64
folk           25863 non-null int64
ambient        25863 non-null int64
new age        25863 non-null int64
synth          25863 non-null int64
drum           25863 non-null int64
bass           25863 non-null int64
loud           25863 non-null int64
string         25863 non-null int64
opera          25863 non-null int64
fast           25863 non-null int64
country        25863 no

In [27]:
# Let us save the final columns
final_columns_names = list(newdata.columns)

In [28]:
# Do it only once to delete the clip_id column
del(final_columns_names[0])

In [29]:
# Verified
final_columns_names

['singer',
 'harpsichord',
 'sitar',
 'heavy',
 'foreign',
 'no piano',
 'classical',
 'female',
 'jazz',
 'guitar',
 'quiet',
 'no beat',
 'solo',
 'folk',
 'ambient',
 'new age',
 'synth',
 'drum',
 'bass',
 'loud',
 'string',
 'opera',
 'fast',
 'country',
 'violin',
 'electro',
 'trance',
 'chant',
 'strange',
 'modern',
 'hard',
 'harp',
 'pop',
 'piano',
 'orchestra',
 'eastern',
 'slow',
 'male',
 'vocal',
 'no singer',
 'india',
 'rock',
 'dance',
 'cello',
 'techno',
 'flute',
 'beat',
 'soft',
 'choir',
 'baroque']

In [30]:
# Create the file which is to be saved off (you could skip and apply similar steps in the previous dataframe)
# Here, binary 0's and 1's from each column is changed to 'False' and 'True' by using '==' operator on the dataframe.
final_matrix = pd.concat([newdata['clip_id'], newdata[final_columns_names]==1], axis=1)

The following steps will convert mp3 files into their respective mel-spectrogram. This is compute intensive. If it takes a long time, copy it over to a text tile and run it as a python script so that you can forget about the jupyter notebook session. I've run these once, so not running them again.

In [None]:
# Rename all the mp3 files to their clip_id and save it into one folder named 'dataset_clip_id_mp3' in the same folder.
# Get the current working directory
root = os.getcwd()
os.mkdir( root + "/dataset_clip_id_mp3/", 0755 )

# Iterate over the mp3 files, rename them to the clip_id and save it to another folder.
for id in range(25863):
    #print clip_id[id], mp3_path[id]
    src = root + "/" + mp3_path[id]
    dest = root + "/dataset_clip_id_mp3/" + str(clip_id[id]) + ".mp3"
    shutil.copy2(src,dest)
    #print src,dest

In [None]:
# Convert all the mp3 files into their corresponding mel-spectrograms (melgrams).

# Audio preprocessing function
def compute_melgram(audio_path):
    ''' Compute a mel-spectrogram and returns it in a shape of (1,1,96,1366), where
    96 == #mel-bins and 1366 == #time frame
    parameters
    ----------
    audio_path: path for the audio file.
                Any format supported by audioread will work.
    More info: http://librosa.github.io/librosa/generated/librosa.core.load.html#librosa.core.load
    '''

    # mel-spectrogram parameters
    SR = 12000
    N_FFT = 512
    N_MELS = 96
    HOP_LEN = 256
    DURA = 29.12  # to make it 1366 frame..

    src, sr = librosa.load(audio_path, sr=SR)  # whole signal
    n_sample = src.shape[0]
    n_sample_fit = int(DURA*SR)

    if n_sample < n_sample_fit:  # if too short
        src = np.hstack((src, np.zeros((int(DURA*SR) - n_sample,))))
    elif n_sample > n_sample_fit:  # if too long
        src = src[(n_sample-n_sample_fit)/2:(n_sample+n_sample_fit)/2]
    logam = librosa.logamplitude
    melgram = librosa.feature.melspectrogram
    ret = logam(melgram(y=src, sr=SR, hop_length=HOP_LEN,
                        n_fft=N_FFT, n_mels=N_MELS)**2,
                ref_power=1.0)
    ret = ret[np.newaxis, np.newaxis, :]
    return ret


# Get the absolute path of all audio files and save it to audio_paths array
audio_paths = []
# Variable to save the mp3 files that don't work
files_that_dont_work=[]
os.chdir('/home/cc/notebooks/MusicProject/MagnaTagATune/')
root = os.getcwd()
os.chdir(root + '/dataset_clip_id_mp3/')
for audio_path in os.listdir('.'):
    #audio_paths.append(os.path.abspath(fname))
    if os.path.isfile(root + '/dataset_clip_id_melgram/' + str(os.path.splitext(audio_path)[0]) + '.npy'):
        #print "existtt"
        continue
    else:
        if str(os.path.splitext(audio_path)[1]) == ".mp3":
            try:
                melgram = compute_melgram(os.path.abspath(audio_path))
                dest = root + '/dataset_clip_id_melgram/' + str(os.path.splitext(audio_path)[0])
                np.save(dest, melgram)
            except EOFError:
                files_that_dont_work.append(audio_path)
                continue
                
"""
NOTE: I've run this an created all the mel-spectrograms and saved them off seprately, 
and then concatenated the train, test and validation set in the ratio that I wanted.
This, will make a significant overhead in the computation time when you look at this
as a whole. 

For example, concatenating the corresponding files to train, test and
validation splits will inturn require more time and memory. If we decide the splits 
beforehand and converting mp3 to mel-spectrogram based on those splits, it will make
life much easier (and less time). 

However, I want each of the mel-spectrograms seperate as I might need to create datasets
based on different genre, number of files, splits etc. in the future. So this is the way
to go for me now. Please note that this requires a significant amount of system memory.
"""

In [33]:
# Get a list of 
mp3_available = []
melgram_available = []
for mp3 in os.listdir('/home/cc/notebooks/MusicProject/MagnaTagATune/dataset_clip_id_mp3/'):
     mp3_available.append(int(os.path.splitext(mp3)[0]))
        
for melgram in os.listdir('/home/cc/notebooks/MusicProject/MagnaTagATune/dataset_clip_id_melgram/'):
     melgram_available.append(int(os.path.splitext(melgram)[0]))

In [34]:
# The latest clip_id
new_clip_id = final_matrix['clip_id']

In [35]:
# Let us see which all files have not been converted into melspectrograms.
set(list(new_clip_id)).difference(melgram_available)

{35644, 55753, 57881}

In [36]:
# Saw that these clips were extra 35644, 55753, 57881. Removing them.
final_matrix = final_matrix[final_matrix['clip_id']!= 35644]
final_matrix = final_matrix[final_matrix['clip_id']!= 55753]
final_matrix = final_matrix[final_matrix['clip_id']!= 57881]

In [37]:
# Check again
final_matrix.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 25860 entries, 13631 to 23146
Data columns (total 51 columns):
clip_id        25860 non-null int64
singer         25860 non-null bool
harpsichord    25860 non-null bool
sitar          25860 non-null bool
heavy          25860 non-null bool
foreign        25860 non-null bool
no piano       25860 non-null bool
classical      25860 non-null bool
female         25860 non-null bool
jazz           25860 non-null bool
guitar         25860 non-null bool
quiet          25860 non-null bool
no beat        25860 non-null bool
solo           25860 non-null bool
folk           25860 non-null bool
ambient        25860 non-null bool
new age        25860 non-null bool
synth          25860 non-null bool
drum           25860 non-null bool
bass           25860 non-null bool
loud           25860 non-null bool
string         25860 non-null bool
opera          25860 non-null bool
fast           25860 non-null bool
country        25860 non-null bool
violin     

In [38]:
# Save the matrix
final_matrix.to_pickle('final_Dataframe.pkl')

In [39]:
# Seperate the training, test and validation dataframe.
training_with_clip = final_matrix[:19773]

In [40]:
validation_with_clip = final_matrix[19773:21294]

In [41]:
testing_with_clip = final_matrix[21294:]

In [42]:
# Quick peek
training_with_clip

Unnamed: 0,clip_id,singer,harpsichord,sitar,heavy,foreign,no piano,classical,female,jazz,guitar,quiet,no beat,solo,folk,ambient,new age,synth,drum,bass,loud,string,opera,fast,country,violin,electro,trance,chant,strange,modern,hard,harp,pop,piano,orchestra,eastern,slow,male,vocal,no singer,india,rock,dance,cello,techno,flute,beat,soft,choir,baroque
13631,29918,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,True,False,False,False
19311,42456,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
23221,51371,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
1910,4107,False,False,False,False,False,False,False,False,False,False,True,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False
4180,9085,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,True,False,True,False,False,False
5473,11923,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False
25749,58476,True,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,True,False
6010,13163,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
22934,50609,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
11925,26199,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,True,False,True,False,False


In [43]:
# Quick peek
testing_with_clip

Unnamed: 0,clip_id,singer,harpsichord,sitar,heavy,foreign,no piano,classical,female,jazz,guitar,quiet,no beat,solo,folk,ambient,new age,synth,drum,bass,loud,string,opera,fast,country,violin,electro,trance,chant,strange,modern,hard,harp,pop,piano,orchestra,eastern,slow,male,vocal,no singer,india,rock,dance,cello,techno,flute,beat,soft,choir,baroque
523,1230,True,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
11543,25407,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
5513,12015,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
12026,26425,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,True,False,True,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
20579,45192,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
19803,43533,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False
25222,56856,False,True,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False
22001,48349,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
21113,46362,True,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,True,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False
4057,8860,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,True,True,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False


In [44]:
# Quick peek
validation_with_clip

Unnamed: 0,clip_id,singer,harpsichord,sitar,heavy,foreign,no piano,classical,female,jazz,guitar,quiet,no beat,solo,folk,ambient,new age,synth,drum,bass,loud,string,opera,fast,country,violin,electro,trance,chant,strange,modern,hard,harp,pop,piano,orchestra,eastern,slow,male,vocal,no singer,india,rock,dance,cello,techno,flute,beat,soft,choir,baroque
4921,10796,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
18976,41616,True,False,False,True,False,False,False,False,False,True,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,True,False,True,False,False,False,False,False,False,False,False
14132,31057,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,True,False,False,True,False,False,False,False,False,False,False,False
104,199,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,True,False,False,False,False,False,False,False,False,False,False,False
3565,7845,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False
22976,50706,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
17574,38542,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
4645,10150,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False
19187,42117,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False
3866,8480,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False


In [45]:
# Extract the corresponding clip_id's
training_clip_id = training_with_clip['clip_id'].values
validation_clip_id = validation_with_clip['clip_id'].values
testing_clip_id = testing_with_clip['clip_id'].values

In [46]:
# Check !
training_clip_id

array([29918, 42456, 51371, ..., 23425,  5838, 37311])

In [47]:
# Go to the directory you want to save the dataframe
os.chdir('/home/cc/notebooks/MusicProject/MagnaTagATune/final_dataset/')

In [48]:
# Save the 'y' values.
np.save('train_y.npy', training_with_clip[final_columns_names].values)

In [49]:
np.save('valid_y.npy', validation_with_clip[final_columns_names].values)

In [50]:
np.save('test_y.npy', testing_with_clip[final_columns_names].values)

In [51]:
# Save the 'x' clip_id's. We will make the numpy array using this.
np.savetxt('train_x_clip_id.txt', training_with_clip['clip_id'].values, fmt='%i')

In [52]:
np.savetxt('test_x_clip_id.txt', testing_with_clip['clip_id'].values, fmt='%i')

In [53]:
np.savetxt('valid_x_clip_id.txt', validation_with_clip['clip_id'].values, fmt='%i')

This is it, the (most) compute intensive part - concatenating the numpy arrays to form train, test and validation splits. In the training file portion, I have included two different ways in which you can create the train split; either by concatenating the numpy arrays or directly converting from corresponding mp3's to melspectrogram.

```
melgram = compute_melgram(str(train_clip) + '.mp3')
OR
melgram = np.load(str(train_clip) + '.npy')
```

Use the one which suits you. I had a cloud instance with plenty RAM, so I concatenated the numpy arrays. It took about 6 hours.

In [None]:
# Now to combine the melgrams according to the clip_id. 
# (maybe in the future we can make melgrams according to the clip id iteslf into train test and validation!!)

# Variable to store melgrams.
train_x = np.zeros((0, 1, 96, 1366))
test_x = np.zeros((0, 1, 96, 1366))
valid_x = np.zeros((0, 1, 96, 1366))

root = '/home/cc/notebooks/MusicProject/MagnaTagATune/'
os.chdir(root + "/dataset_clip_id_melgram/")
for i,valid_clip in enumerate(list(validation_clip_id)):
    if os.path.isfile(str(valid_clip) + '.npy'):
        #print i,valid_clip
        melgram = np.load(str(valid_clip) + '.npy')
        valid_x = np.concatenate((valid_x, melgram), axis=0)
os.chdir('/home/cc/notebooks/MusicProject/MagnaTagATune/')
np.save('valid_x.npy', valid_x)
print "Validation file created"


root = '/home/cc/notebooks/MusicProject/MagnaTagATune/'
os.chdir(root + "/dataset_clip_id_melgram/")
for i,test_clip in enumerate(list(testing_clip_id)):
    if os.path.isfile(str(test_clip) + '.npy'):
        #print i,test_clip
        melgram = np.load(str(test_clip) + '.npy')
        test_x = np.concatenate((test_x, melgram), axis=0)
os.chdir('/home/cc/notebooks/MusicProject/MagnaTagATune/')
np.save('test_x.npy', test_x)
print "Testing file created"

root = '/home/cc/notebooks/MusicProject/MagnaTagATune/'
os.chdir(root + "/dataset_clip_id_melgram/")
for i,train_clip in enumerate(list(training_clip_id)):
    #if os.path.isfile(str(train_clip) + '.npy'):
        #print i,train_clip
    melgram = compute_melgram(str(train_clip) + '.mp3')
    #melgram = np.load(str(train_clip) + '.npy')
    train_x = np.concatenate((train_x, melgram), axis=0)
os.chdir('/home/cc/notebooks/MusicProject/MagnaTagATune/')
np.save('train_x.npy', train_x)
print "Training file created."

That's it for now!