# [FMA: A Dataset For Music Analysis](https://github.com/mdeff/fma)

Michaël Defferrard, Kirell Benzi, Pierre Vandergheynst, Xavier Bresson, EPFL LTS2.

## Usage

1. Go through the [paper] to understand what the data is about.
1. Download some datasets from <https://github.com/mdeff/fma>.
1. Uncompress the archives, e.g. with `unzip fma_small.zip`.
1. Load and play with the data in this notebook.

[paper]: https://arxiv.org/abs/1612.01840

In [1]:
%matplotlib inline
import os
import glob
import IPython.display as ipd
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import utils
import librosa
import time
import sklearn
from sklearn.model_selection import train_test_split
plt.rcParams['figure.figsize'] = (17, 5)

# Data Preparation

The FMA dataset comes with metadata that can be used in formatting of the data. The following metadata is available tracks.csv, genres.csv, features.csv and echonest.csv. These files containe several columns about their given heading for example tracks.csv contains information such as the duration of a track, bit rate, licence and date of recording. In our data preparation we have three key steps  to accomplish

- Load tracks metadata 
- Select a subset of the data (small/medium/large)
- Clean the selected subset 
- Audio mapping and data formatting

### Load tracks metadata

In [3]:
tracks = utils.load('data/FMA/fma_metadata/tracks.csv')
tracks.shape

(106574, 52)

1. The index is the ID of the song, taken from the website, used as the name of the audio file.
2. Per-track, per-album and per-artist metadata from the Free Music Archive website.
3. Two columns to indicate the subset (small, medium, large)

In [4]:
ipd.display(tracks['track'].head())

Unnamed: 0_level_0,bit_rate,comments,composer,date_created,date_recorded,duration,favorites,genre_top,genres,genres_all,information,interest,language_code,license,listens,lyricist,number,publisher,tags,title
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
2,256000,0,,2008-11-26 01:48:12,2008-11-26,168,2,Hip-Hop,[21],[21],,4656,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,1293,,3,,[],Food
3,256000,0,,2008-11-26 01:48:14,2008-11-26,237,1,Hip-Hop,[21],[21],,1470,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,514,,4,,[],Electric Ave
5,256000,0,,2008-11-26 01:48:20,2008-11-26,206,6,Hip-Hop,[21],[21],,1933,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,1151,,6,,[],This World
10,192000,0,Kurt Vile,2008-11-25 17:49:06,2008-11-26,161,178,Pop,[10],[10],,54881,en,Attribution-NonCommercial-NoDerivatives (aka M...,50135,,1,,[],Freeway
20,256000,0,,2008-11-26 01:48:56,2008-01-01,311,0,,"[76, 103]","[17, 10, 76, 103]",,978,en,Attribution-NonCommercial-NoDerivatives (aka M...,361,,3,,[],Spiritual Level


### Subset selection

We select the small subset of the FMA dataset. The small FMA dataset consist of 8 000 music excerpts of
30 seconds duration with a 1 000 excerpts in each of 8 different music genres.

In [5]:
small_tracks=tracks[tracks['set', 'subset'] <= 'small']
small_tracks.shape

(8000, 52)

In [6]:
small_tracks=small_tracks["track"]
small_tracks=small_tracks[["genre_top"]]
small_tracks.head()

Unnamed: 0_level_0,genre_top
track_id,Unnamed: 1_level_1
2,Hip-Hop
5,Hip-Hop
10,Pop
140,Folk
141,Folk


### Data cleaning
We want every track to have its top genre as a string and we should have 8 unique genres. We check the values for genre_top and drop any values that are not of the relevant format or are missing.

In [7]:
small_tracks["genre_top"].unique()

[Hip-Hop, Pop, Folk, Experimental, Rock, International, Electronic, Instrumental]
Categories (8, object): [Hip-Hop, Pop, Folk, Experimental, Rock, International, Electronic, Instrumental]

### Audio mapping and data formatting

In [8]:
audio_path="data/FMA/audio/*"
genres=small_tracks["genre_top"].values
genres=np.unique(genres)
genres={genre:genre_id for genre,genre_id in zip(genres,[i for i in range(len(genres))])}
print(genres)

{'International': 5, 'Experimental': 1, 'Instrumental': 4, 'Pop': 6, 'Electronic': 0, 'Folk': 2, 'Hip-Hop': 3, 'Rock': 7}


In [9]:
def data_formatting():
    dataframe={"filename":[],"genre":[],"genre id":[]}
    folders=glob.glob(audio_path)
    for folder in folders:
        for file in glob.glob(folder+"/*"):
            track_id=file.split("/")[4]
            track_id=track_id.split(".")[0]
            track=small_tracks[small_tracks.index==int(track_id)]
            genre=track["genre_top"].values[0]
            dataframe["genre"].append(genre)
            dataframe["genre id"].append(genres[genre])
            dataframe["filename"].append(file)
    dataframe=pd.DataFrame(dataframe)
    dataframe=sklearn.utils.shuffle(dataframe)       
    return dataframe

In [10]:
small_tracks_formated=data_formatting()
print(small_tracks_formated.shape)
small_tracks_formated.to_csv("small_tracks.csv",index=False)
small_tracks_formated.head()

(8000, 3)


Unnamed: 0,filename,genre,genre id
5216,data/FMA/audio/069/069830.mp3,Hip-Hop,3
4059,data/FMA/audio/130/130689.mp3,Rock,7
6370,data/FMA/audio/055/055711.mp3,Rock,7
5774,data/FMA/audio/054/054297.mp3,Folk,2
5164,data/FMA/audio/069/069746.mp3,Pop,6


### Data splitting
- Training set 80% of the data
- Testing set 20% of the data

In [11]:
training_data,testing_data=train_test_split(small_tracks, test_size=0.1, random_state=42)

#### Training set details

In [12]:
print("size :",len(training_data))
for genre_name in genres.keys():
    print(genre_name)
    print(genre_name,":",len(training_data.loc[training_data["genre"]==genre_name]))

size : 7200
International


KeyError: 'genre'