# 01. Machine Learning for music playlists: Data preparation

This is the first post in a series of posts devoted to building music playlists with Scikit-Learn tools.   
This notebook covers data gathering and preparation and cleaning of the datasets I'm going to use in my analysis.
For the overview of this analysis, its goals, methods, and installation notes please go to [00_Overview](https://github.com/Tykovka/itunes-music-analysis/blob/master/00_Overview.ipynb). 

#### Contents of the notebook
* [Machine Learning Intro](#Machine-Learning-Intro)
* [Preliminaries](#Preliminaries)
* [Data preparation](#Data-preparation)
* [Data overview](#Data-overview)
* [Summary](#Summary)

<a id='Machine-Learning-Intro'></a>

## Machine Learning Intro
In this analysis, I'm interested in three classes of music ("cycling", "ballet", "yoga") and I want to find tracks in my iTunes music library that fit these classes. This is a multiclass classification problem. Classification is the task of predicting the value of a categorical variable given some input variables (the features). 

To solve that problem I use *supervised machine learning classification algorithms*.  

Supervised machine learning is about creating models from data: a model learns from training data (data with class labels), and can be used to predict the result of test data (data without class labels). Thus, the task of supervised learning is to construct an estimator which is able to predict the label of an object given the set of features. One can also think of classification as a function estimation problem where the function that we want to estimate separates the classes.

<a id='Preliminaries'></a>

## Preliminaries
One of the main goals of this analysis is to explore the basics of Scikit-Learn tools. **[Scikit-Learn](http://scikit-learn.org)** is a popular Python package designed to give access to well-known machine learning algorithms within Python code. 

Scikit-Learn is built upon Python's **[NumPy (Numerical Python)](http://www.numpy.org/)** and **[SciPy (Scientific Python)](http://scipy.org/)** libraries, which enable efficient in-core numerical and scientific computation within Python. 

I also use **[pandas](http://pandas.pydata.org/)** library in my analysis. Pandas is a Python package providing fast, flexible, and expressive data structures. It is a fundamental high-level building block for doing practical, real-world data analysis in Python.

The hero and the foundation of my analysis is the **[Echo Nest API](http://the.echonest.com/)**, which provides broad and deep data on millions of artists and songs. **[Pyechonest](https://github.com/echonest/pyechonest)** is an open source Python library for the Echo Nest API that I use in this analysis. To use The Echo Nest API, an API key is required. More about the API key [here](http://developer.echonest.com/raw_tutorials/register.html).

I start with importing modules required in the following notebook. 

In [1]:
from IPython.display import display
import pandas as pd
import numpy as np

# format floating point numbers
# within pandas data structures
pd.set_option('float_format', '{:.2f}'.format)

# import pyechonest
from pyechonest import config

# pass my API key
config.ECHO_NEST_API_KEY="MY_API_KEY"

<a id='Data-preparation'></a>

## Data preparation
In the analysis I use two datasets:
1. iTunes music library serves me as a *test dataset*, or non-labelled data;
2. For the *training dataset* I made a CSV file ('./labeled_tracks.csv') with hand picked tracks outside of my iTunes library. I labelled each track with one of the three classes: "cycling", "yoga", "ballet". 

iTunes library files track the media in iTunes. The iTunes library file, a file called iTunes Music Library.xml, is created automatically when you launch iTunes. 'iTunes Music Library.xml' contains information that's stored in the iTunes database of the songs in the library. On Mac OS X, it can be found in the directory 'Users/username/Music/iTunes'. More information about iTunes library files can be found [here](https://support.apple.com/en-us/HT201610).

Throughout the analysis, I use pandas DataFrame (DF) data structure, which one can think of as an Excel-like table of values. DataFrames have various methods that can be called to easily learn about the data contained in them. 

To store data and results of my computations I use **[HDF5](http://pandas.pydata.org/pandas-docs/stable/io.html#io-hdf5)** format. HDF5 allows to treat a local file as a hash and work directly with pandas DataFrames. Very cool. It's trivial to read and write from this file using Pandas. 
### Test dataset
I will start by processing the test data.  

To parse iTunes XML file I use **[pyItunes](https://github.com/liamks/pyitunes)** module, which makes it easier to access tracks in the XML file. 

HDF5 is not suitable for writing in data row by row. Though the test set is not as big (not millions of songs), when getting data from the Echo Nest API I will need to append rows. After several failed attempts to write all data right into a pandas DF, I switched to a sqlite3 database (DB).

I use the **[sqlitedict](https://github.com/piskvorky/sqlitedict)** library to access it. sqlitedict is a lightweight wrapper around Python's sqlite3 DB with a simple, Pythonic dict-like interface. For this particular problem, I think using a DB is more convenient than a CSV or XML file.

In [2]:
from sqlitedict import SqliteDict
# create a DB to store data from iTunes library
test_db = SqliteDict('./itunes_tracks_db', autocommit=True)

In [3]:
def get_itunes_track_data(song):
    """Check the validity of the track, 
    exclude podcasts and tracks 
    missing artist's name.
    """
    if (song.genre == 'Podcast' or 
        song.genre == u'iTunes U' or 
        song.kind != 'MPEG audio file' or 
        not song.artist): 
        return None 
    else:
        return song.name, song.artist

def parse_itunes_xml(xml_file, db):
    """Parse xml, get song's title
    and artist's name, save to a DB.
    """
    from pyItunes import Library
    l = Library(xml_file)
    
    # index will serve as a key for the dict
    ind = 0
    
    for id, song in l.songs.items():
        try:
            song_title, artist = get_itunes_track_data(song)
            db[ind] = {'song': song_title, 
                       'artist': artist}
            ind += 1
        except TypeError as e:
            continue

In [4]:
# path to the iTunes xml file (I copied it)
xml_file = 'iTunes Music Library copy.xml'

# write the data into the DB
parse_itunes_xml(xml_file, test_db)

In [5]:
# review the result
print ("Test dataset contains {} tracks."
       .format(len(test_db)))

Test dataset contains 669 tracks.


The next step is to get song attributes from the Echo Nest API and to write them in the DB.

In the analysis I use the following track attributes, which are available through the Echo Nest API:

* **Acousticness** represents the likelihood a recording was created by solely acoustic means such as voice and acoustic instruments as opposed to electronically such as with synthesized, amplified or effected instruments;
* **Danceability** describes how suitable a track is for dancing using a number of musical elements (tempo, rhythm stability, beat strength, and overall regularity);
* **Energy** represents a perceptual measure of intensity and powerful activity released throughout the track;
* **Instrumentalnes** is a measure of how likely a song is to be instrumental;
* **Key** identifies the tonic triad, the chord, major or minor. Key signatures start at 0 (C) and ascend the chromatic scale;
* **Loudness** measures the overall loudness of a track in decibels (dB);
* **Mode** indicates the modality (major (1) or minor (0)) of a track, the type of scale from which its melodic content is derived;
* **Speechiness** detects the presence of spoken words in a track;
* **Tempo** is the speed or pace of a given piece (in beats per minute);
* **Time signature** specifies how many beats are in each measure;
* **Valence** describes the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g., happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

In [6]:
# list of features to use in the analysis
features = ['acousticness',
            'danceability',
            'energy',
            'instrumentalness',
            'key',
            'loudness',
            'mode',
            'speechiness',
            'tempo',
            'time_signature',
            'valence']

Using the Echo Nest Python library Pyechonest is super easy and straightforward. The Echo Nest database doesn't provide data for every artist or song. I handle missing items with a "try-except" block.

In [7]:
def get_song_data(artist_name, song_title):
    """Get track features data from 
    the Echo Nest database.
    """
    from pyechonest import song
    try: 
        result = song.search(artist=artist_name, 
                             title=song_title)
        song_result = result[0]
        song_data = song_result.audio_summary
        
        # returns a dictionary of song features
        return song_data
    
    except IndexError as e:
        return None
         
def add_features_to_db(db, features):
    """Request track features from
    the Echo Nest database and add to the DB.
    """
    from time import sleep
    
    for key, value in db.iteritems():
        # check if attributes have been already added
        if value.get('tempo'):
            pass
        else:
            song_data = get_song_data(value['artist'],
                                     value['song'])
            # if the song is in the Echo Nest DB, 
            # I add data to the DB
            if song_data:
                for f in features:
                    value.update({f : song_data[f]})
                db[key] = value
            
            # if not, I delete the track
            else:
                del db[key]
        # the Echo Nest limits number of requests to 20 per minute
        sleep(7)

I call the function to add track features from the Echo Nest data to the DB.

In [13]:
# add features
add_features_to_db(test_db, features)

In [15]:
# view the resulting DF
print ("Test dataset contains {} tracks."
       .format(len(test_db)))

Test dataset contains 537 tracks.


To see the data I transform the set to pandas DF. I transpose data to have songs as rows. 

In [17]:
# read DB
test_df = pd.DataFrame(dict(test_db)).T

# view the result
test_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 537 entries, 0 to 98
Data columns (total 13 columns):
acousticness        537 non-null object
artist              537 non-null object
danceability        537 non-null object
energy              537 non-null object
instrumentalness    536 non-null object
key                 537 non-null object
loudness            537 non-null object
mode                537 non-null object
song                537 non-null object
speechiness         536 non-null object
tempo               537 non-null object
time_signature      537 non-null object
valence             537 non-null object
dtypes: object(13)
memory usage: 58.7+ KB


The table above gives us an overview of data in DF columns. There are 537 tracks, or rows, in the test set. Tracks with no data in the Echo Nest database were removed from the test set when calling *add_features_to_db* function.  

However, one track in the resulting set has no data for the instrumentalness feature. Pandas provides various tools for handling [missing data](http://pandas.pydata.org/pandas-docs/stable/missing_data.html). I simply remove the row with no data for the feature.

In [18]:
# drop the row
test_df.dropna(inplace=True)

I have 11 numeric features (these are the data from the Echo Nest) and 2 columns with text data in the test set. I notice, however, that all features were identified as objects. I convert these columns to the numerical format. I also reorder columns to move "artist" and "song" feature to the end.

In [19]:
# convert to numerical format
test_df = test_df.convert_objects(convert_numeric=True)

# list of columns
cols = test_df.columns.tolist()

# reorder the list
cols = cols[0:1] + cols[2:8] + cols[9:] + cols[1:2] + cols[8:9]
cols

['acousticness',
 'danceability',
 'energy',
 'instrumentalness',
 'key',
 'loudness',
 'mode',
 'speechiness',
 'tempo',
 'time_signature',
 'valence',
 'artist',
 'song']

In [20]:
# apply new order to the list
test_df = test_df.ix[:, cols]

In [21]:
# view the result
test_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 535 entries, 0 to 98
Data columns (total 13 columns):
acousticness        535 non-null float64
danceability        535 non-null float64
energy              535 non-null float64
instrumentalness    535 non-null float64
key                 535 non-null int64
loudness            535 non-null float64
mode                535 non-null int64
speechiness         535 non-null float64
tempo               535 non-null float64
time_signature      535 non-null int64
valence             535 non-null float64
artist              535 non-null object
song                535 non-null object
dtypes: float64(8), int64(3), object(2)
memory usage: 58.5+ KB


I reordered columns in the test set, fixed type of data, and removed rows with missing values. As a result, I have a dataframe with 536 observations (rows) and 13 features (columns). 

In [22]:
test_df.head(5)

Unnamed: 0,acousticness,danceability,energy,instrumentalness,key,loudness,mode,speechiness,tempo,time_signature,valence,artist,song
0,0.01,0.71,0.95,0.0,4,-5.08,1,0.37,125.12,4,0.6,Caravan Palace,Jolie Coquine
1,0.0,0.67,0.5,0.03,0,-20.45,1,0.05,118.01,4,0.59,Caribou,Odessa
10,0.01,0.57,0.45,0.04,0,-7.88,1,0.05,115.45,4,0.56,Jenny Wilson,Like A Fading Rainbow
100,0.01,0.8,0.69,0.01,2,-5.8,0,0.04,117.0,4,0.41,Heartbreak,Loving The Alien
102,0.16,0.57,0.82,0.76,10,-5.4,0,0.06,177.57,4,0.54,High Contrast,Kiss Kiss Bang Band


I notice that the index of the dataframe has "gaps". The reason is that, when writing features to the test database, I deleted tracks with no data in the Echo Nest base. To fix it I reset the index.

In [23]:
# reset index
test_df.reset_index(drop=True, inplace=True)
test_df.head(5)

Unnamed: 0,acousticness,danceability,energy,instrumentalness,key,loudness,mode,speechiness,tempo,time_signature,valence,artist,song
0,0.01,0.71,0.95,0.0,4,-5.08,1,0.37,125.12,4,0.6,Caravan Palace,Jolie Coquine
1,0.0,0.67,0.5,0.03,0,-20.45,1,0.05,118.01,4,0.59,Caribou,Odessa
2,0.01,0.57,0.45,0.04,0,-7.88,1,0.05,115.45,4,0.56,Jenny Wilson,Like A Fading Rainbow
3,0.01,0.8,0.69,0.01,2,-5.8,0,0.04,117.0,4,0.41,Heartbreak,Loving The Alien
4,0.16,0.57,0.82,0.76,10,-5.4,0,0.06,177.57,4,0.54,High Contrast,Kiss Kiss Bang Band


The test set is now ready for further analysis. I move to the training set.

### Training dataset

For the training dataset, I made a CSV file ('./labeled_tracks.csv') with hand picked tracks and labelled each track with one of the three classes: "cycling", "ballet", "yoga".

I use pandas read_csv function to read the CSV file into a DataFrame.

In [24]:
# path to the csv file
csv_file = './labeled_tracks.csv'

# transform the csv file to a DF
train_data = pd.read_csv(csv_file, encoding='utf_8', 
                       header=0)

# add features as empty columns
for f in features:
    train_data[f] = np.nan

In [25]:
# review the result
print ("Training dataset contains {} tracks."
       .format(len(train_data)))
print 
print train_data.info()

Training dataset contains 163 tracks.

<class 'pandas.core.frame.DataFrame'>
Int64Index: 163 entries, 0 to 162
Data columns (total 14 columns):
song                163 non-null object
artist              163 non-null object
category            163 non-null object
acousticness        0 non-null float64
danceability        0 non-null float64
energy              0 non-null float64
instrumentalness    0 non-null float64
key                 0 non-null float64
loudness            0 non-null float64
mode                0 non-null float64
speechiness         0 non-null float64
tempo               0 non-null float64
time_signature      0 non-null float64
valence             0 non-null float64
dtypes: float64(11), object(3)
memory usage: 19.1+ KB
None


I make a function to write data from the Echo Nest to the dataframe.

In [26]:
def add_features_to_df(df, features):
    """Request track features from
    the Echo Nest database and add to the DF.
    """
    from time import sleep
    
    for i in df.index.tolist():
        # check if attributes have been already added
        if pd.notnull(df.loc[i, 'tempo']): 
            pass
        else:
            song_data = get_song_data(df.loc[i, 'artist'], 
                                      df.loc[i, 'song'])
            
            # if the song is in the Echo Nest DB, 
            # I add data to the DF.
            if song_data:
                for f in features:
                    df.loc[i, f] = song_data[f]
            # if not, I drop the track. 
            else:
                df = df.drop(i)
        # the Echo Nest limits number of requests to 20 per minute
        sleep(7) 
    return df

In [31]:
# add features to the training DF
train_df = add_features_to_df(train_data, features)

In [32]:
# get info about data
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 143 entries, 0 to 162
Data columns (total 14 columns):
song                143 non-null object
artist              143 non-null object
category            143 non-null object
acousticness        143 non-null float64
danceability        143 non-null float64
energy              143 non-null float64
instrumentalness    143 non-null float64
key                 143 non-null float64
loudness            143 non-null float64
mode                143 non-null float64
speechiness         141 non-null float64
tempo               143 non-null float64
time_signature      143 non-null float64
valence             143 non-null float64
dtypes: float64(11), object(3)
memory usage: 16.8+ KB


The table above gives an overview of data in training DF columns. There are 143 rows in the training set. Tracks with no data in the Echo Nest database were removed from the set when calling *add_features_to_df* function.  

However, two tracks in the resulting set have no data for the speechiness feature. Being familiar with the features, I know that this particular feature is not very valuable for the analysis. At the same time having more observations toI simply leave the missing data as is.

I know that three features "key", "mode", and "time_signature" are discrete, not continuous as other features, with a limited number of values. E.g. "time_signature" has 4 values: 1, 3, 4 or 5. I change type of these three features to integer instead of float.

In [33]:
train_df[['key', 'mode', 'time_signature']] = train_df[['key', 'mode', 'time_signature']].astype(int)
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 143 entries, 0 to 162
Data columns (total 14 columns):
song                143 non-null object
artist              143 non-null object
category            143 non-null object
acousticness        143 non-null float64
danceability        143 non-null float64
energy              143 non-null float64
instrumentalness    143 non-null float64
key                 143 non-null int64
loudness            143 non-null float64
mode                143 non-null int64
speechiness         141 non-null float64
tempo               143 non-null float64
time_signature      143 non-null int64
valence             143 non-null float64
dtypes: float64(8), int64(3), object(3)
memory usage: 16.8+ KB


Next, I reorder columns in the training set to keep the order consistent with the test dataset. 

In [34]:
cols = train_df.columns.tolist()
cols = cols[3:] + cols[:3]
cols

['acousticness',
 'danceability',
 'energy',
 'instrumentalness',
 'key',
 'loudness',
 'mode',
 'speechiness',
 'tempo',
 'time_signature',
 'valence',
 u'song',
 u'artist',
 u'category']

In [35]:
# change order of columns
train_df = train_df.ix[:, cols]

In [36]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 143 entries, 0 to 162
Data columns (total 14 columns):
acousticness        143 non-null float64
danceability        143 non-null float64
energy              143 non-null float64
instrumentalness    143 non-null float64
key                 143 non-null int64
loudness            143 non-null float64
mode                143 non-null int64
speechiness         141 non-null float64
tempo               143 non-null float64
time_signature      143 non-null int64
valence             143 non-null float64
song                143 non-null object
artist              143 non-null object
category            143 non-null object
dtypes: float64(8), int64(3), object(3)
memory usage: 16.8+ KB


The training set is now ready for further analysis. I save both sets on disk in HDF5 format.

### Save datasets in HDF5

In [37]:
# save test df 
test_df.to_hdf('music_data.h5', 'test_df', min_itemsize = {'values': 50})

In [38]:
# save training df
train_df.to_hdf('music_data.h5', 'train_df', min_itemsize = {'values': 50})

In [39]:
# check the result
print pd.HDFStore('music_data.h5')

<class 'pandas.io.pytables.HDFStore'>
File path: music_data.h5
/test_df             frame        (shape->[536,13])
/train_df            frame        (shape->[143,12])


<a id='Data-overview'></a>

## Data overview  
  
I have two datasets with track attributes data from the Echo Nest API. Next step is to take a look at what I'm working with.
### Test data overview

In [41]:
print ("There are {} tracks in the test dataset.\n"
       .format(len(test_df)))
print "Below is a random sample of the dataset."
test_df.sample(n=3)

There are 536 tracks in the test dataset.

Below is a random sample of the dataset.


Unnamed: 0,acousticness,danceability,energy,instrumentalness,key,loudness,mode,speechiness,tempo,time_signature,valence,artist,song
98,0.03,0.62,0.9,0.0,7,-7.11,1,0.04,100.0,4,0.54,LCMDF,Future Me
376,0.0,0.46,0.69,0.0,6,-6.13,0,0.08,108.97,4,0.62,The Black Keys,Tighten Up
275,0.62,0.49,0.56,0.76,2,-8.7,1,0.03,151.01,3,0.27,Junip,Turn to the Assassin


### Training data overview

In [42]:
print ("There are {} tracks in the CSV file.\n"
       "{} tracks have no data available "
       "in the Echo Nest API.\n"
       "We are left with {} tracks to use as training data.\n"
       .format(len(train_data), 
               (len(train_data) - len(train_df)),
              len(train_df)))
print "Below is a random sample of the dataset."
train_df.sample(n=3)

There are 163 tracks in the CSV file.
20 tracks have no data available in the Echo Nest API.
We are left with 143 tracks to use as training data.

Below is a random sample of the dataset.


Unnamed: 0,acousticness,danceability,energy,instrumentalness,key,loudness,mode,speechiness,tempo,time_signature,valence,song,artist,category
156,0.94,0.17,0.07,0.96,4,-31.64,1,0.03,88.79,4,0.15,Mental boy,Thomas Newman,yoga
72,0.55,0.64,0.19,0.84,6,-19.37,1,0.04,120.0,4,0.07,they move on tracks of never-ending light,this will destroy you,yoga
39,0.0,0.47,0.88,0.01,1,-5.09,1,0.06,176.97,4,0.64,Five Seconds,Twin Shadow,cycling


In [43]:
# list of categories
categories = list(pd.unique(train_df.category.ravel()))

print ("Tracks in the dataset belong " 
       "to {} classes: {}."
       .format(len(categories), ", ".join(categories)))

# count tracks in each category
cat_count = pd.value_counts(train_df.category.ravel())

# print categories
for category in categories:
    print ("{} tracks represent \'{}\' class."
           .format(cat_count[category], category))

Tracks in the dataset belong to 3 classes: ballet, cycling, yoga.
49 tracks represent 'ballet' class.
45 tracks represent 'cycling' class.
49 tracks represent 'yoga' class.


<a id='Summary'></a>

### Summary
In this notebook I described data gathering and cleaning process for further analysis. I parsed iTunes music library XML file to create a dataframe as a test dataset. I also transformed the CSV file with labeled tracks to a dataframe as a training set. Using the Echo Nest API I got track attributes for both sets.

As a result of the above manipulations I created two pandas dataframes: 
* train dataframe contains 143 tracks labelled with one of the three classes — "ballet", "cycling", "yoga";
* test dataframe contains 536 non-labelled tracks. 

Each set has 11 music attributes for every song which I'm going to analyse in the following posts.

The next step in my analysis is to visualize both datasets and examine track attributes, which I cover in the next post [02_Data_visualisation](https://github.com/Tykovka/itunes-music-analysis/blob/master/02_Data_Visualisation.ipynb).