# 01. Machine Learning for music playlists: Data preparation

This is the first post in a series of posts devoted to building music playlists with Scikit-Learn tools.   
This notebook covers data gathering and preparation and cleaning of the datasets I'm going to use in my analysis.
For the overview of this analysis, its goals, methods, and installation notes please go to [00_Overview](http://localhost:8888/notebooks/00_Overview.ipynb). 

#### Contents of the notebook
* [Machine Learning Intro](http://localhost:8888/notebooks/01_Data_preparation.ipynb#Machine-Learning-Intro)
* [Preliminaries](http://localhost:8888/notebooks/01_Data_preparation.ipynb#Preliminaries)
* [Data preparation](http://localhost:8888/notebooks/01_Data_preparation.ipynb#Data-preparation)
* [Data overview](http://localhost:8888/notebooks/01_Data_preparation.ipynb#Data-overview)
* [Summary](http://localhost:8888/notebooks/01_Data_preparation.ipynb#Summary)

## Machine Learning Intro
In this analysis I'm interested in three classes of music ("cycling", "ballet", "yoga") and I want to find tracks in my iTunes music library that fit these classes. This is a multiclass classification problem. Classification is the task of predicting the value of a categorical variable given some input variables (the features). 

To solve that problem I use *supervised machine learning classification algorithms*.  

Supervised machine learning is about creating models from data: a model learns from training data (data with class labels), and can be used to predict the result of test data (data without class labels). Thus the task of supervised learning is to construct an estimator which is able to predict the label of an object given the set of features. One can also think of classification as a function estimation problem where the function that we want to estimate separates the classes.

## Preliminaries
One of the main goals of this analysis is to explore the basics of Scikit-Learn tools. **[Scikit-Learn](http://scikit-learn.org)** is a popular Python package designed to give access to well-known machine learning algorithms within Python code. 

Scikit-Learn is built upon Python's **[NumPy (Numerical Python)](http://www.numpy.org/)** and **[SciPy (Scientific Python)](http://scipy.org/)** libraries, which enable efficient in-core numerical and scientific computation within Python. 

I also use **[pandas](http://pandas.pydata.org/)** library in my analysis. Pandas is a Python package providing fast, flexible, and expressive data structures. It is a fundamental high-level building block for doing practical, real world data analysis in Python.

The hero and the foundation of my analysis is the **[Echo Nest API](http://the.echonest.com/)**, which provides broad and deep data on millions of artists and songs. **[Pyechonest](https://github.com/echonest/pyechonest)** is an open source Python library for the Echo Nest API that I use in this analysis. To use The Echo Nest API, an API key is required. More about the API key [here](http://developer.echonest.com/raw_tutorials/register.html).

In the analysis I use the following track attributes, which are available through the Echo Nest API:

* **Acousticness** represents the likelihood a recording was created by solely acoustic means such as voice and acoustic instruments as opposed to electronically such as with synthesized, amplified, or effected instruments;
* **Danceability** describes how suitable a track is for dancing using a number of musical elements (tempo, rhythm stability, beat strength, and overall regularity);
* **Energy** represents a perceptual measure of intensity and powerful activity released throughout the track;
* **Instrumentalnes** is a measure of how likely a song is to be instrumental;
* **Key** identifies the tonic triad, the chord, major or minor. Key signatures start at 0 (C) and ascend the chromatic scale;
* **Loudness** measures the overall loudness of a track in decibels (dB);
* **Mode** indicates the modality (major (1) or minor (0)) of a track, the type of scale from which its melodic content is derived;
* **Speechiness** detects the presence of spoken words in a track;
* **Tempo** is the speed or pace of a given piece (in beats per minute);
* **Time signature** specifies how many beats are in each measure;
* **Valence** describes the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g., happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).


I start with importing modules required in the following notebook. 

In [1]:
from IPython.display import display
import pandas as pd
import numpy as np

# format floating point numbers
# within pandas data structures
pd.set_option('float_format', '{:.2f}'.format)

# import pyechonest
from pyechonest import config

# pass my API key
config.ECHO_NEST_API_KEY="1RNDIJ5SITBKZFDCT"

## Data preparation
In the analysis I use two datasets:
1. iTunes music library serves me as a *test dataset*, or non-labeled data;
2. For the *training dataset* I made a csv file ('./labeled_tracks.csv') with hand picked tracks outside of my iTunes library. I labeled each track with one of the three classes: "cycling", "yoga", "ballet". 

iTunes library files track the media in iTunes. The iTunes library file, a file called iTunes Music Library.xml, is created automatically when you launch iTunes. 'iTunes Music Library.xml' contains information that's stored in the iTunes database of the songs in the library. On Mac OS X, it can be found in the directory 'Users/username/Music/iTunes'. More information about iTunes library files can be found [here](https://support.apple.com/en-us/HT201610).

Throughout the analysis I use pandas DataFrame (DF) data structure, which one can think of as an Excel-like table of values. DataFrames have various methods that can be called to easily learn about the data contained in them. 

To store data and results of my computations I use **[HDF5](http://pandas.pydata.org/pandas-docs/stable/io.html#io-hdf5)** format. HDF5 allows to treat a local file as a hash and work directly with pandas DataFrames. Very cool. It's trivial to read and write from this file using Pandas. 
### Test dataset
I will start by processing the test data.  

To parse iTunes xml file I use **[pyItunes](https://github.com/liamks/pyitunes)** module, which makes it easier to access tracks in the xml file. 

In [2]:
def get_itunes_track_data(song):
    """Check the validity of the track, 
    exclude podcasts and tracks 
    missing artist's name.
    """
    if (song.genre == 'Podcast' or 
        song.genre == u'iTunes U' or 
        song.kind != 'MPEG audio file' or 
        not song.artist): 
        return None 
    else:
        return song.name, song.artist

def parse_itunes_xml(xml_file, features):
    """Parse xml, get song's title
    and artist's name. Return DataFrame.
    """
    from pyItunes import Library
    l = Library(xml_file)
    
    # create empty df with feature names as columns
    df = pd.DataFrame(columns=features)
    
    for id, song in l.songs.items():
        try:
            song_title, artist = get_itunes_track_data(song)
            df = df.append({'artist' : unicode(artist), 
                            'song' : unicode(song_title)}, 
                           ignore_index=True)
        except TypeError as e:
            continue
    return df

In [3]:
# path to the iTunes xml file (I copied it)
xml_file = 'iTunes Music Library copy.xml'

# list of features to use in the analysis
features = ['acousticness',
            'danceability',
            'energy',
            'instrumentalness',
            'key',
            'loudness',
            'mode',
            'speechiness',
            'tempo',
            'time_signature',
            'valence']

# create a DF for data from iTunes library
test_data = parse_itunes_xml(xml_file, features)

In [4]:
# review the result
print ("Test dataset contains {} tracks."
       .format(len(test_data)))
print 
print test_data.info()

Test dataset contains 669 tracks.

<class 'pandas.core.frame.DataFrame'>
Int64Index: 669 entries, 0 to 668
Data columns (total 13 columns):
acousticness        0 non-null object
danceability        0 non-null object
energy              0 non-null object
instrumentalness    0 non-null object
key                 0 non-null object
loudness            0 non-null object
mode                0 non-null object
speechiness         0 non-null object
tempo               0 non-null object
time_signature      0 non-null object
valence             0 non-null object
artist              669 non-null object
song                669 non-null object
dtypes: object(13)
memory usage: 73.2+ KB
None


There are 13 features, or track attributes, in the DataFrame, but only two columns have data — "song" and "artist". The next step is to get song attributes from the Echo Nest API to fill in other columns in the DF.  

Using the Echo Nest Python library Pyechonest is super easy and straightforward.  

The Echo Nest database doesn't provide data for every artist or song. I handle missing items with a "try-except" block.

In [5]:
def get_track_attr_data(artist_name, song_title):
    """Get track attributes data from 
    the Echo Nest database.
    """
    from pyechonest import song
    try: 
        result = song.search(artist=artist_name, 
                             title=song_title)
        song_result = result[0]
        song_data = song_result.audio_summary
        
        # returns a dictionary of song attributes
        return song_data
    
    except IndexError as e:
        return None
         
def add_features_from_echonest(df, features):
    """Request track features from
    the Echo Nest database and add to the DF.
    Return DF.
    """
    from time import sleep
    
    for i in df.index.tolist():
        # Check if attributes have been already added
        if pd.notnull(df.loc[i, 'tempo']): 
            pass
        else:
            song_data = get_track_attr_data(df.loc[i, 'artist'], 
                                            df.loc[i, 'song'])
            
            # If the song is in the Echo Nest DB, 
            # I add data to the DF.
            if song_data:
                for f in features:
                    df.loc[i, f] = song_data[f]
            # If not, I drop the track. 
            else:
                df = df.drop(i)
        # Echo Nest limits number of requests to 20 per minute
        sleep(6) 
    return df

Next I call the function to add track features from the Echo Nest data to the DF.

In [6]:
# add features
test_df = add_features_from_echonest(test_data, features)

In [8]:
# view the resulting DF
print ("Test dataset contains {} tracks.\n"
       .format(len(test_df))) 
test_df.info()

Test dataset contains 536 tracks.

<class 'pandas.core.frame.DataFrame'>
Int64Index: 536 entries, 0 to 668
Data columns (total 13 columns):
acousticness        536 non-null float64
danceability        536 non-null float64
energy              536 non-null float64
instrumentalness    536 non-null float64
key                 536 non-null int64
loudness            536 non-null float64
mode                536 non-null int64
speechiness         536 non-null float64
tempo               536 non-null float64
time_signature      536 non-null int64
valence             536 non-null float64
artist              536 non-null object
song                536 non-null object
dtypes: float64(8), int64(3), object(2)
memory usage: 58.6+ KB


The table above gives us an overview of data in DF columns. There are 536 tracks, or rows, in the test set. We have 11 numeric features (these are the data from the Echo Nest) and 2 columns with text data. I notice, however, that all numeric features were identified as float type. I know that three features "key", "mode", and "time_signature" are discrete, not continuous as other features, with a limited number of values. E.g. "time_signature" has 4 values: 1, 3, 4 or 5. I change type of these three features to integer.

In [9]:
# change dtype to integer
test_df[['key', 'mode', 'time_signature']] = test_df[['key', 'mode', 'time_signature']].astype(int)

test_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 536 entries, 0 to 668
Data columns (total 13 columns):
acousticness        536 non-null float64
danceability        536 non-null float64
energy              536 non-null float64
instrumentalness    536 non-null float64
key                 536 non-null int64
loudness            536 non-null float64
mode                536 non-null int64
speechiness         536 non-null float64
tempo               536 non-null float64
time_signature      536 non-null int64
valence             536 non-null float64
artist              536 non-null object
song                536 non-null object
dtypes: float64(8), int64(3), object(2)
memory usage: 58.6+ KB


### Training dataset

For the training dataset I made a csv file ('./labeled_tracks.csv') with hand picked tracks and labeled each track with one of the three classes: "cycling", "ballet", "yoga".

I use pandas read_csv function to read the csv file into a DataFrame.

In [10]:
# path to the csv file
csv_file = './labeled_tracks.csv'

# transform the csv file to a DF
train_data = pd.read_csv(csv_file, encoding='utf_8', 
                       header=0)

# add features as empty columns
for f in features:
    train_data[f] = np.nan

In [11]:
# review the result
print ("Training dataset contains {} tracks."
       .format(len(train_data)))
print 
print train_data.info()

Training dataset contains 163 tracks.

<class 'pandas.core.frame.DataFrame'>
Int64Index: 163 entries, 0 to 162
Data columns (total 14 columns):
song                163 non-null object
artist              163 non-null object
category            163 non-null object
acousticness        0 non-null float64
danceability        0 non-null float64
energy              0 non-null float64
instrumentalness    0 non-null float64
key                 0 non-null float64
loudness            0 non-null float64
mode                0 non-null float64
speechiness         0 non-null float64
tempo               0 non-null float64
time_signature      0 non-null float64
valence             0 non-null float64
dtypes: float64(11), object(3)
memory usage: 19.1+ KB
None


In [12]:
# add features to the training DF
train_df = add_features_from_echonest(train_data, features)

In [13]:
# get info about data
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 143 entries, 0 to 162
Data columns (total 14 columns):
song                143 non-null object
artist              143 non-null object
category            143 non-null object
acousticness        143 non-null float64
danceability        143 non-null float64
energy              143 non-null float64
instrumentalness    143 non-null float64
key                 143 non-null float64
loudness            143 non-null float64
mode                143 non-null float64
speechiness         141 non-null float64
tempo               143 non-null float64
time_signature      143 non-null float64
valence             143 non-null float64
dtypes: float64(11), object(3)
memory usage: 16.8+ KB


I want to change the order of columns in the training set to keep it consistent with the test dataset. 

In [14]:
cols = train_df.columns.tolist()
cols = cols[3:] + cols[:3]
cols

['acousticness',
 'danceability',
 'energy',
 'instrumentalness',
 'key',
 'loudness',
 'mode',
 'speechiness',
 'tempo',
 'time_signature',
 'valence',
 u'song',
 u'artist',
 u'category']

In [15]:
# change order of columns
train_df = train_df.ix[:, cols]

In [16]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 143 entries, 0 to 162
Data columns (total 14 columns):
acousticness        143 non-null float64
danceability        143 non-null float64
energy              143 non-null float64
instrumentalness    143 non-null float64
key                 143 non-null float64
loudness            143 non-null float64
mode                143 non-null float64
speechiness         141 non-null float64
tempo               143 non-null float64
time_signature      143 non-null float64
valence             143 non-null float64
song                143 non-null object
artist              143 non-null object
category            143 non-null object
dtypes: float64(11), object(3)
memory usage: 16.8+ KB


As in the test set, I convert dtype in columns "key", "mode", and "time_signature" to integer.

In [17]:
train_df[['key', 'mode', 'time_signature']] = train_df[['key', 'mode', 'time_signature']].astype(int)
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 143 entries, 0 to 162
Data columns (total 14 columns):
acousticness        143 non-null float64
danceability        143 non-null float64
energy              143 non-null float64
instrumentalness    143 non-null float64
key                 143 non-null int64
loudness            143 non-null float64
mode                143 non-null int64
speechiness         141 non-null float64
tempo               143 non-null float64
time_signature      143 non-null int64
valence             143 non-null float64
song                143 non-null object
artist              143 non-null object
category            143 non-null object
dtypes: float64(8), int64(3), object(3)
memory usage: 16.8+ KB


After a quick glimpse of data, I save both DataFrames on disk in HDF5 format.

In [19]:
# save test df 
test_df.to_hdf('music_data.h5', 'test_df', min_itemsize = {'values': 50})

In [19]:
# save training df
train_df.to_hdf('music_data.h5', 'train_df', min_itemsize = {'values': 50})

In [20]:
# check the result
print pd.HDFStore('music_data.h5')

<class 'pandas.io.pytables.HDFStore'>
File path: music_data.h5
/test_df             frame        (shape->[536,13])
/train_df            frame        (shape->[143,12])


## Data overview  
  
I have two datasets with track attributes data from the Echo Nest API. Next step is to take a look at what I'm working with.
### Test data overview

In [21]:
print ("There are {} tracks in the xml file.\n"
       "{} tracks have no data available "
       "in the Echo Nest API.\n"
       "We are left with {} tracks to use as test data.\n"
       .format(len(test_data), 
               (len(test_data) - len(test_df)),
              len(test_df)))
print "Below is a random sample of the dataset."
test_df.sample(n=3)

There are 669 tracks in the xml file.
133 tracks have no data available in the Echo Nest API.
We are left with 536 tracks to use as test data.

Below is a random sample of the dataset.


Unnamed: 0,acousticness,danceability,energy,instrumentalness,key,loudness,mode,speechiness,tempo,time_signature,valence,artist,song
115,0.53,0.68,0.77,0.83,0,-10.61,0,0.03,130.39,4,0.39,Junip,Without you
355,0.73,0.81,0.37,0.23,5,-13.66,1,0.04,81.62,4,0.38,Moby,The Sky Is Broken
157,0.37,0.66,0.34,0.01,4,-13.84,1,0.04,104.98,4,0.38,Phoebe Killdeer and the Short Straws,Let Me


### Training data overview

In [22]:
print ("There are {} tracks in the csv file.\n"
       "{} tracks have no data available "
       "in the Echo Nest API.\n"
       "We are left with {} tracks to use as training data.\n"
       .format(len(train_data), 
               (len(train_data) - len(train_df)),
              len(train_df)))
print "Below is a random sample of the dataset."
train_df.sample(n=3)

There are 163 tracks in the csv file.
20 tracks have no data available in the Echo Nest API.
We are left with 143 tracks to use as training data.

Below is a random sample of the dataset.


Unnamed: 0,acousticness,danceability,energy,instrumentalness,key,loudness,mode,speechiness,tempo,time_signature,valence,song,artist,category
4,0.96,0.37,0.28,0.0,4,-8.69,1,0.04,84.39,4,0.21,Skinny Love,Birdy,ballet
145,0.03,0.5,0.69,0.0,7,-3.91,1,0.04,148.13,4,0.46,Light of love,Music Go Music,cycling
49,0.01,0.62,0.93,0.0,9,-4.82,1,0.07,122.0,4,0.88,Lost on me,Peace,cycling


In [23]:
# list of categories
categories = list(pd.unique(train_df.category.ravel()))

print ("Tracks in the dataset belong " 
       "to {} classes: {}."
       .format(len(categories), ", ".join(categories)))

# count tracks in each category
cat_count = pd.value_counts(train_df.category.ravel())

# print categories
for category in categories:
    print ("{} tracks represent \'{}\' class."
           .format(cat_count[category], category))

Tracks in the dataset belong to 3 classes: ballet, cycling, yoga.
49 tracks represent 'ballet' class.
45 tracks represent 'cycling' class.
49 tracks represent 'yoga' class.


### Summary
In this notebook I described data gathering and cleaning process for further analysis. I parsed iTunes music library xml file to create a dataframe as a test dataset. I also transformed the csv file with labeled tracks to a dataframe as a training set. Using the Echo Nest API I got track attributes for both sets.

As a result of the above manipulations I created two pandas dataframes: 
* train dataframe contains 143 tracks labeled with one of the three classes — "ballet", "cycling", "yoga";
* test dataframe contains 536 non-labeled tracks. 

Each set has 11 music attributes for every song which I'm going to analyse in the following posts.

The next step in my analysis is to visualize both datasets and examine track attributes, which I cover in the next post [02_Data_visualisation](http://localhost:8888/notebooks/02_Data_Visualisation.ipynb).