# iTunes music library analysis: Data preparation

This is the first post in a series of posts devoted to analysis of iTunes music library using Scikit-Learn tools.   
This notebook covers data gathering, preparation and cleaning of the datasets I'm going to use in my analysis.
For the summary of this analysis, its goals, methods and installation notes please go to [link]. 

## Preliminaries

One of the main goals of this analysis is to explore the basics of Scikit-Learn tools. **[Scikit-Learn](http://scikit-learn.org)** is a popular Python package designed to give access to well-known machine learning algorithms within Python code.

Scikit-Learn is built upon Python's **[NumPy (Numerical Python)](http://www.numpy.org/)** and **[SciPy (Scientific Python)](http://scipy.org/)** libraries, which enable efficient in-core numerical and scientific computation within Python. 

I also use **[pandas](http://pandas.pydata.org/)** library in my analysis. Pandas is a Python package providing fast, flexible, and expressive data structures. It is a fundamental high-level building block for doing practical, real world data analysis in Python.

**[Matplotlib](http://matplotlib.org/)** and **[Seaborn](https://stanford.edu/~mwaskom/software/seaborn/)** are used for data visualization. 

The hero and the foundation of my analysis is the **[Echo Nest API](http://the.echonest.com/)**, which provides broad and deep data on millions of artists and songs. **[Pyechonest](https://github.com/echonest/pyechonest)** is an open source Python library for the Echo Nest API that I use in this analysis. To use The Echo Nest API, an API key is required. More about the API key [here](http://developer.echonest.com/raw_tutorials/register.html).

I start with importing all modules. 

In [None]:
# TODO: replace the API key with MY_API_KEY

In [278]:
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
from IPython.display import display
import pandas as pd
import numpy as np

# set seaborn plot defaults
import seaborn as sns; sns.set(palette="husl")

# import pyechonest
from pyechonest import config
# pass my API key
config.ECHO_NEST_API_KEY="1RNDIJ5SITBKZFDCT"

## Data preparation

In this analysis I'm dealing with classification problem. I'm interested in three classes of music ("cycling", "yoga", "ballet") and I want to find tracks in my iTunes music library that fit these classes. To solve that problem I use *machine learning classification algorithms*.  

Machine learning is about creating models from data: a model learns from training data (data with class labels), and can be used to predict the result of test data (data without class labels).

For the analysis I use two datasets:
1. iTunes music library serves me as a *test dataset*;
2. For the *training dataset* I made a csv file ('./labeled_tracks.csv') with hand picked tracks outside of my iTunes library. I labeled each track with one of the three classes: "cycling", "yoga", "ballet". 

iTunes library files track the media in iTunes. The iTunes library file, a file called iTunes Music Library.xml, is created automatically when you launch iTunes. 'iTunes Music Library.xml' contains information that's stored in the iTunes database of the songs in the library. On Mac OS X, it can be found in the directory 'Users/username/Music/iTunes'. More information about iTunes library files can be found [here](https://support.apple.com/en-us/HT201610).

For both datasets I use a sqlite3 database (DB) and use the **[sqlitedict](https://github.com/piskvorky/sqlitedict)** library to access it. sqlitedict is a lightweight wrapper around Python's sqlite3 DB with a simple, Pythonic dict-like interface. For this particular problem I think using a DB is more convenient than a CSV or XML file. 

### Test dataset
I will start by processing the test data.  

To parse iTunes xml file I use **[pyItunes](https://github.com/liamks/pyitunes)** module, which makes it easier to access tracks in the xml file. 

In [None]:
def get_itunes_track_data(song):
    """check the validity of the track, 
    exclude podcasts and tracks 
    missing artist's name.
    """
    if (song.genre == 'Podcast' or 
        song.genre == u'iTunes U' or 
        song.kind != 'MPEG audio file' or 
        not song.artist): 
        return None 
    else:
        return song.name, song.artist

def parse_itunes_xml(db, xml_file):
    """parse xml, get song's title
    and artist's name, save to a database.
    """
    from pyItunes import Library
    l = Library(xml_file)
    
    for id, song in l.songs.items():
        try:
            song_title, artist = get_itunes_track_data(song)
            if not db.get(song_title):
                db[song_title] = {'artist' : artist}
        except TypeError as e:
            continue

In [2]:
from sqlitedict import SqliteDict

# path to the iTunes xml file
xml_file = 'iTunes Music Library copy.xml'

# create a DB to store data from iTunes library
test_db = SqliteDict('./itunes_tracks', 
                     autocommit=True)

# call parse_itunes_xml function and 
# write the data into the DB.
# TODO: uncomment
parse_itunes_xml(test_db, xml_file)

In [4]:
# view the resulting DB
print ("The DB contains {} tracks."
       .format(len(test_db)))
print "Example of values for the track \"Moon river\": "
test_db['Moon river']

The DB contains 554 tracks.
Example of values for the track "Moon river": 


{'acousticness': 0.853775,
 'artist': 'Andrea Ross',
 'danceability': 0.204634,
 'energy': 0.248916,
 'instrumentalness': 0.00198,
 'key': 8,
 'loudness': -11.155,
 'mode': 1,
 'speechiness': 0.03465,
 'tempo': 155.866,
 'time_signature': 3,
 'valence': 0.185901}

The next step is to get song attributes from the Echo Nest API.  

In the analysis I use the following track attributes:

* **Acousticness** represents the likelihood a recording was created by solely acoustic means such as voice and acoustic instruments as opposed to electronically such as with synthesized, amplified, or effected instruments;
* **Danceability** describes how suitable a track is for dancing using a number of musical elements (tempo, rhythm stability, beat strength, and overall regularity);
* **Energy** represents a perceptual measure of intensity and powerful activity released throughout the track;
* **Instrumentalnes** is a measure of how likely a song is to be instrumental;
* **Key** identifies the tonic triad, the chord, major or minor;
* **Loudness** measureas the overall loudness of a track in decibels (dB);
* **Mode** indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived;
* **Speechiness** detects the presence of spoken words in a track;
* **Tempo** is the speed or pace of a given piece (in beats per minute);
* **Time signature** specifies how many beats are in each bar (or measure);
* **Valence** describes the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g., happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

Using the Echo Nest Python library Pyechonest is super easy and straighforward.  

The Echo Nest database doesn't provide data for every artist or song. I handle missing items with a "try-except" block.

In [6]:
def get_track_attr_data(artist_name, song_title):
    """Get track attributes data from 
    the Echo Nest database.
    """
    from pyechonest import song
    try: 
        result = song.search(artist=artist_name, 
                             title=song_title)
        song_result = result[0]
        song_data = song_result.audio_summary
        
        # returns a dictionary of song attributes
        return song_data
    
    except IndexError as e:
        print 'No data for the song', song_title
        return None
    
def pick_track_attr(song_data):
    """Pick required track attributes from a dict
    I got from the Echo Nest library.
    """
    song_val = {'time_signature' : song_data['time_signature'],
                'energy' : song_data['energy'],
                'tempo' : song_data['tempo'], 
                'speechiness' : song_data['speechiness'],
                'acousticness' : song_data['acousticness'], 
                'danceability' : song_data['danceability'],
                'instrumentalness' : song_data['instrumentalness'],
                'key' : song_data['key'],
                'loudness' : song_data['loudness'],
                'valence' : song_data['valence'],
                'mode' : song_data['mode']}
    
    return song_val
         
def set_echonest_attributes(db):
    """Look for new tracks in the DB, request  
    track attributes from the Echo Nest library 
    and save to the db.
    """
    from time import sleep

    for song_title, value in db.iteritems():
        # Check if the attributes have been already added
        if value.get('tempo') or value.get('No_data'): 
            pass
        else:
            song_data = get_track_attr_data(value['artist'], 
                                            song_title)
            
            # If the song is in the Echo Nest DB, 
            # I add the data to the DB.
            if song_data:
                song_val = pick_track_attr(song_data)
                value.update(song_val)
                db[song_title] = value
            # If not, I add 'No_data' key to the song. 
            else:
                song_val = {'No_data' : True}
                value.update(song_val)
                db[song_title] = value
            # Echo Nest limits number of requests to 20 per minute
            sleep(8) 

In [None]:
# call set_echonest_attributes function and 
# write the Echo Nest data into the iTunes DB.
# TODO: uncomment
# set_echonest_attributes(test_db)

In [5]:
# view the resulting DB
print ("The DB contains {} tracks."
       .format(len(test_db)))
print "Example of values for the track \"Moon river\":"
test_db['Moon river']

The DB contains 554 tracks.
Example of values for the track "Moon river":


{'acousticness': 0.853775,
 'artist': 'Andrea Ross',
 'danceability': 0.204634,
 'energy': 0.248916,
 'instrumentalness': 0.00198,
 'key': 8,
 'loudness': -11.155,
 'mode': 1,
 'speechiness': 0.03465,
 'tempo': 155.866,
 'time_signature': 3,
 'valence': 0.185901}

### Training dataset

For the training dataset I made a csv file ('./labeled_tracks.csv') with hand picked tracks and labaled each track with one of the three classes: "cycling", "yoga", "ballet".

I use pandas read_csv function to read the csv file and write it into the sqlite3 dictionary.

In [None]:
def parse_tracks_from_csv(csv_file, db):
    """Transform the csv file into a pandas dataframe,
    save data to a new DB for training data.
    """
    data = pd.read_csv(csv_file, index_col='song', 
                       encoding='utf_8', header=0)
    for item in data.index:
        artist = data.loc[item]['artist']
        # category column contains class label
        song_cat = data.loc[item]['category']
        if not db.get(item):
            db[item] = {'artist' : artist, 'category' : song_cat}

In [6]:
# path to the csv file
# csv_file = './labeled_tracks.csv'

#TODO: change name of the db to './labeled_tracks'
# create a DB to store training data from the csv file
train_db = SqliteDict('./chosen_tracks', autocommit=True)

# call parse_tracks_from_csv function
# parse_tracks_from_csv(csv_file, train_db)

# write the Echo Nest data into the training DB
#TODO: uncomment
# set_echonest_attributes(train_db)

In [7]:
# view the resulting DB
print ("The DB contains {} tracks."
       .format(len(train_db)))
print "Example of values for the track \"Five Seconds\":"
train_db['Five Seconds']

The DB contains 128 tracks.
Example of values for the track "Five Seconds":


{'acousticness': 0.001899,
 'artist': u'Twin Shadow',
 'category': u'cycling',
 'danceability': 0.467563,
 'energy': 0.879714,
 'instrumentalness': 0.007045,
 'key': 1,
 'loudness': -5.086,
 'mode': 1,
 'speechiness': 0.057028,
 'tempo': 176.972,
 'time_signature': 4,
 'valence': 0.64461}

## Data overview  
  
I have two DBs with track attributes data from the Echo Nest API. Next step is to read in data from both DBs and take a look at what I'm working with.

I transform DBs into pandas dataframes (DF), which one can think of as an Excel-like table of values. Dataframes have various methods that can be called to easily learn about the data contained in them. I leave in the DF only tracks with  attributes data for further analysis. 

In [8]:
def read_db_in_pandas(db):
    """ Read the DB and return a DF.
    """
    # transpose data to have tracks as rows
    df = pd.DataFrame(dict(db)).T
    
    # remove rows with no data for a song
    df_clean = df[df['No_data'] != 1]
    
    # convert columns into numbers
    df_clean = df_clean.convert_objects(convert_numeric=True)
    
    # convert index into a column
    df_clean.reset_index(level=0, inplace=True)
    df_clean.rename(columns = {'index': 'song_title'}, 
                    inplace=True)

    # remove the 'No_data' column
    df_clean.drop('No_data', 1, 
                  inplace=True)
    return df_clean

### Training data overview

In [9]:
# create a df with training data
train_df = read_db_in_pandas(train_db)

# format floating point numbers 
# within pandas data structures
pd.set_option('float_format', '{:.2f}'.format)

In [10]:
# rearrange the order of columns 
cols = train_df.columns.tolist()
cols = cols[0:1] + cols[2:4] + cols[1:2] + cols[4:]
train_df = train_df.ix[:, cols]

In [11]:
print ("There are {0} tracks in the dataset."
       .format(len(train_db)))
print ("{0} tracks have no data available "
       "in the Echo Nest API."
       .format(len(train_db) - len(train_df)))
print ("We are left with {0} tracks to use as training data."
       .format(len(train_df)))
print "\nBelow is a random sample of the dataset."
train_df.sample(n=4)

There are 128 tracks in the dataset.
37 tracks have no data available in the Echo Nest API.
We are left with 91 tracks to use as training data.

Below is a random sample of the dataset.


Unnamed: 0,song_title,artist,category,acousticness,danceability,energy,instrumentalness,key,loudness,mode,speechiness,tempo,time_signature,valence
36,I,Benn Jordan,yoga,0.75,0.27,0.35,0.83,10,-18.87,1,0.03,144.31,4,0.39
71,TV Queen,Wild nothing,cycling,0.0,0.44,0.86,0.21,9,-5.9,1,0.05,134.95,4,0.72
47,Lounge Me,Sunsphere,focus,0.01,0.68,0.64,0.89,10,-11.41,0,0.05,97.97,4,0.54
38,In the fog III,Tim Hecker,focus,0.46,0.17,0.32,0.94,5,-9.03,0,0.04,118.82,4,0.04


In [12]:
# list of categories
categories = list(pd.unique(train_df.category.ravel()))

print ("Tracks in the dataset belong " 
       "to {} categories: {}."
       .format(len(categories), ", ".join(categories)))

# count tracks in each category
cat_count = pd.value_counts(train_df.category.ravel())

# print categories
for category in categories:
    print ("{} tracks represent \'{}\' category."
           .format(cat_count[category], category))

Tracks in the dataset belong to 3 categories: cycling, focus, yoga.
26 tracks represent 'cycling' category.
34 tracks represent 'focus' category.
31 tracks represent 'yoga' category.


### Test data overview

Now I move on to the test dataset.

In [292]:
# create a df with test data
test_df = read_db_in_pandas(test_db)

# rearrange the order of columns 
test_cols = test_df.columns.tolist()
test_cols = test_cols[0:1] + test_cols[2:3] + test_cols[1:2] + test_cols[3:]
test_df = test_df.ix[:, test_cols]

print ("There are {0} tracks in the dataset."
       .format(len(test_db)))
print ("{0} tracks have no data available " 
       "in the Echo Nest API."
       .format(len(test_db) - len(test_df)))
print ("We are left with {0} tracks to use as a test set."
       .format(len(test_df)))
print "\nBelow is a random sample of the dataset."
test_df.sample(n=4)

There are 554 tracks in the dataset.
223 tracks have no data available in the Echo Nest API.
We are left with 331 tracks to use as a test set.

Summary statistics: 


Unnamed: 0,song_title,artist,acousticness,danceability,energy,instrumentalness,key,loudness,mode,speechiness,tempo,time_signature,valence
308,White Rain,Junip,0.15,0.58,0.58,0.77,5,-11.42,0,0.03,149.55,4,0.61
175,Nanny Nanny Boo Boo,Le Tigre,0.04,0.85,0.89,0.0,1,-3.55,1,0.05,126.05,4,0.81
133,LIAR LIAR,CASTAWAYS,0.83,0.47,0.48,0.78,7,-13.88,0,0.03,146.01,4,0.81
315,Yeti's Lament,Berry Weight,0.84,0.55,0.47,0.88,4,-8.46,0,0.03,90.07,4,0.23


### Summary
In this notebook I described data gathering and cleaning process for further analysis. I parsed iTunes music library xml file to create a database of tracks for the test dataset. I also transformed the csv file with labeled tracks into a sqlite3 database for the training set. Using the Echo Nest API I got track attributes for both sets.

As a result of the above manipulations I have two pandas dataframes: 
* train dataframe contains 91 labeled tracks in three classes — "cycling", "yoga", "ballet";
* test dataframe contains 331 non-labeled tracks. 

The next step in my analysis is to visualize both datasets and examine track attributes. 