# Business Understanding

This dataset is from the Free Music Archive, a collection of legally available audio files.  The numerical data is made up of features extracted from a musical analysis python package called librosa that quantifies some of the characteristics of an mp3 and also includes statistics such as mean, skew, and kurtosis.  The dataset also includes unique codes for genres of music.  We will use these features to determine what the genre of a piece is given the librosa feature extraction.  This classification would prove useful for a music streaming application such as Spotify that would want to integrate new music into its platform quickly, especially if the defined genre in the audio file's metadata doesn't matches one of the genres defined in the application's database.  It would also help with the application's recommendation system; by broadly defining the main genre categories, users could receive recommendations that are audially similar.  For this use case, the model would be deployed to a production

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn as skl

In [2]:
feature_df = pd.read_csv("data/features.csv", skiprows=range(1,4))
feature_df.rename(columns={'feature':'track_id'}, inplace=True)
genre_df = pd.read_csv("data/genres.csv")
track_df = pd.read_csv("data/tracks.csv", skiprows=[0,2])
track_df.rename(columns={'Unnamed: 0':'track_id'}, inplace=True)
feature_df.head()

Unnamed: 0,track_id,chroma_cens,chroma_cens.1,chroma_cens.2,chroma_cens.3,chroma_cens.4,chroma_cens.5,chroma_cens.6,chroma_cens.7,chroma_cens.8,...,tonnetz.39,tonnetz.40,tonnetz.41,zcr,zcr.1,zcr.2,zcr.3,zcr.4,zcr.5,zcr.6
0,2,7.180653,5.230309,0.249321,1.34762,1.482478,0.531371,1.481593,2.691455,0.866868,...,0.054125,0.012226,0.012111,5.75889,0.459473,0.085629,0.071289,0.0,2.089872,0.061448
1,3,1.888963,0.760539,0.345297,2.295201,1.654031,0.067592,1.366848,1.054094,0.108103,...,0.063831,0.014212,0.01774,2.824694,0.466309,0.084578,0.063965,0.0,1.716724,0.06933
2,5,0.527563,-0.077654,-0.27961,0.685883,1.93757,0.880839,-0.923192,-0.927232,0.666617,...,0.04073,0.012691,0.014759,6.808415,0.375,0.053114,0.041504,0.0,2.193303,0.044861
3,10,3.702245,-0.291193,2.196742,-0.234449,1.367364,0.998411,1.770694,1.604566,0.521217,...,0.074358,0.017952,0.013921,21.434212,0.452148,0.077515,0.071777,0.0,3.542325,0.0408
4,20,-0.193837,-0.198527,0.201546,0.258556,0.775204,0.084794,-0.289294,-0.81641,0.043851,...,0.095003,0.022492,0.021355,16.669037,0.469727,0.047225,0.040039,0.000977,3.189831,0.030993


In [3]:
genre_df.head()

Unnamed: 0,genre_id,#tracks,parent,title,top_level
0,1,8693,38,Avant-Garde,38
1,2,5271,0,International,2
2,3,1752,0,Blues,3
3,4,4126,0,Jazz,4
4,5,4106,0,Classical,5


In [4]:
track_df.columns

Index(['track_id', 'comments', 'date_created', 'date_released', 'engineer',
       'favorites', 'id', 'information', 'listens', 'producer', 'tags',
       'title', 'tracks', 'type', 'active_year_begin', 'active_year_end',
       'associated_labels', 'bio', 'comments.1', 'date_created.1',
       'favorites.1', 'id.1', 'latitude', 'location', 'longitude', 'members',
       'name', 'related_projects', 'tags.1', 'website', 'wikipedia_page',
       'split', 'subset', 'bit_rate', 'comments.2', 'composer',
       'date_created.2', 'date_recorded', 'duration', 'favorites.2',
       'genre_top', 'genres', 'genres_all', 'information.1', 'interest',
       'language_code', 'license', 'listens.1', 'lyricist', 'number',
       'publisher', 'tags.2', 'title.1'],
      dtype='object')

There are several options for picking out a genre label: genre_top and genres seem to be good contenders.  Let's look at a sample of these two columns.

In [5]:
track_df[['genre_top', 'genres']].head()

Unnamed: 0,genre_top,genres
0,Hip-Hop,[21]
1,Hip-Hop,[21]
2,Hip-Hop,[21]
3,Pop,[10]
4,,"[76, 103]"


genre_top is categorical and has missing values for some tracks.  There are no missing values in the genres column, but its datatype is a list, which would mean we'd have to figure out how to pick a label.  Let's try a different approach using the genre dataframe.

In [6]:
top_ten_genres = genre_df.sort_values(by='#tracks', ascending=False)[:10]
top_ten_genres

Unnamed: 0,genre_id,#tracks,parent,title,top_level
31,38,38154,0,Experimental,38
14,15,34413,0,Electronic,15
11,12,32923,0,Rock,12
162,1235,14938,0,Instrumental,1235
9,10,13845,0,Pop,10
16,17,12706,0,Folk,17
22,25,9261,12,Punk,12
0,1,8693,38,Avant-Garde,38
20,21,8389,0,Hip-Hop,21
27,32,7268,38,Noise,38


We'll only look at the top ten genres listed and, if the track's top genre is in this list, include that track in our reduced dataframe.

In [7]:
track_df = track_df[track_df['genre_top'].isin(top_ten_genres['title'].values)]
track_df['genre_top'].value_counts()

Rock            14182
Experimental    10608
Electronic       9372
Hip-Hop          3552
Folk             2803
Pop              2332
Instrumental     2079
Name: genre_top, dtype: int64

In [8]:
feature_df = feature_df.set_index('track_id').join(track_df[['track_id', 'genre_top']].set_index('track_id'))
feature_df.dropna(how='any', axis=0, inplace=True)
feature_df.head()

Unnamed: 0_level_0,chroma_cens,chroma_cens.1,chroma_cens.2,chroma_cens.3,chroma_cens.4,chroma_cens.5,chroma_cens.6,chroma_cens.7,chroma_cens.8,chroma_cens.9,...,tonnetz.40,tonnetz.41,zcr,zcr.1,zcr.2,zcr.3,zcr.4,zcr.5,zcr.6,genre_top
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2,7.180653,5.230309,0.249321,1.34762,1.482478,0.531371,1.481593,2.691455,0.866868,1.341231,...,0.012226,0.012111,5.75889,0.459473,0.085629,0.071289,0.0,2.089872,0.061448,Hip-Hop
3,1.888963,0.760539,0.345297,2.295201,1.654031,0.067592,1.366848,1.054094,0.108103,0.619185,...,0.014212,0.01774,2.824694,0.466309,0.084578,0.063965,0.0,1.716724,0.06933,Hip-Hop
5,0.527563,-0.077654,-0.27961,0.685883,1.93757,0.880839,-0.923192,-0.927232,0.666617,1.038546,...,0.012691,0.014759,6.808415,0.375,0.053114,0.041504,0.0,2.193303,0.044861,Hip-Hop
10,3.702245,-0.291193,2.196742,-0.234449,1.367364,0.998411,1.770694,1.604566,0.521217,1.982386,...,0.017952,0.013921,21.434212,0.452148,0.077515,0.071777,0.0,3.542325,0.0408,Pop
134,0.918445,0.674147,0.577818,1.281117,0.933746,0.078177,1.199204,-0.175223,0.925482,1.438509,...,0.016322,0.015819,4.731087,0.419434,0.06437,0.050781,0.0,1.806106,0.054623,Hip-Hop


With our label determined, let's prepare the data for our model.

In [9]:
label_mapping = dict(zip(feature_df['genre_top'].unique(), range(0,10)))
y = feature_df['genre_top'].map(label_mapping, na_action='ignore').values
X = feature_df.drop(columns=['genre_top'])

There are a lot of numerical columns in the feature_df, so let's use PCA to reduce the dimensionality.

In [10]:
from sklearn.decomposition import PCA
pca = PCA(n_components=5)
pca.fit(X)
sum(pca.explained_variance_ratio_)

0.9622430561826366

It only takes 5 components to achieve 96% explained variance, most likely because the columns are various statistical methods on the same set of values.  Now to split the training and test data.

In [11]:
from sklearn.model_selection import train_test_split
# I'm adding a random_state so that the results of our shuffle don't change on different runs
# of the program
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1)
print(len(X_train))
print(len(X_test))

35942
8986


Doing an 80/20 train/test split is appropriate for our dataset because the data points aren't related to each other in time.  Shuffling the dataset is especially necessary because the data includes multiple tracks from the same album that are sequentially listed.