# Spotify Genre Classification 

Before we get started, let's define our workflow. Workflow can keep your works as works. It could helps you to do a effective or efficient jobs. There are plenty of workflow, and it's absolutely up to you to choose what workflow to use. In this case, let's use the most common ML workflow as described below
![](res/workflow-oreilly.png)
*source: https://www.oreilly.com/ideas/how-graph-algorithms-improve-machine-learning*

We will call the second process as "Exploratory Data Analysis", and will not do a loop for further analysis. 


# Data Load

In [2]:
import pandas as pd
data = pd.read_csv('data/SpotifyFeatures.csv')
data.head()

Unnamed: 0,genre,artist_name,track_name,track_id,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
0,Movie,Henri Salvador,C'est beau de faire un Show,0BRjO6ga9RKCKjfDqeFgWV,0,0.611,0.389,99373,0.91,0.0,C#,0.346,-1.828,Major,0.0525,166.969,4/4,0.814
1,Movie,Martin & les fées,Perdu d'avance (par Gad Elmaleh),0BjC1NfoEOOusryehmNudP,1,0.246,0.59,137373,0.737,0.0,F#,0.151,-5.559,Minor,0.0868,174.003,4/4,0.816
2,Movie,Joseph Williams,Don't Let Me Be Lonely Tonight,0CoSDzoNIKCRs124s9uTVy,3,0.952,0.663,170267,0.131,0.0,C,0.103,-13.879,Minor,0.0362,99.488,5/4,0.368
3,Movie,Henri Salvador,Dis-moi Monsieur Gordon Cooper,0Gc6TVm52BwZD07Ki6tIvf,0,0.703,0.24,152427,0.326,0.0,C#,0.0985,-12.178,Major,0.0395,171.758,4/4,0.227
4,Movie,Fabien Nataf,Ouverture,0IuslXpMROHdEPvSl1fTQK,4,0.95,0.331,82625,0.225,0.123,F,0.202,-21.15,Major,0.0456,140.576,4/4,0.39


# Exploratory Data Analysis

## Inspect Structure
First thing to do with our data, is to take a peek onto its structure. Let's use info, describe, and head to see it. 

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 232725 entries, 0 to 232724
Data columns (total 18 columns):
genre               232725 non-null object
artist_name         232725 non-null object
track_name          232725 non-null object
track_id            232725 non-null object
popularity          232725 non-null int64
acousticness        232725 non-null float64
danceability        232725 non-null float64
duration_ms         232725 non-null int64
energy              232725 non-null float64
instrumentalness    232725 non-null float64
key                 232725 non-null object
liveness            232725 non-null float64
loudness            232725 non-null float64
mode                232725 non-null object
speechiness         232725 non-null float64
tempo               232725 non-null float64
time_signature      232725 non-null object
valence             232725 non-null float64
dtypes: float64(9), int64(2), object(7)
memory usage: 32.0+ MB


In [4]:
data.describe()

Unnamed: 0,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence
count,232725.0,232725.0,232725.0,232725.0,232725.0,232725.0,232725.0,232725.0,232725.0,232725.0,232725.0
mean,41.127502,0.36856,0.554364,235122.3,0.570958,0.148301,0.215009,-9.569885,0.120765,117.666585,0.454917
std,18.189948,0.354768,0.185608,118935.9,0.263456,0.302768,0.198273,5.998204,0.185518,30.898907,0.260065
min,0.0,0.0,0.0569,15387.0,2e-05,0.0,0.00967,-52.457,0.0222,30.379,0.0
25%,29.0,0.0376,0.435,182857.0,0.385,0.0,0.0974,-11.771,0.0367,92.959,0.237
50%,43.0,0.232,0.571,220427.0,0.605,4.4e-05,0.128,-7.762,0.0501,115.778,0.444
75%,55.0,0.722,0.692,265768.0,0.787,0.0358,0.264,-5.501,0.105,139.054,0.66
max,100.0,0.996,0.989,5552917.0,0.999,0.999,1.0,3.744,0.967,242.903,1.0


In [5]:
data.head()

Unnamed: 0,genre,artist_name,track_name,track_id,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
0,Movie,Henri Salvador,C'est beau de faire un Show,0BRjO6ga9RKCKjfDqeFgWV,0,0.611,0.389,99373,0.91,0.0,C#,0.346,-1.828,Major,0.0525,166.969,4/4,0.814
1,Movie,Martin & les fées,Perdu d'avance (par Gad Elmaleh),0BjC1NfoEOOusryehmNudP,1,0.246,0.59,137373,0.737,0.0,F#,0.151,-5.559,Minor,0.0868,174.003,4/4,0.816
2,Movie,Joseph Williams,Don't Let Me Be Lonely Tonight,0CoSDzoNIKCRs124s9uTVy,3,0.952,0.663,170267,0.131,0.0,C,0.103,-13.879,Minor,0.0362,99.488,5/4,0.368
3,Movie,Henri Salvador,Dis-moi Monsieur Gordon Cooper,0Gc6TVm52BwZD07Ki6tIvf,0,0.703,0.24,152427,0.326,0.0,C#,0.0985,-12.178,Major,0.0395,171.758,4/4,0.227
4,Movie,Fabien Nataf,Ouverture,0IuslXpMROHdEPvSl1fTQK,4,0.95,0.331,82625,0.225,0.123,F,0.202,-21.15,Major,0.0456,140.576,4/4,0.39


If you already read the previous notebook, you might notice that most of the features are the same. The only feature that haven't introduced are `genre`. But since it's so clear, that `genre` contains string denoting the genre of the music, we will not discuss any further. If you are not familiar with the feature, or haven't read previous notebook, please take a look on [Spotify Data Collection and Analysis]()

## Feature Drop

In this notebook, the goal is to classify music's genre by it's audio feature. So we don't need `artist_name`, `track_name`, and `track_id` as they have nothing to do with the genre. What about `duration_ms`? by theory, there is a relation between genre and duration. You can see it whether this feature affect model's performance in feature importance. 

Now, let's drop them off

In [7]:
unused_col = ['artist_name', 'track_name', 'track_id']
df = data.drop(columns=unused_col).reset_index(drop=True)

In [8]:
df.head()

Unnamed: 0,genre,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
0,Movie,0,0.611,0.389,99373,0.91,0.0,C#,0.346,-1.828,Major,0.0525,166.969,4/4,0.814
1,Movie,1,0.246,0.59,137373,0.737,0.0,F#,0.151,-5.559,Minor,0.0868,174.003,4/4,0.816
2,Movie,3,0.952,0.663,170267,0.131,0.0,C,0.103,-13.879,Minor,0.0362,99.488,5/4,0.368
3,Movie,0,0.703,0.24,152427,0.326,0.0,C#,0.0985,-12.178,Major,0.0395,171.758,4/4,0.227
4,Movie,4,0.95,0.331,82625,0.225,0.123,F,0.202,-21.15,Major,0.0456,140.576,4/4,0.39


## Feature Engineering

Now, we need our data to be able to be applied with mathematical expressions (since that how machine learning do that). So, let's change several feature values and cast them into numerical.

In [9]:
df.select_dtypes(exclude='number').head()

Unnamed: 0,genre,key,mode,time_signature
0,Movie,C#,Major,4/4
1,Movie,F#,Minor,4/4
2,Movie,C,Minor,5/4
3,Movie,C#,Major,4/4
4,Movie,F,Major,4/4


In [10]:
df['time_signature'].unique().tolist()

['4/4', '5/4', '3/4', '1/4', '0/4']

If you remember from the previous notebook, the `time_signature` are ranging from 3 to 7. I assume changing its value into the first numeric value (ie: 4/4 -> 4) won't remove the informations, because all the signature have and supposed to be have the same denominator. 

In [11]:
df['mode'].unique().tolist()

['Major', 'Minor']

Let's change the value of Major -> 1 and Minor -> 0. Even if it's not a numerical, but since it's only have 2 value, we can change it into binary (0,1). With 0 means the absence of Major (wich means Minor), and 1 means the occurence of Major

In [12]:
df['key'].unique().tolist()

['C#', 'F#', 'C', 'F', 'G', 'E', 'D#', 'G#', 'D', 'A#', 'A', 'B']

>In music theory, the key of a piece is the group of pitches, or scale, that forms the basis of a music composition in classical, Western art, and Western pop music - WIkipedia

Since it contains the value of order, we can safely transform the key into numerical representations. 
Let's say, 1 to 12.

In [13]:
mode_dict = {'Major' : 1, 'Minor' : 0}
key_dict = {'C' : 1, 'C#' : 2, 'D' : 3, 'D#' : 4, 'E' : 5, 'F' : 6, 
        'F#' : 7, 'G' : 9, 'G#' : 10, 'A' : 11, 'A#' : 12, 'B' : 12}

df['time_signature'] = df['time_signature'].apply(lambda x : int(x[0]))
df['mode'].replace(mode_dict, inplace=True)
df['key'] = df['key'].replace(key_dict).astype(int)

df.head()

Unnamed: 0,genre,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
0,Movie,0,0.611,0.389,99373,0.91,0.0,2,0.346,-1.828,1,0.0525,166.969,4,0.814
1,Movie,1,0.246,0.59,137373,0.737,0.0,7,0.151,-5.559,0,0.0868,174.003,4,0.816
2,Movie,3,0.952,0.663,170267,0.131,0.0,1,0.103,-13.879,0,0.0362,99.488,5,0.368
3,Movie,0,0.703,0.24,152427,0.326,0.0,2,0.0985,-12.178,1,0.0395,171.758,4,0.227
4,Movie,4,0.95,0.331,82625,0.225,0.123,6,0.202,-21.15,1,0.0456,140.576,4,0.39


Let's save this into a module so that we can use it later on the same dataset (saved on `scripts/preprocessing.py`)

## Correlation
The easiest way to see data's correlation is by look into its pairplot. Pairplot gives you a brief information about correlation between two features and each feature's distribution.

In [14]:
import seaborn as sns
# sns.pairplot(df) #it may take a while since the data is pretty large

![](res/pairplot.png)

If we pairplot the features (not the genre), we can see that most of the features have a little correlations (based on the plot). We could also find several outliers. But it's good for now

## Check for missing values

In [15]:
df.isna().sum().sum()

0

# Data Preparation

## Cross Validation

In [16]:
from sklearn.model_selection import train_test_split
import time

In [17]:
X = df.drop(columns=['genre'])
y = df['genre']
random_state = 11
test_size = 0.2
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=test_size, random_state=random_state)

In [18]:
y_train.value_counts().sort_index()

A Capella             94
Alternative         7372
Anime               7145
Blues               7260
Children's Music    4356
Children’s Music    7530
Classical           7373
Comedy              7742
Country             6982
Dance               6938
Electronic          7547
Folk                7439
Hip-Hop             7390
Indie               7637
Jazz                7495
Movie               6263
Opera               6663
Pop                 7477
R&B                 7232
Rap                 7458
Reggae              6986
Reggaeton           7090
Rock                7479
Ska                 7053
Soul                7318
Soundtrack          7598
World               7263
Name: genre, dtype: int64

In [19]:
y_valid.value_counts().sort_index()

A Capella             25
Alternative         1891
Anime               1791
Blues               1763
Children's Music    1047
Children’s Music    1823
Classical           1883
Comedy              1939
Country             1682
Dance               1763
Electronic          1830
Folk                1860
Hip-Hop             1905
Indie               1906
Jazz                1946
Movie               1543
Opera               1617
Pop                 1909
R&B                 1760
Rap                 1774
Reggae              1785
Reggaeton           1837
Rock                1793
Ska                 1821
Soul                1771
Soundtrack          2048
World               1833
Name: genre, dtype: int64

# Modelling and Training

In this case, let's use LogisticRegression. Even it's not best suited with multi class problem, let's just try how it works.

In [20]:
from sklearn.linear_model import LogisticRegression

In [21]:
model = LogisticRegression(multi_class = 'multinomial', solver='lbfgs', max_iter=200)

With the `multi_class = 'multinomial'` means in multiclass case (like our data), the loss is measured using cross entropy. The`solver='lbfgs'` are merekly optimizer, like `gradient_descent`. For further information, don't hesitate to go to [Sklearn's Page](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

# Evaluation

In [22]:
model.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=200,
                   multi_class='multinomial', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [23]:
y_pred = model.predict(X_valid)

In [24]:
from sklearn.metrics import classification_report
print(classification_report(y_valid, y_pred))

  'precision', 'predicted', average, warn_for)


                  precision    recall  f1-score   support

       A Capella       0.00      0.00      0.00        25
     Alternative       0.00      0.00      0.00      1891
           Anime       0.11      0.24      0.15      1791
           Blues       0.00      0.00      0.00      1763
Children's Music       0.00      0.00      0.00      1047
Children’s Music       0.04      0.02      0.02      1823
       Classical       0.29      0.17      0.22      1883
          Comedy       0.12      0.01      0.03      1939
         Country       0.00      0.00      0.00      1682
           Dance       0.00      0.00      0.00      1763
      Electronic       0.07      0.27      0.12      1830
            Folk       0.04      0.00      0.00      1860
         Hip-Hop       0.00      0.00      0.00      1905
           Indie       0.01      0.00      0.00      1906
            Jazz       0.06      0.12      0.08      1946
           Movie       0.19      0.01      0.01      1543
           Op

It's amazing to see such a performance. Let's try other baseline 

In [25]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(n_estimators=30, random_state=42)
rfc.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=30,
                       n_jobs=None, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)

In [26]:
y_rfc = rfc.predict(X_valid)

In [27]:
print(classification_report(y_valid, y_rfc))

                  precision    recall  f1-score   support

       A Capella       1.00      0.20      0.33        25
     Alternative       0.15      0.12      0.13      1891
           Anime       0.62      0.58      0.60      1791
           Blues       0.36      0.38      0.37      1763
Children's Music       0.74      0.76      0.75      1047
Children’s Music       0.03      0.04      0.03      1823
       Classical       0.60      0.59      0.60      1883
          Comedy       0.97      0.95      0.96      1939
         Country       0.34      0.39      0.36      1682
           Dance       0.11      0.10      0.11      1763
      Electronic       0.50      0.53      0.51      1830
            Folk       0.18      0.20      0.19      1860
         Hip-Hop       0.12      0.13      0.13      1905
           Indie       0.04      0.04      0.04      1906
            Jazz       0.34      0.31      0.33      1946
           Movie       0.58      0.53      0.55      1543
           Op

Now, it's clear that the data might need more complex model and data engineering. Well done, data.
We will come back later. 

You might wondering of why is it such a disaster. Well, first of all, it's a multi-class problem. Don't compare with binomial problem. The second thing is, we did't do a deep feature selection/extraction. And the third, is what you should guest on the first time. Duplicates. The data contains duplicates. It's true that one track could be labeled as multiple genres. This same data on the different classes will confuse our classifier, and ruin our metric score unless we use **top5** approach. 

In [28]:
duplicated_all = df[data.duplicated(subset = 'track_id', keep=False)]

In [29]:
duplicated = df[data.duplicated(subset = 'track_id', keep='first')]

In [30]:
print(f'''Unique Duplicates: {duplicated.shape[0]}
Total Duplicates: {duplicated_all.shape[0]}
Total Data: {data.shape[0]}
Duplicates %: {round(duplicated_all.shape[0]/data.shape[0]*100, 2)}''')

Unique Duplicates: 55951
Total Duplicates: 91075
Total Data: 232725
Duplicates %: 39.13


Feature Importance

In [55]:
from sklearn.feature_selection import RFE
selector = RFE(model, n_features_to_select=1)
selector.fit(X_train, y_train)



RFE(estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                 fit_intercept=True, intercept_scaling=1,
                                 l1_ratio=None, max_iter=200,
                                 multi_class='multinomial', n_jobs=None,
                                 penalty='l2', random_state=None,
                                 solver='lbfgs', tol=0.0001, verbose=0,
                                 warm_start=False),
    n_features_to_select=1, step=1, verbose=0)

In [56]:
print(f"Model's Feature Importance")
for i in range(len(selector.ranking_)):
    print(f"#{i+1}: {X.columns[selector.ranking_[i]-1]} ")

Model's Feature Importance
#1: loudness 
#2: danceability 
#3: acousticness 
#4: valence 
#5: energy 
#6: duration_ms 
#7: tempo 
#8: key 
#9: speechiness 
#10: mode 
#11: popularity 
#12: time_signature 
#13: liveness 
#14: instrumentalness 


# Summary
Refactor the data to be a multi-class object to make it clear whether the duplicates contribute the most for the low performance measurement.
Changing the classification to top5 approach might cure the performance measure. This method also used in famous computer vision competition, ImageNet. 