# Preprocessing |  Modeling Spotify track popularity
## Leo Evancie, Springboard Data Science Career Track

This is the third step in a capstone project to model music popularity on Spotify, a popular streaming service. Further project details and rationale can be found in the document 'Proposal.pdf'.

In this notebook, I will complete the process of preparing the data for model development. Categorical features will be encoded numerically with the use of dummy variables, and continuous variables will be scaled for consistency. I will also create my train/test data splits, culminating in the creation of cleaned and processed data with which I can train classifiers.

First, I import libraries and retrieve my most up-to-date data:

In [8]:
import numpy as np
import pandas as pd

df = pd.read_csv('../data/post_EDA.csv', index_col=0).reset_index().drop(['index','track_id'], axis=1)

In [9]:
df.head()

Unnamed: 0,popularity_class,track_number,single,danceability,energy,instrumentalness,explicit,collab,duration_s,time_signature
0,1,3,0,0.471,0.924,0.0,0,0,177,4.0
1,1,1,0,0.46,0.326,1e-05,0,0,283,3.0
2,1,3,0,0.772,0.826,9e-06,0,0,221,4.0
3,0,2,1,0.805,0.417,0.0201,0,0,133,4.0
4,1,1,0,0.495,0.856,0.0,1,0,154,4.0


In [10]:
df.nunique()

popularity_class       2
track_number         109
single                 2
danceability         843
energy              1074
instrumentalness    2987
explicit               2
collab                 2
duration_s           596
time_signature         5
dtype: int64

Fortunately, in previous notebooks I already did much of my preprocessing legwork. The important categorical features are already encoded as binary values, with the exception of `time_signature`.

However, I notice that that column's values are floats, which is unneccessary and will only lead to less clean-looking dummy names. I will convert to integer, create dummies, and merge.

In [11]:
df['time_signature'] = df.time_signature.astype('int')
dummies = pd.get_dummies(df.time_signature, prefix='timesig')
df = pd.concat([df, dummies], axis=1).drop('time_signature', axis=1)
df.head()

Unnamed: 0,popularity_class,track_number,single,danceability,energy,instrumentalness,explicit,collab,duration_s,timesig_0,timesig_1,timesig_3,timesig_4,timesig_5
0,1,3,0,0.471,0.924,0.0,0,0,177,0,0,0,1,0
1,1,1,0,0.46,0.326,1e-05,0,0,283,0,0,1,0,0
2,1,3,0,0.772,0.826,9e-06,0,0,221,0,0,0,1,0
3,0,2,1,0.805,0.417,0.0201,0,0,133,0,0,0,1,0
4,1,1,0,0.495,0.856,0.0,1,0,154,0,0,0,1,0


After giving further thought to `track_number`, I am no longer convinced it belongs in the feature space. At best, from an artist's perspective, knowing the role of track number in predicting popularity could guide decisions about where in an album to place tracks you want to be streamed the most. But often times, track number assignments are incidental -- or, often though not always, singles are automatically assigned to track number 1. This is particularly the case for singles which are not later included on albums, as when an artist releases a standalone track. As such, I fear that the inclusion of `track_number` in the model has a greater potential to either create illusory patterns, or generate a predictive model which cannot reasonably guide decisions on the part of music creators or Spotify's business side.

In [12]:
df = df.drop('track_number', axis=1)

Now, it's time to address the numerical features. Fortunately, due to the way these features are defined by Spotify, the audio-feature columns of `danceability`, `energy`, and `instrumentalness` are already normalized from 0 to 1. So, the only remaining feature to scale is `duration`.

I must not scale this column until after I've made the train-test split. Since my test data will be a stand-in for unseen data, I must fit the Standard Scaler to the training data only. Then, I can use the fit scaler to transform the test data.

In [13]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [34]:
y = df.popularity_class
X = df.drop('popularity_class', axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=17)

In [30]:
def scale_columns(cols_to_scale, X_train=X_train, X_test=X_test):
    '''Accepts a single column labels or a list of column labels to scale with StandardScaler,
    returns X_train and X_test DFs with scaled columns as specified from the list'''
    scaling = StandardScaler()
    
    #handle string passed as single column to scale; requires reshape
    if type(cols_to_scale) == str:
        columns_scaled_train = scaling.fit_transform(np.array(X_train[cols_to_scale]).reshape(-1,1))
        columns_scaled_test = scaling.transform(np.array(X_test[cols_to_scale]).reshape(-1,1))
    else:
        columns_scaled_train = scaling.fit_transform(X_train[cols_to_scale])
        columns_scaled_test = scaling.transform(X_test[cols_to_scale])
    
    X_train = X_train.drop(cols_to_scale, axis=1)
    X_train[cols_to_scale] = columns_scaled_train
    X_test = X_test.drop(cols_to_scale, axis=1)
    X_test[cols_to_scale] = columns_scaled_test
    
    return X_train, X_test

In [35]:
X_train, X_test = scale_columns(['duration_s', 'danceability', 'energy', 'instrumentalness'])

In [37]:
X_train.head()

Unnamed: 0,single,explicit,collab,timesig_0,timesig_1,timesig_3,timesig_4,timesig_5,duration_s,danceability,energy,instrumentalness
9671,0,0,0,0,0,1,0,0,-0.552569,1.162971,-2.758509,2.300176
9212,1,1,1,0,0,0,1,0,0.061334,2.020965,0.287754,-0.547864
1875,0,1,0,0,0,0,1,0,0.353669,0.438847,1.019239,-0.547864
1424,0,0,0,0,0,0,1,0,-0.883882,-0.912038,1.439085,2.30319
8488,0,0,0,0,0,0,1,0,0.246479,-0.279191,-1.28775,-0.547864


In [38]:
X_test.head()

Unnamed: 0,single,explicit,collab,timesig_0,timesig_1,timesig_3,timesig_4,timesig_5,duration_s,danceability,energy,instrumentalness
8486,0,0,0,0,0,0,1,0,0.056462,-0.376552,0.270441,2.300176
6700,0,0,0,0,0,0,1,0,-0.075089,1.016929,-0.551937,-0.54785
7099,0,0,0,0,0,0,1,0,-0.167662,1.290757,-2.291918,2.040989
8559,1,0,0,0,0,0,1,0,0.056462,-0.084469,0.595064,-0.547864
7431,0,0,0,0,0,0,1,0,-0.196895,-0.376552,-0.456715,-0.547864


Now fully preprocessed, I will export the data.

In [39]:
X_train.to_csv('../data/X_train.csv')
y_train.to_csv('../data/y_train.csv')
X_test.to_csv('../data/X_test.csv')
y_test.to_csv('../data/y_test.csv')