# Preprocessing |  Modeling Spotify track popularity
## Leo Evancie, Springboard Data Science Career Track

This is the third step in a capstone project to model music popularity on Spotify, a popular streaming service. Further project details and rationale can be found in the document 'Proposal.pdf'.

In this notebook, I will complete the process of preparing the data for model development. Categorical features will be encoded numerically with the use of dummy variables, and continuous variables will be scaled for consistency. I will also create my train/test data splits, culminating in the creation of cleaned and processed data with which I can train classifiers.

First, I import libraries and retrieve my most up-to-date data:

In [1]:
import numpy as np
import pandas as pd

df = pd.read_csv('../data/post_EDA.csv', index_col=0).reset_index().drop(['index','track_id'], axis=1)

In [2]:
df.head()

Unnamed: 0,popularity_class,track_number,single,danceability,energy,instrumentalness,explicit,collab,duration_s,time_signature
0,1,3,0,0.471,0.924,0.0,0,0,177,4.0
1,1,1,0,0.46,0.326,1e-05,0,0,283,3.0
2,1,3,0,0.772,0.826,9e-06,0,0,221,4.0
3,0,2,1,0.805,0.417,0.0201,0,0,133,4.0
4,1,1,0,0.495,0.856,0.0,1,0,154,4.0


In [3]:
df.nunique()

popularity_class       2
track_number         109
single                 2
danceability         843
energy              1074
instrumentalness    2987
explicit               2
collab                 2
duration_s           596
time_signature         5
dtype: int64

Fortunately, in previous notebooks I already did much of my preprocessing legwork. The important categorical features are already encoded as binary values, with the exception of `time_signature`.

However, I notice that that column's values are floats, which is unneccessary and will only lead to less clean-looking dummy names. I will convert to integer, create dummies, and merge.

In [4]:
df['time_signature'] = df.time_signature.astype('int')
dummies = pd.get_dummies(df.time_signature, prefix='timesig')
df = pd.concat([df, dummies], axis=1).drop('time_signature', axis=1)
df.head()

Unnamed: 0,popularity_class,track_number,single,danceability,energy,instrumentalness,explicit,collab,duration_s,timesig_0,timesig_1,timesig_3,timesig_4,timesig_5
0,1,3,0,0.471,0.924,0.0,0,0,177,0,0,0,1,0
1,1,1,0,0.46,0.326,1e-05,0,0,283,0,0,1,0,0
2,1,3,0,0.772,0.826,9e-06,0,0,221,0,0,0,1,0
3,0,2,1,0.805,0.417,0.0201,0,0,133,0,0,0,1,0
4,1,1,0,0.495,0.856,0.0,1,0,154,0,0,0,1,0


After giving further thought to `track_number`, I am no longer convinced it belongs in the feature space. At best, from an artist's perspective, knowing the role of track number in predicting popularity could guide decisions about where in an album to place tracks you want to be streamed the most. But often times, track number assignments are incidental -- or, often though not always, singles are automatically assigned to track number 1. This is particularly the case for singles which are not later included on albums, as when an artist releases a standalone track. As such, I fear that the inclusion of `track_number` in the model has a greater potential to either create illusory patterns, or generate a predictive model which cannot reasonably guide decisions on the part of music creators or Spotify's business side.

In [5]:
df = df.drop('track_number', axis=1)

Now, it's time to address the numerical features. Fortunately, due to the way these features are defined by Spotify, the audio-feature columns of `danceability`, `energy`, and `instrumentalness` are already normalized from 0 to 1. So, the only remaining feature to scale is `duration`.

I must not scale this column until after I've made the train-test split. Since my test data will be a stand-in for unseen data, I must fit the Standard Scaler to the training data only. Then, I can use the fit scaler to transform the test data.

In [6]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [7]:
y = df.popularity_class
X = df.drop('popularity_class', axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=17)

In [8]:
scaler = StandardScaler()

In [9]:
X_train['duration_s_scaled'] = scaler.fit_transform(X_train['duration_s'].to_numpy().reshape(-1,1))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train['duration_s_scaled'] = scaler.fit_transform(X_train['duration_s'].to_numpy().reshape(-1,1))


In [10]:
X_train.head()

Unnamed: 0,single,danceability,energy,instrumentalness,explicit,collab,duration_s,timesig_0,timesig_1,timesig_3,timesig_4,timesig_5,duration_s_scaled
9671,0,0.783,0.0262,0.945,0,0,110,0,0,1,0,0,-0.552569
9212,1,0.924,0.73,0.0,1,1,236,0,0,0,1,0,0.061334
1875,0,0.664,0.899,0.0,1,0,296,0,0,0,1,0,0.353669
1424,0,0.442,0.996,0.946,0,0,42,0,0,0,1,0,-0.883882
8488,0,0.546,0.366,0.0,0,0,274,0,0,0,1,0,0.246479


In [11]:
X_test['duration_s_scaled'] = scaler.transform(X_test['duration_s'].to_numpy().reshape(-1,1))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_test['duration_s_scaled'] = scaler.transform(X_test['duration_s'].to_numpy().reshape(-1,1))


In [12]:
X_test.head()

Unnamed: 0,single,danceability,energy,instrumentalness,explicit,collab,duration_s,timesig_0,timesig_1,timesig_3,timesig_4,timesig_5,duration_s_scaled
8486,0,0.53,0.726,0.945,0,0,235,0,0,0,1,0,0.056462
6700,0,0.759,0.536,5e-06,0,0,208,0,0,0,1,0,-0.075089
7099,0,0.804,0.134,0.859,0,0,189,0,0,0,1,0,-0.167662
8559,1,0.578,0.801,0.0,0,0,235,0,0,0,1,0,0.056462
7431,0,0.53,0.558,0.0,0,0,183,0,0,0,1,0,-0.196895


Rather than transforming the columns in place, I created additional columns in each dataset. This is to allow for flexibility in model testing in the next notebook. For decision tree-based models, I can (and, it seems, should) use the un-scaled data, whereas in other distance-based classifiers, I must use the scaled data. With the data now structured in this way, I can select the appropriate columns for each model I try.

When creating the scaled columns, Python threw a SettingWithCopy warning, but after inspecting the head, the process appears to have worked. I believe these errors were false positives. I can verify that the transformations worked as intended:

In [13]:
X_train.duration_s_scaled.mean(), X_train.duration_s_scaled.std(), X_test.duration_s_scaled.mean(), X_test.duration_s_scaled.std()

(6.385875329431095e-17,
 1.000072513687059,
 -0.01515163208391563,
 0.9436401991180441)

The mean and standard deviation of the scaled testing data is not quite as close to zero and one, but this is to be expected, since the scaler was fit to the training data.

Now fully preprocessed, I will export the data.

In [14]:
X_train.to_csv('../data/X_train.csv')
y_train.to_csv('../data/y_train.csv')
X_test.to_csv('../data/X_test.csv')
y_test.to_csv('../data/y_test.csv')