# Preprocessing |  Modeling Spotify track popularity
## Leo Evancie, Springboard Data Science Career Track

This is the third step in a capstone project to model music popularity on Spotify, a popular streaming service. Further project details and rationale can be found in the document 'Proposal.pdf'.

In this notebook, I will complete the process of preparing the data for model development. Categorical features will be encoded numerically with the use of dummy variables, and continuous variables will be scaled for consistency. I will also create my train/test data splits, culminating in the creation of cleaned and processed data with which I can train classifiers.

First, I import libraries and retrieve my most up-to-date data:

In [1]:
import numpy as np
import pandas as pd

df = pd.read_csv('../data/post_EDA.csv', index_col=0).reset_index().drop(['index','track_id'], axis=1)

In [2]:
df.head()

Unnamed: 0,popularity_class,single,danceability,energy,instrumentalness,explicit,collab,duration_s,time_signature
0,1,0,0.471,0.924,0.0,0,0,177,4.0
1,1,0,0.46,0.326,1e-05,0,0,283,3.0
2,1,0,0.772,0.826,9e-06,0,0,221,4.0
3,0,1,0.805,0.417,0.0201,0,0,133,4.0
4,1,0,0.495,0.856,0.0,1,0,154,4.0


In [3]:
df.nunique()

popularity_class       2
single                 2
danceability         843
energy              1074
instrumentalness    2987
explicit               2
collab                 2
duration_s           596
time_signature         5
dtype: int64

Fortunately, in previous notebooks I already did much of my preprocessing legwork. The important categorical features are already encoded as binary values, with the exception of `time_signature`.

However, I notice that that column's values are floats, which is unneccessary and will only lead to less clean-looking dummy names. I will convert to integer, create dummies, and merge.

In [4]:
df['time_signature'] = df.time_signature.astype('int')
dummies = pd.get_dummies(df.time_signature, prefix='timesig')
df = pd.concat([df, dummies], axis=1).drop('time_signature', axis=1)
df.head()

Unnamed: 0,popularity_class,single,danceability,energy,instrumentalness,explicit,collab,duration_s,timesig_0,timesig_1,timesig_3,timesig_4,timesig_5
0,1,0,0.471,0.924,0.0,0,0,177,0,0,0,1,0
1,1,0,0.46,0.326,1e-05,0,0,283,0,0,1,0,0
2,1,0,0.772,0.826,9e-06,0,0,221,0,0,0,1,0
3,0,1,0.805,0.417,0.0201,0,0,133,0,0,0,1,0
4,1,0,0.495,0.856,0.0,1,0,154,0,0,0,1,0


Now, it's time to address the numerical features. Fortunately, due to the way these features are defined by Spotify, the audio-feature columns of `danceability`, `energy`, and `instrumentalness` are already normalized from 0 to 1.

However, `duration_s` has much greater magnitude, with values in the hundreds. To bring this column into alignment with the others, I will apply the `scikit-learn` 'MinMaxScaler' and rescale the duration values to between 0 and 1.

Since my test data will be a stand-in for unseen data, I must fit the scaler to the training data only. Then, I can use the fit scaler to transform the test data.

In [5]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

In [6]:
y = df.popularity_class
X = df.drop('popularity_class', axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=17)

In [7]:
scaling = MinMaxScaler()

duration_scaled_train = scaling.fit_transform(np.array(X_train['duration_s']).reshape(-1,1))
X_train = X_train.drop('duration_s', axis=1)
X_train['duration_s'] = duration_scaled_train

duration_scaled_test = scaling.transform(np.array(X_test['duration_s']).reshape(-1,1))
X_test = X_test.drop('duration_s', axis=1)
X_test['duration_s'] = duration_scaled_test

In [8]:
X_train.head()

Unnamed: 0,single,danceability,energy,instrumentalness,explicit,collab,timesig_0,timesig_1,timesig_3,timesig_4,timesig_5,duration_s
9671,0,0.783,0.0262,0.945,0,0,0,0,1,0,0,0.019288
9212,1,0.924,0.73,0.0,1,1,0,0,0,1,0,0.045994
1875,0,0.664,0.899,0.0,1,0,0,0,0,1,0,0.058711
1424,0,0.442,0.996,0.946,0,0,0,0,0,1,0,0.004875
8488,0,0.546,0.366,0.0,0,0,0,0,0,1,0,0.054048


In [9]:
X_test.head()

Unnamed: 0,single,danceability,energy,instrumentalness,explicit,collab,timesig_0,timesig_1,timesig_3,timesig_4,timesig_5,duration_s
8486,0,0.53,0.726,0.945,0,0,0,0,0,1,0,0.045782
6700,0,0.759,0.536,5e-06,0,0,0,0,0,1,0,0.040059
7099,0,0.804,0.134,0.859,0,0,0,0,0,1,0,0.036032
8559,1,0.578,0.801,0.0,0,0,0,0,0,1,0,0.045782
7431,0,0.53,0.558,0.0,0,0,0,0,0,1,0,0.03476


Now fully preprocessed, I will export the data.

In [10]:
X_train.to_csv('../data/X_train.csv')
y_train.to_csv('../data/y_train.csv')
X_test.to_csv('../data/X_test.csv')
y_test.to_csv('../data/y_test.csv')