# Preprocessing |  Modeling Spotify track popularity
## Leo Evancie, Springboard Data Science Career Track

This is the third step in a capstone project to model music popularity on Spotify, a popular streaming service. Further project details and rationale can be found in the document 'Proposal.pdf'.

In this notebook, I will complete the process of preparing the data for model development. Categorical features will be encoded numerically with the use of dummy variables, and continuous variables will be scaled for consistency. I will also create my train/test data splits, culminating in the creation of cleaned and processed data with which I can train classifiers.

## 1. Load libraries and data

In [3]:
import numpy as np
import pandas as pd

df = pd.read_csv('../data/post_EDA.csv', index_col=0).reset_index().drop(['index','track_id'], axis=1)

In [4]:
df.head()

Unnamed: 0,track_number,single,danceability,energy,instrumentalness,explicit,collab,duration_s,time_signature,popularity_class
0,3,0,0.471,0.924,0.0,0,0,177,4.0,1
1,1,0,0.46,0.326,1e-05,0,0,283,3.0,1
2,3,0,0.772,0.826,9e-06,0,0,221,4.0,1
3,2,1,0.805,0.417,0.0201,0,0,133,4.0,0
4,1,0,0.495,0.856,0.0,1,0,154,4.0,1


## 2. Preprocess

To help guide preprocessing decisions, let's look at the number of unique values for each feature:

In [5]:
df.nunique()

track_number         109
single                 2
danceability         843
energy              1074
instrumentalness    2987
explicit               2
collab                 2
duration_s           596
time_signature         5
popularity_class       2
dtype: int64

### i. time_signature

Fortunately, in previous notebooks I already did much of my preprocessing legwork. The important categorical features are already encoded as binary values, with the exception of `time_signature`.

However, I notice that that column's values are floats, which is unneccessary and will only lead to less clean-looking dummy names. I will convert to integer, create dummies, and merge.

In [6]:
df['time_signature'] = df.time_signature.astype('int')
dummies = pd.get_dummies(df.time_signature, prefix='timesig')
df = pd.concat([df, dummies], axis=1).drop('time_signature', axis=1)
df.head()

Unnamed: 0,track_number,single,danceability,energy,instrumentalness,explicit,collab,duration_s,popularity_class,timesig_0,timesig_1,timesig_3,timesig_4,timesig_5
0,3,0,0.471,0.924,0.0,0,0,177,1,0,0,0,1,0
1,1,0,0.46,0.326,1e-05,0,0,283,1,0,0,1,0,0
2,3,0,0.772,0.826,9e-06,0,0,221,1,0,0,0,1,0
3,2,1,0.805,0.417,0.0201,0,0,133,0,0,0,0,1,0
4,1,0,0.495,0.856,0.0,1,0,154,1,0,0,0,1,0


Now it's time to address the quantitative features. Fortunately, due to the way these features are defined by Spotify, the audio-feature columns of `danceability`, `energy`, and `instrumentalness` already range from 0 to 1, making them well-suited to all kinds of models.

However, `duration_s` and `track_number` have much broader ranges, and so will need to be rescaled. Each requires a different approach.

### ii. track_number

There are 109 unique values for `track_number`. But from previous notebooks, I know that there are relatively few tracks with high track numbers. Furthermore, I know that those tracks with high track numbers tend to have low popularity, with very few exceptions. As such, it would not be unreasonable to reduce the cardinality of `track_number` by imposing a maximum value of, say 25 (judging by the distribution shown in the EDA notebook).

Why do this? Two reasons:

1. If we rescale the track numbers as is, the wide range would create a large cluster of values on the very low end (close to zero) and a sparse set of values ranging up to 1. But we know that the relationship between track number and popularity appears only in that lower range of track numbers (i.e., where the majority of the data exist). Setting an artificial maximum of 25 before rescaling allows the important range of values to map onto the 0-to-1 scale in a more proportional fashion, allowing the model to learn the genuine relationship between this feature and our target.
2. The vast majority of songs with a track number higher than 25 have a popularity classification of 0. So, even though we are losing some data validity by imposing a maximum value, we are, for the most part, preserving the relationship between `track_number` and popularity. The model will learn that rows with a track number of 25 will typically not be popular, which will be true of both the raw data and the preprocessed data. It would be within the bounds of responsibility to apply such a maximum to new data in the context of generating predictions.

In [14]:
df['track_number'] = [25 if x>25 else x for x in df.track_number]

The standard procedure is to fit scalers or normalizers to the training data only. Then, transform both the training and testing data from that scaler. This is to avoid information "leaking" from the training set into the testing set. I will utilize the `scikit-learn` MinMaxScaler().

In [15]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

In [16]:
y = df.popularity_class
X = df.drop('popularity_class', axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=17)

In [17]:
scaling = MinMaxScaler()

track_num_scaled_train = scaling.fit_transform(np.array(X_train['track_number']).reshape(-1,1))
X_train = X_train.drop('track_number', axis=1)
X_train['track_number'] = track_num_scaled_train

duration_scaled_test = scaling.transform(np.array(X_test['track_number']).reshape(-1,1))
X_test = X_test.drop('track_number', axis=1)
X_test['track_number'] = duration_scaled_test

### iii. track_number

`duration_s` has much greater magnitude, with values in the hundreds. To bring this column into alignment with the others, I will apply a fresh MinMaxScaler().

In [19]:
scaling = MinMaxScaler()

duration_scaled_train = scaling.fit_transform(np.array(X_train['duration_s']).reshape(-1,1))
X_train = X_train.drop('duration_s', axis=1)
X_train['duration_s'] = duration_scaled_train

duration_scaled_test = scaling.transform(np.array(X_test['duration_s']).reshape(-1,1))
X_test = X_test.drop('duration_s', axis=1)
X_test['duration_s'] = duration_scaled_test

In [20]:
X_train.head()

Unnamed: 0,single,danceability,energy,instrumentalness,explicit,collab,timesig_0,timesig_1,timesig_3,timesig_4,timesig_5,track_number,duration_s
9671,0,0.783,0.0262,0.945,0,0,0,0,1,0,0,0.333333,0.019288
9212,1,0.924,0.73,0.0,1,1,0,0,0,1,0,0.0,0.045994
1875,0,0.664,0.899,0.0,1,0,0,0,0,1,0,0.208333,0.058711
1424,0,0.442,0.996,0.946,0,0,0,0,0,1,0,0.625,0.004875
8488,0,0.546,0.366,0.0,0,0,0,0,0,1,0,0.041667,0.054048


In [21]:
X_test.head()

Unnamed: 0,single,danceability,energy,instrumentalness,explicit,collab,timesig_0,timesig_1,timesig_3,timesig_4,timesig_5,track_number,duration_s
8486,0,0.53,0.726,0.945,0,0,0,0,0,1,0,0.791667,0.045782
6700,0,0.759,0.536,5e-06,0,0,0,0,0,1,0,0.541667,0.040059
7099,0,0.804,0.134,0.859,0,0,0,0,0,1,0,0.875,0.036032
8559,1,0.578,0.801,0.0,0,0,0,0,0,1,0,0.0,0.045782
7431,0,0.53,0.558,0.0,0,0,0,0,0,1,0,0.708333,0.03476


## 3. Export

All of my features are now either binary or continuous from 0 to 1. While decision-tree-based models can handle any type or range of features, most require numbers in the range and format we've created here. Since I don't yet know which type of model will work best with this dataset, this approach will leave the most doors open.

In [22]:
X_train.to_csv('../data/X_train.csv')
y_train.to_csv('../data/y_train.csv')
X_test.to_csv('../data/X_test.csv')
y_test.to_csv('../data/y_test.csv')