# ML Spotify project
### By Juliana Varela

The data this week comes from Spotify via the [`spotifyr` package](https://www.rcharlie.com/spotifyr/). [Charlie Thompson](https://twitter.com/_RCharlie), [Josiah Parry](https://twitter.com/JosiahParry), Donal Phipps, and Tom Wolff authored this package to make it easier to get either your own data or general metadata arounds songs from Spotify's API.

## Data description

|variable                 |class     |description |
|:---|:---|:-----------|
|track_id                 |character | Song unique ID|
|track_name               |character | Song Name|
|track_artist             |character | Song Artist|
|track_popularity         |double    | Song Popularity (0-100) where higher is better |
|track_album_id           |character | Album unique ID|
|track_album_name         |character | Song album name |
|track_album_release_date |character | Date when album released |
|playlist_name            |character | Name of playlist |
|playlist_id              |character | Playlist ID|
|playlist_genre           |character | Playlist genre |
|playlist_subgenre        |character | Playlist subgenre|
|danceability             |double    | Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable. |
|energy                   |double    | Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy. |
|key                      |double    | The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1. |
|loudness                 |double    | The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.|
|mode                     |double    | Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.|
|speechiness              |double    | Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks. |
|acousticness             |double    | A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.|
|instrumentalness         |double    | Predicts whether a track contains no vocals. "Ooh" and "aah" sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly "vocal". The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0. |
|liveness                 |double    | Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live. |
|valence                  |double    | A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry). |
|tempo                    |double    | The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration. |
|duration_ms              |double    | Duration of song in milliseconds |


* Task 1: Predict ’popularity’ with a Machine Learning model
    * 13 points - for Machine Learning model, you can use all model learned in lecture.
    * 2 points will be based on the model performance on the test set – metric=MeanSquaredLogarithmicError

* Task 2 : Build a recommender systems with a Machine Learning approach.
    * Suggest 5 tracks to listen to based on 5 tracks that a user has liked
    * 5 points - ML: As we don't have a course dedicated to recommending systems using machine learning models, you'll have to do a bit of research



## Imports

In [10]:
import pandas as pd
import numpy as np
import torch

In [11]:
from imblearn.pipeline import Pipeline

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

from sklearn.metrics import mean_squared_log_error

## Data loading

In [13]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [14]:
TRAIN = pd.read_csv("spotify_songs_train.csv")

In [15]:
x_test = pd.read_csv('spotify_songs_X_test.csv')

## TASK 1: Predict popularity

## 1. Data Cleaning

In [16]:
y_train =  TRAIN.pop('track_popularity')
X_train = TRAIN

In [17]:
X_train.head()

Unnamed: 0.1,Unnamed: 0,track_id,track_name,track_artist,track_album_id,track_album_name,track_album_release_date,playlist_name,playlist_id,playlist_genre,...,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms
0,21973,0Xw5Jg9aFglPqfH163thrA,Tell Me,Krept & Konan,2p93gdjjBJo51WF4YV3L3d,Tell Me,2019-10-31,Chixtape 5 - Tory Lanez,0UXwwVDipbBQeEX7h4YuKU,r&b,...,1,-5.76,0,0.386,0.136,0.0,0.089,0.633,163.563,215047
1,16980,5A1ttHJNuGEoPLhhoHpzFA,Pearls,Epifania,0H7yNFgaPSzmji4ts1s3EN,"Gems from Japan, Vol II",2019-07-09,Sunny Beats,37i9dQZF1DXbtuVQL4zoey,latin,...,0,-7.782,0,0.453,0.388,0.647,0.0799,0.677,85.054,83294
2,18464,3QHMxEOAGD51PDlbFPHLyJ,Vivir Mi Vida,Marc Anthony,6vBpLg3T8bojcqzoKI6m0R,3.0,2013-07-23,Fiesta Latina Mix 🎈🎉💃🏻🕺🏻☀️🏖,2kY6lVc5EcVfI5WNKmPQQG,latin,...,0,-3.23,0,0.0344,0.344,0.0,0.349,0.893,105.017,252347
3,2992,48bSfSZaq9Aizbu4AWn4st,Febreze (feat. 2 Chainz),Jack Ü,6bfkwBrGYKJFk6Z4QVyjxd,Skrillex and Diplo present Jack Ü,2015-02-24,ELECTROPOP🐹,44p8nNLe4fGfUeArS3MaIX,pop,...,2,-3.51,1,0.333,0.0184,0.0,0.289,0.263,149.829,214400
4,16751,2kJIiIqbzYVtv2iTpbQts9,As Far as Feelings Go,Alle Farben,7wN2FvcizhjkzuT3MvAGZI,As Far as Feelings Go,2019-10-11,Tropical House 🏝 2020 Hits,2SRbIs0eBQwHeTP7kErjwo,latin,...,0,-4.399,0,0.0788,0.106,0.0,0.191,0.596,105.948,210827


In [18]:
X_train.shape

(26266, 23)

We check for dependencies

In [21]:
numeric_cols= X_train.select_dtypes(include=np.number).columns
correlation_mtx= X_train[numeric_cols].corr()
display(correlation_mtx)

Unnamed: 0.1,Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms
Unnamed: 0,1.0,0.021414,0.10537,0.00673,0.06715,-0.050205,-0.074575,-0.065035,0.138629,0.036713,-0.096217,0.005361,0.069688
danceability,0.021414,1.0,-0.083847,0.009946,0.030101,-0.058093,0.179714,-0.02733,-0.009332,-0.121986,0.332161,-0.185594,-0.098304
energy,0.10537,-0.083847,1.0,0.014452,0.676546,-0.001786,-0.028591,-0.539987,0.031149,0.161741,0.154975,0.150171,0.007118
key,0.00673,0.009946,0.014452,1.0,0.005139,-0.17129,0.022213,0.005787,0.008063,0.009782,0.019065,-0.012996,0.014893
loudness,0.06715,0.030101,0.676546,0.005139,1.0,-0.016831,0.013066,-0.363087,-0.147197,0.076412,0.059333,0.094786,-0.122133
mode,-0.050205,-0.058093,-0.001786,-0.17129,-0.016831,1.0,-0.06339,0.009448,-0.003981,-0.008699,0.002027,0.016414,0.013203
speechiness,-0.074575,0.179714,-0.028591,0.022213,0.013066,-0.06339,1.0,0.027622,-0.105351,0.055498,0.06819,0.044772,-0.089282
acousticness,-0.065035,-0.02733,-0.539987,0.005787,-0.363087,0.009448,0.027622,1.0,-0.01065,-0.08002,-0.014545,-0.111893,-0.074379
instrumentalness,0.138629,-0.009332,0.031149,0.008063,-0.147197,-0.003981,-0.105351,-0.01065,1.0,-0.010316,-0.18025,0.021827,0.065622
liveness,0.036713,-0.121986,0.161741,0.009782,0.076412,-0.008699,0.055498,-0.08002,-0.010316,1.0,-0.019025,0.020626,0.00988


We see there are no correlated columns atleast not so dependent that we might delete them so we keep the columns.
Now we check for missing data

In [23]:
check_NAs= X_train.isna().sum()

cols_with_NAs = check_NAs[check_NAs > 0]
print(cols_with_NAs)

track_name          5
track_artist        5
track_album_name    5
dtype: int64


We can observe there aren't many missing files however they seem to be important to identify the songs, we get the categorical features and then drop the unknown data

In [25]:
categorical_cols = [f for f in X_train.columns if f not in numeric_cols]
categorical_cols

['track_id',
 'track_name',
 'track_artist',
 'track_album_id',
 'track_album_name',
 'track_album_release_date',
 'playlist_name',
 'playlist_id',
 'playlist_genre',
 'playlist_subgenre']

## 2. Feature engineering

For the numeric features we will use SimpleImputer to replace null values with the most frequent ones and we scale the data.

For the caterogical features we will oneHotEncode them and set sparse to true for a faster computation

In [27]:
numeric_transformer = Pipeline(steps=[
   ('imputer', SimpleImputer(strategy='most_frequent')),
   ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=True))
])

In [28]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_cols),
        ('cat', categorical_transformer, categorical_cols),
    ],
    remainder='drop')

In [50]:
train_tabular= preprocessor.fit_transform(X_train)
test_tabular= preprocessor.transform(x_test)

## 3. Model Selection

This is a regression problem since the popularity is a continuos (numeric) value, so we will apply the main models for this type of problem, some for linear relationships and more complex non-linear relationships as well.
Starting with the most simple Linear regression, applying penalty to it (Ridge), then we will try knn regressor, a more advance model of this Random Forest Regressor, and then finally Gradient Boosting regressor.

In [44]:
models = {
'lr': Pipeline([('pre', preprocessor), ('model', LinearRegression())]),
'ridge': Pipeline([('pre', preprocessor), ('model', Ridge(random_state=42))]),
'knn': Pipeline([('pre', preprocessor), ('model', KNeighborsRegressor())]),
'rf': Pipeline([('pre', preprocessor), ('model', RandomForestRegressor(n_estimators=200, random_state=42, n_jobs=-1))]),
'gbr': Pipeline([('pre', preprocessor), ('model', GradientBoostingRegressor(n_estimators=200, random_state=42))])
}

## 4. Training

We use crossvalidation and split the train set into train and validation sets, and then proceed to apply the models to these data, then in each loop we save the predictions to get a final average prediction. 

In [48]:
v=10
kFold = KFold(n_splits=v, shuffle=True, random_state=42)

results = {}

for name, pipe in models.items():

  msles = []
  for train_id, val_id in kFold.split(train_tabular):

    new_X_train, X_val = X_train.iloc[train_id], X_train.iloc[val_id]
    new_y_train, y_val = y_train.iloc[train_id], y_train.iloc[val_id]

    pipe.fit(new_X_train, new_y_train)
    preds= pipe.predict(X_val)

    #since we use mslogerror we need to make sure the predictions are positive

    preds = np.clip(preds, 0, None)
    msle = mean_squared_log_error(y_val, preds)
    msles.append(msle)

  results[name] = np.mean(msles)

## 5. Results

In [50]:
print("CV MSLE (lower is better):")
for k, v in results.items():
    print(f"{k}: {v:.5f}")

CV MSLE (lower is better):
lr: 2.31968
ridge: 1.85150
knn: 1.82347
rf: 1.79143
gbr: 1.76732


In [52]:
best_name = min(results, key=results.get)
best_model = models[best_name]
best_model

Now that we have the best model we predict on the actual x_test data

In [76]:
best_model.fit(X_train, y_train)

# --- Predict on hold-out test set ---
test_preds = best_model.predict(x_test)
test_preds = np.clip(test_preds, 0, None)

print(f"\nBest model: {best_name}")
print("First 10 predictions on test set:", test_preds[:10])


Best model: gbr
First 10 predictions on test set: [36.52646683 37.03475541 51.61518524 39.49213774 38.61442819 17.74718324
 46.70703865 16.64265407 22.97052475 40.16044377]


We visualize the first predictions we get, then we save the predictions and export them as a csv file and we are done since we don't have the y_test to compare it to.

In [None]:
predictionsML_df = pd.DataFrame({
    'Unnamed: 0': np.arange(len(test_preds)), 
    'ML prediction': test_preds,                  
})

predictionsML_df.to_csv('predictions_ml_template.csv', index=False)
print('Saved predictions_ml_template.csv')
print(predictionsML_df.head())

We can conclude the best model was gradient boosting regressor, this can be due to the complexity of the data and the model's focus on predicting difficult patterns (by its adaptation to correct its mistakes). GBR also has a better performance than random forest since it tends to generalize better.

## TASK 2: Recommmend 5 songs

## 1. Data preprocessing

We apply the previous preprocessing model but to our whola train data

In [113]:
train_matrix= preprocessor.fit_transform(TRAIN)

## 2. Training

For our purpose of recommending songs we will use nearest neighbors as a way to evaluating songs that are simmilar to each other, if we already like a song then it is likely we will like a similar song.

In [46]:
from sklearn.neighbors import NearestNeighbors

We will select cosine as metric since we need to find similarity in patterns, and its capacity to work with high dimensional data and not vary with scaling allows us to get the results we need. Then in order to find the best n_neighbors number we will use cross-validation and chose the k with the smallest distances

In [125]:
possible_k = [3, 5, 8, 10, 12]
av_distances= []

for k in possible_k:
    nn = NearestNeighbors(n_neighbors=k+1, metric='cosine')
    nn.fit(train_matrix)
    
    distances, _ = nn.kneighbors(train_tabular)
    
    #This is to exclude itself
    av_distances.append(np.mean(distances[:, 1:])) 

best_k = possible_k[np.argmin(av_distances)]
print(best_k)

3


In [126]:
# We fit NearestNeighbors with the best_k
nn = NearestNeighbors(n_neighbors=best_k, metric='cosine', algorithm='auto')
nn.fit(train_matrix)

We create a function that recommends 5 tracks based on previous liked songs.

In order to do this we take the *liked_indices* which will represent the previously liked songs, the *dataframe* containing the song info, the *nn* model already fit, the *feature matrix* which is the dataframe info preprocessed, and the *top-k* which is the ammounts of songs we will recommend. 

In [129]:
from collections import Counter

def recommend_from_likes(liked_indices, df, nn, feature_matrix, top_k):
    # Aggregate neighbors
    neighs = []
    
    for idx in liked_indices:
        vect = feature_matrix[idx].reshape(1, -1)
        dists, inds = nn.kneighbors(vect, n_neighbors=6)
        # We skip the first song (since its itself)
        neighs.extend(list(inds[0][1:]))
        
    # Count occurrences
    counts = Counter(neighs)
    
    # From our list we exclude the already liked tracks
    for li in liked_indices:
        counts.pop(li, None)
        
    # Now we need to select tracks while avoiding repeating recomentations
    recommended_indices = []
    for i, _ in counts.most_common():
        if i not in recommended_indices:
            recommended_indices.append(i)
        if len(recommended_indices) >= top_k:
            break
    recommended = df.iloc[recommended_indices].reset_index(drop=True)
    return recommended

## 3. Results

We apply the function to our date and select 5 indices of liked songs

In [131]:
liked_indices = [0,20,100,266,470]

In [133]:
songs_recommended= recommend_from_likes(liked_indices, TRAIN, nn, train_matrix, 5)

Finally we add a column and save our recommended songs as a csv file

In [136]:
songs_recommended['recommended_from_liked_indices'] = [liked_indices]*len(songs_recommended)
songs_recommended.to_csv('recommended_songs.csv', index=False)
print("Recommended tracks saved to recommended_songs.csv")
print(songs_recommended)

Recommended tracks saved to recommended_songs.csv
   Unnamed: 0                track_id  \
0       21889  2XESFfYekQ0mRau2GXjT3E   
1       21896  3FskQrDXcY24ur2fCvz35O   
2       25738  63Ly2sEzloc9s0yAXlMi6r   
3       21901  5RUrVSdaXFVXjYK4lr2xf3   
4       21962  5WkyhmqrQT8iGq0Y5VMa4a   

                                    track_name track_artist  \
0                                      Air Max        Rim'K   
1                                           Ye    Burna Boy   
2                 Warm Water - Snakehips Remix        BANKS   
3                              Time After Time     Lil Baby   
4  Pose To Do (feat. French Montana and Quavo)     Lil Pump   

           track_album_id                             track_album_name  \
0  4iRAKma59A97OMcac2nsOa                                      Air Max   
1  26du6obYLeY1vf6xIJ1l0D                                      Outside   
2  41xJklJV7uqDzg9teggeR6                 Warm Water (Snakehips Remix)   
3  1ho0cNe552yTcBHXbzfozB   