## Spotify playlist prediction [Link](https://www.aicrowd.com/challenges/spotify-million-playlist-dataset-challenge)

*Link to the dataset and competition page:*

In [None]:
# https://www.aicrowd.com/challenges/spotify-million-playlist-dataset-challenge

### How to solve: (Item-based collaborative filtering) - *Background detail*

*Complex Approach:*

In [None]:
# COMPLEX APPROACH: https://towardsdatascience.com/part-iii-building-a-song-recommendation-system-with-spotify-cf76b52705e7

#        ^This website has a whole breakdown of a way to make a similar model, the problem is this is going
#         to be very very complex and outside the scope of what we learned. Which is fine, but also makes the report quite hard.

*Simplier approach:*

In [None]:
# A simplier approach: https://surpriselib.com/
# ^ this has pre made recommender system algorithms

In [None]:
# https://surprise.readthedocs.io/en/stable/prediction_algorithms.html
# ^ will be needed for report

Going with the above approach of using the simplier package will work as you can specify the type of collaborative filtering. In our case, we want item-based collaborative filtering. The difference between user-based and item-based is the level at which similarities are computed. So for user based it will target the song suggestions for a particular user, whereas item based will taget the song suggestions based off of other songs - *we want this*.

This will work as long as we make sure we understand the data and make modelling and data manipulation decisions with the data in mind.

*This guy explains some of the decisions we will have to make very well (and it's with super similar data) [video of song suggestion model](https://www.youtube.com/watch?v=zcPifvgECOw)*

**IF WE GO THIS ROUTE WHAT DO WE NEED TO DO:**

I. We need to pick a simularity method to use. From my research I think we should use cosign, which I can explain in more detail in person, but the basic reasoning is it should work better with common items (playlists/songs) that are further apart in the feature space - and not show (*large*) bias towards one particular song or another. In more detail: The model works by "plotting" the features of each song (or the average of a playlist) on a multi-dimensional graph (*p features*). The model uses a "similarity" model to calculate the difference between those features. The math behind the distance depends on the "similarity" model. The advantage of each similarity model is noted in complete detail here ([similarities](https://surprise.readthedocs.io/en/stable/similarities.html#module-surprise.similarities)).

II. After picking a simularity method, we then need to standardize the variables. The method of standardization varies, but I think most methods will work here. The important thing is we are standardizing so that the model will not show bias toward larger values - likely logistic standardizing/transformation will work best. This way playlists with large followers and songs with lots of plays don't scew the data from volume bias.

III. One of the most important decisions we need to make is whether to include song meta data or not. Each song in a playlist has a song uri which can be used in collaboration with spotify's api to get song data such as temp, bpm, and other meaures. This would add an aditional layer of complexity but make the model work way better (meaning, we could maybe win). In my opinion we start with the simplier option and if our model works, we then copy it and modify the copied version to see if we can add the metadata.

IV. We also need to determine our predictive algorithm structure ([Custom Algo](https://surprise.readthedocs.io/en/stable/building_custom_algo.html)). This will go through each playlist in our test set. It is basically the same thing as if we ran it once, but for a large scale.

V. Lastly, we should break down the meaning of the *fit method* and the *trainset attribute* ([The fit method (trainset attribute right after)](https://surprise.readthedocs.io/en/stable/building_custom_algo.html#the-fit-method)) to better understand our predictions and model.

VI. Construct the model...



---



---



*(hidden) import and setup:*

# I. Data Exploration

### **NOTE DO NOT PRINT THE CONTENTS OF A JSON FILE UNLESS NEEDED (TAKE A SAMPLE) ~ will 10% of the time crash colab**

*Subsection: Loading Data*

In [None]:
import pandas as pd
import os
import json
import numpy as np

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Becuase of RAM limitations, we have to load the data in subsets, and then combine the subsets:

In [None]:
# subset 1
i = 0
spotifydata_s1 = pd.DataFrame()
path = "drive/MyDrive/data/subset1"
for mpd in os.listdir("drive/MyDrive/data/subset1"):
  if i >= 100:
    break
  with open(os.path.join(path, mpd), 'r') as j:
     contents = json.loads(j.read())
  spotifydata_s1 = pd.concat([spotifydata_s1, pd.json_normalize(contents["playlists"])], ignore_index=True)
  i += 1

In [None]:
spotifydata_s1

Unnamed: 0,name,collaborative,pid,modified_at,num_tracks,num_albums,num_followers,tracks,num_edits,duration_ms,num_artists,description
0,litty titty,false,115000,1508371200,212,126,1,"[{'pos': 0, 'artist_name': 'Travis Scott', 'tr...",11,51022342,75,
1,calm,false,115001,1447977600,165,29,2,"[{'pos': 0, 'artist_name': 'Albert Hammond, Jr...",19,28591224,21,
2,jams,false,115002,1470960000,37,34,2,"[{'pos': 0, 'artist_name': 'Post Malone', 'tra...",11,8509329,31,
3,Halloween,false,115003,1504224000,133,73,1,"[{'pos': 0, 'artist_name': 'Aurelio Voltaire',...",7,27064505,77,
4,Energetic,false,115004,1430265600,8,8,1,"[{'pos': 0, 'artist_name': 'Lana Del Rey', 'tr...",4,1915178,8,
...,...,...,...,...,...,...,...,...,...,...,...,...
99995,Chill,false,195995,1481846400,47,46,3,"[{'pos': 0, 'artist_name': 'Vance Joy', 'track...",40,11211960,45,
99996,POP,false,195996,1436054400,25,20,1,"[{'pos': 0, 'artist_name': 'Jason Derulo', 'tr...",6,5429232,15,
99997,2017 Playlist,false,195997,1506384000,50,29,3,"[{'pos': 0, 'artist_name': 'Rihanna', 'track_u...",5,10329101,24,
99998,Chance The Rapper,false,195998,1508457600,14,6,1,"[{'pos': 0, 'artist_name': 'Chance The Rapper'...",4,3478665,6,


*Exploring the data: Seeing what the playlist track dictionaries consist of*

In [None]:
tracklist = spotifydata_s1["tracks"][0]
tracklist[1]

{'pos': 1,
 'artist_name': 'Travis Scott',
 'track_uri': 'spotify:track:6gBFPUFcJLzWGx4lenP6h2',
 'artist_uri': 'spotify:artist:0Y5tJX1MQlPlqiwlOH1tJY',
 'track_name': 'goosebumps',
 'album_uri': 'spotify:album:42WVQWuf1teDysXiOupIZt',
 'duration_ms': 243836,
 'album_name': 'Birds In The Trap Sing McKnight'}

*Normalizing the tracklist to get an easier look*
- This playlist has 212 songs

In [None]:

pd.json_normalize(tracklist)

Unnamed: 0,pos,artist_name,track_uri,artist_uri,track_name,album_uri,duration_ms,album_name
0,0,Travis Scott,spotify:track:0ESJlaM8CE1jRWaNtwSNj8,spotify:artist:0Y5tJX1MQlPlqiwlOH1tJY,beibs in the trap,spotify:album:42WVQWuf1teDysXiOupIZt,213863,Birds In The Trap Sing McKnight
1,1,Travis Scott,spotify:track:6gBFPUFcJLzWGx4lenP6h2,spotify:artist:0Y5tJX1MQlPlqiwlOH1tJY,goosebumps,spotify:album:42WVQWuf1teDysXiOupIZt,243836,Birds In The Trap Sing McKnight
2,2,Travis Scott,spotify:track:2c2csx4OTYtbkzvbSTXlGY,spotify:artist:0Y5tJX1MQlPlqiwlOH1tJY,guidance,spotify:album:42WVQWuf1teDysXiOupIZt,207107,Birds In The Trap Sing McKnight
3,3,Travis Scott,spotify:track:1yxgsra98r3qAtxqiGZPiX,spotify:artist:0Y5tJX1MQlPlqiwlOH1tJY,Butterfly Effect,spotify:album:4fOw7xSDwqb58Z2Qia5j81,190677,Butterfly Effect
4,4,Travis Scott,spotify:track:1SGt65i9AnXYdDQt1AtDRH,spotify:artist:0Y5tJX1MQlPlqiwlOH1tJY,3500,spotify:album:4PWBTB6NYSKQwfo79I3prg,461840,Rodeo
...,...,...,...,...,...,...,...,...
207,207,Russ,spotify:track:3pndPhlQWjuSoXhcIIdBjv,spotify:artist:1z7b1Pr1rSlvWRzsW3HOrS,What They Want,spotify:album:0lUL92det7mZ4DaHYmiUEC,165853,There's Really A Wolf
208,208,Jeremih,spotify:track:0PJIbOdMs3bd5AT8liULMQ,spotify:artist:3KV3p5EY4AvKxOlhGHORLg,oui,spotify:album:7DMyQuDPe8xzjC0UDSDa96,238320,Late Nights: The Album
209,209,Jeremih,spotify:track:08zJpaUQVi9FrKv2e32Bah,spotify:artist:3KV3p5EY4AvKxOlhGHORLg,Planez,spotify:album:7DMyQuDPe8xzjC0UDSDa96,240320,Late Nights: The Album
210,210,Rihanna,spotify:track:3DZQ6mzUkAdHqZWzqxBKIK,spotify:artist:5pKCCKE2ajJHZ9KAiaK11H,Loveeeeeee Song,spotify:album:4eddbruVtOqw8khwxSH6H2,256320,Unapologetic


---

# API: requests *(exploring the api)*

In [None]:
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials

In [None]:
import requests
import base64

client_id = "f07c1447c8384d3d9f5731543d6a4ba6"
client_secret = "747df0293f3c4a39a45da2ceacfd5708"

encoded = base64.b64encode((client_id + ":" + client_secret).encode("ascii")).decode("ascii")

headers = {
     "Content-Type": "application/x-www-form-urlencoded",
     "Authorization": "Basic " + encoded
}

payload = {
     "grant_type": "client_credentials"
}

response = requests.post("https://accounts.spotify.com/api/token", data=payload, headers=headers)
print(response.text)

In [None]:
api_token = "BQCbWuhhU0Y7OxYoW_tIpDsbb1Kuuw9XpcwPSvmEyQyIkdjHimU1IW4b7uD9eHNPC6K0ycGanRREcfUhjMHmp-4hhqeD_Q2OZam3AsBQ7wywj26xH-4"

In [None]:
response = requests.get("https://api.spotify.com/v1/artists/31TPClRtHm23RisEBtV3X7", headers = {"Content-Type": "application/json", "Authorization": f"Bearer {api_token}"})

In [None]:
temp =response.json()

In [None]:
temp

---

---

# PLAYLIST PREDICTION MODEL:

In [None]:
import gc # attempt to limit ram
gc.collect()

0

*Searches through all 100k rows and creates a new col called tracknames - corresponding to each track dictionary*

In [None]:
list_trackname_col = []
for row in spotifydata_s1.iloc():
  tracknames = list()
  tracklist = row['tracks']
  for track in tracklist:
    tracknames.append(track["track_name"])
  list_trackname_col.append(tracknames)

spotifydata_s1["tracknames"] = list_trackname_col
gc.collect()

0

*RAM was crashing the colab script when running machine learning model - so a subset was selected at a time*

In [None]:
subdata = spotifydata_s1.loc[:1000]

In [None]:
subdata

Unnamed: 0,name,collaborative,pid,modified_at,num_tracks,num_albums,num_followers,tracks,num_edits,duration_ms,num_artists,description,tracknames
0,litty titty,false,115000,1508371200,212,126,1,"[{'pos': 0, 'artist_name': 'Travis Scott', 'tr...",11,51022342,75,,"[beibs in the trap, goosebumps, guidance, Butt..."
1,calm,false,115001,1447977600,165,29,2,"[{'pos': 0, 'artist_name': 'Albert Hammond, Jr...",19,28591224,21,,"[Spooky Couch, The Heroic Weather-Conditions o..."
2,jams,false,115002,1470960000,37,34,2,"[{'pos': 0, 'artist_name': 'Post Malone', 'tra...",11,8509329,31,,"[Go Flex, Never Be Like You, Look Alive, Lockj..."
3,Halloween,false,115003,1504224000,133,73,1,"[{'pos': 0, 'artist_name': 'Aurelio Voltaire',...",7,27064505,77,,[Brains! (From The Grim Adventures of Billy an...
4,Energetic,false,115004,1430265600,8,8,1,"[{'pos': 0, 'artist_name': 'Lana Del Rey', 'tr...",4,1915178,8,,[Summertime Sadness [Lana Del Rey vs. Cedric G...
...,...,...,...,...,...,...,...,...,...,...,...,...,...
996,Chaos,true,115996,1408406400,20,19,1,"[{'pos': 0, 'artist_name': 'Kormac', 'track_ur...",16,5660735,17,,"[Superhero - Original Mix, Artichaut, Dragons,..."
997,Spring 2014,false,115997,1417305600,14,12,1,"[{'pos': 0, 'artist_name': 'Disclosure', 'trac...",8,3765841,9,,"[Latch, Rather Be (feat. Jess Glynne), Nirvana..."
998,autumn,false,115998,1507161600,35,32,1,"[{'pos': 0, 'artist_name': 'Hayley Kiyoko', 't...",8,8322008,31,,"[Gravel To Tempo, 101, V. 3005, Blue Denim, Se..."
999,ChIlL,false,115999,1509321600,136,118,2,"[{'pos': 0, 'artist_name': 'Drake', 'track_uri...",96,31046977,99,,"[Right Hand, Broken, Say You Won't Let Go, Rid..."


*Inorder to buid a cooccurrence matrix first we must get a list of every unique track name - did this through a set*

In [None]:
tset = set()
for tlist in subdata["tracknames"]:
  for t in tlist:
    tset.add(t)

*This creates the cooccurrence matrix called 'playlist_matrix' using the MultiLabelBinarizer package. It then takes that cooccurrence matrix and constructs a similarity matrix using cosine similarity for distance. This similarity matrix will become the backbone of our ML model.*

In [None]:
# song encoding for playlist id
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics.pairwise import cosine_similarity

playlist_matrix = None
mlb = MultiLabelBinarizer()
playlist_matrix = pd.DataFrame(mlb.fit_transform(subdata['tracknames']), columns=mlb.classes_, index=subdata.index)

# Calculate the cosine similarity matrix
similarity_matrix = cosine_similarity(playlist_matrix.T, dense_output=False)

gc.collect()

2

### This function gives the recomendations based off the input track name and cooccurance matrix:
*Will be used later to tune NN parameters*

In [None]:
def make_recommendations(track_name, similarity_matrix, num_recommendations, tset):
    # Convert track_name to a hashable type (e.g., string) if needed
    track_name_hashable = track_name if isinstance(track_name, str) else str(track_name)

    # Check if the track_name is in the set
    if track_name_hashable not in tset:
        raise ValueError(f"Track name '{track_name}' not found in the set of track names.")

    # Convert the set to a list for indexing
    track_names = list(tset)

    # Get the index of the specified track name
    track_index = track_names.index(track_name_hashable)

    # Get similarity scores for the specified track
    track_similarity_scores = similarity_matrix[track_index, :]

    # Get indices of top recommendations based on similarity scores
    top_recommendations_indices = np.argsort(track_similarity_scores)[-num_recommendations:][::-1]

    return top_recommendations_indices

*The function outputs an array of track indexes, which then can be converted to track names.*

In [None]:
make_recommendations("45", similarity_matrix, 5, tset)

array([27457,  1053, 14052, 27589, 25922])

In [None]:
def indices_to_track_names(indices, track_names):
    return [track_names[index] for index in indices]

In [None]:
temp = make_recommendations("45", similarity_matrix, 5, tset)
indices_to_track_names(temp, list(tset))

['45',
 'Sweet Dreams (Are Made of This) - Remastered',
 'A Start',
 'Losing My Religion',
 'Event: Confrontation With Iblis']

We just made a cooccurance matrix, which we used to make a cosigne similarity matrix of track to tracks. And then we used that matrix to define an item to item collaborative filtering function. Now we will make a multilabel classification machine learning model to intake a list of tracks and output a seed track - except in our case the seed track is the outcome variable. We will take the top 500 seed tracks, with the highest probability, and output that as our solution to the competion.

**NOTE:** The model accuracy score is useless for our business case. The model accuracy will be checking if the top recomended track is the right corresponding track, but rather we only care if it is one of the top 500 recomended tracks - so we will tune the parameters using a custom loop to output the top 500 and top 100 accuracy score.

In [34]:
similarity_matrix.shape

(30752, 30752)

In [None]:
similarity_matrix

array([[1., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 1., 0., 1.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 1., 0., 1.]])

In [None]:
import numpy as np
import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score

*Building the model*

In [None]:
# Assuming similarity_matrix and tset are defined
X = similarity_matrix
y = np.array(list(tset))

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Combine labels from both training and test sets
all_labels = np.concatenate((y_train, y_test), axis=None)

# Use LabelEncoder to fit on the combined labels
label_encoder = LabelEncoder()
label_encoder.fit(all_labels)



# Encode the labels using the LabelEncoder
y_train_encoded = label_encoder.transform(y_train)
y_test_encoded = label_encoder.transform(y_test)

# Build the model
model = tf.keras.Sequential([
    tf.keras.layers.Dense(128, activation='relu', input_shape=(X.shape[1],)),
    tf.keras.layers.Dense(len(label_encoder.classes_), activation='softmax')
])

# Compile the model
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

In [None]:
# Train the model
model.fit(X_train, y_train_encoded, epochs=20, batch_size=64)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.src.callbacks.History at 0x792328811840>

In [None]:
# Evaluate the model on the test set
accuracy = model.evaluate(X_test, y_test_encoded)
print(f'Test Accuracy: {accuracy[1]}')

Test Accuracy: 0.0


In [None]:
y_pred = model.predict(X_test)
pred_labels = label_encoder.inverse_transform(np.argmax(y_pred, axis=1))



*Custom loop to check top 500 and top 100 accuracy score:*

In [None]:
i = 0
acc_list = []
for label in y_test:
  #print(f'Actual: {label}, Predicted: {pred_labels[i]}')
  recLists = make_recommendations(label, similarity_matrix, 500, tset)
  if pred_labels[i] in indices_to_track_names(recLists, list(tset)):
    acc_list.append(1)
  else:
    acc_list.append(0)
  i += 1


In [None]:
sum(acc_list)/len(acc_list)

0.962119980490977

Again the model accuracy score is predicting the top song of all the 30k+ possibilities, so obviously it is not performing well. We only care about it predicting the correct top 500 - the order is ambiguous. So in this case our model is actually performing at 96.21% accuracy with the test set - which is really good.


---


# Formatting Submission
*Creating submission with the proper requested formatting.*

In [None]:
import csv
import gzip

In [None]:
data = [["team_info", "Mad City Metrics", "epanderson6@wisc.edu", "ldesmet@wisc.edu"]]
for label in y_test:
  #print(f'Actual: {label}, Predicted: {pred_labels[i]}')
  recLists = make_recommendations(label, similarity_matrix, 500, tset)
  data.append([indices_to_track_names(recLists, list(tset))])

In [None]:
output_file_path = 'madCityMetrics_submission.csv.gz'

# Writing to a gzipped CSV file
with gzip.open(output_file_path, 'wt', newline='', encoding='utf-8') as file:
    csv_writer = csv.writer(file)

    # Writing the data to the CSV file
    csv_writer.writerows(data)

print(f'Data has been written to {output_file_path}')

Data has been written to madCityMetrics_submission.csv.gz
