<a href="https://colab.research.google.com/github/mpedraza98/spotify_recommender/blob/main/code/retrieve_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook we will create our datasets. The original json files for one million playlists were retrieved from https://www.aicrowd.com/challenges/spotify-million-playlist-dataset-challenge/dataset_files 

In [None]:
import json
from tqdm import tqdm
import pandas as pd

In [None]:
import os, fnmatch
from requests.exceptions import ReadTimeout
import gc
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
import spotipy.util as util
from operator import itemgetter

# Inspecting the data

We want to take a look at the dataset and its format and compare them with the information provided by the metadata file

### `playlists` field 
This is an array that typically contains 1,000 playlists. Each playlist is a dictionary that contains the following fields:


* ***pid*** - integer - playlist id - the MPD ID of this playlist. This is an integer between 0 and 999,999.
* ***name*** - string - the name of the playlist 
* ***description*** - optional string - if present, the description given to the playlist.  Note that user-provided playlist descrptions are a relatively new feature of Spotify, so most playlists do not have descriptions.
* ***modified_at*** - seconds - timestamp (in seconds since the epoch) when this playlist was last updated. Times are rounded to midnight GMT of the date when the playlist was last updated.
* ***num_artists*** - the total number of unique artists for the tracks in the playlist.
* ***num_albums*** - the number of unique albums for the tracks in the playlist
* ***num_tracks*** - the number of tracks in the playlist
* ***num_followers*** - the number of followers this playlist had at the time the MPD was created. (Note that the follower count does not including the playlist creator)
* ***num_edits*** - the number of separate editing sessions. Tracks added in a two hour window are considered to be added in a single editing session.
* ***duration_ms*** - the total duration of all the tracks in the playlist (in milliseconds)
* ***collaborative*** -  boolean - if true, the playlist is a collaborative playlist. Multiple users may contribute tracks to a collaborative playlist.
* ***tracks*** - an array of information about each track in the playlist. Each element in the array is a dictionary with the following fields:
   * ***track_name*** - the name of the track
   * ***track_uri*** - the Spotify URI of the track
   * ***album_name*** - the name of the track's album
   * ***album_uri*** - the Spotify URI of the album
   * ***artist_name*** - the name of the track's primary artist
   * ***artist_uri*** - the Spotify URI of track's primary artist
   * ***duration_ms*** - the duration of the track in milliseconds
   * ***pos*** - the position of the track in the playlist (zero-based)


Initially we will try to make a classification based on the 'Playlist's Name' 

Let's create a table for every playlist, so that we can have the information in all of them

In [None]:
# Change this path to your own directory to run the code
path = '/home/mapedrazaj/Desktop/math_project/spotify_million_playlist_dataset/data_csv/'
path_extra = '/home/mapedrazaj/Desktop/math_project/spotify_million_playlist_dataset/data_csv_extra/'

In [None]:
df = pd.read_csv(path+'slice_1_tracks.csv', header= None)

In [None]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,10000,Magic,spotify:track:23khhseCLQqVMCIT1WMAns,Ghost Stories,spotify:album:2G4AUqfwxcV1UdQjm2ouYr,Coldplay,spotify:artist:4gzpq5DPGxSnKTe4SA8HAU,285014,0
1,10000,A Sky Full of Stars,spotify:track:0FDzzruyVECATHXKHFs9eJ,Ghost Stories,spotify:album:2G4AUqfwxcV1UdQjm2ouYr,Coldplay,spotify:artist:4gzpq5DPGxSnKTe4SA8HAU,268466,1
2,10000,Every Little Thing She Does Is Magic,spotify:track:5DnUFzGSrLiiAJRxKoiwFv,Symphonicities,spotify:album:1dpyonY9ev2z5a7rwfERZh,Sting,spotify:artist:0Ty63ceoRnnJKVEYP0VQpk,296826,2
3,10000,I Wanna Be Your Lover - Single Version,spotify:track:4gi2ioQwGOBXTrXlBR9RfQ,The Hits 2,spotify:album:2E5Jr8tcyqKrGzGPmNA3il,Prince,spotify:artist:5a2EaR3hamoenG9rDuVn8j,180080,3
4,10000,Raspberry Beret,spotify:track:5jSz894ljfWE0IcHBSM39i,Around The World In A Day,spotify:album:5FbrTPPlaNSOsChhKUZxcu,Prince,spotify:artist:5a2EaR3hamoenG9rDuVn8j,215173,4


In [None]:
# This are the spotify credentials, you can obtain your credentials at 
# https://developer.spotify.com/documentation/web-api/tutorials/client-credentials-flow

client_credentials_manager = SpotifyClientCredentials(client_id='d60d9d1a247f4e1494d740bd0e056fb7', client_secret='f2fd3b8f4ac945fbbf4f5de624379cc4')
spotify = spotipy.Spotify(client_credentials_manager = client_credentials_manager, requests_timeout=10, retries=10)

In [None]:
# these are the features we want to retrieve from the api
track_features = ['danceability', 'energy','key','loudness','mode','speechiness','acousticness','instrumentalness','liveness','valence','tempo']


In [None]:
spotify.audio_features('spotify:track:4gi2ioQwGOBXTrXlBR9RfQ')[0]

{'danceability': 0.793,
 'energy': 0.442,
 'key': 8,
 'loudness': -11.293,
 'mode': 0,
 'speechiness': 0.0564,
 'acousticness': 0.236,
 'instrumentalness': 0.00163,
 'liveness': 0.0662,
 'valence': 0.833,
 'tempo': 115.995,
 'type': 'audio_features',
 'id': '4gi2ioQwGOBXTrXlBR9RfQ',
 'uri': 'spotify:track:4gi2ioQwGOBXTrXlBR9RfQ',
 'track_href': 'https://api.spotify.com/v1/tracks/4gi2ioQwGOBXTrXlBR9RfQ',
 'analysis_url': 'https://api.spotify.com/v1/audio-analysis/4gi2ioQwGOBXTrXlBR9RfQ',
 'duration_ms': 180080,
 'time_signature': 3}

In [None]:
%%time
filenames = fnmatch.filter(os.listdir(path), '*_tracks.csv')
failed_tracks = []
for file in filenames[:10]:
    c = 0
    print(file)
    extra_features = []
    df = pd.read_csv(path + file, header= None)
    for i in tqdm(df[2].values):
        try:
            temp_dict = spotify.audio_features(i)[0]
            temp_features = itemgetter(*track_features)(temp_dict)
            
        except ReadTimeout:
            c += 1
            failed_tracks.append(i)
            temp_features = [-999]*11
        
        extra_features.append(temp_features)            
#        temp_dict = spotify.audio_features(i)[0]
    temp_df = pd.DataFrame(extra_features)
    df = pd.concat([df, temp_df], axis = 1)
    df.to_csv(path_extra + file,header=False, index=False)
    temp_df.head()
    del(df)
    del(temp_df)
    gc.collect()
    print('done')
    print('-------------')
    

slice_2_tracks.csv


100%|███████████████████████████████████| 64939/64939 [1:46:15<00:00, 10.18it/s]


done
-------------
slice_1_tracks.csv


100%|███████████████████████████████████| 66648/66648 [1:49:20<00:00, 10.16it/s]


done
-------------
slice_43_tracks.csv


100%|███████████████████████████████████| 66948/66948 [1:50:22<00:00, 10.11it/s]


done
-------------
slice_15_tracks.csv


100%|███████████████████████████████████| 67125/67125 [1:50:09<00:00, 10.16it/s]


done
-------------
slice_49_tracks.csv


100%|███████████████████████████████████| 65803/65803 [1:48:10<00:00, 10.14it/s]


done
-------------
slice_38_tracks.csv


100%|███████████████████████████████████| 68859/68859 [1:49:59<00:00, 10.43it/s]


done
-------------
slice_3_tracks.csv


100%|███████████████████████████████████| 69441/69441 [1:50:22<00:00, 10.49it/s]


done
-------------
slice_36_tracks.csv


100%|███████████████████████████████████| 67382/67382 [1:44:32<00:00, 10.74it/s]


done
-------------
slice_48_tracks.csv


100%|███████████████████████████████████| 64845/64845 [1:41:32<00:00, 10.64it/s]


done
-------------
slice_70_tracks.csv


100%|███████████████████████████████████| 63292/63292 [1:40:29<00:00, 10.50it/s]


done
-------------
CPU times: user 1h 20min 25s, sys: 6min 1s, total: 1h 26min 27s
Wall time: 17h 51min 32s


In [None]:
temp_df = pd.DataFrame(extra_features)

In [None]:
temp_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,0.674,0.413,2,-7.816,1,0.0274,0.836,1.9e-05,0.098,0.503,124.893
1,0.63,0.53,0,-7.259,1,0.0434,0.4,0.0,0.177,0.417,108.038
2,0.456,0.636,1,-6.552,1,0.0432,0.462,0.000189,0.252,0.492,183.866
3,0.586,0.128,7,-9.297,1,0.0496,0.963,0.0,0.0858,0.371,123.498
4,0.651,0.663,0,-5.569,0,0.0281,0.228,0.0,0.0994,0.465,102.0


In [None]:
temp_df.to_csv(path_extra+'test.csv')

In [None]:
pd.concat([temp_df, temp_df], axis=1)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1.1,2.1,3.1,4.1,5.1,6.1,7.1,8.1,9.1,10
0,0.738,0.482,6,-8.917,0,0.0402,0.0266,0.621,0.106,0.352,...,0.482,6,-8.917,0,0.0402,0.0266,0.621,0.106,0.352,93.457
1,0.545,0.675,6,-6.474,1,0.0279,0.00617,0.00197,0.209,0.162,...,0.675,6,-6.474,1,0.0279,0.00617,0.00197,0.209,0.162,124.97
2,0.432,0.358,2,-9.97,1,0.0309,0.815,0.0103,0.192,0.166,...,0.358,2,-9.97,1,0.0309,0.815,0.0103,0.192,0.166,81.119
3,0.793,0.442,8,-11.293,0,0.0564,0.236,0.00163,0.0662,0.833,...,0.442,8,-11.293,0,0.0564,0.236,0.00163,0.0662,0.833,115.995
