#### Instructions 


To move forward with the project, you need to create a collection of songs with their audio features - as large as possible! 

These are the songs that we will cluster. And, later, when the user inputs a song, we will find the cluster to which the song belongs and recommend a song from the same cluster.
The more songs you have, the more accurate and diverse recommendations you'll be able to give. Although... you might want to make sure the collected songs are "curated" in a certain way. Try to find playlists of songs that are diverse, but also that meet certain standards.

The process of sending hundreds or thousands of requests can take some time - it's normal if you have to wait a few minutes (or, if you're ambitious, even hours) to get all the data you need.

An idea for collecting as many songs as possible is to start with all the songs of a big, diverse playlist and then go to every artist present in the playlist and grab every song of every album of that artist. The amount of songs you'll be collecting per playlist will grow exponentially!

In [38]:
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
import pandas as pd
import math

In [2]:
#Initialize SpotiPy with user credentias
sp = spotipy.Spotify(auth_manager=SpotifyClientCredentials(client_id="af3a4e21d9974f798b0ddef081728f2b",
                                                           client_secret="99a65d20eff04d64bcf24b11824dffc4"))

#### I am using one of my own playlists since I consider it highly curated ;) and it is rather diverse in audio features.

In [55]:
playlist = sp.user_playlist_tracks("spotify", "4PuzKqMQjuIvn3Arkgaek2")

In [56]:
results = sp.user_playlist(user=None, playlist_id="4PuzKqMQjuIvn3Arkgaek2", fields="name")
results

{'name': 'WorldBeatz'}

In [5]:
def get_playlist_tracks(username, playlist_id):
    
    results = sp.user_playlist_tracks(username, playlist_id)
    tracks = results['items']
    
    while results['next']:
        results = sp.next(results)
        tracks.extend(results['items'])
    
    return tracks

tracks = get_playlist_tracks("spotify", "4PuzKqMQjuIvn3Arkgaek2")

In [13]:
playlist['total']

116

#### Getting the artist anmes, track names and ids in one dictionary.

In [17]:
def get_ids_and_names_from_playlist(playlist_id):
    tracks_from_playlist = get_playlist_tracks("spotify", playlist_id)
    
    artists = []
    track_ids = []
    track_names = []
    
    for track in tracks_from_playlist:
        track_artists = track['track']['artists']
        track_id = track['track']['id']
        track_name = track['track']['name']
        
        for artist in track_artists:
            artist_name = artist['name']
            artists.append(artist_name)
        
        track_ids.append(track_id)
        track_names.append(track_name)
    
    return artists, track_ids, track_names

In [57]:
artists, track_ids, track_names = get_ids_and_names_from_playlist("4PuzKqMQjuIvn3Arkgaek2")

#### Let's check the length of each list. There might be more artists since some songs are associated with more than one artist registered on Spotify.

In [19]:
print("Number of artists:", len(artists))  
print("Number of track IDs:", len(track_ids)) 
print("Number of track names:", len(track_names)) 

Number of artists: 146
Number of track IDs: 116
Number of track names: 116


#### I want to have the column `Artist` in my final dataframe, hence I need to add the second artist in its own row. 

In [48]:
def get_ids_and_names_from_playlist(playlist_id):
    tracks_from_playlist = get_playlist_tracks("spotify", playlist_id)
    
    artists = []
    track_ids = []
    track_names = []
    
    for track in tracks_from_playlist:
        track_artists = track['track']['artists']
        track_id = track['track']['id']
        track_name = track['track']['name']
        
        for artist in track_artists:
            artist_name = artist['name']
            artists.append(artist_name)
            track_ids.append(track_id)
            track_names.append(track_name)
    
    return artists, track_ids, track_names

artists, track_ids, track_names = get_ids_and_names_from_playlist("4PuzKqMQjuIvn3Arkgaek2")

In [49]:
df1 = pd.DataFrame({'Artist': artists, 'Song': track_names, 'Song ID': track_ids})
df1

Unnamed: 0,Artist,Song,Song ID
0,Bitori,Munana,2r3OCxXToV0DoXi0Cwf5LZ
1,Orchestra Baobab,Sey,7MmuoqMXkyUvzkUtR6pH5N
2,Amr Diab,Qusad Einy,3ZLz817NgIAr4IxZxvd9hU
3,Magic System,1er Gaou,0LpNLPdGvZLATY6JgdbQzk
4,Jul De Grenelle,Ya Ghaly - Danai Remix,3UhMrPQkV70OWjhRPZnqi9
...,...,...,...
155,The Notwist,Exit Strategy To Myself,5yOT1pfZeSCaJXYjPdWNIK
156,Camélia Jordana,Femmes,2ciDUdQR3JoUnjFJpdF9Cg
157,Crucchi Gang,Ballare,7qDYeY8zWoD7RsOLSk0cIy
158,Clueso,Ballare,7qDYeY8zWoD7RsOLSk0cIy


#### In order to get the audio fetaures of more than 100 tracks I am splitting into batches, then append. (Help of ChatGPT)

In [50]:
track_ids = track_ids[:160]

# Split track IDs into batches of 100
batch_size = 100
num_batches = math.ceil(len(track_ids) / batch_size)
audio_features = []

for i in range(num_batches):
    start = i * batch_size
    end = (i + 1) * batch_size
    batch_track_ids = track_ids[start:end]
    
# Request audio features for the current batch of track IDs
    batch_audio_features = sp.audio_features(batch_track_ids)
    audio_features.extend(batch_audio_features)

In [51]:
df2 = pd.json_normalize(audio_features)
df2

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,type,id,uri,track_href,analysis_url,duration_ms,time_signature
0,0.867,0.6870,6,-9.804,1,0.0942,0.0306,0.040900,0.1130,0.847,88.614,audio_features,2r3OCxXToV0DoXi0Cwf5LZ,spotify:track:2r3OCxXToV0DoXi0Cwf5LZ,https://api.spotify.com/v1/tracks/2r3OCxXToV0D...,https://api.spotify.com/v1/audio-analysis/2r3O...,258777,4
1,0.595,0.6670,8,-5.007,0,0.0305,0.3730,0.003050,0.2020,0.938,97.790,audio_features,7MmuoqMXkyUvzkUtR6pH5N,spotify:track:7MmuoqMXkyUvzkUtR6pH5N,https://api.spotify.com/v1/tracks/7MmuoqMXkyUv...,https://api.spotify.com/v1/audio-analysis/7Mmu...,274840,4
2,0.657,0.4920,0,-9.375,0,0.0382,0.4250,0.000018,0.3650,0.409,77.887,audio_features,3ZLz817NgIAr4IxZxvd9hU,spotify:track:3ZLz817NgIAr4IxZxvd9hU,https://api.spotify.com/v1/tracks/3ZLz817NgIAr...,https://api.spotify.com/v1/audio-analysis/3ZLz...,264255,4
3,0.850,0.9250,7,-4.209,0,0.0862,0.1500,0.000000,0.0600,0.882,119.042,audio_features,0LpNLPdGvZLATY6JgdbQzk,spotify:track:0LpNLPdGvZLATY6JgdbQzk,https://api.spotify.com/v1/tracks/0LpNLPdGvZLA...,https://api.spotify.com/v1/audio-analysis/0LpN...,294200,4
4,0.780,0.6080,7,-11.356,0,0.0533,0.5550,0.200000,0.1060,0.664,129.964,audio_features,3UhMrPQkV70OWjhRPZnqi9,spotify:track:3UhMrPQkV70OWjhRPZnqi9,https://api.spotify.com/v1/tracks/3UhMrPQkV70O...,https://api.spotify.com/v1/audio-analysis/3UhM...,223304,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
155,0.349,0.8920,2,-1.919,1,0.0435,0.0146,0.828000,0.0743,0.319,164.932,audio_features,5yOT1pfZeSCaJXYjPdWNIK,spotify:track:5yOT1pfZeSCaJXYjPdWNIK,https://api.spotify.com/v1/tracks/5yOT1pfZeSCa...,https://api.spotify.com/v1/audio-analysis/5yOT...,188667,4
156,0.771,0.5850,4,-6.498,0,0.0571,0.1110,0.012200,0.1590,0.464,97.065,audio_features,2ciDUdQR3JoUnjFJpdF9Cg,spotify:track:2ciDUdQR3JoUnjFJpdF9Cg,https://api.spotify.com/v1/tracks/2ciDUdQR3JoU...,https://api.spotify.com/v1/audio-analysis/2ciD...,161880,4
157,0.766,0.6540,9,-7.861,0,0.0619,0.5320,0.000042,0.1170,0.836,99.982,audio_features,7qDYeY8zWoD7RsOLSk0cIy,spotify:track:7qDYeY8zWoD7RsOLSk0cIy,https://api.spotify.com/v1/tracks/7qDYeY8zWoD7...,https://api.spotify.com/v1/audio-analysis/7qDY...,208601,4
158,0.766,0.6540,9,-7.861,0,0.0619,0.5320,0.000042,0.1170,0.836,99.982,audio_features,7qDYeY8zWoD7RsOLSk0cIy,spotify:track:7qDYeY8zWoD7RsOLSk0cIy,https://api.spotify.com/v1/tracks/7qDYeY8zWoD7...,https://api.spotify.com/v1/audio-analysis/7qDY...,208601,4


In [52]:
df2 = df2.rename(columns=lambda x: x.capitalize())
df2 = df2.rename(columns={'Id': 'Song ID'})

In [53]:
songs_and_features = pd.merge(df1, df2, on="Song ID", how="outer")
songs_and_features

Unnamed: 0,Artist,Song,Song ID,Danceability,Energy,Key,Loudness,Mode,Speechiness,Acousticness,Instrumentalness,Liveness,Valence,Tempo,Type,Uri,Track_href,Analysis_url,Duration_ms,Time_signature
0,Bitori,Munana,2r3OCxXToV0DoXi0Cwf5LZ,0.867,0.6870,6,-9.804,1,0.0942,0.0306,0.040900,0.113,0.847,88.614,audio_features,spotify:track:2r3OCxXToV0DoXi0Cwf5LZ,https://api.spotify.com/v1/tracks/2r3OCxXToV0D...,https://api.spotify.com/v1/audio-analysis/2r3O...,258777,4
1,Orchestra Baobab,Sey,7MmuoqMXkyUvzkUtR6pH5N,0.595,0.6670,8,-5.007,0,0.0305,0.3730,0.003050,0.202,0.938,97.790,audio_features,spotify:track:7MmuoqMXkyUvzkUtR6pH5N,https://api.spotify.com/v1/tracks/7MmuoqMXkyUv...,https://api.spotify.com/v1/audio-analysis/7Mmu...,274840,4
2,Amr Diab,Qusad Einy,3ZLz817NgIAr4IxZxvd9hU,0.657,0.4920,0,-9.375,0,0.0382,0.4250,0.000018,0.365,0.409,77.887,audio_features,spotify:track:3ZLz817NgIAr4IxZxvd9hU,https://api.spotify.com/v1/tracks/3ZLz817NgIAr...,https://api.spotify.com/v1/audio-analysis/3ZLz...,264255,4
3,Magic System,1er Gaou,0LpNLPdGvZLATY6JgdbQzk,0.850,0.9250,7,-4.209,0,0.0862,0.1500,0.000000,0.060,0.882,119.042,audio_features,spotify:track:0LpNLPdGvZLATY6JgdbQzk,https://api.spotify.com/v1/tracks/0LpNLPdGvZLA...,https://api.spotify.com/v1/audio-analysis/0LpN...,294200,4
4,Jul De Grenelle,Ya Ghaly - Danai Remix,3UhMrPQkV70OWjhRPZnqi9,0.780,0.6080,7,-11.356,0,0.0533,0.5550,0.200000,0.106,0.664,129.964,audio_features,spotify:track:3UhMrPQkV70OWjhRPZnqi9,https://api.spotify.com/v1/tracks/3UhMrPQkV70O...,https://api.spotify.com/v1/audio-analysis/3UhM...,223304,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
279,Crucchi Gang,Ballare,7qDYeY8zWoD7RsOLSk0cIy,0.766,0.6540,9,-7.861,0,0.0619,0.5320,0.000042,0.117,0.836,99.982,audio_features,spotify:track:7qDYeY8zWoD7RsOLSk0cIy,https://api.spotify.com/v1/tracks/7qDYeY8zWoD7...,https://api.spotify.com/v1/audio-analysis/7qDY...,208601,4
280,Crucchi Gang,Ballare,7qDYeY8zWoD7RsOLSk0cIy,0.766,0.6540,9,-7.861,0,0.0619,0.5320,0.000042,0.117,0.836,99.982,audio_features,spotify:track:7qDYeY8zWoD7RsOLSk0cIy,https://api.spotify.com/v1/tracks/7qDYeY8zWoD7...,https://api.spotify.com/v1/audio-analysis/7qDY...,208601,4
281,Clueso,Ballare,7qDYeY8zWoD7RsOLSk0cIy,0.766,0.6540,9,-7.861,0,0.0619,0.5320,0.000042,0.117,0.836,99.982,audio_features,spotify:track:7qDYeY8zWoD7RsOLSk0cIy,https://api.spotify.com/v1/tracks/7qDYeY8zWoD7...,https://api.spotify.com/v1/audio-analysis/7qDY...,208601,4
282,Clueso,Ballare,7qDYeY8zWoD7RsOLSk0cIy,0.766,0.6540,9,-7.861,0,0.0619,0.5320,0.000042,0.117,0.836,99.982,audio_features,spotify:track:7qDYeY8zWoD7RsOLSk0cIy,https://api.spotify.com/v1/tracks/7qDYeY8zWoD7...,https://api.spotify.com/v1/audio-analysis/7qDY...,208601,4


#### There are many duplicates now, logically. I want every song only once, I don't care by which of the cooperating artists since the audio features are related to the song not the artist, hence I drop duplicates of `Song ID`.

In [54]:
songs_and_features = songs_and_features.drop_duplicates(subset="Song ID")
songs_and_features

Unnamed: 0,Artist,Song,Song ID,Danceability,Energy,Key,Loudness,Mode,Speechiness,Acousticness,Instrumentalness,Liveness,Valence,Tempo,Type,Uri,Track_href,Analysis_url,Duration_ms,Time_signature
0,Bitori,Munana,2r3OCxXToV0DoXi0Cwf5LZ,0.867,0.6870,6,-9.804,1,0.0942,0.0306,0.040900,0.1130,0.847,88.614,audio_features,spotify:track:2r3OCxXToV0DoXi0Cwf5LZ,https://api.spotify.com/v1/tracks/2r3OCxXToV0D...,https://api.spotify.com/v1/audio-analysis/2r3O...,258777,4
1,Orchestra Baobab,Sey,7MmuoqMXkyUvzkUtR6pH5N,0.595,0.6670,8,-5.007,0,0.0305,0.3730,0.003050,0.2020,0.938,97.790,audio_features,spotify:track:7MmuoqMXkyUvzkUtR6pH5N,https://api.spotify.com/v1/tracks/7MmuoqMXkyUv...,https://api.spotify.com/v1/audio-analysis/7Mmu...,274840,4
2,Amr Diab,Qusad Einy,3ZLz817NgIAr4IxZxvd9hU,0.657,0.4920,0,-9.375,0,0.0382,0.4250,0.000018,0.3650,0.409,77.887,audio_features,spotify:track:3ZLz817NgIAr4IxZxvd9hU,https://api.spotify.com/v1/tracks/3ZLz817NgIAr...,https://api.spotify.com/v1/audio-analysis/3ZLz...,264255,4
3,Magic System,1er Gaou,0LpNLPdGvZLATY6JgdbQzk,0.850,0.9250,7,-4.209,0,0.0862,0.1500,0.000000,0.0600,0.882,119.042,audio_features,spotify:track:0LpNLPdGvZLATY6JgdbQzk,https://api.spotify.com/v1/tracks/0LpNLPdGvZLA...,https://api.spotify.com/v1/audio-analysis/0LpN...,294200,4
4,Jul De Grenelle,Ya Ghaly - Danai Remix,3UhMrPQkV70OWjhRPZnqi9,0.780,0.6080,7,-11.356,0,0.0533,0.5550,0.200000,0.1060,0.664,129.964,audio_features,spotify:track:3UhMrPQkV70OWjhRPZnqi9,https://api.spotify.com/v1/tracks/3UhMrPQkV70O...,https://api.spotify.com/v1/audio-analysis/3UhM...,223304,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
252,A.R. Rahman,The Humma Song,7heMX7gyHP0mhTlNgd7Lxd,0.721,0.6500,0,-6.950,0,0.1350,0.0115,0.000140,0.3400,0.392,130.487,audio_features,spotify:track:7heMX7gyHP0mhTlNgd7Lxd,https://api.spotify.com/v1/tracks/7heMX7gyHP0m...,https://api.spotify.com/v1/audio-analysis/7heM...,179658,5
277,The Notwist,Exit Strategy To Myself,5yOT1pfZeSCaJXYjPdWNIK,0.349,0.8920,2,-1.919,1,0.0435,0.0146,0.828000,0.0743,0.319,164.932,audio_features,spotify:track:5yOT1pfZeSCaJXYjPdWNIK,https://api.spotify.com/v1/tracks/5yOT1pfZeSCa...,https://api.spotify.com/v1/audio-analysis/5yOT...,188667,4
278,Camélia Jordana,Femmes,2ciDUdQR3JoUnjFJpdF9Cg,0.771,0.5850,4,-6.498,0,0.0571,0.1110,0.012200,0.1590,0.464,97.065,audio_features,spotify:track:2ciDUdQR3JoUnjFJpdF9Cg,https://api.spotify.com/v1/tracks/2ciDUdQR3JoU...,https://api.spotify.com/v1/audio-analysis/2ciD...,161880,4
279,Crucchi Gang,Ballare,7qDYeY8zWoD7RsOLSk0cIy,0.766,0.6540,9,-7.861,0,0.0619,0.5320,0.000042,0.1170,0.836,99.982,audio_features,spotify:track:7qDYeY8zWoD7RsOLSk0cIy,https://api.spotify.com/v1/tracks/7qDYeY8zWoD7...,https://api.spotify.com/v1/audio-analysis/7qDY...,208601,4
