![logo_ironhack_blue 7](https://user-images.githubusercontent.com/23629340/40541063-a07a0a8a-601a-11e8-91b5-2f13e4e6b441.png)

# Lab | Unsupervised learning intro

#### Instructions 


It's the moment to perform clustering on the songs you collected. Remember that the ultimate goal of this little project is to improve the recommendations of artists. Clustering the songs will allow the recommendation system to limit the scope of the recommendations to only songs that belong to the same cluster - songs with similar audio features.

The experiments you did with the `Spotify` API and the Billboard web scraping will allow you to create a pipeline such that when the user enters a song, you:

1. Check whether or not the song is in the Billboard Hot 200.
2. Collect the audio features from the `Spotify` API.

After that, you want to send the `Spotify` audio features of the submitted song to the clustering model, which should return a cluster number.

We want to have as many songs as possible to create the clustering model, so we will add the songs you collected to a bigger dataset available on Kaggle containing 160 thousand songs.


In [1]:
# 📚 Basic libraries
import pandas as pd # dataframe managment
import warnings
warnings.filterwarnings('ignore') # ignore warnings

# 🧩 New libraries
import spotipy # Spotify API for developers
from spotipy.oauth2 import SpotifyClientCredentials # Spotify user credentiials

# Machine Learning
from sklearn.cluster import KMeans # To make the clusters
from sklearn.preprocessing import StandardScaler # data scaler
from sklearn.metrics import pairwise_distances_argmin_min # sklearn metrics

In [2]:
# Specific function for this lab.~ From Xisca's script
def recommend_song():
    
    # get song id
    song_name = input('Choose a song: ')
    results = sp.search(q=f'track:{song_name}', limit=1)
    track_id = results['tracks']['items'][0]['id']
    
    # get song features with the obtained id
    audio_features = sp.audio_features(track_id)
    
    # create dataframe
    df_ = pd.DataFrame(audio_features)
    new_features = df_[x.columns]
    
    # scale features
    scaled_x = scaler.transform(new_features)
    
    # predict cluster
    cluster = kmeans.predict(scaled_x)
    
    # filter dataset to predicted cluster
    filtered_df = scaled_df[scaled_df['cluster'] == cluster[0]][x.columns]
    
    # get closest song from filtered dataset
    closest, _ = pairwise_distances_argmin_min(scaled_x, filtered_df)
    
    # return it in a readable way
    print('\n [You might like this song:]')
    return ' - '.join([scaled_df.loc[closest]['song'].values[0], scaled_df.loc[closest]['artist'].values[0]])

<h1 style="color: #00BFFF;">01 | Data Extraction</h1>

<h3 style="color: #008080;">API acces</h3>

In [3]:
# Don't copy this, those are my credientials :P
sp = spotipy.Spotify(auth_manager=SpotifyClientCredentials(client_id="d565de5f22d343be949f165e317b3448",
                                                           client_secret="9ec0e5075f784243b25dedcf5d91f149"))

In [4]:
df = pd.read_csv('playlist.csv') # extracting my old playlist df
df.head(3)

Unnamed: 0,artist,album,song,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo
0,Marea,Coces al Aire 1997-2007,Corazon de mimbre,0.521,0.928,9,-3.548,0,0.0837,0.0235,0.0,0.0961,0.323,90.555
1,Passion Pit,Gossamer,Take a Walk,0.566,0.755,11,-5.526,1,0.0368,0.0338,0.0,0.315,0.445,101.006
2,Fleetwood Mac,Tango In the Night (Deluxe Edition),Everywhere - 2017 Remaster,0.73,0.487,4,-10.991,1,0.0303,0.258,0.01,0.0852,0.731,114.965


In [5]:
# We will make the previous dataframe bigger with another 2 more playlist:

<h3 style="color: #00BFFF;">Playlist track</h3>

In [6]:
# Selecting the playlists; "Noodles Quarantine Playlist"
playlist = sp.user_playlist_tracks("spotify", "7uot6g7fHhwHEgRByIC94m")

<h3 style="color: #008080;">Getting artists from the playlist</h3>

In [7]:
artists = []
for i in range(len(playlist["items"])):
    artist = playlist["items"][i]["track"]["artists"][0]["name"]
    if artist not in artists:
        artists.append(artist)

<h3 style="color: #008080;">Getting all albums for each artist</h3>

In [8]:
albums = []
for i in range(len(playlist["items"])):
    album = playlist["items"][i]["track"]["album"]["name"]
    if album not in albums:
        albums.append(album)

<h3 style="color: #008080;">Getting all songs from the playlist</h3>

In [9]:
tracks_title = []
for i in range(0,len(playlist["items"])):
    title = playlist["items"][i]["track"]["name"]
    tracks_title.append(title)

<h3 style="color: #008080;">Getting song features from the playlist</h3>

In [10]:
tracks = playlist['items']

while playlist['next']:
    playlist = sp.next(playlist)
    tracks.extend(playlist['items'])

track_ids = [track['track']['id'] for track in tracks]

In [11]:
features = sp.audio_features(track_ids[:100]) # getting features for each track
features_df = pd.DataFrame(features) # making it a df

features_df = features_df.drop(["type", "id", "uri", "track_href", "analysis_url", "duration_ms", "time_signature"], axis=1) # dropping some ugly columns

<h3 style="color: #008080;">Dictionaries rule</h3>

In [12]:
# Now, we will make a dictionary with our lists, assigning the future column name as key
ids = {
    'artist': artists,
    'album': albums,
    'song': tracks_title,
}

ids = pd.DataFrame(ids)

<h3 style="color: #008080;">Pandas rule the world -> making it a dataframe</h3>

In [13]:
df2 = pd.concat([ids, features_df], axis=1)

<h3 style="color: #00BFFF;">Playlist track</h3>

In [14]:
# Selecting the playlists; "Biopsias"
playlist2 = sp.user_playlist_tracks("spotify", "7nnB4G2MgfyucLZlxctqjw")

<h3 style="color: #008080;">Getting artists from the playlist</h3>

In [15]:
artists2 = []
for i in range(len(playlist2["items"])):
    artist = playlist2["items"][i]["track"]["artists"][0]["name"]
    if artist not in artists:
        artists2.append(artist)

<h3 style="color: #008080;">Getting all albums for each artist</h3>

In [16]:
albums2 = []
for i in range(len(playlist2["items"])):
    album = playlist2["items"][i]["track"]["album"]["name"]
    if album not in albums:
        albums2.append(album)

<h3 style="color: #008080;">Getting all songs from the playlist</h3>

In [17]:
tracks_title2 = []
for i in range(0,len(playlist2["items"])):
    title = playlist2["items"][i]["track"]["name"]
    tracks_title2.append(title)

<h3 style="color: #008080;">Getting song features from the playlist</h3>

In [18]:
tracks2 = playlist2['items']

while playlist2['next']:
    playlist2 = sp.next(playlist2)
    tracks2.extend(playlist2['items'])

track_ids2 = [track['track']['id'] for track in tracks2]

In [19]:
features2 = sp.audio_features(track_ids2[:100]) # getting features for each track
features_df2 = pd.DataFrame(features2) # making it a df

features_df2 = features_df2.drop(["type", "id", "uri", "track_href", "analysis_url", "duration_ms", "time_signature"], axis=1) # dropping some ugly columns

<h3 style="color: #008080;">Dictionaries rule</h3>

In [20]:
ids2 = {
    'artist': artists,
    'album': albums,
    'song': tracks_title,
}

ids2 = pd.DataFrame(ids2)

<h3 style="color: #008080;">Pandas rule the world -> making it a dataframe</h3>

In [21]:
df3 = pd.concat([ids2, features_df2], axis=1)

<h3 style="color: #00BFFF;">Playlist track (from Kaggle dataset)</h3>

In [22]:
df4 = pd.read_csv('data.csv') # extracting the songs from kaggle datasets
df4.head(3)

Unnamed: 0,id,name,artists,duration_ms,release_date,year,acousticness,danceability,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,mode,key,popularity,explicit
0,0gNNToCW3qjabgTyBSjt3H,!Que Vida! - Mono Version,['Love'],220560,11/1/66,1966,0.525,0.6,0.54,0.00305,0.1,-11.803,0.0328,125.898,0.547,1,9,26,0
1,0tMgFpOrXZR6irEOLNWwJL,"""40""",['U2'],157840,2/28/83,1983,0.228,0.368,0.48,0.707,0.159,-11.605,0.0306,150.166,0.338,1,8,21,0
2,2ZywW3VyVx6rrlrX75n3JB,"""40"" - Live",['U2'],226200,8/20/83,1983,0.0998,0.272,0.684,0.0145,0.946,-9.728,0.0505,143.079,0.279,1,8,41,0


In [23]:
df4.columns # getting it's columns to change it to t

Index(['id', 'name', 'artists', 'duration_ms', 'release_date', 'year',
       'acousticness', 'danceability', 'energy', 'instrumentalness',
       'liveness', 'loudness', 'speechiness', 'tempo', 'valence', 'mode',
       'key', 'popularity', 'explicit'],
      dtype='object')

In [24]:
df4 = df4[['artists', 'name', 'danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo']]
df4 = df4.rename(columns={'name': 'song', 'artists': 'artist'})

In [25]:
df4.head(3)

Unnamed: 0,artist,song,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo
0,['Love'],!Que Vida! - Mono Version,0.6,0.54,9,-11.803,1,0.0328,0.525,0.00305,0.1,0.547,125.898
1,['U2'],"""40""",0.368,0.48,8,-11.605,1,0.0306,0.228,0.707,0.159,0.338,150.166
2,['U2'],"""40"" - Live",0.272,0.684,8,-9.728,1,0.0505,0.0998,0.0145,0.946,0.279,143.079


<h3 style="color: #00BFFF;">Pandas rule the world -> making it a big dataframe</h3>

In [26]:
data = pd.concat([df, df2, df3, df4])
data.shape

(170022, 14)

<h1 style="color: #00BFFF;">02 | Data Pre-processing</h1>

<h3 style="color: #008080;">Scaling</h3>

In [27]:
# selecting numericals
x = data[['danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo']]
x.head(3)

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo
0,0.521,0.928,9,-3.548,0,0.0837,0.0235,0.0,0.0961,0.323,90.555
1,0.566,0.755,11,-5.526,1,0.0368,0.0338,0.0,0.315,0.445,101.006
2,0.73,0.487,4,-10.991,1,0.0303,0.258,0.01,0.0852,0.731,114.965


In [28]:
# standarize the data
scaler = StandardScaler()
x_prep = scaler.fit_transform(x)

<h3 style="color: #008080;">Train and Predict</h3>

In [29]:
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(x_prep)
clusters = kmeans.predict(x_prep)

<h3 style="color: #008080;">New master dataframe</h3>

In [30]:
scaled_df = pd.DataFrame(x_prep, columns=x.columns)
scaled_df = scaled_df.reset_index(drop=True)

data = data.reset_index(drop=True)


scaled_df['song'] = data['song']
scaled_df['artist'] = data['artist']
scaled_df['cluster'] = clusters
scaled_df.head(3)

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,song,artist,cluster
0,-0.09773,1.642075,1.080778,1.379646,-1.558937,-0.069045,-1.246305,-0.523426,-0.625565,-0.796846,-0.859234,Corazon de mimbre,Marea,1
1,0.158926,0.99526,1.649709,1.030621,0.641463,-0.381925,-1.218961,-0.523426,0.61265,-0.331871,-0.519101,Take a Walk,Passion Pit,1
2,1.094296,-0.006744,-0.341549,0.0663,0.641463,-0.425288,-0.623754,-0.491093,-0.687222,0.758152,-0.064799,Everywhere - 2017 Remaster,Fleetwood Mac,1


<h3 style="color: #008080;">Dropping duplicates</h3>

In [31]:
scaled_df = scaled_df.drop_duplicates()

<h1 style="color: #00BFFF;">03 | Reporting</h1>

In [32]:
recommend_song()

Choose a song:  Toxic



 [You might like this song:]


'Everything Dies - Mariachi El Bronx'