<a href="https://colab.research.google.com/github/izzahalzahri/musicrecommenderp2/blob/main/spotify_recommendation_engine.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Dependencies

In [2]:
import pandas as pd
import numpy as np
import json
import re
import sys
import itertools

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt


import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
from spotipy.oauth2 import SpotifyOAuth
import spotipy.util as util
from IPython.display import Image

import warnings
warnings.filterwarnings("ignore")

In [3]:
%matplotlib inline

In [4]:
#If you're not familiar with this, save it! Makes using jupyter notebook on laptops much easier
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:90% !important; }</style>"))

## Summary:

## 1. Data Exploration/Preparation

Download datasets here:
https://www.kaggle.com/yamaerenay/spotify-dataset-19212020-160k-tracks

In [5]:
spotify_df = pd.read_csv('/content/data.csv')

In [6]:
spotify_df.head()

Unnamed: 0,valence,year,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo
0,0.0594,1921,0.982,"['Sergei Rachmaninoff', 'James Levine', 'Berli...",0.279,831667,0.211,0,4BJqT0PrAfrxzMOxytFOIz,0.878,10,0.665,-20.096,1,"Piano Concerto No. 3 in D Minor, Op. 30: III. ...",4,1921,0.0366,80.954
1,0.963,1921,0.732,['Dennis Day'],0.819,180533,0.341,0,7xPhfUan2yNtyFG0cUWkt8,0.0,7,0.16,-12.441,1,Clancy Lowered the Boom,5,1921,0.415,60.936
2,0.0394,1921,0.961,['KHP Kridhamardawa Karaton Ngayogyakarta Hadi...,0.328,500062,0.166,0,1o6I8BglA6ylDMrIELygv1,0.913,3,0.101,-14.85,1,Gati Bali,5,1921,0.0339,110.339
3,0.165,1921,0.967,['Frank Parker'],0.275,210000,0.309,0,3ftBPsC5vPBKxYSee08FDH,2.8e-05,5,0.381,-9.316,1,Danny Boy,3,1921,0.0354,100.109
4,0.253,1921,0.957,['Phil Regan'],0.418,166693,0.193,0,4d6HGyGT8e121BsdKmw9v6,2e-06,3,0.229,-10.096,1,When Irish Eyes Are Smiling,2,1921,0.038,101.665


Observations:
1. This data is at a **song level**
2. Many numerical values that I'll be able to use to compare movies (liveness, tempo, valence, etc)
2. Release date will useful but I'll need to create a OHE variable for release date in 5 year increments
3. Similar to 2, I'll need to create OHE variables for the popularity. I'll also use 5 year increments here
4. There is nothing here related to the genre of the song which will be useful. This data alone won't help us find relavent content since this is a content based recommendation system. Fortunately there is a `data_w_genres.csv` file that should have some useful information

In [7]:
data_w_genre = pd.read_csv('/content/data_w_genres.csv')
data_w_genre.head()

Unnamed: 0,genres,artists,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,popularity,key,mode,count
0,['show tunes'],"""Cats"" 1981 Original London Cast",0.590111,0.467222,250318.555556,0.394003,0.0114,0.290833,-14.448,0.210389,117.518111,0.3895,38.333333,5,1,9
1,[],"""Cats"" 1983 Broadway Cast",0.862538,0.441731,287280.0,0.406808,0.081158,0.315215,-10.69,0.176212,103.044154,0.268865,30.576923,5,1,26
2,[],"""Fiddler On The Roof” Motion Picture Chorus",0.856571,0.348286,328920.0,0.286571,0.024593,0.325786,-15.230714,0.118514,77.375857,0.354857,34.857143,0,1,7
3,[],"""Fiddler On The Roof” Motion Picture Orchestra",0.884926,0.425074,262890.962963,0.24577,0.073587,0.275481,-15.63937,0.1232,88.66763,0.37203,34.851852,0,1,27
4,[],"""Joseph And The Amazing Technicolor Dreamcoat""...",0.510714,0.467143,270436.142857,0.488286,0.0094,0.195,-10.236714,0.098543,122.835857,0.482286,43.0,5,1,7


Observations:
1. This data is at an **artist level**
2. There are similar continuous variables as our initial dataset but I won't use this. I'll just use the values int he previous dataset.
3. The genres are going to be really useful here and I'll need to use it moving forward. Now, the genre column appears to be in a list format but my past experience tells me that it's likely not. Let's investigate this further.

In [8]:
data_w_genre.dtypes

genres               object
artists              object
acousticness        float64
danceability        float64
duration_ms         float64
energy              float64
instrumentalness    float64
liveness            float64
loudness            float64
speechiness         float64
tempo               float64
valence             float64
popularity          float64
key                   int64
mode                  int64
count                 int64
dtype: object

This checks whether or not `genres` is actually in a list format:

In [9]:
data_w_genre['genres'].values[0]

"['show tunes']"

In [10]:
#To check if this is actually a list, let me index it and see what it returns
data_w_genre['genres'].values[0][0]

'['

As we can see, it's actually a string that looks like a list. Now, look at the example above, I'm going to put together a regex statement to extract the genre and input into a list

In [11]:
data_w_genre['genres_upd'] = data_w_genre['genres'].apply(lambda x: [re.sub(' ','_',i) for i in re.findall(r"'([^']*)'", x)])

In [12]:
data_w_genre['genres_upd'].values[0][0]

'show_tunes'

Voila, now we have the genre column in a format we can actually use. If you go down, you'll see how we use it.

Now, if you recall, this data is at a artist level and the previous dataset is at a song level. So what here's what we need to do:
1. Explode artists column in the previous so each artist within a song will have their own row
2. Merge `data_w_genre` to the exploded dataset in Step 1 so that the previous dataset no is enriched with genre dataset

Before I go further, let's complete these two steps.

Step 1.
Similar to before, we will need to extract the artists from the string list.

In [13]:
spotify_df['artists_upd_v1'] = spotify_df['artists'].apply(lambda x: re.findall(r"'([^']*)'", x))


In [14]:
spotify_df['artists'].values[0]

"['Sergei Rachmaninoff', 'James Levine', 'Berliner Philharmoniker']"

In [15]:
spotify_df['artists_upd_v1'].values[0][0]

'Sergei Rachmaninoff'

This looks good but did this work for every artist string format. Let's double check

In [16]:
spotify_df[spotify_df['artists_upd_v1'].apply(lambda x: not x)].head(5)

Unnamed: 0,valence,year,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo,artists_upd_v1
143,0.3,1921,0.772,"[""Scarlet D'Carpio""]",0.56,249370,0.313,0,7b4eHImKQ51DYaQvNTdtEp,5e-06,6,0.115,-8.346,0,Himno Nacional del Perú,0,1921-09-23,0.0376,107.501,[]
234,0.902,1923,0.994,"[""King Oliver's Creole Jazz Band""]",0.708,194533,0.361,0,1xEEYhWxT4WhDQdxfPCT8D,0.883,0,0.103,-11.764,0,Snake Rag,20,1923,0.0441,105.695,[]
238,0.554,1923,0.996,"[""King Oliver's Creole Jazz Band""]",0.546,170827,0.189,0,3rauXVLOOM5BlxWqUcDpkg,0.908,0,0.339,-15.984,1,Chimes Blues,13,1923,0.0581,80.318,[]
244,0.319,1923,0.995,"[""Clarence Williams' Blue Five""]",0.52,197493,0.153,0,1UdqHVRFYMZKU2Q7xkLtYc,0.131,0,0.353,-14.042,1,Pickin' On Your Baby,11,1923,0.044,102.937,[]
249,0.753,1923,0.994,"[""King Oliver's Creole Jazz Band""]",0.359,187227,0.357,0,5SvyP1ZeJX1jA7AOZD08NA,0.819,3,0.29,-11.81,1,Tears,10,1923,0.0511,205.053,[]


So, it looks like it didn't catch all of them and you can quickly see that it's because artists with an apostrophe in their title and the fact that they are enclosed in a full quotes. I'll write another regex to handle this and then combine the two

In [17]:
spotify_df['artists_upd_v2'] = spotify_df['artists'].apply(lambda x: re.findall('\"(.*?)\"',x))
spotify_df['artists_upd'] = np.where(spotify_df['artists_upd_v1'].apply(lambda x: not x), spotify_df['artists_upd_v2'], spotify_df['artists_upd_v1'] )

In [18]:
#need to create my own song identifier because there are duplicates of the same song with different ids. I see different
spotify_df['artists_song'] = spotify_df.apply(lambda row: row['artists_upd'][0]+row['name'],axis = 1)

In [19]:
spotify_df.sort_values(['artists_song','release_date'], ascending = False, inplace = True)

In [20]:
spotify_df[spotify_df['name']=='Lover']

Unnamed: 0,valence,year,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,...,mode,name,popularity,release_date,speechiness,tempo,artists_upd_v1,artists_upd_v2,artists_upd,artists_song
44320,0.323,1955,0.793,['The Dave Brubeck Quartet'],0.531,307667,0.451,0,3hGmBRWRSKfAW8qLJ2to4Y,0.163,...,1,Lover,8,1955,0.0361,115.415,[The Dave Brubeck Quartet],[],[The Dave Brubeck Quartet],The Dave Brubeck QuartetLover
19564,0.453,2019,0.492,['Taylor Swift'],0.359,221307,0.543,0,1dGr1c8CrMLDpV6mPbImSI,1.6e-05,...,1,Lover,79,2019-08-23,0.0919,68.534,[Taylor Swift],[],[Taylor Swift],Taylor SwiftLover
111658,0.665,1954,0.834,['Tal Farlow Quartet'],0.527,246120,0.571,0,59Dk8xqFJVl4MyltHELqIO,0.559,...,1,Lover,3,1954-01-01,0.0473,80.701,[Tal Farlow Quartet],[],[Tal Farlow Quartet],Tal Farlow QuartetLover
128035,0.274,1956,0.641,['Stan Kenton'],0.393,153427,0.511,0,6sjsrOD8fgYg1K4oWrlXH3,0.337,...,1,Lover,7,1956-01-01,0.0454,90.661,[Stan Kenton],[],[Stan Kenton],Stan KentonLover
109183,0.423,1930,0.553,['Stan Kenton & His Orchestra'],0.394,172693,0.646,0,4UO72fGipUaKRRmhOoMaGB,0.00814,...,1,Lover,0,1930,0.139,89.035,[Stan Kenton & His Orchestra],[],[Stan Kenton & His Orchestra],Stan Kenton & His OrchestraLover
42693,0.755,1947,0.833,"['Roy Eldridge', 'Flip Phillips', 'Mel Tormé']",0.413,263616,0.813,0,1YCYUBnLPePpo51uY1uDzt,0.744,...,1,Lover,0,1947-10-17,0.145,133.695,"[Roy Eldridge, Flip Phillips, Mel Tormé]",[],"[Roy Eldridge, Flip Phillips, Mel Tormé]",Roy EldridgeLover
95658,0.506,1954,0.987,['Oscar Peterson'],0.515,434022,0.2,0,4HBR8Kw7eL3apDgBJFLtZ7,0.928,...,1,Lover,2,1954-09-08,0.0671,80.569,[Oscar Peterson],[],[Oscar Peterson],Oscar PetersonLover
61353,0.503,1951,0.964,['Max Miller'],0.54,142133,0.375,0,0D6VK5Hsf55YqN6bZrivw5,0.855,...,1,Lover,0,1951-07-01,0.0293,86.432,[Max Miller],[],[Max Miller],Max MillerLover
44385,0.566,1955,0.0073,['Les Paul'],0.308,168773,0.653,0,5nU3Hgb07jpc7AZb7SsQkM,0.0149,...,0,Lover,10,1955-01-01,0.036,169.558,[Les Paul],[],[Les Paul],Les PaulLover
143648,0.481,1957,0.577,"['Gerry Mulligan', 'Paul Desmond Quartet']",0.504,419000,0.128,0,359npfHgnSaldyFXRENr9u,0.000606,...,1,Lover,9,1957-12-01,0.0649,179.788,"[Gerry Mulligan, Paul Desmond Quartet]",[],"[Gerry Mulligan, Paul Desmond Quartet]",Gerry MulliganLover


In [21]:
spotify_df.drop_duplicates('artists_song',inplace = True)

In [22]:
spotify_df[spotify_df['name']=='Lover']

Unnamed: 0,valence,year,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,...,mode,name,popularity,release_date,speechiness,tempo,artists_upd_v1,artists_upd_v2,artists_upd,artists_song
44320,0.323,1955,0.793,['The Dave Brubeck Quartet'],0.531,307667,0.451,0,3hGmBRWRSKfAW8qLJ2to4Y,0.163,...,1,Lover,8,1955,0.0361,115.415,[The Dave Brubeck Quartet],[],[The Dave Brubeck Quartet],The Dave Brubeck QuartetLover
19564,0.453,2019,0.492,['Taylor Swift'],0.359,221307,0.543,0,1dGr1c8CrMLDpV6mPbImSI,1.6e-05,...,1,Lover,79,2019-08-23,0.0919,68.534,[Taylor Swift],[],[Taylor Swift],Taylor SwiftLover
111658,0.665,1954,0.834,['Tal Farlow Quartet'],0.527,246120,0.571,0,59Dk8xqFJVl4MyltHELqIO,0.559,...,1,Lover,3,1954-01-01,0.0473,80.701,[Tal Farlow Quartet],[],[Tal Farlow Quartet],Tal Farlow QuartetLover
128035,0.274,1956,0.641,['Stan Kenton'],0.393,153427,0.511,0,6sjsrOD8fgYg1K4oWrlXH3,0.337,...,1,Lover,7,1956-01-01,0.0454,90.661,[Stan Kenton],[],[Stan Kenton],Stan KentonLover
109183,0.423,1930,0.553,['Stan Kenton & His Orchestra'],0.394,172693,0.646,0,4UO72fGipUaKRRmhOoMaGB,0.00814,...,1,Lover,0,1930,0.139,89.035,[Stan Kenton & His Orchestra],[],[Stan Kenton & His Orchestra],Stan Kenton & His OrchestraLover
42693,0.755,1947,0.833,"['Roy Eldridge', 'Flip Phillips', 'Mel Tormé']",0.413,263616,0.813,0,1YCYUBnLPePpo51uY1uDzt,0.744,...,1,Lover,0,1947-10-17,0.145,133.695,"[Roy Eldridge, Flip Phillips, Mel Tormé]",[],"[Roy Eldridge, Flip Phillips, Mel Tormé]",Roy EldridgeLover
95658,0.506,1954,0.987,['Oscar Peterson'],0.515,434022,0.2,0,4HBR8Kw7eL3apDgBJFLtZ7,0.928,...,1,Lover,2,1954-09-08,0.0671,80.569,[Oscar Peterson],[],[Oscar Peterson],Oscar PetersonLover
61353,0.503,1951,0.964,['Max Miller'],0.54,142133,0.375,0,0D6VK5Hsf55YqN6bZrivw5,0.855,...,1,Lover,0,1951-07-01,0.0293,86.432,[Max Miller],[],[Max Miller],Max MillerLover
44385,0.566,1955,0.0073,['Les Paul'],0.308,168773,0.653,0,5nU3Hgb07jpc7AZb7SsQkM,0.0149,...,0,Lover,10,1955-01-01,0.036,169.558,[Les Paul],[],[Les Paul],Les PaulLover
143648,0.481,1957,0.577,"['Gerry Mulligan', 'Paul Desmond Quartet']",0.504,419000,0.128,0,359npfHgnSaldyFXRENr9u,0.000606,...,1,Lover,9,1957-12-01,0.0649,179.788,"[Gerry Mulligan, Paul Desmond Quartet]",[],"[Gerry Mulligan, Paul Desmond Quartet]",Gerry MulliganLover


Now I can explode this column and merge as I planned to in `Step 2`

In [23]:
artists_exploded = spotify_df[['artists_upd','id']].explode('artists_upd')

In [24]:
artists_exploded_enriched = artists_exploded.merge(data_w_genre, how = 'left', left_on = 'artists_upd',right_on = 'artists')
artists_exploded_enriched_nonnull = artists_exploded_enriched[~artists_exploded_enriched.genres_upd.isnull()]

In [25]:
artists_exploded_enriched_nonnull[artists_exploded_enriched_nonnull['id'] =='1dGr1c8CrMLDpV6mPbImSI']

Unnamed: 0,artists_upd,id,genres,artists,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,popularity,key,mode,count,genres_upd
33856,Taylor Swift,1dGr1c8CrMLDpV6mPbImSI,"['dance pop', 'pop', 'pop dance', 'post-teen p...",Taylor Swift,0.225236,0.60248,237971.727273,0.615055,0.000423,0.147943,-6.481543,0.050815,122.845527,0.426964,59.177273,7.0,1.0,440.0,"[dance_pop, pop, pop_dance, post-teen_pop]"


Alright we're almost their, now we need to:
1. Group by on the song `id` and essentially create lists lists
2. Consilidate these lists and output the unique values

In [26]:
artists_genres_consolidated = artists_exploded_enriched_nonnull.groupby('id')['genres_upd'].apply(list).reset_index()

In [27]:
artists_genres_consolidated['consolidates_genre_lists'] = artists_genres_consolidated['genres_upd'].apply(lambda x: list(set(list(itertools.chain.from_iterable(x)))))

In [28]:
artists_genres_consolidated.head()

Unnamed: 0,id,genres_upd,consolidates_genre_lists
0,000G1xMMuwxNHmwVsBdtj1,"[[candy_pop, dance_rock, new_wave, new_wave_po...","[new_wave_pop, power_pop, rock, new_wave, perm..."
1,000GyYHG4uWmlXieKLij8u,"[[alternative_hip_hop, conscious_hip_hop, minn...","[conscious_hip_hop, alternative_hip_hop, pop_r..."
2,000Npgk5e2SgwGaIsN3ztv,"[[classic_bollywood, classic_pakistani_pop, fi...","[sufi, ghazal, classic_pakistani_pop, classic_..."
3,000ZxLGm7jDlWCHtcXSeBe,"[[boogie-woogie, piano_blues, ragtime, stride]]","[stride, ragtime, piano_blues, boogie-woogie]"
4,000jBcNljWTnyjB4YO7ojf,[[]],[]


In [29]:
spotify_df = spotify_df.merge(artists_genres_consolidated[['id','consolidates_genre_lists']], on = 'id',how = 'left')

## 2. Feature Engineering

### - Normalize float variables
### - OHE Year and Popularity Variables
### - Create TF-IDF features off of artist genres

In [30]:
spotify_df.tail()

Unnamed: 0,valence,year,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,...,name,popularity,release_date,speechiness,tempo,artists_upd_v1,artists_upd_v2,artists_upd,artists_song,consolidates_genre_lists
156602,0.768,1997,0.282,"[""Lil' Kim"", ""Lil' Cease""]",0.748,275947,0.693,0,2LP2uDQQ7eLMcUVE4aOpAV,0.0,...,Crush on You (feat. Lil' Cease) - Remix,56,1997-06-30,0.278,88.802,"[ Kim"", ""Lil]","[Lil' Kim, Lil' Cease]","[ Kim"", ""Lil]","Kim"", ""LilCrush on You (feat. Lil' Cease) - R...",
156603,0.792,2004,0.0248,"[""Lil' Flip"", 'Lea']",0.814,225173,0.387,1,4s0o8TJHfX9LLHa0umnOzT,0.0,...,Sunshine (feat. Lea),62,2004-03-30,0.0945,93.961,"[ Flip"", ]",[Lil' Flip],"[ Flip"", ]","Flip"", Sunshine (feat. Lea)",
156604,0.697,1999,0.0516,"[""Ol' Dirty Bastard"", 'Kelis', 'Rich Travali']",0.934,239547,0.459,1,6YYd5MLpu45J0uLrMdivF7,0.0,...,Got Your Money (feat. Kelis),66,1999,0.189,103.04,"[ Dirty Bastard"", , , ]",[Ol' Dirty Bastard],"[ Dirty Bastard"", , , ]","Dirty Bastard"", Got Your Money (feat. Kelis)",
156605,0.429,1994,0.0249,"[""World Class Wreckin' Cru"", ""Michel 'Le""]",0.715,351040,0.49,0,3hoiinUc5VA9xUEJID7R8V,0.00017,...,Turn Off The Lights - Rap,36,1994-04-06,0.0479,129.309,"[ Cru"", ""Michel ]","[World Class Wreckin' Cru, Michel 'Le]","[ Cru"", ""Michel ]","Cru"", ""Michel Turn Off The Lights - Rap",
156606,0.273,1996,0.0113,"[""Rappin' 4-Tay"", 'MC Breed', 'Too $hort']",0.897,337973,0.414,1,78859Af0fmA9VTlgnOHTAP,0.00011,...,Never Talk Down,35,1996,0.246,96.039,"[ 4-Tay"", , , ]",[Rappin' 4-Tay],"[ 4-Tay"", , , ]","4-Tay"", Never Talk Down",


In [31]:
spotify_df['year'] = spotify_df['release_date'].apply(lambda x: x.split('-')[0])

In [32]:
float_cols = spotify_df.dtypes[spotify_df.dtypes == 'float64'].index.values

In [33]:
ohe_cols = 'popularity'

In [34]:
spotify_df['popularity'].describe()

count    156607.000000
mean         31.307215
std          21.712234
min           0.000000
25%          11.000000
50%          33.000000
75%          48.000000
max         100.000000
Name: popularity, dtype: float64

In [35]:
# create 5 point buckets for popularity
spotify_df['popularity_red'] = spotify_df['popularity'].apply(lambda x: int(x/5))

In [36]:
# tfidf can't handle nulls so fill any null values with an empty list
spotify_df['consolidates_genre_lists'] = spotify_df['consolidates_genre_lists'].apply(lambda d: d if isinstance(d, list) else [])

In [37]:
spotify_df.head()

Unnamed: 0,valence,year,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,...,popularity,release_date,speechiness,tempo,artists_upd_v1,artists_upd_v2,artists_upd,artists_song,consolidates_genre_lists,popularity_red
0,0.177,1989,0.568,['조정현'],0.447,237688,0.215,0,2ghebdwe2pNXT4eL34T7pW,1e-06,...,31,1989-06-15,0.0272,71.979,[조정현],[],[조정현],조정현그아픔까지사랑한거야,[classic_korean_pop],6
1,0.352,1992,0.381,['黑豹'],0.353,316160,0.686,0,3KIuCzckjdeeVuswPo20mC,0.0,...,35,1992-12-22,0.0395,200.341,[黑豹],[],[黑豹],黑豹DON'T BREAK MY HEART,"[chinese_indie_rock, chinese_indie]",7
2,0.458,1963,0.987,['黃國隆'],0.241,193480,0.0437,0,4prhqrLXYMjHJ6vpRAlasx,0.000453,...,23,1963-05-28,0.0443,85.936,[黃國隆],[],[黃國隆],黃國隆藝旦調,[],4
3,0.796,1963,0.852,"['黃國隆', '王秋玉']",0.711,145720,0.111,0,5xFXTvnEe03SyvFpo6pEaE,0.0,...,23,1963-05-28,0.0697,124.273,"[黃國隆, 王秋玉]",[],"[黃國隆, 王秋玉]",黃國隆草螟弄雞公,[],4
4,0.704,1963,0.771,['黃國隆'],0.61,208760,0.175,0,6Pqs2suXEqCGx7Lxg5dlrB,0.0,...,23,1963-05-28,0.0419,124.662,[黃國隆],[],[黃國隆],黃國隆思想起,[],4


In [38]:
#simple function to create OHE features
#this gets passed later on
def ohe_prep(df, column, new_name):
    """

    Parameters:
        df (pandas dataframe): Spotify Dataframe
        column (str): Column to be processed
        new_name (str): new column name to be used

    Returns:
        tf_df: One hot encoded features
    """

    tf_df = pd.get_dummies(df[column])
    feature_names = tf_df.columns
    tf_df.columns = [new_name + "|" + str(i) for i in feature_names]
    tf_df.reset_index(drop = True, inplace = True)
    return tf_df


In [39]:
#function to build entire feature set
def create_feature_set(df, float_cols):
    """
    Process spotify df to create a final set of features that will be used to generate recommendations

    Parameters:
        df (pandas dataframe): Spotify Dataframe
        float_cols (list(str)): List of float columns that will be scaled

    Returns:
        final: final set of features
    """

    #tfidf genre lists
    tfidf = TfidfVectorizer()
    tfidf_matrix =  tfidf.fit_transform(df['consolidates_genre_lists'].apply(lambda x: " ".join(x)))
    genre_df = pd.DataFrame(tfidf_matrix.toarray())
    genre_df.columns = ['genre' + "|" + i for i in tfidf.get_feature_names_out()]
    genre_df.reset_index(drop = True, inplace=True)

    #explicity_ohe = ohe_prep(df, 'explicit','exp')
    year_ohe = ohe_prep(df, 'year','year') * 0.5
    popularity_ohe = ohe_prep(df, 'popularity_red','pop') * 0.15

    #scale float columns
    floats = df[float_cols].reset_index(drop = True)
    scaler = MinMaxScaler()
    floats_scaled = pd.DataFrame(scaler.fit_transform(floats), columns = floats.columns) * 0.2

    #concanenate all features
    final = pd.concat([genre_df, floats_scaled, popularity_ohe, year_ohe], axis = 1)

    #add song id
    final['id']=df['id'].values

    return final

In [40]:
complete_feature_set = create_feature_set(spotify_df, float_cols=float_cols)#.mean(axis = 0)

In [41]:
complete_feature_set.head()

Unnamed: 0,genre|21st_century_classical,genre|432hz,genre|_hip_hop,genre|a_cappella,genre|abstract,genre|abstract_beats,genre|abstract_hip_hop,genre|accordeon,genre|accordion,genre|acid_house,...,year|2012,year|2013,year|2014,year|2015,year|2016,year|2017,year|2018,year|2019,year|2020,id
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2ghebdwe2pNXT4eL34T7pW
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3KIuCzckjdeeVuswPo20mC
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4prhqrLXYMjHJ6vpRAlasx
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5xFXTvnEe03SyvFpo6pEaE
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6Pqs2suXEqCGx7Lxg5dlrB


## 3. Connect to Spotify API

Useful links:
1. https://developer.spotify.com/dashboard/
2. https://spotipy.readthedocs.io/en/2.16.1/

In [43]:
#client id and secret for my application
client_id = 'ec131c2a08ce4491b4a4c6f0643a8e93'
client_secret= 'c26fbb2f2bcf4e558917f02faf310302'

In [44]:
scope = 'user-library-read'

if len(sys.argv) > 1:
    username = sys.argv[1]
else:
    print("Usage: %s username" % (sys.argv[0],))
    sys.exit()

In [45]:
auth_manager = SpotifyClientCredentials(client_id='ec131c2a08ce4491b4a4c6f0643a8e93', client_secret='c26fbb2f2bcf4e558917f02faf310302')
sp = spotipy.Spotify(auth_manager=auth_manager)


In [46]:
token = util.prompt_for_user_token(scope, client_id= client_id, client_secret=client_secret, redirect_uri='http://localhost')

In [47]:
sp = spotipy.Spotify(auth=token)

In [48]:
#gather playlist names and images.
#images aren't going to be used until I start building a UI
id_name = {}
list_photo = {}
for i in sp.current_user_playlists()['items']:

    id_name[i['name']] = i['uri'].split(':')[2]
    list_photo[i['uri'].split(':')[2]] = i['images'][0]['url']

In [49]:
id_name

{'Slow sad ones': '5A4vyRuAsrqcjMCo0E3114',
 'Scream Out': '2yQS9qQHsK28uVEYeSWNqb',
 'Of beaches and Sunsets': '3gAc2y1S8wQpncwNR8kCWT',
 'Always': '3RjJVLycIuxkAigmYKGEMe',
 'Indo': '5FcJMlctkxhryPboz4WQLk'}

In [50]:
def create_necessary_outputs(playlist_name,id_dic, df):
    """
    Pull songs from a specific playlist.

    Parameters:
        playlist_name (str): name of the playlist you'd like to pull from the spotify API
        id_dic (dic): dictionary that maps playlist_name to playlist_id
        df (pandas dataframe): spotify datafram

    Returns:
        playlist: all songs in the playlist THAT ARE AVAILABLE IN THE KAGGLE DATASET
    """

    #generate playlist dataframe
    playlist = pd.DataFrame()
    playlist_name = playlist_name

    for ix, i in enumerate(sp.playlist(id_dic[playlist_name])['tracks']['items']):
        #print(i['track']['artists'][0]['name'])
        playlist.loc[ix, 'artist'] = i['track']['artists'][0]['name']
        playlist.loc[ix, 'name'] = i['track']['name']
        playlist.loc[ix, 'id'] = i['track']['id'] # ['uri'].split(':')[2]
        playlist.loc[ix, 'url'] = i['track']['album']['images'][1]['url']
        playlist.loc[ix, 'date_added'] = i['added_at']

    playlist['date_added'] = pd.to_datetime(playlist['date_added'])

    playlist = playlist[playlist['id'].isin(df['id'].values)].sort_values('date_added',ascending = False)

    return playlist

In [51]:
id_name

{'Slow sad ones': '5A4vyRuAsrqcjMCo0E3114',
 'Scream Out': '2yQS9qQHsK28uVEYeSWNqb',
 'Of beaches and Sunsets': '3gAc2y1S8wQpncwNR8kCWT',
 'Always': '3RjJVLycIuxkAigmYKGEMe',
 'Indo': '5FcJMlctkxhryPboz4WQLk'}

In [52]:
playlist_EDM = create_necessary_outputs('Always', id_name,spotify_df)
#playlist_chill = create_necessary_outputs('chill',id_name, spotify_df)
#playlist_classical = create_necessary_outputs('Epic Classical',id_name, spotify_df)

In [53]:
from skimage import io
import matplotlib.pyplot as plt

def visualize_songs(df, max_images=50, img_size=(100, 100)):
    """
    Visualize cover art of the songs in the inputted dataframe

    Parameters:
        df (pandas dataframe): Playlist Dataframe
        max_images (int): Maximum number of images to display
        img_size (tuple): Size to resize images to (width, height)
    """

    temp = df['url'].values[:max_images]
    num_images = len(temp)
    plt.figure(figsize=(15, int(0.625 * num_images)))
    columns = 5
    rows = math.ceil(num_images / columns)
    columns = 5

    for i, url in enumerate(temp):
        plt.subplot(len(temp) / columns + 1, columns, i + 1)

        image = io.imread(url)
        plt.imshow(image)
        plt.xticks(color = 'w', fontsize = 0.1)
        plt.yticks(color = 'w', fontsize = 0.1)
        plt.xlabel(df['name'].values[i], fontsize = 12)
        plt.tight_layout(h_pad=0.4, w_pad=0)
        plt.subplots_adjust(wspace=None, hspace=None)

    plt.show()

In [54]:
playlist_EDM

Unnamed: 0,artist,name,id,url,date_added
94,James Arthur,Say You Won't Let Go,5uCax9HTNlzGybIStD3vDh,https://i.scdn.co/image/ab67616d00001e0220beb6...,2018-02-05 15:54:45+00:00
93,James Arthur,Can I Be Him,0VhgEqMTNZwYL1ARDLLNCX,https://i.scdn.co/image/ab67616d00001e0220beb6...,2018-02-05 15:54:39+00:00
92,James Arthur,Safe Inside,5ooilrQAnOJbUjq7IDm8lY,https://i.scdn.co/image/ab67616d00001e0220beb6...,2018-02-05 15:54:34+00:00
83,Twenty One Pilots,Cancer,19W5OTEcQI3ZoRW1HERMyy,https://i.scdn.co/image/ab67616d00001e020fde79...,2018-02-05 15:50:59+00:00
74,Coldplay,Yellow,3AJwUDP919kvQ9QcozQPxg,https://i.scdn.co/image/ab67616d00001e029164ba...,2018-02-05 15:48:55+00:00
59,The Click Five,Jenny,3iT4vWUWxqsn4hFTkEaJCi,https://i.scdn.co/image/ab67616d00001e0288868e...,2018-02-05 15:44:59+00:00
56,A Rocket To The Moon,Like We Used To,1fkYmLPG2Oi2AkUmcspWKl,https://i.scdn.co/image/ab67616d00001e02ccda01...,2018-02-05 15:44:08+00:00
52,Sleeping With Sirens,Scene Two - Roger Rabbit,7la8N6YLMUDAXl2iAEe9Sy,https://i.scdn.co/image/ab67616d00001e02667873...,2018-02-05 15:42:37+00:00
50,One Direction,Love You Goodbye,1ZWLWVqeEMWMKTlteS0yLH,https://i.scdn.co/image/ab67616d00001e02241e4f...,2018-02-05 15:40:24+00:00
45,The Red Jumpsuit Apparatus,Your Guardian Angel,2Guz1b911CbpG8L92cnglI,https://i.scdn.co/image/ab67616d00001e02f98edb...,2018-02-05 15:38:52+00:00


In [55]:
visualize_songs(playlist_EDM)

NameError: name 'math' is not defined

<Figure size 1500x1800 with 0 Axes>

## 4. Create Playlist Vector

In [56]:
def generate_playlist_feature(complete_feature_set, playlist_df, weight_factor):
    """
    Summarize a user's playlist into a single vector

    Parameters:
        complete_feature_set (pandas dataframe): Dataframe which includes all of the features for the spotify songs
        playlist_df (pandas dataframe): playlist dataframe
        weight_factor (float): float value that represents the recency bias. The larger the recency bias, the most priority recent songs get. Value should be close to 1.

    Returns:
        playlist_feature_set_weighted_final (pandas series): single feature that summarizes the playlist
        complete_feature_set_nonplaylist (pandas dataframe):
    """

    complete_feature_set_playlist = complete_feature_set[complete_feature_set['id'].isin(playlist_df['id'].values)]#.drop('id', axis = 1).mean(axis =0)
    complete_feature_set_playlist = complete_feature_set_playlist.merge(playlist_df[['id','date_added']], on = 'id', how = 'inner')
    complete_feature_set_nonplaylist = complete_feature_set[~complete_feature_set['id'].isin(playlist_df['id'].values)]#.drop('id', axis = 1)

    playlist_feature_set = complete_feature_set_playlist.sort_values('date_added',ascending=False)

    most_recent_date = playlist_feature_set.iloc[0,-1]

    for ix, row in playlist_feature_set.iterrows():
        playlist_feature_set.loc[ix,'months_from_recent'] = int((most_recent_date.to_pydatetime() - row.iloc[-1].to_pydatetime()).days / 30)

    playlist_feature_set['weight'] = playlist_feature_set['months_from_recent'].apply(lambda x: weight_factor ** (-x))

    playlist_feature_set_weighted = playlist_feature_set.copy()
    #print(playlist_feature_set_weighted.iloc[:,:-4].columns)
    playlist_feature_set_weighted.update(playlist_feature_set_weighted.iloc[:,:-4].mul(playlist_feature_set_weighted.weight,0))
    playlist_feature_set_weighted_final = playlist_feature_set_weighted.iloc[:, :-4]
    #playlist_feature_set_weighted_final['id'] = playlist_feature_set['id']

    return playlist_feature_set_weighted_final.sum(axis = 0), complete_feature_set_nonplaylist

In [57]:
complete_feature_set_playlist_vector_EDM, complete_feature_set_nonplaylist_EDM = generate_playlist_feature(complete_feature_set, playlist_EDM, 1.09)
#complete_feature_set_playlist_vector_chill, complete_feature_set_nonplaylist_chill = generate_playlist_feature(complete_feature_set, playlist_chill, 1.09)

In [58]:
complete_feature_set_playlist_vector_EDM.shape

(3070,)

## 5. Generate Recommendations

In [59]:
from IPython.display import Image
Image("/Users/thakm004/Documents/Spotify/cosine_sim_2.png")

FileNotFoundError: No such file or directory: '/Users/thakm004/Documents/Spotify/cosine_sim_2.png'

FileNotFoundError: No such file or directory: '/Users/thakm004/Documents/Spotify/cosine_sim_2.png'

<IPython.core.display.Image object>

In [60]:
def generate_playlist_recos(df, features, nonplaylist_features):
    """
    Pull songs from a specific playlist.

    Parameters:
        df (pandas dataframe): spotify dataframe
        features (pandas series): summarized playlist feature
        nonplaylist_features (pandas dataframe): feature set of songs that are not in the selected playlist

    Returns:
        non_playlist_df_top_40: Top 40 recommendations for that playlist
    """

    non_playlist_df = df[df['id'].isin(nonplaylist_features['id'].values)]
    non_playlist_df['sim'] = cosine_similarity(nonplaylist_features.drop('id', axis = 1).values, features.values.reshape(1, -1))[:,0]
    non_playlist_df_top_40 = non_playlist_df.sort_values('sim',ascending = False).head(40)
    non_playlist_df_top_40['url'] = non_playlist_df_top_40['id'].apply(lambda x: sp.track(x)['album']['images'][1]['url'])

    return non_playlist_df_top_40

In [None]:
edm_top40 = generate_playlist_recos(spotify_df, complete_feature_set_playlist_vector_EDM, complete_feature_set_nonplaylist_EDM)

In [None]:
from IPython.display import Image
Image("/Users/thakm004/Documents/Spotify/spotify_results.png")

In [None]:
edm_top40

In [None]:
visualize_songs(edm_top40)

In [None]:
chill_top40 = generate_playlist_recos(spotify_df, complete_feature_set_playlist_vector_chill, complete_feature_set_nonplaylist_chill)