# Spotify Recommender System

Dataset: <https://www.kaggle.com/datasets/rodolfofigueroa/spotify-12m-songs>

If you're working in Google Colab it will take a few minutes to upload this dataset. It will be faster if you're able to work in a local Jupyter notebook (By the way, VS Code can run notebooks. Try saving a file with the extension `.ipynb` and then open it in VS Code). 

You might want to subset this dataset to something like 100k rows right off of the bat so that it's easier to work with. You can do all of your modeling with the 100k row version and then once you've got things working the way you want them to you can run the notebook once with the entire dataset. 

In [4]:
import pandas as pd
import numpy as np
from sklearn.neighbors import NearestNeighbors
from joblib import dump

In [9]:
# Load dataset and sample it down to 8% of the original size
# Reset index after sampling to make indices easier to reason about
df = pd.read_csv('/Users/jasongersing/PycharmProjects/pythonProject/pythonProject4/JazzySpot/tracks_features.csv', error_bad_lines=False)
df = df.sample(frac=.08, random_state=42).reset_index()





  df = pd.read_csv('/Users/jasongersing/PycharmProjects/pythonProject/pythonProject4/JazzySpot/tracks_features.csv', error_bad_lines=False)
b'Skipping line 10001: expected 12 fields, saw 13\nSkipping line 10002: expected 12 fields, saw 13\nSkipping line 10003: expected 12 fields, saw 13\nSkipping line 10004: expected 12 fields, saw 13\nSkipping line 10005: expected 12 fields, saw 13\nSkipping line 10006: expected 12 fields, saw 13\nSkipping line 10007: expected 12 fields, saw 13\nSkipping line 10008: expected 12 fields, saw 13\nSkipping line 10009: expected 12 fields, saw 13\nSkipping line 10010: expected 12 fields, saw 13\nSkipping line 10011: expected 12 fields, saw 13\nSkipping line 10012: expected 12 fields, saw 13\nSkipping line 10013: expected 12 fields, saw 13\nSkipping line 10014: expected 12 fields, saw 13\nSkipping line 10015: expected 12 fields, saw 13\nSkipping line 10016: expected 12 fields, saw 13\nSkipping line 10017: expected 12 fields, saw 13\nSkipping line 10018: ex

### Usable Columns?

Columns that I can use with minimal data cleaning to make a simple recommender system. I could definitely make this better if I went to the work to make more columns or to do some feature engineering, but I want to get to a working prototype as fast as possible.

- explicit
- danceability
- energy
- key
- loudness
- mode
- speechiness
- acousticness
- time_signature
- year 

Todo:

- Check for Null Values (None)
- Categorically encode `explicit` column

In [10]:
df.shape

(800, 13)

In [11]:
df.columns

Index(['index', '{"_id":{"$oid":"62dc989c18756c54b18307e3"}',
       'track_uri:"spotify:track:79ch1KhwRkS6aRHqcY3uST"',
       'danceability:0.0722', 'energy:0.318', 'key:7', 'loudness:-17.988',
       'speechiness:0.0577', 'acousticness:0.793', 'instrumentalness:0.753',
       'tempo:75.008', 'valence:0.136', 'duration:202560}'],
      dtype='object')

In [12]:
df.head()

Unnamed: 0,index,"{""_id"":{""$oid"":""62dc989c18756c54b18307e3""}","track_uri:""spotify:track:79ch1KhwRkS6aRHqcY3uST""",danceability:0.0722,energy:0.318,key:7,loudness:-17.988,speechiness:0.0577,acousticness:0.793,instrumentalness:0.753,tempo:75.008,valence:0.136,duration:202560}
0,5344,"{""_id"":{""$oid"":""62dd5d10423d988a20cce9ff""}","track_uri:""spotify:track:5Ioi3gRqmXoaUT1OkwBSQd""",danceability:0.644,energy:0.841,key:3,loudness:-5.026,speechiness:0.0444,acousticness:0.0623,instrumentalness:2.35E-05,tempo:93.08,valence:0.785,duration:206760}
1,7444,"{""_id"":{""$oid"":""62dd661e014c329ee2b38a3c""}","track_uri:""spotify:track:3oGRjCpV07tCM5mrYv6iQA""",danceability:0.391,energy:0.456,key:9,loudness:-9.679,speechiness:0.0425,acousticness:0.725,instrumentalness:0.0653,tempo:143.646,valence:0.36,duration:340627}
2,1731,"{""_id"":{""$oid"":""62dcd68c606cb6b415bcdbe0""}","track_uri:""spotify:track:50r1EUDpmSZRPo5aIZpmWi""",danceability:0.602,energy:0.863,key:5,loudness:-5.423,speechiness:0.0773,acousticness:0.0178,instrumentalness:0,tempo:150.183,valence:0.732,duration:206373}
3,8719,"{""_id"":{""$oid"":""62dd70a3e64e50e103e6c4d3""}","track_uri:""spotify:track:1XHjU0TGIgl5lMFKAF25Y3""",danceability:0.863,energy:0.609,key:9,loudness:-5.231,speechiness:0.0564,acousticness:0.000349,instrumentalness:3.97E-06,tempo:140.02,valence:0.326,duration:252520}
4,4521,"{""_id"":{""$oid"":""62dd58e092800660dfb57275""}","track_uri:""spotify:track:3eg0lzGXWHtJZHRsB6P90E""",danceability:0.391,energy:0.28,key:7,loudness:-12.452,speechiness:0.0308,acousticness:0.76,instrumentalness:0.00868,tempo:128.768,valence:0.234,duration:280467}


In [11]:
# no null values
df.isnull().sum()

id                  0
name                0
album               0
album_id            0
artists             0
artist_ids          0
track_number        0
disc_number         0
explicit            0
danceability        0
energy              0
key                 0
loudness            0
mode                0
speechiness         0
acousticness        0
instrumentalness    0
liveness            0
valence             0
tempo               0
duration_ms         0
time_signature      0
year                0
release_date        0
dtype: int64

In [12]:
# any column that contains True and False will automatically
# change to 1s and 0s when cast to the `int` datatype
df['explicit'] = df['explicit'].astype(int)

df.head()

Unnamed: 0,id,name,album,album_id,artists,artist_ids,track_number,disc_number,explicit,danceability,...,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,year,release_date
0,1aGS6nf2xgv3Xzdob4eOO3,Smokin' Sticky Sticky,Beat'n Down Yo Block,5ZO72kl3xMRRzlpod55k1Q,['Unk'],['0PGtMx1bsqoCHCy3MB3gXA'],15,1,1,0.623,...,0.402,0.0021,0.0,0.0691,0.422,87.988,380427,4.0,2006,2006-10-03
1,0fJfoqHIIiET2EcgjOfntG,Holding Back the Years,Holding Back The Years,7sV4kCqQYt8agM5TjkdOYU,['Norm Douglas'],['4kxKyoiYhldUlnfeCZtD0D'],1,1,0,0.585,...,0.0333,0.316,0.775,0.0993,0.88,170.082,266520,4.0,2008,2008-06-13
2,0V2R2LC8dR7S0REieXRaGt,All Along The Watchtower - Live - 1991,"Back On The Bus, Y'All",3jmmx4jRkul3POEhn1cgwF,['Indigo Girls'],['4wM29TDTr3HI0qFY3KoSFG'],7,1,0,0.331,...,0.0379,0.709,0.0,0.939,0.43,90.648,383773,4.0,1991,1991-06-04
3,4VUHYLocWOJ2GfvP78AmSs,Windmills,Total Folklore,5PyLkzuxmT6EoVNZCg8Iya,['Dan Friel'],['4HKTPJw50BFASrfhJEHIVP'],2,1,0,0.193,...,0.109,4.9e-05,0.838,0.285,0.594,113.345,82493,4.0,2013,2013-02-19
4,4m8a1AtmCnoeRzSYoQ0oX0,Overnite Flite,Normal Human Feelings,623VIdYR6Y0NCN9yPbMAC6,['Little Suns'],['5OLcAqMbHpecNOIQyTduQ7'],2,1,0,0.546,...,0.0323,0.427,0.000105,0.197,0.424,127.941,230667,1.0,2013,2013-10-08


## Create X Matrix of numeric song attributes

In [13]:
usable_columns =['explicit', 'danceability', 'energy', 'key', 'loudness', 
        'mode', 'speechiness', 'acousticness', 'time_signature', 'year' ]

X = df[usable_columns]

X.head()

Unnamed: 0,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,time_signature,year
0,1,0.623,0.736,11,-3.657,0,0.402,0.0021,4.0,2006
1,0,0.585,0.639,2,-9.641,0,0.0333,0.316,4.0,2008
2,0,0.331,0.466,9,-14.287,0,0.0379,0.709,4.0,1991
3,0,0.193,0.856,4,-2.97,1,0.109,4.9e-05,4.0,2013
4,0,0.546,0.373,3,-13.929,1,0.0323,0.427,1.0,2013


# Use Nearest Neighbors to get 5 most similar songs.

> Indented block



In [14]:
neigh = NearestNeighbors(n_neighbors=5, n_jobs=-1)
neigh.fit(X)

NearestNeighbors(n_jobs=-1)

In [15]:
# Track name needs to be exact match of spelling, punctuation and capitalization
track_name = "Holding Back the Years"

# Look at the song that we want to find recommendations for
df[df['name'] == track_name]

Unnamed: 0,id,name,album,album_id,artists,artist_ids,track_number,disc_number,explicit,danceability,...,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,year,release_date
1,0fJfoqHIIiET2EcgjOfntG,Holding Back the Years,Holding Back The Years,7sV4kCqQYt8agM5TjkdOYU,['Norm Douglas'],['4kxKyoiYhldUlnfeCZtD0D'],1,1,0,0.585,...,0.0333,0.316,0.775,0.0993,0.88,170.082,266520,4.0,2008,2008-06-13
82635,5F9WGLNnZRRwVyiCt1nHDr,Holding Back the Years,The Lost and Found,4fZx2cNk1Vod8jZkPSWBpv,['Gretchen Parlato'],['76Gi1qoWLrIerL5FcL0TZb'],1,1,0,0.541,...,0.0427,0.778,0.13,0.115,0.127,92.624,226587,4.0,2011,2011-04-05


In [16]:
# We may have multiple tracks that match this title, we'll just select the first one
# We'll grab only its row index and then use that select the corresponding song's
# data from our X matrix.
track_index = df[df['name'] == track_name].index[0]

track_data = X.iloc[track_index]

track_data

explicit             0.0000
danceability         0.5850
energy               0.6390
key                  2.0000
loudness            -9.6410
mode                 0.0000
speechiness          0.0333
acousticness         0.3160
time_signature       4.0000
year              2008.0000
Name: 1, dtype: float64

In [17]:
# Input to model must be a 2D array
# .reshape(1,-1) turns a 1D array into a 2D array
# (basically just adds an extra set of square brackets at
# the beginning and end of the array.)
track_data = track_data.values.reshape(1,-1)

track_data

array([[ 0.000e+00,  5.850e-01,  6.390e-01,  2.000e+00, -9.641e+00,
         0.000e+00,  3.330e-02,  3.160e-01,  4.000e+00,  2.008e+03]])

In [18]:
# Since the selected song is also in the training data,
# the most similar song is itself 
# We will ask for 6 songs to get back 5 songs in addition to the one provided
distances, song_indexes = neigh.kneighbors(track_data, 6)

song_indexes



array([[    1, 69557, 34753,  7222, 50380, 37367]])

In [19]:
# 5 most similar songs
for index in song_indexes:
  print(df.iloc[index][['name', 'artists']])

                            name                  artists
1         Holding Back the Years         ['Norm Douglas']
69557                 Ma Mélodie           ['Feet Peals']
34753     Minha Bênção (Ao Vivo)  ['Padre Marcelo Rossi']
7222   Stuck In A Glass Elevator           ['The Myriad']
50380        Club Hip Hop Beat 2       ['Jorge Quintero']
37367           Like...monk-like    ['The Reese Project']


In [21]:
# dump the model
dump(neigh, filename="shop_rec_model.joblib")

['shop_rec_model.joblib']