# Spotify Recommender

## Project Inspiration
I've always loved listening to music and exploring all the different songs that are available. I've also recently been very captivated by artifical intelligence and the many ways it can be used to create applications that can better the world. I figured, why not combine the two into a project? 

As someone who loves getting recommendations for songs from friends and family, I decided to create this application that recommends songs so that I, and anyone who uses this application, can get recommended new songs to enjoy.

## Starting the Project
**1. Learning about machine learning:** I first learned about many of the basics of machine learning such as exploratory data analysis and feature engineering, and then I learned more about the different libraries that can be used for machine learning, such as scikit-learn and TensorFlow.

**2. Learning about Spotify API:** I then learned about Spotify's API to learn how to extract information on tracks, artists, etc.

## Begin Code

#### Setting up the environment

In [1]:
# import necessary libraries
from dotenv import load_dotenv
import os
import base64
from requests import post, get
import json
import pandas as pd
import numpy as np
import matplotlib as plt

# to suppress warnings that I may get
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# set up the environment
load_dotenv()
client_id = os.getenv("CLIENT_ID")
client_secret = os.getenv("CLIENT_SECRET") 




#### Functions to set up token authorization

In [2]:
# function to generate the token
def get_token():
    auth_string = client_id + ":" + client_secret
    auth_bytes = auth_string.encode("utf-8")
    auth_base64 = str(base64.b64encode(auth_bytes), "utf-8")

    url = "https://accounts.spotify.com/api/token"
    headers = {
        "Authorization": "Basic " + auth_base64,
        "Content-Type": "application/x-www-form-urlencoded"
    }
    data = {"grant_type": "client_credentials"}
    result = post(url, headers=headers, data=data)
    json_result = json.loads(result.content)
    token = json_result["access_token"]
    return token

# function to generate the authorization header
def get_auth_header(token):
    return {"Authorization": "Bearer " + token}

#### Functions getting info from Spotify's Web API

In [3]:
# search for artist by name
def search_for_artist(token, artist_name):
    url = "https://api.spotify.com/v1/search"
    headers = get_auth_header(token)
    query = f"?q={artist_name}&type=artist&limit=1"

    query_url = url + query
    result = get(query_url, headers=headers)
    json_result = json.loads(result.content)["artists"]["items"]
    if len(json_result) == 0:
        print("No artist with this name exists...")
        return None
    return json_result[0]

# search for track by name
def search_for_track(token, track_name):
    url = "https://api.spotify.com/v1/search"
    headers = get_auth_header(token)
    query = f"?q={track_name}&type=track&limit=1"

    query_url = url + query
    result = get(query_url, headers=headers)
    json_result = json.loads(result.content)["tracks"]["items"]
    if len(json_result) == 0:
        print("No track with this name exists...")
        return None
    return json_result[0]

# get artist by artist id
def get_artist(token, artist_id):
    url = f"https://api.spotify.com/v1/artists/{artist_id}"
    headers = get_auth_header(token)
    result = get(url, headers=headers)
    json_result = json.loads(result.content)
    return json_result

# get songs by artist from artist id
def get_songs_by_artist(token, artist_id):
    url = f"https://api.spotify.com/v1/artists/{artist_id}/top-tracks?country=CA"
    headers = get_auth_header(token)
    result = get(url, headers=headers)
    json_result = json.loads(result.content)["tracks"]
    return json_result

# get track info from track id
def get_track(token, track_id):
    url = f"https://api.spotify.com/v1/tracks/{track_id}"
    headers = get_auth_header(token)
    result = get(url, headers=headers)
    json_result = json.loads(result.content)
    return json_result

# get track audio features from track id
def get_track_features(token, track_id):
    url = f"https://api.spotify.com/v1/audio-features/{track_id}"
    headers = get_auth_header(token)
    result = get(url, headers=headers)
    json_result = json.loads(result.content)
    return json_result

# get all available markets on Spotify (CA, US, etc)
def get_markets(token):
    url = "https://api.spotify.com/v1/markets"
    headers = get_auth_header(token)
    result = get(url, headers=headers)
    json_result = json.loads(result.content)
    return json_result

# get all available genres in the database
def get_genres(token):
    url = "https://api.spotify.com/v1/recommendations/available-genre-seeds"
    headers = get_auth_header(token)
    result = get(url, headers=headers)
    json_result = json.loads(result.content)
    return json_result

# get recommendations based on tracks
def get_recommendations(token, seed_tracks):
    url = f"https://api.spotify.com/v1/recommendations?market=CA&limit=1&seed_tracks={seed_tracks}"
    headers = get_auth_header(token)
    result = get(url, headers=headers)
    json_result = json.loads(result.content)['tracks']
    return json_result


#### Set up token and play around with Spotify's API

In [4]:
# set up token that allows us to use the Spotify API
token = get_token()

# getting familiar with Spotify API, printing top songs of an artist given their artist name
result = search_for_artist(token, "Of Monsters and Men")
artist_id = result["id"]
print(result['name'], "Top Tracks")
# by default this gives top 10 tracks, but we'll use top 3 for the sake of smaller output
songs = get_songs_by_artist(token, artist_id)[:3]
for idx, song in enumerate(songs):
    print(f"{idx + 1}. {song['name']}")

Of Monsters and Men Top Tracks
1. Little Talks
2. Dirty Paws
3. Mountain Sound


## PROJECT PLAN
**Problem:** Given a song, recommend a new song.<br><br>
**Solution:**
1. Load in the dataset
2. Perform exploratory data analysis to gain info on the dataset
3. Perform feature engineering to remove irrelevant data, mutate data, etc
4. Get the user's song/track
5. Feature engineer the user's track
6. Use cosine similarity to find similarity between user's track, and the tracks in the dataset
7. Reveal to the user the track that is the most similar
8. Enjoy the recommended song :)

## EDA and Feature Engineering

#### Load in data from Kaggle dataset

In [5]:
# load in the data and display first track
all_song_data = pd.read_csv("spotify_data.csv")

# we will be modifying all_song_data so raw_data is to keep original info on tracks in the dataset
raw_data = all_song_data
all_song_data.head(1)

Unnamed: 0.1,Unnamed: 0,artist_name,track_name,track_id,popularity,year,genre,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
0,0,Jason Mraz,I Won't Give Up,53QF56cjZA9RTuuMZDrSA6,68,2012,acoustic,0.483,0.303,4,-10.058,1,0.0429,0.694,0.0,0.115,0.139,133.406,240166,3


In [6]:
# drop the first column since it is just the index
all_song_data = all_song_data.drop(all_song_data.columns[0], axis=1)
all_song_data.head(1)

Unnamed: 0,artist_name,track_name,track_id,popularity,year,genre,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
0,Jason Mraz,I Won't Give Up,53QF56cjZA9RTuuMZDrSA6,68,2012,acoustic,0.483,0.303,4,-10.058,1,0.0429,0.694,0.0,0.115,0.139,133.406,240166,3


In [7]:
# check the format of genres in Spotify's API (since it could be different from the format in the dataset)

# the artist stored in artist_id is Of Monsters and Men from above
artist_genre = get_artist(token, artist_id)['genres']
# print their genres
print('Of Monsters and Men Genres:', artist_genre)

# showing how to get an artist's genres given the artist name
jason_mraz = search_for_artist(token, "Jason Mraz")['id']
jason_genre = get_artist(token, jason_mraz)['genres']
print('Jason Mraz Genres:', jason_genre)

# get all unique genres in the dataset
kaggle_genres = list(all_song_data['genre'].unique())
# we store in a variable to show reduction in genres size which we will perform later
orig_len_genres = len(kaggle_genres)

Of Monsters and Men Genres: ['folk-pop', 'metropopolis', 'modern rock', 'stomp and holler']
Jason Mraz Genres: ['acoustic pop', 'dance pop', 'neo mellow', 'pop']


#### Manually reduce amount of genres

In [8]:
# list of genres to remove
remove_genres = []
# loop through all genres
for genre in kaggle_genres:
    # there are 4 subgenres of metal that can be reduced to just metal
    if 'metal' in genre and genre != 'metal':
        remove_genres.append(genre)
    # there are 3 subgenres of house that can be reduced to just house
    if 'house' in genre and genre != 'house':
        remove_genres.append(genre)
    # there are 5 subgenres of rock that can be reduced to just rock
    if 'rock' in genre and genre != 'rock':
        remove_genres.append(genre)

# all these fit into pop and edm
pop_edm_subs = ['pop-film', 'power-pop', 'techno', 'minimal-techno', 'electronic', 'electro', 'dubstep']
for genre in pop_edm_subs:
    remove_genres.append(genre)

# remove the genres in remove_genres
for genre in remove_genres:
    kaggle_genres.remove(genre)

print('We\'ve reduced the number of genres by', str(orig_len_genres - len(kaggle_genres)) + '.')


We've reduced the number of genres by 19.


#### Map values in dataset to remaining genres

In [9]:
# map the subgenres that we removed above to their associated genres
map_genre = {
    'alt-rock': 'rock', 'hard-rock': 'rock', 'psych-rock': 'rock', 'punk-rock': 'rock', 'rock-n-roll': 'rock',
    'pop-film': 'pop', 'power-pop': 'pop',
    'black-metal': 'metal', 'death-metal': 'metal', 'heavy-metal': 'metal', 'metalcore': 'metal',
    'chicago-house': 'house', 'deep-house': 'house', 'progressive-house': 'house',
    'techno': 'edm', 'minimal-techno': 'edm', 'electronic': 'edm', 'electro': 'edm', 'dubstep': 'edm'}
# map, and keep genre name if it doesn't exist in the map
all_song_data['genre'] = all_song_data['genre'].map(map_genre).fillna(all_song_data['genre'])
print('Number of genres in the dataset:', len(all_song_data['genre'].unique()))


Number of genres in the dataset: 63


#### One Hot Encode the Genres in Kaggle Dataset

In [10]:
from sklearn.preprocessing import OneHotEncoder


ohe = OneHotEncoder()
# one hot encode the genre column of song data
genre_array = ohe.fit_transform(all_song_data[['genre']]).toarray()
# get the labels (genre names) from the one hot encoder
genre_labels = ohe.categories_[0]
# make into a dataframe
genres = pd.DataFrame(genre_array, columns=genre_labels)
genres.head(1)

Unnamed: 0,acoustic,afrobeat,ambient,blues,breakbeat,cantopop,chill,classical,club,comedy,...,singer-songwriter,ska,sleep,songwriter,soul,spanish,swedish,tango,trance,trip-hop
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [11]:
# concatenate the original DataFrame to the One Hot Encoded genres
all_song_data = pd.concat([all_song_data, genres], axis=1)

# drop the genres column
all_song_data = all_song_data.drop('genre', axis=1)

In [12]:
# dropping columns that don't help with machine learning to suggest songs (note that artist contributes already through their genre)
all_song_data = all_song_data.drop(['track_id', 'track_name', 'artist_name', 'popularity', 'duration_ms'], axis=1)

In [13]:
# checking data to see make sure we've removed all useless columns, to index 13 because after is just genres
all_song_data.iloc[0:1, 0:13]

Unnamed: 0,year,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature
0,2012,0.483,0.303,4,-10.058,1,0.0429,0.694,0.0,0.115,0.139,133.406,3


#### Scale the features
We use the StandardScaler from scikit-learn to scale the features so that each feature has the same affect on the similarities between songs when we later use cosine similarity to calculate similarities between two tracks. The StandardScaler makes the mean of each column 0 and the standard deviation of each column 1.

Note: if we were to use a model to create the parameters then we would not need to scale since the model would do it for us.

In [14]:
from sklearn.preprocessing import StandardScaler


# keep the original data in a variable so that we can later use it to calculate mean and std so that we can scale the new user's track
original_data = all_song_data

scaler = StandardScaler()
scaled_songs = scaler.fit_transform(all_song_data)
all_song_data = pd.DataFrame(scaled_songs, columns=all_song_data.columns, index=all_song_data.index)
all_song_data.head(1)


Unnamed: 0,year,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,...,singer-songwriter,ska,sleep,songwriter,soul,spanish,swedish,tango,trance,trip-hop
0,0.006614,-0.295093,-1.244617,-0.362224,-0.189477,0.758725,-0.393523,1.04923,-0.691229,-0.537219,...,-0.110679,-0.109641,-0.12503,-0.022542,-0.088307,-0.129601,-0.100584,-0.114735,-0.091322,-0.099126


#### Load in the user's track based on ID

In [15]:
# allow user to enter in name of song to use
user_track = input('What is the name of your song? ')

# Here we search for song by name and then use the id given to us to get the track info
track_name = search_for_track(token, user_track)
track_id = track_name['id']

track = get_track(token, track_id)
track_features = get_track_features(token, track_id)

# only taking the first artist on the track for now
track_artist = track['artists']
first_artist = track_artist[0]
artist_id = first_artist['id']

artist = get_artist(token, artist_id)
first_genre = artist['genres'][0]

# we see here that the genres do not perfectly line up
genres = get_genres(token)['genres']
print(first_genre, 'in genres:', first_genre in genres)


alt z in genres: False


#### Match up columns of existing tracks data with the new track

In [16]:
# note: release date might be slightly off due to this being the release date of the album rather than the track
release_year = track['album']['release_date'][0:4]

# get the audio features of the track
danceability = track_features['danceability']
energy = track_features['energy']
key = track_features['key']
loudness = track_features['loudness']
mode = track_features['mode']
speechiness = track_features['speechiness']
acousticness = track_features['acousticness']
instrumentalness = track_features['instrumentalness']
liveness = track_features['liveness']
valence = track_features['valence']
tempo = track_features['tempo']
time_signature = track_features['time_signature']

# put the info into an array
track_info = [release_year, danceability, energy, key, loudness, mode, speechiness, acousticness, instrumentalness, liveness, valence, tempo, time_signature]

# set up the genres of the given track
tracks_genres = []

# note: we will have to use the artists' genres since individual tracks do not have genres
artist_genres = artist['genres']

# if the genre name is somewhere in the artist's genres then append to the tracks genres
for genre in kaggle_genres:
    for tGenre in artist_genres:
        if genre in tGenre:
            tracks_genres.append(genre)

# special case for alternative music, which I manually fit into the indie-pop genre
for tGenre in artist_genres:
    if 'indie-pop' not in tracks_genres and 'alt' in tGenre:
        tracks_genres.append('indie-pop')
curr_artist = artist['name']
print('Genres for', curr_artist + ":", tracks_genres)

tracks_genres = []

# example to show that a genre can fit into multiple genre columns
artist_genres = ['french-pop']
for genre in kaggle_genres:
    for tGenre in artist_genres:
        if genre in tGenre:
            tracks_genres.append(genre)

print('Genres that encapsulate \'french-pop\':', tracks_genres)

# set genre to 1 if it is the track's genre, else 0
for genre in kaggle_genres:
    if genre in tracks_genres:
        track_info.append(1)
    else:
        track_info.append(0)

Genres for Sasha Alex Sloan: ['indie-pop']
Genres that encapsulate 'french-pop': ['french', 'pop']


#### Shift and rescale the track's feature values based on how we shifted database

In [17]:
# recall we still have the original data which is not shifted or scaled
original_data.head(1)

Unnamed: 0,year,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,...,singer-songwriter,ska,sleep,songwriter,soul,spanish,swedish,tango,trance,trip-hop
0,2012,0.483,0.303,4,-10.058,1,0.0429,0.694,0.0,0.115,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [18]:
# use the original values to get mean and std of each feature so that we can shift and rescale the user's track features
features_mean = original_data.mean()
features_std = original_data.std()
num_features = len(track_info)

# shift and rescale each feature of the new track based on dataset's mean and std
for i in range(num_features):
    track_info[i] = float(track_info[i])
    track_info[i] = (track_info[i] - features_mean[i]) / features_std[i]

#### Use cosine similarity to find track in database most similar to given track

In [19]:
# import function that computes cosine_similarity
from sklearn.metrics.pairwise import cosine_similarity


# convert the list to an array so that we can reshape
track_info_arr = np.array(track_info, dtype='float64')
# calculate the cosine similarity between the new track and all tracks in the dataset, reshape so that everything lines up
cos_sim = cosine_similarity(all_song_data, track_info_arr.reshape((1, -1))).reshape(-1)
# find the index with the largest cosine similarity, which would be the track that is most similar
sim_index = np.argmax(cos_sim)

# get the info about the recommended song including the artist name and track name
sim_song_row = raw_data.iloc[sim_index]
rec_artist = sim_song_row['artist_name']
rec_track = sim_song_row['track_name']

#### TIME TO GIVE RECOMMENDATION :)

In [20]:
new_track = track['name']
new_artist = first_artist['name']
print('Song Recommendation: If you like', new_track, 'by', new_artist + ',', 'then we would recommend', rec_track, 'by', rec_artist + '.')

Song Recommendation: If you like Dancing With Your Ghost by Sasha Alex Sloan, then we would recommend Ivory - Please by Joe Bel.


## Drawbacks
1. We have to use album release dates rather than track release dates
2. We use the artist's genres rather than the track's genres
3. We only have a dataset of 1 million songs to recommend from, when there exists much more songs out there
4. May have weird genre names that don't end up matching to any one hot encoded genres even though they may be same genre category
5. Our features currently all have the same weight, realistically a model would discover that certain features play more into similarity than others
6. Can only get recommendations for tracks that are the most popular for their given name
7. For more unknown songs, there might not be enough info, like no genres, which may cause errors

## Conclusion
Overall, this application succesfully returns a song recommendation that is similar to the user's given track, but there are definitely many improvements that could be made to this project in the future to increase the accuracy of the recommendations.