# Testing the Viability of a Song Recommender

<hr>

Using the various attributes we were able to pull, we would like to see if it is possible to create a song recommender. It probably will not work too accurately for two reasons:

 - The Data
      - We have one major hole in our data, and that is genre. Sadly genre wasn't a track feature in the API, it was only tied to artists, so we weren't able to collect it.
 - The Algorithm
      - One algorithm alone isn't near enough to make an effective recommendation system
      - in 2017 netflix had a recommendation algorithm challenge
          - The winners used a combination of 107 different algorithms to create an effective system
      - So even if ours was slightly effective, barring the previously stated genre issue, I would call that a win
      
## Outcome (Spoilers)
<hr>
I took an approach similar to how one might test document similarity using scikit's cosine_similarity function, and I was unsure that applying it to this problem would actually work, but...

It worked suprisingly well, the matches on the top100 set seemed better, and I have hypotheses on why that might be the case, but I won't speculate.

For the larger set, it can be hit or miss, but it seems like around half of the songs that it recommends can "feel" similar. It definitely outperformed my expectations without having a genre or other information that could guide the output though.

In [1]:
### This block grabs the dataframes for you ###
### This block doesn't need any modification ###

from sqlalchemy import create_engine
from pandas.io import sql
import pandas as pd

hostname=''
dbname=''
uname=''
pwd=''

engine = create_engine("mysql+pymysql://{user}:{pw}@{host}/{db}".format(host=hostname, db=dbname, user=uname, pw=pwd))

top100 = pd.read_sql('SELECT * FROM top100', con=engine)
allSongs = pd.read_sql('SELECT * FROM final_Table', con=engine)

In [2]:
print("----- Top 100 Songs -----")
display(top100.head(3))
print("----- All Songs -----")
display(allSongs.head(3))

----- Top 100 Songs -----


Unnamed: 0,year,rank,artist,title,album,release_date,length,popularity,danceability,acousticness,energy,instrumentalness,liveness,loudness,speechiness,tempo,time_signature,keys,valence,ids
0,1990,1,Wilson Phillips,Hold On,Wilson Phillips,1990-01-01,266866,68,0.679,0.4,0.657,0.0,0.0497,-9.897,0.0255,97.8,4,5,0.546,4VZDv8sASBS8UruUBGTFdk
1,1990,2,Roxette,It Must Have Been Love,It Must Have Been Love,1990-05-20,258786,76,0.52,0.34,0.652,5.5e-05,0.256,-6.655,0.0274,80.609,4,5,0.722,6kvoHl80mfCVTv7XnZkjQn
2,1990,3,Sinéad O'Connor,Nothing Compares 2 U,I Do Not Want What I Haven't Got,1990-07-01,280040,72,0.511,0.0425,0.574,2.3e-05,0.105,-7.016,0.0273,119.917,4,5,0.161,3nvuPQTw2zuFAVuLsC9IYQ


----- All Songs -----


Unnamed: 0,ids,name,album,artist,release_date,length,popularity,danceability,acousticness,energy,instrumentalness,liveness,loudness,speechiness,tempo,time_signature,keys,valence
0,2LKBDF6u2QNrzlUPzvpNAS,Tellin' Me Lies,Nature Of The Beast,April Wine,1981,3.02555,23,0.706,0.0543,0.637,4.6e-05,0.0616,-11.825,0.034,137.752,3.0,9,0.965
1,2Y0HDXWfE6KifNzo3GEScQ,Dancer,Hot Space,Queen,1982-05-03,3.818883,25,0.745,0.0435,0.336,9e-06,0.0766,-11.97,0.0398,100.093,4.0,9,0.625
2,2lRETJsdBygk2oVWbPpSRV,In Your Eyes,Emotions In Motion,Billy Squier,1982,3.777333,25,0.526,0.22,0.558,0.0,0.0589,-6.918,0.0291,137.122,4.0,2,0.566


# Start of Testing Below

In [68]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import sklearn_recommender as skr
from sklearn import preprocessing

In [50]:
# define the features we want to use to make a recommendation
# for the sake of testing, we'll look at the two sets differently
# since top100 is smaller and likely has all songs meant for american audience

features = ['danceability','acousticness','energy','liveness','speechiness','tempo','loudness','valence']

recDF = top100[features]
items = tf.transform(recDF)

In [51]:
tf = skr.transformer.SimilarityTransformer(cols=(0,-1),normalize=True)
sim_mat = tf.transform(items)

In [57]:
rec = skr.recommender.SimilarityRecommender(5).fit(sim_mat)

In [107]:
# to test I want to use a song I am familiar with
# after shuffling the first one I knew enough
# was "when I'm gone" sadly...

top100.loc[top100['title'].str.contains("When I'm Gone")]

Unnamed: 0,year,rank,artist,title,album,release_date,length,popularity,danceability,acousticness,energy,instrumentalness,liveness,loudness,speechiness,tempo,time_signature,keys,valence,ids
1302,2003,5,3 Doors Down,When I'm Gone,Away From The Sun,2002-11-12,260333,68,0.53,0.00385,0.768,0.0,0.103,-5.611,0.0284,148.095,4,7,0.374,3WbphvawbMZ8FyqDxYGdSQ
1587,2006,90,Eminem,When I'm Gone,When I'm Gone,2005-12-06,281120,62,0.608,0.0551,0.745,0.0,0.27,-5.496,0.365,150.068,4,10,0.725,2lRI7n5b9jlPYsEpRuots6
2218,2013,21,Anna Kendrick,Cups (Pitch Perfect's When I'm Gone),Ultimate Pitch Perfect (Original Motion Pictur...,2015-02-10,128013,59,0.878,0.0462,0.455,0.00484,0.0353,-8.845,0.077,129.953,4,0,0.79,6mH3qVIeOsnQIAho5eWwhH


In [88]:
# ohh, I know the em song too, so I can test both

print(rec.predict([1302,1587]))

[[163 277  51   5 289]
 [163  51 277   5 289]]


In [66]:
# for some reason this is the top recommendation for both
print(top100.iloc[163].artist + " - " + top100.iloc[163].title)

LL Cool J - Around the Way Girl


In [69]:
# 3 doors down and em don't seem too close, so something
# has to be wrong this looked like the easy approach, but
# sadly im going to have to do a little bit more work apparently
# I think I want to preprocess some of the info myself,
# and then reapproach with another method


x = recDF.values
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
normalized = pd.DataFrame(x_scaled)

In [81]:
cosined = cosine_similarity(normalized)

In [108]:
def top100Searcher(songID,matrix):
    similar = list(enumerate(matrix[songID]))
    sortedSimilar = sorted(similar,key=lambda x:x[1],reverse=True)[1:]
    i=0
    for j in sortedSimilar:
        print(top100.iloc[j[0]].artist + " - " + top100.iloc[j[0]].title)
        i=i+1
        if i>5:
            break

top100Searcher(1302,cosined)

OneRepublic - Secrets
Nickelback - Photograph
Kelly Clarkson - My Life Would Suck Without You
Trace Adkins - You're Gonna Miss This
Miley Cyrus - Malibu
Hanson - I Will Come to You


In [109]:
# okay wow. I'm suprised. The first two seem kinda close actually.
# since that eminem song came up too we'll try that one too

top100Searcher(1587,cosined)

Ludacris - Splash Waterfalls
Lou Bega - Mambo No. 5
Eminem Featuring Juice WRLD - Godzilla
Luther Vandross and Janet Jackson - The Best Things in Life Are Free
DJ Khaled Featuring Justin Bieber, Chance The Rapper & Quavo - No Brainer
Lil Wayne - A Milli


In [110]:
# searching party in the USA (it returns itself because theres a duplicate in the df)
top100Searcher(1826,cosined)

Miley Cyrus - Party In The U.S.A.
Harry Styles - Adore You
TLC - Unpretty
Moby featuring Gwen Stefani - South Side
Adele - Set Fire To The Rain
Ella Henderson - Ghost


In [151]:
# I'm suprised how well it works. Now to go and attempt it with 
# the bigger list, first lets clean up and make functions

def createCosMatrix(df, features):
    # creates the df with the features we want
    recDF = df[features]
    
    # normalized all the features
    x = recDF.values
    min_max_scaler = preprocessing.MinMaxScaler()
    x_scaled = min_max_scaler.fit_transform(x)
    normalized = pd.DataFrame(x_scaled)
    
    # creates an NxN matrix of cosine similarities
    cosined = cosine_similarity(normalized)
    
    return cosined

def idSearch(df,songname):
    # our naming conventions were off for titles, so this
    # will do in a pinch
    try:
        display(df.loc[df['title'].str.contains(songname)])
    except:
        display(df.loc[df['name'].str.contains(songname)])

In [None]:
test = createCosMatrix(allSongs,features)
idSearch(allSongs,"Shake")

In [179]:
# oof thats a lot of numbers
# I tried scaling it down to a less accurate float
# but were still at 441 GB of memory needed
# we only technically need the row we query though,
# so maybe a tweak will fix it

def getSongCosine(df, features, songID):
    # creates the df with the features we want
    recDF = df[features]
    
    # probably dont need this anymore because were not saving
    # the full matrix
    #recDF = recDF.astype(np.float32)
    
    # normalized all the features
    x = recDF.values
    min_max_scaler = preprocessing.MinMaxScaler()
    x_scaled = min_max_scaler.fit_transform(x)
    normalized = pd.DataFrame(x_scaled)
    
    # creates an NxN matrix of cosine similarities
    cosined = cosine_similarity(normalized[songID:songID+1],normalized)
    
    return cosined


# now we also need a way to search songs in allSongs

def allSongSearcher(matrix):
    similar = list(enumerate(matrix[0]))
    sortedSimilar = sorted(similar,key=lambda x:x[1],reverse=True)[1:]
    i=0
    for j in sortedSimilar:
        print(allSongs.iloc[j[0]].artist + " - " + allSongs['name'].iloc[j[0]])
        i=i+1
        if i>5:
            break

In [180]:
# we'll test with taylor swift's shake it off (2906)

test = getSongCosine(allSongs,features,2906)

In [181]:
# all right, there seems to be A LOT of
# duplicates in this, so i want to try
# this again after deleting the dupilcates

# EDIT: nevermind, apparently spotify has
# 4 different releases of this song all
# with unique IDs, so we'll do a different song

allSongSearcher(test)

Taylor Swift - Shake It Off
Taylor Swift - Shake It Off
Taylor Swift - Shake It Off
Taylor Swift - Shake It Off
Cage The Elephant - Ain't No Rest for the Wicked
Charli XCX - Break the Rules - ODESZA Remix


In [185]:
idSearch(allSongs,"Hey Look Ma, I Made It")

Unnamed: 0,ids,name,album,artist,release_date,length,popularity,danceability,acousticness,energy,instrumentalness,liveness,loudness,speechiness,tempo,time_signature,keys,valence
121576,22oEJW6r2rMb9z4IntfyEa,"Hey Look Ma, I Made It",Pray for the Wicked,Panic! At The Disco,2018-06-22,0.283,71,0.577,0.0137,0.833,0.0,0.121,-3.337,0.0695,107.936,4.0,5,0.58


In [186]:
# I could see the first one making sense, not
# the same vibe, but the vocalization and band
# prescence is kind of similar

allSongSearcher(getSongCosine(allSongs,features,121576))

Tenille Arts - Somebody Like That
Moby - Extreme Ways
Poom - Les voiles - Christine Remix
BØRNS - Dopamine
Moby - Extreme Ways
Punk Goes - In My Head


In [187]:
idSearch(allSongs,"Me Like Yuh")

Unnamed: 0,ids,name,album,artist,release_date,length,popularity,danceability,acousticness,energy,instrumentalness,liveness,loudness,speechiness,tempo,time_signature,keys,valence
285419,40wb1DugiG4c4ztCt9oaWp,Me Like Yuh (feat. Hoody) [Korean Version],Everything You Wanted,Jay Park,2016-10-20,0.373,57,0.69,0.305,0.82,2e-06,0.138,-6.763,0.113,99.965,4.0,2,0.75
318662,5S5gGOsUarzRXKoaSstwba,Me Like Yuh,Everything You Wanted,Jay Park,2016-10-20,0.373,0,0.715,0.223,0.831,4e-06,0.132,-6.161,0.0729,100.006,4.0,2,0.805


In [188]:
# The first song feels REALLY similar, but
# the second one is waaaay off. 
# third seems to be like a podcast? and 
# 4th is a different genre, but it feels similar

allSongSearcher(getSongCosine(allSongs,features,285419))

MC Rich - Down the Coast
Billy Paul - Let 'Em In
Onyx - Hold Up
Andra - Camarero
Keko Salata - Pari kilometriä (feat. Diandra)
Sofiane - Nouveaux parrains
