# 4.D. Recommender -- Combined Approach

So far, we have made the following versions of recommendation systems, each with their own respective pros and cons:

 - **Collaborative**, using Genius.com User/Artist follower data
  - *Pros*: Accurate and effective recommendations based on user "behaviors." With music, it can be hard to determine what a user wants to listen to. But, there is latent information in follower data.
  - *Cons*: Artist level
  - **Novelty Level: MED**
  
  
 - **Item-to-Item, Album Based**, using our combined album review data
  - *Pros*: Let professionals do the talking. These are the folks that consider themselves to be rap afficionados. The way that they dissect each album can be compared across albums to reveal similarities (i.e. politics in the case of Eminem and Dead Prez). 
  - *Cons*: Album level, the approach does not take into account the songs themselves (for example, if I want a sad song, or a loud song); Often ends up being the case that multiple albums from an established artist crowd the recs
  - **Novelty Level: LOW**
  
  
  
 - **Item-to-Item, Track Based**, using our combined lyric and audio feature data
  - *Pros*: A very deep recommendation approach which delivers a track recommendation based on a variety of features including sentiment of lyrics, tempo, how much of a track is vocal vs instrumentals, production, etc.
  - *Cons*: High Dimensionality can obscure some of the features. In an ideal world, we would be able to determine that User A cares more about lyrics than User B, and weight these features accordingly. In addition to this, we found this approach had extremely high novelty. This is a great option for uncovering new tracks and songs aligned to our tastes, but should not be our sole approach.

  - **Novelty Level: HIGH**
  
In this workbook, we'll attempt to combine these approaches and get "the best of both (or in this case three) worlds" out of our recommender. We want to retain the novelty we are seeing with our Track based approach, while also grounding our recommendations in the artists and albums that users and critics deem similar. Workflow:

 1. Build Collaborative Recommender
 2. Build Album-Recommender
 3. Build Track Recommender
 4. For a given song, generate track-level recommendations
 5. Overlay track recommendations with artists recs from collaborative filter
 6. Overlay track recommendations with albums recs from album-to-album recommender


In [1]:
import pandas as pd
from bs4 import BeautifulSoup
import numpy as np
import time
import requests
import seaborn as sns
import warnings
import matplotlib.pyplot as plt
import lyricsgenius
import re
import sys
import spotipy
import spotipy.util as util
from sklearn.compose import ColumnTransformer, make_column_transformer
from pyjarowinkler import distance
import nltk

import pandas as pd
from bs4 import BeautifulSoup
import numpy as np
import time
import requests
import seaborn as sns
import warnings
import matplotlib.pyplot as plt

import nltk
from nltk.stem import PorterStemmer
from sklearn.model_selection import cross_val_score

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.metrics import recall_score
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier, GradientBoostingClassifier, AdaBoostClassifier, VotingClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB, GaussianNB, BernoulliNB
from gensim import corpora, models



### Collaborative Recommender

In [2]:
#Read in our user data
users = pd.read_csv('follower_genius_500k.csv')
users = users.drop(columns=['follower_role'])

#clean up our artist_names
users['artist_clean'] = users['artist_name'].str.replace('$', 'S')
users['artist_clean'] = users['artist_clean'].str.replace("’", "")
users['artist_clean'] = users['artist_clean'].str.replace('&', 'and')
users['artist_clean'] = users['artist_clean'].str.replace('!','') 
users['artist_clean'] = users['artist_clean'].str.replace('([ ]{2,})', ' ', regex = True) 
users['artist_clean'] = users['artist_clean'].str.strip()
users['artist_clean'] = users['artist_clean'].str.lower()

#drop missing rows
users = users.dropna()

In [3]:
import pandas as pd
from scipy import sparse
from sklearn.metrics.pairwise import pairwise_distances

#pivot out our user data
user_pivot = users.pivot_table(index='artist_clean', columns = ['follower_name'], values = 'follow')

#convert our data into a sparse matrix
user_pivot_sparse = sparse.csr_matrix(user_pivot.fillna(0))

#Use parwise distance / cosine distance on our sparse pivot; take results and push to a Datframe
collab_recommender = pairwise_distances(user_pivot_sparse, metric='cosine')
collab_recommender_df = pd.DataFrame(collab_recommender, columns=user_pivot.index, index=user_pivot.index)

In [21]:
user_pivot.shape

(1166, 150996)

In [376]:
user_pivot.head()

follower_name,$hamyr,$haz,$mokeyyy,$wag Hip-Hop,",KKKKKKKLLLKK",.z3r0.,0-0-J_D_O_N-0-0,0-MXXL-0,0-Schwa-0,0-efray-0,...,‌bluii,‍ lyra messier,⁠kartashov,☀️ ✼ ☆ KayneTheDog ☆ ✼ ☀️,☣̭̰♜͙̟͇͓́ɤ͏̪̦̬̻͚̗̲Վ̘̰̥̫̜͔a̻͕͞ņ̠̟̩̼̰̲̰͞ ̖̘͇͍͎ͅ'̠̠̻̹M͚̠̣̺̬͕͞.̡̫̗O͎̠̺̪̻̺͚͟.̖̖͍̟͍̜͔M̘̖͙̝͠.̯͙͎͈̗̯͜T̺.̟̮̫̜̘͞'͕̜͢ ̛̳̫̗̬͈V͚̰̬̦༏̢̝̭̰ͅc̴̩̰̺̤̞͇͖e̹̫̮̳̥̣̘♜͙̟͇͓́☣̭̰,✞☭ Perc Nowitzki ☭✞,✰MAGZEN✰,🌴TruSwag🌴,💎 YBN Mystique 💎,🚀👑ASTROWORLD
artist_clean,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
(hed) p.e.,,,,,,,,,,,...,,,,,,,,,,
03 greedo,,,,,,,,,,,...,,,,,,,,,,
070 shake,,,,,,,,,,,...,,,,,,,,,,
100s,,,,,,,,,,,...,,,,,,,,,,
2 chainz,,,,,,,,,,,...,,,,,,,,,,1.0


### Album Recommender

In [4]:
#read in our album data, drop nulls, reset index
albums = pd.read_csv('./combining/albumdata/review_data_for_recommender.csv')
albums = albums[albums.combined_reviews_clean.notnull()]
albums = albums.reset_index()

#read our TfidfVectorizer
tfidf = TfidfVectorizer(
    min_df = 2,
    max_features = 1000,
    ngram_range=(1, 2),
)

#build our set of Tfidf Features
tfidf_matrix = tfidf.fit_transform(albums.combined_reviews_clean)

In [22]:
tfidf_matrix.shape

(7523, 1000)

In [262]:
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity

#general cosine similarities from tfidf features
cosine_similarities = linear_kernel(tfidf_matrix,tfidf_matrix) 

#ready a dict to storoue our results
#for all entries, grab the first 100 most similar entries and store them in a list
album_results = {}
for idx, row in albums.iterrows():
    similar_indices = cosine_similarities[idx].argsort()[:-200:-1] 
    similar_items = [(cosine_similarities[idx][i], albums['artist_album_clean_key'][i]) for i in similar_indices] 
    album_results[row['artist_album_clean_key']] = similar_items[1:]

In [6]:
#This function will return the track information when provided an id
def item(id):  
    return albums.loc[albums['artist_album_clean_key'] == id]['artist_album_clean_key']

def recommend_album(item_id, num):   
    recs = album_results[item_id][:num]   
    return(recs)

### Track Data

In [23]:
#Read in our song data
songs = pd.read_csv('cleaned_lyrics_audio_topics.csv')
songs.columns
songs = songs.drop(columns=['album', 'artist','lyrics','album_id', 'album_name', 'artist_id',
                           'duration_ms', 'id','mode', 'preview_url','time_signature',
                           'track_href', 'track_name', 'uri','split_lyrics',
                           'cleaned_lyrics','cleaned_lyrics_for_sentiment', ])

In [24]:
#We want to avoid overfitting our recommender based on sentiment data. 
#Our exploration of this data has shown some potential inaccuracy. Drop the extra columns, in addition
#to some other numerical features we don't believe are necessary
cols_to_drop = ['track_sad_words', 'track_angry_words', 'track_joy_words', 'track_ant_words', 'track_trust_words',
'track_fear_words', 'track_disgust_words',
'track_surprise_words', 'track_unique_words', 
'track_total_words','artist_vocab_size','number_lines', 'tempo_x', 'valence']


songs=songs.drop(columns=cols_to_drop)

In [25]:
#Let's see what we're working with.
songs.select_dtypes(exclude='object').columns

Index(['date', 'acousticness', 'danceability', 'energy', 'instrumentalness',
       'key', 'liveness', 'loudness', 'speechiness', 'pop', 'follower',
       'track_unique_words_pct', 'track_complexity', 'artist_vocab_complexity',
       'track_rhyme_density', 'sentiment_track_neg', 'sentiment_track_pos',
       'sentiment_track_neu', 'sentiment_track_comp', 'tempo_y', 'chroma_stft',
       'spec_cent', 'spec_bw', 'rolloff', 'zcr', 'mfcc', 'hustle_pct',
       'roots_pct', 'lust_pct', 'fun_pct', 'reflection_pct',
       'storytelling_pct', 'drugs_pct', 'skill_pct', 'aspirational_pct',
       'love_pct'],
      dtype='object')

In [28]:
#ready a unique key for our tracks
songs['artist_album_track'] = songs['artist_clean'] +'_' + songs['album_name_clean'] +'_' + songs['track_clean']

#drop non numericals
songs_for_rec = songs.select_dtypes(exclude='object')

#We're going to scale our data. do this prior to assessing similarity
sc = StandardScaler()
songs_for_rec_sc = sc.fit_transform(songs_for_rec)

In [29]:
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity 

#Same general approach here for tracks. However, with our hyrbid filtering approach, we
#will take a much bigger candidate list of tracks. We'll filter on top of it.
cosine_similarities = cosine_similarity(songs_for_rec_sc, songs_for_rec_sc) 
track_results = {}
for idx, row in songs.iterrows():
    similar_indices = cosine_similarities[idx].argsort()[:-1000:-1] 
    similar_items = [(cosine_similarities[idx][i], songs['artist_album_track'][i]) for i in similar_indices] 
    track_results[row['artist_album_track']] = similar_items[1:]

In [30]:
def item(id):  
    return songs.loc[songs['artist_album_track'] == id]['artist_album_track'] 
def recommend_track(item_id, num):   
    for rec in track_results[item_id][:num]: 

SyntaxError: unexpected EOF while parsing (<ipython-input-30-86c035fde309>, line 4)

### Simulating Track Recommender -- Kendrick

Let's take a look at our output for Kendrick Lamar. We'll begin by building our candidate track list from our track features. We'll then read in our results from our album review recommender and our collaborative filter. We'll flag where we see overlap and sort accordingly.

In [106]:
artist_album_track = 'kendrick lamar_to pimp a butterfly_alright'

In [263]:
#build empty lists to store results
rec_arts =[]
rec_albs = []
rec_songs = []

#grab our results for our track
for rec in track_results[artist_album_track]:
    
    #split and store our general information
    artist=rec[1].split("_")[0]
    album=rec[1].split("_")[1]
    track=rec[1].split("_")[2]
    rec_arts.append(artist)
    rec_albs.append(album)
    rec_songs.append(track)

#build a dataframe from the results
temp_track_results = pd.DataFrame(zip(rec_arts,rec_albs,rec_songs)).rename(columns={0: "artist", 1: "album", 2: "track"})

#Try grabbing a list of the top 10 related artists. It may be that we don't have an exact string match, so 
#list will be empty if this is the case. We'll then flag if we found that artist
try:
    top_related_artists = list(collab_recommender_df[artist].sort_values(ascending=True)[:20].index)
except:
    top_related_artists = []
    
temp_track_results['artist_based_rec'] = 0
temp_track_results.loc[temp_track_results.artist.isin(top_related_artists),'artist_based_rec'] = 1

album_based_album_recs = []
album_based_artist_recs = []

#Try grabbing a list of the top related albums. It may be that we don't have an exact string match, so 
#list will be empty if this is the case. We'll then flag if we found that artist
try:
    for entry in recommend_album(artist + '|' + album,200):
        album_based_artist_recs.append(entry[1].split('|')[0])
        album_based_album_recs.append(entry[1].split('|')[1])
except:
    pass

#Overlay our results in the dataframe
temp_track_results['album_based_artist_rec'] = 0
temp_track_results['album_based_album_rec'] = 0
temp_track_results.loc[temp_track_results.artist.isin(album_based_artist_recs),'album_based_artist_rec'] = 1
temp_track_results.loc[temp_track_results.album.isin(album_based_album_recs),'album_based_album_rec'] = 1


In [264]:
#See the results
temp_track_results[(temp_track_results['artist_based_rec'] == 1) | (temp_track_results['album_based_artist_rec'] == 1) | (temp_track_results['album_based_album_rec'] == 1)].drop_duplicates().head(20)

Unnamed: 0,artist,album,track,artist_based_rec,album_based_artist_rec,album_based_album_rec
7,sheck wes,mudboy,kyrie,1,0,0
10,lil pump,lil pump,back,0,1,0
24,smokepurpp,deadstar,nose,1,1,1
29,megan thee stallion,tina snow,cocky af,0,1,0
34,lil keed,keed talk to em,red hot,1,0,0
36,rico nasty,sugar trap 2,phone,0,1,0
38,megan thee stallion,tina snow,cognac queen,0,1,0
39,danny brown,xxx,i will,0,1,0
46,megan thee stallion,make it hot,geekin,0,1,0
47,comethazine,bawskee,sticks out the window,1,0,0


We're still dealing with a LOT of repetition at the artist level. We're going to balance our output across recommenders to avoid this

### Balance Recommender and Build Random Tester

Here we're going to do two things:
 - There are pros and cons to our output across recommender. We'll ensure that 50% of our recs are coming from a collaborative filter, and that none of the artists here are being repeated. Next, we'll grab 30% of our tracks from our album review data, and ensure that no duplication is occuring here. Finally, we'll slow in the remaining 20% with the most similar tracks based on lyrical and audio data which shows the greatest promise for novelty.
 
 - We'll build a function to randomly select a track and generate a test for us.

In [148]:
def current_track():
    #We're going to randomly grab a track this time, so we can iterate and check the results. 
    #this will inform our presentation
    random_song = np.random.choice(list(songs['artist_album_track'].values))
    random_track_id = songs[songs.artist_album_track == random_song]['track_id']

    song = random_song.split('_')[2]
    artist = random_song.split('_')[0]
    album = random_song.split('_')[1]
    track_id= random_track_id
    artist_album_track = artist + '_' + album + '_' + song
    return(artist_album_track)

In [319]:
songs[songs.artist_album_track == artist_album_track]['artist_clean'].str.upper()
songs[songs.artist_album_track == artist_album_track]['album_name_clean'].str.upper()
songs[songs.artist_album_track == artist_album_track]['song'].str.upper()

15803    MARY JANE
Name: song, dtype: object

In [370]:
def generate_recs():
    
    #build empty lists to store results
    rec_arts =[]
    rec_albs = []
    rec_songs = []

    #grab our current track info
    artist_album_track = current_track()
    artist_name_for_print = songs[songs.artist_album_track == artist_album_track]['artist_clean'].str.upper().head(1).values
    album_name_for_print = songs[songs.artist_album_track == artist_album_track]['album_name_clean'].str.upper().head(1).values
    track_name_for_print = songs[songs.artist_album_track == artist_album_track]['song'].str.upper().head(1).values
    
    #howdy user
    print('Hello! You are currently listening to...')
    print('')
    print(f'ARTIST: {artist_name_for_print}')
    print(f'ALBUM: {album_name_for_print}')
    print(f'TRACK: {track_name_for_print}')
    print('')
    print('We recommend checking out...')

    
    #grab our results for our track
    for rec in track_results[artist_album_track]:

        #split and store our general information
        artist=rec[1].split("_")[0]
        album=rec[1].split("_")[1]
        track=rec[1].split("_")[2]
        rec_arts.append(artist)
        rec_albs.append(album)
        rec_songs.append(track)

    #build a dataframe from the results
    temp_track_results = pd.DataFrame(zip(rec_arts,rec_albs,rec_songs)).rename(columns={0: "artist", 1: "album", 2: "track"})

    #Try grabbing a list of the top 10 related artists. It may be that we don't have an exact string match, so 
    #list will be empty if this is the case. We'll then flag if we found that artist
    try:
        top_related_artists = list(collab_recommender_df[artist].sort_values(ascending=True)[:20].index)
    except:
        top_related_artists = []

    temp_track_results['artist_based_rec'] = 0
    temp_track_results.loc[temp_track_results.artist.isin(top_related_artists),'artist_based_rec'] = 1

    album_based_album_recs = []
    album_based_artist_recs = []

    #Try grabbing a list of the toprelated albums. It may be that we don't have an exact string match, so 
    #list will be empty if this is the case. We'll then flag if we found that artist
    try:
        for entry in recommend_album(artist + '|' + album,200):
            album_based_artist_recs.append(entry[1].split('|')[0])
            album_based_album_recs.append(entry[1].split('|')[1])
    except:
        pass

    #Overlay our results in the dataframe
    temp_track_results['album_based_artist_rec'] = 0
    temp_track_results['album_based_album_rec'] = 0
    temp_track_results.loc[temp_track_results.artist.isin(album_based_artist_recs),'album_based_artist_rec'] = 1
    temp_track_results.loc[temp_track_results.album.isin(album_based_album_recs),'album_based_album_rec'] = 1

    #Our final recs will be combined here
    final_recs = {'artist':[], 'album':[], 'track':[], 'artist_based_rec':[], 'album_based_artist_rec':[], 'album_based_album_rec':[]} 
    final_recs = pd.DataFrame(final_recs)
    
    #grab a variety of tracks from our recommenders and deduplicate at the artist level to increase diversity
    final_recs = final_recs.append(temp_track_results[(temp_track_results['artist_based_rec'] == 1)].drop_duplicates(subset='artist').head(5))
    final_recs = final_recs.append(temp_track_results[(temp_track_results['album_based_artist_rec'] == 1)].drop_duplicates(subset='artist').head(3))
    final_recs = final_recs.append(temp_track_results.drop_duplicates(subset='artist').head(20))

    final_recs = final_recs.drop_duplicates().head(10).sample(frac=1).reset_index(drop=True).drop(columns=['album_based_album_rec'])
    final_recs['combined_flag'] = final_recs['album_based_artist_rec'] + final_recs['artist_based_rec']
    final_recs = final_recs.sort_values(by='artist_based_rec', ascending=False).sort_values(by='combined_flag', ascending=False).drop(columns=['combined_flag'])
    final_recs.columns = ['artist', 'album', 'track', 'follower flag', 'review flag']
    final_recs = final_recs.apply(lambda x: x.astype(str).str.upper())
    final_recs = final_recs.apply(lambda x: x.astype(str).str.upper())
    
    display(final_recs)

## Rec Simulator

In [424]:
warnings.simplefilter('ignore')
generate_recs()

Hello! You are currently listening to...

ARTIST: ['KUTT CALHOUN']
ALBUM: ['FEATURE PRESENTATION']
TRACK: ['COLORS']

We recommend checking out...


Unnamed: 0,artist,album,track,follower flag,review flag
2,SWINGS,"#1 MIXTAPE, VOL. 2",NO MERCY,1.0,0.0
3,CRUCIAL STAR,MAZE GARDEN,SINGER SONGWRITER,1.0,0.0
5,THE QUIETT,MILLIONAIRE POETRY,PRIME TIME,1.0,0.0
7,BEWHY,THE BLIND STAR,WRIGHT BROTHERS,1.0,0.0
0,TRAGEDY KHADAFI,THUG MATRIX II,WHATS POPPIN,0.0,0.0
1,ANT BANKS,DO OR DIE,BAY AREA MASSACRE,0.0,0.0
4,BIG MELLO,THE GIFT,KMJ KILLAS,0.0,0.0
6,INDO G,ANGEL DUST,AINT NO BITCH IN MY BLOOD,0.0,0.0
8,QWEL,SO BE IT,WHITE ELEPHANT,0.0,0.0
9,MR. CAPONE-E,LOVE JAMS,STILL MISSING YOU,0.0,0.0
