# Content based recommendation system

*Draft*

Video 1: https://www.youtube.com/watch?v=hOQg2LQM4ec&ab_channel=MiningMassiveDatasets \
Notes: 
- It would be interesting to mention the "long tail" phenomenom to explain why recommendation systems are interesting
- The key to recommendations is to have the utility matrix and predict the values that don't appear there

Video 2 (Content based): https://www.youtube.com/watch?v=IlqnNWuqToo&ab_channel=MiningMassiveDatasets \

- Main idea of content based: recommend items to customer x similar to previous items rated highly by x

- Plan of action: start with a user and find out a set of items the user likes using explicit and implicit data -> we are going to build an item profile -> Infer user profile -> Match -> Recommemd

1. For each item, create an item profile. Profile is a set of features. Convenient to think it as a vector
2. User profile: (weighted) average of rated item profiles

### Read the data

Useful features:
- artist_terms (tags)
- similar_artists?
- time_signature
- year
- artist_mbtags
- mode, _what is that?_

In [99]:
# Imports

import pandas as pd
import os
import hdf5_getters as hdf5_getters
import numpy as np
from tqdm import tqdm

In [82]:
path = 'MillionSongSubset'
songs_list = []

for (root, dirs, file) in os.walk(path):
    for f in file:
        songs_list.append(os.path.dirname(f))

            
print(len(songs_list))

10000


In [83]:
## READ DATA PATH FROM FILE
songs_file_paths = []

for root, dirs, files in os.walk(os.path.abspath(path)):
    for file in files:        
        strs = os.path.join(root, file)
        new_strs = strs.replace('\\','/')
        songs_file_paths.append(new_strs)

In [84]:
print(songs_file_paths[0])

/Users/alejandranavarrocastillo/Documents/GitHub/comp_tools/MillionSongSubset/A/R/R/TRARRZU128F4253CA2.h5


In [85]:
len(songs_file_paths)

10000

In [86]:
### CREATE PANDAS TABLE

N = len(songs_file_paths)
data = []

for i in range(N):
    record = []
    
    # Open specific song path
    h5 = hdf5_getters.open_h5_file_read(songs_file_paths[i])
    
    #getters
    #artist_id = hdf5_getters.get_artist_id(h5)
    #artist_id = artist_id.decode("utf-8")     
    
    #artist_name  = hdf5_getters.get_artist_name(h5)
    #artist_name = artist_name.decode("utf-8") 
    
    #artist_location  = hdf5_getters.get_artist_location(h5)
    #artist_location = artist_location.decode("utf-8") 
    
    song_id = hdf5_getters.get_song_id(h5)
    song_id = song_id.decode("utf-8")
    
    #song_name = hdf5_getters.get_title(h5)
    #song_name = song_name.decode("utf-8")
    
    #song_hottness = hdf5_getters.get_song_hotttnesss(h5)
    
    #time_signature = hdf5_getters.get_time_signature(h5)
    
    artist_terms_ = hdf5_getters.get_artist_terms(h5)
    artist_terms = []
    for j in range(len(artist_terms_)):
        artist_terms.append(artist_terms_[j].decode("utf-8"))
    
    #artist_mbtags = hdf5_getters.get_artist_mbtags(h5)
    
    #mode = hdf5_getters.get_mode(h5)
    
    #year = hdf5_getters.get_year(h5)
    
    #latitude = hdf5_getters.get_artist_latitude(h5)
    #longitude = hdf5_getters.get_artist_longitude(h5)
    
    # Close file
    h5.close()
    
    #record.append(artist_id)
    #record.append(artist_name)
    #record.append(artist_location)
    record.append(song_id)
    #record.append(song_name)
    #record.append(song_hottness)
    #record.append(time_signature)
    record.append(artist_terms)
    #record.append(artist_mbtags)
    #record.append(mode)
    #record.append(year)
    #record.append(latitude)
    #record.append(longitude)
    
    # Add
    data.append(record)

In [87]:
#df = pd.DataFrame(data, columns=['artist_id', 'artist_name', 'artist_location', 'song_id', 'song_name', 'song_hottness','time_signature','artist_terms','artist_mbtags','mode','year','latitude','longitude'])
df = pd.DataFrame(data, columns=['song_id','artist_terms'])
df

Unnamed: 0,song_id,artist_terms
0,SOGSMXL12A81C23D88,"[chanson, visual kei, hip hop, pop rock, briti..."
1,SOMBCOW12AAF3B229F,"[chanson, dance pop, pop rock, soft rock, fema..."
2,SOEYIHF12AB017B5F4,"[early music, celtic, mediaeval, folk, christm..."
3,SODJYEC12A8C13D757,"[post-hardcore, doomcore, metalcore, screamo, ..."
4,SOGSOUE12A58A76443,"[orchestra, musical theater, british, brazil, ..."
...,...,...
9995,SOWJQRH12AB0186761,"[country gospel, ccm, country, aor, adult cont..."
9996,SOIFMVY12A8AE467B1,"[dance rock, pop rock, british pop, ballad, cl..."
9997,SOGVDLQ12A58A7E3C5,"[hard rock, modern rock, glam metal, rock, hea..."
9998,SOVVGSH12A8C14085F,"[frevo, samba, banda, rockabilly, bossa nova, ..."


In [103]:
### GET USERS TASTE
#triples

user_plays = pd.read_csv('train_triplets.txt', sep='\t', names = ['userID','songID', 'play_count'])

### Compute similarity between songs

I'm going to try to compute the similarity with the artist_terms of the songs (we might have the problem of two songs being exactly the same if they come from the same artist). We will see...

In [89]:
### REPRESENT SONGS AS VECTORS IN ORDER TO COMPUTE SIMILARITY
### We will use the feature artist_terms and implement one-hot-encoding.
### With one-hot-encoding, we convert each categorical value into a new categorical column 
### and assign a binary value of 1 or 0 to those columns.

# First, extract the artist_terms
all_terms = []
for row in range(len(df)):
    all_terms.append(df['artist_terms'][row])
    
all_terms = np.concatenate(all_terms)
all_terms = list(set(all_terms)) # we have gotten a list of the set of all artist_terms

d = len(all_terms) # dimension of the vectors we are representing

In [90]:
# Now, we intend to create a binary vector (length = d) that represents a song, 
# with 1s if the song has this term and 0s if it hasn't.

def vectorize(song):

    index = int(df.index[df['song_id'] == song][0])
    vector = np.zeros(len(all_terms))

    for i in range(len(vector)):
        if all_terms[i] in df['artist_terms'][index]:
            vector[i] = 1
            
    return vector

In [91]:
# Vectorize ALL the songs and save it into a dictionary

vector_representation = {}

for song in df['song_id']:
    vector_representation[song] = vectorize(song)
    
#vector_representation # we end up with a dictionary of songs with their vector representation

In [92]:
# Compute similarity between songs and get a similarities matrix

# Euclidean distance 
# (We could also compute the cosine similarity I guess)
def euclidean_distance(song1 , song2):
    a = vector_representation[song1]
    b = vector_representation[song2]
    dist = np.linalg.norm(a-b)
    return dist


# Similarities matrix
similarities = np.zeros((len(df),len(df)))
i = 0
for song1 in df['song_id']:
    j = 0
    for song2 in df['song_id']:
        dist = euclidean_distance(song1,song2)
        similarities[i,j] = dist
        j += 1
    i += 1
    
#print(similarities)

### Get the user profile

In [209]:
# Get the User profile: that will be a vector of dimension d computed as the weighted average of his played songs

user = '5a905f000fc1ff3df7ca807d57edb608863db05d'

song_counts = list(user_plays[user_plays['userID'] == user]['play_count'])
song_list = list(user_plays[user_plays['userID'] == user]['songID'])


# Check if the user songs are in the song dataset 
# and get the indices of the songs
indices = []
for song in song_list:
    if song in list(df['song_id']):
        print()
        print('yes', song, vector_representation[song])
        indices.append(song_list.index(song))
        print(indices)
    else:
        continue


# Get the User profile (i.e. Compute the (weighted) average of the songs of a user)
a = [song_counts[i] for i in indices]
b = [vector_representation[song_list[i]] for i in indices]

numerator = np.zeros(d)
for i in range(len(a)):
    numerator = numerator + ( a[i] * np.asarray(b[i]) )

user_profile = numerator / sum(a)
user_profile


yes SOEBCBI12AF72A154F [0. 0. 0. ... 0. 0. 0.]
[72]

yes SOFKTPP12A8C1385CA [0. 0. 0. ... 0. 0. 0.]
[72, 100]

yes SOGJPMB12A8C13A9DB [0. 0. 0. ... 0. 0. 0.]
[72, 100, 112]

yes SOUCKDH12A8C138FF5 [0. 0. 0. ... 0. 0. 0.]
[72, 100, 112, 361]

yes SOZOEYP12AB0188C9D [0. 0. 0. ... 0. 0. 0.]
[72, 100, 112, 361, 447]


array([0., 0., 0., ..., 0., 0., 0.])

### Recommendation

We will recommend the songs of our dataset that are more similar to the user profile.

In [242]:
scores_dict = {}
for song in df['song_id']:
    dist = np.linalg.norm(user_profile - vector_representation[song])
    scores_dict[song] = dist

scores_dict_key = list(scores_dict.keys())
scores_dict_val = list(scores_dict.values())

# R best recommendations
R = 10
for score in np.sort(scores_dict_val)[0:R]:
    indices = scores_dict_val.index(score) # value from dictionary
    print("Song:", scores_dict_key[indices], 'with score', score)
    
# Find the titles for these songs in the initial dataset

Song: SOEBCBI12AF72A154F with score 2.689127174136189
Song: SOPEEDH12A58A7C407 with score 3.587167210354338
Song: SOPEEDH12A58A7C407 with score 3.587167210354338
Song: SOHYWLZ12A6D4FBB22 with score 3.59981634058606
Song: SOHYWLZ12A6D4FBB22 with score 3.59981634058606
Song: SOHYWLZ12A6D4FBB22 with score 3.59981634058606
Song: SOFIRFY12AC909716A with score 3.662406741542488
Song: SONCEJF12AB018581C with score 3.6624067415424886
Song: SONCEJF12AB018581C with score 3.6624067415424886
Song: SONCEJF12AB018581C with score 3.6624067415424886
