# Music Recommendation System

This is a project for the Aplica- course.

Year 2017, first period.

Students:
- Diego Vargas
- Andre Pando
- Ronie Arauco

## Making familiar with the dataset

In [1]:
import numpy as np
import pandas as pd
import codecs
# import matplotlib.pyplot as plt
# %matplotlib inline

artists = pd.read_table("./lastfm-data/artists.dat", encoding = 'latin1')
tags = pd.read_table("./lastfm-data/tags.dat", encoding = 'latin1')
user_artists = pd.read_table("./lastfm-data/user_artists.dat", encoding = 'latin1')
user_taggedartists = pd.read_table("./lastfm-data/user_taggedartists.dat",encoding = 'latin1', usecols=['userID', 'artistID', 'tagID'])
user_friends = pd.read_table("./lastfm-data/user_friends.dat",encoding = 'latin1')


# Information taken from
#    Last.fm website, http://www.lastfm.com
#
#    @inproceedings{Cantador:RecSys2011,
#       author = {Cantador, Iv\'{a}n and Brusilovsky, Peter and Kuflik, Tsvi},
#       title = {2nd Workshop on Information Heterogeneity and Fusion in Recommender Systems (HetRec 2011)},
#       booktitle = {Proceedings of the 5th ACM conference on Recommender systems},
#       series = {RecSys 2011},
#       year = {2011},
#       location = {Chicago, IL, USA},
#       publisher = {ACM},
#       address = {New York, NY, USA},
#       keywords = {information heterogeneity, information integration, recommender systems},
#    } 

In [None]:
# Contains information about the artists that has been listened and tagged
# by the users
# id \t name \t url \t pictureURL
artists.sample(3)

In [None]:
# The tags available in the dataset
# tagID \t tagValue
# tags.shape
tags.sample(3)

In [None]:
# Contains the artists listened by each user, providing also
# the listening count for each [user, artist] pair
# userID \t artistID \t weight
# user_artists.shape
user_artists.sample(3)
#user_artists

In [None]:
# Tag assignments of artists provided by each particular user
# as well with the time of when was the tag assigned by the user
# userID \t artistID \t tagID \t day \t month \t year
# user_taggedartists.shape
user_taggedartists.sample(3)

In [None]:
# Contains the friend relations between users in the database
# userID \t friendID
# user_friends.shape
user_friends.sample(3)

Obj| shape 
--- | ---
artists | (17632, 4)
tags | (11946, 2)
user_artists | (92834, 3)
user_taggedartists | (186479, 6)
user_friends | (25434, 2)

In [None]:
# What is the artist with most and least listeners?

# - Most listeners
listeners_agg = user_artists[['artistID','userID']].groupby('artistID', sort=False).agg(['count'])
print("artists with least followers")
print(listeners_agg['userID'].sort_values('count').head(3)) #-- least 9201
print("--------------------")
print("artists with most followers")
print(listeners_agg['userID'].sort_values('count').tail(3)) #-- most 89

# And how many plays do they make?
listens_agg = user_artists[['artistID', 'weight']].groupby(['artistID']).agg(['sum'])
print("--------------------")
print("Amount of plays for the artist with least followers")
print(listens_agg.filter(regex='^9201$',axis=0)) # -- least 139 plays
print("--------------------")
print("Amount of plays for the artist with most followers")
print(listens_agg.filter(regex='^89$',axis=0)) # -- most 1291387 plays
# What are the tags made by those users?



# What is the artist with most and the least listen counts? 
# (the least can't be 0, according with the description of the artist dataset)
print("artist with least plays")
print(listens_agg['weight'].sort_values('sum').head(3)) # -- least 14371
print("--------------------")
print("artist with Most plays")
print(listens_agg['weight'].sort_values('sum').tail(3)) # -- most 2393140

# and how many users makes those listen counts?
print("--------------------")
print("Amount of users for the artist with least plays")
print(listeners_agg.filter(regex='^14371$',axis=0)) # -- artist with less 
print("--------------------")
print("Amount of users for the artist with most plays")
print(listeners_agg.filter(regex='^289$',axis=0)) # -- artist with moee

# What is the most and the least used tag?
# What is the most and the least tagged artists?
# What is the user that tagges the most and tagges the least? 

### The problem
The database doesn't contain any rating/rate column, rather a _weight_ for each artists by user which works as a _listen_ counter. That said, there's going to be artists that has a high amount of plays, but little users - and viceversa.

So, for this solution, the amount of plays has to be converted to a relative along to the amount of users. 

The following graph shows how the data is being shown.

![Graph](graph.png)

One is using the **Content-Based Filtering**, since the data set we currently have is a set of users and a set of categories (keywords or tags). The similarity between the two will be the keywords extracted from the artists tags. Each user should have a degree of interest in certain tags, which can be retrieved using the most tagged item in the most frequent artists the user hears (See table 1). That said, we can only recommend artists to the already given set of users.


| Tag  | $U_1$ | $U_2$ | $U_3$ | $U_x$ |
|------|-----|-----|-----|-----|
| $Tag_1$ |  3  |  2  |     |     |
| $Tag_2$ |  5  |  3  |  3  |     |
| $Tag_3$ |     |  3  |  5  |  4  |
| $Tag_4$ |  1  |     |  5  |  4  |


#### How to retrieve the Interest (Ideas)
The interest can be retrieved from the following table (which belongs for one user):

<table>
    <thead>
        <tr>
            <th>Plays $P_i$</th>
            <th>Artist $A_i$</th>
            <th>Tag $T_j$</th>
            <th>Weight $W_{ij}$</th>
        </tr>
    </thead>
    <tbody>
        <tr>
             <td rowspan="2">$P_1$</td>
             <td rowspan="2">$A_1$</td>
             <td>$T_1$</td>
             <td>$W_{11}$</td>
        </tr>
        <tr>
             <td>$T_2$</td>
             <td>$W_{12}$</td>
        </tr>
        <tr>
             <td rowspan="2">$P_2$</td>
             <td rowspan="2">$A_2$</td>
             <td>$T_2$</td>
             <td>$W_{22}$</td>
        </tr>
        <tr>
             <td>$T_3$</td>
             <td>$W_{23}$</td>
        </tr>
        <tr>
             <td rowspan="4">$P_3$</td>
             <td rowspan="4">$A_3$</td>
             <td>$T_2$</td>
             <td>$W_{32}$</td>
        </tr>
        <tr>
             <td>$T_3$</td>
             <td>$W_{33}$</td>
        </tr>
        <tr>
             <td>$T_4$</td>
             <td>$W_{34}$</td>
        </tr>
        <tr>
             <td>$T_5$</td>
             <td>$W_{35}$</td>
        </tr>
        
    </tbody>
</table>

Being $P_i$ the amount of times the user has played the artist $A_i$ (found as _weight_); $W_i$ the amount of users that has tagged the artist $A_i$ with the tag $T_j$.

From the table we now an Artist has been listened: 

$$listenShare_i = \frac{P_i}{\sum_{i = 1}^{N}P_i}$$

And for the tag

$$tagShare_j = \sum_{i = 1}^{N}\frac{W_{ij}*listenShare_i}{\sum_{z=1}^{M}W_{iz}}$$

We can then, retrieve the interest from 0 to 5, capping the result of the $tagShare_j$ asigning 5 to the maximum value $max(tagShare_j)$. $N$ is the number of artist, $M$ is the number of tags

As an example, say we have the following data for the user $U_x$

<table>
    <thead>
        <tr>
            <th>Plays $P_i$</th>
            <th>Artist $A_i$</th>
            <th>Tag $T_j$</th>
            <th>Weight $W_{ij}$</th>
        </tr>
    </thead>
    <tbody>
        <tr>
             <td rowspan="2">$150$</td>
             <td rowspan="2">$A_1$</td>
             <td>$T_1$</td>
             <td>$30$</td>
        </tr>
        <tr>
             <td>$T_2$</td>
             <td>$15$</td>
        </tr>
        <tr>
             <td rowspan="2">$45$</td>
             <td rowspan="2">$A_2$</td>
             <td>$T_2$</td>
             <td>$13$</td>
        </tr>
        <tr>
             <td>$T_3$</td>
             <td>$7$</td>
        </tr>
        <tr>
             <td rowspan="4">$15$</td>
             <td rowspan="4">$A_3$</td>
             <td>$T_2$</td>
             <td>$45$</td>
        </tr>
        <tr>
             <td>$T_3$</td>
             <td>$15$</td>
        </tr>
        <tr>
             <td>$T_4$</td>
             <td>$16$</td>
        </tr>
        <tr>
             <td>$T_5$</td>
             <td>$6$</td>
        </tr>
        
    </tbody>
</table>


Using the formula $tagShare$ we can get the interest on the user $U_x$ on the tags:

| Tag | $tagShare$ | Interest |
| -- | -- | -- |
| $T_1$ | $0.476$ | $5.000$ |
| $T_2$ | $0.417$ | $4.374$ |
| $T_3$ | $0.088$ | $0.925$ |
| $T_4$ | $0.014$ | $0.146$ |
| $T_5$ | $0.005$ | $0.055$ |

     
We are given an user (Ux), their interests represented by tags (T1, T2, ..., Tn) and their relevances represented in a range of values from 0 to 5. Also, we are given artists (A1, A2, ..., An) which are represented by tags given by the users. Then, we should do the next:

1. Given an user (U1), calculate the intrinsic interest asociated by their tags.
2. Make a weighing for each artist among top 3 tags of the user U1 which are included in the artist A1, such that the total of tags associated to the artist A1 are the unit, and the results of tags 1, 2, 3 represent, for instance, 40% of total.\n",
3. Finally, it will recommend the artist which weighing of the 3 main tags of the user is the highest of the data set.
"P.D.: Had it depleted the data set of artist and ponderation has not been sufficiently higher (that 40% it has not reached the minimum required to suggest, let say 50%) it will proceed to suggest the artist which tags are tightly related with top 3 of user tags.

How it would calculate this?
With cosine similarity or with the Pearson Correlation. It will find, for example, similarity among tag 1 and  tag 2. If both are tightly related, tag 1 will be in tags top 3 but not in top tags 2, and if the above is true, it will recommend the artist which have tag 2 following step 2 and 3.

In [2]:
tagsXArtist = {}
for index, row in artists.iterrows():
    idArt = row['id']
    #print (idArt)
    cad = '' + str(idArt)
    tagsXArtist[cad] = []
    auxQuery = 'artistID == ' + str(idArt)
    iterable = user_taggedartists.query(auxQuery)
    for index2, row2 in iterable.iterrows():
        tagsXArtist[cad].append(row2['tagID'])

In [3]:
#Ahora procederé a transformas el tagsXArtist a una tupla, cuyo primer parametro sera el tag y el segundo
#la ocurrencia
import collections
counter = collections.Counter(tagsXArtist['1'])
print(counter.most_common(5))
newList = counter.most_common(5)
#De esta forma accedes al idTag newList[0][0], y de esta newList[0][1] a su frecuencia. con len(tagsXArtist['289'])
#sacas el total de tags en la lista
frecTagsXArtist = {}
for key, value in tagsXArtist.items():
    newkey = '' + str(key)
    counter = collections.Counter(value)
    frecTagsXArtist[int(newkey)] = counter.most_common(5)

[(139, 5), (141, 3), (179, 2), (541, 2), (552, 1)]


In [4]:
def tag_share(user_artist_plays, all_artist_tags):
    tagshare = []
    
    artists = []
    total_weight = 0
    for index, row in user_artist_plays.iterrows():
        artistID = row['artistID']
        weight = row['weight']
        total_weight += weight
        artists.append((artistID, weight))
    #print("total artist:", len(artists))
    #even_total = 0
    tag_sum = {}
    for artistID, weight in artists:
        artist_tags = all_artist_tags[artistID]
        total_tag_weight = 0
        listen_share = weight/total_weight
        # find the total tag weight for each artist
        for tagid, weight in artist_tags:
            total_tag_weight += weight

        # get the sum of tags
        #total_keys = 0
        for tagid, weight in artist_tags:
            if tagid not in tag_sum.keys():
                tag_sum[tagid] = weight*listen_share/total_tag_weight
                #total_keys += 1
            else:
                tag_sum[tagid] += weight*listen_share/total_tag_weight
        #print(artistID, total_keys)
        #even_total += total_keys
    #print(even_total)
    
    tagshare = {}
    max_weight = max(tag_sum.values())
    ratio = 5/max_weight
    for i,x in tag_sum.items():
        #tagshare[i] = int(x*ratio)
        tagshare[i] = x*ratio
    
    return tagshare
        

In [5]:
from sklearn.feature_extraction import DictVectorizer
import operator 

vec = DictVectorizer()
K = 30
users_tag_interest = {}
users_tag_interest_novec = {}
for index, row in user_artists.iterrows():
    userid = row['userID']
    if int(userid) not in users_tag_interest.keys():
        query = 'userID==' + str(userid)
        tagdata = user_artists.query(query)
        
        v = tag_share(tagdata, frecTagsXArtist)
        sorted_v = sorted(v.items(), key=operator.itemgetter(1), reverse=True)
        sorted_dic = {}
        i = 0
        for tid, x in sorted_v:
            if i == 5: break
            sorted_dic[tid] = x
            i += 1
        users_tag_interest_novec[int(userid)] = sorted_dic
        users_tag_interest[int(userid)] = vec.fit_transform(sorted_dic)

In [41]:
from scipy.spatial.distance import cosine as cos
import numpy as np
def cosine(s1, s2):
    s1 = np.nan_to_num(s1)
    s2 = np.nan_to_num(s2)
    return 1 - cos(s1, s2)

In [6]:
def pearson(s1, s2):
    #take two pd.Series objects and return a perarson correlation 
    s1_c = s1 - s1.mean()
    s2_c = s2 - s2.mean()
    return np.sum(s1_c * s2_c) / np.sqrt(np.sum(s1_c ** 2) * np.sum(s2_c ** 2)) 

In [27]:
def get_recs(artist_id, M, num, f=pearson):
    review = []
    for tit in M.columns:
        title = int(tit)
        if (title == artist_id):
            continue
        cor = f(M[artist_id], M[title])
        if np.isnan(cor):
            continue
        else:
            review.append((title, cor))
    review.sort(key = lambda tup: tup[1], reverse = True)
    return review[:num]

In [8]:
# get correlations between artists given users
pM = user_artists.pivot_table(index = ['userID'], columns = ['artistID'], values = 'weight')

In [9]:
# get correlations between artists given tags
pM1 = user_taggedartists.pivot_table(index = ['tagID'], columns = ['artistID'], aggfunc = 'count')

In [10]:
# Normalizamos
y = {}
for x in pM:
    z = pM[x]
    y[x] = []
    ratio = 5/z.max()
    for i in z:
        y[x].append(i*ratio)
M = pd.DataFrame.from_dict(y)

y = None
pM = None


In [11]:
y = {}
for x  in pM1['userID']:
    z = pM1['userID'][x]
    y[x] = []
    ratio = 5/z.max()
    for i in z:
        y[x].append(i*ratio)
M1 = pd.DataFrame.from_dict(y)

y = None
pM1 = None

### Using artists' tags

In [12]:
artists.query('name == "Radiohead"')

Unnamed: 0,id,name,url,pictureURL
148,154,Radiohead,http://www.last.fm/music/Radiohead,http://userserve-ak.last.fm/serve/252/8461967.jpg


In [13]:
M.sample(3)

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,18736,18737,18738,18739,18740,18741,18742,18743,18744,18745
43,,,,,,,,,,,...,,,,,,,,,,
897,,,,,,,,,,,...,,,,,,,,,,
1042,,,,,,,,,,,...,,,,,,,,,,


In [42]:
recs_cosine = get_recs(154, M, 10, cosine)

In [43]:
for aid, corr  in recs_cosine[:10]:
    print(artists.query('id == ' + str(aid))['name'], corr)

231    Thom Yorke
Name: name, dtype: object 0.556959444543
850    Pixies
Name: name, dtype: object 0.436143766665
5656    Buffy the Vampire Slayer
Name: name, dtype: object 0.432447427387
15795    Stereo MC's
Name: name, dtype: object 0.40102188014
15794    Laurent Garnier
Name: name, dtype: object 0.394170796892
5236    Vibrasphere
Name: name, dtype: object 0.391843553238
211    Death Cab for Cutie
Name: name, dtype: object 0.391493852705
6167    Saves the Day
Name: name, dtype: object 0.383218185105
4613    2 Many DJ's
Name: name, dtype: object 0.383110741624
613    A Tribe Called Quest
Name: name, dtype: object 0.354358682201


In [28]:
# Usando a los usuarios
recs = get_recs(154, M, 10)



In [29]:
for aid, corr  in recs[:10]:
    print(artists.query('id == ' + str(aid))['name'], corr)

850    Pixies
Name: name, dtype: object 0.451411847745
5656    Buffy the Vampire Slayer
Name: name, dtype: object 0.418140600067
231    Thom Yorke
Name: name, dtype: object 0.416842468234
6167    Saves the Day
Name: name, dtype: object 0.400121421881
5236    Vibrasphere
Name: name, dtype: object 0.384463456348
613    A Tribe Called Quest
Name: name, dtype: object 0.382826116835
4613    2 Many DJ's
Name: name, dtype: object 0.376340437614
211    Death Cab for Cutie
Name: name, dtype: object 0.344340926194
15794    Laurent Garnier
Name: name, dtype: object 0.319337230193
205    The Decemberists
Name: name, dtype: object 0.318860962842


In [14]:
M1.sample(3)

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,18724,18732,18734,18735,18736,18737,18739,18740,18741,18744
7225,,,,,,,,,,,...,,,,,,,,,,
1456,,,,,,,,,,,...,,,,,,,,,,
5546,,,,,,,,,,,...,,,,,,,,,,


In [44]:
recs2_cosine = get_recs(154, M1, 10, cosine)

In [45]:
for aid, corr  in recs2_cosine[:10]:
    print(artists.query('id == ' + str(aid))['name'], corr)

167    Placebo
Name: name, dtype: object 0.911509677531
184    Muse
Name: name, dtype: object 0.885113499491
59    Coldplay
Name: name, dtype: object 0.87869052822
1230    Weezer
Name: name, dtype: object 0.839739930028
200    Beck
Name: name, dtype: object 0.832150205922
1461    Lifehouse
Name: name, dtype: object 0.830713459715
960    Kasabian
Name: name, dtype: object 0.828507045567
1366    Snow Patrol
Name: name, dtype: object 0.827678675225
1371    Blue October
Name: name, dtype: object 0.82686087133
223    The Killers
Name: name, dtype: object 0.825055373386


In [17]:
# Usando los tags
recs2 = get_recs(154, M1, 10)



In [18]:
for aid, corr  in recs2[:10]:
    print(artists.query('id == ' + str(aid))['name'], corr)

167    Placebo
Name: name, dtype: object 0.862249130145
184    Muse
Name: name, dtype: object 0.840094065558
59    Coldplay
Name: name, dtype: object 0.832750596664
1461    Lifehouse
Name: name, dtype: object 0.821615879583
1230    Weezer
Name: name, dtype: object 0.791971563078
200    Beck
Name: name, dtype: object 0.785395075126
850    Pixies
Name: name, dtype: object 0.78477199244
185    OneRepublic
Name: name, dtype: object 0.775699872911
223    The Killers
Name: name, dtype: object 0.773118014989
1368    The Fray
Name: name, dtype: object 0.768632219755


### Using user's interests

In [19]:
f = pd.DataFrame.from_dict(users_tag_interest_novec)
f.sample(3)

Unnamed: 0,2,3,4,5,6,7,8,9,10,11,...,2090,2091,2092,2093,2094,2095,2096,2097,2099,2100
2503,,,,,,,,,,,...,,,,,,,,,,
445,,,,,,,,,,,...,,,,,,,,,,
550,,,,,,,,,,,...,,,,,,,,,,


In [47]:
def get_recs2(user_interests, M, num, f=pearson):
    review = []
    for tit in M.columns:
        title = int(tit)
        cor = f(user_interests, M[title])
        if np.isnan(cor):
            continue
        else:
            review.append((title, cor))
    review.sort(key = lambda tup: tup[1], reverse = True)
    return review[:num]

In [49]:
recs3 = get_recs2(f[238], M1, 10)



In [50]:
for aid, corr  in recs3[:10]:
    print(artists.query('id == ' + str(aid))['name'], corr)

604    Miles Davis
Name: name, dtype: object 0.8676761267
619    Thelonious Monk
Name: name, dtype: object 0.859709602533
2999    Louis Armstrong
Name: name, dtype: object 0.852553266226
2443    Nat King Cole
Name: name, dtype: object 0.843150305948
3975    Morphine
Name: name, dtype: object 0.842567804855
7960    Harry Connick, Jr.
Name: name, dtype: object 0.836160352132
11279    Cannonball Adderley
Name: name, dtype: object 0.831969045778
2983    Duke Ellington
Name: name, dtype: object 0.826705768613
2995    Benny Goodman
Name: name, dtype: object 0.811918652466
10570    Charles Mingus
Name: name, dtype: object 0.810269828255


In [51]:
x = user_artists.query('userID == 238').sort_values('weight',ascending=False)[:10]
for i in x.iterrows():
    aid = i[1]['artistID'] 
    print(artists.query('id == ' + str(aid))['name'])

4531    Tribraco
Name: name, dtype: object
4532    UnnaddarÃ¨
Name: name, dtype: object
4533    Neo
Name: name, dtype: object
4534    Mini k Bros
Name: name, dtype: object
4535    Neilos
Name: name, dtype: object
4536    Pink Puffers
Name: name, dtype: object
4537    Snakeprint
Name: name, dtype: object
4538    Spiral69
Name: name, dtype: object
4539    Tubax
Name: name, dtype: object
4540    Goodphellas
Name: name, dtype: object


### Bibliography

1. Robillard, M., Maalej, W., Walker, R. J., & Zimmermann, T. (Eds.). (2014). Recommendation Systems in Software Engineering. Springer Berlin Heidelberg. Cap. 2 p. 20-21 https://doi.org/10.1007/978-3-642-45135-5
2. Jannach, D., Zanker, M., Felfernig, A., & Friedrich, G. (2011). Recommender systems: an introduction. Cambridge University Press (Vol. 40). https://doi.org/10.1017/CBO9780511763113