#Building a song recommender


#Fire up GraphLab Create

In [1]:
import graphlab

#Load music data

In [3]:
song_data = graphlab.SFrame('song_data.gl/')

#Explore data

Music data shows how many times a user listened to a song, as well as the details of the song.

In [4]:
song_data.head()

user_id,song_id,listen_count,title,artist
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOAKIMP12A8C130995,1,The Cove,Jack Johnson
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOBBMDR12A8C13253B,2,Entre Dos Aguas,Paco De Lucia
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOBXHDL12A81C204C0,1,Stronger,Kanye West
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOBYHAJ12A6701BF1D,1,Constellations,Jack Johnson
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SODACBL12A8C13C273,1,Learn To Fly,Foo Fighters
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SODDNQT12A6D4F5F7E,5,Apuesta Por El Rock 'N' Roll ...,Héroes del Silencio
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SODXRTY12AB0180F3B,1,Paper Gangsta,Lady GaGa
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOFGUAY12AB017B0A8,1,Stacked Actors,Foo Fighters
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOFRQTD12A81C233C0,1,Sehr kosmisch,Harmonia
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOHQWYZ12A6D4FA701,1,Heaven's gonna burn your eyes ...,Thievery Corporation feat. Emiliana Torrini ...

song
The Cove - Jack Johnson
Entre Dos Aguas - Paco De Lucia ...
Stronger - Kanye West
Constellations - Jack Johnson ...
Learn To Fly - Foo Fighters ...
Apuesta Por El Rock 'N' Roll - Héroes del ...
Paper Gangsta - Lady GaGa
Stacked Actors - Foo Fighters ...
Sehr kosmisch - Harmonia
Heaven's gonna burn your eyes - Thievery ...


##Showing the most popular songs in the dataset

In [5]:
graphlab.canvas.set_target('ipynb')

In [None]:
song_data['song'].show()

In [None]:
len(song_data)

##Count number of unique users in the dataset

In [14]:
users = song_data['user_id'].unique()

for artist in [ "Kanye West", 'Foo Fighters', 'Taylor Swift', 'Lady GaGa']:
    print(len(song_data.filter_by( [ artist ], "artist" )['user_id'].unique()))
    


2522
2055
3246
2928


In [31]:
import graphlab.aggregate as agg
artist_list = [ "Kanye West", 'Foo Fighters', 'Taylor Swift', 'Lady GaGa']
song_count = song_data.groupby( key_columns = 'artist', operations = {'num_songs':agg.SUM('listen_count') })
song_count.sort('num_songs', ascending=True)

artist,num_songs
William Tabbert,14
Reel Feelings,24
Beyoncé feat. Bun B and Slim Thug ...,26
Diplo,30
Boggle Karaoke,30
harvey summers,31
Nâdiya,36
Kanye West / Talib Kweli / Q-Tip / Common / ...,38
Aneta Langerova,38
Jody Bernal,38


In [None]:
len(users)

#Create a song recommender

In [19]:
train_data,test_data = song_data.random_split(.8,seed=0)

##Simple popularity-based recommender

In [20]:
popularity_model = graphlab.popularity_recommender.create(train_data,
                                                         user_id='user_id',
                                                         item_id='song')

PROGRESS: Recsys training: model = popularity
PROGRESS:     To use one of these as a target column, set target = <column_name>
PROGRESS:     and use a method that allows the use of a target.
PROGRESS: Preparing data set.
PROGRESS:     Data has 893580 observations with 66085 users and 9952 items.
PROGRESS:     Data prepared in: 1.01406s
PROGRESS: 893580 observations to process; with 9952 unique items.


###Use the popularity model to make some predictions

A popularity model makes the same prediction for all users, so provides no personalization.

In [None]:
popularity_model.recommend(users=[users[0]])

In [None]:
popularity_model.recommend(users=[users[1]])

##Build a song recommender with personalization

We now create a model that allows us to make personalized recommendations to each user. 

In [24]:
personalized_model = graphlab.item_similarity_recommender.create(train_data,
                                                                user_id='user_id',
                                                                item_id='song')

PROGRESS: Recsys training: model = item_similarity
PROGRESS:     To use one of these as a target column, set target = <column_name>
PROGRESS:     and use a method that allows the use of a target.
PROGRESS: Preparing data set.
PROGRESS:     Data has 893580 observations with 66085 users and 9952 items.
PROGRESS:     Data prepared in: 1.12169s
PROGRESS: Computing item similarity statistics:
PROGRESS: Computing most similar items for 9952 items:
PROGRESS: +-----------------+-----------------+
PROGRESS: | Number of items | Elapsed Time    |
PROGRESS: +-----------------+-----------------+
PROGRESS: | 1000            | 0.952033        |
PROGRESS: | 2000            | 1.02286         |
PROGRESS: | 3000            | 1.09115         |
PROGRESS: | 4000            | 1.15695         |
PROGRESS: | 5000            | 1.22074         |
PROGRESS: | 6000            | 1.28187         |
PROGRESS: | 7000            | 1.34389         |
PROGRESS: | 8000            | 1.4292          |
PROGRESS: | 9000          

###Applying the personalized model to make song recommendations

As you can see, different users get different recommendations now.

In [None]:
personalized_model.recommend(users=[users[0]])

In [None]:
personalized_model.recommend(users=[users[1]])

###We can also apply the model to find similar songs to any song in the dataset

In [None]:
personalized_model.get_similar_items(['With Or Without You - U2'])

In [None]:
personalized_model.get_similar_items(['Chan Chan (Live) - Buena Vista Social Club'])

#Quantitative comparison between the models

We now formally compare the popularity and the personalized models using precision-recall curves. 

In [None]:
if graphlab.version[:3] >= "1.6":
    model_performance = graphlab.compare(test_data, [popularity_model, personalized_model], user_sample=0.05)
    graphlab.show_comparison(model_performance,[popularity_model, personalized_model])
else:
    %matplotlib inline
    model_performance = graphlab.recommender.util.compare_models(test_data, [popularity_model, personalized_model], user_sample=.05)

The curve shows that the personalized model provides much better performance. 

In [21]:
item_similarity_model = graphlab.item_similarity_recommender.create(train_data, user_id='user_id', item_id='song_id')

PROGRESS: Recsys training: model = item_similarity
PROGRESS:     To use one of these as a target column, set target = <column_name>
PROGRESS:     and use a method that allows the use of a target.
PROGRESS: Preparing data set.
PROGRESS:     Data has 893580 observations with 66085 users and 10000 items.
PROGRESS:     Data prepared in: 1.05537s
PROGRESS: Computing item similarity statistics:
PROGRESS: Computing most similar items for 10000 items:
PROGRESS: +-----------------+-----------------+
PROGRESS: | Number of items | Elapsed Time    |
PROGRESS: +-----------------+-----------------+
PROGRESS: | 1000            | 2.03817         |
PROGRESS: | 2000            | 2.1218          |
PROGRESS: | 3000            | 2.19806         |
PROGRESS: | 4000            | 2.27082         |
PROGRESS: | 5000            | 2.35087         |
PROGRESS: | 6000            | 2.43565         |
PROGRESS: | 7000            | 2.49528         |
PROGRESS: | 8000            | 2.55454         |
PROGRESS: | 9000        

In [22]:
subset_test_users = test_data['user_id'].unique()[0:10000]

In [25]:
personalized_model.recommend(subset_test_users,k=1)

PROGRESS: recommendations finished on 1000/10000 queries. users per second: 1687.09
PROGRESS: recommendations finished on 2000/10000 queries. users per second: 1784.36
PROGRESS: recommendations finished on 3000/10000 queries. users per second: 1804.03
PROGRESS: recommendations finished on 4000/10000 queries. users per second: 1793.74
PROGRESS: recommendations finished on 5000/10000 queries. users per second: 1824.35
PROGRESS: recommendations finished on 6000/10000 queries. users per second: 1841.98
PROGRESS: recommendations finished on 7000/10000 queries. users per second: 1842.3
PROGRESS: recommendations finished on 8000/10000 queries. users per second: 1840.09
PROGRESS: recommendations finished on 9000/10000 queries. users per second: 1842.02
PROGRESS: recommendations finished on 10000/10000 queries. users per second: 1832.4


user_id,song,score,rank
c66c10a9567f0d82ff31441a9 fd5063e5cd9dfe8 ...,Cuando Pase El Temblor - Soda Stereo ...,0.0194504525792,1
c067c22072a17d33310d7223d 7b79f819e48cf42 ...,Grind With Me (Explicit Version) - Pretty Ricky ...,0.0459424433009,1
f6c596a519698c97f1591ad89 f540d76f6a04f1a ...,Hey_ Soul Sister - Train,0.0249503418855,1
696787172dd3f5169dc94deef 97e427cee86147d ...,Senza Una Donna (Without A Woman) - Zucchero / ...,0.0170265780731,1
3a7111f4cdf3c5a85fd4053e3 cc2333562e1e0cb ...,Heartbreak Warfare - John Mayer ...,0.0320961822239,1
532e98155cbfd1e1a474a28ed 96e59e50f7c5baf ...,Jive Talkin' (Album Version) - Bee Gees ...,0.0118288659232,1
ee43b175ed753b2e2bce806c9 03d4661ad351a91 ...,Ricordati Di Noi - Valerio Scanu ...,0.0305171277997,1
e372c27f6cb071518ae500589 ae02c126954c148 ...,Fall Out - The Police,0.0819672131148,1
83b1428917b47a6b130ed471b 09033820be78a8c ...,Clocks - Coldplay,0.0440427234059,1
39487deef9345b1e22881245c abf4e7c53b6cf6e ...,Black Mirror - Arcade Fire ...,0.0417737699321,1


In [28]:
recommendations = item_similarity_model.recommend(subset_test_users, k=1)
recommendations_count = recommendations.groupby( key_columns = 'song_id', operations = {'num_songs':agg.COUNT() })

PROGRESS: recommendations finished on 1000/10000 queries. users per second: 1526.1
PROGRESS: recommendations finished on 2000/10000 queries. users per second: 1619.81
PROGRESS: recommendations finished on 3000/10000 queries. users per second: 1663.06
PROGRESS: recommendations finished on 4000/10000 queries. users per second: 1686.1
PROGRESS: recommendations finished on 5000/10000 queries. users per second: 1704.57
PROGRESS: recommendations finished on 6000/10000 queries. users per second: 1709.18
PROGRESS: recommendations finished on 7000/10000 queries. users per second: 1690.25
PROGRESS: recommendations finished on 8000/10000 queries. users per second: 1702.24
PROGRESS: recommendations finished on 9000/10000 queries. users per second: 1710.82
PROGRESS: recommendations finished on 10000/10000 queries. users per second: 1696.68


In [33]:
recommendations_count.sort('num_songs', ascending=False)

song_id,num_songs
SOAUWYT12A81C206F1,430
SONYKOW12AB01849C9,384
SOSXLTC12AF72A7F54,233
SOBONKR12A58A7A7E0,168
SOLFXKT12AB017E3E0,126
SODJWHY12A8C142CCE,102
SOEGIYH12A6D4FC0E3,98
SOFRQTD12A81C233C0,74
SOUSMXX12AB0185C24,60
SOAXGDH12A8C13F8A1,55


In [35]:
song_data.filter_by(['SOAUWYT12A81C206F1'], 'song_id')

user_id,song_id,listen_count,title,artist,song
e006b1a48f466bf59feefed32 bec6494495a4436 ...,SOAUWYT12A81C206F1,2,Undo,Björk,Undo - Björk
0afaa5d9d04bf85af720fe8cc 566a41ca3e41c97 ...,SOAUWYT12A81C206F1,23,Undo,Björk,Undo - Björk
2b6c2f33bc0e887ea7c4411f5 8106805a1923280 ...,SOAUWYT12A81C206F1,6,Undo,Björk,Undo - Björk
73d0a0c725c9b2c541635672b b0572bfcb7eb2b4 ...,SOAUWYT12A81C206F1,1,Undo,Björk,Undo - Björk
62f2f9b881dc320d745a90c0c 10528d18e10deb1 ...,SOAUWYT12A81C206F1,2,Undo,Björk,Undo - Björk
f47116f998e030f2dab275b81 fb2a04a9dc06c33 ...,SOAUWYT12A81C206F1,3,Undo,Björk,Undo - Björk
179b2286bb4eea7193bcfa0c3 6fcfa4eade2b34d ...,SOAUWYT12A81C206F1,6,Undo,Björk,Undo - Björk
b1269307f2ae8c17062c6aea2 502b099aad517b6 ...,SOAUWYT12A81C206F1,11,Undo,Björk,Undo - Björk
4192e443e37ffa08f1cc02b10 b42b4a178a09004 ...,SOAUWYT12A81C206F1,6,Undo,Björk,Undo - Björk
ed3664f9cd689031fe4d0ed6c 66503bdc3ad7cb6 ...,SOAUWYT12A81C206F1,1,Undo,Björk,Undo - Björk
