### Setup & load data

In [1]:
import graphlab
song_data = graphlab.SFrame('song_data.gl/')
graphlab.canvas.set_target('ipynb')

This non-commercial license of GraphLab Create for academic use is assigned to mmavricek@burning-glass.com and will expire on October 02, 2018.


[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: C:\Users\MMAVRI~1.BGT\AppData\Local\Temp\graphlab_server_1510089156.log.0


### Create training and test data

In [2]:
train_data, test_data = song_data.random_split(.8, seed = 0)

In [3]:
song_data.head(5)

user_id,song_id,listen_count,title,artist
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOAKIMP12A8C130995,1,The Cove,Jack Johnson
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOBBMDR12A8C13253B,2,Entre Dos Aguas,Paco De Lucia
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOBXHDL12A81C204C0,1,Stronger,Kanye West
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOBYHAJ12A6701BF1D,1,Constellations,Jack Johnson
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SODACBL12A8C13C273,1,Learn To Fly,Foo Fighters

song
The Cove - Jack Johnson
Entre Dos Aguas - Paco De Lucia ...
Stronger - Kanye West
Constellations - Jack Johnson ...
Learn To Fly - Foo Fighters ...


# Q1: Counting Unique Users
The method .unique() can be used to select the unique elements in a column of data. In this question, you will compute the number of unique users who have listened to songs by various artists. For example, to find out the number of unique users who listened to songs by 'Kanye West', all you need to do is select the rows of the song data where the artist is 'Kanye West', and then count the number of unique entries in the ‘user_id’ column. 

Compute the number of unique users for each of these artists: 'Kanye West', 'Foo Fighters', 'Taylor Swift' and 'Lady GaGa'.

In [5]:
kw_count = len(song_data[song_data['artist'] == 'Kanye West']['user_id'].unique())
ff_count = len(song_data[song_data['artist'] == 'Foo Fighters']['user_id'].unique())
ts_count = len(song_data[song_data['artist'] == 'Taylor Swift']['user_id'].unique())
lg_count = len(song_data[song_data['artist'] == 'Lady GaGa']['user_id'].unique())
print 'Unique users that listened to Kanye: %s' % kw_count
print 'Unique users that listened to Foo Fighters: %s' % ff_count
print 'Unique users that listened to Taylor Swift: %s' % ts_count
print 'Unique users that listened to Lady GaGa: %s' % lg_count

Unique users that listened to Kanye: 2522
Unique users that listened to Foo Fighters: 2055
Unique users that listened to Taylor Swift: 3246
Unique users that listened to Lady GaGa: 2928


# Q2: Most & Least Popular Artists

Using groupby-aggregate to find the most popular and least popular artist: each row of song_data contains the number of times a user listened to particular song by a particular artist. If we would like to know how many times any song by 'Kanye West' was listened to, we need to select all the rows where ‘artist’=='Kanye West' and sum the ‘listen_count’ column. 

In [8]:
summ_artist = song_data.groupby(key_columns = 'artist', operations = {'total_count': graphlab.aggregate.SUM('listen_count')})
summ_artist = summ_artist.sort('total_count', ascending = False)
print summ_artist.head(5)
print summ_artist.tail(5)

+------------------------+-------------+
|         artist         | total_count |
+------------------------+-------------+
|     Kings Of Leon      |    43218    |
|     Dwight Yoakam      |    40619    |
|         Björk          |    38889    |
|        Coldplay        |    35362    |
| Florence + The Machine |    33387    |
+------------------------+-------------+
[5 rows x 2 columns]

+-------------------------------+-------------+
|             artist            | total_count |
+-------------------------------+-------------+
|             Diplo             |      30     |
|         Boggle Karaoke        |      30     |
| Beyoncé feat. Bun B and Sl... |      26     |
|         Reel Feelings         |      24     |
|        William Tabbert        |      14     |
+-------------------------------+-------------+
[5 rows x 2 columns]



# Q3:  [OPTIONAL] Using groupby-aggregate to find the most recommended songs

In [9]:
personalized_model = graphlab.item_similarity_recommender.create(train_data,
                                                                user_id = 'user_id',
                                                                item_id = 'song')

We are going to make recommendations for the users in the test data, but there are over 200,000 users (58,628 unique users) in the test set. Computing recommendations for these many users can be slow in some computers. Thus, we will use only the first 10,000 users only in this question. Using this command to select this subset of users:

In [10]:
subset_test_users = test_data['user_id'].unique()[0:10000]

In [12]:
# Compute one recommended song for each of these test users
top_song = personalized_model.recommend(subset_test_users, k=1

In [13]:
top_song.head(5)

user_id,song,score,rank
c66c10a9567f0d82ff31441a9 fd5063e5cd9dfe8 ...,Cuando Pase El Temblor - Soda Stereo ...,0.0194504536115,1
c067c22072a17d33310d7223d 7b79f819e48cf42 ...,Grind With Me (Explicit Version) - Pretty Ricky ...,0.0459424376488,1
f6c596a519698c97f1591ad89 f540d76f6a04f1a ...,Hey_ Soul Sister - Train,0.0238929539919,1
696787172dd3f5169dc94deef 97e427cee86147d ...,Senza Una Donna (Without A Woman) - Zucchero / ...,0.017026577677,1
3a7111f4cdf3c5a85fd4053e3 cc2333562e1e0cb ...,Heartbreak Warfare - John Mayer ...,0.0298416515191,1


In [14]:
summ_song = top_song.groupby(key_columns='song', 
                             operations={'count': graphlab.aggregate.COUNT()}).sort('count', ascending = False)
summ_song.head(5)

song,count
Undo - Björk,436
Secrets - OneRepublic,384
Revelry - Kings Of Leon,225
You're The One - Dwight Yoakam ...,163
Fireflies - Charttraxx Karaoke ...,123
