Assignment: Recommending Songs
This assignment focuses on building recommender systems to find products, specifically songs, that interest users. You'll see how to build two kinds of song recommenders—one based on song popularity and another that is personalized. Then, you'll compare the models.

Learning outcomes
Load and transform real song data
Build a song recommender model
Use the model to recommend songs to individual users
Compute aggregate statistics of the data using the groupby method
Instructions
There are three tasks in this assignment. There are several results you need to gather for the quiz that accompanies this module.

Task 1: Count unique users
The method unique selects the unique elements in a column of data. You will compute the number of unique users who listened to songs by various artists. To find out the number of unique users who listened to songs by a particular artist, for example, Kanye West, you simply select the rows of the song data where the artist is 'Kanye West'. Then count the number of unique entries in the user_id column.

Compute the number of unique users for each of these artists: Kanye West, Foo Fighters, Taylor Swift, and Lady GaGa.

Save these results to answer the quiz for this module.

Task 2: Find the most popular and least popular artists
Use the groupby aggregate method to find the most popular and least popular artist. Each row of song_data contains the number of times a user listened to particular song by a particular artist. To find out how many times users listened to any song by Kanye West, you need to select all the rows where ‘artist’=='Kanye West' and sum the listen_count column.

To find the most popular artist, you would need to follow this procedure for each artist, which would be very slow. Instead, you can use a powerful method, groupby.

Take a moment to read the groupby documentation (Links to an external site.).

The groupby method computes an aggregate (in this case, the sum of the listen_count column) or each distinct value in a column (in this case, the artist column).

The groupby method has two important parameters:

key_column_names, which takes the column to group, in this case, the artist column
operations, which specifies the aggregation operation to use, in this case, sum over the the listen_count column.
With this in mind, the following command computes the sum of the listen_count column for each artist and returns an SFrame data structure with the results:

song_data.groupby(key_column_names='artist', operations={'total_count':turicreate.aggregate.SUM('listen_count')})   

The total number of listens for each artist is stored in the total_count column.

Sort the resulting SFrame on the total_count column, and find the the most popular and least popular artist in the dataset.

Save these results to answer the quiz for this module.

Task 3: Find the most recommended songs
Now that you know how to use the groupby method to compute aggregates for each value in a column, you will find the song that is most recommended by the personalized model shown in the video.

Follow these steps to find the most recommended song

Split the data into 80% training, 20% testing, using a seed of 0, as shown in the video.
Train an item_similarity_recommender model, as shown in the video, using the training data.
Next, make recommendations for the users in the test data. There are more than 200,000 users (58,628 unique users) in the test set. Computing recommendations for this many users can be slow. Thus, you should use only the first 10,000 users.
To select this subset of users:

subset_test_users = test_data['user_id'].unique()[0:10000]

Compute one recommended song for each test user.

Use the following command to compute the recommendations:

personalized_model.recommend(subset_test_users,k=1)

Finally, use the groupby method to find the most recommended song. When you used the groupby method in the previous question, you summed the listen_count for each artist, by setting the parameter SUM in the aggregator:

operations={'total_count': turicreate.aggregate.SUM('listen_count')}

For this task, you need only to count how often each song is recommended, so you can use the COUNT aggregator instead of SUM, and store the results in a column called count by using:

operations={'count': turicreate.aggregate.COUNT()}

And, since you want to use the song titles as the key to the aggregator instead of of the artist, use:

key_column_names='song'

Sort the results, to find the most recommended song to the first 10,000 users in the test data.

Due to randomness in train-test split, the most recommended song may come out differently for different people. This is why there is not a quiz question for this task. But note your result so you can share it during the class discussion time.

In [1]:
import turicreate

## Import data

In [2]:
song_data=turicreate.SFrame('/Users/eunheelim/my-env/data/song_data.sframe/')

## Explore data

In [3]:
song_data

user_id,song_id,listen_count,title,artist
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOAKIMP12A8C130995,1,The Cove,Jack Johnson
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOBBMDR12A8C13253B,2,Entre Dos Aguas,Paco De Lucia
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOBXHDL12A81C204C0,1,Stronger,Kanye West
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOBYHAJ12A6701BF1D,1,Constellations,Jack Johnson
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SODACBL12A8C13C273,1,Learn To Fly,Foo Fighters
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SODDNQT12A6D4F5F7E,5,Apuesta Por El Rock 'N' Roll ...,Héroes del Silencio
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SODXRTY12AB0180F3B,1,Paper Gangsta,Lady GaGa
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOFGUAY12AB017B0A8,1,Stacked Actors,Foo Fighters
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOFRQTD12A81C233C0,1,Sehr kosmisch,Harmonia
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOHQWYZ12A6D4FA701,1,Heaven's gonna burn your eyes ...,Thievery Corporation feat. Emiliana Torrini ...

song
The Cove - Jack Johnson
Entre Dos Aguas - Paco De Lucia ...
Stronger - Kanye West
Constellations - Jack Johnson ...
Learn To Fly - Foo Fighters ...
Apuesta Por El Rock 'N' Roll - Héroes del ...
Paper Gangsta - Lady GaGa
Stacked Actors - Foo Fighters ...
Sehr kosmisch - Harmonia
Heaven's gonna burn your eyes - Thievery ...


## Task 1: Count unique users
The method unique selects the unique elements in a column of data. You will compute the number of unique users who listened to songs by various artists. To find out the number of unique users who listened to songs by a particular artist, for example, Kanye West, you simply select the rows of the song data where the artist is 'Kanye West'. Then count the number of unique entries in the user_id column.

Compute the number of unique users for each of these artists: Kanye West, Foo Fighters, Taylor Swift, and Lady GaGa.

In [11]:
users = song_data['user_id'].unique()

In [14]:
len(song_data[song_data['artist']=='Kanye West']['user_id'].unique())

2522

In [15]:
len(song_data[song_data['artist']=='Foo Fighters']['user_id'].unique())

2055

In [18]:
len(song_data[song_data['artist']=='Taylor Swift']['user_id'].unique())

3246

In [20]:
len(song_data[song_data['artist']=='Lady GaGa']['user_id'].unique())

2928

## Task 2: Find the most popular and least popular artists
Use the groupby aggregate method to find the most popular and least popular artist. Each row of song_data contains the number of times a user listened to particular song by a particular artist. To find out how many times users listened to any song by Kanye West, you need to select all the rows where ‘artist’=='Kanye West' and sum the listen_count column.

To find the most popular artist, you would need to follow this procedure for each artist, which would be very slow. Instead, you can use a powerful method, groupby.

Take a moment to read the groupby documentation (Links to an external site.).

The groupby method computes an aggregate (in this case, the sum of the listen_count column) or each distinct value in a column (in this case, the artist column).

The groupby method has two important parameters:

key_column_names, which takes the column to group, in this case, the artist column
operations, which specifies the aggregation operation to use, in this case, sum over the the listen_count column.
With this in mind, the following command computes the sum of the listen_count column for each artist and returns an SFrame data structure with the results:

song_data.groupby(key_column_names='artist', operations={'total_count':turicreate.aggregate.SUM('listen_count')})   

The total number of listens for each artist is stored in the total_count column.

Sort the resulting SFrame on the total_count column, and find the the most popular and least popular artist in the dataset.



In [24]:
total_count_data=song_data.groupby(key_column_names='artist', operations={'total_count':turicreate.aggregate.SUM('listen_count')})

In [25]:
total_count_data.sort('total_count',ascending=False)

artist,total_count
Kings Of Leon,43218
Dwight Yoakam,40619
Björk,38889
Coldplay,35362
Florence + The Machine,33387
Justin Bieber,29715
Alliance Ethnik,26689
OneRepublic,25754
Train,25402
The Black Keys,22184


In [26]:
total_count_data.sort('total_count')

artist,total_count
William Tabbert,14
Reel Feelings,24
Beyoncé feat. Bun B and Slim Thug ...,26
Boggle Karaoke,30
Diplo,30
harvey summers,31
Nâdiya,36
Jody Bernal,38
Aneta Langerova,38
Kanye West / Talib Kweli / Q-Tip / Common / ...,38


## Task 3: Find the most recommended songs
Now that you know how to use the groupby method to compute aggregates for each value in a column, you will find the song that is most recommended by the personalized model shown in the video.

Follow these steps to find the most recommended song

Split the data into 80% training, 20% testing, using a seed of 0, as shown in the video.
Train an item_similarity_recommender model, as shown in the video, using the training data.
Next, make recommendations for the users in the test data. There are more than 200,000 users (58,628 unique users) in the test set. Computing recommendations for this many users can be slow. Thus, you should use only the first 10,000 users.
To select this subset of users:

subset_test_users = test_data['user_id'].unique()[0:10000]

Compute one recommended song for each test user.

Use the following command to compute the recommendations:

personalized_model.recommend(subset_test_users,k=1)

Finally, use the groupby method to find the most recommended song. When you used the groupby method in the previous question, you summed the listen_count for each artist, by setting the parameter SUM in the aggregator:

operations={'total_count': turicreate.aggregate.SUM('listen_count')}

For this task, you need only to count how often each song is recommended, so you can use the COUNT aggregator instead of SUM, and store the results in a column called count by using:

operations={'count': turicreate.aggregate.COUNT()}

And, since you want to use the song titles as the key to the aggregator instead of of the artist, use:

key_column_names='song'

Sort the results, to find the most recommended song to the first 10,000 users in the test data.

Due to randomness in train-test split, the most recommended song may come out differently for different people. This is why there is not a quiz question for this task. But note your result so you can share it during the class discussion time.

In [27]:
train_data,test_data = song_data.random_split(.8,seed=0)

In [38]:
item_similarity_recommender_model = turicreate.item_similarity_recommender.create(train_data, 
                                                                          user_id = 'user_id',
                                                                          item_id = 'song')

In [32]:
subset_test_users = test_data['user_id'].unique()[0:10000]

In [35]:
subset_test_users

dtype: str
Rows: 10000
['c067c22072a17d33310d7223d7b79f819e48cf42', '696787172dd3f5169dc94deef97e427cee86147d', '532e98155cbfd1e1a474a28ed96e59e50f7c5baf', '18325842a941bc58449ee71d659a08d1c1bd2383', '507433946f534f5d25ad1be302edb9a2376f503c', '18fafad477f9d72ff86f7d0bd838a6573de0f64a', 'fe85b96ba1983219b296f6b4869dd29eb2b72ff9', '225ea420b4bede50919d1bfe24a599691522d176', '95dc7e2b188b1148b2d25f4e6b6e94afacc4efc3', '4a3a1ae2748f12f7ab921a47d6d79abf82e3e325', 'a2c1a593432f5e19a9174eb1b3b57e02d3212eb6', 'f9958d5c8e88f53bbbf6a5b82d3062b369497f64', 'c6d5086d22ba5a9c205877770f29bf97e3a5993b', '62c4bf887b7b1e5cf6ab62723481099c7f98377e', '9fd403cf953d4bc8f77980f2bd9dbb174a567d15', '92337c69d4ff1c13f5c9ad4e9c62a0654be9d230', 'e8813fe73c90d69b4f744251c68fce896ec2aede', '5eb761c242ec9d014a4c3f79d1496c342b2bf4f6', '1c4735ef0f9e2b95524f8c92a70be8b355eeb651', '486fc2c57f6a4213bee54f9323a6668a7a306cc3', '11e266d0d2d7a841bd1a2604b14cc05f4bcecd8e', '332c6355122179f10786c61fff699057b36a15c0', '4886fb3

In [42]:
item_similarity_recommender_model.recommend(subset_test_users, k=1)

user_id,song,score,rank
c067c22072a17d33310d7223d 7b79f819e48cf42 ...,Grind With Me (Explicit Version) - Pretty Ricky ...,0.0459424376487731,1
696787172dd3f5169dc94deef 97e427cee86147d ...,Senza Una Donna (Without A Woman) - Zucchero / ...,0.0170265776770455,1
532e98155cbfd1e1a474a28ed 96e59e50f7c5baf ...,Jive Talkin' (Album Version) - Bee Gees ...,0.0118288653237479,1
18325842a941bc58449ee71d6 59a08d1c1bd2383 ...,Goodnight And Goodbye - Jonas Brothers ...,0.0159257985651493,1
507433946f534f5d25ad1be30 2edb9a2376f503c ...,Find The Cost Of Freedom - Crosby_ Stills_ Nash & ...,0.0165806589303193,1
18fafad477f9d72ff86f7d0bd 838a6573de0f64a ...,Rabbit Heart (Raise It Up) - Florence + The ...,0.0799399726092815,1
fe85b96ba1983219b296f6b48 69dd29eb2b72ff9 ...,Secrets - OneRepublic,0.0788827141125996,1
225ea420b4bede50919d1bfe2 4a599691522d176 ...,Clocks - Coldplay,0.0271030251796428,1
95dc7e2b188b1148b2d25f4e6 b6e94afacc4efc3 ...,Bust a Move - Infected Mushroom ...,0.0534738540649414,1
4a3a1ae2748f12f7ab921a47d 6d79abf82e3e325 ...,Isis (Spam Remix) - Alaska Y Dinarama ...,0.0418030211800023,1


In [43]:
popularity_count_data=song_data.groupby(key_column_names='song', operations={'count': turicreate.aggregate.COUNT()})

In [44]:
popularity_count_data

song,count
Your Star - The All- American Rejects ...,113
Swing - Zero 7,49
Your Love - The Outfield,64
Outside (Original LP Version) - Staind ...,219
Vitamin - Incubus,56
(I Just) Died In Your Arms - Cutting Crew ...,136
Cold Gin - Kiss,39
Californication (Album Version) - Red Hot Chili ...,354
You Sang To Me - Marc Anthony ...,48
Arco Arena - Cake,55


In [45]:
popularity_count_data.sort('count')

song,count
Younger Than Springtime - William Tabbert ...,12
Hubcap - Sleater-kinney,12
Trahison - Vitalic,15
Marching Theme - Neutral Milk Hotel ...,15
Accidntel Deth (Album Version) - Rilo Kiley ...,16
Made In The Dark - Hot Chip ...,16
Bendable Poseable - Hot Chip ...,16
Jumpers (Album) - Sleater-kinney ...,16
I Dont Want To See You - Camera Obscura ...,17
Music Now - Frightened Rabbit ...,17


In [46]:
popularity_count_data.sort('count',ascending=False)

song,count
Sehr kosmisch - Harmonia,5970
Undo - Björk,5281
You're The One - Dwight Yoakam ...,4806
Dog Days Are Over (Radio Edit) - Florence + The ...,4536
Revelry - Kings Of Leon,4339
Horn Concerto No. 4 in E flat K495: II. Romance ...,3949
Secrets - OneRepublic,3916
Tive Sim - Cartola,3185
Fireflies - Charttraxx Karaoke ...,3171
Hey_ Soul Sister - Train,3132
