In [1]:
# Modules

import turicreate as tc

In [2]:
# Load data
song_data = tc.SFrame('song_data.sframe')
song_data.head(5)

user_id,song_id,listen_count,title,artist
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOAKIMP12A8C130995,1,The Cove,Jack Johnson
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOBBMDR12A8C13253B,2,Entre Dos Aguas,Paco De Lucia
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOBXHDL12A81C204C0,1,Stronger,Kanye West
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOBYHAJ12A6701BF1D,1,Constellations,Jack Johnson
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SODACBL12A8C13C273,1,Learn To Fly,Foo Fighters

song
The Cove - Jack Johnson
Entre Dos Aguas - Paco De Lucia ...
Stronger - Kanye West
Constellations - Jack Johnson ...
Learn To Fly - Foo Fighters ...


**Counting unique users:** The method .unique() can be used to select the unique elements in a column of data. In this question, you will compute the number of unique users who have listened to songs by various artists. For example, to find out the number of unique users who listened to songs by 'Kanye West', all you need to do is select the rows of the song data where the artist is 'Kanye West', and then count the number of unique entries in the ‘user_id’ column. Compute the number of unique users for each of these artists: 'Kanye West', 'Foo Fighters', 'Taylor Swift' and 'Lady GaGa'. Save these results to answer the quiz at the end. 

In [3]:
artists = ['Kanye West', 'Foo Fighters', 'Taylor Swift', 'Lady GaGa']

for artist in artists:
    artist_data = song_data[song_data['artist'] == artist]
    artist_users = artist_data['user_id'].unique()
    print(f'Number of unique users for {artist}: {len(artist_users)}')

Number of unique users for Kanye West: 2522
Number of unique users for Foo Fighters: 2055
Number of unique users for Taylor Swift: 3246
Number of unique users for Lady GaGa: 2928


**Using groupby-aggregate to find the most popular and least popular artist:** each row of song_data contains the number of times a user listened to particular song by a particular artist. If we would like to know how many times any song by 'Kanye West' was listened to, we need to select all the rows where ‘artist’=='Kanye West' and sum the ‘listen_count’ column. If we would like to find the most popular artist, we would need to follow this procedure for each artist, which would be very slow. Instead, you will learn about a very important method: `.groupby()`.

The .groupby method computes an aggregate (in our case, the sum of the `listen_count`) for each distinct value in a column (in our case, the `artist` column).

Follow these steps to find the most popular artist in the dataset:

+ The .groupby method has two important parameters:
  + key_columns, which takes the column we want to group, in our case, ‘artist’
  + operations, where we define the aggregation operation we using, in our case, we want to sum over the ‘listen_count’.
+ With this in mind, the following command will compute the sum listen_count for each artist and return an SFrame with the results:
```Python
song_data.groupby(key_columns='artist',
                  operations={'total_count': turicreate.aggregate.SUM('listen_count')})
```
the total number of listens for each artist will be stored in ‘total_count’.

+ Sort the resulting SFrame according to the ‘total_count’, and find the artist with the most popular and least popular artist in the dataset. Save these results to answer the quiz at the end.

In [8]:
popular_artist = song_data.groupby('artist',
                                   operations={'total_count': tc.aggregate.SUM('listen_count')})

In [11]:
popular_artist = popular_artist.sort('total_count', ascending=False)

In [18]:
print(f"The most popular artist is {popular_artist['artist'][0]} with {popular_artist['total_count'][0]} reproductions")
print(f"The least popular artist is {popular_artist['artist'][-1]} with {popular_artist['total_count'][-1]} reproductions")

The most popular artist is Kings Of Leon with 43218 reproductions
The least popular artist is William Tabbert with 14 reproductions


**[OPTIONAL] Using groupby-aggregate to find the most recommended songs:** Now that we learned how to use .groupby() to compute aggregates for each value in a column, let’s use to find the song that is most recommended by the personalized_model model we learned in the Jupyter notebook above. Follow these steps to find the most recommended song:

+ Split the data into 80% training, 20% testing, using seed=0, as was done in the Jupyter notebook above.
+ Train an item_similarity_recommender, as done in the Jupyter notebook, using the training data.
+ Next, we are going to make recommendations for the users in the test data, but there are over 200,000 users (58,628 unique users) in the test set. Computing recommendations for these many users can be slow in some computers. Thus, we will use only the first 10,000 users only in this question. Using this command to select this subset of users:
```Python
subset_test_users = test_data['user_id'].unique()[0:10000]
```
+ Let’s compute one recommended song for each of these test users. Use this command to compute these recommendations:
```Python
personalized_model.recommend(subset_test_users,k=1)
```
+ Finally, we can use .groupby() to find the most recommended song! :) When we used .groupby() in the previous question, we summed up the total ‘listen_count’ for each artist, by setting the parameter SUM in the aggregator:
```Python
operations={'total_count': turicreate.aggregate.SUM('listen_count')}
```
For this question, we simply want to count how often each song is recommended, so we will use the COUNT aggregator instead of SUM, and store the results in a column we will call ‘count’ by using:
```Python
operations={'count': turicreate.aggregate.COUNT()}
```
And, since we want to use the song titles as the key to the aggregator instead of of the ‘artist’, we use:
```Python
key_columns='song'
```
+ By sorting the results, you will find out the most recommended song to the first 10,000 users in the test data! Due to randomness in train-test split, the most recommended song may come out differently for different people. This is why we chose not to assign a quiz question for this section.

In [19]:
train_data,test_data = song_data.random_split(.8,seed=0)

In [20]:
personalized_model = tc.item_similarity_recommender.create(train_data,
                                                           user_id = 'user_id',
                                                           item_id = 'song')

In [21]:
subset_test_users = test_data['user_id'].unique()[0:10000]

In [23]:
personalized_model.recommend(subset_test_users,k=1).head(5)

user_id,song,score,rank
c067c22072a17d33310d7223d 7b79f819e48cf42 ...,Grind With Me (Explicit Version) - Pretty Ricky ...,0.0459424376487731,1
696787172dd3f5169dc94deef 97e427cee86147d ...,Senza Una Donna (Without A Woman) - Zucchero / ...,0.0170265776770455,1
532e98155cbfd1e1a474a28ed 96e59e50f7c5baf ...,Jive Talkin' (Album Version) - Bee Gees ...,0.0118288653237479,1
18325842a941bc58449ee71d6 59a08d1c1bd2383 ...,Goodnight And Goodbye - Jonas Brothers ...,0.0159257985651493,1
507433946f534f5d25ad1be30 2edb9a2376f503c ...,Find The Cost Of Freedom - Crosby_ Stills_ Nash & ...,0.0165806589303193,1


In [25]:
most_recommended_song = song_data.groupby('song',
                                          operations={'count': tc.aggregate.COUNT()})
most_recommended_song = most_recommended_song.sort('count', ascending=False)

In [27]:
most_recommended_song.head(5)

song,count
Sehr kosmisch - Harmonia,5970
Undo - Björk,5281
You're The One - Dwight Yoakam ...,4806
Dog Days Are Over (Radio Edit) - Florence + The ...,4536
Revelry - Kings Of Leon,4339
