## Dataset Information

Million Songs Dataset contains of two files: triplet_file and metadata_file. The triplet_file contains user_id, song_id and listen time. The metadata_file contains song_id, title, release, year and artist_name. Million Songs Dataset is a mixture of song from various website with the rating that users gave after listening to the song.

There are 3 types of recommendation system: content-based, collaborative and popularity.

## Import modules

In [7]:
import pandas as pd
import numpy as np
import Recommenders as Recommenders

## Loading the dataset

In [8]:
song_df_1 = pd.read_csv('triplets_file.csv')
song_df_1.head()

Unnamed: 0,user_id,song_id,listen_count
0,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOAKIMP12A8C130995,1
1,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBBMDR12A8C13253B,2
2,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBXHDL12A81C204C0,1
3,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBYHAJ12A6701BF1D,1
4,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SODACBL12A8C13C273,1


In [9]:
song_df_2 = pd.read_csv('song_data.csv')
song_df_2.head()

Unnamed: 0,song_id,title,release,artist_name,year
0,SOQMMHC12AB0180CB8,Silent Night,Monster Ballads X-Mas,Faster Pussy cat,2003
1,SOVFVAK12A8C1350D9,Tanssi vaan,Karkuteillä,Karkkiautomaatti,1995
2,SOGTUKN12AB017F4F1,No One Could Ever,Butter,Hudson Mohawke,2006
3,SOBNYVR12A8C13558C,Si Vos Querés,De Culo,Yerba Brava,2003
4,SOHSBXH12A8C13B0DF,Tangle Of Aspens,Rene Ablaze Presents Winter Sessions,Der Mystic,0


In [10]:
song_df_2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 5 columns):
 #   Column       Non-Null Count    Dtype 
---  ------       --------------    ----- 
 0   song_id      1000000 non-null  object
 1   title        999983 non-null   object
 2   release      999993 non-null   object
 3   artist_name  1000000 non-null  object
 4   year         1000000 non-null  int64 
dtypes: int64(1), object(4)
memory usage: 38.1+ MB


In [11]:
# combine both data
song_df = pd.merge(song_df_1, song_df_2.drop_duplicates(['song_id']), on='song_id', how='left')
song_df.head()

Unnamed: 0,user_id,song_id,listen_count,title,release,artist_name,year
0,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOAKIMP12A8C130995,1,The Cove,Thicker Than Water,Jack Johnson,0
1,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBBMDR12A8C13253B,2,Entre Dos Aguas,Flamenco Para Niños,Paco De Lucia,1976
2,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBXHDL12A81C204C0,1,Stronger,Graduation,Kanye West,2007
3,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBYHAJ12A6701BF1D,1,Constellations,In Between Dreams,Jack Johnson,2005
4,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SODACBL12A8C13C273,1,Learn To Fly,There Is Nothing Left To Lose,Foo Fighters,1999


In [12]:
print(len(song_df_1), len(song_df_2))

2000000 1000000


In [13]:
len(song_df)

2000000

## Data Preprocessing

In [14]:
# creating new feature combining title and artist name
song_df['song'] = song_df['title']+' - '+song_df['artist_name']
song_df.head()

Unnamed: 0,user_id,song_id,listen_count,title,release,artist_name,year,song
0,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOAKIMP12A8C130995,1,The Cove,Thicker Than Water,Jack Johnson,0,The Cove - Jack Johnson
1,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBBMDR12A8C13253B,2,Entre Dos Aguas,Flamenco Para Niños,Paco De Lucia,1976,Entre Dos Aguas - Paco De Lucia
2,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBXHDL12A81C204C0,1,Stronger,Graduation,Kanye West,2007,Stronger - Kanye West
3,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBYHAJ12A6701BF1D,1,Constellations,In Between Dreams,Jack Johnson,2005,Constellations - Jack Johnson
4,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SODACBL12A8C13C273,1,Learn To Fly,There Is Nothing Left To Lose,Foo Fighters,1999,Learn To Fly - Foo Fighters


In [15]:
# taking top 900k samples for results
song_df = song_df.head()

In [16]:
# cummulative sum of listen count of the songs
song_grouped = song_df.groupby(['song']).agg({'listen_count':'count'}).reset_index()
song_grouped.head()

Unnamed: 0,song,listen_count
0,#!*@ You Tonight [Featuring R. Kelly] (Explici...,31
1,#40 - DAVE MATTHEWS BAND,139
2,& Down - Boys Noize,182
3,' Cello Song - Nick Drake,47
4,'97 Bonnie & Clyde - Eminem,39


In [17]:
grouped_sum = song_grouped['listen_count'].sum()
song_grouped['percentage'] = (song_grouped['listen_count'] / grouped_sum ) * 100
song_grouped.sort_values(['listen_count', 'song'], ascending=[0,1])

Unnamed: 0,song,listen_count,percentage
7127,Sehr kosmisch - Harmonia,3762,0.418000
9084,Undo - Björk,3214,0.357111
2068,Dog Days Are Over (Radio Edit) - Florence + Th...,3188,0.354222
9880,You're The One - Dwight Yoakam,2872,0.319111
6774,Revelry - Kings Of Leon,2743,0.304778
...,...,...,...
6632,Radioactive Toy - Porcupine Tree,19,0.002111
6269,Para No Verte Más - La Mosca Tse-Tse,18,0.002000
9544,While You Were Sleeping - Elvis Perkins,18,0.002000
1759,Crying Like A Church On Monday - New Radicals,17,0.001889


## Popularity Recommendation Engine

In [18]:
pr = Recommenders.popularity_recommender_py()

In [19]:
pr.create(song_df, 'user_id', 'song')

In [20]:
# display the top 10 popular songs
pr.recommend(song_df['user_id'][5])

Unnamed: 0,user_id,song,score,Rank
7127,b80344d063b5ccb3212f76538f3d9e43d87dca9e,Sehr kosmisch - Harmonia,3762,1.0
9084,b80344d063b5ccb3212f76538f3d9e43d87dca9e,Undo - Björk,3214,2.0
2068,b80344d063b5ccb3212f76538f3d9e43d87dca9e,Dog Days Are Over (Radio Edit) - Florence + Th...,3188,3.0
9880,b80344d063b5ccb3212f76538f3d9e43d87dca9e,You're The One - Dwight Yoakam,2872,4.0
6774,b80344d063b5ccb3212f76538f3d9e43d87dca9e,Revelry - Kings Of Leon,2743,5.0
7115,b80344d063b5ccb3212f76538f3d9e43d87dca9e,Secrets - OneRepublic,2618,6.0
3613,b80344d063b5ccb3212f76538f3d9e43d87dca9e,Horn Concerto No. 4 in E flat K495: II. Romanc...,2434,7.0
2717,b80344d063b5ccb3212f76538f3d9e43d87dca9e,Fireflies - Charttraxx Karaoke,2186,8.0
3485,b80344d063b5ccb3212f76538f3d9e43d87dca9e,Hey_ Soul Sister - Train,2128,9.0
8847,b80344d063b5ccb3212f76538f3d9e43d87dca9e,Tive Sim - Cartola,2082,10.0


In [21]:
pr.recommend(song_df['user_id'][100])

Unnamed: 0,user_id,song,score,Rank
7127,e006b1a48f466bf59feefed32bec6494495a4436,Sehr kosmisch - Harmonia,3762,1.0
9084,e006b1a48f466bf59feefed32bec6494495a4436,Undo - Björk,3214,2.0
2068,e006b1a48f466bf59feefed32bec6494495a4436,Dog Days Are Over (Radio Edit) - Florence + Th...,3188,3.0
9880,e006b1a48f466bf59feefed32bec6494495a4436,You're The One - Dwight Yoakam,2872,4.0
6774,e006b1a48f466bf59feefed32bec6494495a4436,Revelry - Kings Of Leon,2743,5.0
7115,e006b1a48f466bf59feefed32bec6494495a4436,Secrets - OneRepublic,2618,6.0
3613,e006b1a48f466bf59feefed32bec6494495a4436,Horn Concerto No. 4 in E flat K495: II. Romanc...,2434,7.0
2717,e006b1a48f466bf59feefed32bec6494495a4436,Fireflies - Charttraxx Karaoke,2186,8.0
3485,e006b1a48f466bf59feefed32bec6494495a4436,Hey_ Soul Sister - Train,2128,9.0
8847,e006b1a48f466bf59feefed32bec6494495a4436,Tive Sim - Cartola,2082,10.0


## Item Similarity Recommendation

In [22]:
ir = Recommenders.item_similarity_recommender_py()
ir.create(song_df, 'user_id', 'song')

In [23]:
user_items = ir.get_user_items(song_df['user_id'][5])

In [24]:
# display user songs history
for user_item in user_items:
    print(user_item)

The Cove - Jack Johnson
Entre Dos Aguas - Paco De Lucia
Stronger - Kanye West
Constellations - Jack Johnson
Learn To Fly - Foo Fighters
Apuesta Por El Rock 'N' Roll - Héroes del Silencio
Paper Gangsta - Lady GaGa
Stacked Actors - Foo Fighters
Sehr kosmisch - Harmonia
Heaven's gonna burn your eyes - Thievery Corporation feat. Emiliana Torrini
Let It Be Sung - Jack Johnson / Matt Costa / Zach Gill / Dan Lebowitz / Steve Adams
I'll Be Missing You (Featuring Faith Evans & 112)(Album Version) - Puff Daddy
Love Shack - The B-52's
Clarity - John Mayer
I?'m A Steady Rollin? Man - Robert Johnson
The Old Saloon - The Lonely Island
Behind The Sea [Live In Chicago] - Panic At The Disco
Champion - Kanye West
Breakout - Foo Fighters
Ragged Wood - Fleet Foxes
Mykonos - Fleet Foxes
Country Road - Jack Johnson / Paula Fuga
Oh No - Andrew Bird
Love Song For No One - John Mayer
Jewels And Gold - Angus & Julia Stone
83 - John Mayer
Neon - John Mayer
The Middle - Jimmy Eat World
High and dry - Jorge Drexle

In [None]:
# give song recommendation for that user
ir.recommend(song_df['user_id'][5])

No. of unique songs for the user: 45
no. of unique songs in the training set: 9953


In [None]:
# give related songs based on the words
ir.get_similar_items(['Oliver James - Fleet Foxes', 'The End - Pearl Jam'])