<a href="https://colab.research.google.com/github/parsa-abbasi/intro-to-nlp/blob/main/NLP_songs_recommandation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Song Embeddings - Skipgram Recommender

**Note:** This notebook is based on [this implementation](https://colab.research.google.com/github/jalammar/jalammar.github.io/blob/master/notebooks/nlp/02_Song_Embeddings.ipynb#scrollTo=Qn0kiKeiJzR3) provided by
[Jay Alammar](https://jalammar.github.io/).

Suppose you're building a music streaming service. You want to recommend songs to your users. One way to do that is to recommend songs that are similar to the ones they've listened to. But how do you know which songs are similar to each other? You could use the song's metadata (artist, genre, etc.) to find similar songs. But that's not always accurate. For example, two songs could be in the same genre but sound nothing alike. So how do you find similar songs?

One creative way is to learn embeddings for songs. Embeddings are a way to represent a song as a vector of numbers. The idea is that similar songs will have similar embeddings. So if you want to find similar songs, you can find the ones with the closest embeddings.

As we already have the history of songs listened by each user, we can use this information to learn the embeddings. We'll use the word2vec algorithm to learn the embeddings. The word2vec algorithm was originally designed to learn embeddings for words. But it can be used to learn embeddings for any sequence of items.

This technique is used by Spotify, AirBnB, Alibaba, and others. It accounts for a vast portion of their user activity, user media consumption, and/or sales (in the case of Alibaba).

## Libraries

In [None]:
import numpy as np
import pandas as pd
import gensim
from gensim.models import Word2Vec
from urllib import request
import warnings
warnings.filterwarnings('ignore')

## Dataset

The [dataset we'll use](https://www.cs.cornell.edu/~shuochen/lme/data_page.html) was collected by Shuo Chen from Cornell University. The dataset contains playlists from hundreds of radio stations from around the US which are retrieved from the Yes.com website.

### Playlists data

The playlist dataset is a `txt` file where every line represents a playlist. That playlist is basically a series of song IDs.

Format of the playlist data:
* The first line of the data file is the IDs (not the integer ID, but IDs from other sources for identifying the songs) for the songs, separated by a space.
* The second line are the number of appearances of each song in the file, also separated by a space.
* Starting from the third line are the playlists, with each song represented by its integer ID in this file (from 0 to the total number of songs minus one).
* Note that in the playlist data file, each line is ended with a space.

You can download the dataset from [here](https://storage.googleapis.com/maps-premium/dataset/yes_complete/train.txt).

In [None]:
# Get the playlist dataset file
data = request.urlopen('https://storage.googleapis.com/maps-premium/dataset/yes_complete/train.txt')

# Parse the playlist dataset file. Skip the first two lines as they only contain metadata
lines = data.read().decode("utf-8").split('\n')[2:]

# Remove playlists with only one song
playlists = [s.rstrip().split() for s in lines if len(s.split()) > 1]

In [None]:
print('Playlist #1:\n', playlists[0], '\n')
print('Playlist #2:\n', playlists[1])

Playlist #1:
 ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '2', '42', '43', '44', '45', '46', '47', '48', '20', '49', '8', '50', '51', '52', '53', '54', '55', '56', '57', '25', '58', '59', '60', '61', '62', '3', '63', '64', '65', '66', '46', '47', '67', '2', '48', '68', '69', '70', '57', '50', '71', '72', '53', '73', '25', '74', '59', '20', '46', '75', '76', '77', '59', '20', '43'] 

Playlist #2:
 ['78', '79', '80', '3', '62', '81', '14', '82', '48', '83', '84', '17', '85', '86', '87', '88', '74', '89', '90', '91', '4', '73', '62', '92', '17', '53', '59', '93', '94', '51', '50', '27', '95', '48', '96', '97', '98', '99', '100', '57', '101', '102', '25', '103', '3', '104', '105', '106', '107', '47', '108', '109', '110', '111', '112', '113', '25', '63', '62', '114', '115', '84', '116', '117', '

In [None]:
print('Total number of playlists:', len(playlists))

Total number of playlists: 11088


### Songs information data

The title and artist information for each song is stored in a separate file named `song_hash.txt`. Each line corresponds to one song, and has the format `Integer_ID \t Title \t Artist \n` (The spaces here are only for making it easy to read. They do not exist in the real data file.)

You can download the dataset from [here](https://storage.googleapis.com/maps-premium/dataset/yes_complete/song_hash.txt).

In [None]:
songs_file = request.urlopen('https://storage.googleapis.com/maps-premium/dataset/yes_complete/song_hash.txt')

songs_file = songs_file.read().decode("utf-8").split('\n')

songs = [s.rstrip().split('\t') for s in songs_file]

In [None]:
songs[:3]

[['0 ', 'Gucci Time (w\\/ Swizz Beatz)', 'Gucci Mane'],
 ['1 ', 'Aston Martin Music (w\\/ Drake & Chrisette Michelle)', 'Rick Ross'],
 ['2 ', 'Get Back Up (w\\/ Chris Brown)', 'T.I.']]

In [None]:
songs_df = pd.DataFrame(data=songs, columns = ['id', 'title', 'artist'])
songs_df = songs_df.set_index('id')

In [None]:
songs_df

Unnamed: 0_level_0,title,artist
id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,Gucci Time (w\/ Swizz Beatz),Gucci Mane
1,Aston Martin Music (w\/ Drake & Chrisette Mich...,Rick Ross
2,Get Back Up (w\/ Chris Brown),T.I.
3,Hot Toddy (w\/ Jay-Z & Ester Dean),Usher
4,Whip My Hair,Willow
...,...,...
75258,USA Today,Alan Jackson
75259,Superstar,Raul Malo
75260,Romancin' The Blues,Giacomo Gates
75261,Inner Change,The Jazzmasters


In [None]:
# The last row is just None values
songs_df = songs_df[:-1]
songs_df

Unnamed: 0_level_0,title,artist
id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,Gucci Time (w\/ Swizz Beatz),Gucci Mane
1,Aston Martin Music (w\/ Drake & Chrisette Mich...,Rick Ross
2,Get Back Up (w\/ Chris Brown),T.I.
3,Hot Toddy (w\/ Jay-Z & Ester Dean),Usher
4,Whip My Hair,Willow
...,...,...
75257,Dearest (I'm So Sorry),Picture Me Broken
75258,USA Today,Alan Jackson
75259,Superstar,Raul Malo
75260,Romancin' The Blues,Giacomo Gates


In [None]:
# Let's see the songs in the first playlist
print('Playlist #1:\n')
songs_df.iloc[np.array(playlists[0], dtype=np.int32)]

Playlist #1:



Unnamed: 0_level_0,title,artist
id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,Gucci Time (w\/ Swizz Beatz),Gucci Mane
1,Aston Martin Music (w\/ Drake & Chrisette Mich...,Rick Ross
2,Get Back Up (w\/ Chris Brown),T.I.
3,Hot Toddy (w\/ Jay-Z & Ester Dean),Usher
4,Whip My Hair,Willow
...,...,...
76,Get It All (w\/ Nicki Minaj),Sean Garrett
77,You Be Killin Em,Fabolous
59,"Monster (w\/ Rick Ross, Jay-Z, Nicki Minaj & B...",Kanye West
20,Your Love,Nicki Minaj


In [None]:
print('Playlist #2:\n')
songs_df.iloc[np.array(playlists[1], dtype=np.int32)]

Playlist #2:



Unnamed: 0_level_0,title,artist
id,Unnamed: 1_level_1,Unnamed: 2_level_1
78,Soca Bhangra,Bunji Garlin
79,Fettin On (w\/ Machel Montano),Skinny Fabulous
80,Ants In Yuh Sugar Pan,Jamesy P
3,Hot Toddy (w\/ Jay-Z & Ester Dean),Usher
62,Runaway (w\/ Pusha T),Kanye West
...,...,...
207,Same Ol' G,Ginuwine
32,You Make Me Wanna...,Usher
208,My Love Is Your Love,Whitney Houston
209,Can't Be Friends,Trey Songz


## Training the Word2Vec Model

Our dataset is now in the shape the the Word2Vec model expects as input. We pass the dataset to the model, and set the following key parameters:

 * **vector_size**: Embedding size for the songs.
 * **window**: word2vec algorithm parameter -- maximum distance between the current and predicted word (song) within a sentence
 * **negative**: word2vec algorithm parameter -- Number of negative examples to use at each training step that the model needs to identify as noise
 * **min_count**: word2vec algorithm parameter -- Ignores all words with total frequency lower than this
 * **epochs**: Number of iterations (epochs) over the corpus. The more epochs, the longer the model trains, and the more accurate the embeddings.
 * **workers**: Number of worker threads to train the model (faster training with multicore machines)

You can find more information about the parameters [here](https://radimrehurek.com/gensim/models/word2vec.html).

In [None]:
import time
start = time.time()
model = Word2Vec(playlists, vector_size=50, window=20, negative=50, min_count=1, workers=4, epochs=5)
end = time.time()
print('Time to build the model: {} mins'.format(round((end - start) / 60, 2)))

Time to build the model: 1.95 mins


## Recommending Similar Songs

Let's now pick a song, and see what similar songs the model recommends:

### Get similar songs to a specific song

In [None]:
song_id = 2172

songs_df.iloc[song_id]

title     Fade To Black
artist        Metallica
Name: 2172 , dtype: object

In [None]:
# Ask the model for songs similar to the selected song
most_similar = model.wv.most_similar(positive=str(song_id))
most_similar

[('3167', 0.9984625577926636),
 ('2976', 0.997577965259552),
 ('11517', 0.996544361114502),
 ('3094', 0.9961346387863159),
 ('2849', 0.9955669641494751),
 ('6624', 0.9952481985092163),
 ('1922', 0.9951440691947937),
 ('5586', 0.9951029419898987),
 ('2014', 0.9946932792663574),
 ('5549', 0.9943324327468872)]

In [None]:
# Get the song information using the song id
similar_songs = np.array(most_similar)[:,0]
songs_df.iloc[similar_songs]

Unnamed: 0_level_0,title,artist
id,Unnamed: 1_level_1,Unnamed: 2_level_1
3167,Unchained,Van Halen
2976,I Don't Know,Ozzy Osbourne
11517,Mary Had A Little Lamb,Stevie Ray Vaughan & Double Trouble
3094,Breaking The Law,Judas Priest
2849,Run To The Hills,Iron Maiden
6624,Everybody Wants Some!!!,Van Halen
1922,One,Metallica
5586,The Last In Line,Dio
2014,Youth Gone Wild,Skid Row
5549,November Rain,Guns N' Roses


In [None]:
# Let's create a function to print the recommendations given a song id
def print_recommendations(song_id):
    print(songs_df.iloc[song_id])
    similar_songs = np.array(model.wv.most_similar(positive=str(song_id)))[:,0]
    return songs_df.iloc[similar_songs]

### Compute similarity between two songs

In [None]:
# Find artists with the most songs in the dataset
songs_df['artist'].value_counts()[:20]

-                       1812
The Beatles              201
Frank Sinatra            166
Vicente Fernandez        166
Metallica                141
The Rolling Stones       127
Los Tigres Del Norte     125
Miles Davis              120
Bob Dylan                105
Led Zeppelin             101
Ray Charles               96
George Strait             95
Pink Floyd                93
Joan Sebastian            92
Kenny G                   92
Billie Holiday            86
Johnny Cash               85
Ella Fitzgerald           83
U2                        83
AC\/DC                    80
Name: artist, dtype: int64

In [None]:
# Find songs by Johnny Cash
# John R. Cash was an American country singer-songwriter. Most of Cash's music contained themes of sorrow, moral tribulation, and redemption, especially in the later stages of his career.
songs_df[songs_df['artist']=='Johnny Cash']

Unnamed: 0_level_0,title,artist
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1383,Folsom Prison Blues,Johnny Cash
1454,Home Of The Blues,Johnny Cash
1641,Port Of Lonely Hearts,Johnny Cash
1642,God's Gonna Cut You Down,Johnny Cash
8921,The Preacher Said 'Jesus Said',Johnny Cash
...,...,...
71738,The Christmas Guest,Johnny Cash
73248,I Will Rock And Roll With You,Johnny Cash
73336,Train Of Love,Johnny Cash
73844,Country Boy,Johnny Cash


In [None]:
# Find songs by Nicki Minaj
# Onika Tanya Maraj-Petty, known professionally as Nicki Minaj, is a Trinidadian-born rapper, singer, and songwriter based in the United States. Often referred to as the "Queen of Rap", she is known for her musical versatility, animated flow in her rapping, alter egos, and influence in popular music.
songs_df[songs_df['artist']=='Nicki Minaj']

Unnamed: 0_level_0,title,artist
id,Unnamed: 1_level_1,Unnamed: 2_level_1
20,Your Love,Nicki Minaj
25,Right Thru Me,Nicki Minaj
43,I Get Crazy (w\/ Lil Wayne),Nicki Minaj
147,Massive Attack (w\/ Sean Garrett),Nicki Minaj
20065,Moment 4 Life (w\/ Drake),Nicki Minaj
20075,Roman's Revenge (w\/ Eminem),Nicki Minaj
20634,Did It On' Em,Nicki Minaj
20644,Save Me,Nicki Minaj
23900,Blazin' (w\/ Kanye West),Nicki Minaj
25865,Here I Am,Nicki Minaj


In [None]:
# Find songs by Pink Floyd
# Pink Floyd are an English rock band formed in London in 1965. Gaining an early following as one of the first British psychedelic groups, they were distinguished by their extended compositions, sonic experimentation, philosophical lyrics and elaborate live shows.
songs_df[songs_df['artist']=='Pink Floyd']

Unnamed: 0_level_0,title,artist
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1872,Hey You,Pink Floyd
1960,Comfortably Numb,Pink Floyd
2072,Money,Pink Floyd
2089,Wish You Were Here,Pink Floyd
2579,Have A Cigar,Pink Floyd
...,...,...
72046,Hey You,Pink Floyd
72478,A Saucerful Of Secrets,Pink Floyd
72516,Summer '68,Pink Floyd
73326,Take Up Thy Stethoscope And Walk,Pink Floyd


In [None]:
# similarity between Folsom Prison Blues - Johnny Cash and Your Love - Nicki Minaj
model.wv.similarity(1383, 20)

0.3566965

In [None]:
# similarity between Folsom Prison Blues - Johnny Cash and Home Of The Blues - Johnny Cash
model.wv.similarity(1383, 1454)

0.7779931

In [None]:
# similarity between Folsom Prison Blues - Johnny Cash and Hey You - Pink Floyd
model.wv.similarity(1383, 1872)

0.43616733

In [None]:
# Get the most similar songs to Folsom Prison Blues - Johnny Cash
print_recommendations(1383)

title     Folsom Prison Blues
artist            Johnny Cash
Name: 1383 , dtype: object


Unnamed: 0_level_0,title,artist
id,Unnamed: 1_level_1,Unnamed: 2_level_1
10918,I Walk The Line,Johnny Cash
9727,Don't It Make My Brown Eyes Blue,Crystal Gayle
9872,I Wouldn't Have Missed It For The World,Ronnie Milsap
10654,Unwound,George Strait
9787,I'll Still Be Loving You,Restless Heart
6489,The Gambler,Kenny Rogers
10823,On The Road Again,Willie Nelson
10724,Why'd You Come In Here Lookin' Like That,Dolly Parton
10726,Big City - (Newly Recorded Version),Merle Haggard
10993,I'm A Ramblin Man,Waylon Jennings


## Recommending Similar Artist

Let's now pick an artist, and see what similar artists the model recommends. How do we do that?

One way is to get $k$ most similar songs of each artist's songs and remember the artists of those songs. Then, we can count the number of times each artist appears in the list of similar artists. The artists with the highest counts are the most similar artists.

Suppose we want to find similar artists to Pink Floyd. First we should find all the songs of Pink Floyd.

In [None]:
pink_floyd_songs = songs_df[songs_df['artist']=='Pink Floyd']
pink_floyd_songs

Unnamed: 0_level_0,title,artist
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1872,Hey You,Pink Floyd
1960,Comfortably Numb,Pink Floyd
2072,Money,Pink Floyd
2089,Wish You Were Here,Pink Floyd
2579,Have A Cigar,Pink Floyd
...,...,...
72046,Hey You,Pink Floyd
72478,A Saucerful Of Secrets,Pink Floyd
72516,Summer '68,Pink Floyd
73326,Take Up Thy Stethoscope And Walk,Pink Floyd


At first iteartion, we find the $k=5$ most similar songs to the first song of Pink Floyd (with `ID=1872`).



In [None]:
model.wv.most_similar(positive=str('1872'), topn=5)

[('3050', 0.9951059818267822),
 ('3093', 0.9941794872283936),
 ('1845', 0.994076669216156),
 ('2637', 0.9928159713745117),
 ('2811', 0.9921959042549133)]

Then, we want to extract the id of these songs so we can find the artists of these songs. We convert the result to a numpy array of song IDs.

In [None]:
similar_songs = np.array(model.wv.most_similar(positive=str(1872), topn=5))[:,0]
similar_songs

array(['3050', '3093', '1845', '2637', '2811'], dtype='<U32')

In [None]:
similar_songs.astype(np.int32)

array([3050, 3093, 1845, 2637, 2811], dtype=int32)

Now we can find the artist of each song in the list of similar songs.

In [None]:
songs_df.iloc[similar_songs.astype(np.int32)]['artist'].values

array(['Joe Walsh', 'Blue Oyster Cult', 'The Jimi Hendrix Experience',
       'The Who', 'Stevie Ray Vaughan & Double Trouble'], dtype=object)

The full code for finding similar artists is shown below:

In [None]:
artist_name = 'Pink Floyd'
selected_songs = songs_df[songs_df['artist']==artist_name]

most_similar_artists = []
for song_id in selected_songs.index:
    similar_songs = np.array(model.wv.most_similar(positive=str(int(song_id)), topn=5))[:,0]
    most_similar_artists.extend(songs_df.iloc[similar_songs]['artist'].values)

pd.Series(most_similar_artists).value_counts()[:5]


Pink Floyd      25
Led Zeppelin    16
The Beatles     15
AC\/DC          14
ZZ Top          10
dtype: int64

In [None]:
artist_name = 'Metallica'
selected_songs = songs_df[songs_df['artist']==artist_name]

most_similar_artists = []
for song_id in selected_songs.index:
    similar_songs = np.array(model.wv.most_similar(positive=str(int(song_id)), topn=5))[:,0]
    most_similar_artists.extend(songs_df.iloc[similar_songs]['artist'].values)

pd.Series(most_similar_artists).value_counts()[:5]

Metallica        56
Pink Floyd       11
Guns N' Roses    11
Godsmack         10
Motley Crue       9
dtype: int64