<a href="https://colab.research.google.com/github/graviraja/100-Days-of-NLP/blob/recommendation/applications/recommendations/Song%20Recommendation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Song Recommendation Using Word2Vec

## Recommendation systems fall under two broad categories:

- **`Content-based systems`** are recommendation systems that are based on the features of the item we’re trying to recommend. When talking about music, this includes for example the genre of the song or how many beats per minute it has.

- **`Collaborative Filtering-based`** systems are systems that rely on historical usage data to recommend items that other similar users have previously interacted with. These systems are oblivious to the features of the content itself, and base their recommendations on the principle that people who have many songs or artists in common, will generally like the same styles of music.

With enough data, collaborative filtering systems turn out to be effective at recommending relevant items. The basic idea behind collaborative filtering is that if user 1 likes artists A & B, and user 2 likes artists A, B & C, then it is likely that user 1 will also be interested in artist C.

I highly recommend to go throught the following medium post:

- [Word2Vec for music recommendation](https://towardsdatascience.com/using-word2vec-for-music-recommendations-bb9649ac2484)

## Word2Vec 

Don't know what is Word2Vec ? Checkout my [repo](https://github.com/graviraja/100-Days-of-NLP/tree/master/embeddings)

The Word2vec Skip-gram model is a shallow neural network with a single hidden layer that takes in a word as input and tries to predict the context of words around it as output.

But how does that relate to music recommendations? Well, we can think of a user’s listening queue as a sentence, with each word in that sentence being a song that the user has listened to. So then, training the Word2vec model on those sentences essentially means that for each song the user has listened to in the past, we’re using the songs they have listened to before and after to teach our model that those songs somehow belong to the same context. Here’s an idea of what the neural network would look like with songs instead of words:

![song_word2vec](https://drive.google.com/uc?id=1pA62ssUL_883vMYEj83PDox-b5T8mU1H)

This is the same approach as the analysis of text discussed above, except instead of textual words we now have a unique identifier for each song.
What we get at the end of the training phase is a model where each song is represented by a vector of weights in a high dimensional space. What’s interesting about those vectors is that similar songs will have weights that are closer together than songs that are unrelated.

### Resources


- [Word2vec for music recommnedations](https://towardsdatascience.com/using-word2vec-for-music-recommendations-bb9649ac2484)

- [Intuition & Uses-cases of Embeddings in NLP](https://www.youtube.com/watch?v=4-QoMdSqG_I)

# Let's Code !!!

## Initial Setup

In [0]:
import gensim
import warnings

import numpy as np
import pandas as pd

from gensim.models import Word2Vec
from urllib import request

warnings.filterwarnings('ignore')

## Dataset

The dataset used was collected by Shuo Chen from Cornell University which can be found [here](https://www.cs.cornell.edu/~shuochen/lme/data_page.html)

In [4]:
!wget https://www.cs.cornell.edu/~shuochen/lme/dataset.tar.gz

--2020-05-21 16:02:54--  https://www.cs.cornell.edu/~shuochen/lme/dataset.tar.gz
Resolving www.cs.cornell.edu (www.cs.cornell.edu)... 132.236.207.20
Connecting to www.cs.cornell.edu (www.cs.cornell.edu)|132.236.207.20|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 15344424 (15M) [application/x-gzip]
Saving to: ‘dataset.tar.gz’


2020-05-21 16:02:56 (8.51 MB/s) - ‘dataset.tar.gz’ saved [15344424/15344424]



In [5]:
!tar -xvf dataset.tar.gz

dataset/
dataset/._.DS_Store
dataset/.DS_Store
dataset/README
dataset/yes_big/
dataset/yes_complete/
dataset/yes_small/
dataset/yes_small/song_hash.txt
dataset/yes_small/tag_hash.txt
dataset/yes_small/tags.txt
dataset/yes_small/test.txt
dataset/yes_small/train.txt
dataset/yes_complete/song_hash.txt
dataset/yes_complete/tag_hash.txt
dataset/yes_complete/tags.txt
dataset/yes_complete/test.txt
dataset/yes_complete/train.txt
dataset/yes_big/song_hash.txt
dataset/yes_big/tag_hash.txt
dataset/yes_big/tags.txt
dataset/yes_big/test.txt
dataset/yes_big/train.txt


In [6]:
!ls dataset

README	yes_big  yes_complete  yes_small


## Inspecting dataset


> The collection lasted from December 2010 to May 2011. This lead to a dataset of 75,262 songs and 2,840,553 transitions. To get datasets of various sizes, we pruned the raw data so that only the songs with a number of appearances above a certain threshold are kept. We then divide the pruned set into a training set and a testing set, making sure that each song has appeared at least once in the training set. We name them as yes_small, yes_big and yes_complete, whose basic statistics are shown below.


 Property             | yes_small        | yes_big           | yes_complete  |
--------------------- | -------------    |:-------------:    | -----:        |
 Appearance Threshold |   20             | 5                 | 0             |
 Number of Songs      | 3,168            | 9,775             | 75,262        |
 Number of Train Transitions      | 134,431            | 172,510            | 1,542,372        |
Number of Test Transitions      | 1,191,279            | 1,602,079           | 1,298,181        |

We will use `yes_complete` for the recommendation


In [0]:
# Get the playlist dataset file
data_file = 'dataset/yes_complete/train.txt'
songs_file = 'dataset/yes_complete/song_hash.txt'

In [0]:
with open(data_file, 'r', encoding='utf-8') as f:
    data = f.read().split('\n')
 
with open(songs_file, 'r', encoding='utf-8') as f:
    songs = f.read().split('\n')


In [37]:
# The first line of the data file is the IDs (not the integer ID, 
# but IDs from other sources for identifying the songs) for the songs, separated by a space.
data[0][:99]

'17430147 17277121 17767569 17352501 17567841 17650342 17572001 17646522 17451245 17451162 17706101 '

In [11]:
len(data[0].split())

75262

In [39]:
# The second line are the number of appearances of each song in the file, also separated by a space.
data[1][:99]

'138 2833 297 502 700 5041 3235 72 1004 2 1 116 448 2300 2684 2 5 612 171 864 295 33 106 87 239 1974'

In [13]:
len(data[1].split())

75262

In [15]:
# Starting from the third line are the playlists
# with each song represented by its integer ID in this file (from 0 to the total number of songs minus one).
# Note that in the playlist data file, each line is ended with a space.
data[2]

'0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 2 42 43 44 45 46 47 48 20 49 8 50 51 52 53 54 55 56 57 25 58 59 60 61 62 3 63 64 65 66 46 47 67 2 48 68 69 70 57 50 71 72 53 73 25 74 59 20 46 75 76 77 59 20 43 '

In [25]:
# num of playlists
len(data[2:])

11138

In [17]:
#Each line corresponds to one song, and has the format
# Integer_ID \t Title \t Artist \n
# (The spaces here are only for making it easy to read. They do not exist in the real data file.)
songs = [s.rstrip().split('\t') for s in songs]
songs[:3]

[['0 ', 'Gucci Time (w\\/ Swizz Beatz)', 'Gucci Mane'],
 ['1 ', 'Aston Martin Music (w\\/ Drake & Chrisette Michelle)', 'Rick Ross'],
 ['2 ', 'Get Back Up (w\\/ Chris Brown)', 'T.I.']]

In [20]:
len(songs)

75263

In [18]:
# Let's convert the songs to dataframe
songs_df = pd.DataFrame(data=songs, columns = ['id', 'title', 'artist'])
songs_df = songs_df.set_index('id')
songs_df.head()

Unnamed: 0_level_0,title,artist
id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,Gucci Time (w\/ Swizz Beatz),Gucci Mane
1,Aston Martin Music (w\/ Drake & Chrisette Mich...,Rick Ross
2,Get Back Up (w\/ Chris Brown),T.I.
3,Hot Toddy (w\/ Jay-Z & Ester Dean),Usher
4,Whip My Hair,Willow


In [19]:
# let's look at the songs of a certain artist
songs_df[songs_df.artist == "Usher"].head()

Unnamed: 0_level_0,title,artist
id,Unnamed: 1_level_1,Unnamed: 2_level_1
3,Hot Toddy (w\/ Jay-Z & Ester Dean),Usher
14,DJ Got Us Fallin' In Love (w\/ Pitbull),Usher
32,You Make Me Wanna...,Usher
51,There Goes My Baby,Usher
88,OMG (w\/ Will.I.Am),Usher


## Training

Let's consider the playlists which atleast more than 1 song

In [0]:
# first two lines are metadata
playlists = [s.rstrip().split() for s in data[2:] if len(s.split()) > 1]

In [26]:
print(f"Initial Playlists: {len(data[2:])}")
print(f"After: {len(playlists)}")

Initial Playlists: 11138
After: 11088


Word2Vec training has some hyper-parameters. I am using the following:

- `size`: embedding size of each song
- `window`: Maximum distance between the current and predicted song within a playlist.
- `negative`: If > 0, negative sampling will be used, the int for negative specifies how many “noise songs” should be drawn. If set to 0, no negative sampling is used.
- `min_count`: Ignores all songs with total frequency lower than this.
- `workers`: Use these many worker threads to train the model (=faster training with multicore machines).

For other parameters checkout [gensim](https://radimrehurek.com/gensim/models/word2vec.html)

In [0]:
model = Word2Vec(playlists, size=32, window=20, negative=50, min_count=1, workers=4)

## Recommending Similar Songs

In [31]:
song_id = 2172

songs_df.iloc[song_id]

title     Fade To Black
artist        Metallica
Name: 2172 , dtype: object

In [32]:
# Ask the model for songs similar to song
model.wv.most_similar(positive=str(song_id))

[('2849', 0.9995048642158508),
 ('2976', 0.9990255832672119),
 ('2987', 0.9985707998275757),
 ('2886', 0.9984464049339294),
 ('3094', 0.9984420537948608),
 ('3167', 0.9982445240020752),
 ('6624', 0.9982098340988159),
 ('2715', 0.9978221654891968),
 ('2640', 0.9975029826164246),
 ('5549', 0.9972248077392578)]

In [0]:
def get_recommendations(song_id):
    print(songs_df.iloc[song_id])
    similar_songs = np.array(model.wv.most_similar(positive=str(song_id)))[:,0]
    return  songs_df.iloc[similar_songs] 

In [34]:
get_recommendations(2172)

title     Fade To Black
artist        Metallica
Name: 2172 , dtype: object


Unnamed: 0_level_0,title,artist
id,Unnamed: 1_level_1,Unnamed: 2_level_1
2849,Run To The Hills,Iron Maiden
2976,I Don't Know,Ozzy Osbourne
2987,Ready For Love,Bad Company
2886,The Zoo,Scorpions
3094,Breaking The Law,Judas Priest
3167,Unchained,Van Halen
6624,Everybody Wants Some!!!,Van Halen
2715,Rainbow In The Dark,Dio
2640,Red Barchetta,Rush
5549,November Rain,Guns N' Roses


# Visualizations will be added soon !!