### Content-Based Song Recommender

In [1]:
# importing few libraries
import numpy as np
import pandas as pd
from typing import Dict, List

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
# load the songs data
df = pd.read_csv('songLyrics.csv')
df.head(5)

Unnamed: 0,artist,song,link,lyrics
0,ABBA,Ahe's My Kind Of Girl,/a/abba/ahes+my+kind+of+girl_20598417.html,"Look at her face, it's a wonderful face \nAnd..."
1,ABBA,"Andante, Andante",/a/abba/andante+andante_20002708.html,"Take it easy with me, please \nTouch me gentl..."
2,ABBA,As Good As New,/a/abba/as+good+as+new_20003033.html,I'll never know why I had to go \nWhy I had t...
3,ABBA,Bang,/a/abba/bang_20598415.html,Making somebody happy is a question of give an...
4,ABBA,Bang-A-Boomerang,/a/abba/bang+a+boomerang_20002668.html,Making somebody happy is a question of give an...


In [3]:
print('The shape of the dataset used: {}'.format(len(df)))

The shape of the dataset used: 57650


Since there are many songs in the dataset, my system is not being able to calculate the cosine matrix. It is giving some 'too large nnz result error'. So I am considering a random sample of 15,000 songs.


Note: There are probably ways to calculate cosine similarities for large sparse matrices, but i'm yet to explore this. So, for the sake of getting immediate good results - I'm only considering a sample!

In [4]:
df = df.sample(n=15000).drop('link', axis=1).reset_index(drop=True)
print('The shape of the dataset used: {}'.format(len(df)))

The shape of the dataset used: 15000


In [5]:
df['lyrics'][0]

"Coke machines on the ho chi minh trail  \nRussian spies shopping at bloomingdales  \nKiller bees swarming up the rio grande  \nHey baby where'd you get that tan  \nRome burns as we dance cheek to cheek  \nTitanic's sinking but oooh do we look chic  \nNew york city's got the hong kong flu  \nAnd I can't take my eyes off you  \nMx gi joe  \nRsvp ufo  \nAbcia  \nI still want to know who shot jfk  \nLet's dance  \nTo the rhythm of romance  \nLet's swing  \n'neath the stars above  \nC'mon let's dance  \nTo the rhythm of romance  \nLet's kiss  \nLet's fall in love  \nJohnny woke up from his american dream  \nRich man poor man and a desert in between  \nCrystal chandeliers hangin' by a thread  \nWe're in the pink baby just two trillon in the red  \nAnd I hear gabriel blowing his saxophone  \nThrough a great big hole in the ozone  \nWho cares it the sun don't rise  \nWhen I look in to your ultraviolet eyes  \nThat's life, this is war  \nI took it on faith, can't take it anymore  \nOld, young,

**Replacing "\n's" and extra spaces to convert to into plain looking lyrical text:**

In [6]:
df['lyrics'] = df['lyrics'].str.replace(r'\n', '')

In [7]:
df['lyrics'][0]

"Coke machines on the ho chi minh trail  Russian spies shopping at bloomingdales  Killer bees swarming up the rio grande  Hey baby where'd you get that tan  Rome burns as we dance cheek to cheek  Titanic's sinking but oooh do we look chic  New york city's got the hong kong flu  And I can't take my eyes off you  Mx gi joe  Rsvp ufo  Abcia  I still want to know who shot jfk  Let's dance  To the rhythm of romance  Let's swing  'neath the stars above  C'mon let's dance  To the rhythm of romance  Let's kiss  Let's fall in love  Johnny woke up from his american dream  Rich man poor man and a desert in between  Crystal chandeliers hangin' by a thread  We're in the pink baby just two trillon in the red  And I hear gabriel blowing his saxophone  Through a great big hole in the ozone  Who cares it the sun don't rise  When I look in to your ultraviolet eyes  That's life, this is war  I took it on faith, can't take it anymore  Old, young, yin and yang  We'll all go together in the next big bang  B

In [8]:
df['lyrics'] = df['lyrics'].str.replace(r'  ', ' ')
df['lyrics'] = df['lyrics'].str.replace(r'  ', ' ')

In [9]:
df['lyrics'][0]

"Coke machines on the ho chi minh trail Russian spies shopping at bloomingdales Killer bees swarming up the rio grande Hey baby where'd you get that tan Rome burns as we dance cheek to cheek Titanic's sinking but oooh do we look chic New york city's got the hong kong flu And I can't take my eyes off you Mx gi joe Rsvp ufo Abcia I still want to know who shot jfk Let's dance To the rhythm of romance Let's swing 'neath the stars above C'mon let's dance To the rhythm of romance Let's kiss Let's fall in love Johnny woke up from his american dream Rich man poor man and a desert in between Crystal chandeliers hangin' by a thread We're in the pink baby just two trillon in the red And I hear gabriel blowing his saxophone Through a great big hole in the ozone Who cares it the sun don't rise When I look in to your ultraviolet eyes That's life, this is war I took it on faith, can't take it anymore Old, young, yin and yang We'll all go together in the next big bang Boom shug a lug a lug a boom Let'

**Scikit-learn's TF-IDF vectorizer calculates the tfidf score --- word by word for each song**

In [10]:
tfidf = TfidfVectorizer(analyzer='word', stop_words='english')

**Fit the Vectorizer on the lyrics and it'll generate a matrix with all the scores stored**

In [11]:
import time
start = time.time()
print('Computing a tfidf vector matrix between lyrics ---')
lyric_matrix = tfidf.fit_transform(df['lyrics'])
end = time.time()
print(end - start)
print('Completed!!!!')

Computing a tfidf vector matrix between lyrics ---
1.730647325515747
Completed!!!!


In [12]:
lyric_matrix.shape

(15000, 43323)

To calculate the similarity between lyrics - I am using a similarity measure called __Cosine Similarity__. 

The argument for calculating cosine similarity is just the tfidf matrix calculated by the vectorizer. This calculates the similarity of each word with every other word in the matrix.

In [13]:
import time
start = time.time()
print('Computing Cosine similarities for the lyric_matrix ---')
cosine_similarities = cosine_similarity(lyric_matrix)
end = time.time()
print(end - start)
print('Completed!!!!')

Computing Cosine similarities for the lyric_matrix ---
6.581965923309326
Completed!!!!


**I used the computed similarities to build a Recommender!**

In [14]:
similar_songs = {}

for i in range(len(cosine_similarities)):
    indices_similar = cosine_similarities[i].argsort()[:-50:-1]
    similar_songs[df['song'].iloc[i]] = [(cosine_similarities[i][x], df['song'][x], df['artist'][x]) for x in indices_similar][1:]

In [15]:
class Recommender:
    def __init__(self, matrix):
        self.similar_matrix = matrix
    
    def _print(self, song, recommended):
        recommended_length = len(recommended)
        print(f'The {recommended_length} recommended songs for {song} are: ')
        for i in range(recommended_length):
            print(f'{i+1}: ')
            print(f'{recommended[i][1]} by {recommended[i][2]} with similarity score of {round(recommended[i][0], 3)}')
            print('***********************************************')
    
    def recommend(self, recommendation):
        song = recommendation['song']
        number = recommendation['number']
        recommended = self.similar_matrix[song][:number]
        self._print(song=song, recommended=recommended)

In [16]:
# instantiate the class
recommendations = Recommender(similar_songs)

Finally, Get recommendations:

In [18]:
song1 = {
    "song": df['song'].iloc[1977],
    "number": 5
}

recommendations.recommend(song1)

The 5 recommended songs for I Like Death are: 
1: 
Cheese Cake by Aerosmith with similarity score of 0.581
***********************************************
2: 
The Corner Grocery Store by Raffi with similarity score of 0.205
***********************************************
3: 
Duke Of Prunes by Frank Zappa with similarity score of 0.168
***********************************************
4: 
I'm Your Daddy by Weezer with similarity score of 0.166
***********************************************
5: 
I Guess I Like It Like That by Kylie Minogue with similarity score of 0.147
***********************************************


In [20]:
song2 = {
    "song": df['song'].iloc[4415],
    "number": 5
}

recommendations.recommend(song2)

The 5 recommended songs for Love The Way You Lie are: 
1: 
Love The Way You Lie by Eminem with similarity score of 0.632
***********************************************
2: 
Would I Lie To You by Whitesnake with similarity score of 0.442
***********************************************
3: 
Love Don't Lie by Def Leppard with similarity score of 0.382
***********************************************
4: 
Lie Down by Whitesnake with similarity score of 0.37
***********************************************
5: 
Any Way You Want It by Kiss with similarity score of 0.362
***********************************************
