# Music recommender system

One of the most used machine learning algorithms is recommendation systems. A **recommender** (or recommendation) **system** (or engine) is a filtering system which aim is to predict a rating or preference a user would give to an item, eg. a film, a product, a song, etc.

Which type of recommender can we have?   

There are two main types of recommender systems: 
- Content-based filters
- Collaborative filters
  
> Content-based filters predicts what a user likes based on what that particular user has liked in the past. On the other hand, collaborative-based filters predict what a user like based on what other users, that are similar to that particular user, have liked.

### 1) Content-based filters

Recommendations done using content-based recommenders can be seen as a user-specific classification problem. This classifier learns the user's likes and dislikes from the features of the song.

The most straightforward approach is **keyword matching**.

In a few words, the idea behind is to extract meaningful keywords present in a song description a user likes, search for the keywords in other song descriptions to estimate similarities among them, and based on that, recommend those songs to the user.

*How is this performed?*

In our case, because we are working with text and words, **Term Frequency-Inverse Document Frequency (TF-IDF)** can be used for this matching process.
  
We'll go through the steps for generating a **content-based** music recommender system.

### Importing required libraries

First, we'll import all the required libraries.

In [72]:
import numpy as np
import pandas as pd

In [73]:
from typing import List, Dict

We have already used the **TF-IDF score before** when performing Twitter sentiment analysis. 

Likewise, we are going to use TfidfVectorizer from the Scikit-learn package again.

In [74]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import distance

### Dataset

So imagine that we have the [following dataset](https://www.kaggle.com/mousehead/songlyrics/data#). 

This dataset contains name, artist, and lyrics for *57650 songs in English*. The data has been acquired from LyricsFreak through scraping.

In [75]:
songs = pd.read_csv('songdata.csv', nrows=5000)

In [76]:
songs.head()

Unnamed: 0,artist,song,link,text
0,ABBA,Ahe's My Kind Of Girl,/a/abba/ahes+my+kind+of+girl_20598417.html,"Look at her face, it's a wonderful face \nAnd..."
1,ABBA,"Andante, Andante",/a/abba/andante+andante_20002708.html,"Take it easy with me, please \r\nTouch me gen..."
2,ABBA,As Good As New,/a/abba/as+good+as+new_20003033.html,I'll never know why I had to go \nWhy I had t...
3,ABBA,Bang,/a/abba/bang_20598415.html,Making somebody happy is a question of give an...
4,ABBA,Bang-A-Boomerang,/a/abba/bang+a+boomerang_20002668.html,Making somebody happy is a question of give an...


Because of the dataset being so big, we are going to resample only 5000 random songs.

In [77]:
# songs = songs.sample(n=5000).drop('link', axis=1).reset_index(drop=True)
songs = songs.drop('link', axis=1)

We can notice also the presence of `\n` in the text, so we are going to remove it.

In [78]:
songs['text'] = songs['text'].str.replace(r'\n', '', regex=True)

In [79]:
songs['text']

0       Look at her face, it's a wonderful face  And i...
1       Take it easy with me, please  \rTouch me gentl...
2       I'll never know why I had to go  Why I had to ...
3       Making somebody happy is a question of give an...
4       Making somebody happy is a question of give an...
                              ...                        
4995    You won't take my love for tender  \rYou can p...
4996    I've looked at it every way I can  \rFrom unde...
4997    I won't walk with my head bowed  \r(Be on) Bey...
4998    Dressed up like a dog's dinner  \rButter would...
4999    Now there's newsprint all over your face  \rWe...
Name: text, Length: 5000, dtype: object

After that, we use TF-IDF vectorizerthat calculates the TF-IDF score for each song lyric, word-by-word. 

Here, we pay particular attention to the arguments we can specify.

In [80]:
tfidf = TfidfVectorizer(analyzer='word', stop_words='english')

In [81]:
lyrics_matrix = tfidf.fit_transform(songs['text'])

In [103]:
# print(lyrics_matrix)
matrix_arr = np.array(lyrics_matrix)

print('==================================')
print(matrix_arr)
# print(tfidf.vocabulary_) # dictionary of words and their index


  (0, 13612)	0.14426211585703372
  (0, 18535)	0.08193734815347754
  (0, 18198)	0.12377179609942188
  (0, 8774)	0.13662249282824912
  (0, 20023)	0.11702773810586717
  (0, 10534)	0.053061945396677815
  (0, 8132)	0.09211566628984386
  (0, 17289)	0.237566856563542
  (0, 8617)	0.16377780427488617
  (0, 13054)	0.14834215090454939
  (0, 20018)	0.09380471149390261
  (0, 10268)	0.2655820946681044
  (0, 1981)	0.19507670721087733
  (0, 1634)	0.18059738027083114
  (0, 6712)	0.2196470434421451
  (0, 6570)	0.1391315848631338
  (0, 10952)	0.2090553517993387
  (0, 7576)	0.3292555462476679
  (0, 9920)	0.4240399935128574
  (0, 9772)	0.18814964017738572
  (0, 6588)	0.19423747679701758
  (0, 10743)	0.13859327906897498
  (0, 15964)	0.14785705566619378
  (0, 16778)	0.16051052032357432
  (0, 20158)	0.06193913808495802
  :	:
  (4999, 10659)	0.06471177496103261
  (4999, 7698)	0.05844985120541044
  (4999, 11221)	0.0632354956813946
  (4999, 14606)	0.07529524589665534
  (4999, 7532)	0.06965011383607916
  (4999, 8

*How do we use this matrix for a recommendation?* 

We now need to calculate the similarity of one lyric to another. We are going to use **cosine similarity**.

We want to calculate the cosine similarity of each item with every other item in the dataset. So we just pass the lyrics_matrix as argument.

In [83]:
cosine_similarities = cosine_similarity(lyrics_matrix) 
print(cosine_similarities)


[[1.         0.00169743 0.00936841 ... 0.03331108 0.03366429 0.08968344]
 [0.00169743 1.         0.0044533  ... 0.00275294 0.01246284 0.00367714]
 [0.00936841 0.0044533  1.         ... 0.00674108 0.00608736 0.0225969 ]
 ...
 [0.03331108 0.00275294 0.00674108 ... 1.         0.01257674 0.01331615]
 [0.03366429 0.01246284 0.00608736 ... 0.01257674 1.         0.07131912]
 [0.08968344 0.00367714 0.0225969  ... 0.01331615 0.07131912 1.        ]]


In [89]:
lyrics_matrix

<5000x20849 sparse matrix of type '<class 'numpy.float64'>'
	with 250920 stored elements in Compressed Sparse Row format>

In [105]:
distance_matrix = distance.cdist(lyrics_matrix, lyrics_matrix, metric='euclidean')
print(distance_matrix)

ValueError: XA must be a 2-dimensional array.

In [None]:
lyrics_matrix, lyrics_matrix, metric='euclidean')
print(distance_matrix)

Once we get the similarities, we'll store in a dictionary the names of the 50  most similar songs for each song in our dataset.

In [87]:
similarities = {}

In [88]:
for i in range(len(cosine_similarities)):
    # Now we'll sort each element in cosine_similarities and get the indexes of the songs. 
    similar_indices = cosine_similarities[i].argsort()[:-50:-1] 
    # After that, we'll store in similarities each name of the 50 most similar songs.
    # Except the first one that is the same song.
    similarities[songs['song'].iloc[i]] = [(cosine_similarities[i][x], songs['song'][x], songs['artist'][x]) for x in similar_indices][1:]

After that, all the magic happens. We can use that similarity scores to access the most similar items and give a recommendation.

For that, we'll define our Content based recommender class.

In [None]:
class ContentBasedRecommender:
    def __init__(self, matrix):
        self.matrix_similar = matrix

    def _print_message(self, song, recom_song):
        rec_items = len(recom_song)
        
        print(f'The {rec_items} recommended songs for {song} are:')
        for i in range(rec_items):
            print(f"Number {i+1}:")
            print(f"{recom_song[i][1]} by {recom_song[i][2]} with {round(recom_song[i][0], 3)} similarity score") 
            print("--------------------")
        
    def recommend(self, recommendation):
        # Get song to find recommendations for
        song = recommendation['song']
        # Get number of songs to recommend
        number_songs = recommendation['number_songs']
        # Get the number of songs most similars from matrix similarities
        recom_song = self.matrix_similar[song][:number_songs]
        # print each item
        self._print_message(song=song, recom_song=recom_song)

Now, instantiate class

In [None]:
recommedations = ContentBasedRecommender(similarities)

Then, we are ready to pick a song from the dataset and make a recommendation.

In [None]:
recommendation = {
    "song": songs['song'].iloc[10],
    "number_songs": 4 
}

In [None]:
recommedations.recommend(recommendation)

The 4 recommended songs for Dance are:
Number 1:
Life Is A Dance by Chaka Khan with 0.568 similarity score
--------------------
Number 2:
Do You Wanna Dance? by Beach Boys with 0.504 similarity score
--------------------
Number 3:
Do You Wanna Dance? by Cliff Richard with 0.472 similarity score
--------------------
Number 4:
Let's Dance by Chris Rea with 0.445 similarity score
--------------------


And we can pick another random song and recommend again:

In [None]:
recommendation2 = {
    "song": songs['song'].iloc[120],
    "number_songs": 4 
}

In [None]:
recommedations.recommend(recommendation2)

The 4 recommended songs for Life Is A Flower are:
Number 1:
World Where You Live by Crowded House with 0.386 similarity score
--------------------
Number 2:
All Over The World by Arlo Guthrie with 0.334 similarity score
--------------------
Number 3:
The World Is Mine by David Guetta with 0.313 similarity score
--------------------
Number 4:
World Of Two by Cake with 0.276 similarity score
--------------------


In [None]:
corpus = [
     'This is the first document is.',
     'This document is the second document.'
 ]

vectorizer = TfidfVectorizer(analyzer='word', stop_words='english')

In [None]:
X = vectorizer.fit_transform(corpus)
print(X)
print(vectorizer.vocabulary_) # dictionary of words and their index

  (0, 0)	1.0
  (1, 1)	0.5749618667993135
  (1, 0)	0.8181802073667197
{'document': 0, 'second': 1}


In [None]:
print(vectorizer.get_stop_words())

frozenset({'sixty', 'whose', 'top', 'anything', 'our', 'on', 'find', 'about', 'the', 'among', 'over', 'none', 'whereupon', 'interest', 'became', 'into', 'amongst', 'thereafter', 'by', 'has', 'elsewhere', 'further', 'four', 'anyhow', 'below', 'should', 'is', 'together', 're', 'six', 'fifteen', 'he', 'third', 'must', 'in', 'no', 'behind', 'done', 'whereas', 'would', 'whenever', 'yourselves', 'detail', 'name', 'him', 'please', 'fill', 'though', 'already', 'former', 'twenty', 'again', 'side', 'could', 'often', 'cant', 'because', 'against', 'its', 'hasnt', 'much', 'nothing', 'several', 'being', 'less', 'between', 'yourself', 'show', 'what', 'but', 'noone', 'few', 'ie', 'thin', 'found', 'whom', 'and', 'also', 'etc', 'am', 'eg', 'moreover', 'full', 'such', 'she', 'wherever', 'whereby', 'been', 'ten', 'any', 'whether', 'they', 'somewhere', 'more', 'serious', 'part', 'everywhere', 'almost', 'namely', 'indeed', 'onto', 'have', 'five', 'whole', 'con', 'hundred', 'next', 'can', 'which', 'this', 't