## Introduction:
According to a new report released by Nielsen Music, on average, Americans now spend just slightly more than 32 hours a week listening to music. This is a staggering 36% increase in 2 years and has been made possible due to the popularity of music streaming sites like Spotify, Pandora, Apple Music etc.    
With such a tremendous growth in the music industry, it becomes crucial to deliver personalized music recommendation to the listeners. This piqued our curiosity to understand the process that goes behind the music recommendation engine and led us to work on this project.  

lastfm is one of the oldest music streaming company that started providing personalized recommendations to its listeners. Their website is a standing example of the amount of analytics that they are using on their data to extract insights. So we thought of doing the same with the open dataset that we have for lastfm users. Although this dataset is 10 years old, the algorithms and the methologies applied would still be relevant for the latest songs and albums.

### What do we want to do?
To build a recommendation engine that provides personalized recommendations to listeners based on their listening history and visualize the same using advanced visualization tools  

### Data:

    The data is formatted one entry per line as follows (tab separated):

    usersha1-artmbid-artname-plays.tsv:
      user-mboxsha1 \t musicbrainz-artist-id \t artist-name \t plays

    usersha1-profile.tsv:
      user-mboxsha1 \t gender ('m'|'f'|empty) \t age (int|empty) \t country (str|empty) \t signup (date|empty)

Example:

    usersha1-artmbid-artname-plays.tsv:
      000063d3fe1cf2ba248b9e3c3f0334845a27a6bf    af8e4cc5-ef54-458d-a194-7b210acf638f    cannibal corpse    48
      000063d3fe1cf2ba248b9e3c3f0334845a27a6bf    eaaee2c2-0851-43a2-84c8-0198135bc3a8    elis    31
      ...

    usersha1-profile.tsv
      000063d3fe1cf2ba248b9e3c3f0334845a27a6bf    m    19    Mexico    Apr 28, 2008
      
For further details, please refer to the readme file in the folder

**Implicit vs Explicit data** 

Also, the data that we have is implicit data.  
To elaborate, Explicit data is direct preference data from the customers like ratings, likes etc and is often used in collaborative recommendation systems whereas implicit data is the non direct preference data like number of views of a customer, number of times a customer listener to a song or the number of times a customer purchased a particular type of product. In general, we have more noise in implicit data and it takes more effort to make relevant recommendations.  

### How does the look like?

In [2]:
# Importing all the required libraries
import sys
import pandas as pd
import numpy as np
import scipy.sparse as sparse
from scipy.sparse.linalg import spsolve
import random

from sklearn.preprocessing import MinMaxScaler

import implicit

# Load the data
raw_data = pd.read_table('usersha1-artmbid-artname-plays.tsv')
raw_data = raw_data.drop(raw_data.columns[1], axis=1)
raw_data.columns = ['user', 'artist', 'plays']


In [4]:
# Preview of the data
raw_data.head()

Unnamed: 0,user,artist,plays
0,00000c289a1829a808ac09c00daf10bc3c4e223b,die Ärzte,1099
1,00000c289a1829a808ac09c00daf10bc3c4e223b,melissa etheridge,897
2,00000c289a1829a808ac09c00daf10bc3c4e223b,elvenking,717
3,00000c289a1829a808ac09c00daf10bc3c4e223b,juliette & the licks,706
4,00000c289a1829a808ac09c00daf10bc3c4e223b,red hot chili peppers,691


In [7]:
print('Total number if users in the data:',len(raw_data['user'].unique()))
print('Total number if artists in the data:',len(raw_data['artist'].unique()))

Total number if users in the data: 358868
Total number if artists in the data: 292364


There are about 204 nulls in the artist column. Lets drop the null rows from the dataset

In [8]:
# Drop NaN columns
data = raw_data.dropna()
data = data.copy()

Key links for the below code:  
https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html#scipy.sparse.csr_matrix  

**Advantages of the CSR format:**
* efficient arithmetic operations CSR + CSR, CSR * CSR, etc.
* efficient row slicing
* fast matrix vector products

### How to deal with implicit data?
(You can skip this part if you want to look at the code)

As we all know that there are 2 main recommendation system algorithms
1. Content based approach
2. Collaborative filtering approach  

Collaborative filtering is an approach based on past user behavior without requiring the creation of explicit profiles like in content based approach. Historically, recommendation systems depended on more information rich explicit datasets like ratings and thumbs up/down indications. However, explicit feedback is not always avaialable and it becomes crucial to make use of the vast amount of implicit data like user transaction data, browsing data or user play data which indirectly reflect opinion through observing user behavior.   
Also, it is important to note that once the user gives permission to access the usage data, there is no need to collect explicit data as we can use implicit data to make recommendations.  

**Coming to our dataset**  
There are some unique characteristics of implicit data that prevents the use of conventional algorithms that were successful for explicit data.Lets quickly look into those characteristics briefly.  
1. No negative feedback:
    In explicit dataset, the user explicitly gives the information about his/her likes or dislikes and any other user-item pairs that did not have the information are considered are missing data. But in case of implicit, we don't know for sure if the user did not like the movie or did not even know about the movie. Just conisdering the data with positive feedback will not capture the entire picture and will lead to incorrect conclusions. Hence there is a need to address missing data.  
    
2. Implicit data is noisy:
    User might be passively watching a movie or listening to a song and this will constitute to noise as long as the model is concerned.  
    
3. Frequency might not always represent the true user opinion or vice versa  

**Neightborhood models vs Latent factor models:**  
user oriented approach or item-oriented approach in collaborative filtering are the traditional neighborhood models as they consider similarities between users or items while making a recommendation. Item-oriented approach became popular as they performed better than user-oriented approach in terms of accuracy and are easy interpretable as it is easy to say that 2 items are similar than to talk about 2 similar minded individuals.  

Calculating similarties between items would be a realitively simple task when it comes to ratings data as we can use Pearson cofficient to do that. But when it comes to implicit datasets, it becomes a bit complicated as the scale of the frequencies of the metrics are different for different users and might mean different.

This takes us to the latent factor models in which we use matrix factorization methods to uncover the hidden feature vectors for users and items. Refer to the original paper to gain a complete understanding of how this algorithm is being used for implict data. http://yifanhu.net/PUB/cf.pdf

**Matrix factorization**  
The main idea behind matrix factorization is that we estimate user matrix and item matrix from the original sparse matrix by minimizing the cost function that involces 2 important metrics
Preference : Binary metric that indicates if the customer has listened to the music at least  
Confidence: Metric that indicates if the user likes the song  

**Alternating least squares**  
It becomes computationally expensive to optimize the cost function with stochastic gradient as the total combinations of users and items will easily reach billions in case of real world datasets. Hence, an alternative approach called Alternating least squares to perform the optimization of the cost function. In this approach, the optimal estimates for the user vector and item vectors are estimated in an alternating manner by keeping one matrix constant and finally arrives at the optimal solution. This is very clearly in the paper that was mentioned. For example, in a particular iteration, we will have user vector constant while optimizing for item vector.  

This should give you enough information about the theory behind alternating least squares for implicit dataset if you have read the brief and the paper.  

**Shout out for Ben frederickson**  
Thanks to Ben Frederickson, the entire algorithm is available as a module called 'Implicit' in python and provides a much cleaner to get recommendations instead of writing it from scratch. We have used the same while building this recommendation engine.  



### Prerequisites 
If you are running this on your personal system, please make sure that you perform the following steps:  
Before starting the analysis, make sure to download the implicit package  
Run the following command on your command prompt  
conda install -c conda-forge implicit  

### Lets move on to the execution part of the algorithm

In [9]:
# Converting the numbers to categories to be used for creating the categorical codes to avoid using long hash keys 
data['user'] = data['user'].astype("category")
data['artist'] = data['artist'].astype("category")

#cat.codes creates a categorical id for the users and artists
data['user_id'] = data['user'].cat.codes
data['artist_id'] = data['artist'].cat.codes

# The implicit library expects data as a item-user matrix so we
# create two matrices, one for fitting the model (item-user) 
# and one for recommendations (user-item)

sparse_item_user = sparse.csr_matrix((data['plays'].astype(float), (data['artist_id'], data['user_id'])))
sparse_user_item = sparse.csr_matrix((data['plays'].astype(float), (data['user_id'], data['artist_id'])))

### Training the model

In [11]:

# Initialize the als model and fit it using the sparse item-user matrix
model = implicit.als.AlternatingLeastSquares(factors=20, regularization=0.1, iterations=20)

# Calculate the confidence by multiplying it by our alpha value.(alpha value corresponds to the confidence metric 
# that we discussed earlier)

alpha_val = 15
data_conf = (sparse_item_user * alpha_val).astype('double')

#Fit the model
model.fit(data_conf)

100%|████████████████████████████████████████████████████████████████████████████████| 20.0/20 [01:33<00:00,  4.70s/it]


### Finding similar artists

In [18]:
# Lets start with red hot chilli peppers. Earlier in our exploration, we have seen that artist_id for red hot chilli peppers
# is 220128
data[data['artist_id'] == 220128].head()

Unnamed: 0,user,artist,plays,user_id,artist_id
4,00000c289a1829a808ac09c00daf10bc3c4e223b,red hot chili peppers,691,0,220128
1422,000429493d9716b66b02180d208d09b5b89fbe64,red hot chili peppers,234,29,220128
2139,0007e26aafcfc0b6dcb87d7041583fbb7cced88a,red hot chili peppers,159,44,220128
3284,000b0bb32f149504e1df3cce85b6bfd20cef3dd0,red hot chili peppers,46,68,220128
3322,000b2ee840cbda56e0f41c8f248c4fb7ee275db3,red hot chili peppers,87,69,220128


In [22]:
# Find the 10 most similar to red hot chilli peppers
artist_id = 220128
n_similar = 10 # getting the top ten similar items

# Use implicit to get similar items.
similar = model.similar_items(artist_id, n_similar)
# Print the names of our most similar artists
for artist in similar:
    idx, score = artist
    print (data.artist.loc[data.artist_id == idx].iloc[0])

red hot chili peppers
muse
nirvana
coldplay
placebo
foo fighters
the killers
pink floyd
nine inch nails
the beatles


#### Insights:  
* When you look for Red hot chilli peppers(220128), all the top recommendations are rock bands which makes sense.

In [23]:
# Find the 10 most similar to red hot chilli peppers
artist_id = 90933
n_similar = 10 # getting the top ten similar items

# Use implicit to get similar items.
similar = model.similar_items(artist_id, n_similar)
# Print the names of our most similar artists
for artist in similar:
    idx, score = artist
    print (data.artist.loc[data.artist_id == idx].iloc[0])

die Ärzte
die toten hosen
koЯn
rammstein
him
limp bizkit
nightwish
guano apes
the offspring
audioslave


#### Insights:
* Also, one more example that we took was die Ärzte(90933). It is a punk rock from Berlin according to wiki   https://en.wikipedia.org/wiki/Die_%C3%84rzte  
* The top recommendations for their music are die toten hosen(another german punk rock), blink-182(punk rock).  


**It's really amazing how math works. Without mentioning any other features, the algorithm figured out the features vectors for each artist based on the number of times the users played their songs.**  

### Creating user recommendations

Lets look at the users who have rock music on the top of their list by doing a simple EDA and find the recommendations for those users using the algorithm

In [30]:
data['rank'] = data.groupby(['user_id'])['plays'].rank(ascending = False)

# filtering for their first choice
data_1  = data[data['rank'] == 1]

In [32]:
# Users with red hot chilli peppers as their first choice
data_1[data_1['artist_id'] == 220128].head()

Unnamed: 0,user,artist,plays,user_id,artist_id,rank
61447,00df0ce0bf1c2eecf846662a567a114b9af72c1f,red hot chili peppers,3254,1246,220128,1.0
80764,01260f09011b3d7c3a7c70b805a366023958e20e,red hot chili peppers,772,1644,220128,1.0
95902,015b1447b3edfa2eb5406dcee6f54fb009c22da2,red hot chili peppers,587,1949,220128,1.0
121928,01bc707b65903d4782ddce4516b0b48f799fd5bc,red hot chili peppers,213,2487,220128,1.0
146656,02197338f4446ed62055d0e03bbdb3755540d304,red hot chili peppers,472,2994,220128,1.0


In [52]:
data[data['user_id'] == 1949]

Unnamed: 0,user,artist,plays,user_id,artist_id,rank
95902,015b1447b3edfa2eb5406dcee6f54fb009c22da2,red hot chili peppers,587,1949,220128,1.0
95903,015b1447b3edfa2eb5406dcee6f54fb009c22da2,radiohead,497,1949,217883,2.0
95904,015b1447b3edfa2eb5406dcee6f54fb009c22da2,pink floyd,404,1949,212496,3.0
95905,015b1447b3edfa2eb5406dcee6f54fb009c22da2,peter tosh,341,1949,210962,4.0
95906,015b1447b3edfa2eb5406dcee6f54fb009c22da2,placebo,327,1949,212967,5.0
95907,015b1447b3edfa2eb5406dcee6f54fb009c22da2,noir désir,323,1949,200543,6.0
95908,015b1447b3edfa2eb5406dcee6f54fb009c22da2,jack johnson,293,1949,144238,7.0
95909,015b1447b3edfa2eb5406dcee6f54fb009c22da2,tarmac,276,1949,249552,8.0
95910,015b1447b3edfa2eb5406dcee6f54fb009c22da2,death in vegas,203,1949,86664,9.0
95911,015b1447b3edfa2eb5406dcee6f54fb009c22da2,erykah badu,200,1949,109735,10.0


In [55]:
# Create recommendations for user with id 1246
user_id = 1949

# Use the implicit recommender.
recommended = model.recommend(user_id, sparse_user_item,N = 20,filter_already_liked_items = True)

artists = []
scores = []

# Get artist names from ids
for item in recommended:
    idx, score = item
    artists.append(data.artist.loc[data.artist_id == idx].iloc[0])
    scores.append(score)

# Create a dataframe of artist names and scores
recommendations = pd.DataFrame({'artist': artists, 'score': scores})

print (recommendations)

                     artist     score
0           balkan beat box  1.179114
1                 mc solaar  1.167469
2                   shantel  1.099876
3                   orishas  1.098003
4               hocus pocus  1.094345
5                wax tailor  1.084923
6           amadou & mariam  1.077355
7                     nneka  1.075789
8       easy star all-stars  1.072060
9      le peuple de l'herbe  1.064388
10      le peuple de lherbe  1.062161
11            ojos de brujo  1.061382
12            bran van 3000  1.059684
13              2 many dj's  1.057671
14                   fugees  1.056557
15             sofa surfers  1.044766
16                jazzanova  1.042040
17           babylon circus  1.040971
18  buena vista social club  1.039951
19                      ayo  1.039596
