# **Music Recommendation System**

# **Milestone 2**

Now that we have explored the data, let's apply different algorithms to build recommendation systems.

**Note:** Use the shorter version of the data, i.e., the data after the cutoffs as used in Milestone 1.

## **Load the dataset**

In [None]:
# Load the dataset you have saved at the end of milestone 1
# Mounting the drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Used to ignore the warning given as output of the code
import warnings
warnings.filterwarnings('ignore')

# Basic libraries of python for numeric and dataframe computations
import numpy as np
import pandas as pd

# Basic library for data visualization
import matplotlib.pyplot as plt

# Slightly advanced library for data visualization
import seaborn as sns

# To compute the cosine similarity between two vectors
from sklearn.metrics.pairwise import cosine_similarity

# A dictionary output that does not raise a key error
from collections import defaultdict

# A performance metrics in sklearn
from sklearn.metrics import mean_squared_error

In [None]:
#importing the datasets
count_df = pd.read_csv('/content/drive/MyDrive/count_data.csv')
song_df = pd.read_csv('/content/drive/MyDrive/song_data.csv')

In [None]:
df_final = pd.merge(count_df, song_df.drop_duplicates(['song_id']), on="song_id", how="left")
#df = df.drop(['Unnamed: 0'],axis=1)
df_final

Unnamed: 0,user_id,song_id,play_count,title,release,artist_name,year
0,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOAKIMP12A8C130995,1,The Cove,Thicker Than Water,Jack Johnson,0
1,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBBMDR12A8C13253B,2,Entre Dos Aguas,Flamenco Para Niños,Paco De Lucia,1976
2,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBXHDL12A81C204C0,1,Stronger,Graduation,Kanye West,2007
3,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBYHAJ12A6701BF1D,1,Constellations,In Between Dreams,Jack Johnson,2005
4,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SODACBL12A8C13C273,1,Learn To Fly,There Is Nothing Left To Lose,Foo Fighters,1999
...,...,...,...,...,...,...,...
1048570,6b4d326415bae31ca8e7c25d9f966f72d09802cc,SODCIML12A6D4FADF9,1,Axel F (Radio Edit),Axel F,Crazy Frog,2005
1048571,6b4d326415bae31ca8e7c25d9f966f72d09802cc,SOFRQTD12A81C233C0,4,Sehr kosmisch,Musik von Harmonia,Harmonia,0
1048572,6b4d326415bae31ca8e7c25d9f966f72d09802cc,SOLFXKT12AB017E3E0,3,Fireflies,Karaoke Monthly Vol. 2 (January 2010),Charttraxx Karaoke,2009
1048573,6b4d326415bae31ca8e7c25d9f966f72d09802cc,SONIFJR12A6702187A,1,Every Planet We Reach Is Dead,Demon Days,Gorillaz,2005


In [None]:
#label encoding code
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df_final['user_id'] = le.fit_transform(df_final['user_id']) 

df_final['song_id'] = le.fit_transform(df_final['song_id'])

In [None]:
# Get the column containing the users
users = df_final.user_id
# Create a dictionary from users to their number of songs
ratings_count = dict()
for user in users:
    # If we already have the user, just add 1 to their rating count
    if user in ratings_count:
        ratings_count[user] += 1
    # Otherwise, set their rating count to 1
    else:
        ratings_count[user] = 1    

In [None]:
# We want our users to have listened at least 90 songs
RATINGS_CUTOFF = 90
remove_users = []
for user, num_ratings in ratings_count.items():
    if num_ratings < RATINGS_CUTOFF:
        remove_users.append(user)
df_final = df_final.loc[~df_final.user_id.isin(remove_users)]

In [None]:
# Get the column containing the songs
songs = df_final.song_id
# Create a dictionary from songs to their number of users
ratings_count = dict()
for song in songs:
    # If we already have the song, just add 1 to their rating count
    if song in ratings_count:
        ratings_count[song] += 1
    # Otherwise, set their rating count to 1
    else:
        ratings_count[song] = 1    

In [None]:
# Drop records with play_count more than(>) 5
df_final = df_final.loc[df_final["play_count"] <= 5]

### **Popularity-Based Recommendation Systems**

Let's take the count and sum of play counts of the songs and build the popularity recommendation systems based on the sum of play counts.

In [None]:
# Calculating average play_count
average_count = df_final.groupby('song_id')['play_count'].mean()        # Hint: Use groupby function on the song_id column

# Calculating the frequency a song is played
play_freq = df_final.groupby('song_id')['play_count'].sum()       # Hint: Use groupby function on the song_id column

print('Average count:\n',average_count)
print('\n\nPlay frequency:\n',play_freq)

Average count:
 song_id
0       1.000000
1       1.560000
2       2.250000
3       2.222222
4       1.375000
          ...   
9995    1.909091
9996    1.785714
9997    1.700000
9998    1.187500
9999    1.200000
Name: play_count, Length: 9970, dtype: float64


Play frequency:
 song_id
0        5
1       39
2        9
3       20
4       33
        ..
9995    42
9996    25
9997    17
9998    19
9999     6
Name: play_count, Length: 9970, dtype: int64


In [None]:
# Making a dataframe with the average_count and play_freq
final_play = pd.DataFrame({'avg_count':average_count, 'play_freq':play_freq})

# Let us see the first five records of the final_play dataset
final_play.head()

Unnamed: 0_level_0,avg_count,play_freq
song_id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,1.0,5
1,1.56,39
2,2.25,9
3,2.222222,20
4,1.375,33


Now, let's create a function to find the top n songs for a recommendation based on the average play count of song. We can also add a threshold for a minimum number of playcounts for a song to be considered for recommendation.

In [None]:
# Build the function to find top n songs
def top_n_songs (data, n, min_inter):


  #Filter only the songs which a certain play_freq value
  reco = data[data['play_freq'] > min_inter]

  #ordering ascending = False by play_freq
  reco = reco.sort_values(by='play_freq', ascending=False)

  return reco.index[:n]

In [None]:
# Recommend top 10 songs using the function defined above
top_n_songs(final_play, 10, 15)

Int64Index([352, 2220, 8582, 5531, 4152, 4448, 1118, 1334, 8092, 6189], dtype='int64', name='song_id')

### **User User Similarity-Based Collaborative Filtering**

To build the user-user-similarity-based and subsequent models we will use the "surprise" library.

In [None]:
# Install the surprise package using pip. Uncomment and run the below code to do the same
!pip install surprise 

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting surprise
  Downloading surprise-0.1-py2.py3-none-any.whl (1.8 kB)
Collecting scikit-surprise
  Downloading scikit-surprise-1.1.1.tar.gz (11.8 MB)
[K     |████████████████████████████████| 11.8 MB 4.2 MB/s 
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.1-cp37-cp37m-linux_x86_64.whl size=1633955 sha256=d60a08b3d86337dd835872945a2677b1f85f34daa49b46f245aeafc0e74218bb
  Stored in directory: /root/.cache/pip/wheels/76/44/74/b498c42be47b2406bd27994e16c5188e337c657025ab400c1c
Successfully built scikit-surprise
Installing collected packages: scikit-surprise, surprise
Successfully installed scikit-surprise-1.1.1 surprise-0.1


In [None]:
# Import necessary libraries

# To compute the accuracy of models
from surprise import accuracy

# This class is used to parse a file containing play_counts, data should be in structure - user; item; play_count
from surprise.reader import Reader

# Class for loading datasets
from surprise.dataset import Dataset

# For tuning model hyperparameters
from surprise.model_selection import GridSearchCV

# For splitting the data in train and test dataset
from surprise.model_selection import train_test_split

# For implementing similarity-based recommendation system
from surprise.prediction_algorithms.knns import KNNBasic

# For implementing matrix factorization based recommendation system
from surprise.prediction_algorithms.matrix_factorization import SVD

# For implementing KFold cross-validation
from surprise.model_selection import KFold

# For implementing clustering-based recommendation system
from surprise import CoClustering

### Some useful functions

Below is the function to calculate precision@k and recall@k, RMSE and F1_Score@k to evaluate the model performance.

**Think About It:** Which metric should be used for this problem to compare different models?

In [None]:
# The function to calulate the RMSE, precision@k, recall@k, and F_1 score
def precision_recall_at_k(model, k = 30, threshold = 1.5):
    """Return precision and recall at k metrics for each user"""

    # First map the predictions to each user.
    user_est_true = defaultdict(list)
    
    # Making predictions on the test data
    predictions=model.test(testset)
    
    for uid, _, true_r, est, _ in predictions:
        user_est_true[uid].append((est, true_r))

    precisions = dict()
    recalls = dict()
    for uid, user_ratings in user_est_true.items():

        # Sort user ratings by estimated value
        user_ratings.sort(key = lambda x : x[0], reverse = True)

        # Number of relevant items
        n_rel = sum((true_r >= threshold) for (_, true_r) in user_ratings)

        # Number of recommended items in top k
        n_rec_k = sum((est >= threshold) for (est, _) in user_ratings[ : k])

        # Number of relevant and recommended items in top k
        n_rel_and_rec_k = sum(((true_r >= threshold) and (est >= threshold))
                              for (est, true_r) in user_ratings[ : k])

        # Precision@K: Proportion of recommended items that are relevant
        # When n_rec_k is 0, Precision is undefined. We here set Precision to 0 when n_rec_k is 0

        precisions[uid] = n_rel_and_rec_k / n_rec_k if n_rec_k != 0 else 0

        # Recall@K: Proportion of relevant items that are recommended
        # When n_rel is 0, Recall is undefined. We here set Recall to 0 when n_rel is 0

        recalls[uid] = n_rel_and_rec_k / n_rel if n_rel != 0 else 0
    
    # Mean of all the predicted precisions are calculated
    precision = round((sum(prec for prec in precisions.values()) / len(precisions)), 3)

    # Mean of all the predicted recalls are calculated
    recall = round((sum(rec for rec in recalls.values()) / len(recalls)), 3)
    
    accuracy.rmse(predictions)

    # Command to print the overall precision
    print('Precision: ', precision)

    # Command to print the overall recall
    print('Recall: ', recall)
    
    # Formula to compute the F-1 score
    print('F_1 score: ', round((2 * precision * recall) / (precision + recall), 3))

**Think About It:** In the function precision_recall_at_k above the threshold value used is 1.5. How precision and recall are affected by changing the threshold? What is the intuition behind using the threshold value of 1.5? 

In [None]:
# Instantiating Reader scale with expected rating scale 
reader = Reader(rating_scale= (0,5)) #use rating scale (0, 5)

# Loading the dataset
data = Dataset.load_from_df(df_final[['user_id', 'song_id', 'play_count']], reader) # Take only "user_id","song_id", and "play_count"

# Splitting the data into train and test dataset
trainset, testset = train_test_split(data, test_size=0.4, random_state = 42) # Take test_size = 0.4

**Think About It:** How changing the test size would change the results and outputs?

In [None]:
# Build the default user-user-similarity model
sim_options = {'name': 'cosine',
               'user_based': True}

# KNN algorithm is used to find desired similar items
sim_user_user = KNNBasic(sim_options=sim_options, verbose = False, random_state=1) # Use random_state = 1 

# Train the algorithm on the trainset, and predict play_count for the testset
sim_user_user.fit(trainset)

# Let us compute precision@k, recall@k, and f_1 score with k = 30
precision_recall_at_k(sim_user_user) # Use sim_user_user model

RMSE: 1.1235
Precision:  0.38
Recall:  0.588
F_1 score:  0.462


**Observations and Insights: **

- RMSE: indicates how far the overall predicted ratings are from the true ratings. This value clearly can be improved by tuning hyperparameters (GridSearchCV).

- Recall: it's value is ~0.38 (very low), which means out of all the relevant songs 38 % are recommended.

- Precision: it's value is ~ 0.59 (very low), which means out of all the recommended songs 59 % are relevant.

- F-1 score: it's value is ~ 0.46. It indicates that mostly recommended songs were relevant and relevant songs were recommended. It has a low value, so the model by now it's not doing a good job.

NOTE: k = 30 it's the value of K in KNN that was used by default while training the model.

In [None]:
# Predicting play_count for a sample user with a listened song
sim_user_user.predict(6958, 1671, verbose = True) # Use user id 6958 and song_id 1671

user: 6958       item: 1671       r_ui = None   est = 1.64   {'was_impossible': True, 'reason': 'User and/or item is unknown.'}


Prediction(uid=6958, iid=1671, r_ui=None, est=1.6378550945070267, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'})

In [None]:
# Predicting play_count for a sample user with a song not-listened by the user
sim_user_user.predict(6958, 3232, verbose = True) # Use user_id 6958 and song_id 3232

user: 6958       item: 3232       r_ui = None   est = 1.64   {'was_impossible': True, 'reason': 'User and/or item is unknown.'}


Prediction(uid=6958, iid=3232, r_ui=None, est=1.6378550945070267, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'})

**Observations and Insights:**

This is the predicted play count for the user-item pair based on the user-user similariy-based baseline model. Due to the sparsity of the problem matrix and the cut offs that were set to the model for prediction (both for an user who actual listened a song and not listened a song), a play count of 1.64 was precicted for the user 6958 and the song 1671. Also, the model also indicates that in the model the user and the song are unknown. 

Now, let's try to tune the model and see if we can improve the model performance.

In [None]:
# Setting up parameter grid to tune the hyperparameters
param_grid = {'k': [10, 20, 30], 'min_k': [3, 6, 9],
              'sim_options': {'name': ["cosine", 'pearson', "pearson_baseline"],
                              'user_based': [True], "min_support": [2, 4]}
              }

# Performing 3-fold cross-validation to tune the hyperparameters
gs = GridSearchCV(KNNBasic, param_grid, measures = ['rmse'], cv=3) #n_jobs=-1 throws a memopry error

# Fitting the data
gs.fit(data) # Use entire data for GridSearch

# Best RMSE score
print(gs.best_score['rmse'])

# Combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...

In [None]:
# Train the best model found in above gridsearch

#Using optmal similarity measures for user-user collaborative filtering
sim_options = {'name': 'pearson_baseline', 'user_based': True, 'min_support': 2}

#Creating an instance of KNNBasic with optimal hyperparameter values
sim_user_user_optimized = KNNBasic(sim_options=sim_options, k = 30, min_k = 9, verbose = False, random_state=1)

#Training algoithm (trainset)
sim_user_user_optimized.fit(trainset)

precision_recall_at_k(sim_user_user_optimized)

RMSE: 1.0657
Precision:  0.374
Recall:  0.683
F_1 score:  0.483


**Observations and Insights: **

Now, the model is trained using the best analized hiperparameters. The best principal hyperparameters turn out to be k = 30, min_k = 9, name : pearson_baseline, user_based : True and min_support : 2.

The model does not improve very much, but it is better than the first one. This could be due to the sparsity of the data.

The model has a very low precision (fraction of recommended songs that are relevant to the user) in comparison to recall (fraction of relevant songs that are recommended to the user). For this problem, recall should be more important that precision, so it's fine.



In [None]:
# Predict the play count for a user who has listened to the song. Take user_id 6958, song_id 1671 and r_ui = 2
sim_user_user_optimized.predict(6958, 1671, r_ui=2, verbose=True)

user: 6958       item: 1671       r_ui = 2.00   est = 1.64   {'was_impossible': True, 'reason': 'User and/or item is unknown.'}


Prediction(uid=6958, iid=1671, r_ui=2, est=1.6378550945070267, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'})

In [None]:
# Predict the play count for a song that is not listened to by the user (with user_id 6958)
sim_user_user_optimized.predict(6958, 3232, verbose=True)

user: 6958       item: 3232       r_ui = None   est = 1.64   {'was_impossible': True, 'reason': 'User and/or item is unknown.'}


Prediction(uid=6958, iid=3232, r_ui=None, est=1.6378550945070267, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'})

**Observations and Insights:**

The prediction barely changes compared to the first one.

**Think About It:** Along with making predictions on listened and unknown songs can we get 5 nearest neighbors (most similar) to a certain song?

In [None]:
# Use inner id 0
sim_user_user_optimized.get_neighbors(0, k=5)

[649, 501, 736, 426, 354]

Below we will be implementing a function where the input parameters are:

- data: A **song** dataset
- user_id: A user-id **against which we want the recommendations**
- top_n: The **number of songs we want to recommend**
- algo: The algorithm we want to use **for predicting the play_count**
- The output of the function is a **set of top_n items** recommended for the given user_id based on the given algorithm

In [None]:
def get_recommendations(data, user_id, top_n, algo):
    
    # creating an empty list to store the recommended product ids
    recommendations = []
    
    # creating an user item interactions matrix 
    user_item_interactions_matrix = data.pivot(index='user_id', columns='song_id', values='play_count')
    
    # extracting those business ids which the user_id has not visited yet
    non_interacted_products = user_item_interactions_matrix.loc[user_id][user_item_interactions_matrix.loc[user_id].isnull()].index.tolist()
    
    # looping through each of the business ids which user_id has not interacted yet
    for item_id in non_interacted_products:
        
        # predicting the ratings for those non visited restaurant ids by this user
        est = algo.predict(user_id, item_id).est
        
        # appending the predicted ratings
        recommendations.append((item_id, est))

    # sorting the predicted ratings in descending order
    recommendations.sort(key=lambda x: x[1], reverse=True)

    return recommendations[:top_n] # returing top n highest predicted rating products for this user

In [None]:
# Make top 5 recommendations for user_id 6958 with a similarity-based recommendation engine
recommendations = get_recommendations(df_final, 6958, 5, sim_user_user_optimized)

#it's seems to throw an error due to missing data for user 69958

KeyError: ignored

In [None]:
# Building the dataframe for above recommendations with columns "song_id" and "predicted_ratings"
pd.DataFrame(recommendations, columns = ['song_id','predicted_ratings'])

NameError: ignored

**Observations and Insights:______________**

### Correcting the play_counts and Ranking the above songs

In [None]:
def ranking_songs(recommendations, final_rating):
  # Sort the songs based on play counts
  ranked_songs = final_rating.loc[[items[0] for items in recommendations]].sort_values('play_freq', ascending = False)[['play_freq']].reset_index()

  # Merge with the recommended songs to get predicted play_count
  ranked_songs = ranked_songs.merge(pd.DataFrame(recommendations, columns = ['song_id', 'predicted_ratings']), on = 'song_id', how = 'inner')

  # Rank the songs based on corrected play_counts
  ranked_songs['corrected_ratings'] = ranked_songs['predicted_ratings'] - 1 / np.sqrt(ranked_songs['play_freq'])

  # Sort the songs based on corrected play_counts
  ranked_songs = ranked_songs.sort_values('corrected_ratings', ascending = False)
  
  return ranked_songs

**Think About It:** In the above function to correct the predicted play_count a quantity 1/np.sqrt(n) is subtracted. What is the intuition behind it? Is it also possible to add this quantity instead of subtracting?

In [None]:
# Applying the ranking_songs function on the final_play data
ranking_songs(recommendations, df_final)

NameError: ignored

**Observations and Insights:______________**

### Item Item Similarity-based collaborative filtering recommendation systems 

In [None]:
# Apply the item-item similarity collaborative filtering model with random_state = 1 and evaluate the model performance

#Specifying user_based = False
sim_options = {'name':'cosine', 'user_based': False}

#Finding similar songs witth KNNBasic algorithm
sim_item_item = KNNBasic(sim_options = sim_options, random_state = 1, verbose = False)

#Trainng algotihm on trainset
sim_item_item.fit(trainset)

#Computing metrics with k=30
precision_recall_at_k(sim_item_item)


RMSE: 1.0419
Precision:  0.377
Recall:  0.49
F_1 score:  0.426


**Observations and Insights:**

- The F1 score of the baseline model is about 0.43. This is very low, and must be improved tuning hyperparameters with the GridSearchCV algorithm.

In [None]:
# Predicting play count for a sample user_id 6958 and song (with song_id 1671) heard by the user
sim_item_item.predict(6958, 1671, r_ui = 6958, verbose = False)

Prediction(uid=6958, iid=1671, r_ui=6958, est=1.8008578431372548, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'})

In [None]:
# Predict the play count for a user that has not listened to the song (with song_id 1671)
#change song_id to a song which the user has not listened
sim_item_item.predict(6958, 3232, verbose = False)

Prediction(uid=6958, iid=3232, r_ui=None, est=1.6378550945070267, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'})

**Observations and Insights:**

- For the user id 6958 and the song id 1671 the play count estimation is about 1.80.
- For the same user and a song not listened by the user the play frecuency is about 1.64.

In [None]:
# Apply grid search for enhancing model performance

# Setting up parameter grid to tune the hyperparameters
param_grid = {'k': [10, 20, 30], 'min_k': [3, 6, 9],
              'sim_options': {'name': ["cosine", 'pearson', "pearson_baseline"],
                              'user_based': [False], "min_support": [2, 4]}
              }

# Performing 3-fold cross-validation to tune the hyperparameters
gs = GridSearchCV(KNNBasic, param_grid, measures = ['rmse'], cv = 3)

# Fitting the data
gs.fit(data)

# Find the best RMSE score
print(gs.best_score['rmse'])

# Extract the combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...

**Think About It:** How do the parameters affect the performance of the model? Can we improve the performance of the model further? Check the list of hyperparameters [here](https://surprise.readthedocs.io/en/stable/knn_inspired.html).

In [None]:
# Apply the best modle found in the grid search

#ite-item collaborative filtering
sim_options = {'name': 'cosine', 'user_based': False, 'min_support': 2}

#Instanctiating KNNBasic with optimal hyperparameters
#todo change best hyperparameters
sim_item_item_optimized = KNNBasic(sim_options = sim_options, k = 30, min_6 = 3, random_state =1, verbose = False)

#Training algorithm on trainset
sim_item_item_optimized.fit(trainset)

#Computing metrics for tuned hyperparameters
precision_recall_at_k(sim_item_item_optimized)



RMSE: 1.0689
Precision:  0.4
Recall:  0.545
F_1 score:  0.461


**Observations and Insights: **

- After ttuning hyperparameters the F_1 score and the recall of the tuned model is better than the baseline model. The RMSE and the precisions have not improved, so the model performace has improved in general after tuning hyperparameters, considering the F_1 score is the best metric to take into consideration in this model.



In [None]:
# Predict the play_count by a user(user_id 6958) for the song (song_id 1671)
sim_item_item_optimized.predict(6958, 1671, r_ui = 6959, verbose = True)

user: 6958       item: 1671       r_ui = 6959.00   est = 1.64   {'was_impossible': True, 'reason': 'User and/or item is unknown.'}


Prediction(uid=6958, iid=1671, r_ui=6959, est=1.6378550945070267, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'})

In [None]:
# Predicting play count for a sample user_id 6958 with song_id 3232 which is not heard by the user
sim_item_item_optimized.predict(6958, 3232, verbose = True)

user: 6958       item: 3232       r_ui = None   est = 1.64   {'was_impossible': True, 'reason': 'User and/or item is unknown.'}


Prediction(uid=6958, iid=3232, r_ui=None, est=1.6378550945070267, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'})

**Observations and Insights:**

- Using the optimized model to predict play count for user id 6958 and song id 1671, the result value is 1.64, which is lower than the previous one (1.80).

- For user id 6958 and a not listened song the optimized play count result is the same, more or less, than the previous one.

In [None]:
# Find five most similar items to the item with inner id 0
sim_item_item_optimized.get_neighbors(0, k = 5)

[24, 118, 177, 86, 158]

In [None]:
# Making top 5 recommendations for user_id 6958 with item_item_similarity-based recommendation engine
recommendations = get_recommendations(df_final, 6958, 5, sim_item_item_optimized)

KeyError: ignored

In [None]:
# Building the dataframe for above recommendations with columns "song_id" and "predicted_play_count"
pd.DataFrame(recommendations, columns = ['song_id','predicted_ratings'])

NameError: ignored

In [None]:
# Applying the ranking_songs function
ranking_songs(recommendations, df_final)

NameError: ignored

**Observations and Insights:_________**

### Model Based Collaborative Filtering - Matrix Factorization

Model-based Collaborative Filtering is a **personalized recommendation system**, the recommendations are based on the past behavior of the user and it is not dependent on any additional information. We use **latent features** to find recommendations for each user.

In [None]:
#Build baseline model using svd

#SVD - matrix factorization
svd = SVD(random_state=1)

#Training algorithm on trainset
svd.fit(trainset)

#Computing metrics for svd
precision_recall_at_k(svd)

RMSE: 1.0362
Precision:  0.392
Recall:  0.529
F_1 score:  0.45


In [None]:
# Making prediction for user (with user_id 6958) to song (with song_id 1671), take r_ui = 2
svd.predict(6958, 1671, r_ui = 2, verbose = True)

user: 6958       item: 1671       r_ui = 2.00   est = 1.35   {'was_impossible': False}


Prediction(uid=6958, iid=1671, r_ui=2, est=1.349708259405322, details={'was_impossible': False})

In [None]:
# Making a prediction for the user who has not listened to the song (song_id 3232)
svd.predict(6958, 3232, verbose = True)

user: 6958       item: 3232       r_ui = None   est = 1.61   {'was_impossible': False}


Prediction(uid=6958, iid=3232, r_ui=None, est=1.607241192222509, details={'was_impossible': False})

#### Improving matrix factorization based recommendation system by tuning its hyperparameters

In [None]:
# Set the parameter space to tune
param_grid = {'n_epochs': [10, 20, 30], 'lr_all': [0.001, 0.005, 0.01],
              'reg_all': [0.2, 0.4, 0.6]}

# Performe 3-fold grid-search cross-validation
gs = GridSearchCV(SVD, param_grid, measures=['rmse'], cv = 3)

# Fitting data
gs.fit(data)

# Best RMSE score
print(gs.best_score['rmse'])

# Combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

1.0209764532994463
{'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.2}


**Think About It**: How do the parameters affect the performance of the model? Can we improve the performance of the model further? Check the available hyperparameters [here](https://surprise.readthedocs.io/en/stable/matrix_factorization.html).

In [None]:
# Building the optimized SVD model using optimal hyperparameters
svd_optimized = SVD(n_epochs = 10, lr_all = 0.005, reg_all = 0.2, random_state = 1)

#Training algorithm on trainset
svd_optimized.fit(trainset)

#Computing metrics for svd_optimzed
precision_recall_at_k(svd_optimized)

RMSE: 1.0232
Precision:  0.393
Recall:  0.54
F_1 score:  0.455


**Observations and Insights:**


- RMSE: ~1.03, indicates how far the overall predicted ratings are from the true ratings. This value clearly can be improved by tuning hyperparameters (GridSearchCV).

- Recall: it's value is ~0.54, which means out of all the relevant songs 54 % are recommended, value which is fine.

- Precision: it's value is ~ 0.40 (very low), which means out of all the recommended songs 40 % are relevant.

- F-1 score: it's value is ~ 0.45. It indicates that mostly recommended songs were relevant and relevant songs were recommended. It has a low value, so the model by now it's not doing a good job.

- User with id 6958 is predicted to listen 1.35 times a song with id 1671 (which he already has listened).

- User with id 6958 is predicted to listen 1.61 times a song with id 3232 (which he has not listened to).

- Tuning hyperparameters has not barely improved the SVD model, precision is very low. Both RMSE and recall havee improved a little.



In [None]:
# Using svd_algo_optimized model to recommend for userId 6958 and song_id 1671
svd_optimized.predict(6958, 1671, r_ui=6958, verbose = True)

user: 6958       item: 1671       r_ui = 6958.00   est = 1.40   {'was_impossible': False}


Prediction(uid=6958, iid=1671, r_ui=6958, est=1.39925167928939, details={'was_impossible': False})

In [None]:
# Using svd_algo_optimized model to recommend for userId 6958 and song_id 3232 with unknown baseline rating
svd_optimized.predict(6958, 3232, verbose = True)

user: 6958       item: 3232       r_ui = None   est = 1.62   {'was_impossible': False}


Prediction(uid=6958, iid=3232, r_ui=None, est=1.618245499191109, details={'was_impossible': False})

**Observations and Insights: **

- The user with id 6958 which has interacted with the song with id 1671 has a predicted listening of ~1.40, which is not far from the real value.

- The user with id 6958 which has NOT interacted with the song with id 3232 has a predicted listening of ~1.62.

- The optimized values are very close to the non optimized.

In [None]:
# Getting top 5 recommendations for user_id 6958 using "svd_optimized" algorithm
svd_recommendations = get_recommendations(df_final, 6958, 5, svd_optimized)

KeyError: ignored

In [None]:
# Ranking songs based on above recommendations
ranking_songs(svd_recommendations, df_final)

NameError: ignored

**Observations and Insights:_________**

### Cluster Based Recommendation System

In **clustering-based recommendation systems**, we explore the **similarities and differences** in people's tastes in songs based on how they rate different songs. We cluster similar users together and recommend songs to a user based on play_counts from other users in the same cluster.

In [None]:
# Make baseline clustering model

#Algorithm for clusering
cluster_baseline = CoClustering(random_state = 1)

#Training algorithm (on trainset)
cluster_baseline.fit(trainset)

#Computing metrics
precision_recall_at_k(cluster_baseline)

RMSE: 1.1018
Precision:  0.386
Recall:  0.517
F_1 score:  0.442


In [None]:
# Making prediction for user_id 6958 and song_id 1671
cluster_baseline.predict(6958, 1671, r_ui = 6958, verbose = True)

user: 6958       item: 1671       r_ui = 6958.00   est = 1.64   {'was_impossible': False}


Prediction(uid=6958, iid=1671, r_ui=6958, est=1.6378550945070267, details={'was_impossible': False})

In [None]:
# Making prediction for user (userid 6958) for a song(song_id 3232) not heard by the user
cluster_baseline.predict(6958, 3232, verbose = True)

user: 6958       item: 3232       r_ui = None   est = 1.64   {'was_impossible': False}


Prediction(uid=6958, iid=3232, r_ui=None, est=1.6378550945070267, details={'was_impossible': False})

#### Improving clustering-based recommendation system by tuning its hyper-parameters

In [None]:
# Set the parameter space to tune
param_grid = {'n_cltr_u': [5, 6, 7, 8], 'n_cltr_i': [5, 6, 7, 8], 'n_epochs': [10, 20, 30]}

# Performing 3-fold grid search cross-validation
gs = GridSearchCV(CoClustering, param_grid, measures = ['rmse'], cv = 3)

# Fitting data
gs.fit(data)

# Best RMSE score
print(gs.best_score['rmse'])

# Combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

1.1228578449811095
{'n_cltr_u': 5, 'n_cltr_i': 5, 'n_epochs': 10}


**Think About It**: How do the parameters affect the performance of the model? Can we improve the performance of the model further? Check the available hyperparameters [here](https://surprise.readthedocs.io/en/stable/co_clustering.html).

In [None]:
# Train the tuned Coclustering algorithm
cluster_tuned = CoClustering(n_cltr_u = 5, n_cltr_i = 5, n_epochs = 10, random_state = 1)

#Training algorithm (on trainset)
cluster_tuned.fit(trainset)

#Computing metrics
precision_recall_at_k(cluster_tuned)

RMSE: 1.1272
Precision:  0.388
Recall:  0.496
F_1 score:  0.435


**Observations and Insights:**

The model barely improves, almost every metric remains almost the same.

In [None]:
# Using co_clustering_optimized model to recommend for userId 6958 and song_id 1671
cluster_tuned.predict(6958, 1671, r_ui = 6958, verbose = True)

user: 6958       item: 1671       r_ui = 6958.00   est = 1.64   {'was_impossible': False}


Prediction(uid=6958, iid=1671, r_ui=6958, est=1.6378550945070267, details={'was_impossible': False})

In [None]:
# Use Co_clustering based optimized model to recommend for userId 6958 and song_id 3232 with unknown baseline rating
cluster_tuned.predict(6958, 3232, verbose = True)

user: 6958       item: 3232       r_ui = None   est = 1.64   {'was_impossible': False}


Prediction(uid=6958, iid=3232, r_ui=None, est=1.6378550945070267, details={'was_impossible': False})

**Observations and Insights: **

Both the prediction for user id 6958 and song id 1671 (listened) and song id 3232 (not listened) barely changes.

#### Implementing the recommendation algorithm based on optimized CoClustering model

In [None]:
# Getting top 5 recommendations for user_id 6958 using "Co-clustering based optimized" algorithm
clustering_recommendations = get_recommendations(df_final, 6958, 5, cluster_tuned)

KeyError: ignored

### Correcting the play_count and Ranking the above songs

In [None]:
# Ranking songs based on the above recommendations
ranking_songs(clustering_recommendations, df_final)

**Observations and Insights:_________**

### Content Based Recommendation Systems

**Think About It:** So far we have only used the play_count of songs to find recommendations but we have other information/features on songs as well. Can we take those song features into account?

In [None]:
df_small = df_final

In [None]:
# Concatenate the "title", "release", "artist_name" columns to create a different column named "text"
df_small['text'] = df_small.title + ' ' + df_small.release + ' ' + df_small.artist_name

In [None]:
df_small = df_small[['user_id', 'song_id', 'play_count', 'title', 'text']]
df_small = df_small.drop_duplicates(subset=['title'])
df_small = df_small.set_index('title')
df_small.head()

Unnamed: 0_level_0,user_id,song_id,play_count,text
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Aunt Eggma Blowtorch,3605,12,1,Aunt Eggma Blowtorch Everything Is Neutral Mil...
Full Circle,3605,40,1,Full Circle Breakout Miley Cyrus
Poor Jackie,3605,151,2,Poor Jackie Rabbit Habits Man Man
Hot N Cold (Manhattan Clique Remix Radio Edit),3605,326,1,Hot N Cold (Manhattan Clique Remix Radio Edit)...
Daisy And Prudence,3605,447,1,Daisy And Prudence Distillation Erin McKeown


In [None]:
df_small.shape

(9539, 4)

In [None]:
indices = pd.Series(df_small.index)
indices[:5]

0                              Aunt Eggma Blowtorch
1                                       Full Circle
2                                       Poor Jackie
3    Hot N Cold (Manhattan Clique Remix Radio Edit)
4                                Daisy And Prudence
Name: title, dtype: object

In [None]:
# Importing necessary packages to work with text data
import nltk

# Download punkt library
nltk.download("punkt")

# Download stopwords library
nltk.download("stopwords")

# Download wordnet 
nltk.download("wordnet")

nltk.download('omw-1.4')

# Import regular expression
import re

# Import word_tokenizer
from nltk import word_tokenize

# Import WordNetLemmatizer
from nltk.stem import WordNetLemmatizer

# Import stopwords
from nltk.corpus import stopwords

# Import CountVectorizer and TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


We will create a **function to pre-process the text data:**

In [None]:
# Function to tokenize the text
def tokenize(text):
    
    text = re.sub(r"[^a-zA-Z]"," ", text.lower())
    
    tokens = word_tokenize(text)
    
    words = [word for word in tokens if word not in stopwords.words("english")]  # Use stopwords of english
    
    text_lems = [WordNetLemmatizer().lemmatize(lem).strip() for lem in words]

    return text_lems

In [None]:
# Create tfidf vectorizer 
tfidf = TfidfVectorizer(tokenizer=tokenize)

# Fit_transfrom the above vectorizer on the text column and then convert the output into an array
tfidf_songs = tfidf.fit_transform(df_small['text'].values).toarray()

In [None]:
#Creating Dataframe
pd.DataFrame(tfidf_songs)

# Compute the cosine similarity for the tfidf above output
songs_similarity = cosine_similarity(tfidf_songs, tfidf_songs)

songs_similarity

array([[1.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 1.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 1.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 1.        ,
        0.03909159],
       [0.        , 0.        , 0.        , ..., 0.        , 0.03909159,
        1.        ]])

 Finally, let's create a function to find most similar songs to recommend for a given song.

In [None]:
# function that takes in song title as input and returns the top 10 recommended songs
def recommendations(title, similar_songs):
    
    recommended_songs = []
    
    # gettin the index of the song that matches the title
    idx = indices[indices == title].index[0]

    # creating a Series with the similarity scores in descending order
    score_series = pd.Series(similar_songs[idx]).sort_values(ascending = False)

    # getting the indexes of the 10 most similar songs
    top_10_indexes = list(score_series.iloc[1:11].index)
    print(top_10_indexes)
    
    # populating the list with the titles of the best 10 matching songs
    for i in top_10_indexes:
        recommended_songs.append(list(df_small.index)[i])
        
    return recommended_songs

Recommending 10 songs similar to Learn to Fly

In [None]:
# Make the recommendation for the song with title 'Learn To Fly'
recommendations('Learn To Fly', songs_similarity)

[1956, 5606, 1939, 4528, 3453, 3507, 1954, 1918, 5138, 1934]


['Generator',
 'Stacked Actors',
 'Big Me',
 'For All The Cows',
 'Exhausted',
 'Floaty',
 'Wattershed',
 'Oh_ George',
 'X-Static',
 "I'll Stick Around"]

**Observations and Insights:**

It's obtained 10 songs similar to "Learn to fly" based on song's information (release, title, artist name). This is done with natural language processing techniques.

## **Conclusion and Recommendations:** 

- **Refined Insights -** What are the most meaningful insights from the data relevant to the problem?

- With very little information (song id, user id, play frecuency, song title, etc.) it can be built an interesting recommendation system. 

- The negative part of the problem is that the datasets needed to build a recommendation system are very big (a million rows!), and the main part of the rows are not used, so it's not optimized.

- The data is very sparse, not every user listens to every song and not every song it's listened by every user. Also, there is the need of a cutoff for performance issues to filter only songs which are listened a certain number of times (which is 90). 

- Ideally, the recommendation system should be dynamic, for recommending the lastest songs and catch the local or global trends, so I think to work with last.fm or Spotify API's should be a good idea. This dataset is no dynamic and recommends only songs from the past.



- **Comparison of various techniques and their relative performance -** How do different techniques perform? Which one is performing relatively better? Is there scope to improve the performance further?

In this notebook several recommendations techniques have been used to recommend content to users, like:

- User-User similarity-based collaborative filtering:

    RMSE: 1.0657
    Precision:  0.374
    Recall:  0.683
    F_1 score:  0.483

- Item-Item similarity-based collaborative filtering:

    RMSE: 1.0689
    Precision:  0.4
    Recall:  0.545
    F_1 score:  0.461

- Model-based collaborative filtering (matrix factorization):

    RMSE: 1.0232
    Precision:  0.393
    Recall:  0.54
    F_1 score:  0.455

- Cluster based recommendation:

    RMSE: 1.1272
    Precision:  0.388
    Recall:  0.496
    F_1 score:  0.435

- Natural language processing content based recommendations.

- The best performace in terms of F1 score is for the user-user recommendation system. 

- The cluster based recommendatoin could be used for recommending certain type of music to clustered based users (for example, by genres). 

- For a new user, withouth historic data, the function top_n_songs would be interesting to use. 

- The item-item similarity-based collaborative filtering I think does not work as well as user-user because the cutoff it was set (songs listened at least 90 times), and without this handicap I would use it as well, but I would work with a larger database of listened songs.

- **Proposal for the final solution design -** What model do you propose to be adopted? Why is this the best solution to adopt?

- The model I would use is the user-user similarity-based collaborative filtering, not only because it has given the best outcomes, otherwise I think is important to follow recommendations from people with the same musical tastes than you; it seems natural that if you like certain type of music you will find more interesting recommendations listening to some bands that your similar-music-taste peers listen to.

- I would use as well item-item similarity-based collaborative filtering and cluster based recommendations to filter genres of music, clustering people which listen to similar type of music together.

- Finally, I would use natural language processing content based recommendations for lyrics, recommending similar lyric songs for users that find this issue important. 

-  The data is very sparse and to avoid storage issues I would use Spotify play count to reduce the storage complexity problems.

- A crucial aspect of the music recommendation system would be the time performance so I would store the models in a pickle variable to avoid any time performance issues.
