## Recommender System via Surprise library

### Importing data, libraries & prepping the data

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import surprise
from surprise.model_selection import cross_validate
from surprise import Dataset
from surprise import Reader
from surprise import SVD
import heapq

In [27]:
path1 = "/Users/miraslavats/artsist recommender system/usersha1-artmbid-artname-plays.tsv"
user_art_plays = pd.read_csv(path1, sep = "\t", header = None)
user_art_plays.drop(user_art_plays.tail(48).index,inplace = True)
user_art_plays.dropna(inplace = True)
user_art_plays.columns = ['User id', 'Artist id', 'Artist', 'No plays']

In the cell below I am normalizing the values in "No plays" column so that they fall in the range 0-1. This is done because otherwise the values would range from 0 to nearly 450000, which is a very large scale for analysis. I am also creating a new dataframe that only consists of 3 columns: "User id", "Artist id", and "No plays". This format is necessary when using the surprise library. 

In [3]:
normalized_=(user_art_plays['No plays']-user_art_plays['No plays'].min())/(user_art_plays['No plays'].max()-user_art_plays['No plays'].min())
user_art_plays['No plays'] = normalized_
user_art_plays1 = user_art_plays.drop(['Artist'], axis = 1)

### Training, testing and evaluating SVD model with surprise

In [5]:
# creaing a suprise dataset object from the stored data
# specifying the rating scale bc the default is (0, 5)
not_trainset = Dataset.load_from_df(user_art_plays1, reader = Reader(rating_scale=(0, 1)))

In [8]:
# specifying the number of latent factors in which we want the matrix to be broken down
# and the number of iterations of the Stochastic Gradient Descent algorithm 
svd = SVD(n_factors = 10, n_epochs = 10)

In [9]:
# training, testing and evaluating how the SVD algorithm performs on unseen data in testing
# svd factors = 10, n_epochs = 10
cross_validate(svd, not_trainset, measures=['RMSE', 'MAE'], cv=3, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.0102  0.0102  0.0102  0.0102  0.0000  
MAE (testset)     0.0042  0.0042  0.0042  0.0042  0.0000  
Fit time          111.64  138.08  183.79  144.50  29.80   
Test time         149.44  77.69   92.06   106.40  31.00   


{'test_rmse': array([0.0102349, 0.0102309, 0.0102291]),
 'test_mae': array([0.00418698, 0.00417983, 0.00417592]),
 'fit_time': (111.64483308792114, 138.07686710357666, 183.79237580299377),
 'test_time': (149.43848395347595, 77.69325280189514, 92.05556321144104)}

In [7]:
# svd factors = 100, n_epochs = 10
from surprise.model_selection import cross_validate
cross_validate(svd, not_trainset, measures=['RMSE', 'MAE'], cv=3, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.0322  0.0322  0.0323  0.0322  0.0000  
MAE (testset)     0.0129  0.0129  0.0129  0.0129  0.0000  
Fit time          199.56  269.55  311.09  260.07  46.03   
Test time         104.91  155.69  97.30   119.30  25.92   


{'test_rmse': array([0.03220468, 0.03215626, 0.03225419]),
 'test_mae': array([0.01290115, 0.01288369, 0.01294568]),
 'fit_time': (199.55753993988037, 269.54899191856384, 311.09470796585083),
 'test_time': (104.90773606300354, 155.69017481803894, 97.30340313911438)}

From the above experiments, we see that SVD with hyper parameters n_factors = 10 and n_epochs = 10 performs slightly better than the model with n_factors=100. This is because the less specific model does not overfit the training data (does not fit the noise in the data), and, hence, is able to make better predictions about unseen data since it is not very specific to the training set.

### Making predictions and recommendations

In [17]:
# testing out the predictions for a single user
svd.predict(uid = '00000c289a1829a808ac09c00daf10bc3c4e223b', iid = '3bd73256-3905-4f3a-97e2-8b341527f805')

Prediction(uid='00000c289a1829a808ac09c00daf10bc3c4e223b', iid='3bd73256-3905-4f3a-97e2-8b341527f805', r_ui=None, est=0.1383394962446166, details={'was_impossible': False})

get_recommendations(user, items, n) function uses the trained SVD algorithm to give out predicted scores to every artist in the list for a specific user. Then, these predictions (scores and artist ids) are put in a heap due to time complexity advantage it has over lists when quering min/max values, and n most highly scored artist ids are returned. 

In [11]:
def get_recommendations(user, items, n):
    """
    Get the n most highly rated artists for a given user based on predicted scores.

    Args:
        user: The user for whom recommendations are being generated.
        items (list): The list of artists to consider for recommendations.
        n (int): The number of recommendations to return.

    Returns:
        list: A list of n top recommendations based on predicted scores.

    """
    items = set(items)
    prediction_artists, prediction_scores, recommendations = [], [], []
    for item in items:
        prediction_artists.append(svd.predict(user, item).iid)
        prediction_scores.append(svd.predict(user, item).est)
    predictions = [(prediction_scores[i], prediction_artists[i]) for i in range(0, len(prediction_scores))]
    heapq.heapify(predictions)
    recommendations.append(heapq.nlargest(n, predictions))
    return recommendations[0]

get_artist_name(predictions, id_name) function uses the artists ids (output from the function above) and connects the ids with the corresponding artist names. Specifically, it goes through the list of the recommended ids and a list mapping artist names with their ids to find matches. 

In [12]:
def get_artist_name(predictions, id_name):
    """
    Get the artist names corresponding to the predicted IDs in the given predictions.

    Args:
        predictions (list): A list of predictions in the form of (score, artist_id) tuples.
        id_name (list): A list of tuples mapping artist IDs to their corresponding names.

    Returns:
        list: A list of artist names corresponding to the predicted IDs.

    """
    ids = [p[1] for p in predictions] # getting the ids from the tuple
    names = []
    for i in ids:
        for _ in id_name:
            if i in _:
                names.append(_[1])
                break # to save time
    return names

get_rec(user_id, items_list, n_rec, id_names_) function puts the two functions defined above together and formats the output for a better user experience.

In [24]:
def get_rec(user_id, items_list, n_rec, id_names_):
    """
    Get the top recommended artists for a user and format the output.

    Args:
        user_id: The ID of the user for whom recommendations are being generated.
        items_list (list): The list of items to consider for recommendations.
        n_rec (int): The number of recommendations to return.
        id_names_ (list): A list of tuples mapping artist IDs to their corresponding names.

    Returns:
        str: A formatted string listing the top recommended artists for the user.

    """
    prediction = get_recommendations(user_id, items_list, n_rec)
    names_ = get_artist_name(prediction, list(id_names_))
    output = "\n".join(names_)
    n = "\n\n"
    return f"The {n_rec} artists recommended for user {user_id} are:{n}{output}"

Below I am testing out my SVD-based recommendation system on a sample user.

In [26]:
id_names = set(zip(user_art_plays["Artist id"], user_art_plays["Artist"]))
print(get_rec('00000c289a1829a808ac09c00daf10bc3c4e223b',user_art_plays['Artist id'], 10, id_names))

The 10 artists recommended for user 00000c289a1829a808ac09c00daf10bc3c4e223b are:

qloaca letal
the chairs
ultra violet
simmonds and cristopher
borut krisnik
mad tea party
liaison
kalibas
dead flesh fashion
six red carpets
