## Final: Song List Collaborative Filtering
#### Haris Sumra

### Our task involves constructing a fundamental recommender system using a Last.fm dataset accessible through GroupLens on behalf of Lab41. The dataset, procured from LastFM back in 2011, encompasses the play frequencies of 17,632 artists as tracked by 1,892 users. In this project we are going to utilize collaborative filtering that relies on the interactions and preferences of users to generate recommendations. It identifies patterns and similarities among users and items. 

In [89]:
!pip3 install fuzzywuzzy



Importing Dependencies

In [101]:
# check kernel environment
import sys
print("Kernel: {}".format(sys.executable))

# Core data analysis packages
import pandas as pd
import numpy as np

# Compressed Sparse Row ("CSR") matrix
from scipy.sparse import csr_matrix 

# use K-Nearest Neighbors to find cosine distance amongst artists
from sklearn.neighbors import NearestNeighbors

# fuzzy string matching to allow for differing spelling of artist names
from fuzzywuzzy import fuzz, process

# set output to three decimals
pd.set_option('display.float_format',lambda x: '%.2f' %x)

# set seed for reproducibility of random number initializations
seed = np.random.RandomState(seed=42)

Kernel: /Users/harisx91/anaconda3/bin/python


Importing Datasets

In [102]:
# import our files
plays = pd.read_csv('../data/user_artists.dat',sep='\t')
artists = pd.read_csv('../data/artists.dat',sep='\t',usecols=['id','name'])

# we import to understand what datapoints we have, but do not use these 
# in our collaborative engine
tags = pd.read_csv('../data/tags.dat', sep='\t',encoding='latin-1')
uta = pd.read_csv('../data/user_taggedartists.dat', sep='\t')
utat = pd.read_csv('../data/user_taggedartists-timestamps.dat', sep="\t")
friends = pd.read_csv('../data/user_friends.dat', sep='\t')

In [103]:
# create a function to provide various statistics on our data files
def print_info(df_list, df_name):
    
    # assertion to ensure our two lists are equal in length (ie we didn't make any mistakes)
    assert len(df_list) == len(df_name)

    for i in range(len(df_list)):
        print(df_name[i],'\n')
        print("Shape: {}\n".format(df_list[i].shape))
        print("Info:")
        print(df_list[i].info(),'\n')
        print("Unique:\n{}\n".format(df_list[i].nunique()))     
        
        # This returns True if no duplicates are dropped (ie duplicates do not exist)
        print("No duplicates: {}\n".format(len(df_list[i])==len(df_list[i].drop_duplicates())))

This function takes in two lists: "df_list" containing the dataframes you want to analyze, and "df_name" containing corresponding names or labels for each dataframe. It then provides various statistics and information about each dataframe, such as shape, info, number of unique values, and whether duplicates exist. 

In [104]:
df_list = [plays, artists, tags, uta, utat, friends]
df_name = ['Plays',
           'Artists',
           'Tags',
           'User Tagged Artists (Date)',
           'User Tagged Artists (Timestamp)',
           'Friends']

print_info(df_list, df_name)

Plays 

Shape: (92834, 3)

Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 92834 entries, 0 to 92833
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   userID    92834 non-null  int64
 1   artistID  92834 non-null  int64
 2   weight    92834 non-null  int64
dtypes: int64(3)
memory usage: 2.1 MB
None 

Unique:
userID       1892
artistID    17632
weight       5436
dtype: int64

No duplicates: True

Artists 

Shape: (17632, 2)

Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17632 entries, 0 to 17631
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      17632 non-null  int64 
 1   name    17632 non-null  object
dtypes: int64(1), object(1)
memory usage: 275.6+ KB
None 

Unique:
id      17632
name    17632
dtype: int64

No duplicates: True

Tags 

Shape: (11946, 2)

Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11946 entries, 0 to 11945
Data co

#### As previously mentioned, out data contains 17,632 artists and 1,892 users.

## Exploratory Data Analysis (EDA)

Having both artist IDs and names, along with user interactions and play counts, consolidated within a single file, our task is now to combine our two main data files into a singular cohesive dataset:

In [105]:
ap = pd.merge(artists, 
                plays, 
                how='inner',
                left_on='id',
                right_on='artistID')

ap = ap.rename(columns={"weight":"userArtistPlays"})

ap.head()

Unnamed: 0,id,name,userID,artistID,userArtistPlays
0,1,MALICE MIZER,34,1,212
1,1,MALICE MIZER,274,1,483
2,1,MALICE MIZER,785,1,76
3,2,Diary of Dreams,135,2,1021
4,2,Diary of Dreams,257,2,152


Here we performed an inner merge between the 'artists' and 'plays' dataframes based on the 'id' column from the 'artists' dataframe and the 'artistID' column from the 'plays' dataframe. It then renames the 'weight' column in the merged dataframe as 'userArtistPlays'. Finally, it displays the first few rows of the merged dataframe. 

For each user-artist combination:

1) User Artist Plays: This data represents the count of artist plays for a specific user-artist pairing.

For each artist:

1) Total Artist Plays: This indicates the collective play count of an individual artist across all users.
2) Total Unique Users: This denotes the cumulative count of distinct users who have listened to a particular artist at least once.
3) Average User Plays: Computed as the ratio of Total Artist Plays to Total Unique Users.

For each user:

1) Total User Plays: This aggregates the play counts of all artists for a given user.
2) Total Unique Artists: This signifies the count of unique artists associated with a single user (the dataset apparently imposes a limit of 50 unique artists).


In essence, we will be extracting these meaningful metrics to gain deeper insights from our dataset.

In [106]:
artist_rank = (ap.groupby(['name']).agg({'userID':'count','userArtistPlays':'sum'}).
    rename(columns={"userID":'totalUniqueUsers',"userArtistPlays":"totalArtistPlays"}).
              sort_values(['totalArtistPlays'],ascending=False))
artist_rank['avgUserPlays'] = artist_rank['totalArtistPlays']/artist_rank['totalUniqueUsers']
user_rank = (ap.groupby(['userID']).agg({'name':'count','userArtistPlays':'sum'}).
    rename(columns={"name":'totalUniqueArtists',"userArtistPlays":"totalUserPlays"}).
            sort_values(['totalUserPlays'],ascending=False))

For the artist rankings (artist_rank), the code groups the merged dataframe 'ap' by artist names. It calculates the count of unique users ('totalUniqueUsers') who have interacted with each artist and the sum of their play counts ('totalArtistPlays'). It then sorts the results based on total artist plays in descending order. The 'avgUserPlays' column is calculated as the ratio of total artist plays to total unique users.

For the user rankings (user_rank), the code groups the merged dataframe 'ap' by user IDs. It calculates the count of unique artists ('totalUniqueArtists') each user has interacted with and the sum of their play counts ('totalUserPlays'). It then sorts the results based on total user plays in descending order.

These calculations provide insights into artist popularity and user engagement within the dataset.

In [107]:
# Britney Spears is the most played at 2.4 million plays
artist_rank.head()

Unnamed: 0_level_0,totalUniqueUsers,totalArtistPlays,avgUserPlays
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Britney Spears,522,2393140,4584.56
Depeche Mode,282,1301308,4614.57
Lady Gaga,611,1291387,2113.56
Christina Aguilera,407,1058405,2600.5
Paramore,399,963449,2414.66


According to the figures in the "Total Artist Plays" column, the average number of plays per artist is nearly 4,000, while the maximum count reaches 2.4 million, exemplified by Britney Spears.

In [108]:
artist_rank.describe()

Unnamed: 0,totalUniqueUsers,totalArtistPlays,avgUserPlays
count,17632.0,17632.0,17632.0
mean,5.27,3923.77,423.78
std,20.62,34099.34,785.38
min,1.0,1.0,1.0
25%,1.0,113.0,97.0
50%,1.0,350.0,246.0
75%,3.0,1234.25,496.88
max,611.0,2393140.0,35323.0


Presented here are the leading users based on their cumulative play counts. Additionally, it's evident that there is a limit of 50 artists per user, seemingly imposed as a parameter during the initial dataset query.

In [109]:
user_rank.head()

Unnamed: 0_level_0,totalUniqueArtists,totalUserPlays
userID,Unnamed: 1_level_1,Unnamed: 2_level_1
757,50,480039
2000,50,468409
1418,50,416349
1642,50,388251
1094,50,379125


To streamline the process, we will combine our artist and user data into a unified table. Consequently, we must exercise caution when extracting insights from the merged columns.

In [110]:
ap2 = ap.join(artist_rank,on='name',how='inner')
ap3 = ap2.join(user_rank,on='userID',how='inner').sort_values(['userArtistPlays'],ascending=False)

# confirm no duplicated rows
assert ap3[ap3.duplicated(['userID','name'])].empty

This combines the artist and user rankings with the 'ap' dataframe, sorts the result based on user-artist play counts, and then asserts that there are no duplicate rows in terms of user and artist combinations.

In [111]:
ap3.head()

Unnamed: 0,id,name,userID,artistID,userArtistPlays,totalUniqueUsers,totalArtistPlays,avgUserPlays,totalUniqueArtists,totalUserPlays
2800,72,Depeche Mode,1642,72,352698,282,1301308,4614.57,50,388251
35843,792,Thalía,2071,792,324663,26,350035,13462.88,50,338400
27302,511,U2,1094,511,320725,185,493024,2664.99,50,379125
8152,203,Blur,1905,203,257978,114,318221,2791.41,50,276295
26670,498,Paramore,1664,498,227829,399,963449,2414.66,50,251560


In [112]:
print_info([ap3[['userID','artistID']]],['***Artist Plays***'])

***Artist Plays*** 

Shape: (92834, 2)

Info:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 92834 entries, 2800 to 88660
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   userID    92834 non-null  int64
 1   artistID  92834 non-null  int64
dtypes: int64(2)
memory usage: 2.1 MB
None 

Unique:
userID       1892
artistID    17632
dtype: int64

No duplicates: True



Noteworthy statistics encompass:

17,632 distinct artist IDs and names

1,892 unique user IDs

A cumulative total of 92,834 data points, representing artist-user pairs.

In [113]:
# 69 million total plays
print("{:,}".format(ap3['userArtistPlays'].sum()))

69,183,975


### Data Insight

In [114]:
artist_rank.sort_values(['totalUniqueUsers'],ascending=False).head()

Unnamed: 0_level_0,totalUniqueUsers,totalArtistPlays,avgUserPlays
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Lady Gaga,611,1291387,2113.56
Britney Spears,522,2393140,4584.56
Rihanna,484,905423,1870.71
The Beatles,480,662116,1379.41
Katy Perry,473,532545,1125.89


Among the complete count of 1,892 users contained within this dataset, 611 individuals engaged with Lady Gaga's content at least once. This constitutes 32.3% of the entire user population.

In [115]:
len(ap3['userArtistPlays'][ap3['name']=='Lady Gaga'])

611

In [116]:
artist_rank.sort_values(['avgUserPlays'],ascending=False).head()

Unnamed: 0_level_0,totalUniqueUsers,totalArtistPlays,avgUserPlays
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Viking Quest,1,35323,35323.0
Tyler Adam,1,30614,30614.0
Rytmus,1,23462,23462.0
Johnny Hallyday,2,32995,16497.5
Dicky Dixon,1,15345,15345.0



Now, let's examine what we can term as "user loyalty," quantified by the average frequency at which a user listens to a particular artist. Once more, this is calculated as the ratio of total artist plays to total unique users.

A case in point is "Viking Quest," which has a solitary unique user responsible for 35,000 plays. Consequently, on average, this artist achieves the highest number of plays per user, making it the frontrunner in terms of user engagement.

In [117]:
ap3[ap3['name']=='Viking Quest']

Unnamed: 0,id,name,userID,artistID,userArtistPlays,totalUniqueUsers,totalArtistPlays,avgUserPlays,totalUniqueArtists,totalUserPlays
80046,8388,Viking Quest,596,8388,35323,1,35323,35323.0,50,101469


## Utilizing K-Nearest Neighbors for Item Similarity in Scikit Learn

In our initial attempt at creating a fundamental collaborative recommender, our approach involves constructing a sparse matrix that compares artist plays (represented by rows) across users (represented by columns). This matrix will then undergo processing through a latent mapping algorithm known as K-nearest neighbors. The goal here is to establish cosine similarity among the relationships between users and artists. This process assists us in identifying artists that are most similar, implying they are positioned closely together within this latent mapping. For instance, a user's engagement with the Beatles might result in a relatively small cosine distance, indicating a higher likelihood of engagement with the Rolling Stones rather than a more distant artist like Snoop Dogg.


### Creating the Sparse Matrix

In this step, we arrange the data within a sparse matrix format, where artist names correspond to rows and users to columns. This matrix comprehensively encapsulates the interactions between artists and users, with each cell containing the count of plays.

In [118]:
# fit data into a sparse matrix of artist name (row) vs user (column)
# in terms of number of plays by artist/user
def data_to_sparse(data,index,columns,values):
    pivot = data.pivot(index=index,columns=columns,values=values).fillna(0)
    sparse = csr_matrix(pivot.values)
    print(sparse.shape)
    return pivot,sparse

# User K Nearest Neighbors to determine cosine distance amongst artists
def fit_knn(sparse):
    knn = NearestNeighbors(metric='cosine')
    knn.fit(sparse)
    print(knn)
    return knn

There are two functions. The first function converts data into a sparse matrix, and the second function fits K Nearest Neighbors to determine cosine distance among artists. Here are the functions and their explanations:

1) data_to_sparse(data, index, columns, values): This function transforms data into a sparse matrix format. It takes the following arguments:

        data: The data to be transformed.
        index: The column to be used as the index of the resulting pivot table.
        columns: The column to be used as the columns of the resulting pivot table.
        values: The column containing the values for the resulting pivot table.

The function performs a pivot operation on the data using the provided index, columns, and values. It fills missing values with zeros and converts the resulting pivot table into a sparse matrix using the csr_matrix function from scipy.sparse. The shape of the sparse matrix is printed, and both the pivot table and the sparse matrix are returned.

2) fit_knn(sparse): This function fits K Nearest Neighbors to determine cosine distance among artists. It takes the following argument:

        sparse: The sparse matrix representing artist-play relationships.

The function creates a NearestNeighbors model with cosine distance metric and fits it with the provided sparse matrix. The NearestNeighbors model is printed, and the fitted model is returned.

In [119]:
pivot_ap3,sparse_ap3 = data_to_sparse(ap3,index='name',columns='userID',values='userArtistPlays')


(17632, 1892)


We used the data_to_sparse function to transform the ap3 dataframe into a pivot table and then into a sparse matrix. The pivot table and sparse matrix are stored in pivot_ap3 and sparse_ap3, respectively.

In [120]:
knn = fit_knn(sparse_ap3)

NearestNeighbors(metric='cosine')


This code call creates a K Nearest Neighbors model with a cosine distance metric and fits it using the provided sparse matrix sparse_ap3. The fitted model is then stored in the variable knn for future use.

In [121]:
pivot_ap3.head()

userID,2,3,4,5,6,7,8,9,10,11,...,2090,2091,2092,2093,2094,2095,2096,2097,2099,2100
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
!!!,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
!DISTAIN,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
!deladap,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
#####,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
#2 Orchestra,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Here, we investigate artist similarities by referencing their index numbers within the sparse matrix, indicated alongside their cosine distances enclosed in parentheses.

In [122]:
def idx_recommend(data,idx,model,k):
    distances, indices = (model.kneighbors(data.
                                     iloc[idx,:].
                                     values.reshape(1,-1),
                                     n_neighbors=k+1))

    for i in range(0,len(distances.flatten())):
        if i == 0:
            print(('Recommendations for {}:\n'.
                  format(data.index[idx])))
        else:
            print(('{}: {} ({:.3f})'.
                  format(i,
                         data.index[indices.flatten()[i]],
                         distances.flatten()[i])))
    return ''

This function calculates the nearest neighbors of the specified index in the dataset using the K Nearest Neighbors model. It then prints a list of recommended items along with their corresponding distances. The first recommendation is the item itself, as it is the nearest neighbor with a distance of 0. The subsequent recommendations are based on increasing distance from the queried item.

In [123]:
# we query a random artist
query_index = seed.choice(pivot_ap3.shape[0])
idx_recommend(pivot_ap3,query_index,knn,6)

Recommendations for VINILOVERSUS:

1: Don Davis (0.000)
2: Taking Back Sunday (0.792)
3: Ana Johnsson (0.856)
4: Lacuna Coil (0.864)
5: Breaking Benjamin (0.867)
6: Trapt (0.868)


''

The function will provide a list of recommendations for the randomly selected artist, including the artist itself and the most similar artists based on the K Nearest Neighbors model.

In [124]:
# lookup index number for select artists
query_index = pivot_ap3.index.get_loc('Britney Spears')
idx_recommend(pivot_ap3,query_index,knn,6)

Recommendations for Britney Spears:

1: Lindsay Lohan (0.504)
2: RuPaul (0.567)
3: Sarah Michelle Gellar (0.568)
4: mclusky (0.568)
5: Анастасия Приходько (0.568)
6: †‡† (0.570)


''

The function will provide recommendations for "Britney Spears," including the artist itself and the most similar artists based on the K Nearest Neighbors model.

In [125]:
query_index = pivot_ap3.index.get_loc('Oasis')
idx_recommend(pivot_ap3,query_index,knn,6)

Recommendations for Oasis:

1: Fuel (0.349)
2: The Perishers (0.355)
3: Vertical Horizon (0.386)
4: The Wreckers (0.406)
5: The Vines (0.471)
6: Calogero (0.534)


''

The function will provide recommendations for "Oasis," including the artist itself and the most similar artists based on the K Nearest Neighbors model.


### Artist Query (Using Fuzzy Matching)

In this section, we incorporate a direct artist lookup feature, employing fuzzy matching to accommodate partial name matches.

These functions are designed to provide recommendations for artists, even if the query artist's name is not spelled exactly as in the dataset, by using fuzzy matching to find partial name matches.

In [126]:
# this will help us to locate partial matches of our query
def fuzzy_match(query_subject,data):
    ratio_tuples = []
    
    # each artist name is the subject in the data index
    # each artist in the data is compared against our query subject to determine percentage match
    # NOTE there may be ways to optimize this process by not looping through all artists
    for subject in data.index:
        ratio = fuzz.ratio(subject.lower(),query_subject.lower())
        current_query_idx = data.index.tolist().index(subject)
        ratio_tuples.append((subject, ratio,current_query_idx))
        
    # our findings are then sorted based on match rating, and top two are kept
    ratio_tuples = sorted(ratio_tuples, key=lambda tup: tup[1],reverse=True)[:2]
    
    print('Top matches: {}\n'.format([(x[0],x[1]) for x in ratio_tuples]))  
    
    match = ratio_tuples[0][0]
    
    return match, ratio_tuples

def artist_recommend(query_artist, data, model,k):
    
    # determine artist matches using fuzzy matching
    
    match, ratio_tuples = fuzzy_match(query_artist,data)
    
    # look up artist by query index
    idx_recommend(data, ratio_tuples[0][2],model,k)   
    return ''

Leveraging subjective insights from the domain of music, the recommendations provided below are generally within the expected range. However, our aim is to enhance the precision of these recommendations through further refinement. In the upcoming sections, we will explore methods to decrease any unwanted variability by incorporating exclusively popular artists and engaged users. Additionally, we will consider options such as grouping play counts or converting them into binary representations to streamline the data for improved accuracy.

In [127]:
artist_recommend('britney spears',pivot_ap3,knn,8)

Top matches: [('Britney Spears', 100), ('Britney Spears⊼', 97)]

Recommendations for Britney Spears:

1: Lindsay Lohan (0.504)
2: RuPaul (0.567)
3: Sarah Michelle Gellar (0.568)
4: mclusky (0.568)
5: Анастасия Приходько (0.568)
6: †‡† (0.570)
7: Nadia Oh (0.570)
8: Rachel Stevens (0.571)


''

The function will use fuzzy matching to find partial name matches for "britney spears," determine the best match, and then provide recommendations based on the K Nearest Neighbors model for the chosen artist. The recommendations will include similar artists along with their cosine distances.

In [128]:
artist_recommend('nirvana',pivot_ap3,knn,10)

Top matches: [('Nirvana', 100), ('Nina', 73)]

Recommendations for Nirvana:

1: Nullset (0.130)
2: SoundGarden | www.CdsCompletos.net (0.130)
3: Humberto Gessinger Trio (0.130)
4: Green River (0.130)
5: Infectious Grooves (0.130)
6: 4 Non Blondes (0.135)
7: Puddle of Mudd (0.144)
8: Meat Puppets (0.152)
9: Institute (0.302)
10: Living Colour (0.309)


''

The function will use fuzzy matching to find partial name matches for "nirvana," determine the best match, and then provide recommendations based on the K Nearest Neighbors model for the chosen artist. The recommendations will include similar artists along with their cosine distances.

In [129]:
artist_recommend('red hot chillis',pivot_ap3,knn,10)

Top matches: [('Red Hot Chili Peppers', 78), ('The Chills', 64)]

Recommendations for Red Hot Chili Peppers:

1: The Offspring (0.568)
2: Kreator (0.630)
3: Bloodhound Gang (0.660)
4: 5'nizza (0.689)
5: Steppenwolf (0.691)
6: Beatallica (0.692)
7: Ногу Свело! (0.692)
8: Mercyful Fate (0.692)
9: John Frusciante (0.692)
10: Ленинград (0.692)


''

The function will use fuzzy matching to find partial name matches for "Red Hot Chili Peppers," determine the best match, and then provide recommendations based on the K Nearest Neighbors model for the chosen artist. The recommendations will include similar artists along with their cosine distances.

### Feature Scaling: Implementing Thresholds

In this phase, we incorporate threshold criteria to retain exclusively popular artists and engaged users. This approach is intended to diminish extraneous fluctuations within our data, ultimately enhancing the quality of our recommendations.

Our filtration criteria will be as follows:

For Users:

1) Minimum Plays per User: A stipulated threshold for the aggregate play count per user.
2) Minimum Unique Artist Plays: Users need to have played a minimum number of distinct artists to be considered.

For Artists:

1) Minimum Artist Plays: A designated threshold indicating the minimum number of times an artist must be played.
2) Minimum Listeners: A predefined threshold for the minimum count of unique users per artist.

In [130]:
minPlaysPerUser = 1000 # minimum aggregate play count per user
minUniqueArtistPlays = 10 # minimum different artists that need values per user to be counted
minArtistPlays = 10000 # minimum times an artist must be played
minListeners = 10 # minimum unique listeners of artist to be played

def apply_threshold(data,
                    minPlaysPerUser,
                    minUniqueArtistPlays,
                    minArtistPlays,
                    minListeners):
    
    filtered = (data[(data['totalUserPlays']>=minPlaysPerUser) & 
               (data['totalUniqueArtists']>=minUniqueArtistPlays) & 
               (data['totalArtistPlays']>=minArtistPlays) & 
               (data['totalUniqueUsers']>=minListeners)])

    # confirm our min thresholds have been applied
    print('MINIMUM VALUES')
    print('totalUserPlays from {} to {}'.format(min(data.totalUserPlays),min(filtered.totalUserPlays)))
    print('totalUniqueArtists from {} to {}'.format(min(data.totalUniqueArtists),min(filtered.totalUniqueArtists)))
    print('totalArtistPlays from {} to {}'.format(min(data.totalArtistPlays),min(filtered.totalArtistPlays)))
    print('totalUniqueUsers from {} to {}'.format(min(data.totalUniqueUsers),min(filtered.totalUniqueUsers)))

    print('\nFILTER IMPACT')
    print("FILTERED Users: {} Artists: {}".format(len(filtered['userID'].unique()),
                                                           len(filtered['name'].unique())))
    print("ORIGINAL Users: {} Artists: {}".format(len(data['userID'].unique()),
                                                           len(data['name'].unique())))
    print("FILTERED % ORIGINAL Users: {:.1f}% Artists: {:.1f}%".format(100*(len(filtered['userID'].unique())/len(data['userID'].unique())),
                                                           100*(len(filtered['name'].unique())/len(data['name'].unique()))))
    return filtered

ap4 = apply_threshold(ap3,minPlaysPerUser,minUniqueArtistPlays,minArtistPlays,minListeners)

ap4[['totalUserPlays','totalUniqueArtists','totalArtistPlays','totalUniqueUsers']].describe()

MINIMUM VALUES
totalUserPlays from 3 to 1001
totalUniqueArtists from 1 to 10
totalArtistPlays from 1 to 10007
totalUniqueUsers from 1 to 10

FILTER IMPACT
FILTERED Users: 1790 Artists: 871
ORIGINAL Users: 1892 Artists: 17632
FILTERED % ORIGINAL Users: 94.6% Artists: 4.9%


Unnamed: 0,totalUserPlays,totalUniqueArtists,totalArtistPlays,totalUniqueUsers
count,52298.0,52298.0,52298.0,52298.0
mean,41467.4,49.87,192603.24,144.96
std,50438.77,1.7,336597.43,135.1
min,1001.0,10.0,10007.0,10.0
25%,13094.0,50.0,27175.0,45.0
50%,25511.0,50.0,64596.0,89.0
75%,48311.0,50.0,188634.0,208.0
max,480039.0,50.0,2393140.0,611.0


In [131]:
pivot_ap4,sparse_ap4 = data_to_sparse(ap4,index='name',columns='userID',values='userArtistPlays')
knn = fit_knn(sparse_ap4)

(871, 1790)
NearestNeighbors(metric='cosine')


Post the application of the threshold, our outcomes exhibit enhancement; however, there remains potential for further refinement. Additionally, we can delve into the possibility of standardizing artist play data through the use of buckets or a binarization approach, both of which are discussed below.

In [132]:
artist_recommend('red hot chillis',pivot_ap4,knn,10)

Top matches: [('Red Hot Chili Peppers', 78), ('Hot Chip', 61)]

Recommendations for Red Hot Chili Peppers:

1: The Offspring (0.568)
2: Kreator (0.630)
3: John Frusciante (0.692)
4: Ennio Morricone (0.694)
5: In Extremo (0.696)
6: Riverside (0.697)
7: Katie Melua (0.701)
8: Ben Folds (0.702)
9: Bush (0.718)
10: Mylène Farmer (0.721)


''

### Feature Scaling: Transforming Plays into Buckets

In this stage, we will proceed to transform our original play counts into a categorical scale of five buckets. A rating of five will indicate a dedicated fan, while a rating of one suggests that the user has engaged with the artist but displays a preference for other options. A rating of zero continues to signify that the user has not interacted with the artist at all.

In [133]:
ap4['userArtistPlays'].describe()

count    52298.00
mean       992.30
std       4885.03
min          1.00
25%        139.00
50%        323.00
75%        771.00
max     352698.00
Name: userArtistPlays, dtype: float64

In [134]:
# convert our play counts into ratings buckets
# a rating of 2 requires more than one play
# a rating of at least 1 is given for >=1 play (else 0)

b = ap4['userArtistPlays']
buckets = np.linspace(b.quantile(.005),b.quantile(.995),5)
print("Bucket thresholds: {}".format([int(b) for b in buckets]))
print("For instance, if value is {}, then the rank would be {}.".
      format(int(buckets[0]+1),len(buckets[:1])+1))

def bucketize(x):
    cur_bucket = 0
    for i in range(0,5):
        cur_bucket += 1
        if x <= buckets[i]:
            break
    return cur_bucket

Bucket thresholds: [6, 4621, 9237, 13853, 18469]
For instance, if value is 7, then the rank would be 2.


In [135]:
ap5 = ap4
ap5['rank'] = ap5['userArtistPlays'].apply(bucketize)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ap5['rank'] = ap5['userArtistPlays'].apply(bucketize)


In [136]:
ap5.groupby(['rank'])['userArtistPlays'].describe().T

rank,1,2,3,4,5
count,304.0,50493.0,872.0,254.0,375.0
mean,3.49,585.26,6230.25,11294.57,37443.69
std,1.73,737.34,1246.27,1272.97,42039.23
min,1.0,7.0,4627.0,9275.0,13860.0
25%,2.0,137.0,5186.0,10205.5,17133.5
50%,4.0,311.0,5889.5,11114.5,23830.0
75%,5.0,710.0,7138.75,12318.0,37931.5
max,6.0,4621.0,9199.0,13831.0,352698.0


In [137]:
pivot_ap5,sparse_ap5 = data_to_sparse(ap5,index='name',columns='userID',values='rank')
knn = fit_knn(sparse_ap5)

(871, 1790)
NearestNeighbors(metric='cosine')


While there has been a slight increase in the cosine distance, it appears that the utilization of buckets has led to a significant enhancement in our recommendations.

In [138]:
artist_recommend('red hot chillis',pivot_ap5,knn,10)

Top matches: [('Red Hot Chili Peppers', 78), ('Hot Chip', 61)]

Recommendations for Red Hot Chili Peppers:

1: Nirvana (0.650)
2: Oasis (0.672)
3: Foo Fighters (0.685)
4: The Beatles (0.692)
5: Muse (0.706)
6: U2 (0.720)
7: Radiohead (0.721)
8: Aerosmith (0.722)
9: System of a Down (0.723)
10: Green Day (0.724)


''

The function will use fuzzy matching to find partial name matches for "red hot chillis," determine the best match, and then provide recommendations based on the K Nearest Neighbors model for the chosen artist.

#### Resources: 

https://www.last.fm/api/webauth

https://cseweb.ucsd.edu/classes/wi15/cse255-a/reports/fa15/007.pdf

https://vincentmai.com/Last-fm-Music-Recommender

https://ansegura7.github.io/RS_CF_LastFm/

https://towardsdatascience.com/music-artist-recommender-system-using-stochastic-gradient-descent-machine-learning-from-scratch-5f2f1aae972c