# Use NMF features to find similar musical artists

Given a list of users and the musicians they listened to with the number of times each artist was listened to by each user, we will use NMF (Non-negative matrix factorization) to cluster artists and assume they are similar. Then we will choose 'Bruce Springsteen' and find other artists who 

In [1]:
# Import dependencies
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.sparse import csr_matrix

In [2]:
scrobbler_df = pd.read_csv("data/scrobbler-small-sample.csv")

In [3]:
scrobbler_df.head()

Unnamed: 0,user_offset,artist_offset,playcount
0,1,79,58
1,1,84,80
2,1,86,317
3,1,89,64
4,1,96,159


In [4]:
num_rows = int(scrobbler_df['user_offset'].nlargest(1).values) + 1

In [5]:
num_rows

500

In [6]:
# Read artist text file into an array named artist_list
with open('data/artists.csv') as f:
    artist_list = f.read().splitlines()

In [7]:
scrobbler = scrobbler_df.values

In [8]:
scrobbler

array([[  1,  79,  58],
       [  1,  84,  80],
       [  1,  86, 317],
       ...,
       [  0,  52,  58],
       [  0,  54,  53],
       [  0,   1, 128]], dtype=int64)

In [9]:
df = pd.DataFrame(columns=artist_list, index=np.arange(num_rows))

In [10]:
for row, col, val in scrobbler:
    df.iloc[row, col] = val

In [11]:
df = df.fillna(0)

In [12]:
artists = csr_matrix(df.values.T)

In [13]:
artists

<111x500 sparse matrix of type '<class 'numpy.int64'>'
	with 2894 stored elements in Compressed Sparse Row format>

## Recommend musical artists using NMF
   Using the sparse array `artists` whose rows correspond to artists and whose columns correspond to users, we will use the number of times each artist was listened to by each user to create NMF features of 'similar artists' and then recommend artists similar to 'Bruce Springsteen'
   
First we will build a pipeline and transform the array into normalized NMF features.

In [14]:
# Import dependencies
from sklearn.decomposition import NMF
from sklearn.preprocessing import Normalizer, MaxAbsScaler
from sklearn.pipeline import make_pipeline

The first step in the pipeline, `MaxAbsScaler`, transforms teh data so that all users have the same influence on the model, regardless of how many different artists they've listened to.

In [15]:
# Create a MaxAbsScaler instance: scaler
scaler = MaxAbsScaler()

In [16]:
# Create an NMF model: nmf
nmf = NMF(n_components=20)

In [17]:
# Create a Normalizer: normalizer
normalizer = Normalizer()

In [18]:
# Create a pipeline using the above
pipeline = make_pipeline(scaler, nmf, normalizer)

In [19]:
# Apply fit_transform to artists sparse array: norm_features
norm_features = pipeline.fit_transform(artists)



In [20]:
# Create a DataFrame of the norm_features
norm_df = pd.DataFrame(norm_features, index=artist_list)

In [21]:
# Select row of 'Bruce Springsteen': artist
bruce = norm_df.loc['Bruce Springsteen']

In [22]:
similar_artists = norm_df.dot(bruce)

In [23]:
# Display the top 10 artists with the highest cosine similarity
print(f"Artists most similar to 'Bruce Springsteen' are:\n {similar_artists.nlargest(11)}")

Artists most similar to 'Bruce Springsteen' are:
 Bruce Springsteen              1.000000
Neil Young                     0.958146
Leonard Cohen                  0.915517
Van Morrison                   0.881271
Bob Dylan                      0.863172
Simon & Garfunkel              0.850389
Ryan Adams                     0.846713
Tom Waits                      0.821058
The Beach Boys                 0.815280
Phish                          0.754951
Nick Cave and the Bad Seeds    0.734114
dtype: float64
