# Recommend musical artists
<p> <img src="files/img/mic.jpg" style="width:500px">

## 1. Introduction
* In a lot of situations we need build a recommender system, such as recommend aricles have similar topics with what the reader current read on. For an ecommerce website or supermarket like Target, recommend similar items to the customer based on what they looked through or what they already bought. The recommender system also benefits for targeted advisertising industry, for example facebook, youtube or instagram, we can come up with an efficient way to deliver targeted ads based on the customers' preference. 
* The goal of this project is to build up a recommender system to recommend musical artists to the users based on their preference.
* By achieving this goal, we build up NMF model and compute the cosine similarities, the steps are as following:
    1. Proprocessing the dataset, including pivoting the table and filling missing values
    2. Build a pipeline implementing standardization, NMF model and normalization
    3. Fit and transform the dataset
    4. Compute cosin similarities
* The result is a sucessful musical artists recommending system which can recommend artist to the user based on users' favor

In [1]:
import pandas as pd
import numpy as np

from sklearn.decomposition import NMF
from sklearn.preprocessing import Normalizer, MaxAbsScaler
from sklearn.pipeline import Pipeline

## 2. Musical Artists Dataset

### Load data
In the original datsaset, the columns are:
* Column user_offset is user ID
* Column artist_offset is artist ID
* Column playcount is number of times each artist was listened

In [2]:
artists_data = pd.read_csv('datasets/Musical artists/scrobbler-small-sample.csv')
artists_data.head()

Unnamed: 0,user_offset,artist_offset,playcount
0,1,79,58
1,1,84,80
2,1,86,317
3,1,89,64
4,1,96,159


#### Artists name
* The second file contains all the arists names we are working on

In [3]:
artists_name_df= pd.read_csv('datasets/Musical artists/artists.csv', header=None)
artists_name_df.head()

Unnamed: 0,0
0,Massive Attack
1,Sublime
2,Beastie Boys
3,Neil Young
4,Dead Kennedys


#### Convert the artists names to array

In [4]:
# Conver the artists names to numpy array
artists_name = artists_name_df[0].values
# Print out the first 5 names
artists_name[:5]

array(['Massive Attack', 'Sublime', 'Beastie Boys', 'Neil Young',
       'Dead Kennedys'], dtype=object)

### Pivoting dataframe
* After pivoting, the rows correspond to artists and columns correspond to users
* The entries give the number of times each artist was listened by each user

In [5]:
artists = artists_data.pivot(index='artist_offset',
                             columns = 'user_offset',
                             values = 'playcount')
artists.head()

user_offset,0,1,2,3,4,5,6,7,8,9,...,490,491,492,493,494,495,496,497,498,499
artist_offset,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,,,105.0,,,,,,,,...,,,,,,,,,,
1,128.0,211.0,,,,,,,,,...,,,,270.0,,105.0,97.0,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,


In [6]:
artists.shape

(111, 500)

### Fill missing data with 0

In [7]:
artists.fillna(0, inplace=True)
artists.head()

user_offset,0,1,2,3,4,5,6,7,8,9,...,490,491,492,493,494,495,496,497,498,499
artist_offset,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0.0,0.0,105.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,128.0,211.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,270.0,0.0,105.0,97.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## 3. Build pipeline

### MaxAbsScaler
* Scale each feature by its maximum absolute value
* This estimator scales and translates each feature individually such that the maximal absolute value of each feature in the training set will be 1.0. It does not shift/center the data, and thus does not destroy any sparsity
* This scaler can also be applied to sparse CSR or CSC matrices

#### Create MaxAbsScaler
*  **The first step in the pipeline, MaxAbsScaler, transforms the data so that all users have the same influence on the model, regardless of how many different artists they've listened to**

In [8]:
from sklearn.preprocessing import Normalizer, MaxAbsScaler

In [9]:
# Create a MaxAbsScaler: scaler
scaler = MaxAbsScaler()

### Creat NMF model

In [10]:
# Create an NMF model: nmf
nmf = NMF(n_components=20)

### Create normalizer

In [11]:
# Create a Normalizer: normalizer
normalizer = Normalizer()

### Create pipeline

In [12]:
steps = [('scaler', scaler), 
         ('nmf', nmf), 
         ('normalizer', normalizer)]

pipeline = Pipeline(steps)

## 4. Fit and transform

### Fit and Transform

In [13]:
normed_features = pipeline.fit_transform(artists)

### Create dataframe of normed_features

In [14]:
df = pd.DataFrame(normed_features, index=artists_name)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
Massive Attack,0.0,0.0,0.0,0.0,0.005673,0.0,0.0,0.055281,0.0,0.0,0.005338,0.0,0.001389,0.998066,0.0,0.0,0.0,0.0,0.027314,0.0
Sublime,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.005949,0.0,0.0,0.0,0.0,0.999982,0.0
Beastie Boys,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Neil Young,0.272682,0.0,0.0,0.058885,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.957824,0.066779,0.0,0.017032,0.0
Dead Kennedys,0.0,0.013712,0.0,0.59091,0.0,0.0,0.0,0.0,0.0,0.73631,0.0,0.0,0.139216,0.0,0.0,0.078726,0.0,0.287934,0.0,0.0


## 5. Recommend for artists

### Recommed artists similar to 'Dr. Dre'

#### Select row of 'Dr Dre'

In [15]:
# Select row of 'Dr. Dre': artist
artist = df.loc['Dr. Dre']

#### Compute cosine similarities

In [16]:
# Compute cosine similarities: similarities
similarities = df.dot(artist)

#### Display the highest similarities

In [17]:
# Display those with highest cosine similarity
print(similarities.nlargest())

Dr. Dre     1.000000
50 Cent     0.927136
Ludacris    0.900073
Eminem      0.888960
2Pac        0.876205
dtype: float64


### Conclustion: It shows our recommender system can successifully recommend musical artists whose style is very similar to the one we picked 'Dr. Dre'.