## Sample: Recommender System using Last.fm data
The data set contains information about users: gender, age, and which artists they have listened to on Last.FM. Only songs in Germany are analyzed and data has been transformed into an item frequency matrix, which means each row represents a user and each column represents an artist.

Code updated by Jess: <link>http://www.salemmarafi.com/code/collaborative-filtering-with-python/

In [1]:
# --- Import Libraries --- #
import pandas as pd
import numpy as np
from scipy.spatial.distance import cosine

In [2]:
# --- Read Data --- #
data = pd.read_csv('lastfm_data.csv')

In [3]:
# --- Quick View -- #
data.head(6).iloc[:,2:8]

Unnamed: 0,abba,ac/dc,adam green,aerosmith,afi,air
0,0,0,0,0,0,0
1,0,0,1,0,0,0
2,0,0,0,0,0,0
3,0,0,0,0,0,0
4,0,0,0,0,0,0
5,0,0,0,0,0,0


### Item-based collaborative filtering

In [4]:
# --- Start Item Based Recommendations --- #
# Drop any column named "user"
item_matrix = data.drop('user', 1)

Use numpy to normalize the data and then create a vectorized implementation of cosine similarities

In [5]:
# Let's fill in the empty spaces with cosine similarities
# Vectorized implementation of cosine similarities

# Normalize dataframe
norm_data = item_matrix / np.sqrt(np.square(item_matrix).sum(axis=0))

# Compute cosine similarities
cos_data = norm_data.transpose().dot(norm_data)

# View results
cos_data.head(5)

Unnamed: 0,a perfect circle,abba,ac/dc,adam green,aerosmith,afi,air,alanis morissette,alexisonfire,alicia keys,...,timbaland,tom waits,tool,tori amos,travis,trivium,u2,underoath,volbeat,yann tiersen
a perfect circle,1.0,0.0,0.017917,0.051554,0.062776,0.0,0.051755,0.060718,0.0,0.0,...,0.047338,0.0812,0.394709,0.125553,0.030359,0.111154,0.024398,0.06506,0.052164,0.0
abba,0.0,1.0,0.052279,0.025071,0.061056,0.0,0.016779,0.029527,0.0,0.0,...,0.0,0.0,0.0,0.061056,0.029527,0.0,0.094916,0.0,0.025367,0.0
ac/dc,0.017917,0.052279,1.0,0.113154,0.177153,0.067894,0.07573,0.038076,0.0,0.088333,...,0.044529,0.067894,0.058241,0.039367,0.0,0.087131,0.122398,0.0204,0.130849,0.0
adam green,0.051554,0.025071,0.113154,1.0,0.056637,0.0,0.093386,0.0,0.0,0.025416,...,0.0,0.146516,0.083789,0.056637,0.082169,0.025071,0.022011,0.0,0.023531,0.088045
aerosmith,0.062776,0.061056,0.177153,0.056637,1.0,0.0,0.113715,0.100056,0.0,0.061898,...,0.052005,0.029735,0.025507,0.068966,0.033352,0.0,0.214423,0.0,0.057307,0.0


Now, we will identify each song's “nearest neighbor” by looping through the Cosine Similarity matrix and sorting each column in descending order, then grabbing the name of each of the top 10 songs.

In [6]:
# Create a placeholder items for closes neighbours to an item
data_neighbours = pd.DataFrame(index=cos_data.index,columns=range(1,11))
 
# Loop through our similarity dataframe and fill in neighbouring item names
for i in range(0,len(cos_data.columns)):
    data_neighbours.iloc[i,:10] = cos_data.iloc[0:,i].sort_values(ascending=False)[:10].index

In [7]:
# Display the 3 most similar songs (according to cosine similarity)
data_neighbours.head(6).iloc[:6,1:4]

Unnamed: 0,2,3,4
a perfect circle,tool,dredg,deftones
abba,madonna,robbie williams,elvis presley
ac/dc,red hot chili peppers,metallica,iron maiden
adam green,the libertines,the strokes,babyshambles
aerosmith,u2,led zeppelin,metallica
afi,funeral for a friend,rise against,fall out boy


In [8]:
# --- End Item Based Recommendations --- #

### User-based collaborative filtering

Basic Logic:

- Create an Item Based similarity matrix
- Check which items the user has consumed
- For each item the user has consumed, get the top X neighbours
- Get the consumption record of the user for each neighbour.
- Calculate a similarity score using some formula
- Recommend the items with the highest score

In [9]:
# --- Start User Based Recommendations --- #
 
# Helper function to get similarity scores
def getScore(history, similarities):
    print("running getScore")
    return sum(history * similarities) / sum(similarities)

In [10]:
# Create a place holder matrix for similarities, and fill in the user name column
data_sims = pd.DataFrame(index=data.index,columns=data.columns)
data_sims.iloc[:,:1] = data.iloc[:,:1]

# quick check (view)
data_sims.head(5)

Unnamed: 0,user,a perfect circle,abba,ac/dc,adam green,aerosmith,afi,air,alanis morissette,alexisonfire,...,timbaland,tom waits,tool,tori amos,travis,trivium,u2,underoath,volbeat,yann tiersen
0,1,,,,,,,,,,...,,,,,,,,,,
1,33,,,,,,,,,,...,,,,,,,,,,
2,42,,,,,,,,,,...,,,,,,,,,,
3,51,,,,,,,,,,...,,,,,,,,,,
4,62,,,,,,,,,,...,,,,,,,,,,


NOTE: there is an unresolved error in this section of the code- endless loop

In [11]:
#Loop through all rows, skip the user column, and fill with similarity scores
counter = 0

for i in range(0, len(data.index)):
    user = cos_data.index[i]
    print("Running first for loop")
    
    for j in range(1, len(data.columns)):
        product = cos_data.columns[j]
        counter = counter + 1
        print("Running second for loop: " + str(counter))

        product_top_names = data_neighbours.loc[product][1:10]
        product_top_sims = cos_data.loc[product].sort_values(ascending=False)[1:10]
        user_purchases = data.loc[user, product_top_names]
        print("setting variables")
         
        data_sims.iloc[i][j] = getScore(user_purchases, product_top_sims)
        print("call to getScore")

Running first for loop
Running second for loop: 1


KeyError: 'the label [a perfect circle] is not in the [index]'

In [None]:
# Get the top songs
print("getting top songs")
data_recommend = pd.DataFrame(index=data_sims.index, columns=['user','1','2','3','4','5','6'])
data_recommend.iloc[0:,0] = data_sims.iloc[:,0]

In [None]:
# Instead of top song scores, we want to see names
for i in range(0,len(data_sims.index)):
    data_recommend.iloc[i,1:] = data_sims.iloc[i,:].sort_values(ascending=False).iloc[1:7,].index.transpose()
    
# Print a sample
data_recommend.iloc[:10,:4]