# An oversimplified example of SVD for recommendation

**Note: Before running this, have a read through. Some of the discussion is based on my particular randomly generated data**

In [1]:
import pandas as pd
import numpy as np

First, let's generate a really simple data set. Each row is a user, and each column is an item. If the value for a user:item location is 1; that user has "liked" that item on _SocialMediaster-SellingMaster<sup>TM</sup>_, my new social media platform for people who like items and things and stuff.

In [26]:
num_users = 10
num_items = 5
def generate_users(num_users, num_items):
    data = []
    for i in range(num_users):
        user = [np.random.randint(2) for _ in range(num_items)]
        data.append(user)
    return data

user_item_mat = pd.DataFrame(generate_users(num_users,num_items))
user_item_mat         

Unnamed: 0,0,1,2,3,4
0,0,1,0,1,0
1,1,0,1,0,1
2,1,1,1,0,1
3,0,0,0,1,0
4,0,0,0,0,0
5,1,1,1,0,1
6,0,0,1,0,0
7,1,1,0,0,1
8,1,0,1,1,0
9,0,0,1,0,0


Now, let's do some SVD. This is a small enough dataset that I really don't need to truncate... but in most large scale recommenders you WILL need to truncate, so we're going to start with only 3 components of SVD. 

In [35]:
from sklearn.utils.extmath import randomized_svd

U, Sigma, VT = randomized_svd(user_item_mat, 
                              n_components=3,
                              n_iter=5,
                              random_state=None)

Great, so now what do we have? In this case **VT** is now a matrix where each column represents one of the items in the new vector space. Each row is one component of the vector space, for the items.

In [53]:
pd.DataFrame(VT)

Unnamed: 0,0,1,2,3,4
0,0.557986,0.403996,0.532586,0.12859,0.4746117
1,-0.06227,-0.267219,0.463976,0.731194,-0.4180885
2,-0.0,0.57735,-0.57735,0.57735,6.747348e-16


If I transpose this, the rows are items, and the columns are the items in the "hidden" vector space created by the truncated SVD.

In [54]:
pd.DataFrame(VT.T)

Unnamed: 0,0,1,2
0,0.557986,-0.06227,-0.0
1,0.403996,-0.267219,0.5773503
2,0.532586,0.463976,-0.5773503
3,0.12859,0.731194,0.5773503
4,0.474612,-0.418089,6.747348e-16


**U** is a matrix where each row is a user and each column shows the location in the hidden vector space created by the SVD.

In [38]:
pd.DataFrame(U)

Unnamed: 0,0,1,2
0,0.139275,0.260024,0.6666667
1,0.409308,-0.009181,-0.3333333
2,0.514956,-0.158938,1.247782e-16
3,0.033627,0.40978,0.3333333
4,0.0,0.0,-0.0
5,0.514956,-0.158938,1.246846e-16
6,0.139275,0.260024,-0.3333333
7,0.375681,-0.418962,0.3333333
8,0.318821,0.634907,-1.056328e-15
9,0.139275,0.260024,-0.3333333


**Sigma** is just the singular values of the decomposition. In this case, we're not particularly interested in **Sigma**.

In [40]:
pd.DataFrame(Sigma)

Unnamed: 0,0
0,3.823972
1,1.784356
2,1.732051


Now, let's take a look at the matrix again. 

Let's note a few things (assuming you are still looking at my data and haven't re-run anything yet):
 * Items 0 and 4 have a lot of overlapping users. Users that like Item 0, tend to also like Item 4.
 * Users 2 and 5 like exactly the same items.

In [49]:
user_item_mat

Unnamed: 0,0,1,2,3,4
0,0,1,0,1,0
1,1,0,1,0,1
2,1,1,1,0,1
3,0,0,0,1,0
4,0,0,0,0,0
5,1,1,1,0,1
6,0,0,1,0,0
7,1,1,0,0,1
8,1,0,1,1,0
9,0,0,1,0,0


So, if we look in our new hidden vector space and take the dot products of items (cosine similarity!), we expect that items 0 & 4 are the most similar. 

In [55]:
compare_item = 0
for item in range(num_items):
    if item != compare_item:
        print("Item %s & %s: "%(compare_item,item), np.dot(VT.T[compare_item],VT.T[item]))  

Item 0 & 1:  0.242063787485
Item 0 & 2:  0.268283426776
Item 0 & 3:  0.0262196392909
Item 0 & 4:  0.290861073319


If we compare users, we expect that users 2 & 5 should be the most similar.

In [56]:
compare_user = 2
for user in range(num_users):
    #if user != compare_user:
        print("User %s & %s: "%(compare_user,user), np.dot(U[compare_user],U[user]))

User 2 & 0:  0.0303931556646
User 2 & 1:  0.212235169716
User 2 & 2:  0.290441348893
User 2 & 3:  -0.0478130235128
User 2 & 4:  0.0
User 2 & 5:  0.290441348893
User 2 & 6:  0.0303931556646
User 2 & 7:  0.260048193229
User 2 & 8:  0.0632680613133
User 2 & 9:  0.0303931556646


Let's make a function that returns recommendations for a given item input (this user likes item 0... so she'll probably also like items X, Y, Z).

In [81]:
def get_recommends(itemID, VT, num_recom=2):
    recs = []
    for item in range(VT.T.shape[0]):
        if item != itemID:
            recs.append([item,np.dot(VT.T[itemID],VT.T[item])])
    final_rec = [i[0] for i in sorted(recs,key=lambda x: x[1],reverse=True)]
    return final_rec[:num_recom]
print(get_recommends(0,VT,num_recom=2))

[4, 2]


We could also find a user that's most similar to a user and then recommend all items they like. In my example, user 3 is most similar to user 0... but user 0 also likes item 1 (and user 3 doesn't). Thus, this function will find for user 3 - User 0 is most similar, and item 1 should be something user 3 likes too.

In [113]:
def get_recommends_user(userID, U, df):
    userrecs = []
    for user in range(U.shape[0]):
        if user!= userID:
            userrecs.append([user,np.dot(U[userID],U[user])])
    final_rec = [i[0] for i in sorted(userrecs,key=lambda x: x[1],reverse=True)]
    comp_user = final_rec[0]
    print("User #%s's most similar user is User #%s "% (userID, comp_user))
    rec_likes = df.iloc[comp_user]
    current = df.iloc[userID]
    recs = []
    for i,item in enumerate(current):
        if item != rec_likes[i] and rec_likes[i]!=0:
            recs.append(i)
    return recs

user_to_rec = 3
print("Items for User %s to check out: "% user_to_rec, get_recommends_user(user_to_rec,U,user_item_mat))

User #3's most similar user is User #0 
Items for User 3 to check out:  [1]


That's pretty much what a recommender is doing for collaborative filtering, but it just works on a slightly **huge-er** scale compared to this. So now let's look into some "real-life" recommenders using some beer and movie data sets.