# Matrix Factorization for recommender systems

** Introduction **

Matrix factorization technique is a recommender system technique that discovers and utilises the latent features underlying the interactions between the users and movies. As the name suggests, the technique is to factorize the matrix into two or more matrices such that you multiply them to get the original matrix. 
In our project, we have a group of users and set of movies. Given that users have rated some movies, we would like to predict the how the users would rate other movies that they have not yet rated so that we can recommend movies to them. Mathematically, we have a matrix where the rows are the users_ids and the columns are movie_ids. Each cell is either empty(missing entries) or has the rating given by the user to the movie. We would like to build a model to fill in the missing entries such that values are consistent with the existing entries in the matrix.

The idea behind using matrix factorization to solve this problem is that there should be some latent features that determine how a user rates a particular movie. For example, two users can give high ratings to a movie if it is an action movie, a genre preferred by both the users or if the movie has certain actors/actresses liked by the users.
All the data we have in matrix is ratings given the user for a movie and no other features about either the user or movie. So if we discover these latent features, we should be able to predict ratings for a movie with respect to a user because the features associated with the user should match with the features associated with the movie.

In the procedure of discovering latent features, we make an assumption that the number of features are less than the number of movies and the users. If the number of features are same as the number of users, then each user has a unique feature, then there is no point in making predictions since a user would not be interested in movies rated high by other users. Same is the case for movies if the number of features are same as the number of movies.

** Mathematical Explanation **
We have a set of $|N|$ users and $|M|$ movies. Let $R$ of size $|N| x |M|$ be the  matrix that contains the ratings of movies given by the users. We assume that there are $|K|$ no. of latent features. Our task it to find  two matrices $P$($|N| x |K|$) and $Q$($|M| x |K|$) such that their product approximates $R$:

$$ R \approx P x Q^T = \hat{R} $$

Each row of P would represent the strength of associations between a user and the features. Similarly, each row of Q would represent the strength of associations between a movie and the features. To get the prediction of a rating of a movie $d_j$ by user $u_i$, we calculate the dot product of  the two vectors corresponding to $u_i$ and $d_j$.:

$$\hat{r_{ij}} = p_i^Tq_j = \sum\limits_{k=1}^{K}p_{ik}q_{kj} $$.


**Model ** Our aim is to find such matrices $P$ and $Q$. We follow the following gradient descent approach. We initialize the two matrices with random values, and calculate how their product is different from $R$ and try to minimize the difference iteratively, aiming to find a local minima. The difference is the squared error between the actual rating and the estimated rating for each user-movie pair which has non-empty value.

$$ e_{ij}^2 = (r_{ij} - \hat{r_{ij}})^2 =  (r_{ij} - \sum\limits_{k=1}^{K}p_{ik}q_{kj}) ^2 $$

We differentiate the above equation with respect to $p_{ik}$ and $q_{kj}$ since we need to update these values:

$$ \frac{\partial }{\partial p_{ik}} e_{ij}^2 =  -2 (r_{ij} - \hat{r_{ij}})* (q_{kj}) = -2e_{ij}q_{kj}$$
$$ \frac{\partial }{\partial q_{kj}} e_{ij}^2 =  -2 (r_{ij} - \hat{r_{ij}})* (p_{ik}) = -2e_{ij}p_{ik}$$

Once we have the gradient, we can update the values of $p_{ik}$ and $q_{kj}$ using the  following update rules:
$$ p_{ik}^{'} =  p_{ik} - \alpha *\frac{\partial }{\partial p_{ik}} e_{ij}^2 = p_{ik}  + 2\alpha e_{ij}q_{kj}$$
$$ q_{kj}^{'} =  q_{kj} - \alpha *\frac{\partial }{\partial q_{kj}} e_{ij}^2 = q_{kj}  + 2\alpha e_{ij}p_{ik}$$

Here $\alpha$ is the step size of approaching the minimum. Step size is an important parameter to be tuned in gradient descent. If we choose too large step size, there is a risk of oscillating around the minimum, if the step size is too small, the rate is convergence is low.  Our data is huge and it takes long to tune the parameter using cross validation.

Using the above update rules, we can then iteratively perform the operation until the error converges to its minimum. We can check the overall error as calculated using the following equation and determine when we should stop the process.

$$ E = \sum\limits_{u_i,d_j, r_{ij} \in T}e_{ij}  = \sum\limits_{u_i,d_j, r_{ij} \in T} (r_{ij} - \sum\limits_{k=1}^{K}p_{ik}q_{kj}) ^2$$

** Regularization **
As observed in datascience, overfitting a common problem and we introduce regularization to this method to avoid overfitting. To introduce regularization, we add a parameter $\beta$ and modify the squared error as

$$e_{ij}^2 = (r_{ij} - \sum\limits_{k=1}^{K}p_{ik}q_{kj}) ^2  + \frac{\beta}{2}\sum\limits_{k=1}^{K}(||P||^2 + ||Q||^2)$$

The new update rules are :
$$ p_{ik}^{'} =  p_{ik} - \alpha *\frac{\partial }{\partial p_{ik}} e_{ij}^2 = p_{ik}  + \alpha (2e_{ij}q_{kj} - \beta p_{ik})$$
$$ q_{kj}^{'} =  q_{kj} - \alpha *\frac{\partial }{\partial q_{kj}} e_{ij}^2 = q_{kj}  + \alpha (2e_{ij}p_{ik} - \beta q_{kj})$$



In [None]:
import pandas as pd
from scipy.spatial.distance import cosine

In [2]:
data = pd.read_csv('subset.csv')[["movieID", "reviewerID", "rating"]]
data.head()

Unnamed: 0,movieID,reviewerID,rating
0,5019281,ADZPIG9QOCDG5,4.0
1,5019281,A35947ZP82G7JH,3.0
2,5019281,A3UORV8A9D5L2E,3.0
3,5019281,A1VKW06X1O2X7V,5.0
4,5019281,A3R27T4HADWFFJ,4.0


In [3]:
reviewer_count = data.groupby('reviewerID').count()
# to see if reviewerID indeed mean reviewers
reviewer_count[0:5]

Unnamed: 0_level_0,movieID,rating
reviewerID,Unnamed: 1_level_1,Unnamed: 2_level_1
A00295401U6S2UG3RAQSZ,6,6
A00348066Q1WEW5BMESN,5,5
A0040548BPHKXMHH3NTI,10,10
A00438023NNXSDBGXK56L,5,5
A0048168OBFNFN7WW8XC,9,9


In [4]:
data.shape

(1697533, 3)

REducing dataset

In [5]:
data_order = data.groupby('reviewerID').count()

data_order = data_order.sort_values('movieID',ascending=False)
data_order.head(n=1000)
data_order.index[:1000].values
users_selected = data_order.index[:1000].values
data_users_selected = data[data['reviewerID'].isin(users_selected)]
data_users_selected.shape


(317908, 3)

In [6]:
# selecting most rated movies
data_order2 = data.groupby('movieID').count()
data_order2 = data_order2.sort_values('reviewerID',ascending=False)
data_order2.head(n=1000)
data_order2.index[:1000].values
movies_selected = data_order2.index[:1000].values

data_users_selected = data[data['reviewerID'].isin(users_selected)]
print(data_users_selected.shape)
data_movies_users_selected = data_users_selected[data['movieID'].isin(movies_selected)]
print(data_movies_users_selected.shape)
data_movies_users_selected.head(n=10)

(317908, 3)
(61542, 3)




Unnamed: 0,movieID,reviewerID,rating
522,310263662,A3EE0H0NWQ9QVL,5.0
526,310263662,A32JKNQ6BABMQ2,3.0
531,310263662,A1VQBHHXIKHIGS,1.0
535,310263662,A2PV6GK1HV54Y9,4.0
540,310263662,A1GSR7RGCG1QYZ,3.0
546,310263662,AQ8DU6XVA3USJ,5.0
547,310263662,A20LY8E9NGYA4M,4.0
555,310263662,A3NQU1649SH0Q4,5.0
571,310263662,A1VHK9A4VLJTHC,5.0
573,310263662,A3TNM3C9ENUCFW,4.0


In [7]:
import numpy as np

grouped = data_movies_users_selected.groupby('reviewerID')
users = []
for user in data_movies_users_selected['reviewerID'].unique():
    if grouped.get_group(user).shape[0] < 3: 
    #li.append(grouped.get_group(user).shape[0])
        #print (user)
        users.append(user)
    #print (grouped.get_group(i).shape[0])




In [8]:
#data_movies_users_selected_new = data_movies_users_selected[data_movies_users_selected['reviewerID'] not in users]
criterion = data_movies_users_selected['reviewerID'].map(lambda x: x not in users)
data_movies_users_selected_new = data_movies_users_selected[criterion]
print (data_movies_users_selected_new.shape)

(61525, 3)


In [9]:
#test data , will randomly mark rating of one movie for each user as NaN and then predict it
data_movies_users_selected_new_test = data_movies_users_selected_new.copy()

In [10]:
rp = data_movies_users_selected_new_test.pivot_table(columns=['movieID'],index=['reviewerID'],values='rating')
rp.head(n= 10)


movieID,0310263662,0767002652,076780192X,0767802470,0767802519,0767802624,0767802659,0767805267,0767811100,0767824571,...,B00H9KKGTO,B00H9L26AA,B00HEPC0TS,B00HEPDGKA,B00HEPE6MM,B00HHYF570,B00HLTD3ZW,B00HNGZHDE,B00JA3RPAG,B00JAQJMJ0
reviewerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A10175AMUHOQC4,,,,,,,,,,,...,,,5.0,,,,,,,
A103KNDW8GN92L,4.0,,,,,,,,,,...,,,,,,,,,,
A106016KSI0YQ,,,,,,,,,,,...,,,,,,,,,,
A106YXO3EHVD3J,,,,,,,,,,,...,,,4.0,,,3.0,3.0,,,
A10H47FMW8NHII,,,,,,,,,,,...,3.0,5.0,4.0,,4.0,,4.0,4.0,,5.0
A10ODC971MDHV8,,,5.0,5.0,,,,,5.0,5.0,...,,,,,,,,,,
A10Q8NIFOVOHFV,,,,,,,,,,,...,,,,,,,,,,
A11ED8O95W2103,,,,,,,,,5.0,,...,,,,,,,,,,
A11PTCZ2FM2547,3.0,,4.0,,,,4.0,,,5.0,...,,,,,,,,,,
A11XKY4EIU2KNR,,,,,,,,,,,...,,,,,,,,,,


In [11]:
rp =rp.fillna(0)


In [12]:
#
# An implementation of matrix factorization
#
import numpy
###############################################################################

"""
@INPUT:
    R     : a matrix to be factorized, dimension N x M
    P     : an initial matrix of dimension N x K
    Q     : an initial matrix of dimension M x K
    K     : the number of latent features (rank of the matrix)
    steps : the maximum number of steps to perform the optimisation
    alpha : the learning rate
    beta  : the regularization parameter
@OUTPUT:
    the final matrices P and Q
"""
def matrix_factorization(R, P, Q, K, steps=5000, alpha=0.0002, beta=0.02):
    Q = Q.T
    for step in range(steps):
        for i in range(len(R)):
            for j in range(len(R[i])):
                if R[i][j] > 0:
                    eij = R[i][j] - numpy.dot(P[i,:],Q[:,j])
                    for k in range(K):
                        P[i][k] = P[i][k] + alpha * (2 * eij * Q[k][j] - beta * P[i][k])
                        Q[k][j] = Q[k][j] + alpha * (2 * eij * P[i][k] - beta * Q[k][j])
        eR = numpy.dot(P,Q)
        e = 0
        for i in range(len(R)):
            for j in range(len(R[i])):
                if R[i][j] > 0:
                    e = e + pow(R[i][j] - numpy.dot(P[i,:],Q[:,j]), 2)
                    for k in range(K):
                        e = e + (beta/2) * ( pow(P[i][k],2) + pow(Q[k][j],2) )
        if e < 0.001:
            break
    return P, Q.T

###############################################################################

if __name__ == "__main__":
    R = rp

    R = np.array(R)
    R = R[:1000,:1000]
    N = len(R)
    M = len(R[0])
    K = 3

    P = numpy.random.rand(N,K)
    Q = numpy.random.rand(M,K)

    nP, nQ = matrix_factorization(R, P, Q, K)

In [13]:
Final =  numpy.dot(nP, nQ.T)
print(Final)
R = np.array(rp)
print(R[:100,:100])

[[ 4.32423785  5.19036694  3.76404225 ...,  2.33315206  1.85646726
   2.56547494]
 [ 4.3051052   4.43526362  4.666825   ...,  2.91501147  2.55514738
   4.48462933]
 [ 3.84089598  4.13216478  4.12899284 ...,  2.3280845   1.96158674
   3.86134005]
 ..., 
 [ 4.78832461  4.98735763  4.83261257 ...,  3.40312515  3.02251345
   4.34110157]
 [ 4.28913143  3.95220125  4.38307872 ...,  3.8838613   3.68120646
   4.21688434]
 [ 4.90828161  5.1319599   4.57764881 ...,  3.72071564  3.36160299
   3.78739644]]
[[ 0.  0.  0. ...,  0.  0.  0.]
 [ 4.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  4.]
 ..., 
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 5.  0.  0. ...,  0.  0.  0.]]


In [14]:
R2 = 0
N = 0
R2_list=[]
for i in range(985):
    for j in range(1000):
        if(R[i][j]!=0):
            R2+= np.square((R[i][j]- Final[i][j]))
            N+=1
results = np.sqrt(R2/N)
print("Results k=3", results,"N", N)
print(R2, R2/100)
R2_list.append(R2)

Results k=3 0.795988583072 N 61525
38982.1061451 389.821061451


In [67]:
R2 = 0
N = 0
R2_list=[]
for i in range(985):
    for j in range(1000):
        if(R[i][j]!=0):
            R2+= np.square((R[i][j]- Final[i][j]))
            N+=1
results = np.sqrt(R2/N)
print("Results", results,"N", N)
print(R2, R2/100)
R2_list.append(R2)

Results 0.83384562304 N 61525
42778.2416314 427.782416314


In [68]:
users_test = []
movies_for_users = []
for user in data_movies_users_selected_new_test['reviewerID'].unique():
    a = grouped.get_group(user)
    reviewer = a['reviewerID'].iloc[0]
    movie = a['movieID'].iloc[-1]
    #rating = a['rating'].iloc[0]
    print(movie)
    print(rp.ix[reviewer,movie])
    rp.ix[reviewer,movie]=float('NaN')
    #print(rp.ix[movie,reviewer])
    users_test.append(reviewer)
    movies_for_users.append(movie)

B00FZM8Z7I
4.0
B00BEIYP1W
4.0
B00104QSOM
5.0
B00DL47RQ2
5.0
B0060ZJ74O
2.0
B000ARTN3I
5.0
B0001ZX0OC
4.0
B00CIXVAN8
4.0
B000MMMT9G
5.0
B008JFUN1O
3.0
B001MYIXAC
5.0
B000JLTR8Q
2.0
B00C8CQRQ4
4.0
B001TOD92C
4.0
B000VBJEEG
2.0
B0002ZDVEU
5.0
B0000VAFNQ
1.0
B00027SIUK
4.0
B00H83EUL2
4.0
B000E33VWW
5.0
B00H9KKGTO
3.0
B000J10EQU
2.0
B000BW7QWW
1.0
B005IZLPKQ
4.0
B000BHZ2BO
5.0
B00GST8U4U
5.0
B000VKL6Z2
5.0
B000C3L27K
4.0
B002VPE1AW
4.0
B0051MKMNC
5.0
B00104QSOM
2.0
B002VPE1AW
5.0
B0006SSOHC
3.0
B0007RUSGW
5.0
B00JAQJMJ0
2.0
B008JFUPFI
4.0
B0002234LS
3.0
B0009PLLN6
2.0
B00H1RMOI6
5.0
B004G6009K
4.0
B00CZB9BCU
2.0
B00080ZG10
4.0
B00H83EUL2
5.0
B001SGEUYW
2.0
B00JAQJMJ0
5.0
B0007Y08I8
5.0
B00DL48BM6
3.0
B002RD55JE
5.0
B0067EKYS6
2.0
B002VKE0XA
5.0
B00H7LINKE
4.0
B000F7CMRM
3.0
B002ZG997C
4.0
B00JA3RPAG
4.0
B00BUADSMQ
1.0
B00JAQJMJ0
3.0
B000F7CMRM
2.0
B00H83EUL2
5.0
B00JAQJMJ0
5.0
B005CMSDKA
3.0
B00H83EUL2
4.0
B005LAIHW2
4.0
B000CQLZ0Q
4.0
B00JAQJMJ0
3.0
B00H83EUL2
5.0
B0064NTZJO
4.0
B002VPE1AW

In [69]:
# take the sum of movie liking for each movie for respective reviewer
#user_history = rp.groupby('reviewerID').sum()
#user_history.head()
data_movies_users_selected_new_test.head()

Unnamed: 0,movieID,reviewerID,rating
522,310263662,A3EE0H0NWQ9QVL,5.0
526,310263662,A32JKNQ6BABMQ2,3.0
531,310263662,A1VQBHHXIKHIGS,1.0
535,310263662,A2PV6GK1HV54Y9,4.0
540,310263662,A1GSR7RGCG1QYZ,3.0


In [70]:
data_ibs = pd.DataFrame(index=rp.columns,columns=rp.columns)
data_ibs.columns
rp = rp.fillna(0)

In [None]:
# similarity between movies
for i in range(0,len(data_ibs.columns)) :
    # Loop through the columns for each column
    for j in range(0,len(data_ibs.columns)) :
    #Fill in placeholder with cosine similarities
      data_ibs.ix[i,j] = 1-cosine(rp.ix[:,i],rp.ix[:,j])

In [17]:
data_ibs_save = data_ibs.copy()
data_ibs = data_ibs.fillna(0)
data_ibs.head(n = 10)

movieID,0310263662,0767002652,076780192X,0767802470,0767802519,0767802624,0767802659,0767805267,0767811100,0767824571,...,B00H9KKGTO,B00H9L26AA,B00HEPC0TS,B00HEPDGKA,B00HEPE6MM,B00HHYF570,B00HLTD3ZW,B00HNGZHDE,B00JA3RPAG,B00JAQJMJ0
movieID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0310263662,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0767002652,0,1.0,0.0,0.0,0.0,0.0,0.0,0.114125,0.0,0.16639,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.196267
076780192X,0,0.0,1.0,0.31566,0.176503,0.090549,0.182064,0.259683,0.152539,0.13409,...,0.036769,0.050396,0.046326,0.0,0.0,0.0,0.0,0.0,0.0,0.065128
0767802470,0,0.0,0.31566,1.0,0.223362,0.128479,0.211357,0.253315,0.074632,0.214878,...,0.029812,0.027458,0.0,0.0,0.0,0.0,0.028877,0.0,0.042805,0.079207
0767802519,0,0.0,0.176503,0.223362,1.0,0.053394,0.0,0.197151,0.212152,0.04186,...,0.046461,0.0,0.0,0.0,0.0,0.0,0.0,0.04372,0.0,0.098753
0767802624,0,0.0,0.090549,0.128479,0.053394,1.0,0.175994,0.234144,0.146008,0.10453,...,0.031357,0.050542,0.026199,0.0,0.0,0.065561,0.0,0.0,0.045023,0.09664
0767802659,0,0.0,0.182064,0.211357,0.0,0.175994,1.0,0.166028,0.059726,0.152499,...,0.019344,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.027775,0.051395
0767805267,0,0.114125,0.259683,0.253315,0.197151,0.234144,0.166028,1.0,0.119597,0.086085,...,0.06323,0.0,0.07044,0.052754,0.038288,0.0,0.034026,0.0,0.040349,0.097063
0767811100,0,0.0,0.152539,0.074632,0.212152,0.146008,0.059726,0.119597,1.0,0.200625,...,0.056011,0.056621,0.022828,0.051291,0.0,0.0,0.0,0.064277,0.0,0.076948
0767824571,0,0.16639,0.13409,0.214878,0.04186,0.10453,0.152499,0.086085,0.200625,1.0,...,0.044249,0.075001,0.025674,0.022151,0.033493,0.0,0.0,0.048193,0.0,0.159365


In [30]:
# Create a placeholder items for closes neighbours to an item
data_neighbours = pd.DataFrame(index=data_ibs.columns,columns=range(1,11))
 
# Loop through our similarity dataframe and fill in neighbouring item names
for i in range(0,len(data_ibs.columns)):
    data_neighbours.ix[i,:] = data_ibs.ix[:,i].sort_values(ascending=False)[1:11].index
    
### The issue is here!!!. data_neighbours is always nan even though data_ibs has finite values. could not debug the issue


In [31]:
data_neighbours.head()

Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9,10
movieID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0310263662,B00005JKC3,B00005JL3T,B00005JL3A,B00005JKZY,B00005JKVZ,B00005JKQZ,B00005JKNF,B00005JKN9,B00005JKMY,B00005JKJM
0767002652,B003Y5HWJU,B000OY8NII,B002BWP2IK,B001O4C6NA,B000QUEQ4U,6303965415,B000CEXG0U,B003ZSJ212,1424819253,6304383827
076780192X,0767802470,0780623134,B00000K3AM,0783216084,0780628799,B00003CXJC,B00003CXRP,0783211856,0790729385,B00003CWT6
0767802470,076780192X,6300214826,6300216500,0790729628,6302135621,0800141709,0790743132,0790701022,6302760046,0780623134
0767802519,0780622545,B00003CXZ3,0767830520,6302760046,0767802470,B008RV5K4U,B00003CXBK,0767811100,0780623134,B00003CXQA


In [41]:
# --- Start User Based Recommendations --- #
 
# Helper function to get similarity scores
def getScore(history, similarities):
       return sum(history*similarities)/sum(similarities)

In [42]:
# Create a place holder matrix for similarities, and fill in the user name column
data_sims = pd.DataFrame(index=rp.index,columns=rp.columns)
data_sims.ix[:,:] = rp.ix[:,:]
data_sims.head()

movieID,0310263662,0767002652,076780192X,0767802470,0767802519,0767802624,0767802659,0767805267,0767811100,0767824571,...,B00H9KKGTO,B00H9L26AA,B00HEPC0TS,B00HEPDGKA,B00HEPE6MM,B00HHYF570,B00HLTD3ZW,B00HNGZHDE,B00JA3RPAG,B00JAQJMJ0
reviewerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A10175AMUHOQC4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A103KNDW8GN92L,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A106016KSI0YQ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A106YXO3EHVD3J,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,4.0,0.0,0.0,3.0,3.0,0.0,0.0,0.0
A10H47FMW8NHII,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,3.0,5.0,4.0,0.0,4.0,0.0,4.0,4.0,0.0,5.0


In [50]:
data_ibs.ix['0310263662']

movieID
0310263662    0.0
0767002652    0.0
076780192X    0.0
0767802470    0.0
0767802519    0.0
0767802624    0.0
0767802659    0.0
0767805267    0.0
0767811100    0.0
0767824571    0.0
0767830520    0.0
0767834739    0.0
0767849493    0.0
0780020685    0.0
0780022300    0.0
0780618556    0.0
0780619412    0.0
0780622545    0.0
0780623134    0.0
0780625129    0.0
078062565X    0.0
0780628799    0.0
0780631153    0.0
0782002064    0.0
0783211856    0.0
0783216084    0.0
0783219695    0.0
0783221088    0.0
0783221487    0.0
0783222734    0.0
             ... 
B00DV1XYTO    0.0
B00DZP1BZ0    0.0
B00E3UN44W    0.0
B00E8RK5OC    0.0
B00EV4EUT8    0.0
B00FPPQYXM    0.0
B00FRILRL6    0.0
B00FZ4KR4U    0.0
B00FZM8Z7I    0.0
B00G15MDI0    0.0
B00G2P79BU    0.0
B00G4Q3KOC    0.0
B00GMV8LIO    0.0
B00GSBMNOQ    0.0
B00GST8U4U    0.0
B00GUO2SKA    0.0
B00H1RMOI6    0.0
B00H7LINKE    0.0
B00H83EUL2    0.0
B00H9HZGQ0    0.0
B00H9KKGTO    0.0
B00H9L26AA    0.0
B00HEPC0TS    0.0
B00HEPDGKA    0.0
B0

In [52]:
print(len(users_test),len(movies_for_users))
for i in range(1, 20):
    user = users_test[i]
    product = movies_for_users[i]
    print("user", user, "product", product)
    product_top_names = data_neighbours.ix[product][1:10]
    print("top 10 names", product_top_names)
    product_top_sims = data_ibs.ix[product].sort_values(ascending=False)[1:10]
    print("similarity score= ", product_top_sims)
    user_purchases = rp.ix[user,product_top_names]
    print ("user purchases =", user_purchases)
    
    data_sims.ix[user][product] = getScore(user_purchases,product_top_sims)
    print(data_sims.ix[user][product])
# we just need to compare this matrix with pivot table created using data_movies_users_selected to get accuracy!

985 985
user A32JKNQ6BABMQ2 product 0310263662
top 10 names 2     B00005JL3T
3     B00005JL3A
4     B00005JKZY
5     B00005JKVZ
6     B00005JKQZ
7     B00005JKNF
8     B00005JKN9
9     B00005JKMY
10    B00005JKJM
Name: 0310263662, dtype: object
similarity score=  movieID
B00005JKC3    0.0
B00005JL3T    0.0
B00005JL3A    0.0
B00005JKZY    0.0
B00005JKVZ    0.0
B00005JKQZ    0.0
B00005JKNF    0.0
B00005JKN9    0.0
B00005JKMY    0.0
Name: 0310263662, dtype: float64
user purchases = movieID
B00005JL3T    0.0
B00005JL3A    0.0
B00005JKZY    0.0
B00005JKVZ    0.0
B00005JKQZ    0.0
B00005JKNF    0.0
B00005JKN9    0.0
B00005JKMY    0.0
B00005JKJM    0.0
Name: A32JKNQ6BABMQ2, dtype: float64
nan
user A1VQBHHXIKHIGS product 0310263662
top 10 names 2     B00005JL3T
3     B00005JL3A
4     B00005JKZY
5     B00005JKVZ
6     B00005JKQZ
7     B00005JKNF
8     B00005JKN9
9     B00005JKMY
10    B00005JKJM
Name: 0310263662, dtype: object
similarity score=  movieID
B00005JKC3    0.0
B00005JL3T    0.0
B0000

We used a groupby statement above to ensure that each reviewerID indeed indicates a person.

Right now, each row of the dataframe contains a single movie that has been reviewed by one particular user ID. I would like to transform the dataframe such that each row of the dataframe contains a single reviewerID and the movies that the reviewer likes (defined to be of rating 4 and above).