In [2]:
import numpy as np

## Initial Setup
We will first create our utility matrix, which has the movie ratings (columns) for each user (row). You can  also load the ratings file  

In [3]:
utility = np.array([[1, 0, 3, 0, 0, 5, 0, 0, 5, 0, 4, 0],
                    [0, 0, 5, 4, 0, 0, 4, 0, 0, 2, 1, 3],
                    [2, 4, 0, 1, 2, 0, 3, 0, 4, 3, 5, 0],
                    [0, 2, 4, 0, 5, 0, 0, 4, 0, 0, 2, 0],
                    [0, 0, 4, 3, 4, 2, 0, 0, 0, 0, 2, 5],
                    [1, 0, 3, 0, 3, 0, 0, 2, 0, 0, 4, 0],
                   ])

In [4]:
utility.mean(axis=1, keepdims=True)

array([[ 1.5       ],
       [ 1.58333333],
       [ 2.        ],
       [ 1.41666667],
       [ 1.66666667],
       [ 1.08333333]])

## Masked Numpy Matrix
While calculating the average scores for the user, we  want to ignore the zero / NULL value cells in the data set. We assumed that a zero rated cell, means the user has not watched the particular movie. 
While working with Numpy matrices and calculating the averages (or other functions), numpy will take into account the 0 rated cell. This would lead to incorrect value. To overcome this problem, we make use of Masked Numpy Matrix. Any functions that are run on the numpy masked matrix will be only run on cells which are marked as TRUE. For more details refer to the Numpy Masked Documentation: https://docs.scipy.org/doc/numpy/reference/maskedarray.generic.html

In [5]:
utility1 = np.ma.masked_where(utility == 0, utility)

In [6]:
utility1

masked_array(data =
 [[1 -- 3 -- -- 5 -- -- 5 -- 4 --]
 [-- -- 5 4 -- -- 4 -- -- 2 1 3]
 [2 4 -- 1 2 -- 3 -- 4 3 5 --]
 [-- 2 4 -- 5 -- -- 4 -- -- 2 --]
 [-- -- 4 3 4 2 -- -- -- -- 2 5]
 [1 -- 3 -- 3 -- -- 2 -- -- 4 --]],
             mask =
 [[False  True False  True  True False  True  True False  True False  True]
 [ True  True False False  True  True False  True  True False False False]
 [False False  True False False  True False  True False False False  True]
 [ True False False  True False  True  True False  True  True False  True]
 [ True  True False False False False  True  True  True  True False False]
 [False  True False  True False  True  True False  True  True False  True]],
       fill_value = 999999)

In [7]:
averages = np.ma.mean(utility1, axis=1, keepdims=True).filled(0)
averages

array([[ 3.6       ],
       [ 3.16666667],
       [ 3.        ],
       [ 3.4       ],
       [ 3.33333333],
       [ 2.6       ]])

In [8]:
intermediate = utility1 - averages

## Calculation of Cosine Similarity

In the calculation of the sum of squares, we have made use of an excellent function called Einstein summation. It is a very powerful function and for more details refer to the documentation. 

The parameters used for the sum of squares take the size of the matrix (i, j) and they are projected to calculated the sum as 

```python
np.einsum(ij, ij -> i, A, A)
```

Another example would be a dot product of two matrix A(i,j) and B(j,k) would be written as

```python
np.einsum(ij, jk -> ik, A, B)
```

https://docs.scipy.org/doc/numpy/reference/generated/numpy.einsum.html

In [9]:
def similarity(util, person1, person2):
    
    #Extract the row for the first movie given in the parameter given in person1
    if person1 ==1:
        p1m = np.ma.filled(util[:person1,],0)
    else:
        #Calculate the previous index value, so that we can make use of the correct row
        prev = person1-1 if person1 >1 else 1
        p1m = np.ma.filled(util[prev:person1,],0)
    
    #Extract the row for the second movie given in the parameter given in person2
    if person2 ==1:
        p2m = np.ma.filled(util[:person2,],0)
    else:
        #Calculate the previous index value, so that we can make use of the correct row
        prev = person2-1 if person2 >1 else 1
        p2m = np.ma.filled(util[prev:person2,],0)
    
    #Now cosine similarity for the two given users
    similar = np.asscalar(np.inner(p1m,p2m)) /  (np.asscalar(np.sqrt(np.einsum('ij,ij ->i',p1m, p1m))) * 
                 np.asscalar(np.sqrt(np.einsum('ij,ij ->i',p2m, p2m))))
    return similar
    

In [10]:
similarity(intermediate, 1,3)

0.41403933560541256

In [11]:
#Create an empty matrix that will be used to hold the similarity calculations for the movie pairs. 
count = intermediate.shape[0]
out = np.zeros(shape=(count))

In [12]:
out

array([ 0.,  0.,  0.,  0.,  0.,  0.])

In [13]:
baseMovie = 1
for i in range(1,count+1):
    out[i-1] = similarity(intermediate,baseMovie, i)

In [14]:
#Lets see what the out Matrix contains
out

array([ 1.        , -0.17854212,  0.41403934, -0.10245014, -0.30895719,
        0.58703951])

In [15]:
def calculateRating(ratingMatrix, similarityArray, userId=4, numberOfSimilarMovies=2):
    sumRating = 0
    norm = 0
    
    #Calculate the similar movie indices (smi)
    smi = np.argsort(a=similarityArray)[-(numberOfSimilarMovies+1):-1]

    for i in smi:
        sumRating += ratingMatrix[i, userId] * similarityArray[i]
        norm += similarityArray[i]
    
    return (sumRating/norm)

In [16]:
np.argsort(a=out)[-3:-1]

array([2, 5], dtype=int64)

In [17]:
calculateRating(utility, out,4)

2.5864068669348175