# Sweep Weighted UserUser Params
In this notebook, we sweep through the params w1 and w2 for our weighted user user model.
1. w1 = 0.0 - 5.0 (using step of 0.1)
2. w2 = 0.0 - 5.0 (using step of 0.1)

Output: a json file that contains key (w1, w2) mapped to a value which is the test data MSE

In [10]:
import pandas as pd
import numpy as np
import scipy.sparse as sp
import gc
import json
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter
%matplotlib inline
plt.rcParams['figure.figsize'] = 15,10

In [11]:
rating = sp.load_npz("rating_matrix_shrunk.npz")
isRead = sp.load_npz("isRead_matrix_shrunk.npz")
shelved = sp.load_npz("shelved_matrix_shrunk.npz")

In [12]:
#Change params here
w1_range = np.arange(4.7, 5.1, 0.1)
w2_range = np.arange(0, 5.1, 0.1 )

## Functions Needed

This function helps us linearly combine the three matrix into the weighted matrixwe will use to feed into the weighted user user model. For exactly how we sum up the weights and matrices, pleaes see our final report.

In [13]:
#used to string the three matricies together into one with differerent weights on each number
def threeMatricesWeight (w1, w2, shelved, isRead, rating):
    '''
    w1 : weight to apply on shelved
    w2 : weight to apply on intermediate
    '''
    if (not isinstance(isRead, np.ndarray)):
        isRead = np.asarray(isRead.todense())
        shelved = np.asarray(shelved.todense())
        rating = np.asarray(rating.todense())
    
    isRead_zero_indices = (isRead == 0)
    
    intermediate = shelved * w1 * isRead_zero_indices
    
    intermediate = isRead * w2 + intermediate
    
    rating_zero_indices = (rating == 0)
    
    return intermediate * rating_zero_indices + rating

This is used to calculate the similarity between users.

In [14]:
def all_pearson(X, nnz_indices, user_means, min_common_items=5):
    X_norm = (X - user_means[:,None]) * nnz_indices
    X_col_norm = (X_norm**2) @ (X_norm != 0).T
    common_items = nnz_indices.astype(float) @ nnz_indices.T
    return (X_norm @ X_norm.T) / (np.sqrt(X_col_norm*X_col_norm.T)+1e-12) * (common_items >= min_common_items)

This is used to predict a single user's ratings of all the books in the matrix.

In [15]:
def predict_single_user_user(X, nnz_indices, W, user_means, diff, i, j):
    """ Return prediction of X_(ij). """
    return user_means[i] + (
        np.sum(diff[:,j] * nnz_indices[:,j] * W[i,:]) / 
        (np.sum(nnz_indices[:,j] * np.abs(W[i,:])) + 1e-12)
    )

This is a simple MSE function.

In [16]:
def error(prediction_array, original_array):
    """
        prediction array: np array
        original_array: np array
    """
    return np.sum((prediction_array - original_array)** 2) / len(nonzero_ratings_te)

## Train Test Split

Note that we used a seed in our random. This is to ensure that in every model we are train test splitting the same way on the matrices.

In [17]:
#Set up train data
dense_rating_matrix = np.asarray(rating.todense())
X_tr = dense_rating_matrix.copy()
X_tr = X_tr.flatten()

nonzero_pairs = np.nonzero(X_tr)[0]
num_non_zero_pairs = len(nonzero_pairs)

total_num_pairs = X_tr.shape[0]
num_testing_pairs = int(0.1 * num_non_zero_pairs)

# seeds the random generator
np.random.seed(0)

# indices of 1d array X_tr
testing_pair_indices = np.random.choice(nonzero_pairs, num_testing_pairs, replace=False)
training_pair_indices = list(set(np.arange(total_num_pairs)) - set(testing_pair_indices))

#set up test data
X_te = X_tr.copy()

# sets testing pairs in training set to be 0
X_tr[testing_pair_indices] = 0

# sets training pairs in testing set to be 0
X_te[training_pair_indices] = 0

# takes X_tr and X_te back to shape of dense_rating_matrix
X_tr = X_tr.reshape((dense_rating_matrix.shape[0], dense_rating_matrix.shape[1]))
X_te = X_te.reshape((dense_rating_matrix.shape[0], dense_rating_matrix.shape[1]))

#find where indices are nonzero
nonzero_rating_list_te = sp.find(X_te)
users_te, books_te, nonzero_ratings_te = nonzero_rating_list_te
nnz_indices_tr = (X_tr != 0)
nnz_indices_te = (X_te != 0)

#find user_means
user_means = np.array([X_tr[i,nnz_indices_tr[i,:]].mean() for i in range(X_tr.shape[0])])
user_means = np.nan_to_num(user_means, 0)
diff_tr = X_tr - user_means[:,None]

  ret = ret.dtype.type(ret / rcount)


## Sweep Through w1 and w2
Here we start sweeping through the w1 and w2 ranges that we specified at the top of the notebook. We save the test MSE results into a json file everytime w2 finishes its loop.

In [18]:
param_dict = {}
print ("(w1, w2)" + "    " + "error")
for w1 in w1_range:
    last_w2 = w2_range[-1]
    for w2 in w2_range:
        #Create combined matrix
        combined_rating_matrix = threeMatricesWeight(w1, w2, shelved, isRead, rating)
        combined_rating_matrix = combined_rating_matrix.flatten()
        combined_rating_matrix[testing_pair_indices] = 0
        combined_rating_matrix = combined_rating_matrix.reshape((dense_rating_matrix.shape[0], dense_rating_matrix.shape[1]))

        nnz_indices_combined = (combined_rating_matrix != 0)
        user_means_combined = np.array([combined_rating_matrix[i,nnz_indices_combined[i,:]].mean() for i in range(combined_rating_matrix.shape[0])])
        user_means_combined = np.nan_to_num(user_means_combined, 0)

        W_pearson_combined = all_pearson(combined_rating_matrix, nnz_indices_combined, user_means_combined)
        predictions_te_combined = []
        for index in range(len(users_te)):
            user = users_te[index]
            book = books_te[index]
            predictions_te_combined.append(predict_single_user_user(X_tr, nnz_indices_tr, W_pearson_combined,
                                                           user_means, diff_tr, user, book))
        #get the error 
        param_dict[str((w1, w2))] = error(np.asarray(predictions_te_combined), nonzero_ratings_te)
        print (str((w1, w2)) + "  " + str(param_dict[str((w1, w2))]))        
        
        #save the file when w2 finishes looping
        if w2 == last_w2:
            with open("user_user_sweep7" + ".json", "w+") as f:
                json.dump(param_dict, f)
                print ("w1 = " + str((w1)) + " saved")
                print ("")


(w1, w2)    error


  del sys.path[0]
  ret = ret.dtype.type(ret / rcount)


(4.7, 0.0)  0.9888967951673145
(4.7, 0.1)  0.993100240989649
(4.7, 0.2)  0.9929983734457559
(4.7, 0.30000000000000004)  0.9928544633629018
(4.7, 0.4)  0.9927440537217554
(4.7, 0.5)  0.9926372309072099
(4.7, 0.6000000000000001)  0.9925004360360011
(4.7, 0.7000000000000001)  0.9924342834842992
(4.7, 0.8)  0.9922733381336448
(4.7, 0.9)  0.9922127319708536
(4.7, 1.0)  0.9920852414854237
(4.7, 1.1)  0.9919441965067892
(4.7, 1.2000000000000002)  0.9920498740629606
(4.7, 1.3)  0.9919704091075656
(4.7, 1.4000000000000001)  0.9918455451231729
(4.7, 1.5)  0.9917451411813873
(4.7, 1.6)  0.9916914967205419
(4.7, 1.7000000000000002)  0.9915261832408542
(4.7, 1.8)  0.9914758882619864
(4.7, 1.9000000000000001)  0.9913768099807765
(4.7, 2.0)  0.9913103610195539
(4.7, 2.1)  0.9911777537231365
(4.7, 2.2)  0.9911296880127536
(4.7, 2.3000000000000003)  0.9910418063094953
(4.7, 2.4000000000000004)  0.9911087120493296
(4.7, 2.5)  0.9910165811913014
(4.7, 2.6)  0.9910189620126777
(4.7, 2.7)  0.99093791172865

(4.999999999999999, 3.7)  0.9980838921142946
(4.999999999999999, 3.8000000000000003)  0.9981731357267258
(4.999999999999999, 3.9000000000000004)  0.9981748365274856
(4.999999999999999, 4.0)  0.9983290191798014
(4.999999999999999, 4.1000000000000005)  0.9985190739354101
(4.999999999999999, 4.2)  0.9984764517144764
(4.999999999999999, 4.3)  0.9984642389418885
(4.999999999999999, 4.4)  0.9984489194753521
(4.999999999999999, 4.5)  0.9984434488820503
(4.999999999999999, 4.6000000000000005)  0.998341918676403
(4.999999999999999, 4.7)  0.9983072509852264
(4.999999999999999, 4.800000000000001)  0.9983165672272336
(4.999999999999999, 4.9)  0.9984017214724886
(4.999999999999999, 5.0)  0.9971025535420045
w1 = 4.999999999999999 saved



## Unweighted Rating Matrix
This is the benchmark model that our weighted user user model wants to beat.

In [19]:
W_pearson = all_pearson(X_tr, nnz_indices_tr, user_means)
predictions_te = []

for index in range(len(users_te)):
    user = users_te[index]
    book = books_te[index]
    predictions_te.append(predict_single_user_user(X_tr, nnz_indices_tr, W_pearson,
                                                   user_means, diff_tr, user, book))
    if (index % 1000 == 0):
        print (str(index) + " done")

0 done
1000 done
2000 done
3000 done
4000 done
5000 done
6000 done
7000 done
8000 done
9000 done
10000 done
11000 done
12000 done
13000 done
14000 done
15000 done
16000 done


### MSE

In [20]:
np.sum((np.asarray(predictions_te) - nonzero_ratings_te) ** 2) / len(nonzero_ratings_te)

0.9756388191656609