# Final Term) Recommender System - Singularities

### Measure similarity between users based on the Singularity

Measuring similarity is the first and foremost task to be implemented when building a recommender system. There are many types of methodologies to measure similarity between users or items.  The ones traditionally used are 'Pearson Correlation', 'Cosine', 'Contrained Pearson's Correlation', etc.  However, the only problem (or weakness) of those methods is lack of contextual factor.  Those methods only measure similarity based on the overall preferences between users and the contextual information is inferred through the preferences.  If the contextual information were one of direct factors to calculate similarity, would it be a better way to build a more precise recommender system?  Luckly, there is a method called "Singularity" which does not only meausre similarity between users, but it also considers the contextual factors. This term paper will focus on how the Singularity method is implemented in real life dataset and analysing its effectiveness compared to the traditional ones.

Before start, we hypothesize that the recommender system based on Singularity would be more accurate than the traditionally used ones. It is because people tend to perform as a group, especially in Korea, this phenomenon occurs very often. Contextual factor includes the group preference which makes it more critical. In order to verify the hypothesis, predicted rating will be estimated through user memory based CF, particularly user based basic algorithm and the result will be compared to the traditional ones using Root Mean Squared Error (RMSE). RMSE is a measure of how spread out the residuals are, therefore this method is suitable to measure the prediction error.

In [1]:
import numpy as np
import pandas as pd

# 1. Load the dataset

In [2]:
ui_matrix=np.load("ui_matrix.npy")
ui_matrix.shape

(380, 21601)

In [3]:
#The maximum and minimum rating value of the matrix
x = ui_matrix[np.logical_not(np.isnan(ui_matrix))]
print(np.amax(x))
print(np.amin(x))

5.0
1.0


The imported matrix includes users and stores on its index and columns, respectively, and rating value corresponding to the index and column.  The maximum rating value is 5.0 and minimum is 1.0.  As it is clearly described in the cell above, the Max(M) value is not being high because M is equals to 5, therefore, we used a value of d as 2 (R1 = irrelevant sets, R2= relevant sets), therefore the '4.3 Extneded model' is not carried out.

### Define threshold

In [7]:
print(np.count_nonzero(ui_matrix >= 3.0))
print(np.count_nonzero(ui_matrix < 3.0))

44273
5404


  """Entry point for launching an IPython kernel.
  


In [8]:
print(np.count_nonzero(ui_matrix >= 4.0))
print(np.count_nonzero(ui_matrix < 4.0))

29151
20526


  """Entry point for launching an IPython kernel.
  


Including the range of rating values from 1.0 to 5.0, the "relevant" threshold should be clearly defined. If the dataset distributor instructed what each rating stands for, for example 1.0: irrelevant at all, 5.0: very relevant, it would be explicit to define the threshold, however it is not in this dataset. Comparing the two cases above, when k = 4.0 and k = 3.0, the differences between greater and smaller is smaller when k = 4.0.  If there is a great difference, it will make the singularity factors too big or too small.  The smaller differences make it clear that k = 4.0 is a critical point therefore, we decided to define the "relevant" threshold is 4.0.

r(u, i)>=4,  relevant

r(u, i)<4,   irrelevant

# 2. Calculating Singularity factor

From this stage forward, every calculation will be implemented with the sample matrix, not ui_matrix, for verifying results of calculation easier.

In [9]:
#The sample matrix is from the paper at the bottom of P. 207 and some values were intentionally changed into np.nan 
#because the yelp dataset includes a lot of nan value.
trial=np.array([[4, np.nan, 5, 2, 2], [2, np.nan, 4, 4, 1], [1, 5, 4, np.nan, 2], [np.nan, 5, 5, 5, 2], [2, 5, 4, 4, 5],
               [5, 5, 2, 4, 4], [2, 4, 5, 5, 5], [1, 5, 4, np.nan, 5]] )
print(trial)

[[ 4. nan  5.  2.  2.]
 [ 2. nan  4.  4.  1.]
 [ 1.  5.  4. nan  2.]
 [nan  5.  5.  5.  2.]
 [ 2.  5.  4.  4.  5.]
 [ 5.  5.  2.  4.  4.]
 [ 2.  4.  5.  5.  5.]
 [ 1.  5.  4. nan  5.]]


In [10]:
def singularity_factor(matrix, k):
    #If denominator includes nan values, it will make spi and sni too smaller.
    #Therefore the nan value is excluded in this stage.
    spi=1-(np.count_nonzero(matrix >= k, axis=0)/np.count_nonzero(~np.isnan(matrix), axis=0))
    sni=1-(np.count_nonzero(matrix < k, axis=0)/np.count_nonzero(~np.isnan(matrix), axis=0))
    return list(spi), list(sni)

In [11]:
spi, sni=singularity_factor(trial, 4)
print(spi)
print(sni)

[0.7142857142857143, 0.0, 0.125, 0.16666666666666663, 0.5]
[0.2857142857142857, 1.0, 0.875, 0.8333333333333334, 0.5]


  after removing the cwd from sys.path.
  """


In [12]:
trial[np.isnan(trial)] = 0
print(trial)

[[4. 0. 5. 2. 2.]
 [2. 0. 4. 4. 1.]
 [1. 5. 4. 0. 2.]
 [0. 5. 5. 5. 2.]
 [2. 5. 4. 4. 5.]
 [5. 5. 2. 4. 4.]
 [2. 4. 5. 5. 5.]
 [1. 5. 4. 0. 5.]]


# Singularity Calculation

In [13]:
def singularity(ui, spi, sni):
    #Create a matrix filled with 0.0 in order to store all the similarities (should be triangular matrix)
    Num_Users=np.size(ui, 0)
    Sim=np.full((Num_Users, Num_Users), 0.0)
    #For normalization (rui should be between 0, 1)
    ui=ui*(1/5)
    
    for i in range(Num_Users):
        for j in range(i, Num_Users):
            a, a_cnt, b, b_cnt, c, c_cnt=0, 0, 0, 0, 0, 0
            
            #if one of two values (ui[i][k] and ui[j][k] is nan, the calculation is not proceeded)
            for k in range(np.size(ui, 1)):
                if ui[i][k]>=0.8 and ui[j][k]>=0.8:
                    a+=((spi[k]**2)*(1-((ui[i][k]-ui[j][k])**2)))
                    a_cnt+=1
                elif (ui[i][k]<0.8 and 0<ui[i][k]) and (ui[j][k]<0.8 and 0<ui[j][k]):
                    b+=((sni[k]**2)*(1-(ui[i][k]-ui[j][k])**2))
                    b_cnt+=1
                elif ui[i][k]>=0.8 and (ui[j][k]>0 and ui[j][k]<0.8):
                    c+=((spi[k]*sni[k])*(1-(ui[i][k]-ui[j][k])**2))
                    c_cnt+=1
                elif ui[j][k]>=0.8 and (ui[i][k]>0 and ui[i][k]<0.8):
                    c+=((spi[k]*sni[k])*(1-(ui[i][k]-ui[j][k])**2))
                    c_cnt+=1
            if a_cnt==0:
                a=0
            elif b_cnt==0:
                b=0
            elif c_cnt==0:
                c=0
            else:
                a=a/a_cnt
                b=b/b_cnt
                c=c/c_cnt
            Sim[i][j]=(1/3)*(a+b+c)
            Sim[j][i]=Sim[i][j]
    return Sim

In [14]:
def singularity_2(ui, spi, sni):
    #Create a matrix filled with 0.0 in order to store all the similarities (should be triangular matrix)
    Num_Users=np.size(ui, 0)
    Sim=np.full((Num_Users, Num_Users), 0.0)
    #For normalization (rui should be between 0, 1)
    ui=ui*(1/5)
    
    for i in range(Num_Users):
        for j in range(i, Num_Users):
            a, a_cnt, b, b_cnt, c, c_cnt=0, 0, 0, 0, 0, 0
            
            #if one of two values (ui[i][k] and ui[j][k] is nan, the calculation is not proceeded)
            for k in range(np.size(ui, 1)):
                if ui[i][k]>=0.8 and ui[j][k]>=0.8:
                    a+=((spi[k]**2)*(1-((ui[i][k]-ui[j][k])**2)))
                    a_cnt+=1
                elif (ui[i][k]<0.8 and 0<=ui[i][k]) and (ui[j][k]<0.8 and 0<=ui[j][k]):
                    b+=((sni[k]**2)*(1-(ui[i][k]-ui[j][k])**2))
                    b_cnt+=1
                elif ui[i][k]>=0.8 and (ui[j][k]>=0 and ui[j][k]<0.8):
                    c+=((spi[k]*sni[k])*(1-(ui[i][k]-ui[j][k])**2))
                    c_cnt+=1
                elif ui[j][k]>=0.8 and (ui[i][k]>=0 and ui[i][k]<0.8):
                    c+=((spi[k]*sni[k])*(1-(ui[i][k]-ui[j][k])**2))
                    c_cnt+=1
            if a_cnt==0:
                a=0
            elif b_cnt==0:
                b=0
            elif c_cnt==0:
                c=0
            else:
                a=a/a_cnt
                b=b/b_cnt
                c=c/c_cnt
            Sim[i][j]=(1/3)*(a+b+c)
            Sim[j][i]=Sim[i][j]
    return Sim

In [15]:
s=singularity(trial, spi, sni)
dataframe=pd.DataFrame.from_records(s)
print(dataframe)

          0         1         2         3         4         5         6  \
0  0.490091  0.133016  0.131871  0.118171  0.154365  0.295488  0.145314   
1  0.133016  0.125012  0.111331  0.093889  0.064445  0.136755  0.064155   
2  0.131871  0.111331  0.115753  0.088333  0.082060  0.125115  0.081956   
3  0.118171  0.093889  0.088333  0.097801  0.067222  0.102222  0.067801   
4  0.154365  0.064445  0.082060  0.067222  0.125012  0.163422  0.124433   
5  0.295488  0.136755  0.125115  0.102222  0.163422  0.517869  0.155760   
6  0.145314  0.064155  0.081956  0.067801  0.124433  0.155760  0.125012   
7  0.101871  0.061331  0.083148  0.058333  0.114664  0.135115  0.114456   

          7  
0  0.101871  
1  0.061331  
2  0.083148  
3  0.058333  
4  0.114664  
5  0.135115  
6  0.114456  
7  0.115753  


In [16]:
ss=singularity_2(trial, spi, sni)
dataframe=pd.DataFrame.from_records(ss)
print(dataframe)

          0         1         2         3         4         5         6  \
0  0.823425  0.259683  0.165658  0.106581  0.154365  0.295488  0.145314   
1  0.259683  0.458345  0.066603  0.058373  0.049445  0.136755  0.049155   
2  0.165658  0.066603  0.347234  0.057228  0.063727  0.141781  0.055289   
3  0.106581  0.058373  0.057228  0.125012  0.080820  0.102222  0.081013   
4  0.154365  0.049445  0.063727  0.080820  0.125012  0.163422  0.124433   
5  0.295488  0.136755  0.141781  0.102222  0.163422  0.517869  0.155760   
6  0.145314  0.049155  0.055289  0.081013  0.124433  0.155760  0.125012   
7  0.231735  0.046886  0.185284  0.055289  0.072303  0.151781  0.055567   

          7  
0  0.231735  
1  0.046886  
2  0.185284  
3  0.055289  
4  0.072303  
5  0.151781  
6  0.055567  
7  0.347234  


The obvious difference between Singularity and traditional methodologies is that the diagonal elements are not 1. It is not only about preference anymore, the singularity factor was applied.

In [17]:
# Basic CF
def basic_CF(mat, Sim, k):
    predicted_rating = np.array([[0.0 for col in range(np.size(mat, 1))] for row in range(np.size(mat, 0))])
        
    k_neighbors = np.argsort(-Sim)
    k_neighbors = np.delete(k_neighbors,np.s_[k:],1)
    
    NumUsers = np.size(mat,axis=0)
    
    for u in range(NumUsers):
        list_sim = Sim[u,k_neighbors[u,]]
        list_rating = mat[k_neighbors[u,],].astype('float64')
        
        predicted_rating[u,] = np.sum(list_sim.reshape(-1,1)*list_rating,axis=0)/np.sum(list_sim)
        
    return predicted_rating

In [18]:
pre=basic_CF(trial, s, 3)
print(pre)

[[3.98591125 2.39297613 3.8926703  2.95719045 3.12141844]
 [3.71308617 1.73202892 3.64412284 3.32613118 2.37615176]
 [3.40402427 3.23105352 3.6824611  2.05023498 2.67132819]
 [3.09180853 3.14309617 4.03622872 3.56460061 2.64251419]
 [3.80442135 3.2569379  3.61048142 3.30277516 3.58509724]
 [4.19556703 3.48743788 3.24215097 3.39497515 3.562282  ]
 [3.7787675  3.00138147 3.90332151 3.61130699 3.61130699]
 [2.79224873 5.         3.26072139 2.73332325 4.63036069]]


In [19]:
pre_1=basic_CF(trial, ss, 3)
print(pre)

[[3.98591125 2.39297613 3.8926703  2.95719045 3.12141844]
 [3.71308617 1.73202892 3.64412284 3.32613118 2.37615176]
 [3.40402427 3.23105352 3.6824611  2.05023498 2.67132819]
 [3.09180853 3.14309617 4.03622872 3.56460061 2.64251419]
 [3.80442135 3.2569379  3.61048142 3.30277516 3.58509724]
 [4.19556703 3.48743788 3.24215097 3.39497515 3.562282  ]
 [3.7787675  3.00138147 3.90332151 3.61130699 3.61130699]
 [2.79224873 5.         3.26072139 2.73332325 4.63036069]]


In [20]:
a_total=0
for i in range(8):
    for j in range(5):
        a=(trial[i][j]-pre[i][j])**2
        a_total+=a
total=a_total/40
total=np.sqrt(total)
print(total)

1.4332249946804914


In [21]:
a_total=0
for i in range(8):
    for j in range(5):
        a=(trial[i][j]-pre_1[i][j])**2
        a_total+=a
total=a_total/40
total=np.sqrt(total)
print(total)

1.1282283221107103


The paper does not clearly instructed about dealing with nan values which fills more than half of the values of ui_matrix.  Two experiments were conducted to compare the results of including nan values and exclusing nan values.  The nan values were converted into 0 and 'def singularity' calculates excluding 0 and 'def singularity_2' calculates including 0.  The predicted rating is based on basic CF algorithm and 'pre' (def singularity) and 'pre_1' (def singularity_2) was evaluated with RMSE in order to measure its accuracy.  As it is clearly shown in the cell above, the evaluation value of pre_1 is smaller, therefore 'def singularity_2' will be implemented from this stage forward.

# Singularity Calculation - yelp dataset

In [22]:
spi, sni=singularity_factor(ui_matrix, 4.0)

  after removing the cwd from sys.path.
  """


In [23]:
print(min(spi))
print(max(spi))
print(sum(spi)/len(spi))
print()
print(min(sni))
print(max(sni))
print(sum(sni)/len(sni))

0.0
1.0
0.4250973946446734

0.0
1.0
0.5749026053553248


In [24]:
ui_matrix[np.isnan(ui_matrix)] = 0
print(ui_matrix)

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


In [25]:
Sim=singularity_2(ui_matrix, spi, sni)

In [26]:
np.save("singularity_2", Sim)

In [27]:
a=np.load("singularity_2.npy")
print(a)

[[3.80649525e+03 3.78487058e+03 2.12497331e-01 ... 1.92102008e-01
  3.80305098e+03 3.80491255e+03]
 [3.78487058e+03 3.79615034e+03 1.92945180e-01 ... 2.03280864e-01
  2.52001853e-01 3.79437808e+03]
 [2.12497331e-01 1.92945180e-01 3.80125542e+03 ... 1.95895293e-01
  3.79806323e+03 3.79982147e+03]
 ...
 [1.92102008e-01 2.03280864e-01 1.95895293e-01 ... 3.74172558e+03
  3.73481787e+03 3.73666216e+03]
 [3.80305098e+03 2.52001853e-01 3.79806323e+03 ... 3.73481787e+03
  3.81316893e+03 3.81284013e+03]
 [3.80491255e+03 3.79437808e+03 3.79982147e+03 ... 3.73666216e+03
  3.81284013e+03 3.81479126e+03]]


In [28]:
df=pd.DataFrame.from_records(a)
print(df)

             0            1            2            3            4    \
0    3806.495248  3784.870582     0.212497     0.209656     0.214731   
1    3784.870582  3796.150339     0.192945  3780.663561     0.215027   
2       0.212497     0.192945  3801.255423  3786.036402     0.213271   
3       0.209656  3780.663561  3786.036402  3803.141447     0.194515   
4       0.214731     0.215027     0.213271     0.194515  3809.450983   
5    3803.321424     0.206492  3798.108042     0.190036     0.209266   
6       0.225973     0.215411     0.192541     0.316078     0.216563   
7    3802.152810  3791.591662  3797.035047     0.241817     0.199487   
8    3802.228394  3791.768137  3797.190979  3798.268859  3805.999834   
9       0.205240  3791.254493  3796.594533  3797.672413     0.205075   
10   3804.905071  3794.382347  3799.798334  3800.876214  3808.630674   
11      0.183254  3785.551106  3790.994492  3792.072372  3799.803347   
12   3801.172665     0.188308  3796.159902  3797.237782     0.21

In [29]:
#To make sure the form of singularity matrix
df.shape

(380, 380)

In [30]:
predicted=basic_CF(ui_matrix, a, 10)
print(predicted)

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


In [31]:
ui_matrix[np.isnan(ui_matrix)] = 0

In [32]:
df_2=pd.DataFrame.from_records(predicted)
print(df_2)

     0      1      2      3      4      5         6      7      8      9      \
0      0.0    0.0    0.0    0.0    0.0    0.0  0.000000    0.0    0.0    0.0   
1      0.0    0.0    0.0    0.0    0.0    0.0  0.000000    0.0    0.0    0.0   
2      0.0    0.0    0.0    0.0    0.0    0.0  0.000000    0.0    0.0    0.0   
3      0.0    0.0    0.0    0.0    0.0    0.0  0.000000    0.0    0.0    0.0   
4      0.0    0.0    0.0    0.0    0.0    0.0  0.000000    0.0    0.0    0.0   
5      0.0    0.0    0.0    0.0    0.0    0.0  0.000000    0.0    0.0    0.0   
6      0.0    0.0    0.0    0.0    0.0    0.0  0.000000    0.0    0.0    0.0   
7      0.0    0.0    0.0    0.0    0.0    0.0  0.000000    0.0    0.0    0.0   
8      0.0    0.0    0.0    0.0    0.0    0.0  0.000000    0.0    0.0    0.0   
9      0.0    0.0    0.0    0.0    0.0    0.0  0.000000    0.0    0.0    0.0   
10     0.0    0.0    0.0    0.0    0.0    0.0  0.000000    0.0    0.0    0.0   
11     0.0    0.0    0.0    0.0    0.0  

In [33]:
print(np.amax(predicted))
print(np.amin(predicted))

2.050559632082257
0.0


In [34]:
a_total=0
for i in range(380):
    for j in range(21601):
        a=(ui_matrix[i][j]-predicted[i][j])**2
        a_total+=a
total=a_total/(380*21601)
total=np.sqrt(total)
print(total)

0.26400356577204825


Evaluation RMSE was carried out.

# Comparing with the traditional similarity measure

The traditional similarities were from the former assignments

In [99]:
cos=np.load("COS_similarity.npy")
pcc=np.load("PCC_similiarity.npy")
jac=np.load("JAC_similarity.npy")

In [119]:
cos_pre=basic_CF(ui_matrix, cos, 10)
print(cos_pre)

[[0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.31436986]
 [0.         0.         0.         ... 0.         0.         1.07110232]
 ...
 [0.         0.         0.         ... 0.27355056 0.04521172 0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]]


In [120]:
pcc_pre=basic_CF(ui_matrix, pcc, 10)
print(pcc_pre)

[[0.  0.  0.  ... 0.  0.  0. ]
 [0.  0.  0.  ... 0.  0.  0. ]
 [0.  0.  0.5 ... 0.  0.  0.5]
 ...
 [0.  0.  0.  ... 0.  0.  0. ]
 [0.  0.  0.  ... 0.  0.  0.4]
 [0.  0.  0.  ... 0.  0.  0. ]]


In [121]:
jac_pre=basic_CF(ui_matrix, jac, 10)
print(jac_pre)

[[0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.39611398]
 ...
 [0.         0.         0.         ... 0.         0.0286196  0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]]


In [123]:
def RMSE(ui, pre):
    a_total=0
    for i in range(np.size(ui, 0)):
        for j in range(np.size(ui, 1)):
            a=(ui_matrix[i][j]-pre[i][j])**2
            a_total+=a
    total=a_total/(380*21601)
    total=np.sqrt(total)
    return total

In [124]:
print(RMSE(ui_matrix, pcc_pre))
print(RMSE(ui_matrix, cos_pre))
print(RMSE(ui_matrix, jac_pre))

0.2907707842254746
0.15984225940824584
0.1120672480612189


# Conclusion:

RMSE results of traditional similarities are similar or slightly greater than singularity. It was hypothesized that singularity would be more accurate than the traditional ones, however it can not be said that the hypothesis is correct as some are more accutate than singularity. With all these results, we conclude that singularity is not supreme method but good enough to be used for measuring similarity.