# Collaborative Filtering

准备数据集，使用 MovieLens 推荐的应用与开发和学习的 Small [数据集](http://files.grouplens.org/datasets/movielens/ml-latest-small.zip)。该数据集包含 600 位用户在 9000 个电影上的 100,000 个标签和 3600 个 Tag。

In [1]:
import os

# working directory
BASEDIR = os.getcwd()
print(BASEDIR)

/Users/hailin/Git/MLFM


读取 MovieLens 数据集，并且构造出 rating 矩阵。矩阵的行对应于用户，矩阵的列对应于电影。

In [2]:
import pandas as pd
import numpy as np


dataframe = pd.read_csv(BASEDIR + '/assets/datasets/ml-latest-small/ratings.csv')

userId_dict = {}
movieId_dict = {}

userId_unique = dataframe.userId.unique()
movieId_unique = dataframe.movieId.unique()


idx = 0
for n in range(userId_unique.shape[0]):
    userId_dict[userId_unique[idx]] = idx
    idx += 1

idx = 0
for n in range(movieId_unique.shape[0]):
    movieId_dict[movieId_unique[idx]] = idx
    idx += 1

ratings = np.zeros(shape=(len(userId_dict), len(movieId_dict)))


for row in dataframe.itertuples():
    ratings[userId_dict[row.userId], movieId_dict[row.movieId]] = row.rating

FileNotFoundError: [Errno 2] File /Users/hailin/Git/MLFM/assets/datasets/ml-latest-small/ratings.csv does not exist: '/Users/hailin/Git/MLFM/assets/datasets/ml-latest-small/ratings.csv'

## UserCF 

根据用户对 movie 的打分，对任意一个用户 u，找出与其相似度最高的 N 个用户（这里就采用 Top 10），计算用户 u 对未看过电影 m 的分，计算方法为： $Rate(u, m)=\frac{\sum_{s \in S} w_{u, s} Rate_{s, m}}{\sum w_{u,s}}$

所以先要计算 user 和 user 之间的相似度矩阵。

In [3]:
# first construct user-user collaborative matrix

user_user_cm = np.zeros(shape=(len(userId_dict), len(userId_dict)))

# set diagonal value to 1.0
for i in range(len(userId_dict)):
    user_user_cm[i, i] = 1.0


for i in range(len(userId_dict)):
    for j in range(i, len(userId_dict)):
        v_i = ratings[i, :]
        v_j = ratings[j, :]

        similarity = np.dot(v_i, v_j) / (np.linalg.norm(v_i) * np.linalg.norm(v_j))

        user_user_cm[i, j] = similarity
        user_user_cm[j, i] = similarity

NameError: name 'userId_dict' is not defined

在 user-user 相似度矩阵的基础之上，对于每一个 user， 找出与其相似的 Top N 个用户，然后通过这 N 个用户，填充 user 的对矩阵评分向量的缺失值。

In [4]:
# create a new rating matrix
filled_ratings = np.zeros(shape=(len(userId_dict), len(movieId_dict)))

NameError: name 'userId_dict' is not defined

In [5]:
# for user 1, fill its rating vector
N = 10

top_n_idx = user_user_cm[1, :].argsort()[::-1][1:N + 1]
sum_weights = 0.0

for i in top_n_idx:
    sum_weights += user_user_cm[1, i]

for movieIdx in range(len(movieId_unique)):
    if ratings[1, movieIdx] is 0.0:
        for i in top_n_idx:
            filled_ratings[1, movieIdx] += user_user_cm[1, i] * ratings[i, movieIdx]
        filled_ratings[1, movieIdx] /= sum_weights
    else:
        filled_ratings[1, movieIdx] = ratings[0, movieIdx]


print(ratings[1, 0:10])
print(filled_ratings[1, 0:10])

NameError: name 'user_user_cm' is not defined

填充整个用户打分矩阵

In [6]:
N = 5
for i in range(len(userId_dict)):
    top_n_idx = user_user_cm[i, :].argsort()[::-1][1:N + 1]
    sum_weights = 0.0

    for j in top_n_idx:
        sum_weights += user_user_cm[i, j]

    for movieIdx in range(len(movieId_unique)):
        if ratings[i, movieIdx] is 0.0:
            for j in top_n_idx:
                filled_ratings[i, movieIdx] += user_user_cm[i, j] * ratings[j, movieIdx]
            filled_ratings[i, movieIdx] /= sum_weights
        else:
            filled_ratings[i, movieIdx] = ratings[i, movieIdx]

NameError: name 'userId_dict' is not defined

在实际中，UserCF 会遇到两个问题：

- 原始 rating 矩阵稀疏的话，相似矩阵的计算会有很大的偏差
- 随着用户量的增长，需要 $O(N^2)$ 的存储和计算空间，这种规模是无法接受的

# ItemCF

根据观看过 movie 的 user，计算 movie 之间的相似度，得到 movie-movie 的相似度矩阵。然后根据 user 对 movie 的打分，找出相似度最高的 Top N 个 movie 给用户推荐。

先计算 movie 和 movie 之间的相似度矩阵。

In [7]:
# first construct movie-movie collaborative matrix

movie_movie_cm = np.zeros(shape=(len(movieId_dict), len(movieId_dict)))

# set diagonal value to 1.0
for i in range(len(movieId_dict)):
    movie_movie_cm[i, i] = 1.0


for i in range(len(movieId_dict)):
    for j in range(i, len(movieId_dict)):
        v_i = ratings[:, i]
        v_j = ratings[:, j]

        similarity = np.dot(v_i, v_j) / (np.linalg.norm(v_i) * np.linalg.norm(v_j))

        movie_movie_cm[i, j] = similarity
        movie_movie_cm[j, i] = similarity

NameError: name 'movieId_dict' is not defined