Datawhale 十月组队学习：推荐系统

Task02: 协同过滤

**“协同过滤”就是协同大家的反馈、评价和意见一起对海量的信息进行过滤，从中筛选出目标用户可能感兴趣的信息的推荐过程**。

* 用户协同（UserCF）：用户相似，给用户推荐和他兴趣相似的其他用户喜欢的产品
* 物品协同（ItemCF）：物品相似，给用户推荐和他之前喜欢的物品相似的物品


### 1.相似度度量方法

以用户相似度为例，假设有用户$u$和$v$，$N(u)$是用户$u$行为过的物品集合，$N(v)$是用户$v$行为过的物品集合。

* jaccard相似度：
<div align=center><font size=3.5 width="50%" height="50%">$W_{uv}=\frac{|N(u)\cap N(v)|}{|N(u)\cup N(v)|}$</font></div>

* 余弦相似度：
<div align=center><font size=3.5 width="50%" height="50%">$W_{uv}=\frac{|N(u)\cap N(v)|}{\sqrt|N(u)||N(v)|}$</font></div>

余弦相似度的缺陷：没有考虑到每个人评分的平均值，即不同用户的平均打分的偏差情况。比如一个用户的品味很高，对物品很挑剔，给每个推荐的物品打分很低，则此用户很难与其他用户协同起来。解决这个问题的办法：引入偏差。

* 皮尔逊相关系数度量相似度

<div align=center><font size=3.5 width="50%" height="50%">$sim(u, v)=\frac{\sum_{i \in I}(r_{ui}-\bar{r_u})(r_{vi}-\bar{r_v})}{\sqrt{\sum_{i \in I}(r_{ui}-\bar{r_u})^2} \sqrt{\sum_{i \in I}(r_{vi}-\bar{r_v})^2} }$</font></div>

皮尔逊相关系数通过使用用户的平均分对各独立评分进行修正，减小了用户评分偏置的影响。

### 2.基于用户的协同过滤

UserCF算法给用户推荐的是和他兴趣最相似的K个用户喜欢的物品，并按感兴趣程度排序后的前N个。（详细的原理这里就不写了，主要练习一下代码）

数据集采用的是MovieLens数据集，包含6000多用户对4000多部电影的100万条评分，它是一个评分数据集，用户可以给电影评5个不同等级的分数（1~5分），数据集经过了人为的清理，被清除了很多稀疏的数据。数据集地址：[https://grouplens.org/datasets/movielens/1m/](https://grouplens.org/datasets/movielens/1m/)

In [1]:
import os
import math
import random
from pandas import DataFrame

class UserBasedCF:
    def __init__(self, path):
        self.train={}     #用户—物品的评分表   训练集
        self.test={}      #用户-物品的评分表   测试集
        self.generate_dataset(path)
    
    def loadfile(self, path):
        with open(path, 'r', encoding='utf-8') as fp:
            for i, line in enumerate(fp):
                yield line.strip('\r\n')
    
    def generate_dataset(self, path, pivot=0.7):
        #读取文件，并生成用户-物品的评分表和测试集
        i = 0
        for line in self.loadfile(path):
            user, movie, rating, _ = line.split('::')
            if i <= 10:
                print('{}, {}, {}, {}'.format(user, movie, rating, _))
            i += 1
            if random.random() < pivot:
                self.train.setdefault(user, {})
                self.train[user][movie] = int(rating)
            else:
                self.test.setdefault(user, {})
                self.test[user][movie] = int(rating)
                
    def UserSimilarity(self):
        #建立物品——用户的倒排表（一开始是用户——物品表，现在要翻转过来）
        self.item_users = dict()
        for user, items in self.train.items():
            for i in items.keys():
                if i not in self.item_users:
                    self.item_users[i] = set()
                self.item_users[i].add(user)
        
        #计算用户-用户共现矩阵
        C = dict()     #用户——用户共现矩阵
        N = dict()     #用户产生行为的物品个数
        for i, users in self.item_users.items():
            for u in users:
                N.setdefault(u, 0)
                N[u] += 1
                C.setdefault(u, {})
                for v in users:                  #这里的两层遍历算的还是比较慢的，有没有更好的方法计算？
                    if u == v:
                        continue
                    C[u].setdefault(v, 0)
                    C[u][v] += 1
        
        #计算用户——用户相似度， 余弦相似度
        self.W = dict()    #相似度矩阵 
        for u, related_users in C.items():    #.items返回可遍历的(键， 值)元组的数组
            self.W.setdefault(u, {})
            for v, cuv in related_users.items():
                self.W[u][v] = cuv / math.sqrt(N[u] * N[v])    #普通的余弦相似度
        return self.W, C, N
    
    #给用户user推荐，前k个用户
    def Recommend(self, u, K = 3, N = 10):
        rank = dict()
        action_item = self.train[u].keys()   #用户user产生过行为的item
        # v：用户v
        # wuv：用户u和用户v的相似度
        for v, wuv in sorted(self.W[u].items(), key = lambda x : x[1], reverse = True)[0:K]:
            #遍历前K个与user最相关的用户
            # i：用户v有过行为的物品i
            # rvi: 用户v对物品i的打分
            for i, rvi in self.train[v].items():
                if i in action_item:
                    continue
                rank.setdefault(i, 0)
                #用户对物品的感兴趣程度：用户u和用户v的相似度 * 用户v对物品i的打分
                rank[i] += wuv * rvi
        return dict(sorted(rank.items(), key=lambda x : x[1], reverse=True)[0:N])    #推荐结果取前N个
    
    #计算召回率和准确率
    def recallAndPrecision(self, k=8, nitem=10):
        hit = 0
        recall = 0
        precision = 0
        for user, items in self.test.items():
            rank = self.Recommend(user, K=k, N=nitem)
            hit += len(set(rank.keys()) & set(items.keys()))
            recall += len(items)
            precision += nitem
        return (hit / (recall * 1.0), hit / (precision * 1.0))

In [2]:
def print_2_dim_dic(dic, n=3):
    n = 0
    for u, v_cnt in dic.items():
        if n >= 3:
            break
        n += 1
        m = 1
        for v, cnt in v_cnt.items():
            if m >= 3:
                break
            m += 1
            print(u, v, cnt)

def print_1_dim_dic(dic, n=3):
    n = 0
    for u, i_cnt in dic.items():
        if n >= 3:
            break
        n += 1
        print(u, i_cnt)
        
def sort_2_dim_dic(dic, k, n=5):
    return sorted(dic[k].items(), key=lambda x : x[1], reverse=True)[:n]

def sort_1_dim_dic(dic, n=5):
    return sorted(dic.items(), key=lambda x : x[1], reverse=True)[:n]

def trans_dic_2_matrix(dic):
    return DataFrame(dic).T.fillna(0)

In [4]:
#user, movie, rating, _

path = os.path.join('ml-1m', 'ratings.dat')
ucf = UserBasedCF(path)

1, 1193, 5, 978300760
1, 661, 3, 978302109
1, 914, 3, 978301968
1, 3408, 4, 978300275
1, 2355, 5, 978824291
1, 1197, 3, 978302268
1, 1287, 5, 978302039
1, 2804, 5, 978300719
1, 594, 4, 978302268
1, 919, 4, 978301368
1, 595, 5, 978824268


In [5]:
# W 相似度矩阵
W, C, N = ucf.UserSimilarity()

In [6]:
#用户共现矩阵
df_c = trans_dic_2_matrix(C)

df_c.shape

(6040, 6040)

In [7]:
df_c.iloc[:10, :10]

Unnamed: 0,2975,3588,2110,620,3888,550,297,1293,4256,440
1124,19.0,53.0,44.0,29.0,23.0,132.0,29.0,15.0,5.0,5.0
2975,0.0,19.0,7.0,9.0,26.0,45.0,12.0,9.0,5.0,2.0
3588,19.0,0.0,23.0,17.0,28.0,80.0,21.0,7.0,8.0,6.0
2110,7.0,23.0,0.0,18.0,17.0,74.0,10.0,10.0,1.0,4.0
620,9.0,17.0,18.0,0.0,9.0,57.0,19.0,11.0,3.0,4.0
3888,26.0,28.0,17.0,9.0,0.0,58.0,22.0,5.0,6.0,5.0
550,45.0,80.0,74.0,57.0,58.0,0.0,36.0,24.0,11.0,12.0
297,12.0,21.0,10.0,19.0,22.0,36.0,0.0,13.0,6.0,6.0
1293,9.0,7.0,10.0,11.0,5.0,24.0,13.0,0.0,3.0,2.0
4256,5.0,8.0,1.0,3.0,6.0,11.0,6.0,3.0,0.0,2.0


In [8]:
sort_2_dim_dic(C, '1')

[('2073', 28), ('1680', 28), ('6016', 28), ('1150', 28), ('1088', 27)]

In [9]:
# 用户和用户的相似度矩阵
df_w = trans_dic_2_matrix(W)

In [10]:
df_w.shape

(6040, 6040)

In [11]:
df_w.iloc[:10, :10]

Unnamed: 0,2975,3588,2110,620,3888,550,297,1293,4256,440
1124,0.129615,0.25566,0.21429,0.159853,0.130045,0.308978,0.213235,0.135582,0.071458,0.069552
2975,0.0,0.170064,0.063258,0.092053,0.272779,0.195451,0.163724,0.150946,0.132593,0.051623
3588,0.170064,0.0,0.146972,0.12295,0.207721,0.245697,0.202599,0.083016,0.150012,0.109508
2110,0.063258,0.146972,0.0,0.131436,0.127331,0.229459,0.097405,0.119737,0.018932,0.073708
620,0.092053,0.12295,0.131436,0.0,0.076296,0.200042,0.209463,0.149071,0.064282,0.083424
3888,0.272779,0.207721,0.127331,0.076296,0.0,0.208794,0.248782,0.069505,0.131876,0.106966
550,0.195451,0.245697,0.229459,0.200042,0.208794,0.0,0.168534,0.138116,0.100091,0.106278
297,0.163724,0.202599,0.097405,0.209463,0.248782,0.168534,0.0,0.235008,0.171499,0.166924
1293,0.150946,0.083016,0.119737,0.149071,0.069505,0.138116,0.235008,0.0,0.105409,0.068399
4256,0.132593,0.150012,0.018932,0.064282,0.131876,0.100091,0.171499,0.105409,0.0,0.108148


In [12]:
# 求某一个用户与其最相似的n个用户，n=5
sort_2_dim_dic(W, '102')

[('2988', 0.3062819459158444),
 ('657', 0.294174202707276),
 ('2019', 0.27796849374791294),
 ('3851', 0.27761261601152887),
 ('4067', 0.2693862292201827)]

In [13]:
recomend = ucf.Recommend('102')

recomend

{'2712': 2.9456538387797395,
 '2622': 2.929448129820376,
 '2686': 2.3047764947801186,
 '2598': 1.8013684458693613,
 '2881': 1.7164280893655668,
 '2434': 1.47087101353638,
 '3114': 1.47087101353638,
 '3174': 1.47087101353638,
 '2908': 1.47087101353638,
 '2762': 1.47087101353638}

In [14]:
ucf.recallAndPrecision()

(0.06916018112825027, 0.34339403973509935)

#### 用户相似度的改进

加入对用户u和用户v共同兴趣列表中热门物品的惩罚。

$$
W_{uv}=\frac{\sum_{i \in N(u) \cap N(v) \frac{1}{log1+|N(i)|}}}{\sqrt|N(u)||N(v)|}
$$

In [15]:
 def UserSimilarity(self):
        #建立物品——用户的倒排表（一开始是用户——物品表，现在要翻转过来）
        self.item_users = dict()
        for user, items in self.train.items():
            for i in items.keys():
                if i not in self.item_users:
                    self.item_users[i] = set()
                self.item_users[i].add(user)
        
        #计算用户-用户共现矩阵
        C = dict()     #用户——用户共现矩阵
        N = dict()     #用户产生行为的物品个数
        for i, users in self.item_users.items():
            for u in users:
                N.setdefault(u, 0)
                N[u] += 1
                C.setdefault(u, {})
                for v in users:                  #这里的两层遍历算的还是比较慢的，有没有更好的方法计算？
                    if u == v:
                        continue
                    C[u].setdefault(v, 0)
                    C[u][v] += 1 / math.log(1 + len(u))     #要做的改动就是这里
        
        #计算用户——用户相似度， 余弦相似度
        self.W = dict()    #相似度矩阵 
        for u, related_users in C.items():    #.items返回可遍历的(键， 值)元组的数组
            self.W.setdefault(u, {})
            for v, cuv in related_users.items():
                self.W[u][v] = cuv / math.sqrt(N[u] * N[v])    #普通的余弦相似度
        return self.W, C, N

**基于用户的协同过滤（UserCF）**符合人们直觉上的“兴趣相似的朋友喜欢的物品，我也喜欢”的思想，但**从技术角度，它也存在一些缺点**：

1. 在互联网应用场景下，用户数往往大于物品数，而UserCF需要维护用户相似度矩阵以便快速找出Top n相似用户。用户相似度矩阵的存储开销非常大，$O(n^2)$的空间复杂度。
2. 用户的历史数据向量往往非常稀疏，对于只有几次购买或者点击行为的用户来说，找到相似用户的准确度非常低，这导致UserCF不适用于那些正反馈获取较困难的应用场景（如酒店预定，大件尚品购买等低频应用）。

### 3.基于物品的协同过滤

ItemCF算法给用户推荐的是和他历史兴趣的物品中最相似的物品。

In [16]:
class ItemBasedCF:
    def __init__(self, path):
        self.train = {} #用户-物品的评分表 训练集
        self.test = {} #用户-物品的评分表 测试集
        self.generate_dataset(path)

    def loadfile(self, path):
        with open(path, 'r', encoding='utf-8') as fp:
            for i, line in enumerate(fp):     #enumerate()函数用于将
                yield line.strip('\r\n')

    
    def generate_dataset(self, path, pivot=0.7):
        #读取文件，并生成用户-物品的评分表和测试集
        i = 0
        for line in self.loadfile(path):
            user, movie, rating, _ = line.split('::')
            if i <= 10:
                print('{},{},{},{}'.format(user, movie, rating, _))
            i += 1
            if random.random() < pivot:
                self.train.setdefault(user, {})
                self.train[user][movie] = int(rating)
            else:
                self.test.setdefault(user, {})
                self.test[user][movie] = int(rating)


    def ItemSimilarity(self):
        #建立物品-物品的共现矩阵
        C = dict()  #物品-物品的共现矩阵，元素c_ij，存放物品i和物品j同时被多少个用户行为过
        N = dict()  #物品被多少个不同用户行为过
        for user,items in self.train.items():     #三重循环，费时间
            for i in items.keys():
                N.setdefault(i,0)
                N[i] += 1
                C.setdefault(i,{})
                for j in items.keys():
                    if i == j: 
                        continue
                    C[i].setdefault(j,0)
                    C[i][j] += 1
        #计算相似度矩阵
        self.W = dict()
        for i,related_items in C.items():
            self.W.setdefault(i,{})
            for j,cij in related_items.items():
                self.W[i][j] = cij / (math.sqrt(N[i] * N[j]))   
        return self.W, C, N

    #给用户user推荐，前K个相关用户
    def Recommend(self,u,K=3,N=10):
        rank = dict()
        action_item = self.train[u]     #用户u产生过行为的item和评分
        for i,score in action_item.items():
            # j：物品j
            # wj：物品i和物品j的相似度
            for j,wj in sorted(self.W[i].items(),key=lambda x:x[1],reverse=True)[0:K]:    #遍历与物品i最相近的K个物品 j              
                if j in action_item.keys():     #如果用户已经购买过j，就跳过
                    continue
                rank.setdefault(j,0)
                # 用户u对物品j感兴趣程度：用户对物品i的打分 * 物品i和物品j的相似度
                rank[j] += score * wj
        return dict(sorted(rank.items(),key=lambda x:x[1],reverse=True)[0:N])     #返回N个和用户历史上感兴趣的物品最相似的物品
    
    # 计算召回率和准确率
    # 召回率 = 推荐的物品数 / 所有物品集合
    # 准确率 = 推荐对的数量 / 推荐总数
    def recallAndPrecision(self,k=8,nitem=10):
        hit = 0
        recall = 0
        precision = 0
        for user, items in self.test.items():
            rank = self.Recommend(user, K=k, N=nitem)
            hit += len(set(rank.keys()) & set(items.keys()))
            recall += len(items)
            precision += nitem
        return (hit / (recall * 1.0),hit / (precision * 1.0))

In [17]:
path = os.path.join('ml-1m', 'ratings.dat')
icf = ItemBasedCF(path)

1,1193,5,978300760
1,661,3,978302109
1,914,3,978301968
1,3408,4,978300275
1,2355,5,978824291
1,1197,3,978302268
1,1287,5,978302039
1,2804,5,978300719
1,594,4,978302268
1,919,4,978301368
1,595,5,978824268


In [18]:
#计算物品的相似度矩阵W，共现矩阵C
i_W, i_C, i_N = icf.ItemSimilarity()

In [19]:
#物品-物品共现矩阵
df_ic = trans_dic_2_matrix(i_C)

In [20]:
df_ic.iloc[:10, :10]

Unnamed: 0,661,914,2355,1197,1287,2804,938,2398,2918,2791
1193,102.0,150.0,261.0,403.0,188.0,308.0,51.0,115.0,319.0,352.0
661,0.0,74.0,165.0,160.0,58.0,116.0,36.0,49.0,116.0,146.0
914,74.0,0.0,147.0,213.0,125.0,128.0,84.0,93.0,145.0,154.0
2355,165.0,147.0,0.0,440.0,145.0,271.0,48.0,89.0,321.0,339.0
1197,160.0,213.0,440.0,0.0,250.0,446.0,70.0,125.0,507.0,516.0
1287,58.0,125.0,145.0,250.0,0.0,139.0,43.0,89.0,159.0,168.0
2804,116.0,128.0,271.0,446.0,139.0,0.0,40.0,91.0,387.0,373.0
938,36.0,84.0,48.0,70.0,43.0,40.0,0.0,43.0,48.0,45.0
2398,49.0,93.0,89.0,125.0,89.0,91.0,43.0,0.0,106.0,100.0
2918,116.0,145.0,321.0,507.0,159.0,387.0,48.0,106.0,0.0,469.0


In [21]:
df_ic.shape

(3660, 3660)

In [22]:
#物品和物品相似度矩阵
df_iw = trans_dic_2_matrix(i_W)

In [23]:
df_iw.iloc[:10, :10]

Unnamed: 0,661,914,2355,1197,1287,2804,938,2398,2918,2791
1193,0.152647,0.20403,0.220538,0.294378,0.241912,0.290301,0.125297,0.203795,0.291531,0.295454
661,0.0,0.176995,0.245162,0.205517,0.131236,0.192258,0.155525,0.152693,0.186414,0.21549
914,0.176995,0.0,0.198518,0.248668,0.257069,0.192818,0.329831,0.263402,0.211788,0.20659
2355,0.245162,0.198518,0.0,0.319105,0.185246,0.2536,0.117083,0.156591,0.291259,0.282506
1197,0.205517,0.248668,0.319105,0.0,0.276107,0.360804,0.147607,0.190128,0.397686,0.371737
1287,0.131236,0.257069,0.185246,0.276107,0.0,0.198084,0.159727,0.238465,0.219699,0.213203
2804,0.192258,0.192818,0.2536,0.360804,0.198084,0.0,0.108835,0.178597,0.391689,0.346731
938,0.155525,0.329831,0.117083,0.147607,0.159727,0.108835,0.0,0.219975,0.126632,0.109036
2398,0.152693,0.263402,0.156591,0.190128,0.238465,0.178597,0.219975,0.0,0.201713,0.174776
2918,0.186414,0.211788,0.291259,0.397686,0.219699,0.391689,0.126632,0.201713,0.0,0.42272


In [24]:
recomend = icf.Recommend('102')  

recomend     #('物品j', '用户u对物品j的感兴趣程度')

{'1247': 5.629173286537089,
 '3623': 5.497684149325655,
 '1284': 5.327839764160483,
 '608': 5.030842753593304,
 '1214': 4.930347397226823,
 '2997': 4.731527455099225,
 '2858': 4.701981263345619,
 '1200': 4.48482552884945,
 '2571': 3.767368208823568,
 '2186': 3.7584592628297715}

In [25]:
icf.recallAndPrecision()

(0.07561056094609864, 0.37556291390728475)

#### 物品协同算法（ItemCF）的改进

1. 用户活跃度对物品相似度的影响：认为活跃用户对物品相似度的贡献应该小于不活跃的用户，增加了**IUF**（Inverse User Frequence，用户活跃度对数的倒数）参数来修正物品相似度的计算公式：

$$
W_{ij}=\frac{\sum_{i \in N(i) \cap N(j) \frac{1}{log1+|N(u)|}}}{\sqrt|N(i)||N(j)|}
$$

2. 如果将ItemCF的**相似度矩阵按最大值归一化**，可以提高推荐的准确率，还可以提高推荐的覆盖率和多样性，如果已经得到了物品相似度矩阵W，那么归一化后的相似度矩阵w'：

$$
w'_{ij}=\frac{w_{ij}}{\max\limits_j w_{ij}}
$$

**UserCF和ItemCF在具体应用场景上的区别：**

* UserCF基于用户相似度进行排序，使**其具备更强的社交特性**，用户能够快速得知与自己兴趣相似的人喜欢的是什么，这样的特点非常适用于新闻推荐场景。因为新闻本身的兴趣点往往是分散的，相似用户对不同新闻的兴趣偏好，新闻的及时性，热点往往是更重要的属性。而**UserCF正适用于发现热点，以及追踪热点的趋势**。
* **ItemCF更适用于兴趣变化较为稳定的应用**。如电商场景中，用户在一段时间内更倾向于寻找一类商品；在Netfilix的视频推荐场景中，用户观看电影电视剧的兴趣点往往比较稳定，利用ItemCF推荐风格、类型相似的视频是更合理的选择。

**协同过滤的局限性：**

协同过滤是一个很直观、可解释性强的模型，但它并**不具备较强的泛化能力**。协同过滤无法将两个物品相似这一信息推广到其他物品的相似度计算上（？），这就导致了一个比较严重的问题——**热门的物品具有很强的头部效应，容易跟大量物品产生相似性**（解决方法是在计算物品相似度时引入对热门程度的惩罚项）；而**尾部的物品由于特征向量稀疏，很少与其他物品产生相似性，导致很少被推荐**。协同过滤的天然缺陷——推荐结果的头部效应较明显，处理稀疏向量的能力弱。（改进——矩阵分解技术）

**协同过滤仅利用用户和物品的交互信息**，**无法有效地引入**用户年龄、性别、商品描述、商品分类、当前时间等一系列**用户特征、物品特征和上下文特征**，这无疑**造成了有效信息的遗漏**。（改进——以逻辑回归为核心的能够综合不同类型特征的机器学习模型）

********************************

参考资料：

1. 项亮-《推荐系统实践》-第二章 利用用户行为数据
2. 王喆-《深度学习推荐系统》-2.2 协同过滤——经典的推荐算法
3. [推荐系统（2）-协同过滤](https://zhuanlan.zhihu.com/p/94024379)
4. [基于用户的协同过滤来构建推荐系统](https://mp.weixin.qq.com/s/ZtnaQrVIpVOPJpqMdLWOcw)