###### P

> [W1_017陈佳豪](https://www.kesci.com/home/project/5a73e21659774204c69f2559)

# <center>目录索引</center>

# 一、准备

## 1、项目准备
- 数据清洗
- 为了让结果显得更美好，这里使用爬虫，在推荐的结果显示了海报，**这是最主要的**，因为用户能看到真实的样子，才会知道自己会不会去看，而不是简单的文字。这是一个很优雅的地方。

## 2、数据整合和变量的初始化

## 3、思路
在实现MF中，因为搜索的空间太大了，这里使用了user-user来转化，但是为了避免大量的计算，这里我自己重写了核心代码，基本思想和CF类似，但是实现方式有所不同。

# 二、代码重构

## 1、函数定义

### 1.1、参数说明
- **userId:**需要推荐的用户Id,0则表示未知用户
- **top:**推荐多少部电影
- **poster:**是否显示推荐电影的海报
- **show:**是否显示结果

### 1.2、具体函数

```python
def always_most_popular_recommendation(userId=0,top=10,poster=False,show=True):
    pass
def content_based_exact_knn_recommendation(userId=1,top=10,poster=False,show=True):
    pass
def content_based_approximate_knn_recommendation(userId=1,top=10,poster=False,show=True):
    pass
def user_to_User_cf_recommendation(userId=1,top=10,poster=False,show=True):
    pass
def item_to_item_cf_recommendation(userId=1,top=10,poster=False,show=True):
    pass
def mf_recommendation(userId=1,top=10,poster=False,show=True):
```
## 2、函数实现

### 2.0 [Always recommend the most popular item](#Always-recommend-the-most-popular-item)

### 2.1 [Content based recommendation by using exact KNN search](###Content-based-recommendation-by-using-exact-KNN-search)
### 2.2 Content based approximate knn Recommendation (未能实现faiss)
### 2.3 [User to User CF Recommendation](#User-to-User-CF-Recommendation)
### 2.4 [Item to Item CF Recommendation](#Item-to-Item-CF-Recommendation)
### 2.5 [Matrix Factorization Recommendation](#Matrix-Factorization-Recommendation)


# 三、改进算法

- 1、多指标可以改进
- 2、在对用户评分>=4.5的电影中，按照流派排名的前五的占到用户其它流派的**75%**，用流派可能会有好的结果
- 3、总体的评分对于用户具体的电影影响不是最大的因素（谁说烂片没人看了~）
- 4、我觉得这里有很多指标，但是应该找个办法去度量所有指标的办法，来找到top指标来综合考虑，这应该是最好的方案了。时间和能力有限，今天能做多少做多少吧！

# 四、误差估计

最好的召回率是28%（KNN）但是感觉还是不很满足，这是很低！多维度是一个方向

# 五、推荐结果 [优雅的显示](#推荐结果)

## 1. 项目准备

In [None]:
#关闭警告信息
import warnings
warnings.filterwarnings('ignore')

In [None]:
# 查看当前挂载的数据集目录
!ls ml-20m/

### 1.1 加载数据分析常用库

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
% matplotlib inline

### 1.2 设置显示选项

In [None]:
pd.set_option('display.max_columns', 8)
pd.set_option('display.max_rows', 8)

### 1.3 加载数据

In [None]:
ratings = pd.read_csv('ml-20m/ratings.csv')
movies = pd.read_csv('ml-20m/movies.csv')
links = pd.read_csv('ml-20m/links.csv')
genome_tags = pd.read_csv('ml-20m/genome-tags.csv')
tags = pd.read_csv('ml-20m/tags.csv')
genome_scores = pd.read_csv('ml-20m/genome-scores.csv')

### 1.4 选取部分数据进行分析

**变量**

- **users_num**: `userId`小于等于`users_num`的评分
- **movies_num**: `movieId`小于等于`movies_num`的电影

In [None]:
users_num = 500
movies_num = 3000

ratings = ratings[(ratings.movieId <= movies_num) & (ratings.userId <= users_num)]
genome_scores = genome_scores[genome_scores.movieId <= movies_num]
tags = tags[tags.movieId <= movies_num]
tagIds_num = genome_tags.index.size
movies = movies[movies.movieId <= movies_num]
links = links[links.movieId <= movies_num]

## 2. 数据整合和变量的初始化

>将后面常用的数据整合到一张表中

### 2.1 变量

- **rating_count**: 每一部电影评分的人数
- **rating_sum**: 每一部电影获得评分的总数
- **n_users**: 总用户数
- **n_items**: 总电影数

In [None]:
# 每一部电影评分的人数
rating_count = pd.merge(movies, ratings, on='movieId', how='right').groupby('movieId').size()
rating_count = pd.DataFrame({'rating count':rating_count.values, "movieId":rating_count.index})
# rating_count
# 每一部电影获得评分的总数
rating_sum = pd.merge(movies, ratings, on='movieId', how='left').groupby('movieId').sum()
rating_sum = pd.DataFrame({'rating sum':rating_sum.rating, "movieId":rating_sum.index})
# rating_sum

In [None]:
# 融合电影表
movies = pd.merge(movies, rating_count, on='movieId', how='left')
movies = pd.merge(movies, rating_sum, on='movieId', how='left')
movies = movies.fillna(0)
movies

In [None]:
## 统计数据中的用户量和电影量
n_users = ratings.userId.unique().shape[0]
n_items = ratings.movieId.unique().shape[0]
print (str(n_users) + ' users')
print (str(n_items) + ' movies')

### 2.2 relevance 矩阵

In [None]:
relevance = np.zeros((movies_num, tagIds_num))
relevance[genome_scores.movieId.as_matrix() - 1, genome_scores.tagId.as_matrix() - 1] = \
        genome_scores.relevance.as_matrix()
relevance.shape

In [None]:
# 将用户打的标签改成标签的 tagId 来方便做相似度计算
tags = pd.merge(tags, genome_tags, on='tag', how='left')

### 2.3 构建标签矩阵, 对应电影有该 tag 标记为 1

In [None]:
tag_mat = np.zeros((movies_num, tagIds_num), dtype=np.int0) #这里数据类型改成 np.int0 可以减少内存的消耗
#去除tagId中的NAN
tags_new = tags[tags.tagId.isnull().values == False].copy()
tags_new[['tagId']] = tags_new[['tagId']].astype(np.int16)
tag_mat[tags_new.movieId.as_matrix() - 1, tags_new.tagId.as_matrix() - 1] = 1

### 2.4 构建 IDF 矩阵

In [None]:
# tag_num 表示每个 tag 共出现在多少部电影中
tag_num = tag_mat.sum(axis=0) + 0.01
IDF_mat = np.log(movies_num / tag_num)
TF_IDF_mat = relevance * IDF_mat
TF_IDF_mat

### 2.5 显示电影海报

In [None]:
#!pip install requests

from IPython.display import Image
from IPython.display import display
import requests
import re

def get_poster(recom_item):
    posters = []
    titles = recom_item.title
    imdbIds = pd.merge(recom_item, links).imdbId
    #print(imdbIds)
    for imdbId in imdbIds:
        url = "http://www.imdb.com/title/tt{:0>7}/".format(imdbId)
        host = "https://images-na.ssl-images-amazon.com/images/M/"
        pattern = r"https://images-na.ssl-images-amazon.com/images/M/[\S]*?jpg"
        #print(url)
        wb_data = requests.get(url)
        #print(wb_data.text)
        urls = re.findall(pattern, wb_data.text)
        #print(urls)
        url = urls[0].split('/')[5].split('.')
        url = host+url[0]+'._V1_UX182_CR0,0,182,268_AL_.jpg'
        posters.append(Image(url=url))
        
    for index, poster, title in zip(range(len(posters)), posters, titles):
        print(str(index + 1) + '、' + title)
        display(poster)

### 2.6 输出推荐的结果

In [None]:
def show_user_movies(recom_movies, poster=False):
    if(poster):
        get_poster(recom_movies)
    else:
        titles = recom_movies.title
        for index, title in zip(range(titles.size), titles):
            print(str(index + 1) + '、' + title)

[返回索引](#P)

### 2.7 Always-recommend-the-most-popular-item

In [None]:
def always_most_popular_recommendation(userId=0, top=10, poster=False, show=True):
    recom_movies=movies.sort_values('rating sum', ascending=False)[0:top]
    if(show):
        show_user_movies(recom_movies, poster)
    return recom_movies

[返回索引](#P)

### 2.8 Content-based-recommendation-by-using-exact-KNN-search

In [None]:
#先得到ratings矩阵
user_ratings = np.zeros((users_num, movies_num))
user_ratings[ratings.userId.as_matrix() - 1, ratings.movieId.as_matrix() - 1] = ratings.rating.as_matrix()
new_ratings = np.zeros((users_num, movies_num))
for i in range(user_ratings.shape[0]):
    average = np.mean(user_ratings[i,:][user_ratings[i,:].nonzero()])
    where = user_ratings[i,:].nonzero()
    if len(where[0]) > 0:
        ##每个人的ratings[i,:][where]-average和为0，+1/len(where)使他们的加权和为1
        new_ratings[i,:][where] = user_ratings[i,:][where] - average + 1 / len(where[0]) 

#### 2.8.1 进行矩阵运算，计算出 user_tag 矩阵(各用户‘最喜爱的电影’应具备的标签)

In [None]:
user_tag_mat = new_ratings.dot(TF_IDF_mat)
user_tag_mat[0:4,0:4]

In [None]:
def content_based_exact_knn_recommendation(userId=1, top=10, poster=False, show=True):
    from sklearn.neighbors import NearestNeighbors
    user_tag = user_tag_mat[userId,:]
    nbrs = NearestNeighbors(n_neighbors=top, algorithm='ball_tree').fit(TF_IDF_mat)
    distances, indices = nbrs.kneighbors(user_tag)
    recom_movies = pd.DataFrame({'movieId':indices[0] + 1})
    recom_movies = pd.merge(recom_movies, movies, how='left', on='movieId')
    
    if(show):
        show_user_movies(recom_movies, poster)
    return recom_movies

#### 2.8.2 CF（3，4）算法准备

映射数据，因为电影的id和电影的比例很大，所以全部有点浪费

In [None]:
new_rating = ratings.copy()
new_n_users = users_num
new_n_items = new_rating.movieId.unique().shape[0]
movie_map = pd.Series(index=new_rating.movieId.unique(), data=range(new_n_items))

new_rating.movieId = movie_map[new_rating.movieId.values].values
new_rating.movieId.describe()
new_rating.head()

#### 2.8.3 计算评分矩阵

In [None]:
rating_matrix = np.zeros((new_n_users, new_n_items))
for row in new_rating.itertuples():
    rating_matrix[row[1] - 1, row[2] - 1] = row[3]
    
rating_matrix[0:4, 0:4]

In [None]:
sparsity = float(len(rating_matrix.nonzero()[0]))
sparsity /= (rating_matrix.shape[0] * rating_matrix.shape[1])
sparsity *= 100
print ('Sparsity: {:4.2f}%'.format(sparsity))

In [None]:
def similarity(new_rating, kind='user', epsilon=1e-9):
    # epsilon -> small number for handling dived-by-zero errors
    if kind == 'user':
        sim = new_rating.dot(new_rating.T) + epsilon
    elif kind == 'item':
        sim = new_rating.T.dot(new_rating) + epsilon
    norms = np.array([np.sqrt(np.diagonal(sim))])
    return (sim / norms / norms.T)

#### 2.8.3 计算 user-user 和 item-item 相似矩阵

In [None]:
user_similarity = similarity(rating_matrix, kind='user')
item_similarity = similarity(rating_matrix, kind='item')

[返回索引](#P)

### 2.9 User-to-User-CF-Recommendation

In [None]:
def user_to_User_cf_recommendation(userId=1, top=10, poster=False, show=True):
    #1、获取和用户相似度高的用户
    similar_user = np.argsort(user_similarity[userId,:])[:-k-1:-1].flatten()
    similar_user = pd.DataFrame({"userId":similar_user})
    similar_user_movies = pd.merge(similar_user, ratings, on='userId')
    
    #2、获取待推荐的电影
    similar_user_movies = similar_user_movies[similar_user_movies.rating >= 3.5]
    recom_movie_ids = similar_user_movies.groupby('movieId').size().sort_values(ascending=False)[0:top].index
    recom_movies = pd.DataFrame({'movieId':recom_movie_ids})
    recom_movies = pd.merge(recom_movies, movies, how='left', on='movieId')
    
    #3、是否显示
    if(show):
        show_user_movies(recom_movies, poster)
    return recom_movies

[返回索引](#P)

### 2.10 Item-to-Item-CF-Recommendation

In [None]:
def item_to_item_cf_recommendation(userId=1, top=10, poster=False, show=True):
    #1、获取用户的电影id
    user_love_movie_ids = new_rating[(new_rating.userId == userId) & (new_rating.rating >= 4.5)].movieId
    
    #2、通过相似矩阵来获得相近的电影
    recom_movie_ids = np.argsort(item_similarity[user_love_movie_ids,:])[:,:-k-1:-1].flatten()
    recom_movie_ids = pd.value_counts(c)[0:k].index
    recom_movie_ids = list(movie_map.iloc[recom_movie_ids].index)
    recom_movies = pd.DataFrame({'movieId':recom_movie_ids})
    recom_movies = pd.merge(recom_movies, movies, how='left', on='movieId')
    
    #3、是否显示
    if(show):
        show_user_movies(recom_movies,poster)
    return recom_movies

[返回索引](#P)

### 2.11 Matrix-Factorization-Recommendation

#### 2.11.1 Matrix Factorization

In [None]:
def matrix_factorization(R, P, Q, K, steps=5000, alpha=0.0002, beta=0.02):
    Q = Q.T
    for step in range(steps):
        for i in range(len(R)):
            for j in range(len(R[i])):
                if R[i][j] > 0:
                    eij = R[i][j] - np.dot(P[i,:],Q[:,j])
                    for k in range(K):
                        P[i][k] = P[i][k] + alpha * (2 * eij * Q[k][j] - beta * P[i][k])
                        Q[k][j] = Q[k][j] + alpha * (2 * eij * P[i][k] - beta * Q[k][j])
        eR = np.dot(P,Q)
        e = 0
        for i in range(len(R)):
            for j in range(len(R[i])):
                if R[i][j] > 0:
                    e = e + pow(R[i][j] - np.dot(P[i,:],Q[:,j]), 2)
                    for k in range(K):
                        e = e + (beta/2) * (pow(P[i][k],2) + pow(Q[k][j],2))
        if e < 0.001:
            break
    return P, Q.T

In [None]:
def user_to_User_cf_recommendation(userId=1,top=10,poster=False,show=True):
    #1、 用户1评分大于等于4的电影
    user_high_score_movies=ratings[(ratings.userId==userId) &  (ratings.rating>=4)]
    user_high_score=pd.DataFrame({"movieId":user_high_score_movies.movieId})
    #2、 在评分中寻找，看过上面电影，并打分超过4.5的项目
    find_similar=pd.merge(user_high_score,ratings,on='movieId')
    find_similar=find_similar[(find_similar.userId != userId) & (find_similar.rating>=4.5)]
    #3、 寻找和用户最相似的十个人
    similar_users=find_similar.groupby('userId').apply(lambda x:len(x['movieId'].unique())).sort_values(ascending=False)[0:10]
    similar_users=pd.DataFrame({"userId":similar_users.index})
    #4、 筛选对十个人看的所有电影中打分超过4.5的电影
    similar_users_movies=pd.merge(similar_users,ratings,on='userId')
    similar_users_movies=similar_users_movies[similar_users_movies.rating>=4.5]
    #5、 按照十个人中看的数目排名取前50
    similar_users_movies=similar_users_movies.groupby('movieId').apply(lambda x:len(x['userId'].unique()))
    similar_users_movies=similar_users_movies.sort_values(ascending=False)[0:50]
    similar_users_movies=pd.DataFrame({"movieId":similar_users_movies.index})
    #6、 用户映射向量表
    similar_users.loc[10,]=userId
    #7、 融合电影和评分
    c=pd.merge(similar_users,ratings,on='userId',how='left')
    c=pd.merge(similar_users_movies,c,on='movieId',how='left')
    c=c.loc[:,['movieId','userId','rating']]
    #8、 user-item 矩阵
    R=np.zeros((11,50))
    for i in range(11):
        for j in range(50):
            temp_i,temp_j=similar_users.iloc[i,0],similar_users_movies.iloc[j,0]
            rate=c[(c.movieId==temp_j) & (c.userId==temp_i)]
            if(rate.size==0):
                R[i][j]=0
            else:
                R[i][j]=rate.iloc[0,2]
    #9、 MF计算50部电影中缺少的评分，并取前
    N ,M= len(R),len(R[0])
    K = 2
    P = np.random.rand(N,K)
    Q = np.random.rand(M,K)
    nP, nQ = matrix_factorization(R, P, Q, K,300)
    nR = np.dot(nP, nQ.T)
    #10、 计算对应用户1评分高的电影，取前十
    result=pd.DataFrame(nR).loc[[10]].T.sort(10,ascending=False)[0:10]
    recom_movie_ids=list(similar_users_movies.loc[list(result.index),].movieId)
    recom_movies=pd.DataFrame({'movieId':recom_movie_ids})
    recom_movies=pd.merge(recom_movies,movies,how='left',on='movieId')
    #11、是否显示
    if(show):
        show_user_movies(recom_movies,poster)
    return recom_movies

[返回索引](#P)

## 3. 推荐结果

In [None]:
movies = always_most_popular_recommendation(top=5, poster=True)