## 电影推荐系统

> [https://cloud.tencent.com/developer/article/1058235](https://cloud.tencent.com/developer/article/1058235)

> [当推荐系统遇上深度学习](https://blog.csdn.net/somTian/article/details/71516613)

> [鉴于IMDB电影ID，我如何以编程方式获取其海报图像？](https://oomake.com/question/17663)

> [W1_017陈佳豪](https://www.kesci.com/home/project/5a73e21659774204c69f2559) -- 有实现获取海报的代码

> [TMDB电影数据分析](https://www.kesci.com/home/project/5afa3eaf3878b214d9ca9fff)

> [趣味项目 episode 2——IMDB电影数据分析](https://www.kesci.com/home/project/5ad5a5687238515d80b55cba)

> [一个电影推荐系统，毕业设计](https://github.com/JaniceWuo/MovieRecommend)

> [电影推荐实例--基于协同过滤和DL特征提取的比较](https://blog.csdn.net/qq_32453673/article/details/72593675)

> [TF-IDF与余弦相似性的应用（一）：自动提取关键词](http://www.ruanyifeng.com/blog/2013/03/tf-idf.html)

> [Python Numpy计算各类距离](https://blog.csdn.net/qq_19707521/article/details/78479532)

> [推荐系统算法学习（一）——协同过滤(CF) MF FM FFM](https://blog.csdn.net/qq_23269761/article/details/81355383)

GitHub：

> [yjm930504/Recommender-System](https://github.com/yjm930504/Recommender-System)

> [用MovieLens数据集做推荐](https://github.com/m-L-0/18a-RecSys-zhouhaiyang-2018)

> [爬取豆瓣 48233 条数据, 与 movielens ml-latest 数据集取交集获取共同数据 15752 条](https://github.com/jlshix/movielens-douban-dataset)

> [jadianes/spark-movie-lens](https://github.com/jadianes/spark-movie-lens/tree/c161b9305e5df0c41aed62f7aa883c47e1960481)

# 项目准备

1、项目准备

- 数据清洗

- 为了让结果显得更美好，这里使用爬虫，在推荐的结果显示了海报，这是最主要的，因为用户能看到真实的样子，才会知道自己会不会去看，而不是简单的文字。这是一个很优雅的地方。

2、数据整合和变量初始化

3、思路

- 在实现 MF 中，因为搜索的空间太大了，这里使用了 user-user 来转化，但是为了避免大量的计算，这里我自己重写了核心代码，基本思想和 CF 类似，但是实现方式有所不同。

4、参数说明

- userId：需要推荐的用户 Id,0 则表示未知用户
- top：推荐多少部电影
- poster：是否显示推荐电影的海报
- show：是否显示结果

5、具体函数

    def always_most_popular_recommendation(userId=0,top=10,poster=False,show=True):
        pass
    def content_based_exact_knn_recommendation(userId=1,top=10,poster=False,show=True):
        pass
    def content_based_approximate_knn_recommendation(userId=1,top=10,poster=False,show=True):(未能实现faiss)
        pass
    def user_to_User_cf_recommendation(userId=1,top=10,poster=False,show=True):
        pass
    def item_to_item_cf_recommendation(userId=1,top=10,poster=False,show=True):
        pass
    def mf_recommendation(userId=1,top=10,poster=False,show=True):
        pass

6、改进算法

- 多指标可以改进

- 在对用户评分 >=4.5 的电影中，按照流派排名的前五的占到用户其它流派的 75%，用流派可能会有好的结果

- 总体的评分对于用户具体的电影影响不是最大的因素（谁说烂片没人看了~）

- 我觉得这里有很多指标，但是应该找个办法去度量所有指标的办法，来找到 top 指标来综合考虑，这应该是最好的方案了。时间和能力有限，今天能做多少做多少吧！

7、误差估计

最好的召回率是 28%（KNN）但是感觉还是不很满足，这是很低！多维度是一个方向



In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np
from collections import Counter
import tensorflow as tf

import os
import pickle
import re
from tensorflow.python.ops import math_ops

from matplotlib import pyplot as plt
import seaborn as sns

%matplotlib inline

In [2]:
from urllib.request import urlretrieve
from os.path import isfile, isdir
from tqdm import tqdm
import zipfile
import hashlib


class DLProgress(tqdm):
    """
    Handle Progress Bar while Downloading
    """
    last_block = 0

    def hook(self, block_num=1, block_size=1, total_size=None):
        """
        A hook function that will be called once on establishment of the network connection and
        once after each block read thereafter.
        :param block_num: A count of blocks transferred so far
        :param block_size: Block size in bytes
        :param total_size: The total size of the file. This may be -1 on older FTP servers which do not return
                            a file size in response to a retrieval request.
        """
        self.total = total_size
        self.update((block_num - self.last_block) * block_size)
        self.last_block = block_num

def _unzip(save_path, _, database_name, data_path):
    """
    Unzip wrapper with the same interface as _ungzip
    :param save_path: The path of the gzip files
    :param database_name: Name of database
    :param data_path: Path to extract to
    :param _: HACK - Used to have to same interface as _ungzip
    """
    print('Extracting {}...'.format(database_name))
    with zipfile.ZipFile(save_path) as zf:
        zf.extractall(data_path)

def download_extract(database_name, data_path):
    """
    Download and extract database
    :param database_name: Database name
    """
    DATASET_ML1M = database_name

    if database_name == DATASET_ML1M:
        url = 'http://files.grouplens.org/datasets/movielens/ml-20m.zip'
        hash_code = 'cd245b17a1ae2cc31bb14903e1204af3'
        extract_path = os.path.join(data_path, database_name)
        save_path = os.path.join(data_path, database_name + '.zip')
        extract_fn = _unzip

    if os.path.exists(extract_path):
        print('Found {} Data'.format(database_name))
        return

    if not os.path.exists(data_path):
        os.makedirs(data_path)

    if not os.path.exists(save_path):
        with DLProgress(unit='B', unit_scale=True, miniters=1, desc='Downloading {}'.format(database_name)) as pbar:
            urlretrieve(
                url,
                save_path,
                pbar.hook)

    assert hashlib.md5(open(save_path, 'rb').read()).hexdigest() == hash_code, \
        '{} file is corrupted.  Remove the file and try again.'.format(save_path)

    os.makedirs(extract_path)
    try:
        extract_fn(save_path, extract_path, database_name, data_path)
    except Exception as err:
        shutil.rmtree(extract_path)  # Remove extraction folder if there is an error
        raise err

    print('Done.')
    # Remove compressed data
    os.remove(save_path)

In [3]:
data_dir = './'
download_extract('ml-20m', data_dir)

Found ml-20m Data


## 1. 数据来源

[MovieLens ml-20m](https://grouplens.org/datasets/movielens/20m/)

MovieLens 20M数据集


由138,000名用户向27,000部电影应用了2000万个评级和465,000个标签应用程序。包括标签基因组数据，在1,100个标签上有1200万个相关性分数。

In [4]:
# 项目准备

# 关闭警告信息
import warnings
warnings.filterwarnings('ignore')

In [5]:
# 查看当前挂载的数据集目录
!ls ml-20m/

genome-scores.csv  links.csv   ratings.csv  tags.csv
genome-tags.csv    movies.csv  README.txt


In [6]:
# 加载数据分析常用库
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
% matplotlib inline

# 设置显示选项
pd.set_option('display.max_columns', 8)
pd.set_option('display.max_rows', 8)

In [7]:
ratings = pd.read_csv('ml-20m/ratings.csv')
movies = pd.read_csv('ml-20m/movies.csv')
links = pd.read_csv('ml-20m/links.csv')
genome_tags = pd.read_csv('ml-20m/genome-tags.csv')
tags = pd.read_csv('ml-20m/tags.csv')
genome_scores = pd.read_csv('ml-20m/genome-scores.csv')

### 1.1 选取部分数据进行分析

变量

- users_num: userId小于等于users_num的评分
- movies_num: movieId小于等于movies_num的电影

In [8]:
users_num=500
movies_num=3000

ratings = ratings[(ratings.movieId <= movies_num) & (ratings.userId <= users_num)]
genome_scores = genome_scores[genome_scores.movieId <= movies_num]
tags = tags[tags.movieId <= movies_num]
tagIds_num = genome_tags.index.size
movies = movies[movies.movieId <= movies_num]
links = links[links.movieId <= movies_num]

In [9]:
ratings

Unnamed: 0,userId,movieId,rating,timestamp
0,1,2,3.5,1112486027
1,1,29,3.5,1112484676
2,1,32,3.5,1112484819
3,1,47,3.5,1112484727
...,...,...,...,...
71505,500,2858,4.5,1337181077
71506,500,2881,3.0,1337179627
71507,500,2959,4.5,1337180948
71508,500,3000,4.5,1337181745


In [10]:
movies

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
...,...,...,...
2911,2997,Being John Malkovich (1999),Comedy|Drama|Fantasy
2912,2998,Dreaming of Joseph Lees (1999),Drama|Romance
2913,2999,Man of the Century (1999),Comedy
2914,3000,Princess Mononoke (Mononoke-hime) (1997),Action|Adventure|Animation|Drama|Fantasy


In [11]:
links

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
...,...,...,...
2911,2997,120601,492.0
2912,2998,144178,108346.0
2913,2999,154827,98480.0
2914,3000,119698,128.0


In [12]:
genome_tags

Unnamed: 0,tagId,tag
0,1,007
1,2,007 (series)
2,3,18th century
3,4,1920s
...,...,...
1124,1125,wuxia
1125,1126,wwii
1126,1127,zombie
1127,1128,zombies


In [13]:
tags

Unnamed: 0,userId,movieId,tag,timestamp
1,65,208,dark hero,1368150078
2,65,353,dark hero,1368150079
3,65,521,noir thriller,1368149983
4,65,592,dark hero,1368150078
...,...,...,...,...
465545,138446,918,halloween scene,1358984062
465546,138446,918,quirky,1358984051
465547,138446,2396,topless scene,1358973995
465563,138472,923,rise to power,1194037967


In [14]:
genome_scores

Unnamed: 0,movieId,tagId,relevance
0,1,1,0.02500
1,1,2,0.02500
2,1,3,0.05775
3,1,4,0.09675
...,...,...,...
3004988,3000,1125,0.04050
3004989,3000,1126,0.04525
3004990,3000,1127,0.08525
3004991,3000,1128,0.02475


## 2. 理论背景

粗略地说，推荐系统有三种类型（不包括简单的评级方法）：

- 基于内容的推荐

- 协同过滤

- 混合模型

“基于内容的推荐”是一个回归问题，我们把电影内容作为特征，对用户对电影的评分做预测。

“协同过滤”中，一般无法提前获得内容特征。是通过用户之间的相似度（用户们给了用一个电影相同的评级）和电影之间的相似度（有相似用户评级的电影）来学习潜在特征，同时预测用户对电影的评分。在学习了电影的特征之后，我们便可以衡量电影之间的相似度，并根据用户历史观影信息，向他/她推荐最相似的电影。

“基于内容的推荐”和“协同过滤”是10多年前最先进的技术。很显然，现在有很多模型和算法可以提高预测效果。比如，针对事先缺乏用户电影评分信息的情况，可以使用隐式矩阵分解，用偏好和置信度取代用户电影打分——比如用户对电影推荐有多少次点击，以此进行协同过滤。另外，我们还可以将“内容推荐”与“协同过滤”的方法结合起来，将内容作为侧面信息来提高预测精度。这种混合方法，可以用“学习进行排序”（”Learning to Rank” ）算法来实现。

在该项目中，采用的方法是“协同过滤”。首先，用电影和用户相似度来找出相似度最高的海报，并基于相似度做电影推荐。然后，我将讨论如何Deep Learning学习潜在特征、做电影推荐。最后会谈谈如何在推荐系统中使用深度学习。

## 3. 数据整合和变量的初始化

> 将后面常用的数据整合到一张表中

变量

- rating_count: 每一部电影评分的人数
- rating_sum: 每一部电影获得评分的总数
- n_users: 总用户数
- n_items: 总电影数

In [15]:
# 每一部电影评分的人数
# how='right' 除了含两个表里都有的行，也会将第二个表中独有的行合并。
# groupby() size跟count的区别： size计数时包含NaN值，而count不包含NaN值
rating_count = pd.merge(movies, ratings, on='movieId', how='right').groupby('movieId').size()
rating_count = pd.DataFrame({'rating count':rating_count.values, "movieId":rating_count.index})
rating_count

Unnamed: 0,movieId,rating count
0,1,166
1,2,78
2,3,45
3,4,10
...,...,...
2393,2996,2
2394,2997,86
2395,2998,1
2396,3000,30


In [16]:
# 每一部电影获得评分的总数
# how='left' 除了含两个表里都有的行，也会将第一个表中独有的行合并。
# groupby().sum -- 非NA值的和
rating_sum = pd.merge(movies, ratings, on='movieId', how='left').groupby('movieId').sum()
rating_sum = pd.DataFrame({'rating sum':rating_sum.rating, "movieId":rating_sum.index})
rating_sum

Unnamed: 0_level_0,movieId,rating sum
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,1,663.5
2,2,251.0
3,3,147.5
4,4,30.0
...,...,...
2997,2997,347.0
2998,2998,4.0
2999,2999,0.0
3000,3000,118.0


In [17]:
# 融合电影表
movies = pd.merge(movies, rating_count, on='movieId', how='left')
movies = pd.merge(movies, rating_sum, on='movieId', how='left')
movies = movies.fillna(0)
movies

Unnamed: 0,movieId,title,genres,rating count,rating sum
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,166.0,663.5
1,2,Jumanji (1995),Adventure|Children|Fantasy,78.0,251.0
2,3,Grumpier Old Men (1995),Comedy|Romance,45.0,147.5
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,10.0,30.0
...,...,...,...,...,...
2911,2997,Being John Malkovich (1999),Comedy|Drama|Fantasy,86.0,347.0
2912,2998,Dreaming of Joseph Lees (1999),Drama|Romance,1.0,4.0
2913,2999,Man of the Century (1999),Comedy,0.0,0.0
2914,3000,Princess Mononoke (Mononoke-hime) (1997),Action|Adventure|Animation|Drama|Fantasy,30.0,118.0


In [18]:
## 统计数据中的用户量和电影量
# .value_counts() 统计不同值的个数，不包括 NaN；unique() 用来展示每个不同的值，包括 NaN。
n_users = ratings.userId.unique().shape[0]
n_items = ratings.movieId.unique().shape[0]

print (str(n_users) + ' users')
print (str(n_items) + ' movies')

500 users
2397 movies


## 4. relevance 矩阵

In [19]:
# as_blocks() 和 as_matrix()，分別用於將 DataFrame 轉化為以數據類型為鍵值的字典和將 DataFrame 轉化為二維數組
relevance = np.zeros((movies_num, tagIds_num))
relevance[genome_scores.movieId.as_matrix() - 1,genome_scores.tagId.as_matrix() - 1] = \
        genome_scores.relevance.as_matrix()
relevance.shape

(3000, 1128)

In [20]:
# 将用户打的标签改成标签的 tagId 来方便做相似度计算
tags = pd.merge(tags, genome_tags, on='tag', how='left')
tags

Unnamed: 0,userId,movieId,tag,timestamp,tagId
0,65,208,dark hero,1368150078,288.0
1,65,353,dark hero,1368150079,288.0
2,65,521,noir thriller,1368149983,712.0
3,65,592,dark hero,1368150078,288.0
...,...,...,...,...,...
133701,138446,918,halloween scene,1358984062,
133702,138446,918,quirky,1358984051,829.0
133703,138446,2396,topless scene,1358973995,
133704,138472,923,rise to power,1194037967,


## 5. 构建标签矩阵,对应电影有该tag标记为1

In [21]:
tag_mat = np.zeros((movies_num, tagIds_num), dtype=np.int0) # 这里数据类型改成 np.int0 可以减少内存的消耗
# 去除 tagId 中的NAN
tags_new = tags[tags.tagId.isnull().values == False].copy()
tags_new[['tagId']] = tags_new[['tagId']].astype(np.int16)
tag_mat[tags_new.movieId.as_matrix() - 1, tags_new.tagId.as_matrix() - 1] = 1

## 6. 构建IDF矩阵

> [TF-IDF与余弦相似性的应用（一）：自动提取关键词](http://www.ruanyifeng.com/blog/2013/03/tf-idf.html)

> [机器学习：生动理解TF-IDF算法](https://zhuanlan.zhihu.com/p/31197209)

> [TFIDF算法及应用](https://blog.csdn.net/zhangpinghao/article/details/20881553)

<img src="_asset/IDF.png">

如果一个词越常见，那么分母就越大，逆文档频率就越小越接近 0。分母之所以要加 1，是为了避免分母为 0（即所有文档都不包含该词）。log 表示对得到的值取对数。

TF-IDF 的主要思想是：

> 如果某个词或短语在一篇文章中出现的频率 TF 高，并且在其他文章中很少出现，则认为此词或者短语具有很好的类别区分能力，适合用来分类。

> TF-IDF 实际上是：`TF * IDF`，TF 词频(Term Frequency)，IDF 逆向文件频率(Inverse Document Frequency)。
> - TF 表示词条在文档 d 中出现的频率。
> - IDF 的主要思想是：如果包含词条 t 的文档越少，也就是 n 越小，IDF 越大，则说明词条 t 具有很好的类别区分能力。

> 如果某一类文档 C 中包含词条 t 的文档数为 m，而其它类包含 t 的文档总数为 k，显然所有包含 t 的文档数 n=m+k，当 m 大的时候，n 也大，按照 IDF 公式得到的 IDF 的值会小，就说明该词条 t 类别区分能力不强。但是实际上，如果一个词条在一个类的文档中频繁出现，则说明该词条能够很好代表这个类的文本的特征，这样的词条应该给它们赋予较高的权重，并选来作为该类文本的特征词以区别与其它类文档。这就是 IDF 的不足之处. 在一份给定的文件里，词频（term frequency，TF）指的是某一个给定的词语在该文件中出现的频率。这个数字是对词数(term count)的归一化，以防止它偏向长的文件。（同一个词语在长文件里可能会比短文件有更高的词数，而不管该词语重要与否。）


pandas 对数据进行操作时，经常需要在横轴方向或者数轴方向对数据进行操作，这时需要设定参数axis的值：

- axis = 0 代表对横轴操作，也就是第0轴；
- axis = 1 代表对纵轴操作，也就是第1轴；

<img src="_asset/pandas_axis.png">

In [22]:
# tag_num 表示每个 tag 共出现在多少部电影中
tag_num = tag_mat.sum(axis=0) + 0.01
IDF_mat = np.log(movies_num / tag_num)
TF_IDF_mat = relevance * IDF_mat
TF_IDF_mat

array([[ 0.14256957,  0.14256957,  0.30992093, ...,  0.23181627,
         0.47108957,  0.12697444],
       [ 0.22668562,  0.24949676,  0.20258901, ...,  0.09432524,
         0.54682745,  0.10351177],
       [ 0.24807106,  0.31222737,  0.15026469, ...,  0.12470117,
         0.5877259 ,  0.10213162],
       ..., 
       [ 0.14114388,  0.14399527,  0.2079556 , ...,  0.07673918,
         0.3953517 ,  0.08556973],
       [ 0.14827236,  0.17536058,  0.18514757, ...,  0.1358923 ,
         0.41958782,  0.0897102 ],
       [ 0.14542097,  0.1340154 ,  0.27503806, ...,  0.28937066,
         0.5165323 ,  0.13663554]])

## 7. 显示电影海报

In [104]:
from IPython.display import Image
from IPython.display import display
import requests
import json
import re

def get_poster(recom_item):
    posters = []
    titles = recom_item.title
    imdbIds = pd.merge(recom_item, links).imdbId
    for imdbId in imdbIds:
        file_path = ""
        #Get posters from Movie Database by API
        headers = {'Accept':'application/json'}
        payload = {'api_key':'20047cd838219fb54d1f8fc32c45cda4'}
        response = requests.get('http://api.themoviedb.org/3/configuration', params=payload, headers=headers)
        response = json.loads(response.text)

        base_url = response['images']['base_url'] + 'w185'
        
        #query themovie.org API for movie poster path.
        imdb_id = 'tt0{0}'.format(imdbId)
        movie_url = 'http://api.themoviedb.org/3/movie/{:}/images'.format(imdb_id)
        response = requests.get(movie_url, params=payload, headers=headers)
        try:
            file_path = json.loads(response.text)['posters'][0]['file_path']
        except:
            print('Something wrong, cannot get the poster for imdb id: {0}!'.format(imdbId))
        
        if file_path == "":
            posters.append("Failed：imdbId({}) image get failed!".format(imdbId))
        else:
            url = base_url + file_path
            posters.append(Image(url=url))
        
    for index, poster, title in zip(range(len(posters)), posters, titles):
        #print(str(index + 1) + '、' + title)
        display(poster)

## 8. 输出推荐的结果

In [24]:
def show_user_movies(recom_movies, poster=False):
    if(poster):
        get_poster(recom_movies)
    else:
        titles = recom_movies.title
        for index, title in zip(range(titles.size), titles):
            print(str(index + 1) + '、' + title)

### 8.1 推荐最受欢迎的电影

根据每一部电影获得评分的总数排序取 Top K

In [25]:
# Always recommend the most popular item（推荐最受欢迎的 item）
def always_most_popular_recommendation(userId=0, top=10, poster=False, show=True):
    recom_movies = movies.sort_values('rating sum', ascending=False)[0:top]
    if(show):
        show_user_movies(recom_movies, poster)
    return recom_movies

### 8.2 使用精确的 KNN 搜索进行基于内容的推荐


In [26]:
# 先得到 ratings 矩阵
user_ratings = np.zeros((users_num,movies_num))
user_ratings[ratings.userId.as_matrix()-1, ratings.movieId.as_matrix()-1] = ratings.rating.as_matrix()
new_ratings = np.zeros((users_num,movies_num))
for i in range(user_ratings.shape[0]):
    average = np.mean(user_ratings[i,:][user_ratings[i,:].nonzero()])
    where = user_ratings[i,:].nonzero()
    if len(where[0]) > 0:
        ## 每个人的 ratings[i,:][where]-average 和为 0，+1/len(where) 使他们的加权和为 1
        new_ratings[i,:][where] = user_ratings[i,:][where] - average + 1 / len(where[0]) 

进行矩阵运算，计算出 user_tag 矩阵(各用户‘最喜爱的电影’应具备的标签)

In [27]:
user_tag_mat = new_ratings.dot(TF_IDF_mat)
user_tag_mat[0:4,0:4]

array([[ -1.01631945,  -0.94051276,   2.5041473 ,  -1.24897375],
       [  6.78722826,   6.52602043,  -3.03289351,   5.93997794],
       [ -5.80529842,  -5.51091601,  -9.28510587,  16.3869413 ],
       [  3.65818252,   3.06275088,  -0.8300016 ,  -0.90054341]])

> [sklearn文档 — 1.6. 最近邻](http://sklearn.apachecn.org/cn/latest/modules/neighbors.html)

无监督的最近邻学习，算法的选择通过过关键字 'algorithm' 来控制， 并必须是 ['auto', 'ball_tree', 'kd_tree', 'brute'] 其中的一个。

In [86]:
def content_based_exact_knn_recommendation(userId=1, top=10, poster=False, show=True):
    from sklearn.neighbors import NearestNeighbors
    user_tag = user_tag_mat[userId,:]
    nbrs = NearestNeighbors(n_neighbors=top, algorithm='ball_tree').fit(TF_IDF_mat)
    distances, indices = nbrs.kneighbors(user_tag)
    recom_movies = pd.DataFrame({'movieId':indices[0] + 1})
    recom_movies = pd.merge(recom_movies, movies, how='left', on='movieId')

    if(show):
        show_user_movies(recom_movies, poster)
    return recom_movies

### 8.3 基于用户对用户，电影对电影的协同过滤推荐

映射数据，因为电影的 id 和电影的比例很大，所以全部有点浪费

In [29]:
new_rating = ratings.copy()
new_n_users = users_num
new_n_items = new_rating.movieId.unique().shape[0]
movie_map = pd.Series(index=new_rating.movieId.unique(), data=range(new_n_items))

new_rating.movieId = movie_map[new_rating.movieId.values].values
new_rating.movieId.describe()
new_rating.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,0,3.5,1112486027
1,1,1,3.5,1112484676
2,1,2,3.5,1112484819
3,1,3,3.5,1112484727
4,1,4,3.5,1112484580


In [115]:
new_rating.T.head()

Unnamed: 0,0,1,2,3,...,71505,71506,71507,71508
userId,1.0,1.0,1.0,1.0,...,500.0,500.0,500.0,500.0
movieId,0.0,1.0,2.0,3.0,...,129.0,433.0,96.0,98.0
rating,3.5,3.5,3.5,3.5,...,4.5,3.0,4.5,4.5
timestamp,1112486000.0,1112485000.0,1112485000.0,1112485000.0,...,1337181000.0,1337180000.0,1337181000.0,1337182000.0


计算评分矩阵

In [30]:
rating_matrix = np.zeros((new_n_users, new_n_items))
for row in new_rating.itertuples():
    rating_matrix[row[1]-1, row[2]-1] = row[3]
rating_matrix[0:4,0:4]

array([[ 3.5,  3.5,  3.5,  3.5],
       [ 0. ,  0. ,  0. ,  0. ],
       [ 0. ,  4. ,  0. ,  5. ],
       [ 0. ,  1. ,  0. ,  0. ]])

In [31]:
# 计算矩阵稀疏性
sparsity = float(len(rating_matrix.nonzero()[0]))
sparsity /= (rating_matrix.shape[0] * rating_matrix.shape[1])
sparsity *= 100
print ('Sparsity: {:4.2f}%'.format(sparsity))

Sparsity: 3.77%


In [32]:
def similarity(new_rating, kind='user', epsilon=1e-9):
    # epsilon -> small number for handling dived-by-zero errors
    if kind == 'user':
        sim = new_rating.dot(new_rating.T) + epsilon
    elif kind == 'item':
        sim = new_rating.T.dot(new_rating) + epsilon
    norms = np.array([np.sqrt(np.diagonal(sim))])
    return (sim / norms / norms.T)

计算 user-user 和 item-item 相似矩阵

In [33]:
user_similarity = similarity(rating_matrix, kind='user')
item_similarity = similarity(rating_matrix, kind='item')

#### 8.3.1 用户对用户 CF 推荐


In [34]:
def user_to_User_cf_recommendation(userId=1, top=10, poster=False, show=True):
    # 1、获取和用户相似度高的用户
    similar_user = np.argsort(user_similarity[userId,:])[:-k-1:-1].flatten()
    similar_user = pd.DataFrame({"userId":similar_user})
    similar_user_movies = pd.merge(similar_user, ratings, on='userId')
    
    # 2、获取待推荐的电影
    similar_user_movies = similar_user_movies[similar_user_movies.rating >= 3.5]
    recom_movie_ids = similar_user_movies.groupby('movieId').size().sort_values(ascending=False)[0:top].index
    recom_movies = pd.DataFrame({'movieId':recom_movie_ids})
    recom_movies = pd.merge(recom_movies, movies, how='left', on='movieId')
    
    # 3、是否显示
    if(show):
        show_user_movies(recom_movies, poster)
    return recom_movies

In [65]:
def user_to_User_cf_recommendation_2(userId=1, top=10, poster=False, show=True):
    # 1、 用户 1 评分大于等于 4 的电影
    user_high_score_movies = ratings[(ratings.userId == userId) &  (ratings.rating >= 4)]
    user_high_score = pd.DataFrame({"movieId":user_high_score_movies.movieId})
    
    # 2、 在评分中寻找，看过上面电影，并打分超过 4.5 的项目
    find_similar = pd.merge(user_high_score,ratings,on='movieId')
    find_similar = find_similar[(find_similar.userId != userId) & (find_similar.rating >= 4.5)]
    
    # 3、 寻找和用户最相似的十个人
    similar_users = find_similar.groupby('userId').apply(lambda x:len(x['movieId'].unique())).sort_values(ascending=False)[0:10]
    similar_users = pd.DataFrame({"userId":similar_users.index})
    
    # 4、 筛选对十个人看的所有电影中打分超过 4.5 的电影
    similar_users_movies = pd.merge(similar_users,ratings,on='userId')
    similar_users_movies = similar_users_movies[similar_users_movies.rating >= 4.5]
    
    # 5、 按照十个人中看的数目排名取前 50
    similar_users_movies = similar_users_movies.groupby('movieId').apply(lambda x:len(x['userId'].unique()))
    similar_users_movies = similar_users_movies.sort_values(ascending=False)[0:50]
    similar_users_movies = pd.DataFrame({"movieId":similar_users_movies.index})
    
    # 6、 用户映射向量表
    similar_users.loc[10,] = userId
    
    # 7、 融合电影和评分
    c = pd.merge(similar_users, ratings, on='userId', how='left')
    c = pd.merge(similar_users_movies, c, on='movieId', how='left')
    c = c.loc[:,['movieId', 'userId', 'rating']]
    
    # 8、 user-item 矩阵
    R = np.zeros((11,50))
    for i in range(11):
        for j in range(50):
            temp_i,temp_j = similar_users.iloc[i,0],similar_users_movies.iloc[j,0]
            rate = c[(c.movieId == temp_j) & (c.userId == temp_i)]
            if(rate.size == 0):
                R[i][j] = 0
            else:
                R[i][j] = rate.iloc[0,2]
    
    # 9、 MF 计算 50 部电影中缺少的评分，并取前
    N ,M = len(R), len(R[0])
    K = 2
    P = np.random.rand(N, K)
    Q = np.random.rand(M, K)
    nP, nQ = matrix_factorization(R, P, Q, K, 300)
    nR = np.dot(nP, nQ.T)
    
    # 10、 计算对应用户 1 评分高的电影，取前十
    result = pd.DataFrame(nR).loc[[10]].T.sort_values(10, ascending=False)[0:10]
    recom_movie_ids = list(similar_users_movies.loc[list(result.index),].movieId)
    recom_movies = pd.DataFrame({'movieId':recom_movie_ids})
    recom_movies = pd.merge(recom_movies, movies, how='left', on='movieId')
    
    # 11、是否显示
    if(show):
        show_user_movies(recom_movies, poster)
    return recom_movies

#### 8.3.2 电影对电影 CF 推荐


In [36]:
def item_to_item_cf_recommendation(userId=1, top=10, poster=False, show=True):
    # 1、获取用户的电影 id
    user_love_movie_ids = new_rating[(new_rating.userId == userId) & (new_rating.rating >= 4.5)].movieId
    
    # 2、通过相似矩阵来获得相近的电影
    recom_movie_ids = np.argsort(item_similarity[user_love_movie_ids,:])[:,:-k-1:-1].flatten()
    recom_movie_ids = pd.value_counts(c)[0:k].index
    recom_movie_ids = list(movie_map.iloc[recom_movie_ids].index)
    recom_movies = pd.DataFrame({'movieId':recom_movie_ids})
    recom_movies = pd.merge(recom_movies, movies, how='left', on='movieId')
    
    # 3、是否显示
    if(show):
        show_user_movies(recom_movies, poster)
    return recom_movies

### 8.4 基于矩阵分解的推荐

矩阵分解

In [37]:
def matrix_factorization(R, P, Q, K, steps=5000, alpha=0.0002, beta=0.02):
    Q = Q.T
    for step in range(steps):
        for i in range(len(R)):
            for j in range(len(R[i])):
                if R[i][j] > 0:
                    eij = R[i][j] - np.dot(P[i,:],Q[:,j])
                    for k in range(K):
                        P[i][k] = P[i][k] + alpha * (2 * eij * Q[k][j] - beta * P[i][k])
                        Q[k][j] = Q[k][j] + alpha * (2 * eij * P[i][k] - beta * Q[k][j])
        eR = np.dot(P,Q)
        e = 0
        for i in range(len(R)):
            for j in range(len(R[i])):
                if R[i][j] > 0:
                    e = e + pow(R[i][j] - np.dot(P[i,:],Q[:,j]), 2)
                    for k in range(K):
                        e = e + (beta / 2) * (pow(P[i][k],2) + pow(Q[k][j],2))
        if e < 0.001:
            break
    return P, Q.T

## 9. 推荐结果

### 9.1 推荐最受欢迎的电影

In [101]:
movies = always_most_popular_recommendation(top=5, poster=True)

### 9.2  使用精确的 KNN 搜索进行基于内容的推荐

In [102]:
content_based_exact_knn_recommendation(userId=100, top=5, poster=True)

Something wrong, cannot get the poster for imdb id: 99685!


'Failed：99685 image get failed!'

Unnamed: 0,movieId,title,genres,rating count,rating sum
0,50,,,,
1,318,"Shawshank Redemption, The (1994)",Crime|Drama,218.0,970.0
2,1213,,,,
3,2858,,,,
4,2329,,,,


### 9.3 用户对用于的 CF 推荐

In [105]:
user_to_User_cf_recommendation_2(userId=100, top=5, poster=True)

Something wrong, cannot get the poster for imdb id: 75686!
Something wrong, cannot get the poster for imdb id: 47296!
Something wrong, cannot get the poster for imdb id: 86190!


'Failed：imdbId(75686) image get failed!'

'Failed：imdbId(47296) image get failed!'

'Failed：imdbId(86190) image get failed!'

Unnamed: 0,movieId,title,genres,rating count,rating sum
0,296,Pulp Fiction (1994),Comedy|Crime|Drama|Thriller,248.0,1009.5
1,2571,,,,
2,1230,,,,
3,527,,,,
...,...,...,...,...,...
6,1945,,,,
7,593,"Silence of the Lambs, The (1991)",Crime|Horror|Thriller,213.0,872.5
8,2858,,,,
9,1210,,,,
