## 电影推荐系统

> [https://cloud.tencent.com/developer/article/1058235](https://cloud.tencent.com/developer/article/1058235)

> [当推荐系统遇上深度学习](https://blog.csdn.net/somTian/article/details/71516613)

> [鉴于IMDB电影ID，我如何以编程方式获取其海报图像？](https://oomake.com/question/17663)

> [W1_017陈佳豪](https://www.kesci.com/home/project/5a73e21659774204c69f2559) -- 有实现获取海报的代码

> [TMDB电影数据分析](https://www.kesci.com/home/project/5afa3eaf3878b214d9ca9fff)

> [趣味项目 episode 2——IMDB电影数据分析](https://www.kesci.com/home/project/5ad5a5687238515d80b55cba)

> [一个电影推荐系统，毕业设计](https://github.com/JaniceWuo/MovieRecommend)

> [电影推荐实例--基于协同过滤和DL特征提取的比较](https://blog.csdn.net/qq_32453673/article/details/72593675)

> [TF-IDF与余弦相似性的应用（一）：自动提取关键词](http://www.ruanyifeng.com/blog/2013/03/tf-idf.html)

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np
from collections import Counter
import tensorflow as tf

import os
import pickle
import re
from tensorflow.python.ops import math_ops

from matplotlib import pyplot as plt
import seaborn as sns

%matplotlib inline

In [2]:
from urllib.request import urlretrieve
from os.path import isfile, isdir
from tqdm import tqdm
import zipfile
import hashlib


class DLProgress(tqdm):
    """
    Handle Progress Bar while Downloading
    """
    last_block = 0

    def hook(self, block_num=1, block_size=1, total_size=None):
        """
        A hook function that will be called once on establishment of the network connection and
        once after each block read thereafter.
        :param block_num: A count of blocks transferred so far
        :param block_size: Block size in bytes
        :param total_size: The total size of the file. This may be -1 on older FTP servers which do not return
                            a file size in response to a retrieval request.
        """
        self.total = total_size
        self.update((block_num - self.last_block) * block_size)
        self.last_block = block_num

def _unzip(save_path, _, database_name, data_path):
    """
    Unzip wrapper with the same interface as _ungzip
    :param save_path: The path of the gzip files
    :param database_name: Name of database
    :param data_path: Path to extract to
    :param _: HACK - Used to have to same interface as _ungzip
    """
    print('Extracting {}...'.format(database_name))
    with zipfile.ZipFile(save_path) as zf:
        zf.extractall(data_path)

def download_extract(database_name, data_path):
    """
    Download and extract database
    :param database_name: Database name
    """
    DATASET_ML1M = database_name

    if database_name == DATASET_ML1M:
        url = 'http://files.grouplens.org/datasets/movielens/ml-20m.zip'
        hash_code = 'cd245b17a1ae2cc31bb14903e1204af3'
        extract_path = os.path.join(data_path, database_name)
        save_path = os.path.join(data_path, database_name + '.zip')
        extract_fn = _unzip

    if os.path.exists(extract_path):
        print('Found {} Data'.format(database_name))
        return

    if not os.path.exists(data_path):
        os.makedirs(data_path)

    if not os.path.exists(save_path):
        with DLProgress(unit='B', unit_scale=True, miniters=1, desc='Downloading {}'.format(database_name)) as pbar:
            urlretrieve(
                url,
                save_path,
                pbar.hook)

    assert hashlib.md5(open(save_path, 'rb').read()).hexdigest() == hash_code, \
        '{} file is corrupted.  Remove the file and try again.'.format(save_path)

    os.makedirs(extract_path)
    try:
        extract_fn(save_path, extract_path, database_name, data_path)
    except Exception as err:
        shutil.rmtree(extract_path)  # Remove extraction folder if there is an error
        raise err

    print('Done.')
    # Remove compressed data
    os.remove(save_path)

In [3]:
data_dir = './'
download_extract('ml-20m', data_dir)

Downloading ml-20m: 199MB [02:50, 1.17MB/s]                           


Extracting ml-20m...
Done.


## 1. 数据来源

[MovieLens ml-20m](https://grouplens.org/datasets/movielens/20m/)

MovieLens 20M数据集


由138,000名用户向27,000部电影应用了2000万个评级和465,000个标签应用程序。包括标签基因组数据，在1,100个标签上有1200万个相关性分数。

In [4]:
# 加载数据分析常用库
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
% matplotlib inline

# 设置显示选项
pd.set_option('display.max_columns', 8)
pd.set_option('display.max_rows', 8)

In [5]:
ratings = pd.read_csv('ml-20m/ratings.csv')
movies = pd.read_csv('ml-20m/movies.csv')
links = pd.read_csv('ml-20m/links.csv')
genome_tags = pd.read_csv('ml-20m/genome-tags.csv')
tags = pd.read_csv('ml-20m/tags.csv')
genome_scores = pd.read_csv('ml-20m/genome-scores.csv')

### 1.1 选取部分数据进行分析

变量

- users_num: userId小于等于users_num的评分
- movies_num: movieId小于等于movies_num的电影

In [6]:
users_num=500
movies_num=3000

ratings = ratings[(ratings.movieId <= movies_num) & (ratings.userId <= users_num)]
genome_scores = genome_scores[genome_scores.movieId <= movies_num]
tags = tags[tags.movieId <= movies_num]
tagIds_num = genome_tags.index.size
movies = movies[movies.movieId <= movies_num]
links = links[links.movieId <= movies_num]

In [7]:
ratings

Unnamed: 0,userId,movieId,rating,timestamp
0,1,2,3.5,1112486027
1,1,29,3.5,1112484676
2,1,32,3.5,1112484819
3,1,47,3.5,1112484727
...,...,...,...,...
71505,500,2858,4.5,1337181077
71506,500,2881,3.0,1337179627
71507,500,2959,4.5,1337180948
71508,500,3000,4.5,1337181745


In [8]:
movies

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
...,...,...,...
2911,2997,Being John Malkovich (1999),Comedy|Drama|Fantasy
2912,2998,Dreaming of Joseph Lees (1999),Drama|Romance
2913,2999,Man of the Century (1999),Comedy
2914,3000,Princess Mononoke (Mononoke-hime) (1997),Action|Adventure|Animation|Drama|Fantasy


In [9]:
links

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
...,...,...,...
2911,2997,120601,492.0
2912,2998,144178,108346.0
2913,2999,154827,98480.0
2914,3000,119698,128.0


In [10]:
genome_tags

Unnamed: 0,tagId,tag
0,1,007
1,2,007 (series)
2,3,18th century
3,4,1920s
...,...,...
1124,1125,wuxia
1125,1126,wwii
1126,1127,zombie
1127,1128,zombies


In [11]:
tags

Unnamed: 0,userId,movieId,tag,timestamp
1,65,208,dark hero,1368150078
2,65,353,dark hero,1368150079
3,65,521,noir thriller,1368149983
4,65,592,dark hero,1368150078
...,...,...,...,...
465545,138446,918,halloween scene,1358984062
465546,138446,918,quirky,1358984051
465547,138446,2396,topless scene,1358973995
465563,138472,923,rise to power,1194037967


In [12]:
genome_scores

Unnamed: 0,movieId,tagId,relevance
0,1,1,0.02500
1,1,2,0.02500
2,1,3,0.05775
3,1,4,0.09675
...,...,...,...
3004988,3000,1125,0.04050
3004989,3000,1126,0.04525
3004990,3000,1127,0.08525
3004991,3000,1128,0.02475


## 2. 理论背景

粗略地说，推荐系统有三种类型（不包括简单的评级方法）：

- 基于内容的推荐

- 协同过滤

- 混合模型

“基于内容的推荐”是一个回归问题，我们把电影内容作为特征，对用户对电影的评分做预测。

“协同过滤”中，一般无法提前获得内容特征。是通过用户之间的相似度（用户们给了用一个电影相同的评级）和电影之间的相似度（有相似用户评级的电影）来学习潜在特征，同时预测用户对电影的评分。在学习了电影的特征之后，我们便可以衡量电影之间的相似度，并根据用户历史观影信息，向他/她推荐最相似的电影。

“基于内容的推荐”和“协同过滤”是10多年前最先进的技术。很显然，现在有很多模型和算法可以提高预测效果。比如，针对事先缺乏用户电影评分信息的情况，可以使用隐式矩阵分解，用偏好和置信度取代用户电影打分——比如用户对电影推荐有多少次点击，以此进行协同过滤。另外，我们还可以将“内容推荐”与“协同过滤”的方法结合起来，将内容作为侧面信息来提高预测精度。这种混合方法，可以用“学习进行排序”（”Learning to Rank” ）算法来实现。

在该项目中，采用的方法是“协同过滤”。首先，用电影和用户相似度来找出相似度最高的海报，并基于相似度做电影推荐。然后，我将讨论如何Deep Learning学习潜在特征、做电影推荐。最后会谈谈如何在推荐系统中使用深度学习。

## 3. 数据整合和变量的初始化

> 将后面常用的数据整合到一张表中

变量

- rating_count: 每一部电影评分的人数
- rating_sum: 每一部电影获得评分的总数
- n_users: 总用户数
- n_items: 总电影数

In [13]:
# 每一部电影评分的人数
# how='right' 除了含两个表里都有的行，也会将第二个表中独有的行合并。
# groupby() size跟count的区别： size计数时包含NaN值，而count不包含NaN值
rating_count = pd.merge(movies, ratings, on='movieId', how='right').groupby('movieId').size()
rating_count = pd.DataFrame({'rating count':rating_count.values, "movieId":rating_count.index})
rating_count

Unnamed: 0,movieId,rating count
0,1,166
1,2,78
2,3,45
3,4,10
...,...,...
2393,2996,2
2394,2997,86
2395,2998,1
2396,3000,30


In [14]:
# 每一部电影获得评分的总数
# how='left' 除了含两个表里都有的行，也会将第一个表中独有的行合并。
# groupby().sum -- 非NA值的和
rating_sum = pd.merge(movies, ratings, on='movieId', how='left').groupby('movieId').sum()
rating_sum = pd.DataFrame({'rating sum':rating_sum.rating, "movieId":rating_sum.index})
rating_sum

Unnamed: 0_level_0,movieId,rating sum
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,1,663.5
2,2,251.0
3,3,147.5
4,4,30.0
...,...,...
2997,2997,347.0
2998,2998,4.0
2999,2999,0.0
3000,3000,118.0


In [15]:
# 融合电影表
movies = pd.merge(movies, rating_count, on='movieId', how='left')
movies = pd.merge(movies, rating_sum, on='movieId', how='left')
movies = movies.fillna(0)
movies

Unnamed: 0,movieId,title,genres,rating count,rating sum
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,166.0,663.5
1,2,Jumanji (1995),Adventure|Children|Fantasy,78.0,251.0
2,3,Grumpier Old Men (1995),Comedy|Romance,45.0,147.5
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,10.0,30.0
...,...,...,...,...,...
2911,2997,Being John Malkovich (1999),Comedy|Drama|Fantasy,86.0,347.0
2912,2998,Dreaming of Joseph Lees (1999),Drama|Romance,1.0,4.0
2913,2999,Man of the Century (1999),Comedy,0.0,0.0
2914,3000,Princess Mononoke (Mononoke-hime) (1997),Action|Adventure|Animation|Drama|Fantasy,30.0,118.0


In [16]:
## 统计数据中的用户量和电影量
# .value_counts() 统计不同值的个数，不包括 NaN；unique() 用来展示每个不同的值，包括 NaN。
n_users = ratings.userId.unique().shape[0]
n_items = ratings.movieId.unique().shape[0]

print (str(n_users) + ' users')
print (str(n_items) + ' movies')

500 users
2397 movies


## 4. relevance 矩阵

In [17]:
# as_blocks() 和 as_matrix()，分別用於將 DataFrame 轉化為以數據類型為鍵值的字典和將 DataFrame 轉化為二維數組
relevance = np.zeros((movies_num, tagIds_num))
relevance[genome_scores.movieId.as_matrix() - 1,genome_scores.tagId.as_matrix() - 1] = \
        genome_scores.relevance.as_matrix()
relevance.shape

(3000, 1128)

In [18]:
# 将用户打的标签改成标签的 tagId 来方便做相似度计算
tags = pd.merge(tags, genome_tags, on='tag', how='left')
tags

Unnamed: 0,userId,movieId,tag,timestamp,tagId
0,65,208,dark hero,1368150078,288.0
1,65,353,dark hero,1368150079,288.0
2,65,521,noir thriller,1368149983,712.0
3,65,592,dark hero,1368150078,288.0
...,...,...,...,...,...
133701,138446,918,halloween scene,1358984062,
133702,138446,918,quirky,1358984051,829.0
133703,138446,2396,topless scene,1358973995,
133704,138472,923,rise to power,1194037967,


## 5. 构建标签矩阵,对应电影有该tag标记为1

In [19]:
tag_mat = np.zeros((movies_num, tagIds_num), dtype=np.int0) # 这里数据类型改成np.int0可以减少内存的消耗
# 去除tagId中的NAN
tags_new = tags[tags.tagId.isnull().values == False].copy()
tags_new[['tagId']] = tags_new[['tagId']].astype(np.int16)
tag_mat[tags_new.movieId.as_matrix() - 1, tags_new.tagId.as_matrix() - 1] = 1

## 6. 构建IDF矩阵

In [20]:
# tag_num 表示每个 tag 共出现在多少部电影中
tag_num = tag_mat.sum(axis=0) + 0.01
IDF_mat = np.log(movies_num / tag_num)
TF_IDF_mat = relevance * IDF_mat
TF_IDF_mat

array([[ 0.14256957,  0.14256957,  0.30992093, ...,  0.23181627,
         0.47108957,  0.12697444],
       [ 0.22668562,  0.24949676,  0.20258901, ...,  0.09432524,
         0.54682745,  0.10351177],
       [ 0.24807106,  0.31222737,  0.15026469, ...,  0.12470117,
         0.5877259 ,  0.10213162],
       ..., 
       [ 0.14114388,  0.14399527,  0.2079556 , ...,  0.07673918,
         0.3953517 ,  0.08556973],
       [ 0.14827236,  0.17536058,  0.18514757, ...,  0.1358923 ,
         0.41958782,  0.0897102 ],
       [ 0.14542097,  0.1340154 ,  0.27503806, ...,  0.28937066,
         0.5165323 ,  0.13663554]])

## 7. 显示电影海报

In [21]:
from IPython.display import Image
from IPython.display import display
import requests
import json
import re

def get_poster(recom_item):
    posters = []
    titles = recom_item.title
    imdbIds = pd.merge(recom_item, links).imdbId
    for imdbId in imdbIds:
        #Get posters from Movie Database by API
        headers = {'Accept':'application/json'}
        payload = {'api_key':'20047cd838219fb54d1f8fc32c45cda4'}
        response = requests.get('http://api.themoviedb.org/3/configuration', params=payload, headers=headers)
        response = json.loads(response.text)

        base_url = response['images']['base_url'] + 'w185'
        
        #query themovie.org API for movie poster path.
        imdb_id = 'tt0{0}'.format(imdbId)
        movie_url = 'http://api.themoviedb.org/3/movie/{:}/images'.format(imdb_id)
        response = requests.get(movie_url, params=payload, headers=headers)
        try:
            file_path = json.loads(response.text)['posters'][0]['file_path']
        except:
            print('Something wrong, cannot get the poster for imdb id: {0}!'.format(imdb))
        
        url = base_url + file_path
        posters.append(Image(url=url))
        
    for index, poster, title in zip(range(len(posters)), posters, titles):
        print(str(index + 1) + '、' + title)
        display(poster)

## 8. 输出推荐的结果

In [22]:
def show_user_movies(recom_movies, poster=False):
    if(poster):
        get_poster(recom_movies)
    else:
        titles = recom_movies.title
        for index, title in zip(range(titles.size), titles):
            print(str(index + 1) + '、' + title)

In [23]:
# Always-recommend-the-most-popular-item
def always_most_popular_recommendation(userId=0, top=10, poster=False, show=True):
    recom_movies = movies.sort_values('rating sum', ascending=False)[0:top]
    if(show):
        show_user_movies(recom_movies, poster)
    return recom_movies

## 9. 推荐结果

In [24]:
movies = always_most_popular_recommendation(top=5, poster=True)

1、Pulp Fiction (1994)


2、Forrest Gump (1994)


3、Shawshank Redemption, The (1994)


4、Silence of the Lambs, The (1991)


5、Jurassic Park (1993)
