**第一步：收集和清洗数据**

数据链接：https://grouplens.org/datasets/movielens/

下载文件：ml-latest-small

In [1]:
import pandas as pd
import numpy as np
import tensorflow as tf

ratings_df = pd.read_csv('./ml-latest-small/ratings.csv')
ratings_df.tail()
#tail命令用于输入文件中的尾部内容。tail命令默认在屏幕上显示指定文件的末尾5行。

Unnamed: 0,userId,movieId,rating,timestamp
100831,610,166534,4.0,1493848402
100832,610,168248,5.0,1493850091
100833,610,168250,5.0,1494273047
100834,610,168252,5.0,1493846352
100835,610,170875,3.0,1493846415


In [2]:
movies_df = pd.read_csv('./ml-latest-small/movies.csv')
movies_df.tail()

Unnamed: 0,movieId,title,genres
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy
9739,193585,Flint (2017),Drama
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation
9741,193609,Andrew Dice Clay: Dice Rules (1991),Comedy


In [3]:
movies_df['movieRow'] = movies_df.index
#生成一列‘movieRow’，等于索引值index
movies_df.tail()

Unnamed: 0,movieId,title,genres,movieRow
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy,9737
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy,9738
9739,193585,Flint (2017),Drama,9739
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation,9740
9741,193609,Andrew Dice Clay: Dice Rules (1991),Comedy,9741


In [4]:
movies_df = movies_df[['movieRow', 'movieId', 'title']]
#筛选三列出来
movies_df.to_csv('./ml-latest-small/moviesProcessed.csv', index=False, header=True, encoding='utf-8')
#生成一个新的文件moviesProcessed.csv
movies_df.tail()

Unnamed: 0,movieRow,movieId,title
9737,9737,193581,Black Butler: Book of the Atlantic (2017)
9738,9738,193583,No Game No Life: Zero (2017)
9739,9739,193585,Flint (2017)
9740,9740,193587,Bungo Stray Dogs: Dead Apple (2018)
9741,9741,193609,Andrew Dice Clay: Dice Rules (1991)


In [5]:
ratings_df = pd.merge(ratings_df, movies_df, on='movieId')
ratings_df.head()

Unnamed: 0,userId,movieId,rating,timestamp,movieRow,title
0,1,1,4.0,964982703,0,Toy Story (1995)
1,5,1,4.0,847434962,0,Toy Story (1995)
2,7,1,4.5,1106635946,0,Toy Story (1995)
3,15,1,2.5,1510577970,0,Toy Story (1995)
4,17,1,4.5,1305696483,0,Toy Story (1995)


In [6]:
ratings_df = ratings_df[['userId', 'movieRow', 'rating']]
#筛选出三列
ratings_df.to_csv('./ml-latest-small/ratingsProcessed.csv', index=False, header=True, encoding='utf-8')
#导出一个新的文件ratingsProcessed.csv
ratings_df.head()

Unnamed: 0,userId,movieRow,rating
0,1,0,4.0
1,5,0,4.0
2,7,0,4.5
3,15,0,2.5
4,17,0,4.5


**第二步：创建电影评分矩阵rating和评分纪录矩阵record**

In [7]:
userNo = ratings_df['userId'].max() + 1
#userNo的最大值
movieNo = ratings_df['movieRow'].max() + 1
#movieNo的最大值

In [8]:
rating = np.zeros((movieNo,userNo))
#创建一个值都是0的数据
flag = 0
ratings_df_length = np.shape(ratings_df)[0]
#查看矩阵ratings_df的第一维度是多少
for index,row in ratings_df.iterrows():
    #interrows（），对表格ratings_df进行遍历
    rating[int(row['movieRow']),int(row['userId'])] = row['rating']
    #将ratings_df表里的'movieRow'和'userId'列，填上row的‘评分’
    flag += 1

In [9]:
record = rating > 0
record
record = np.array(record, dtype = int)
#更改数据类型，0表示用户没有对电影评分，1表示用户已经对电影评分
record

array([[0, 1, 0, ..., 1, 1, 1],
       [0, 0, 0, ..., 1, 0, 0],
       [0, 1, 0, ..., 1, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

**第三步：构建模型**

In [10]:
def normalizeRatings(rating, record):
    m, n =rating.shape
    #m代表电影数量，n代表用户数量
    rating_mean = np.zeros((m,1))
    #每部电影的平均得分
    rating_norm = np.zeros((m,n))
    #处理过的评分
    for i in range(m):
        idx = record[i,:] !=0
        #每部电影的评分，[i，:]表示每一行的所有列
        rating_mean[i] = np.mean(rating[i,idx])
        #第i行，评过份idx的用户的平均得分；
        #np.mean() 对所有元素求均值
        rating_norm[i,idx] -= rating_mean[i]
        #rating_norm = 原始得分-平均得分
    return rating_norm, rating_mean

In [11]:
rating_norm, rating_mean = normalizeRatings(rating, record)
# 注：如果数据出现较多的NaNN，对后面的运算影响较大

  out=out, **kwargs)
  ret = ret.dtype.type(ret / rcount)


In [12]:
rating_norm = np.nan_to_num(rating_norm)
#对值为NaNN进行处理，改成数值0
rating_norm

array([[ 0.        , -3.92093023,  0.        , ..., -3.92093023,
        -3.92093023, -3.92093023],
       [ 0.        ,  0.        ,  0.        , ..., -3.43181818,
         0.        ,  0.        ],
       [ 0.        , -3.25961538,  0.        , ..., -3.25961538,
         0.        ,  0.        ],
       ...,
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ]])

In [13]:
rating_mean = np.nan_to_num(rating_mean)
#对值为NaNN进行处理，改成数值0
rating_mean

array([[3.92093023],
       [3.43181818],
       [3.25961538],
       ...,
       [3.5       ],
       [3.5       ],
       [4.        ]])

In [14]:
num_features = 10
X_parameters = tf.Variable(tf.random_normal([movieNo, num_features],stddev = 0.35))
Theta_parameters = tf.Variable(tf.random_normal([userNo, num_features],stddev = 0.35))
#tf.Variables()初始化变量
#tf.random_normal()函数用于从服从指定正太分布的数值中取出指定个数的值，mean: 正态分布的均值。stddev: 正态分布的标准差。dtype: 输出的类型

In [15]:
loss = 1/2 * tf.reduce_sum(((tf.matmul(X_parameters, Theta_parameters, transpose_b = True) - rating_norm) * record) ** 2) + 1/2 * (tf.reduce_sum(X_parameters ** 2) + tf.reduce_sum(Theta_parameters ** 2))
#基于内容的推荐算法模型

**函数解释：**

reduce_sum() 就是求和，reduce_sum( input_tensor, axis=None,  keep_dims=False, name=None, reduction_indices=None)

reduce_sum() 参数解释：

1) input_tensor：输入的张量。

2) axis：沿着哪个维度求和。对于二维的input_tensor张量，0表示按列求和，1表示按行求和，[0, 1]表示先按列求和再按行求和。

3) keep_dims：默认值为False，表示默认要降维。若设为True，则不降维。

4) name：名字。

5) reduction_indices：默认值是None，即把input_tensor降到 0维，也就是一个数。对于2维input_tensor，reduction_indices=0时，按列；reduction_indices=1时，按行。

6) 注意，reduction_indices与axis不能同时设置。

tf.matmul（a,b）,将矩阵 a 乘以矩阵 b，生成a * b

tf.matmul（a,b）参数解释：

1) a：类型为 float16，float32，float64，int32，complex64，complex128 和 rank > 1的张量。

2) b：与 a 具有相同类型和 rank。

3) transpose_a：如果 True，a 在乘法之前转置。

4) transpose_b：如果 True，b 在乘法之前转置。

5) adjoint_a：如果 True，a 在乘法之前共轭和转置。

6) adjoint_b：如果 True，b 在乘法之前共轭和转置。

7) a_is_sparse：如果 True，a 被视为稀疏矩阵。

8) b_is_sparse：如果 True，b 被视为稀疏矩阵。

9) name：操作名称（可选）

In [16]:
optimizer = tf.train.AdamOptimizer(1e-4)
# https://blog.csdn.net/lenbow/article/details/52218551
train = optimizer.minimize(loss)
# Optimizer.minimize对一个损失变量基本上做两件事
# 它计算相对于模型参数的损失梯度。
# 然后应用计算出的梯度来更新变量。

W0903 05:38:52.370069 140108146329408 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_grad.py:1205: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


**第四步：训练模型**

In [17]:
# tf.summary的用法 https://www.cnblogs.com/lyc-seu/p/8647792.html
tf.summary.scalar('loss',loss)
#用来显示标量信息

<tf.Tensor 'loss:0' shape=() dtype=string>

In [18]:
summaryMerged = tf.summary.merge_all()
#merge_all 可以将所有summary全部保存到磁盘，以便tensorboard显示。
filename = './movie_tensorborad'
writer = tf.summary.FileWriter(filename)
#指定一个文件用来保存图。
sess = tf.Session()
#https://www.cnblogs.com/wuzhitj/p/6648610.html
init = tf.global_variables_initializer()
sess.run(init)
#运行

In [19]:
for i in range(1000): # origin is 5000, but since this is a transferred connection, timeout will make 5000 not finishing, here, use 1000 instead
    _, movie_summary = sess.run([train, summaryMerged])
    # 把训练的结果summaryMerged存在movie里
    writer.add_summary(movie_summary, i)
    # 把训练的结果保存下来

- 查看训练结果:在终端输入 tensorboard --logir=./

**第五步：评估模型**

In [20]:
Current_X_parameters, Current_Theta_parameters = sess.run([X_parameters, Theta_parameters])
# Current_X_parameters为用户内容矩阵，Current_Theta_parameters用户喜好矩阵
predicts = np.dot(Current_X_parameters,Current_Theta_parameters.T) + rating_mean
# dot函数是np中的矩阵乘法，np.dot(x,y) 等价于 x.dot(y)
errors = np.sqrt(np.sum((predicts - rating)**2))
# sqrt(arr) ,计算各元素的平方根
errors

8124.794265811395

**第六步：构建完整的电影推荐系统**

In [23]:
user_id = input('您要想哪位用户进行推荐？请输入用户编号：')
sortedResult = predicts[:, int(user_id)].argsort()[::-1]
# argsort()函数返回的是数组值从小到大的索引值; argsort()[::-1] 返回的是数组值从大到小的索引值
idx = 0
print('为该用户推荐的评分最高的20部电影是：'.center(80,'='))
# center() 返回一个原字符串居中,并使用空格填充至长度 width 的新字符串。默认填充字符为空格。
for i in sortedResult:
    print('评分: %.2f, 电影名: %s' % (predicts[i,int(user_id)],movies_df.iloc[i]['title']))
    # .iloc的用法：https://www.cnblogs.com/harvey888/p/6006200.html
    idx += 1
    if idx == 20:break

您要想哪位用户进行推荐？请输入用户编号：123
评分: 6.71, 电影名: Happy Go Lovely (1951)
评分: 6.46, 电影名: Hellbenders (2012)
评分: 6.38, 电影名: Galaxy of Terror (Quest) (1981)
评分: 6.37, 电影名: Cheburashka (1971)
评分: 6.35, 电影名: National Lampoon's Bag Boy (2007)
评分: 6.35, 电影名: Mr. Skeffington (1944)
评分: 6.35, 电影名: Sisters (Syostry) (2001)
评分: 6.33, 电影名: The Fox and the Hound 2 (2006)
评分: 6.32, 电影名: Bossa Nova (2000)
评分: 6.31, 电影名: Investigation Held by Kolobki (1986)
评分: 6.29, 电影名: Alien Contamination (1980)
评分: 6.24, 电影名: Thin Line Between Love and Hate, A (1996)
评分: 6.24, 电影名: Tom Segura: Mostly Stories (2016)
评分: 6.24, 电影名: Enter the Void (2009)
评分: 6.21, 电影名: Rififi (Du rififi chez les hommes) (1955)
评分: 6.18, 电影名: I Am Not Your Negro (2017)
评分: 6.18, 电影名: 7 Faces of Dr. Lao (1964)
评分: 6.12, 电影名: Lumberjack Man (2015)
评分: 6.11, 电影名: American History X (1998)
评分: 6.10, 电影名: Babes in Toyland (1934)
