## Fastai deep learning course lesson4 学习笔记（下）
分享者：胡智豪 
email: justinhochn@gmail.com

## 简述
本篇学习笔记是接着lesson4的下篇，上篇介绍了StateFarm的任务，下篇来介绍推荐系统的协同过滤算法，开始接触NLP方面的内容。

## 原理介绍
建议同学们先下载jeremy的excel表，来看看这个电影的协同过滤算法是如何进行推荐的。

我对图一和图二的表格进行了填色，这样便于解释各个方块的含义：
1. 图一：此表格为用户对他们所看过的电影的**真实评分**。
2. 图二蓝色区域：此区域是每个用户对于每部电影的**评分预测**
3. 图二绿色区域：左右两块绿色区域分别代表的是用户的特征以及电影的特征，对于每一个用户和每一部电影，这里各用5个数字进行表示。
4. 图二黄色区域：左右两块黄色区域分别代表**用户特征的偏置项**以及**电影特征的偏置项**。用户偏置项的意思是，预防有些用户是电影的狂热粉丝，有些用户不怎么看电影，这两个极端导致的评分相差太大。电影偏置项的意思是，预防有些电影是只是明星效应高实际不怎么好看，有些电影很好看但是演员不出名比较冷门，这两种极端情况导致的评分相差太大。

**协同过滤算法的计算流程**
1. **用户特征**与**电影特征**进行**矩阵相乘**，并加上用户和电影特征各自的**偏置项（bias）**，获得用户对这部电影的**预测评分**。用图上的解释为：用户和电影的绿色区域相乘，再加上黄色区域的数字。
2. **预测评分**与**真实评分**相减，得出评分数值的误差。
3. 进行**梯度下降**，不断**更新用户特征及电影特征的数值**，最终使得评分误差最小。

## 代码解释
本文只对课程内核心代码进行解释，完整的代码可以点击这里下载。

In [1]:
from theano.sandbox import cuda



In [2]:
%matplotlib inline
import utils; reload(utils)
from utils import *
from __future__ import division, print_funtion

ImportError: No module named 'cPickle'

In [2]:
import pandas as pd
import numpy as np
import os

In [3]:
path = 'F:/ml-latest-small/'
model_path = path + 'model/'
if not os.path.exists(model_path):
    os.mkdir(model_path)

In [4]:
batch_size = 64

## 设置数据集

In [5]:
ratings = pd.read_csv(path + 'ratings.csv')
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


In [6]:
len(ratings)

100004

In [7]:
movie_names = pd.read_csv(path+ 'movies.csv').set_index('movieId')['title'].to_dict()

In [8]:
pd.read_csv(path+ 'movies.csv').set_index('movieId')['title']

movieId
1                                          Toy Story (1995)
2                                            Jumanji (1995)
3                                   Grumpier Old Men (1995)
4                                  Waiting to Exhale (1995)
5                        Father of the Bride Part II (1995)
6                                               Heat (1995)
7                                            Sabrina (1995)
8                                       Tom and Huck (1995)
9                                       Sudden Death (1995)
10                                         GoldenEye (1995)
11                           American President, The (1995)
12                       Dracula: Dead and Loving It (1995)
13                                             Balto (1995)
14                                             Nixon (1995)
15                                  Cutthroat Island (1995)
16                                            Casino (1995)
17                             S

In [9]:
movie_names

{1: 'Toy Story (1995)',
 2: 'Jumanji (1995)',
 3: 'Grumpier Old Men (1995)',
 4: 'Waiting to Exhale (1995)',
 5: 'Father of the Bride Part II (1995)',
 6: 'Heat (1995)',
 7: 'Sabrina (1995)',
 8: 'Tom and Huck (1995)',
 9: 'Sudden Death (1995)',
 10: 'GoldenEye (1995)',
 11: 'American President, The (1995)',
 12: 'Dracula: Dead and Loving It (1995)',
 13: 'Balto (1995)',
 14: 'Nixon (1995)',
 15: 'Cutthroat Island (1995)',
 16: 'Casino (1995)',
 17: 'Sense and Sensibility (1995)',
 18: 'Four Rooms (1995)',
 19: 'Ace Ventura: When Nature Calls (1995)',
 20: 'Money Train (1995)',
 21: 'Get Shorty (1995)',
 22: 'Copycat (1995)',
 23: 'Assassins (1995)',
 24: 'Powder (1995)',
 25: 'Leaving Las Vegas (1995)',
 26: 'Othello (1995)',
 27: 'Now and Then (1995)',
 28: 'Persuasion (1995)',
 29: 'City of Lost Children, The (Cité des enfants perdus, La) (1995)',
 30: 'Shanghai Triad (Yao a yao yao dao waipo qiao) (1995)',
 31: 'Dangerous Minds (1995)',
 32: 'Twelve Monkeys (a.k.a. 12 Monkeys) (199

In [10]:
users = ratings.userId.unique()
movies = ratings.movieId.unique()

In [11]:
userid2idx = {o:i for i,o in enumerate(users)}
movieid2idx = {o:i for i,o in enumerate(movies)}

对ratings的userid和movieid以升序排序，以变成连续的整数，用于后面的embedding层。

In [12]:
ratings.movieId = ratings.movieId.apply(lambda x : movieid2idx[x])
ratings.userId = ratings.userId.apply(lambda x : userid2idx[x])

In [13]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,0,0,2.5,1260759144
1,0,1,3.0,1260759179
2,0,2,3.0,1260759182
3,0,3,2.0,1260759185
4,0,4,4.0,1260759205


In [14]:
user_min, user_max, movie_min, movie_max = (ratings.userId.min(), 
    ratings.userId.max(), ratings.movieId.min(), ratings.movieId.max())
user_min, user_max, movie_min, movie_max

(0, 670, 0, 9065)

In [15]:
n_users = ratings.userId.nunique()
n_movies = ratings.movieId.nunique()
n_users,n_movies

(671, 9066)

设置潜在因子数量

In [16]:
n_factors = 50

In [17]:
np.random.seed= 42

随机分类出训练集和验证集

In [18]:
msk = np.random.rand(len(ratings)) < 0.8
trn = ratings[msk]
val = ratings[~msk]

In [19]:
len(trn),len(val)

(79869, 20135)

## 点乘 Dot Product

In [78]:
from keras.layers import Input, Dense, merge, Flatten, Activation,  Dropout
from keras.models import Model
from keras.layers import Embedding
from keras import regularizers
from keras import optimizers

In [21]:
user_in = Input(shape=(1,), dtype='int64', name='user_in')
u = Embedding(n_users, n_factors, input_length=1, W_regularizer=regularizers.l2(1e-4))(user_in)
movie_in = Input(shape=(1,), dtype='int64', name='movie_in')
m = Embedding(n_movies, n_factors, input_length=1, W_regularizer=regularizers.l2(1e-4))(user_in)

  from ipykernel import kernelapp as app


In [22]:
x = merge([u,m], mode='dot')
x = Flatten()(x)
model = Model([user_in, movie_in], x)
model.compile(optimizers.Adam(0.001), loss='mse')

  if __name__ == '__main__':
  name=name)


In [70]:
model.fit([trn.userId, trn.movieId], trn.rating, batch_size=64, nb_epoch=1, 
         validation_data=([val.userId, val.movieId], val.rating))

  from ipykernel import kernelapp as app


Train on 80099 samples, validate on 19905 samples
Epoch 1/1


<keras.callbacks.History at 0x11dd3ef0>

In [71]:
model.optimizer.lr=0.01

In [72]:
model.fit([trn.userId, trn.movieId], trn.rating, batch_size=64, nb_epoch=3, 
          validation_data=([val.userId, val.movieId], val.rating))

Train on 80099 samples, validate on 19905 samples
Epoch 1/3
  192/80099 [..............................] - ETA: 55s - loss: 1.5782

  from ipykernel import kernelapp as app


Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0xd3143c8>

## Bias 偏差

In [26]:
def embedding_input(name, n_in, n_out, reg):
    inp = Input(shape=(1,), dtype='int64', name=name)
    return inp, Embedding(n_in, n_out, input_length=1, W_regularizer=regularizers.l2(reg))(inp)

In [27]:
user_in, u = embedding_input('user_in', n_users, n_factors, 1e-4)
movie_in, m = embedding_input('movie_in', n_movies, n_factors, 1e-4)

  app.launch_new_instance()
  app.launch_new_instance()


In [28]:
def create_bias(inp, n_in):
    x = Embedding(n_in, 1, input_length=1)(inp)
    return Flatten()(x)

In [29]:
ub = create_bias(user_in, n_users)
mb = create_bias(movie_in, n_movies)

In [30]:
x = merge([u, m], mode='dot')
x = Flatten()(x)
x = merge([x, ub], mode='sum')
x = merge([x, mb], mode='sum')
model = Model([user_in, movie_in], x)
model.compile(optimizers.Adam(0.001), loss='mse')

  if __name__ == '__main__':
  name=name)
  app.launch_new_instance()


In [31]:
model.fit([trn.userId, trn.movieId], trn.rating, batch_size=64, nb_epoch=1, 
         validation_data=([val.userId, val.movieId], val.rating))

  from ipykernel import kernelapp as app


Train on 79869 samples, validate on 20135 samples
Epoch 1/1


<keras.callbacks.History at 0xfb3fe48>

In [33]:
model.optimizer.lr = 0.01

In [34]:
model.fit([trn.userId, trn.movieId], trn.rating, batch_size=64, nb_epoch=6, 
         validation_data=([val.userId, val.movieId], val.rating))

Train on 79869 samples, validate on 20135 samples
Epoch 1/6
  320/79869 [..............................] - ETA: 48s - loss: 2.7224

  from ipykernel import kernelapp as app


Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6


<keras.callbacks.History at 0x104d42b0>

In [35]:
model.optimizer.lr = 0.001

In [36]:
model.fit([trn.userId, trn.movieId], trn.rating, batch_size=64, nb_epoch=6, 
         validation_data=([val.userId, val.movieId], val.rating))

Train on 79869 samples, validate on 20135 samples
Epoch 1/6
  320/79869 [..............................] - ETA: 50s - loss: 1.5429

  from ipykernel import kernelapp as app


Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6


<keras.callbacks.History at 0x104d4550>

In [37]:
model.fit([trn.userId, trn.movieId], trn.rating, batch_size=64, nb_epoch=10, 
         validation_data=([val.userId, val.movieId], val.rating))

Train on 79869 samples, validate on 20135 samples
Epoch 1/10
  320/79869 [..............................] - ETA: 47s - loss: 1.0309

  from ipykernel import kernelapp as app


Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x100bb4a8>

In [38]:
model.fit([trn.userId, trn.movieId], trn.rating, batch_size=64, nb_epoch=10, 
         validation_data=([val.userId, val.movieId], val.rating))

Train on 79869 samples, validate on 20135 samples
Epoch 1/10
  320/79869 [..............................] - ETA: 46s - loss: 0.6587

  from ipykernel import kernelapp as app


Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x104d42e8>

In [42]:
model.optimizer.lr = 0.001

In [43]:
model.fit([trn.userId, trn.movieId], trn.rating, batch_size=64, nb_epoch=5, 
         validation_data=([val.userId, val.movieId], val.rating))

Train on 79869 samples, validate on 20135 samples
Epoch 1/5
  320/79869 [..............................] - ETA: 51s - loss: 0.4990

  from ipykernel import kernelapp as app


Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x100bbe80>

## 分析结果

In [81]:
g = ratings.groupby('movieId')['rating'].count()
topMovies = g.sort_values(ascending=False)[:2000]
topMovies = np.array(topMovies.index)

In [83]:
get_movie_bias = Model(movie_in, mb)
movie_bias = get_movie_bias.predict(topMovies)
movie_ratings = [(b[0], movie_names[movies[i]]) for i,b in zip(topMovies,movie_bias)]

RuntimeError: Graph disconnected: cannot obtain value for tensor Tensor("movie_in_1:0", shape=(?, 1), dtype=int64) at layer "movie_in". The following previous layers were accessed without issue: []

In [51]:
import operator

In [53]:
sorted(movie_ratings, key=operator.itemgetter(0))[:15]

[(-1.0340273, 'Battlefield Earth (2000)'),
 (-0.66565764, 'Super Mario Bros. (1993)'),
 (-0.63319349, 'Police Academy 6: City Under Siege (1989)'),
 (-0.56280315, 'Police Academy 4: Citizens on Patrol (1987)'),
 (-0.55983746, 'Speed 2: Cruise Control (1997)'),
 (-0.55840671, 'Jaws 3-D (1983)'),
 (-0.53708786, 'Spice World (1997)'),
 (-0.53677666, 'Howard the Duck (1986)'),
 (-0.51937348, 'Police Academy 5: Assignment: Miami Beach (1988)'),
 (-0.49318901, 'Blade: Trinity (2004)'),
 (-0.46237037, 'Police Academy 3: Back in Training (1986)'),
 (-0.45498112, 'Mighty Morphin Power Rangers: The Movie (1995)'),
 (-0.41894358, 'House on Haunted Hill (1999)'),
 (-0.41134694, 'Superman IV: The Quest for Peace (1987)'),
 (-0.39329076, 'Anaconda (1997)')]

In [54]:
sorted(movie_ratings, key=operator.itemgetter(0), reverse=True)[:15]

[(1.9498818, 'Wings of Desire (Himmel über Berlin, Der) (1987)'),
 (1.9369031, 'African Queen, The (1951)'),
 (1.9152941, 'All About Eve (1950)'),
 (1.8604825, 'It Happened One Night (1934)'),
 (1.8471717, 'Grand Illusion (La grande illusion) (1937)'),
 (1.8248702, 'Shawshank Redemption, The (1994)'),
 (1.7931011, 'Tom Jones (1963)'),
 (1.7909009, 'Ran (1985)'),
 (1.7738168, 'Godfather, The (1972)'),
 (1.7656829, 'Mister Roberts (1955)'),
 (1.7546382, 'Modern Times (1936)'),
 (1.7382654, 'Thin Man, The (1934)'),
 (1.7310246, 'Big Night (1996)'),
 (1.7272288, 'Diva (1981)'),
 (1.7227252, 'Grand Day Out with Wallace and Gromit, A (1989)')]

预测1号观众会为2号电影打多少分

In [72]:
pred = model.predict([np.array([1]), np.array([2])])

In [73]:
pred

array([[ 3.5491395]], dtype=float32)

## 神经网络
上面加了bias，费了好大劲都跑不到jeremy的0.8，下面利用单个隐藏层的神经网络，分分钟就state-of-the-art了...

In [75]:
user_in, u = embedding_input('user_in', n_users, n_factors, 1e-4)
movie_in, m = embedding_input('movie_in', n_movies, n_factors, 1e-4)

  app.launch_new_instance()
  app.launch_new_instance()


In [79]:
x = merge([u, m], mode='concat')
x = Flatten()(x)
x = Dropout(0.3)(x)
x = Dense(70, activation='relu')(x)
x = Dropout(0.75)(x)
x = Dense(1)(x)
nn = Model([user_in, movie_in], x)
nn.compile(optimizers.Adam(0.001), loss='mse')

  if __name__ == '__main__':
  name=name)


In [80]:
nn.fit([trn.userId, trn.movieId], trn.rating, batch_size=64, nb_epoch=8, 
          validation_data=([val.userId, val.movieId], val.rating))

  from ipykernel import kernelapp as app


Train on 79869 samples, validate on 20135 samples
Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8


<keras.callbacks.History at 0x11d3acc0>

In [85]:
pred = nn.predict([np.array([1]), np.array([2])])
pred

array([[ 3.50317764]], dtype=float32)

In [87]:
nn.save_weights(model_path+'nn.h5')