https://zhuanlan.zhihu.com/p/126282487

# 背景
随着深度学习技术的普及，越来越多的深度学习算法被应用到了工业界中。笔者自去年毕业进入企业后，有幸参与了某新业务的推荐系统搭建以及用户体验和业务指标的优化当中，其中在召回部分也进行过一些基于向量召回的探索并取得了一些收益。

之前在读研期间出于个人兴趣开发过一个基于深度学习的点击率预测算法库[DeepCTR](https://github.com/shenweichen/DeepCTR)，随着时间的迭代得到了一些同学的支持和认可，自己也亲身使用到了里面的算法应用到了自己的业务当中并取得了显著的收益。

相比于排序中各种点击率预估模型，自己对于召回模块的了解还有很多欠缺，借着这个机会，抱着学习的心态，和几位热心的优秀小伙伴一起做了DeepMatch这个项目，希望它能够帮助到大家！

https://github.com/shenweichen/DeepMatch


下面简单介绍一下如何
## 安装和使用

In [2]:
# !pip install -U deepmatch
!pip freeze | grep deepmatch

deepmatch==0.1.2


# 示例1: YoutubeDNN-ml_1m
https://github.com/shenweichen/DeepMatch/blob/master/examples/colab_MovieLen1M_YoutubeDNN.ipynb
    
下面已大家比较熟悉的YoutubeDNN为例子，给大家介绍如何使用deepmatch进行召回模型的训练，用户和物品向量的导出，以及使用faiss进行近似最近邻搜索。

整段代码不到100行，可以是非常的方便进行学习和使用了～

__运行环境 tf = 1.14.0, tf2会报错!!!__

## 导入需要的库

In [11]:
import pandas as pd
from deepctr.inputs import SparseFeat, VarLenSparseFeat
from preprocess import gen_data_set, gen_model_input
from sklearn.preprocessing import LabelEncoder
from tensorflow.python.keras import backend as K
from tensorflow.python.keras.models import Model

from deepmatch.models import *
from deepmatch.utils import sampledsoftmaxloss


## 读取数据

In [13]:
data_path = "/Users/luoyonggui/PycharmProjects/mayiexamples/RecommendatIon_System/"

unames = ['user_id','gender','age','occupation','zip']
user = pd.read_csv(data_path+'ml_1m/users.dat',sep='::',header=None,names=unames)
rnames = ['user_id','movie_id','rating','timestamp']
ratings = pd.read_csv(data_path+'ml_1m/ratings.dat',sep='::',header=None,names=rnames)
mnames = ['movie_id','title','genres']
movies = pd.read_csv(data_path+'ml_1m/movies.dat',sep='::',header=None,names=mnames)

data = pd.merge(pd.merge(ratings,movies),user)#.iloc[:10000]


  after removing the cwd from sys.path.
  
  


In [23]:
data.head()

Unnamed: 0,user_id,movie_id,rating,timestamp,title,genres,gender,age,occupation,zip
456790,6040,803,4,956703932,"Godfather, The (1972)",Action|Crime|Drama,2,3,7,467
456672,6040,580,5,956703954,"Silence of the Lambs, The (1991)",Drama|Thriller,2,3,7,467
456732,6040,2192,4,956703954,Babe: Pig in the City (1998),Children's|Comedy,2,3,7,467
456641,6040,1782,4,956703977,Rain Man (1988),Drama,2,3,7,467
456842,6040,1840,5,956703977,Seven Samurai (The Magnificent Seven) (Shichin...,Action|Drama,2,3,7,467


## 构建特征列，训练模型，导出embedding

In [None]:
#data = pd.read_csvdata = pd.read_csv("./movielens_sample.txt")
sparse_features = ["movie_id", "user_id",
                    "gender", "age", "occupation", "zip", ]
SEQ_LEN = 50
negsample = 0

# 1.Label Encoding for sparse features,and process sequence features with `gen_date_set` and `gen_model_input`

features = ['user_id', 'movie_id', 'gender', 'age', 'occupation', 'zip']
feature_max_idx = {}
for feature in features:
    lbe = LabelEncoder()
    data[feature] = lbe.fit_transform(data[feature]) + 1
    feature_max_idx[feature] = data[feature].max() + 1

user_profile = data[["user_id", "gender", "age", "occupation", "zip"]].drop_duplicates('user_id')

item_profile = data[["movie_id"]].drop_duplicates('movie_id')

user_profile.set_index("user_id", inplace=True)

user_item_list = data.groupby("user_id")['movie_id'].apply(list)

In [24]:
user_item_list.head(2)

user_id
1    [1105, 640, 854, 3178, 2163, 1108, 1196, 2600,...
2    [1105, 2890, 2129, 1783, 1118, 1849, 1155, 126...
Name: movie_id, dtype: object

In [26]:
train_set, test_set = gen_data_set(data, negsample)

100%|██████████| 6040/6040 [00:14<00:00, 410.24it/s]


6 6


In [29]:
train_set[:1]

[(3101, [2780, 1164, 3399], 336, 1, 3, 4)]

In [30]:
train_seq = [line[1] for line in train_set]

In [31]:
train_seq[:2]

[[2780, 1164, 3399],
 [880,
  1136,
  2720,
  2874,
  1831,
  909,
  853,
  2880,
  1169,
  1115,
  1130,
  3547,
  3216,
  844,
  1123,
  803,
  862,
  1161,
  3204,
  1784,
  2447,
  3131,
  890,
  1014,
  203,
  1779,
  1211,
  2427,
  860]]

In [33]:
from tensorflow.python.keras.preprocessing.sequence import pad_sequences

In [35]:
pad_sequences(train_seq[:2], maxlen=50, padding='post', truncating='post', value=0)

array([[2780, 1164, 3399,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0],
       [ 880, 1136, 2720, 2874, 1831,  909,  853, 2880, 1169, 1115, 1130,
        3547, 3216,  844, 1123,  803,  862, 1161, 3204, 1784, 2447, 3131,
         890, 1014,  203, 1779, 1211, 2427,  860,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0]], dtype=int32)

In [14]:
train_model_input, train_label = gen_model_input(train_set, user_profile, SEQ_LEN)
test_model_input, test_label = gen_model_input(test_set, user_profile, SEQ_LEN)

100%|██████████| 6040/6040 [00:13<00:00, 436.76it/s]


6 6


In [36]:
train_model_input

{'user_id': array([2215, 1447,   92, ..., 4680, 5915, 4344]),
 'movie_id': array([1018,   15, 3476, ..., 2914, 2427, 2277]),
 'hist_movie_id': array([[3044,  414,  703, ..., 3445,  610, 1512],
        [2618, 1367,  703, ...,  529,  402,  181],
        [2405, 3015, 2404, ..., 1282,  854, 1207],
        ...,
        [2243,  477,  519, ..., 1803, 2780, 3266],
        [ 368, 1903, 1306, ..., 2167, 2502, 2111],
        [1277, 2224, 3246, ..., 1098,  579, 3167]], dtype=int32),
 'hist_len': array([128, 691, 279, ..., 271, 210, 333]),
 'gender': array([1, 2, 1, ..., 2, 2, 2]),
 'age': array([3, 2, 2, ..., 4, 2, 3]),
 'occupation': array([1, 5, 5, ..., 1, 5, 2]),
 'zip': array([3029, 2563, 1440, ...,  242, 1976, 1438])}

In [37]:
SparseFeat('user_id', feature_max_idx['user_id'], 16)

SparseFeat(name='user_id', vocabulary_size=6041, embedding_dim=16, use_hash=False, dtype='int32', embedding_name='user_id', group_name='default_group')

In [15]:
# 2.count #unique features for each sparse field and generate feature config for sequence feature

embedding_dim = 32

user_feature_columns = [SparseFeat('user_id', feature_max_idx['user_id'], 16),
                        SparseFeat("gender", feature_max_idx['gender'], 16),
                        SparseFeat("age", feature_max_idx['age'], 16),
                        SparseFeat("occupation", feature_max_idx['occupation'], 16),
                        SparseFeat("zip", feature_max_idx['zip'], 16),
                        VarLenSparseFeat(SparseFeat('hist_movie_id', feature_max_idx['movie_id'], embedding_dim,
                                                    embedding_name="movie_id"), SEQ_LEN, 'mean', 'hist_len'),
                        ]

item_feature_columns = [SparseFeat('movie_id', feature_max_idx['movie_id'], embedding_dim)]

In [None]:
# 3.Define Model and train

K.set_learning_phase(True)

model = YoutubeDNN(user_feature_columns, item_feature_columns, num_sampled=200, user_dnn_hidden_units=(128,64, embedding_dim))
# model = MIND(user_feature_columns,item_feature_columns,dynamic_k=True,p=1,k_max=2,num_sampled=5,user_dnn_hidden_units=(64,16),init_std=0.001)

model.compile(optimizer="adam", loss=sampledsoftmaxloss)  # "binary_crossentropy")

In [20]:
history = model.fit(train_model_input, train_label,  # train_label,
                    batch_size=512, epochs=24, verbose=1, validation_split=0.0, )

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


训练完整后，由于在实际使用时，我们需要根据当前的用户特征实时产生用户侧向量，并对物品侧向量构建索引进行近似最近邻查找。这里由于是离线模拟，所以我们导出所有待测试用户的表示向量，和所有物品的表示向量。


In [21]:
# 4. Generate user features for testing and full item features for retrieval
test_user_model_input = test_model_input
all_item_model_input = {"movie_id": item_profile['movie_id'].values,}

# 以下两行是deepmatch中的通用使用方法，分别获得用户向量模型和物品向量模型
user_embedding_model = Model(inputs=model.user_input, outputs=model.user_embedding)
item_embedding_model = Model(inputs=model.item_input, outputs=model.item_embedding)

# 输入对应的数据拿到对应的向量
user_embs = user_embedding_model.predict(test_user_model_input, batch_size=2 ** 12)
# user_embs = user_embs[:, i, :]  i in [0,k_max) if MIND
item_embs = item_embedding_model.predict(all_item_model_input, batch_size=2 ** 12)

print(user_embs.shape)
print(item_embs.shape)



(6040, 32)
(3706, 32)


## 使用faiss进行ANN查找并评估结果
[可选的]如果有安装faiss库的同学，可以体验以下将上一步导出的物品向量构建索引，然后用用户向量来进行ANN查找并评估效果


In [18]:
! pip install faiss-cpu

Collecting faiss-cpu
  Using cached faiss_cpu-1.6.3-cp37-cp37m-macosx_10_9_x86_64.whl (1.7 MB)
Installing collected packages: faiss-cpu
Successfully installed faiss-cpu-1.6.3


In [26]:
test_true_label = {line[0]:[line[2]] for line in test_set}

In [22]:


import numpy as np
import faiss
from tqdm import tqdm
from deepmatch.utils import recall_N

index = faiss.IndexFlatIP(embedding_dim)
# faiss.normalize_L2(item_embs)
index.add(item_embs)
# faiss.normalize_L2(user_embs)
D, I = index.search(user_embs, 50)
s = []
hit = 0
for i, uid in tqdm(enumerate(test_user_model_input['user_id'])):
    try:
        pred = [item_profile['movie_id'].values[x] for x in I[i]]
        filter_item = None
        recall_score = recall_N(test_true_label[uid], pred, N=50)
        s.append(recall_score)
        if test_true_label[uid] in pred:
            hit += 1
    except:
        print(i)
print("")
print("recall", np.mean(s))
print("hit rate", hit / len(test_user_model_input['user_id']))

6040it [00:01, 4215.70it/s]


recall 0.29238410596026493
hit rate 0.29238410596026493



