### Python使用Faiss实现向量近邻搜索

Embedding的近邻搜索是当前图推荐系统非常重要的一种召回方式，通过item2vec、矩阵分解、双塔DNN等方式都能够产出训练好的user embedding、item embedding，对于embedding的使用非常的灵活：

* 输入user embedding，近邻搜索item embedding，可以给user推荐感兴趣的items
* 输入user embedding，近邻搜搜user embedding，可以给user推荐感兴趣的user
* 输入item embedding，近邻搜索item embedding，可以给item推荐相关的items

然而有一个工程问题，一旦user embedding、item embedding数据量达到一定的程度，对他们的近邻搜索将会变得非常慢，如果离线阶段提前搜索好在高速缓存比如redis存储好结果当然没问题，但是这种方式很不实时，如果能在线阶段上线几十MS的搜索当然效果最好。

Faiss是Facebook AI团队开源的针对聚类和相似性搜索库，为稠密向量提供高效相似度搜索和聚类，支持十亿级别向量的搜索，是目前最为成熟的近似近邻搜索库。

安装命令：   
```
conda install -c pytorch faiss-cpu 
```

演示步骤：
1. 读取训练好的Embedding数据
2. 构建faiss索引，将待搜索的Embedding添加进去
3. 取得目标Embedding，实现搜索得到ID列表
4. 根据ID获取电影标题，返回结果

faiss使用经验：
1. 为了支持自己的ID，可以用faiss.IndexIDMap包裹faiss.IndexFlatL2即可
2. embedding数据都需要转换成np.float32，包括索引中的embedding以及待搜索的embedding
3. ids需要转换成int64类型

### 1. 准备数据

In [104]:
import pandas as pd
import numpy as np
import json

In [105]:
df_question = pd.read_csv("./my_datas/tensorflow_question_embedding.csv")
df_user = pd.read_csv("./my_datas/tensorflow_user_embedding.csv")
df_user.head()

Unnamed: 0,user_id,user_embedding
0,0,"[0.06441626697778702, 0.0, 0.0, 0.297404974699..."
1,1,"[0.05123215168714523, 0.0, 0.0, 0.229109406471..."
2,2,"[0.09596045315265656, 0.0, 0.0, 0.447525322437..."
3,3,"[0.2353876680135727, 0.0, 0.0, 1.1839507818222..."
4,4,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0180229507386..."


In [106]:
df_question.head()

Unnamed: 0,question_id,question_embedding
0,0,"[0.6998521089553833, 0.0, 0.0, 3.8853187561035..."
1,1,"[0.8288981318473816, 0.0, 0.0, 4.8776364326477..."
2,2,"[0.5633597373962402, 0.0, 0.0, 2.8337125778198..."
3,3,"[0.5317284464836121, 0.0, 0.0, 2.56494140625, ..."
4,4,"[0.373773992061615, 0.0, 0.0, 1.36089992523193..."


#### 构建ids

In [107]:
ids_question = df_question["question_id"].values.astype(np.int64)
ids_user = df_user["user_id"].values.astype(np.int64)
type(ids_question), ids_question.shape,type(ids_user),ids_user.shape

(numpy.ndarray, (82,), numpy.ndarray, (12662,))

In [108]:
ids_user.dtype

dtype('int64')

In [109]:
ids_question_size = ids_question.shape[0]
ids_user_size = ids_user.shape[0]
ids_question_size,ids_user_size

(82, 12662)

#### 构建datas

In [110]:
import json
import numpy as np

In [111]:
type(df_question["question_embedding"][0]),type(df_user["user_embedding"][0])

(str, str)

In [112]:
# embedding从字符串向量化
df_question["question_embedding"] = df_question["question_embedding"].map(lambda x : np.array(json.loads(x)))
df_user["user_embedding"] = df_user["user_embedding"].map(lambda x : np.array(json.loads(x)))

In [113]:
df_question.head()#,df_user.head()

Unnamed: 0,question_id,question_embedding
0,0,"[0.6998521089553833, 0.0, 0.0, 3.8853187561035..."
1,1,"[0.8288981318473816, 0.0, 0.0, 4.8776364326477..."
2,2,"[0.5633597373962402, 0.0, 0.0, 2.8337125778198..."
3,3,"[0.5317284464836121, 0.0, 0.0, 2.56494140625, ..."
4,4,"[0.373773992061615, 0.0, 0.0, 1.36089992523193..."


In [119]:
df_question["question_embedding"][0].shape

(8,)

In [130]:
datas_question = []
for y in df_question["question_embedding"]:
    datas_question.append(y)
len(datas_question)

82

In [131]:
datas_user = []
for x in df_user["user_embedding"]:
    datas_user.append(x)
len(datas_user)

12662

In [132]:
datas_user = np.array(datas_user).astype(np.float32)
datas_question = np.array(datas_question).astype(np.float32)

In [133]:
datas_question.dtype,datas_user.dtype

(dtype('float32'), dtype('float32'))

In [134]:
datas_question.shape,datas_user.shape

((82, 8), (12662, 8))

In [136]:
datas_user[0],datas_question[0]

(array([0.06441627, 0.        , 0.        , 0.29740497, 0.        ,
        0.        , 0.01814273, 0.1700981 ], dtype=float32),
 array([0.6998521 , 0.        , 0.        , 3.8853188 , 0.        ,
        0.03113749, 0.        , 2.0728223 ], dtype=float32))

In [137]:
# 维度
dimension_question = datas_question.shape[1]
dimension_user = datas_user.shape[1]
dimension_question,dimension_user

(8, 8)

### 2. 建立索引

In [138]:
import faiss

In [142]:
index_question = faiss.IndexFlatL2(dimension_question)
index_user = faiss.IndexFlatL2(dimension_user)

In [143]:
index2_user = faiss.IndexIDMap(index_user)
index2_question = faiss.IndexIDMap(index_question)

In [144]:
ids_question.dtype,ids_user.dtype

(dtype('int64'), dtype('int64'))

In [146]:
index2_question.add_with_ids(datas_question, ids_question)
index2_user.add_with_ids(datas_user, ids_user)

In [147]:
index2_question.ntotal,index2_user.ntotal

(82, 12662)

## 搜索题目相似列表

In [192]:
question_question_list = []
topk = 20
for index, row in df_question.iterrows():
    question_id = row["question_id"]
    question_embedding = row["question_embedding"]
    question_embedding = np.expand_dims(question_embedding, axis=0).astype(np.float32)
    D, I = index2_question.search(question_embedding, topk)
    question_sim_list = " ".join([str(x) for x in I[0]])
#     print(question_sim_list)
    question_question_list.append([question_id, question_sim_list])
question_question_rec_list = pd.DataFrame(question_question_list, columns = ["question_id", "question_sim_list"])

In [193]:
question_question_rec_list = question_question_rec_list.drop_duplicates()
question_question_rec_list.to_csv("./my_datas/question_question_list.csv", index=False)
# question_question_rec_list.head(19)

## 搜索用户问题相似列表

In [194]:
user_question_list = []
topk = 20
for index, row in df_user.iterrows():
    user_id = row["user_id"]
    user_embedding = row["user_embedding"]
    user_embedding = np.expand_dims(user_embedding, axis=0).astype(np.float32)
    D, I = index2_question.search(user_embedding, topk)
    user_sim_list = " ".join([str(x) for x in I[0]])
#     print(question_sim_list)
    user_question_list.append([user_id, user_sim_list])
user_question_rec_list = pd.DataFrame(user_question_list, columns = ["user_id", "user_question_sim_list"])

In [195]:
len(user_question_list)

12662

In [196]:
user_question_rec_list = user_question_rec_list.drop_duplicates()
user_question_rec_list.to_csv("./my_datas/user_question_list.csv", index=False)
user_question_rec_list.head()

Unnamed: 0,user_id,user_question_sim_list
0,0,62 32 65 21 14 60 49 33 4 61 38 75 73 17 12 78...
1,1,62 32 65 21 14 60 49 33 4 61 38 75 73 17 12 78...
2,2,65 32 62 21 14 60 49 33 4 61 38 75 73 17 12 78...
3,3,49 33 60 4 61 38 14 21 75 73 17 12 78 65 47 32...
4,4,62 32 65 21 14 60 49 33 4 61 38 75 73 17 12 78...


In [197]:
len(user_question_rec_list)

7457

## 用户相似列表

In [198]:
user_user_list = []
topk = 20
for index, row in df_user.iterrows():
    user_id = row["user_id"]
    user_embedding = row["user_embedding"]
    user_embedding = np.expand_dims(user_embedding, axis=0).astype(np.float32)
    D, I = index2_user.search(user_embedding, topk)
    user_sim_list = " ".join([str(x) for x in I[0]])
#     print(question_sim_list)
    user_user_list.append([user_id, user_sim_list])
user_user_rec_list = pd.DataFrame(user_user_list, columns = ["user_id", "user_sim_list"])

In [199]:
user_user_rec_list = user_user_rec_list.drop_duplicates()
user_user_rec_list.to_csv("./my_datas/user_user_list.csv", index=False)
user_user_rec_list.head()

Unnamed: 0,user_id,user_sim_list
0,0,0 0 4018 4018 1719 7118 3691 5486 5486 5486 63...
1,1,1 2193 2361 2361 2361 597 597 370 2575 3806 35...
2,2,2 2 3684 5451 4260 4260 3790 593 593 593 593 5...
3,3,3 3 3967 666 666 666 666 666 6552 6552 2750 27...
4,4,4 4 713 713 4216 3628 7026 6954 553 553 553 55...


### 4. 根据ID取出信息

In [157]:
target_ids = pd.Series(I[0], name="MovieID")
target_ids.head()

0    65
1    32
2    62
3    21
4    14
Name: MovieID, dtype: int64

In [158]:
type(target_ids)

pandas.core.series.Series

In [30]:
df_movie = pd.read_csv("./datas/ml-1m/movies.dat",
                     sep="::", header=None, engine="python",
                     names = "MovieID::Title::Genres".split("::"))
df_movie.head()

Unnamed: 0,MovieID,Title,Genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


In [31]:
df_result = pd.merge(target_ids, df_movie)
df_result.head()

Unnamed: 0,MovieID,Title,Genres
0,439,Dangerous Game (1993),Drama
1,3147,"Green Mile, The (1999)",Drama|Thriller
2,985,Small Wonders (1996),Documentary
3,1290,Some Kind of Wonderful (1987),Drama|Romance
4,3408,Erin Brockovich (2000),Drama
