## Python训练item2vec实现电影相关推荐

知识：
* word2vec：输入(doc, words)，得到word embedding
* item2vec：输入（userid, itemids），得到item embedding

说明：
* 使用标题/内容的分词embedding作推荐，属于内容相似推荐
* 使用行为列表作embedding作推荐，属于行为相关推荐，效果比内容相似推荐更好

延伸：
* 把word embedding进行加和、平均，就得到了document embedding；
* 把item embedding进行加和、平均，就得到了user embedding；

### 1. 获取数据

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("./datas/ml-latest-small/ratings.csv")
df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [3]:
df["rating"].mean()

3.501556983616962

In [4]:
# 只取平均分以上的数据，作为喜欢的列表
df = df[df["rating"] > df["rating"].mean()].copy()
df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [5]:
# 聚合得到userId，movieId列表
df_group = df.groupby(['userId'])['movieId'].apply(lambda x: ' '.join([str(m) for m in x])).reset_index()
df_group.head()

Unnamed: 0,userId,movieId
0,1,1 3 6 47 50 101 110 151 157 163 216 231 235 26...
1,2,333 1704 3578 6874 46970 48516 58559 60756 681...
2,3,849 1587 2288 2851 3024 3703 4518 5181 5746 57...
3,4,106 125 162 176 215 232 260 265 319 342 345 34...
4,5,1 21 34 36 50 58 110 232 247 261 290 296 367 4...


In [6]:
df_group.to_csv("./datas/movielens_uid_movieids.csv", index=False)

### 3. 使用Pyspark训练item2vec

In [7]:
import findspark
findspark.init()

from pyspark.sql import SparkSession
spark = SparkSession \
    .builder \
    .appName("PySpark Item2vec") \
    .getOrCreate()

sc = spark.sparkContext

#### Pyspark读取CSV数据

In [8]:
df = spark.read.csv("./datas/movielens_uid_movieids.csv", header=True)
df.show(5)

+------+--------------------+
|userId|             movieId|
+------+--------------------+
|     1|1 3 6 47 50 101 1...|
|     2|333 1704 3578 687...|
|     3|849 1587 2288 285...|
|     4|106 125 162 176 2...|
|     5|1 21 34 36 50 58 ...|
+------+--------------------+
only showing top 5 rows



In [9]:
from pyspark.sql import functions as F
from pyspark.sql import types as T

# 把非常的字符串格式变成LIST形式
df = df.withColumn('movie_ids', F.split(df.movieId, " "))

#### 实现word2vec的训练与转换

In [10]:
# https://spark.apache.org/docs/2.4.6/ml-features.html#word2vec

from pyspark.ml.feature import Word2Vec

word2Vec = Word2Vec(
    vectorSize=5, 
    minCount=0, 
    inputCol="movie_ids", 
    outputCol="movie_2vec")

model = word2Vec.fit(df)

In [11]:
# 不计算每个user的embedding，而是计算item的embedding
model.getVectors().show(3, truncate=False)

+-----+-----------------------------------------------------------------------------------------------------------+
|word |vector                                                                                                     |
+-----+-----------------------------------------------------------------------------------------------------------+
|26985|[-0.08157788217067719,0.04485902935266495,-0.03560459613800049,0.07710414379835129,0.002518109977245331]   |
|5451 |[-0.01672695204615593,0.045234885066747665,0.023883186280727386,-0.02078450843691826,-0.019449032843112946]|
|4018 |[-0.1291561871767044,0.063893623650074,0.04831916466355324,-0.0976295918226242,-0.1921783685684204]        |
+-----+-----------------------------------------------------------------------------------------------------------+
only showing top 3 rows



In [12]:
model.getVectors().select("word", "vector") \
           .toPandas() \
           .to_csv('./datas/movielens_movie_embedding.csv', index=False)

### 4. 对于给定电影算出最相似的10个电影

In [13]:
df_embedding = pd.read_csv("./datas/movielens_movie_embedding.csv")
df_embedding.head(3)

Unnamed: 0,word,vector
0,26985,"[-0.08157788217067719,0.04485902935266495,-0.0..."
1,5451,"[-0.01672695204615593,0.045234885066747665,0.0..."
2,4018,"[-0.1291561871767044,0.063893623650074,0.04831..."


In [14]:
df_movie = pd.read_csv("./datas/ml-latest-small/movies.csv")
df_movie.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [15]:
df_merge = pd.merge(left=df_embedding, 
                    right=df_movie,
                    left_on="word",
                    right_on="movieId")
df_merge.head()

Unnamed: 0,word,vector,movieId,title,genres
0,26985,"[-0.08157788217067719,0.04485902935266495,-0.0...",26985,Nirvana (1997),Action|Sci-Fi
1,5451,"[-0.01672695204615593,0.045234885066747665,0.0...",5451,Pumpkin (2002),Comedy|Drama|Romance
2,4018,"[-0.1291561871767044,0.063893623650074,0.04831...",4018,What Women Want (2000),Comedy|Romance
3,4056,"[-0.16579893231391907,0.06356438249349594,-0.1...",4056,"Pledge, The (2001)",Crime|Drama|Mystery|Thriller
4,32584,"[-0.03422517329454422,0.002282997127622366,-0....",32584,"Ballad of Jack and Rose, The (2005)",Drama


In [16]:
import numpy as np
import json
df_merge["vector"] = df_merge["vector"].map(lambda x : np.array(json.loads(x)))

In [17]:
# 随便挑选一个电影：4018	What Women Want (2000)
movie_id = 4018
df_merge.loc[df_merge["movieId"]==movie_id]

Unnamed: 0,word,vector,movieId,title,genres
2,4018,"[-0.1291561871767044, 0.063893623650074, 0.048...",4018,What Women Want (2000),Comedy|Romance


In [18]:
movie_embedding = df_merge.loc[df_merge["movieId"]==movie_id, "vector"].iloc[0]
movie_embedding

array([-0.12915619,  0.06389362,  0.04831916, -0.09762959, -0.19217837])

In [19]:
# 余弦相似度
from scipy.spatial import distance
df_merge["sim_value"] = df_merge["vector"].map(lambda x : 1 - distance.cosine(movie_embedding, x))

In [20]:
df_merge[["movieId", "title", "genres", "sim_value"]].head(3)

Unnamed: 0,movieId,title,genres,sim_value
0,26985,Nirvana (1997),Action|Sci-Fi,0.110415
1,5451,Pumpkin (2002),Comedy|Drama|Romance,0.745538
2,4018,What Women Want (2000),Comedy|Romance,1.0


In [21]:
# 按相似度降序排列，查询前10条
df_merge.sort_values(by="sim_value", ascending=False)[["movieId", "title", "genres", "sim_value"]].head(10)

Unnamed: 0,movieId,title,genres,sim_value
2,4018,What Women Want (2000),Comedy|Romance,1.0
4381,2261,One Crazy Summer (1986),Comedy,0.984198
4600,188833,The Man Who Killed Don Quixote (2018),Adventure|Comedy|Fantasy,0.972859
1641,69406,"Proposal, The (2009)",Comedy|Romance,0.970813
5522,75341,Remember Me (2010),Drama|Romance,0.97014
2704,45431,Over the Hedge (2006),Adventure|Animation|Children|Comedy,0.969787
5571,2702,Summer of Sam (1999),Drama,0.966559
4513,40962,"Yours, Mine and Ours (2005)",Comedy|Romance,0.95987
1228,3948,Meet the Parents (2000),Comedy,0.957473
3455,1631,"Assignment, The (1997)",Action|Thriller,0.955127
