**Thinking1: 在实际工作中，FM和MF哪个应用的更多，为什么**

- MF只考虑user、item两个维度，FM可以处理更多特征，MF是FM的特例。
- 一般FM应用更多，由于FM考虑了更多特征，而且考虑了特征之间的相关性，所以预测结果更为准确。

**Thinking2：FFM与FM有哪些区别？**

- FFM是带场的FM，对于每个特征有多个隐向量。
- FM是FFM的特例

**Thinking3：DeepFM相比于FM解决了哪些问题，原理是怎样的**

- DeepFM用深度模型DNN来处理三阶及以上的特征交叉，用FM处理一阶和二阶特征交叉。
- 既利用了FM的优点，又规避了FM处理高阶特征计算量大的不足。

**Thinking4：Surprise工具中的baseline算法原理是怎样的？BaselineOnly和KNNBaseline有什么区别？**

- Baseline算法：基于统计的基准预测线打分，预测值=用户的基准+商品的基准
 - 预测值$\hat{r}_{ui} = b_{ui}$
 - $b_{ui} = b_u+b_i$: 用户对整体的偏差+商品对整体的偏差
 - 用ALS进行计算
- KNNBaseline算法：KNN+Baseline
 - 预测值$\hat{r}_{ui} = b_{ui}+用户领域/商品领域$

**Thinking5：基于邻域的协同过滤都有哪些算法，请简述原理**

- 基于用户领域的协同过滤（UserCF）：推荐与目标用户相似的用户群体感兴趣的物品
- 基于物品领域的协同过滤（ItemCF）：推荐与目标用户喜欢的物品相似的商品  

以上两种方法有不同的适用场景，商品迭代快用UserCF；商品比较固定，用ItemCF。

**Action1：使用libfm工具对movielens进行评分预测，采用SGD优化算法**

数据转换
>./triple_format_to_libfm.pl -in ratings.dat -target 2 -delete_column 3 -separator "::"

用SGD进行迭代计算：learning rate=0.01
> ./libFM -task r -train ratings.dat.libfm -test ratings.dat.libfm -dim '1,1,8' -iter 100 -method sgd -learn_rate 0.01 -regular '0,0,0.01' -init_stdev 0.1 -out movielens_out.txt

迭代结果:
> #Iter= 99       Train=0.778209  Test=0.778209   
> Final   Train=0.778209  Test=0.778209

**Action2：使用DeepFM对movielens进行评分预测**

1. 对movielens_sample进行评分预测

In [5]:
import pandas as pd
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from deepctr.models import DeepFM
from deepctr.feature_column import SparseFeat,get_feature_names

In [6]:
#数据加载
path = 'DeepCTR/'
data = pd.read_csv(path+"movielens_sample.txt")
sparse_features = ["movie_id", "user_id", "gender", "age", "occupation", "zip"]
target = ['rating']

In [8]:
# 对特征标签进行编码
for feature in sparse_features:
    lbe = LabelEncoder()
    data[feature] = lbe.fit_transform(data[feature])
# 计算每个特征中的 不同特征值的个数
fixlen_feature_columns = [SparseFeat(feature, data[feature].nunique()) for feature in sparse_features]
# print(fixlen_feature_columns)
linear_feature_columns = fixlen_feature_columns
dnn_feature_columns = fixlen_feature_columns
feature_names = get_feature_names(linear_feature_columns + dnn_feature_columns)

In [10]:
# 将数据集切分成训练集和测试集
train, test = train_test_split(data, test_size=0.2)
train_model_input = {name:train[name].values for name in feature_names}
test_model_input = {name:test[name].values for name in feature_names}

In [30]:
# 使用DeepFM进行训练
model = DeepFM(linear_feature_columns, dnn_feature_columns, task='regression')
model.compile("adam", "mse", metrics=['mse'], )
history = model.fit(train_model_input, train[target].values, batch_size=16, epochs=6, verbose=True, validation_split=0.2, )

Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6


**这里由于训练集很小只有128，所以batch size应该不能取256，不然就是全局梯度下降了。这里取batch size=16，效果更好**

In [33]:
# 使用DeepFM进行预测
pred_ans = model.predict(test_model_input, batch_size=16)
# 输出RMSE或MSE
mse = round(mean_squared_error(test[target].values, pred_ans), 4)
rmse = mse ** 0.5
print("test RMSE", rmse)

test RMSE 1.0165136496870075


2. 对movielens 100K数据集进行预测

In [1]:
import pandas as pd
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from deepctr.models import DeepFM
from deepctr.feature_column import SparseFeat,get_feature_names, DenseFeat
from datetime import datetime

In [2]:
#数据加载
path = 'DeepCTR/ml-100k/'
# load origin data of movielens 100K
u_data = pd.read_csv('DeepCTR/ml-100k/u.data', header=None, sep='\t')
u_user = pd.read_csv('DeepCTR/ml-100k/u.user', header=None, sep='|')
u_item = pd.read_csv('DeepCTR/ml-100k/u.item', header=None, sep='|', encoding='unicode_escape')
# get the columns name 
u_data.columns = 'user_id | item_id | rating | timestamp'.split(' | ')
u_user.columns = 'user_id | age | gender | occupation | zip_code'.split(' | ')
item_columns = 'movie_id | movie_title | release_date | video_release_date | IMDb_URL | unknown | Action | Adventure | Animation | Children_s | Comedy | Crime | Documentary | Drama | Fantasy | Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi | Thriller | War | Western'
u_item.columns = item_columns.split(' | ')
# merge the three tables 
ml_data = pd.merge(u_data, u_user, on="user_id")
ml_data = pd.merge(ml_data, u_item, left_on='item_id', right_on='movie_id')

In [3]:
# 数据处理
# timestamp convert to date form same as release date
def timestamp2date(ts):
    return datetime.utcfromtimestamp(ts).strftime('%d-%b-%Y')

def timestamp2hour(ts):
    return datetime.utcfromtimestamp(ts).hour

def str2date(s):
    return datetime.strptime(s, '%d-%b-%Y')

ml_data['rate_hour'] = ml_data['timestamp'].map(timestamp2hour)
ml_data['rate_hour'] = pd.cut(ml_data['rate_hour'], 3, labels=['moring', 'afternoon', 'night'])
ml_data['rate_date'] = ml_data['timestamp'].map(timestamp2date)

# 处理空值
ml_data['release_date'] = ml_data['release_date'].fillna('')
# 上映与观看的间隔时间
def delta_days(s1, s2):
    if not s1 or not s2:
        return -1
    return (str2date(s1)-str2date(s2)).days
ml_data['delta_days'] = ml_data.apply(lambda x: delta_days(x.rate_date, x.release_date), axis=1)

#处理年龄数据
# ml_data['age_label'] = pd.cut(ml_data['age'], 3, labels=['young', 'middle', 'old'])
ml_data['age_label'] = pd.cut(ml_data['age'], 6)

In [4]:
# 对稀疏类别特征标签进行类别编码
sparse_features = ["movie_id", "user_id", "gender", "occupation", "zip_code", "age_label", "rate_hour"]
for feature in sparse_features:
    lbe = LabelEncoder()
    ml_data[feature] = lbe.fit_transform(ml_data[feature])
    
fixlen_feature_columns = [SparseFeat(feat, vocabulary_size=ml_data[feat].nunique(),embedding_dim=8)
                       for feat in sparse_features]
# 单独对timestamp和release date进行处理
lbe = LabelEncoder()
lbe.fit(pd.concat([ml_data['rate_date'],ml_data['release_date']]))
ml_data['rate_date'] = lbe.transform(ml_data['rate_date'])
ml_data['release_date'] = lbe.transform(ml_data['release_date'])
vocabulary_size = pd.concat([ml_data['rate_date'],ml_data['release_date']]).nunique()

sparse_features += ['rate_date', 'release_date'] 
fixlen_feature_columns += [SparseFeat(feat, vocabulary_size=vocabulary_size,embedding_dim=8)
                       for feat in ['rate_date', 'release_date']]
# 对稠密特征进行归一化
# 观看与上映时间间隔作为稠密特征
dense_features = ["delta_days"]
mms = MinMaxScaler(feature_range=(0,1))
ml_data[dense_features] = mms.fit_transform(ml_data[dense_features])
#
dense_features += list(u_item.columns[5:])
fixlen_feature_columns += [DenseFeat(feat, 1,) for feat in dense_features]
# 目标标签
target = ['rating']

In [5]:
# 生成特征列
dnn_feature_columns = fixlen_feature_columns
linear_feature_columns = fixlen_feature_columns

feature_names = get_feature_names(linear_feature_columns + dnn_feature_columns)

In [6]:
# 将数据集切分成训练集和测试集
train, test = train_test_split(ml_data, test_size=0.2)

train_model_input = {name:train[name].values for name in feature_names}
test_model_input = {name:test[name].values for name in feature_names}

In [7]:
# 使用DeepFM进行训练
model = DeepFM(linear_feature_columns,dnn_feature_columns,task='regression', 
               dnn_hidden_units=(64, 32, 64), dnn_dropout=0.6,
               l2_reg_embedding=0.3, l2_reg_dnn=10)
model.compile("adam", "mse", metrics=['mse'], )

history = model.fit(train_model_input, train[target].values,
                    batch_size=256, epochs=20, verbose=1, validation_split=0.2, )

Epoch 1/20


  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [8]:
# 使用DeepFM进行预测
pred_ans = model.predict(test_model_input, batch_size=256)
# 输出RMSE或MSE
mse = round(mean_squared_error(test[target].values, pred_ans), 4)
rmse = mse ** 0.5
print("test RMSE", rmse)

test RMSE 0.936162379077476


**这里调整了神经元数目、增大了正则项、加了dropout，防止过拟合**

**Action3:使用基于邻域的协同过滤（KNNBasic, KNNWithMeans, KNNWithZScore, KNNBaseline中的任意一种）对MovieLens数据集进行协同过滤，采用k折交叉验证(k=3)，输出每次计算的RMSE, MAE**

In [6]:
from surprise import Dataset
from surprise import Reader
from surprise import accuracy
from surprise.model_selection import KFold
from surprise import KNNBasic, KNNWithMeans, KNNBaseline, KNNWithZScore
# 数据读取
path = 'L6-code/knn_cf/'
reader = Reader(line_format='user item rating timestamp', sep=',', skip_lines=1)
data = Dataset.load_from_file(path+'ratings.csv', reader=reader)
train_set = data.build_full_trainset()

**KNNBasic with UserCF**

In [7]:
# 定义K折交叉验证迭代器, K=3
kf = KFold(n_splits=3)
# 存储K个模型
algos = []

for trainset, testset in kf.split(data):
    algo = KNNBasic() #use default setting
    algos.append(algo)
    # 训练并预测
    algo.fit(trainset)
    predictions = algo.test(testset)
    # 计算RMSE
    accuracy.rmse(predictions, verbose=True)
    # 计算MAE
    accuracy.mae(predictions, verbose=True)

Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9030
MAE:  0.6904
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9030
MAE:  0.6912
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9057
MAE:  0.6921


**KNNBasic with ItemCF**

In [8]:
#KNNBasic with item CF
sim_options = {'name': 'cosine',
               'user_based': False  # compute  similarities between items
               }
# 存储K个模型
algos = []
for trainset, testset in kf.split(data):
    algo = KNNBasic(sim_options=sim_options)
    algos.append(algo)
    # 训练并预测
    algo.fit(trainset)
    predictions = algo.test(testset)
    # 计算RMSE
    accuracy.rmse(predictions, verbose=True)
    # 计算MAE
    accuracy.mae(predictions, verbose=True)

Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 0.9620
MAE:  0.7427
Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 0.9591
MAE:  0.7404
Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 0.9621
MAE:  0.7425


**KNNWithMeans**

In [10]:
#KNNWithMeans
# 存储K个模型
algos = []
for trainset, testset in kf.split(data):
    algo = KNNWithMeans()
    algos.append(algo)
    # 训练并预测
    algo.fit(trainset)
    predictions = algo.test(testset)
    # 计算RMSE
    accuracy.rmse(predictions, verbose=True)
    # 计算MAE
    accuracy.mae(predictions, verbose=True)

Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.8818
MAE:  0.6794
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.8809
MAE:  0.6784
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.8785
MAE:  0.6773


**KNNWithZScore**

In [11]:
# 存储K个模型
algos = []
for trainset, testset in kf.split(data):
    algo = KNNWithZScore()
    algos.append(algo)
    # 训练并预测
    algo.fit(trainset)
    predictions = algo.test(testset)
    # 计算RMSE
    accuracy.rmse(predictions, verbose=True)
    # 计算MAE
    accuracy.mae(predictions, verbose=True)

Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.8771
MAE:  0.6722
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.8809
MAE:  0.6749
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.8784
MAE:  0.6730


**KNNBaseline**

In [12]:
# 存储K个模型
algos = []
for trainset, testset in kf.split(data):
    algo = KNNBaseline()
    algos.append(algo)
    # 训练并预测
    algo.fit(trainset)
    predictions = algo.test(testset)
    # 计算RMSE
    accuracy.rmse(predictions, verbose=True)
    # 计算MAE
    accuracy.mae(predictions, verbose=True)

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.8589
MAE:  0.6588
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.8565
MAE:  0.6579
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.8582
MAE:  0.6585


**KNNBaseline算法对于这个数据集来说效果最好**