**Thinking1：在CTR点击率预估中，使用GBDT+LR的原理是什么?**

采用stacking方法。GBDT做特征构造，LR做分类。

**Thinking2：Wide & Deep的模型结构是怎样的，为什么能通过具备记忆和泛化能力（memorization and generalization）**

**结构**：Wide部分，线性模型LR；Deep部分，模型用DNN。  
LR模型记忆性能好，而DNN模型提取深层特征，泛化能力好。

**Thinking3：在CTR预估中，使用FM与DNN结合的方式，有哪些结合的方式，代表模型有哪些？**

**串行**：代表模型是NFM  
**并行**：代表模型是DeepFM

**Thinking4：GBDT和随机森林都是基于树的算法，它们有什么区别？**

GBDT：用下一颗树去拟合前几棵树的残差，是Boosting思想，预测结果偏差较小，方差较大。  
RF：多个弱分类树组合成一个强分类器，是bagging思想，预测结果偏差较大，方差较小。

**Thinking5：item流行度在推荐系统中有怎样的应用**

1. 热门推荐，解决冷启动问题
2. 作为个性化推荐时的一个权重，降低流行度高的item的权重，以增加个性化。

**Action1：使用Wide&Deep模型对movielens进行评分预测**

In [1]:
import pandas as pd
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, MinMaxScaler, StandardScaler
from deepctr.models import WDL
from deepctr.feature_column import SparseFeat,get_feature_names, DenseFeat
from datetime import datetime

In [2]:
#数据加载
path = 'WideDeep/ml-100k/'
# load origin data of movielens 100K
u_data = pd.read_csv(path+'u.data', header=None, sep='\t')
u_user = pd.read_csv(path+'u.user', header=None, sep='|')
u_item = pd.read_csv(path+'u.item', header=None, sep='|', encoding='unicode_escape')
# get the columns name 
u_data.columns = 'user_id | item_id | rating | timestamp'.split(' | ')
u_user.columns = 'user_id | age | gender | occupation | zip_code'.split(' | ')
item_columns = 'movie_id | movie_title | release_date | video_release_date | IMDb_URL | unknown | Action | Adventure | Animation | Children_s | Comedy | Crime | Documentary | Drama | Fantasy | Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi | Thriller | War | Western'
u_item.columns = item_columns.split(' | ')
# merge the three tables 
ml_data = pd.merge(u_data, u_user, on="user_id")
ml_data = pd.merge(ml_data, u_item, left_on='item_id', right_on='movie_id')

In [3]:
# 数据处理
# timestamp convert to date form same as release date
def timestamp2date(ts):
    return datetime.utcfromtimestamp(ts).strftime('%d-%b-%Y')

def timestamp2hour(ts):
    return datetime.utcfromtimestamp(ts).hour

def str2date(s):
    return datetime.strptime(s, '%d-%b-%Y')

ml_data['rate_hour'] = ml_data['timestamp'].map(timestamp2hour)
ml_data['rate_hour'] = pd.cut(ml_data['rate_hour'], 3, labels=['moring', 'afternoon', 'night'])
ml_data['rate_date'] = ml_data['timestamp'].map(timestamp2date)

# 处理空值
ml_data['release_date'] = ml_data['release_date'].fillna('')
# 上映与观看的间隔时间
def delta_days(s1, s2):
    if not s1 or not s2:
        return -1
    return (str2date(s1)-str2date(s2)).days
ml_data['delta_days'] = ml_data.apply(lambda x: delta_days(x.rate_date, x.release_date), axis=1)

#处理年龄数据
# ml_data['age_label'] = pd.cut(ml_data['age'], 3, labels=['young', 'middle', 'old'])
ml_data['age_label'] = pd.cut(ml_data['age'], 6)

In [4]:
# 对稀疏类别特征标签进行类别编码
sparse_features = ["movie_id", "user_id", "gender", "occupation", "zip_code", "age_label", "rate_hour"]
for feature in sparse_features:
    lbe = LabelEncoder()
    ml_data[feature] = lbe.fit_transform(ml_data[feature])
    
fixlen_feature_columns = [SparseFeat(feat, vocabulary_size=ml_data[feat].nunique(),embedding_dim=8)
                       for feat in sparse_features]
# 单独对timestamp和release date进行处理
lbe = LabelEncoder()
lbe.fit(pd.concat([ml_data['rate_date'],ml_data['release_date']]))
ml_data['rate_date'] = lbe.transform(ml_data['rate_date'])
ml_data['release_date'] = lbe.transform(ml_data['release_date'])
vocabulary_size = pd.concat([ml_data['rate_date'],ml_data['release_date']]).nunique()

sparse_features += ['rate_date', 'release_date'] 
fixlen_feature_columns += [SparseFeat(feat, vocabulary_size=vocabulary_size,embedding_dim=8)
                       for feat in ['rate_date', 'release_date']]
# 对稠密特征进行归一化
# 观看与上映时间间隔作为稠密特征
dense_features = ["delta_days"]
mms = MinMaxScaler(feature_range=(0,1))
# mms = StandardScaler()
ml_data[dense_features] = mms.fit_transform(ml_data[dense_features])
#
dense_features += list(u_item.columns[5:])
fixlen_feature_columns += [DenseFeat(feat, 1,) for feat in dense_features]
# 目标标签
target = ['rating']

In [5]:
# 生成特征列
dnn_feature_columns = fixlen_feature_columns
linear_feature_columns = fixlen_feature_columns

feature_names = get_feature_names(linear_feature_columns + dnn_feature_columns)

In [6]:
# 将数据集切分成训练集和测试集
train, test = train_test_split(ml_data, test_size=0.2)

train_model_input = {name:train[name].values for name in feature_names}
test_model_input = {name:test[name].values for name in feature_names}

In [31]:
# 使用Wide&Deep进行训练
model = WDL(linear_feature_columns,dnn_feature_columns,task='regression', 
               dnn_hidden_units=(16, 16, 16), dnn_dropout=0.6,
               l2_reg_embedding=1e-5, l2_reg_dnn=0)
model.compile("adam", "mse", metrics=['mse'], )

history = model.fit(train_model_input, train[target].values,
                    batch_size=256, epochs=15, verbose=1, validation_split=0.2, )

Epoch 1/15


  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


In [32]:
# 使用Wide&Deep进行预测
pred_ans = model.predict(test_model_input, batch_size=256)
# 输出RMSE或MSE
mse = round(mean_squared_error(test[target].values, pred_ans), 4)
rmse = mse ** 0.5
print("test RMSE", rmse, mse)

test RMSE 0.9411694852681954 0.8858


**从训练过程看有些过拟合，但测试发现增大L2并不能减小过拟合情况，反而在测试集上效果变差，有些迷惑。。。**