通过GBDT+LR进行推荐，数据集源自kaggle 2014年的一个比赛。

数据集介绍：
- `train.csv`: Criteo公司7天内的部分流量，每一行对应其提供的广告。数据按照时间进行排序。
- `test.csv`: 是训练集之后一天的数据，格式同`train.csv`。

字段：
- `Label`: 目标。0，1分别代表未点击和点击
- `l1-l13`: 一共13列数值特征，大部分都是计数特征
- `C1-C26`: 一共26列分类特征，为32位数据表示

In [1]:
import numpy as np
import pandas as pd

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss
from sklearn.preprocessing import MinMaxScaler,\
    OneHotEncoder, LabelEncoder
import lightgbm



# 1. 数据导入和处理

In [5]:
import os
from pathlib import Path

In [9]:
path = "../datasets/criteo's-traffic-data"
path = Path(path)
path.exists()

True

## 1.1 数据导入

In [19]:
train_path = path / 'train.csv'
test_path = path / 'test.csv'
train_path.exists() and test_path.exists()

True

In [20]:
df_train = pd.read_csv(train_path)
df_test = pd.read_csv(test_path)

In [24]:
df_train.shape, df_test.shape

((1599, 41), (400, 40))

In [29]:
set(df_train.columns) - set(df_test.columns)

{'Label'}

## 1.2 数据处理

其中`Id`无必要，可进行去除，另外`train`数据带标签而`test`不带标签，可进行填充

In [36]:
df_train.drop(columns='Id', axis=1, inplace=True)
df_test.drop(columns='Id', axis=1, inplace=True)

In [39]:
df_test['Label'] = -1

In [53]:
data = pd.concat([df_train, df_test], axis=0)

In [55]:
data.fillna(-1, inplace=True)

由于数据特征中包含两类，既有数值特征又有分类特征，这两类特征应该分开处理，得事先将定义包含分别包含两类特征的集合数据。

In [41]:
df_train.columns

Index(['Label', 'I1', 'I2', 'I3', 'I4', 'I5', 'I6', 'I7', 'I8', 'I9', 'I10',
       'I11', 'I12', 'I13', 'C1', 'C2', 'C3', 'C4', 'C5', 'C6', 'C7', 'C8',
       'C9', 'C10', 'C11', 'C12', 'C13', 'C14', 'C15', 'C16', 'C17', 'C18',
       'C19', 'C20', 'C21', 'C22', 'C23', 'C24', 'C25', 'C26'],
      dtype='object')

In [46]:
integer_features = df_train.columns[1:14]
category_features = df_train.columns[14:]

# 2. 构建模型

使用GBDT+LR进行处理，首先得分别建立两个模型。

注意事项：
- LR：需要进行特征处理（数值特征需要归一化，类别特征需要one-hot化）
- GBDT：需要进行特征处理（类别特征需要one-hot化）
- GBDT+LR：LR的输入不仅是GBDT的输出特征，另外**也包括输入原数据**

## 2.1 LR

### 2.1.1 将所有数值特征进行归一化（`MinMaxScaler`）

In [56]:
scaler = MinMaxScaler()
for col in integer_features:
    data[col] = scaler.fit_transform(data[col].values.reshape(-1, 1))

### 2.1.2 离散数据one-hot化（`OneHotEncoder`和`pd.get_dummies`都可以）

In [62]:
for col in category_features:
    onehot_features = pd.get_dummies(data[col], prefix=col)
    data.drop([col], axis=1, inplace=True)
    data = pd.concat([data, onehot_features], axis=1)

In [64]:
data.shape

(1999, 13105)

### 2.1.3 分割数据集

分割训练集和测试集

In [65]:
train = data[data['Label'] != -1]
target = train.pop('Label')
test = data[data['Label'] == -1]
test.drop(['Label'], axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


In [70]:
train.shape, test.shape

((1599, 13104), (400, 13104))

划分数据集

In [71]:
x_train, x_val, y_train, y_val = train_test_split(train, target, test_size=0.2)

### 2.1.4 构建LR模型

In [77]:
model_lr = LogisticRegression()
model_lr.fit(x_train, y_train)

# predict_proba输出的是0,1类别的概率，由于目标是正样本，所以取索引1
train_logloss = log_loss(y_train, model_lr.predict_proba(x_train)[:, 1])
val_logloss = log_loss(y_val, model_lr.predict_proba(x_val)[:, 1])

In [85]:
print('train_logloss: ', train_logloss)
print('val_logloss: ', val_logloss)

train_logloss:  0.12532530444217038
val_logloss:  0.4388871134754885


In [86]:
y_pred = model_lr.predict_proba(test)[:, 1]

In [91]:
print('预测前10个测试样本的正样本预测概率：\n', y_pred[:10])

预测前10个测试样本的正样本预测概率：
 [0.64830175 0.83173134 0.27290936 0.03009931 0.12878579 0.16314081
 0.55666631 0.0678317  0.02816935 0.26834223]


## 2.2 GBDT建模

In [94]:
data = pd.concat([df_train, df_test], axis=0)

### 2.2.1 仅需离散化one-hot编码

In [96]:
for col in category_features:
    onehot_features = pd.get_dummies(data[col], prefix=col)
    data.drop(col, axis=1, inplace=True)
    data = pd.concat([data, onehot_features], axis=1)

### 2.2.2 分割数据集

分割训练集和测试集

In [98]:
train = data[data['Label'] != -1]
target = train.pop('Label')
test = data[data['Label'] == -1]
test.drop(['Label'], axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


In [100]:
train.shape, test.shape

((1599, 13092), (400, 13092))

划分数据集

In [101]:
x_train, x_val, y_train, y_val = train_test_split(train, target, test_size=0.2)

### 2.2.3 GBDT建模

In [103]:
gbdt = lightgbm.LGBMClassifier(boosting_type='gbdt', 
                              objective='binary', 
                              subsample=0.8,
                              min_child_weight=0.5, 
                              colsample_bytree=0.7,
                              num_leaves=100,
                              max_depth=12,
                              learning_rate=0.01,
                              n_estimators=10000
                              )
gbdt.fit(x_train, y_train, eval_set=[(x_train, y_train),
                                    (x_val, y_val)],
         eval_names=['train', 'val'],
         eval_metric='binary_logloss',
         early_stopping_rounds=100)

[1]	train's binary_logloss: 0.510716	val's binary_logloss: 0.50848
Training until validation scores don't improve for 100 rounds
[2]	train's binary_logloss: 0.50825	val's binary_logloss: 0.507845
[3]	train's binary_logloss: 0.505972	val's binary_logloss: 0.507147
[4]	train's binary_logloss: 0.503759	val's binary_logloss: 0.506615
[5]	train's binary_logloss: 0.50127	val's binary_logloss: 0.506343
[6]	train's binary_logloss: 0.498882	val's binary_logloss: 0.506083
[7]	train's binary_logloss: 0.496621	val's binary_logloss: 0.505833
[8]	train's binary_logloss: 0.494249	val's binary_logloss: 0.50551
[9]	train's binary_logloss: 0.491893	val's binary_logloss: 0.505215
[10]	train's binary_logloss: 0.489866	val's binary_logloss: 0.504489
[11]	train's binary_logloss: 0.487545	val's binary_logloss: 0.504268
[12]	train's binary_logloss: 0.485274	val's binary_logloss: 0.503703
[13]	train's binary_logloss: 0.48304	val's binary_logloss: 0.503277
[14]	train's binary_logloss: 0.480842	val's binary_logl

LGBMClassifier(colsample_bytree=0.7, learning_rate=0.01, max_depth=12,
               min_child_weight=0.5, n_estimators=10000, num_leaves=100,
               objective='binary', subsample=0.8)

In [105]:
train_logloss = log_loss(y_train, gbdt.predict_proba(x_train)[:, 1])   # −(ylog(p)+(1−y)log(1−p)) log_loss
val_logloss = log_loss(y_val, gbdt.predict_proba(x_val)[:, 1])
y_pred = gbdt.predict_proba(test)[:, 1]

In [107]:
print('train_logloss: ', train_logloss)
print('val_logloss: ', val_logloss)
print('预测前10个测试样本的正样本预测概率：\n', y_pred[:10])  

train_logloss:  0.38494440600725727
val_logloss:  0.491378914243493
预测前10个测试样本的正样本预测概率：
 [0.32814256 0.27857479 0.18389449 0.1756473  0.17750179 0.32951145
 0.18588754 0.15332876 0.14050926 0.20831841]


## 2.3 GBDT+LR

In [127]:
data = pd.concat([df_train, df_test], axis=0)
data.fillna(-1, inplace=True)

In [128]:
for col in category_features:
    onehot_features = pd.get_dummies(data[col], prefix=col)
    data.drop(col, axis=1, inplace=True)
    data = pd.concat([data, onehot_features], axis=1)

train = data[data['Label'] != -1]
target = train.pop('Label')
test = data[data['Label'] == -1]
test.drop(['Label'], axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


In [129]:
x_train, x_val, y_train, y_val = train_test_split(train, target, test_size = 0.2)

In [130]:
gbdt = lightgbm.LGBMClassifier(boosting_type='gbdt',
                        objective='binary',
                        subsample= 0.8,
                        min_child_weight= 0.5,
                        colsample_bytree= 0.7,
                        num_leaves=100,
                        max_depth = 12,
                        learning_rate=0.01,
                        n_estimators=1000,
                        )
gbdt.fit(x_train, y_train, 
        eval_set=[(x_train, y_train), (x_val, y_val)], 
        eval_names=['train', 'val'],
        eval_metric='binary_logloss',
        early_stopping_rounds=100,
       )

[1]	train's binary_logloss: 0.514356	val's binary_logloss: 0.49594
Training until validation scores don't improve for 100 rounds
[2]	train's binary_logloss: 0.51211	val's binary_logloss: 0.495314
[3]	train's binary_logloss: 0.509881	val's binary_logloss: 0.494604
[4]	train's binary_logloss: 0.507836	val's binary_logloss: 0.493983
[5]	train's binary_logloss: 0.505763	val's binary_logloss: 0.493119
[6]	train's binary_logloss: 0.503774	val's binary_logloss: 0.492661
[7]	train's binary_logloss: 0.501896	val's binary_logloss: 0.491738
[8]	train's binary_logloss: 0.499895	val's binary_logloss: 0.49112
[9]	train's binary_logloss: 0.497814	val's binary_logloss: 0.490488
[10]	train's binary_logloss: 0.49592	val's binary_logloss: 0.489721
[11]	train's binary_logloss: 0.493976	val's binary_logloss: 0.489026
[12]	train's binary_logloss: 0.491928	val's binary_logloss: 0.488457
[13]	train's binary_logloss: 0.490179	val's binary_logloss: 0.487714
[14]	train's binary_logloss: 0.48831	val's binary_logl

LGBMClassifier(colsample_bytree=0.7, learning_rate=0.01, max_depth=12,
               min_child_weight=0.5, n_estimators=1000, num_leaves=100,
               objective='binary', subsample=0.8)

In [131]:
model = gbdt.booster_

gbdt_feats_train = model.predict(train, pred_leaf=True)
gbdt_feats_test = model.predict(test, pred_leaf = True)
gbdt_feats_name = ['gbdt_leaf_' + str(i) for i in range(gbdt_feats_train.shape[1])]
df_train_gbdt_feats = pd.DataFrame(gbdt_feats_train, columns = gbdt_feats_name) 
df_test_gbdt_feats = pd.DataFrame(gbdt_feats_test, columns = gbdt_feats_name)

In [132]:
train = pd.concat([train, df_train_gbdt_feats], axis = 1)
test = pd.concat([test, df_test_gbdt_feats], axis = 1)
train_len = train.shape[0]
data = pd.concat([train, test])

In [133]:
scaler = MinMaxScaler()
for col in integer_features:
    data[col] = scaler.fit_transform(data[col].values.reshape(-1, 1))

for col in gbdt_feats_name:
    onehot_feats = pd.get_dummies(data[col], prefix = col)
    data.drop([col], axis = 1, inplace = True)
    data = pd.concat([data, onehot_feats], axis = 1)

train = data[: train_len]
test = data[train_len:]

In [134]:
x_train, x_val, y_train, y_val = train_test_split(train, target, test_size = 0.3)

In [137]:
lr = LogisticRegression()
lr.fit(x_train, y_train)
tr_logloss = log_loss(y_train, lr.predict_proba(x_train)[:, 1])
print('tr-logloss: ', tr_logloss)
val_logloss = log_loss(y_val, lr.predict_proba(x_val)[:, 1])
print('val-logloss: ', val_logloss)
y_pred = lr.predict_proba(test)[:, 1]
print('预测前10个测试样本的正样本预测概率：\n', y_pred[:10]) 

tr-logloss:  0.009539733478411328
val-logloss:  0.23676042150340368
预测前10个测试样本的正样本预测概率：
 [8.45006675e-01 1.71889779e-02 3.96333726e-01 3.70922173e-03
 1.06322840e-02 1.98127020e-01 4.46584975e-03 6.12527110e-03
 8.19511513e-05 3.08945681e-01]


对比可见，GBDT+LR这套方案相交各个单独使用大大减少了loss值，无论是在训练集还是测试集。