# MarTech Challenge 点击反欺诈预测比赛思路及实现

## 1 背景介绍 
广告欺诈是数字营销需要面临的重要挑战之一，点击会欺诈浪费广告主大量金钱，同时对点击数据会产生误导作用。本次比赛提供了约50万次点击数据。特别注意：我们对数据进行了模拟生成，对某些特征含义进行了隐藏，并进行了脱敏处理。
请预测用户的点击行为是否为正常点击，还是作弊行为。点击欺诈预测适用于各种信息流广告投放，banner广告投放，以及百度网盟平台，帮助商家鉴别点击欺诈，锁定精准真实用户。

[比赛传送门](https://aistudio.baidu.com/aistudio/competition/detail/52)

本思路将从数据分析、数据探索&特征工程、建模三个方面进行介绍：

## 2 数据分析

### 读取数据

In [None]:
import pandas as pd
train = pd.read_csv('data/data97586/train.csv')
test1 = pd.read_csv('data/data97586/test1.csv')
train

Unnamed: 0.1,Unnamed: 0,android_id,apptype,carrier,dev_height,dev_ppi,dev_width,label,lan,media_id,...,os,osv,package,sid,timestamp,version,fea_hash,location,fea1_hash,cus_type
0,0,316361,1199,46000.0,0.0,0.0,0.0,1,,104,...,android,9,18,1438873,1.559893e+12,8,2135019403,0,2329670524,601
1,1,135939,893,0.0,0.0,0.0,0.0,1,,19,...,android,8.1,0,1185582,1.559994e+12,4,2782306428,1,2864801071,1000
2,2,399254,821,0.0,760.0,0.0,360.0,1,,559,...,android,8.1.0,0,1555716,1.559837e+12,0,1392806005,2,628911675,696
3,3,68983,1004,46000.0,2214.0,0.0,1080.0,0,,129,...,android,8.1.0,0,1093419,1.560042e+12,0,3562553457,3,1283809327,753
4,4,288999,1076,46000.0,2280.0,0.0,1080.0,1,zh-CN,64,...,android,8.0.0,0,1400089,1.559867e+12,5,2364522023,4,1510695983,582
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
499995,499995,392477,1028,46000.0,1920.0,3.0,1080.0,1,zh-CN,144,...,Android,7.1.2,25,1546078,1.559834e+12,7,861755946,79,140647032,373
499996,499996,346134,1001,0.0,1424.0,0.0,720.0,0,,29,...,android,8.1.0,0,1480612,1.559814e+12,3,1714444511,23,2745131047,525
499997,499997,499635,761,46000.0,1280.0,0.0,720.0,0,,54,...,android,6.0.1,9,1698442,1.559676e+12,0,3843262581,25,1326115882,810
499998,499998,239786,917,46001.0,960.0,0.0,540.0,0,zh_CN,109,...,android,5.1.1,0,1331155,1.559840e+12,0,1984296118,225,1446741112,772


### 字段说明

![](https://ai-studio-static-online.cdn.bcebos.com/c5a7a8f10ce44593a6dd3310cda0352efea701c63a854ee395a2be52d0fec0ab)

**label是否作弊，0为正常，1位作弊**

### 初步筛选特征

In [None]:
features = train.drop(['Unnamed: 0','label'],axis = 1)
labels = train['label']
features.columns

Index(['android_id', 'apptype', 'carrier', 'dev_height', 'dev_ppi',
       'dev_width', 'lan', 'media_id', 'ntt', 'os', 'osv', 'package', 'sid',
       'timestamp', 'version', 'fea_hash', 'location', 'fea1_hash',
       'cus_type'],
      dtype='object')

## 3 数据探索&特征工程

### 构造函数，寻找关键特征值

In [None]:
#数据探索，找到导致1的关键特征值
def find_key_feature(train, selected):
    temp = pd.DataFrame(columns = [0,1])
    temp0 = train[train['label'] == 0]
    temp1 = train[train['label'] == 1]
    temp[0] = temp0[selected].value_counts() / len(temp0) * 100
    temp[1] = temp1[selected].value_counts() / len(temp1) * 100
    temp[2] = temp[1] / temp[0]
    #选出大于10倍的特征
    result = temp[temp[2] > 10].sort_values(2, ascending = False).index
    return result
key_feature = {}
key_feature['osv'] = find_key_feature(train, 'osv')
key_feature


{'osv': Index(['7.7.7', '7.2.1', '7.7.5', '7.8.5', '7.8.7', '3.8.0', '7.6.7', '3.9.0',
        '2.3', '8.0.1', '7.9.0', '7.6.4', '3.8.4', '7.8.9', '21100', '7.9.2',
        '4.1', '7.7.2', '7.8.2', 'Android_8.0.0', '7.8.0', '3.8.6', '7.7.0',
        '7.8.4', '8', '7.6.8', '21000', '7.8.6', '5', '6.1', '7.7.3', '9.0.0',
        '3.8.3', '3.7.8', '9.0', '8.0', 'Android_9', '7.7.4', '6.1.0'],
       dtype='object')}

### 通过特征类型及意义，确定需要寻找关键特征值的字段

In [None]:
features.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500000 entries, 0 to 499999
Data columns (total 19 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   android_id  500000 non-null  int64  
 1   apptype     500000 non-null  int64  
 2   carrier     500000 non-null  float64
 3   dev_height  500000 non-null  float64
 4   dev_ppi     500000 non-null  float64
 5   dev_width   500000 non-null  float64
 6   lan         316720 non-null  object 
 7   media_id    500000 non-null  int64  
 8   ntt         500000 non-null  float64
 9   os          500000 non-null  object 
 10  osv         493439 non-null  object 
 11  package     500000 non-null  int64  
 12  sid         500000 non-null  int64  
 13  timestamp   500000 non-null  float64
 14  version     500000 non-null  object 
 15  fea_hash    500000 non-null  object 
 16  location    500000 non-null  int64  
 17  fea1_hash   500000 non-null  int64  
 18  cus_type    500000 non-null  int64  
dtypes:

In [None]:
features.columns

Index(['android_id', 'apptype', 'carrier', 'dev_height', 'dev_ppi',
       'dev_width', 'lan', 'media_id', 'ntt', 'os', 'osv', 'package', 'sid',
       'timestamp', 'version', 'fea_hash', 'location', 'fea1_hash',
       'cus_type'],
      dtype='object')

### 确定字段，寻找对应字段的关键特征值

In [None]:
selected_cols = ['osv','apptype', 'carrier', 'dev_height', 'dev_ppi',
       'dev_width', 'media_id', 'package', 'version', 'fea_hash', 'location', 'fea1_hash',
       'cus_type']
for selected in selected_cols:
    key_feature[selected] = find_key_feature(train, selected)
key_feature

{'osv': Index(['7.7.7', '7.2.1', '7.7.5', '7.8.5', '7.8.7', '3.8.0', '7.6.7', '3.9.0',
        '2.3', '8.0.1', '7.9.0', '7.6.4', '3.8.4', '7.8.9', '21100', '7.9.2',
        '4.1', '7.7.2', '7.8.2', 'Android_8.0.0', '7.8.0', '3.8.6', '7.7.0',
        '7.8.4', '8', '7.6.8', '21000', '7.8.6', '5', '6.1', '7.7.3', '9.0.0',
        '3.8.3', '3.7.8', '9.0', '8.0', 'Android_9', '7.7.4', '6.1.0'],
       dtype='object'),
 'apptype': Int64Index([1139, 716, 941, 851, 1034, 1067], dtype='int64'),
 'carrier': Float64Index([], dtype='float64'),
 'dev_height': Float64Index([2242.0, 1809.0, 1500.0, 2385.0,  918.0, 1546.0,  895.0, 1521.0,
                816.0,  830.0, 1540.0, 2219.0,  676.0, 1480.0,  818.0,  694.0,
                665.0, 2287.0, 2281.0,  851.0, 1560.0, 2131.0, 2320.0, 2248.0,
                846.0,  748.0, 2312.0, 2240.0,  770.0, 2406.0, 2223.0, 2244.0,
                749.0,  772.0, 2277.0, 3040.0,  892.0, 1493.0, 2310.0, 2466.0,
               1460.0, 1496.0, 1441.0, 2268.0,  747.0

### 构造新特征字段

In [None]:
#构造新特征，新特征字段 = 原始特征字段 + 1
def f(x, selected):
    #判断是否在关键特征里，是1，否0
    if x in key_feature[selected]:
        return 1
    else:
        return 0
    
for selected in selected_cols:
    #判断是否有特征比大于10
    if len(key_feature[selected]) > 0:
        features[selected+'1'] = features[selected].apply(f, args = (selected,))
        test1[selected+'1'] = test1[selected].apply(f, args = (selected,))
        print(selected+'1 created')

osv1 created
apptype1 created
dev_height1 created
dev_ppi1 created
dev_width1 created
media_id1 created
package1 created
fea_hash1 created
fea1_hash1 created


### 查看新特征字段osv1

In [None]:
features['osv1'].value_counts()

0    444656
1     55344
Name: osv1, dtype: int64

### 进一步筛选特征

特征os的值为Android，android。意义相同当作唯一值处理，去掉

sid都是唯一值，也不参与建模

In [None]:
remove_list = ['os','sid']
col = features.columns.tolist()
for i in remove_list:
    col.remove(i)
col

['android_id',
 'apptype',
 'carrier',
 'dev_height',
 'dev_ppi',
 'dev_width',
 'lan',
 'media_id',
 'ntt',
 'osv',
 'package',
 'timestamp',
 'version',
 'fea_hash',
 'location',
 'fea1_hash',
 'cus_type',
 'osv1',
 'apptype1',
 'dev_height1',
 'dev_ppi1',
 'dev_width1',
 'media_id1',
 'package1',
 'fea_hash1',
 'fea1_hash1']

In [None]:
features = features[col]
# features

### 提取时间多尺度

In [None]:
import time
from datetime import datetime

def get_date(features):
    #先除以1000，再转换为日期格式
    features['timestamp'] = features['timestamp'].apply(lambda x: datetime.fromtimestamp(x / 1000))
    
    # 创建时间戳索引
    temp = pd.DatetimeIndex(features['timestamp'])
    features['year'] = temp.year
    features['month'] = temp.month
    features['day'] = temp.day
    features['week_day'] = temp.weekday
    features['hour'] = temp.hour
    features['minute'] = temp.minute
    
    #添加time_diff
    start_time = features['timestamp'].min()
    features['time_diff'] = features['timestamp'] - start_time
    #将time_diff转换为小时格式
    features['time_diff'] = features['time_diff'].dt.days * 24 + features['time_diff'].dt.seconds / 3600
    #只使用day 和time_diff
    features.drop(['timestamp','year','month','week_day','hour','minute'], axis = 1, inplace = True)
    
    return features

#对训练集提取时间多尺度
features = get_date(features)
#对测试集提取时间多尺度
test1 = get_date(test1)
features

Unnamed: 0,android_id,apptype,carrier,dev_height,dev_ppi,dev_width,lan,media_id,ntt,osv,...,apptype1,dev_height1,dev_ppi1,dev_width1,media_id1,package1,fea_hash1,fea1_hash1,day,time_diff
0,316361,1199,46000.0,0.0,0.0,0.0,,104,6.0,9,...,0,0,0,0,0,0,0,0,7,111.535278
1,135939,893,0.0,0.0,0.0,0.0,,19,6.0,8.1,...,0,0,0,0,0,0,0,0,8,139.671944
2,399254,821,0.0,760.0,0.0,360.0,,559,0.0,8.1.0,...,0,1,0,0,0,0,0,0,6,95.971111
3,68983,1004,46000.0,2214.0,0.0,1080.0,,129,2.0,8.1.0,...,0,0,0,0,0,0,0,0,9,152.993333
4,288999,1076,46000.0,2280.0,0.0,1080.0,zh-CN,64,2.0,8.0.0,...,0,0,0,0,0,0,0,0,7,104.472222
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
499995,392477,1028,46000.0,1920.0,3.0,1080.0,zh-CN,144,6.0,7.1.2,...,0,0,0,0,0,0,0,0,6,95.238056
499996,346134,1001,0.0,1424.0,0.0,720.0,,29,2.0,8.1.0,...,0,0,0,0,0,0,0,0,6,89.681111
499997,499635,761,46000.0,1280.0,0.0,720.0,,54,6.0,6.0.1,...,0,0,0,0,0,0,0,0,4,51.248889
499998,239786,917,46001.0,960.0,0.0,540.0,zh_CN,109,2.0,5.1.1,...,0,0,0,0,0,0,0,0,6,96.990556


### 对osv和lan进行LabelEncoder

In [None]:
#对OSV进行LabelEncoder
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
#需要将训练集和测试集合并，然后统一做LabelEncoder
all_df = pd.concat([train, test1])
all_df['osv'] = all_df['osv'].astype('str')
all_df['osv'] = le.fit_transform(all_df['osv'])
#对lan进行LabelEncoder
all_df['lan'] = all_df['lan'].astype('str')
all_df['lan'] = le.fit_transform(all_df['lan'])

### 对fea_hash、fea1_hash和version进行特征处理

In [None]:
#特征变换。对于数值过大的异常值 设置为0
features['fea_hash'] = features['fea_hash'].map(lambda x: 0 if len(str(x)) > 16 else int(x))
features['fea1_hash'] = features['fea1_hash'].map(lambda x: 0 if len(str(x)) > 16 else int(x))
#数据清洗。针对version非数值类型 设置0
features['version'] = features['version'].map(lambda x: int(x) if str(x).isdigit() else 0)
#将osv拆开
features['osv'] = all_df[all_df['label'].notnull()]['osv']
#将lan拆开
features['lan'] = all_df[all_df['label'].notnull()]['lan']


#测试集做预测，保持与features中的columns一致即可
test_fea = test1[features.columns]
#特征变换。对于数值过大的异常值 设置为0
test_fea['fea_hash'] = test_fea['fea_hash'].map(lambda x: 0 if len(str(x)) > 16 else int(x))
test_fea['fea1_hash'] = test_fea['fea1_hash'].map(lambda x: 0 if len(str(x)) > 16 else int(x))
#数据清洗。针对version非数值类型 设置0
test_fea['version'] = test_fea['version'].map(lambda x: int(x) if str(x).isdigit() else 0)
#将osv拆开
test_fea['osv'] = all_df[all_df['label'].isnull()]['osv']
#将lan拆开
test_fea['lan'] = all_df[all_df['label'].isnull()]['lan']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user

## 4 建模

### 采用五折交叉验证 ensemble model

In [None]:
from sklearn.model_selection import KFold,StratifiedKFold
from sklearn.metrics import accuracy_score

def ensemble_model(clf, train_x, train_y, test):
    #采用五折交叉验证 ensemble model
    sk = StratifiedKFold(n_splits = 5, shuffle = True, random_state = 2021)
    prob = []#记录最终结果
    mean_acc = 0#记录平均准确率
    
    for k, (train_index, val_index) in enumerate(sk.split(train_x, train_y)):
        train_x_real = train_x.iloc[train_index]
        train_y_real = train_y.iloc[train_index]
        val_x = train_x.iloc[val_index]
        val_y = train_y.iloc[val_index]
        #子模型训练
        clf = clf.fit(train_x_real, train_y_real)
        val_y_pred = clf.predict(val_x)
        #子模型评估
        acc_val = accuracy_score(val_y, val_y_pred)
        print('第{}个子模型acc{}'.format(k+1, acc_val))
        mean_acc += acc_val / 5
        #子模型预测
        test_y_pred = clf.predict_proba(test)[:, -1]#soft得到概率值
        prob.append(test_y_pred)
    print(mean_acc)
    mean_prob = sum(prob) / 5
    return mean_prob

### 选择xgboost进行模型训练、预测

In [None]:
import xgboost as xgb
clf = xgb.XGBClassifier(
            max_depth=12, learning_rate=0.001, n_estimators=20000, 
            objective='binary:logistic', tree_method='gpu_hist', 
            subsample=0.8, colsample_bytree=0.7, 
            min_child_samples=3, eval_metric='auc', reg_lambda=0.5
        )
result = ensemble_model(clf, features, labels, test_fea)
result



Parameters: { "min_child_samples" } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


第1个子模型acc0.89041




Parameters: { "min_child_samples" } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


第2个子模型acc0.89114




Parameters: { "min_child_samples" } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


第3个子模型acc0.89041




Parameters: { "min_child_samples" } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


第4个子模型acc0.8904




Parameters: { "min_child_samples" } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


第5个子模型acc0.8909
0.890652


array([0.10067499, 0.75248444, 0.02351505, ..., 0.9336721 , 0.9753353 ,
       0.98109853], dtype=float32)

### 按提交格式保存结果

In [None]:
#保存结果
a = pd.DataFrame(test1['sid'])
a['label'] = result
#转换为二分类
a['label'] = a['label'].apply(lambda x:0 if x<0.5 else 1)
a.to_csv('xgb_0.001_20000.csv', index = False)

In [None]:
a

Unnamed: 0,sid,label
0,1440682,0
1,1606824,1
2,1774642,0
3,1742535,0
4,1689686,1
...,...,...
149995,1165373,1
149996,1444115,1
149997,1134378,1
149998,1700238,1


## 5 心得&致谢

本次比赛没有使用深度学习框架，主要通过特征工程 + xgboost实现模型迭代，在2000轮可以达到88.8的效果，20000轮为本方案的最优得分89.1413。后续可以使用PaddlePaddle进行改进。其中部分数据清洗和特征变换方式参考了某项目公开的[trick](https://aistudio.baidu.com/aistudio/projectdetail/461026?channelType=0&channel=0)，在这里向热衷于开源的大佬表示感谢，同时还要感谢百度飞桨提供的比赛机会和算力支持！欢迎大家一起交流讨论。