In [1]:
import pandas as pd
import numpy as np
from datetime import datetime
from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from sklearn.metrics import roc_auc_score

In [2]:
# 构建测试数据
df, label = make_classification(
    n_samples=5000,
    n_features=300,
    n_informative=12,
    n_redundant=7,
    random_state=134985745,
)
df = pd.DataFrame(df, columns=[f"f{i}" for i in range(df.shape[1])])

print("df shape:", df.shape)
print("label: ", label)
df.head()

df shape: (5000, 300)
label:  [1 1 0 ... 1 1 0]


Unnamed: 0,f0,f1,f2,f3,f4,f5,f6,f7,f8,f9,...,f290,f291,f292,f293,f294,f295,f296,f297,f298,f299
0,-0.994557,-1.136878,0.169768,0.768031,0.014296,0.524148,1.023236,-0.172799,0.01204,-0.310725,...,1.074111,-0.392414,-0.380007,1.463322,0.95008,-0.374694,0.024485,-0.036063,-0.910353,-0.137185
1,-0.601184,-0.470369,-1.054326,0.352207,-0.431754,-0.186422,1.362683,-0.126976,-2.522448,0.626738,...,0.34633,-2.16588,2.037716,-0.715749,-0.328964,0.091847,-1.090315,-0.632209,0.886497,0.914958
2,-1.597741,0.777435,0.267835,-1.18658,-0.048285,0.638156,0.627173,-0.367807,-2.083738,1.277261,...,-2.057728,1.391847,-0.760141,-0.700443,-3.04624,0.323482,1.074514,-1.064665,-1.457375,0.504237
3,0.909784,0.910874,0.435565,-1.951355,0.577979,-0.603668,-0.209334,0.787032,-2.518156,-0.189924,...,0.052121,0.144603,0.474985,0.024548,0.138589,1.339473,0.518763,0.312907,0.992612,-1.121993
4,0.121554,-1.045082,-1.070315,-0.441583,-1.718309,0.138502,-0.534873,2.212596,0.001896,-1.008905,...,1.267492,2.010374,-1.055319,0.85212,-2.115959,1.801926,-0.32789,-1.539215,0.214853,2.003376


## 特征选择

本部分的整体思想是按照某一策略给出特征的子集，然后通过交叉验证或者验证集来对特征子集进行评估，最后选出最优的特征子集。   
目前实现的算法有四种分别是随机搜索、模型重要性选取、模型重要性top list和lvw。
- 随机搜索   
特征子集的选取策略是随机选择部分特征

- 模型重要性选取   
类似随机搜索，不过每次特征选取时按照特征重要性进行选取，特征重要性越大，那么选择到的概率越大。

- 模型重要性top list   
按照特征重要性对特征进行排序，然后选择前top list作为特征子集

- lvw   
Las Vegas Wraaper，每次特征自己在当前最优的特征子集中进行选择，相当于在不断缩小特征范围，具体算法原理参考周志华《机器学习》,p250-p252

前三种搜索算法原理基本类似，不同之处在于每次特征子集的不同，其函数对应的参数基本是一致的，首先介绍这三种算法。

## 1. 随机搜索

### 1.1 方法1
通过交叉验证的方式进行特征自己的评估。其中参数`sample`可以控制每次选择特征子集的数量，默认为None，相当于随机产生一个数字作为特征子集的数量，如果是float类型，应该在范围(0, 1)之间表示每次特征选取的比例，如果是一个正整数，则为每次选取特征的具体数量。
> 数据的输入也可以是numpy.ndarray,这样返回的这是每个特征的index

In [3]:
from model_helper.feature_selection.wrapper import random_search

clf = DecisionTreeClassifier()
subset, effect = random_search(df, label, clf, k_fold=3, sample=0.8, max_iter=10, random_state=666)
print(effect, subset)

initialize effect 0.7664000213162648, cost time 5, with feat_dim 300, with param None
round 1/10 start...
round 1/10 end, effect subset is 0.7725987815642154, cost time 4, with feature dim is 240, with param None
round 2/10 start...
round 2/10 end, effect subset is 0.7722004218604059, cost time 5, with feature dim is 240, with param None
round 3/10 start...
round 3/10 end, effect subset is 0.7453988193957848, cost time 4, with feature dim is 240, with param None
round 4/10 start...
round 4/10 end, effect subset is 0.7648032218286236, cost time 3, with feature dim is 240, with param None
round 5/10 start...
round 5/10 end, effect subset is 0.7641997410841962, cost time 5, with feature dim is 240, with param None
round 6/10 start...
round 6/10 end, effect subset is 0.7711982213401257, cost time 4, with feature dim is 240, with param None
round 7/10 start...
round 7/10 end, effect subset is 0.7631963403237719, cost time 4, with feature dim is 240, with param None
round 8/10 start...
round

### 1.2 方法2
该方法是方法1的升级版，相当于可以传递自定义的参数至`cross_val_score`中，有关`cross_val_score`的API可以参考[链接](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html)。

下面的使用中定义了`cross_val_score`的`scoring`和`n_jobs`两个参数。
> 该方法中参数`X`, `y`和`cv`参数分别对应param_search中的`train_x`, `train_y`和`cv`，不需要再额外定义。

In [4]:
from model_helper.feature_selection.wrapper import random_search


clf = DecisionTreeClassifier()
# 定义`cross_val_score`方法的参数
cross_val_param = {"scoring": lambda clf, X, y: roc_auc_score(y_true=y, y_score=clf.predict_proba(X)[:, 1]),
                   "n_jobs": None}

subset, effect = random_search(df, label, clf, k_fold=3, sample=0.8, max_iter=10, random_state=666, cross_val_param=cross_val_param)
print(effect, subset)

initialize effect 0.7664016731110735, cost time 5, with feat_dim 300, with param None
round 1/10 start...
round 1/10 end, effect subset is 0.7725984269010402, cost time 4, with feature dim is 240, with param None
round 2/10 start...
round 2/10 end, effect subset is 0.7722006543774653, cost time 3, with feature dim is 240, with param None
round 3/10 start...
round 3/10 end, effect subset is 0.7453891492977888, cost time 3, with feature dim is 240, with param None
round 4/10 start...
round 4/10 end, effect subset is 0.7647955104364229, cost time 3, with feature dim is 240, with param None
round 5/10 start...
round 5/10 end, effect subset is 0.7642036744855613, cost time 4, with feature dim is 240, with param None
round 6/10 start...
round 6/10 end, effect subset is 0.7711966602443517, cost time 3, with feature dim is 240, with param None
round 7/10 start...
round 7/10 end, effect subset is 0.7631965656171517, cost time 3, with feature dim is 240, with param None
round 8/10 start...
round

### 1.3 方法3   
通过随机产生验证集的方式进行特征子集的筛选。
上面两种方式的核心是通过交叉验证的方式进行特征子集的评估，显然当数据量较大时速度会很慢。下面这种方式是通过随机划分验证集的方式进行特征子集的评估。   
此时需要指定三组参数`create_valid`&`valid_ratio`&`metric_func`，参数`create_valid`是表示需要从输入数据中划分验证集，`valid_ratio`表示划分验证集的比例，最后一组参数`metric_func`则是表示评估验证效果的函数（输入是y_true,y_pred，返回应当只有一个值）。

In [5]:
from model_helper.feature_selection.wrapper import random_search

clf = DecisionTreeClassifier()
subset, effect = random_search(df, label, clf, create_valid=True, valid_ratio=0.2, metric_func=roc_auc_score, sample=0.8, max_iter=10, random_state=666)
print(effect, subset)

initialize effect 0.7970273245921329, cost time 1, with feat_dim 300, with param None
round 1/10 start...
round 1/10 end, effect subset is 0.7835551996797118, cost time 1, with feature dim is 240, with param None
round 2/10 start...
round 2/10 end, effect subset is 0.8078170353317987, cost time 1, with feature dim is 240, with param None
round 3/10 start...
round 3/10 end, effect subset is 0.758862976679011, cost time 1, with feature dim is 240, with param None
round 4/10 start...
round 4/10 end, effect subset is 0.7792413171854669, cost time 1, with feature dim is 240, with param None
round 5/10 start...
round 5/10 end, effect subset is 0.7663296967270544, cost time 1, with feature dim is 240, with param None
round 6/10 start...
round 6/10 end, effect subset is 0.7920528475628066, cost time 1, with feature dim is 240, with param None
round 7/10 start...
round 7/10 end, effect subset is 0.7711840656590931, cost time 1, with feature dim is 240, with param None
round 8/10 start...
round 

### 1.4 方法4
通过自定义的验证集作为特征子集评估的标准。   
方法3中通过随机划分验证集进行特征子集的评估，当训练样本中虽在时序相关时（即样本是严格按照时间来产生）时，这种方式欠妥，存在拿未来的样本训练来预测过去的样本。    
下面这种方式支持自定义输入验证集。需要指定参数`valid_x`&`valid_y`&`metric_func`。

In [6]:
from model_helper.feature_selection.wrapper import random_search

clf = DecisionTreeClassifier()
subset, effect = random_search(df[:-1000], label[:-1000], clf, valid_x=df[-1000:], valid_y=label[-1000:], metric_func=roc_auc_score, sample=None, max_iter=10, random_state=666)
print(effect, subset)

initialize effect 0.8003533710315792, cost time 1, with feat_dim 300, with param None
round 1/10 start...
round 1/10 end, effect subset is 0.745788961857539, cost time 0, with feature dim is 71, with param None
round 2/10 start...
round 2/10 end, effect subset is 0.6519195290520612, cost time 0, with feature dim is 115, with param None
round 3/10 start...
round 3/10 end, effect subset is 0.7617526883011377, cost time 2, with feature dim is 284, with param None
round 4/10 start...
round 4/10 end, effect subset is 0.6468070546144334, cost time 0, with feature dim is 106, with param None
round 5/10 start...
round 5/10 end, effect subset is 0.7778484786636732, cost time 1, with feature dim is 189, with param None
round 6/10 start...
round 6/10 end, effect subset is 0.7958131735760108, cost time 1, with feature dim is 228, with param None
round 7/10 start...
round 7/10 end, effect subset is 0.580643031227114, cost time 0, with feature dim is 69, with param None
round 8/10 start...
round 8/1

### 1.5 方法5
当验证方式使用验证集进行特征子集的评估时，该方法提供了一种高阶的使用方式。该方式最初想法是想融入`early_stopping_rounds`至特征选择的过程中，即每次调参时最优训练轮数是一个动态调整的值。

该方式需要额外指定一组参数，该参数类型是dict，里面可包含key值为`model_fit_param`或者`set_eval_set`或者`update_param_func`，
- `model_fit_param`  
模型训练时，`fit`方法包含的参数，是一个dict
- `set_eval_set`   
在训练时是否指定验证集合，如果设置为True，会在fit时添加一组参数`eval_set = [(valid_x, valid_y)]`，否则则不添加，往往early_stopping_rounds时需要指定。
- `update_param_func`   
更新参数的function，输入为model和param，输出也是param，在使用early_stopping_rounds需要返回最佳的训练轮次，则可以启用该函数。


In [7]:
from model_helper.feature_selection.wrapper import random_search

clf = LGBMClassifier()

def _update(model, param):
    if param is None and model.best_iteration_ is not None:
        return {"best_iteration": model.best_iteration_}
    elif param is not None:
        return param
    else:
        return None
valid_set_param = {"model_fit_param": {"eval_metric": "auc", "verbose": False, "early_stopping_rounds": 5},
                   "set_eval_set": True,
                   "update_param_func": _update}

subset, effect, param = random_search(df, label, clf, sample=81, random_state=666,
                                      create_valid=True, valid_ratio=0.2, valid_set_param=valid_set_param,
                                      metric_func=roc_auc_score, max_iter=10)
print(param, effect, subset)

initialize effect 0.9689800820738664, cost time 3, with feat_dim 300, with param {'best_iteration': 51}
round 1/10 start...
round 1/10 end, effect subset is 0.8561004904413972, cost time 0, with feature dim is 81, with param {'best_iteration': 16}
round 2/10 start...
round 2/10 end, effect subset is 0.9007186467821039, cost time 0, with feature dim is 81, with param {'best_iteration': 29}
round 3/10 start...
round 3/10 end, effect subset is 0.6921169052146932, cost time 0, with feature dim is 81, with param {'best_iteration': 6}
round 4/10 start...
round 4/10 end, effect subset is 0.8434671204083676, cost time 0, with feature dim is 81, with param {'best_iteration': 22}
round 5/10 start...
round 5/10 end, effect subset is 0.8152677409668702, cost time 0, with feature dim is 81, with param {'best_iteration': 18}
round 6/10 start...
round 6/10 end, effect subset is 0.937445701131018, cost time 0, with feature dim is 81, with param {'best_iteration': 30}
round 7/10 start...
round 7/10 end

### 1.6 方法6
以上方式均是单机版一次次进行遍历进行，如果内存允许的情况下，可以启用多进程。下面以交叉验证作为特征子集的评选标准进行，对验证集的方式评估该方式完全一致。

In [8]:
from model_helper.feature_selection.wrapper import random_search


# 未启用多进程，测试耗时
s = datetime.now()
clf = DecisionTreeClassifier()
subset, effect = random_search(df, label, clf, k_fold=3, sample=0.9, max_iter=100, random_state=666, verbose=False)
e = datetime.now()
print(effect, subset)
print(f"do not use multiprocess cost time {(e-s).seconds}")  

# 启用多进程，测试耗时
s = datetime.now()
clf = DecisionTreeClassifier()
subset, effect = random_search(df, label, clf, k_fold=3, sample=0.9, max_iter=100, 
                               enable_multiprocess=True, n_jobs=2, random_state=666, verbose=False)
e = datetime.now()
print(effect, subset)
print(f"use multiprocess cost time {(e-s).seconds}")  

0.7864010631247099 ['f176', 'f131', 'f278', 'f106', 'f59', 'f81', 'f12', 'f77', 'f219', 'f128', 'f230', 'f216', 'f277', 'f250', 'f287', 'f122', 'f268', 'f108', 'f209', 'f152', 'f133', 'f125', 'f103', 'f32', 'f180', 'f3', 'f284', 'f161', 'f147', 'f226', 'f89', 'f15', 'f227', 'f291', 'f123', 'f177', 'f225', 'f11', 'f159', 'f151', 'f191', 'f48', 'f61', 'f14', 'f110', 'f85', 'f137', 'f214', 'f155', 'f68', 'f66', 'f247', 'f16', 'f211', 'f114', 'f141', 'f127', 'f295', 'f189', 'f50', 'f296', 'f23', 'f251', 'f67', 'f239', 'f279', 'f22', 'f273', 'f150', 'f290', 'f259', 'f91', 'f275', 'f157', 'f39', 'f289', 'f27', 'f179', 'f236', 'f202', 'f37', 'f84', 'f156', 'f224', 'f121', 'f70', 'f129', 'f112', 'f217', 'f34', 'f149', 'f25', 'f169', 'f270', 'f282', 'f113', 'f79', 'f132', 'f38', 'f174', 'f288', 'f135', 'f257', 'f292', 'f283', 'f175', 'f201', 'f45', 'f107', 'f2', 'f1', 'f104', 'f192', 'f263', 'f249', 'f69', 'f252', 'f148', 'f145', 'f213', 'f100', 'f98', 'f158', 'f281', 'f160', 'f154', 'f232', 'f

## 2.模型重要性选择
该方法在选择特征时，首先根据输入的模型训练一个模型，训练完模型需要带有属性`feature_importances_`，然后根据特征的重要性通过`softmax`方式进行归一到概率分布，然后每次选择特征子集时，依靠此概率分布进行特征子集的选取，这样方式相当于加入了先验知识，然为重要性较高的特征应当以较大概率被选择到。   
在函数的实现上，其对应的参数和上面随机选择完全一致，只是函数名不同，这里不再详细展开。

### 2.1 方法1
通过交叉验证评估特征子集的效果，其余参数的使用的方式参考**随机搜索**部分。

In [9]:
from model_helper.feature_selection.wrapper import weight_search

clf = DecisionTreeClassifier()
subset, effect = weight_search(df, label, clf, k_fold=3, sample=0.8, random_state=667)
print(effect, subset)

initialize effect 0.7702002216603497, cost time 4, with feat_dim 300, with param None
round 1/10 start...
round 1/10 end, effect subset is 0.7695979411560665, cost time 3, with feature dim is 240, with param None
round 2/10 start...
round 2/10 end, effect subset is 0.7667975408519737, cost time 4, with feature dim is 240, with param None
round 3/10 start...
round 3/10 end, effect subset is 0.7615997808841594, cost time 3, with feature dim is 240, with param None
round 4/10 start...
round 4/10 end, effect subset is 0.7712000217003419, cost time 3, with feature dim is 240, with param None
round 5/10 start...
round 5/10 end, effect subset is 0.7865988627004491, cost time 4, with feature dim is 240, with param None
round 6/10 start...
round 6/10 end, effect subset is 0.7536015005882377, cost time 3, with feature dim is 240, with param None
round 7/10 start...
round 7/10 end, effect subset is 0.7267998981236166, cost time 3, with feature dim is 240, with param None
round 8/10 start...
round

## 3. 模型重要性top list

和上面方法类似，需要首先预训练一个模型，该模型用于评估特征的重要性，然后指定相应的特征前top list作为特征候选子集。  
其参数基本和上面类似，但是需要指定top list参数。

### 3.1 方法1  
通过交叉验证评价特征子集的效果，其余参数的使用的方式参考**随机搜索**部分。

In [10]:
from model_helper.feature_selection.wrapper import top_feat_search


clf = DecisionTreeClassifier()
subset, effect = top_feat_search(df, label, top_ratio_list=[0.95, 0.9, 0.85, 0.8, 0.75, 0.7],
                                   model=clf, k_fold=3, random_state=666)
print(effect, subset)

initialize effect 0.7664000213162648, cost time 4, with feat_dim 300, with param None
round 1/6 start...
round 1/6 end, effect subset is 0.7771977417241641, cost time 4, with feature dim is 285, with param None
round 2/6 start...
round 2/6 end, effect subset is 0.778797901884209, cost time 4, with feature dim is 270, with param None
round 3/6 start...
round 3/6 end, effect subset is 0.7805979020282378, cost time 3, with feature dim is 255, with param None
round 4/6 start...
round 4/6 end, effect subset is 0.7769988619322955, cost time 3, with feature dim is 240, with param None
round 5/6 start...
round 5/6 end, effect subset is 0.7789973817961497, cost time 3, with feature dim is 225, with param None
round 6/6 start...
round 6/6 end, effect subset is 0.7743998619243738, cost time 3, with feature dim is 210, with param None
all round end.
best effect is 0.7805979020282378, best feature dim 255, input feat dim is 300
0.7805979020282378 ['f231', 'f107', 'f100', 'f145', 'f29', 'f294', 'f17

## 4. lvw

lvw，Las Vegas Wrpper，每次从当前最优的特征子集进行新的特征子集的选取，所以lvw是一个串行执行的算法，所以未实现其对应的多进程版本，其参数的使用方式和**随机搜索**类似。

### 4.1 方法1
通过交叉验证来评估特征子集的效果。

In [11]:
from model_helper.feature_selection.wrapper import lvw

clf = DecisionTreeClassifier()

subset, effect = lvw(df, label, clf, k_fold=3, sample=0.8, max_iter=40, random_state=667)
print(effect, subset)

initialize effect 0.7702002216603497, cost time 4, with feat_dim 300, with param None
round 1 start...
effect subset is 0.7671971007959272, feature dim is 240, cost 3 seconds.
round 2 start...
effect subset is 0.7671971007959272, feature dim is 240, cost 3 seconds.
round 3 start...
effect subset is 0.7671971007959272, feature dim is 240, cost 3 seconds.
round 4 start...
effect subset is 0.7671971007959272, feature dim is 240, cost 3 seconds.
round 5 start...
effect subset is 0.7671971007959272, feature dim is 240, cost 3 seconds.
round 6 start...
effect subset is 0.7671971007959272, feature dim is 240, cost 3 seconds.
round 7 start...
effect subset is 0.7671971007959272, feature dim is 240, cost 3 seconds.
round 8 start...
effect subset is 0.7671971007959272, feature dim is 240, cost 3 seconds.
round 9 start...
effect subset is 0.7671971007959272, feature dim is 240, cost 3 seconds.
round 10 start...
effect subset is 0.7671971007959272, feature dim is 240, cost 3 seconds.
round 11 star

## 5. 分布式特征选择 

分布式特征选择中，将上述随机搜索、模型重要性选取和模型重要性top list这三种方式进行了实现，集成在一个方法中，因为lvw串行进行，所以没有实现该算法。   

其使用方式和上面基本一致，不同之处需要传入一个spark session，因为本地没有搭建分布式环境，具体的使用方式可以参考目录`../tests/test_distribute_feature_select.py`