### 1.知识补充

* 数据集的划分有三种方式：留出法，交叉验证法和自助法（常使用的是k折交叉验证，另外两种不常用？）
  1. 对于数据量充足的时候，通常采用留出法或者k折交叉验证来进行训练/测试集的部分
  2. 对于数据集小且难以有效划分训练/测试集时使用自助法
  3. 对于数据集小且可有效划分的时候最好使用留一法来进行划分，因为这种方法最为准确
  
* 学习了XGBoost的原理，以及在小数据集上用xgb.cv进行调参，比赛的数据有些多，只用baseline中的参数跑了一下。
  >记录了下XGBoost原理的学习笔记：[XGBoost原理简述](https://nekomoon404.github.io/2020/09/22/XGBoost%E5%8E%9F%E7%90%86%E7%AE%80%E8%BF%B0/)

* 学习了LightGBM如何在XGBoost的基础上做出改进，以及用网格搜索的方法逐步对lgb调参。

### 2.建模

In [24]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
from tqdm import tqdm
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.preprocessing import MinMaxScaler
import xgboost as xgb
import lightgbm as lgb
from catboost import CatBoostRegressor
import warnings
from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, log_loss
warnings.filterwarnings('ignore')
from time import time

In [25]:
#reduce_mem_usage 函数通过调整数据类型，帮助我们减少数据在内存中占用的空间
def reduce_mem_usage(df):
    start_mem = df.memory_usage().sum() 
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() 
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    
    return df

In [26]:
#读取特征工程处理好的数据
data_train = pd.read_csv('data_train_for_model-09-21.csv')
data_train = reduce_mem_usage(data_train)

data_test = pd.read_csv('data_test_for_model-09-21.csv')
data_test = reduce_mem_usage(data_test)

Memory usage of dataframe is 716800128.00 MB
Memory usage after optimization is: 128000128.00 MB
Decreased by 82.1%
Memory usage of dataframe is 177600128.00 MB
Memory usage after optimization is: 31800128.00 MB
Decreased by 82.1%


In [27]:
#训练数据/测试数据准备
features = [f for f in data_train.columns if f not in ['id','isDefault']]

x_train = data_train[features]
x_test = data_test[features]

y_train = data_train['isDefault']

In [30]:
#处理样本不均衡
#from imblearn.combine import SMOTETomek    
#smote_tomek = SMOTETomek(random_state=0)
#x_train, y_train = smote_tomek.fit_resample(x_train, y_train)

In [31]:
# 5折交叉验证
folds = 5
seed = 0
kf = KFold(n_splits=folds, shuffle=True, random_state=seed)

使用Lightgbm进行建模

>Lightgbm调参的方法可以参考：[LightGBM调参方法（具体操作）](https://www.cnblogs.com/bjwu/p/9307344.html)

In [None]:
#使用Lightgbm进行建模，这里使用了baseline中提供的参数，从训练结果看模型是存在的过拟合的，说明还有调参空间
#模型训练  lgb

1.先用lightgbm库原生的.cv，确定下迭代的次数num_boost_round

In [18]:
params = {'objective' : 'binary',
         #is_unbalance = True,
         'metric' : 'auc',
         'max_depth' : 6,
         'num_leaves' : 40,
         'learning_rate' : 0.1,
         'feature_fraction' : 0.8,
         'bagging_fraction' : 0.8,
         'min_child_samples' : 21,     #和min_data_in_leaf 同义，设置的较大可以避免生成一个过深的树，但有可能过拟合
         'min_child_weight': 0.001,    #和min_sum_hessian_in_leaf同义， 使一个节点分裂的最小海森值之和？不懂
         
         'bagging_freq': 2,   #降采样频率，表示bagging的频率，k意味着每k轮迭代进行一次bagging？
         'reg_alpha': 0,      #L1正则化参数
         'reg_lambda' : 0      #L2正则化参数
         }

lgb_data_train = lgb.Dataset(x_train, y_train, silent=True)

time0 = time()
cv_results = lgb.cv(
    params, lgb_data_train, num_boost_round=1200, nfold=5, stratified=False, shuffle=True, metrics='auc',
    early_stopping_rounds=100, verbose_eval=50, show_stdv=True, seed=0)
print(datetime.datetime.fromtimestamp(time()-time0).strftime("%M:%S:%f"))

print('best n_estimators:', len(cv_results['auc-mean']))
print('best cv score:', cv_results['auc-mean'][-1])

[50]	cv_agg's auc: 0.727105 + 0.0014403
[100]	cv_agg's auc: 0.730844 + 0.00148981
[150]	cv_agg's auc: 0.73261 + 0.00145055
[200]	cv_agg's auc: 0.733577 + 0.00136686
[250]	cv_agg's auc: 0.734204 + 0.00134081
[300]	cv_agg's auc: 0.734521 + 0.00128866
[350]	cv_agg's auc: 0.734812 + 0.00130706
[400]	cv_agg's auc: 0.73493 + 0.00136635
[450]	cv_agg's auc: 0.73491 + 0.00139679
[500]	cv_agg's auc: 0.734941 + 0.00145647
[550]	cv_agg's auc: 0.734876 + 0.001415
01:36:888110
best n_estimators: 490
best cv score: 0.7349676565422735


用lgb.cv确定的迭代次数是490。

接下来用网格搜索来确定调其他参数，网格搜索要使用sklearn的GridSearchCV，先调对模型影响大的参数，顺序可以是：

注意每一步得到优化参数后要在下一步手动更新，网格搜索是真的费时间，有哪些调参的trick？

In [6]:
"""通过网格搜索确定最优参数"""
from sklearn.model_selection import GridSearchCV

def get_best_cv_params(learning_rate=0.1, n_estimators=490, num_leaves=40, max_depth=6, bagging_fraction=0.8, 
                       feature_fraction=0.8, bagging_freq=2, min_child_samples=21, min_child_weight=0.001, 
                       min_split_gain=0, reg_lambda=0, reg_alpha=0, param_grid=None):
    # 设置5折交叉验证
    cv_fold = StratifiedKFold(n_splits=5, random_state=0, shuffle=True, )
    
    model_lgb = lgb.LGBMClassifier(learning_rate=learning_rate,
                                   n_estimators=n_estimators,
                                   
                                   num_leaves=num_leaves,
                                   max_depth=max_depth,
                                   
                                   bagging_fraction=bagging_fraction,
                                   feature_fraction=feature_fraction,
                                   bagging_freq=bagging_freq,
                                   
                                   min_child_samples=min_child_samples,
                                   min_child_weight=min_child_weight,
                                   
                                   reg_lambda=reg_lambda,
                                   reg_alpha=reg_alpha,
                                   
                                   min_split_gain=min_split_gain,
                                   
                                   n_jobs= -1,
                                  )
    grid_search = GridSearchCV(estimator=model_lgb, 
                               cv=cv_fold,
                               param_grid=param_grid,
                               scoring='roc_auc'
                              )
    grid_search.fit(x_train, y_train)

    print('模型当前最优参数为:{}'.format(grid_search.best_params_))
    print('模型当前最优得分为:{}'.format(grid_search.best_score_))

2.调max_depth和num_leaves，策略是先粗调后细调

* max_depth：设置树深度，深度越大可能过拟合
* num_leaves：因为 LightGBM 使用的是 leaf-wise 的算法，因此在调节树的复杂程度时，使用的是 num_leaves 而不是 max_depth。大致换算关系：num_leaves = 2^(max_depth)，但是它的值的设置应该小于 2^(max_depth)，否则可能会导致过拟合。


In [22]:
lgb_params = {'num_leaves': range(10, 80, 5), 'max_depth': range(3,10,2)}

time0 = time()
get_best_cv_params(learning_rate=0.1, n_estimators=490, num_leaves=None, max_depth=None, bagging_fraction=0.8, 
                       feature_fraction=0.8, bagging_freq=2, min_child_samples=21, min_child_weight=0.001, 
                       min_split_gain=0, reg_lambda=0, reg_alpha=0, param_grid=lgb_params)     #要调的参数设置成None？
print('time: ', datetime.datetime.fromtimestamp(time()-time0).strftime("%M:%S:%f"))

模型当前最优参数为:{'max_depth': 5, 'num_leaves': 35}
模型当前最优得分为:0.7354909075787674
time:  17:32:796533


进一步细调max_depth，和num_leaves

In [23]:
lgb_params = {'num_leaves': range(30, 40, 1), 'max_depth': range(3,7,1)}

time0 = time()
get_best_cv_params(learning_rate=0.1, n_estimators=490, num_leaves=None, max_depth=None, bagging_fraction=0.8, 
                       feature_fraction=0.8, bagging_freq=2, min_child_samples=21, min_child_weight=0.001, 
                       min_split_gain=0, reg_lambda=0, reg_alpha=0, param_grid=lgb_params)     #要调的参数设置成None？
print('time: ', datetime.datetime.fromtimestamp(time()-time0).strftime("%M:%S:%f"))

模型当前最优参数为:{'max_depth': 5, 'num_leaves': 31}
模型当前最优得分为:0.7355489458951782
time:  52:46:757414


3.调min_child_samples 和 min_child_weight

In [24]:
lgb_params={
    'min_child_samples': [18, 19, 20, 21, 22, 23],    #看别人这样设置，不太懂
    'min_child_weight':[0.001, 0.002]
}

time0 = time()
get_best_cv_params(learning_rate=0.1, n_estimators=490, num_leaves=31, max_depth=5, bagging_fraction=0.8, 
                       feature_fraction=0.8, bagging_freq=2, min_child_samples=None, min_child_weight=None, 
                       min_split_gain=0, reg_lambda=0, reg_alpha=0, param_grid=lgb_params)     #要调的参数设置成None？
print('time: ', datetime.datetime.fromtimestamp(time()-time0).strftime("%M:%S:%f"))

模型当前最优参数为:{'min_child_samples': 23, 'min_child_weight': 0.001}
模型当前最优得分为:0.7356996328240119
time:  16:59:609068


4.调feature_fraction

In [25]:
lgb_params = {#'bagging_fraction': [i/10 for i in range(5,10,1)], 
              'feature_fraction': [i/10 for i in range(5,10,1)],
              #'bagging_freq': range(0,81,10)
             }

time0 = time()
get_best_cv_params(learning_rate=0.1, n_estimators=490, num_leaves=31, max_depth=5, bagging_fraction=0.8, 
                       feature_fraction=None, bagging_freq=2, min_child_samples=23, min_child_weight=0.001, 
                       min_split_gain=0, reg_lambda=0, reg_alpha=0, param_grid=lgb_params)     #要调的参数设置成None？
print('time: ', datetime.datetime.fromtimestamp(time()-time0).strftime("%M:%S:%f"))

模型当前最优参数为:{'feature_fraction': 0.8}
模型当前最优得分为:0.7356996328240119
time:  06:54:594004


5.调bagging_fraction 和 bagging_freq

这两个参数必须同时设置，bagging_fraction相当于subsample样本采样，可以使bagging更快的运行，同时也可以降拟合。bagging_freq默认0，表示bagging的频率，0意味着没有使用bagging，k意味着每k轮迭代进行一次bagging。

In [26]:
lgb_params = {'bagging_fraction': [i/10 for i in range(5,10,1)], 
              #'feature_fraction': [i/10 for i in range(5,10,1)],
              'bagging_freq': range(2,6,1)    #这个范围不太确定
             }

time0 = time()
get_best_cv_params(learning_rate=0.1, n_estimators=490, num_leaves=31, max_depth=5, bagging_fraction=None, 
                       feature_fraction=0.8, bagging_freq=None, min_child_samples=23, min_child_weight=0.001, 
                       min_split_gain=0, reg_lambda=0, reg_alpha=0, param_grid=lgb_params)     #要调的参数设置成None？
print('time: ', datetime.datetime.fromtimestamp(time()-time0).strftime("%M:%S:%f"))
#这两轮搜了个寂寞

模型当前最优参数为:{'bagging_fraction': 0.8, 'bagging_freq': 2}
模型当前最优得分为:0.7356996328240119
time:  24:11:655037


6.调L1正则化系数reg_alpha，L2正则化系数reg_lambda

In [28]:
lgb_params = {
    'reg_alpha': [0, 0.001, 0.01, 0.03, 0.08, 0.3, 0.5],
    'reg_lambda': [0, 0.001, 0.01, 0.03, 0.08, 0.3, 0.5]
}
time0 = time()
get_best_cv_params(learning_rate=0.1, n_estimators=490, num_leaves=31, max_depth=5, bagging_fraction=0.8, 
                       feature_fraction=0.8, bagging_freq=2, min_child_samples=23, min_child_weight=0.001, 
                       min_split_gain=0, reg_lambda=None, reg_alpha=None, param_grid=lgb_params)     #要调的参数设置成None？
print('time: ', datetime.datetime.fromtimestamp(time()-time0).strftime("%M:%S:%f"))

模型当前最优参数为:{'reg_alpha': 0.5, 'reg_lambda': 0.3}
模型当前最优得分为:0.7357520426784678
time:  07:44:422930


6.调min_split_gain

In [7]:
lgb_params = {'min_split_gain': [i/10 for i in range(0,11,1)]}

time0 = time()
get_best_cv_params(learning_rate=0.1, n_estimators=490, num_leaves=31, max_depth=5, bagging_fraction=0.8, 
                       feature_fraction=0.8, bagging_freq=2, min_child_samples=23, min_child_weight=0.001, 
                       min_split_gain=None, reg_lambda=0.3, reg_alpha=0.5, param_grid=lgb_params)     #要调的参数设置成None？
print('time: ', datetime.datetime.fromtimestamp(time()-time0).strftime("%M:%S:%f"))

模型当前最优参数为:{'min_split_gain': 0.0}
模型当前最优得分为:0.7357520426784678
time:  18:51:284698


7.降低learning_rate

之前使用较高的学习速率是因为可以让收敛更快，。最后我们使用较低的学习速率，以及使用更多的决策树n_estimators来训练数据，看能不能可以进一步的优化分数。可以用回lightGBM的cv函数了 ，代入之前优化好的参数。

将learning_rate调小的效果还是很明显的，上一步调参auc的最优结果是0.735752，这里到7300步时已达到这个效果；在12650步之后，就基本每次只增加<0.000005了。

auc一直在增加，为什么？

In [None]:
''' 
params = {'objective' : 'binary',
         'metric' : 'auc',
         'max_depth' : 5,
         'num_leaves' : 31,
          
         'learning_rate' : 0.005,
          
         'feature_fraction' : 0.8,
         'bagging_fraction' : 0.8,
         'bagging_freq': 2, 
          
         'min_child_samples' : 23,     
         'min_child_weight': 0.001,    
         
         'reg_alpha': 0.5,      
         'reg_lambda' : 0.3,     
         'min_split_gain' : 0.0,
         'n_jobs' : -1
         }

lgb_data_train = lgb.Dataset(x_train, y_train, silent=True)

time0 = time()
cv_results = lgb.cv(
    params, lgb_data_train, num_boost_round=50000, nfold=5, stratified=False, shuffle=True, metrics='auc',
    early_stopping_rounds=200, verbose_eval=200, show_stdv=True, seed=0)
print(datetime.datetime.fromtimestamp(time()-time0).strftime("%M:%S:%f"))

print('best n_estimators:', len(cv_results['auc-mean']))
print('best cv score:', cv_results['auc-mean'][-1])
'''

确定好参数后，是要代入全部的训练数据来训练最终的模型？

In [6]:
test = np.zeros(x_test.shape[0])
test_pred = np.zeros(x_test.shape[0])

cv_scores = []
for i, (train_index, valid_index) in enumerate(kf.split(x_train, y_train)):
    print('************************************ {} ************************************'.format(str(i+1)))
    X_train_split, y_train_split, X_val, y_val = x_train.iloc[train_index], y_train[train_index], x_train.iloc[valid_index], y_train[valid_index]
    
    train_matrix = lgb.Dataset(X_train_split, label=y_train_split)
    valid_matrix = lgb.Dataset(X_val, label=y_val)
    #test_matrix = xgb.DMatrix(x_test)

    params = {'objective' : 'binary',
         'metric' : 'auc',
         'max_depth' : 5,
         'num_leaves' : 31,
          
         'learning_rate' : 0.01,    #0.005
          
         'feature_fraction' : 0.8,
         'bagging_fraction' : 0.8,
         'bagging_freq': 2, 
          
         'min_child_samples' : 23,     
         'min_child_weight': 0.001,    
         
         'reg_alpha': 0.5,      
         'reg_lambda' : 0.3,     
         'min_split_gain' : 0.0,
         'n_jobs' : -1
         }
    #watchlist = [(train_matrix, 'train'),(valid_matrix, 'eval')]
    
    model = lgb.train(params, train_matrix, 25000, valid_sets=[train_matrix, valid_matrix], verbose_eval=100, early_stopping_rounds=200)
    val_pred = model.predict(X_val, num_iteration=model.best_iteration)
    test_pred += model.predict(x_test, num_iteration=model.best_iteration)
    
    #train[valid_index] = val_pred
    #test = test_pred / kf.n_splits
    cv_scores.append(roc_auc_score(y_val, val_pred))
    print(cv_scores)

print("lgb_scotrainre_list:{}".format(cv_scores))
print("lgb_score_mean:{}".format(np.mean(cv_scores)))
print("lgb_score_std:{}".format(np.std(cv_scores)))

#还是有很强的过拟合

************************************ 1 ************************************
Training until validation scores don't improve for 200 rounds
[500]	training's auc: 0.729581	valid_1's auc: 0.727319
[1000]	training's auc: 0.736593	valid_1's auc: 0.731597
[1500]	training's auc: 0.741263	valid_1's auc: 0.733631
[2000]	training's auc: 0.745084	valid_1's auc: 0.734973
[2500]	training's auc: 0.748338	valid_1's auc: 0.735933
[3000]	training's auc: 0.751314	valid_1's auc: 0.736646
[3500]	training's auc: 0.754162	valid_1's auc: 0.73718
[4000]	training's auc: 0.756799	valid_1's auc: 0.737558
[4500]	training's auc: 0.759248	valid_1's auc: 0.737894
[5000]	training's auc: 0.761683	valid_1's auc: 0.738199
[5500]	training's auc: 0.763988	valid_1's auc: 0.738432
[6000]	training's auc: 0.766212	valid_1's auc: 0.738604
[6500]	training's auc: 0.768361	valid_1's auc: 0.738735
[7000]	training's auc: 0.770535	valid_1's auc: 0.738872
[7500]	training's auc: 0.772686	valid_1's auc: 0.738981
[8000]	training's auc: 0

In [8]:
test_pred

array([0.34949527, 1.54133269, 3.22277792, ..., 0.65466506, 1.32097488,
       0.12891482])

In [9]:
test = test_pred / 5.0

In [11]:
data_test['isDefault'] = test

In [12]:
data_test[['id','isDefault']].to_csv('test_sub-09-23-lgb.csv', index=False)

提交之后发现线上分数只比用baseline的参数涨了0.0005，啊这。。。