<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#3.1-学习目标" data-toc-modified-id="3.1-学习目标-1">3.1 学习目标</a></span></li><li><span><a href="#3.2-内容介绍" data-toc-modified-id="3.2-内容介绍-2">3.2 内容介绍</a></span></li><li><span><a href="#3.3-代码示例" data-toc-modified-id="3.3-代码示例-3">3.3 代码示例</a></span><ul class="toc-item"><li><span><a href="#3.3.1-导入包并读取数据" data-toc-modified-id="3.3.1-导入包并读取数据-3.1">3.3.1 导入包并读取数据</a></span></li><li><span><a href="#3.3.2-数据预处理" data-toc-modified-id="3.3.2-数据预处理-3.2">3.3.2 数据预处理</a></span></li><li><span><a href="#对测试集做训练集同样的操作" data-toc-modified-id="对测试集做训练集同样的操作-3.3">对测试集做训练集同样的操作</a></span></li><li><span><a href="#3.3.3-使用tsfresh-进行时间序列特征处理" data-toc-modified-id="3.3.3-使用tsfresh-进行时间序列特征处理-3.4">3.3.3 使用tsfresh 进行时间序列特征处理</a></span></li><li><span><a href="#3.3.4-特征选择" data-toc-modified-id="3.3.4-特征选择-3.5">3.3.4 特征选择</a></span></li></ul></li><li><span><a href="#特征筛选(总的)" data-toc-modified-id="特征筛选(总的)-4">特征筛选(总的)</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#修改all_features的列名" data-toc-modified-id="修改all_features的列名-4.0.1">修改all_features的列名</a></span></li></ul></li></ul></li><li><span><a href="#3.4-模型训练" data-toc-modified-id="3.4-模型训练-5">3.4 模型训练</a></span><ul class="toc-item"><li><span><a href="#训练数据/测试数据准备" data-toc-modified-id="训练数据/测试数据准备-5.1">训练数据/测试数据准备</a></span></li></ul></li></ul></div>

# 3.1 学习目标
- 学习时间序列数据的特征预处理方法
- 学习时间序列特征处理工具Tsfresh(TimeSeries Fresh) 的使用

# 3.2 内容介绍
数据预处理
- 时间序列数据格式处理
- 加入时间步特征time

特征工程
- 时间序列特征构造
- 特征筛选
- 使用tsfresh

# 3.3 代码示例
## 3.3.1 导入包并读取数据
Tsfresh是处理**时间序列**的关系数据库的特征工程工具，能自动从时间序列中提取100多个特征。    
该软件包包含多种特征提取方法和一种稳健的特征选择算法，还包含评价这些特征对回归或分类  
任务的解释能力和重要性的方法。  
https://zhuanlan.zhihu.com/p/93310900

In [3]:
# 包导入
import pandas as pd
import numpy as np
import tsfresh as tsf
from tsfresh import extract_features,select_features
from tsfresh.utilities.dataframe_functions import impute

In [2]:
import os 
import gc 
import math

import pandas as pd
import numpy as np
 
import lightgbm as lgb
import xgboost as xgb
from catboost import CatBoostRegressor
from sklearn.linear_model import SGDRegressor, LinearRegression, Ridge
from sklearn.preprocessing import MinMaxScaler

from sklearn.model_selection import StratifiedKFold,KFold
from sklearn.metrics import log_loss
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

from tqdm import tqdm
import matplotlib.pyplot as plt
import time
import warnings
warnings.filterwarnings('ignore')

In [3]:
# 数据读取
data_train = pd.read_csv("train.csv")
data_test_A = pd.read_csv("testA.csv")

print(data_train.shape)
print(data_test_A.shape)

(100000, 3)
(20000, 2)


## 3.3.2 数据预处理
 - 对心电特征进行行列处理，同时为每个心电信号加入时间步特征time
 - reset_index()和set_index()的使用

In [4]:
train_heartbeat_df = data_train["heartbeat_signals"].str.split(",",expand=True).stack()

In [5]:
train_heartbeat_df

0      0      0.9912297987616655
       1      0.9435330436439665
       2      0.7646772997256593
       3      0.6185708990212999
       4      0.3796321642826237
                     ...        
99999  200                   0.0
       201                   0.0
       202                   0.0
       203                   0.0
       204                   0.0
Length: 20500000, dtype: object

- 重新设置索引 且变成了数据框的形式

In [6]:
train_heartbeat_df = train_heartbeat_df.reset_index()  

In [7]:
train_heartbeat_df

Unnamed: 0,level_0,level_1,0
0,0,0,0.9912297987616655
1,0,1,0.9435330436439665
2,0,2,0.7646772997256593
3,0,3,0.6185708990212999
4,0,4,0.3796321642826237
...,...,...,...
20499995,99999,200,0.0
20499996,99999,201,0.0
20499997,99999,202,0.0
20499998,99999,203,0.0


- 将level_0 设置为索引

In [8]:
train_heartbeat_df =  train_heartbeat_df.set_index("level_0")

In [9]:
train_heartbeat_df

Unnamed: 0_level_0,level_1,0
level_0,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0,0.9912297987616655
0,1,0.9435330436439665
0,2,0.7646772997256593
0,3,0.6185708990212999
0,4,0.3796321642826237
...,...,...
99999,200,0.0
99999,201,0.0
99999,202,0.0
99999,203,0.0


- 将索引的名字置空，感觉就好像是扔掉了

In [10]:
train_heartbeat_df.index.name = None

In [11]:
train_heartbeat_df

Unnamed: 0,level_1,0
0,0,0.9912297987616655
0,1,0.9435330436439665
0,2,0.7646772997256593
0,3,0.6185708990212999
0,4,0.3796321642826237
...,...,...
99999,200,0.0
99999,201,0.0
99999,202,0.0
99999,203,0.0


-  使用rename（）方法更改列名，inplace为True应该就是原地更改的意思【直接修改】

In [12]:
train_heartbeat_df.rename(columns={"level_1":"time",0:"heartbeat_signals"},inplace=True)

In [13]:
train_heartbeat_df

Unnamed: 0,time,heartbeat_signals
0,0,0.9912297987616655
0,1,0.9435330436439665
0,2,0.7646772997256593
0,3,0.6185708990212999
0,4,0.3796321642826237
...,...,...
99999,200,0.0
99999,201,0.0
99999,202,0.0
99999,203,0.0


In [14]:
train_heartbeat_df["heartbeat_signals"] = train_heartbeat_df["heartbeat_signals"].astype(float)

In [15]:
train_heartbeat_df

Unnamed: 0,time,heartbeat_signals
0,0,0.991230
0,1,0.943533
0,2,0.764677
0,3,0.618571
0,4,0.379632
...,...,...
99999,200,0.000000
99999,201,0.000000
99999,202,0.000000
99999,203,0.000000


- 将处理后的心电特征加入到训练数据中，同时将训练数据label列单独存储

In [16]:
data_train_label = data_train["label"]

In [17]:
data_train_label

0        0.0
1        0.0
2        2.0
3        0.0
4        2.0
        ... 
99995    0.0
99996    2.0
99997    3.0
99998    2.0
99999    0.0
Name: label, Length: 100000, dtype: float64

 - 将data_train去掉label这一列

In [18]:
data_train = data_train.drop('label',axis=1)

In [19]:
data_train

Unnamed: 0,id,heartbeat_signals
0,0,"0.9912297987616655,0.9435330436439665,0.764677..."
1,1,"0.9714822034884503,0.9289687459588268,0.572932..."
2,2,"1.0,0.9591487564065292,0.7013782792997189,0.23..."
3,3,"0.9757952826275774,0.9340884687738161,0.659636..."
4,4,"0.0,0.055816398940721094,0.26129357194994196,0..."
...,...,...
99995,99995,"1.0,0.677705342021188,0.22239242747868546,0.25..."
99996,99996,"0.9268571578157265,0.9063471198026871,0.636993..."
99997,99997,"0.9258351628306013,0.5873839035878395,0.633226..."
99998,99998,"1.0,0.9947621698382489,0.8297017704865509,0.45..."


In [20]:
data_train = data_train.drop("heartbeat_signals", axis=1)

In [21]:
data_train

Unnamed: 0,id
0,0
1,1
2,2
3,3
4,4
...,...
99995,99995
99996,99996
99997,99997
99998,99998


In [22]:
data_train = data_train.join(train_heartbeat_df)

In [23]:
data_train

Unnamed: 0,id,time,heartbeat_signals
0,0,0,0.991230
0,0,1,0.943533
0,0,2,0.764677
0,0,3,0.618571
0,0,4,0.379632
...,...,...,...
99999,99999,200,0.000000
99999,99999,201,0.000000
99999,99999,202,0.000000
99999,99999,203,0.000000


In [24]:
data_train[data_train["id"]==1]

Unnamed: 0,id,time,heartbeat_signals
1,1,0,0.971482
1,1,1,0.928969
1,1,2,0.572933
1,1,3,0.178457
1,1,4,0.122962
...,...,...,...
1,1,200,0.000000
1,1,201,0.000000
1,1,202,0.000000
1,1,203,0.000000


可以看到,每个样本的心电特征都由205个时间步的心电信号组成

## 对测试集做训练集同样的操作

In [25]:
# 对心电特征进行行转列处理，同时为每个心电信号加入时间步特征time
test_A_heartbeat_df = data_test_A["heartbeat_signals"].str.split(",", expand=True).stack()
test_A_heartbeat_df = test_A_heartbeat_df.reset_index()
test_A_heartbeat_df = test_A_heartbeat_df.set_index("level_0")
test_A_heartbeat_df.index.name = None
test_A_heartbeat_df.rename(columns={"level_1":"time", 0:"heartbeat_signals"}, inplace=True)
test_A_heartbeat_df["heartbeat_signals"] = test_A_heartbeat_df["heartbeat_signals"].astype(float)

test_A_heartbeat_df

Unnamed: 0,time,heartbeat_signals
0,0,0.991571
0,1,1.000000
0,2,0.631816
0,3,0.136230
0,4,0.041420
...,...,...
19999,200,0.000000
19999,201,0.000000
19999,202,0.000000
19999,203,0.000000


In [26]:
# 将处理后的心电特征加入到测试数据中
data_test_A = data_test_A.drop("heartbeat_signals", axis=1)
data_test_A = data_test_A.join(test_A_heartbeat_df)

In [27]:
data_test_A

Unnamed: 0,id,time,heartbeat_signals
0,100000,0,0.991571
0,100000,1,1.000000
0,100000,2,0.631816
0,100000,3,0.136230
0,100000,4,0.041420
...,...,...,...
19999,119999,200,0.000000
19999,119999,201,0.000000
19999,119999,202,0.000000
19999,119999,203,0.000000


In [28]:
data_train

Unnamed: 0,id,time,heartbeat_signals
0,0,0,0.991230
0,0,1,0.943533
0,0,2,0.764677
0,0,3,0.618571
0,0,4,0.379632
...,...,...,...
99999,99999,200,0.000000
99999,99999,201,0.000000
99999,99999,202,0.000000
99999,99999,203,0.000000


- 拼接数据

In [29]:
# all_data = pd.concat([data_train,data_test_A],axis=0,join='inner')

In [30]:
#all_data

## 3.3.3 使用tsfresh 进行时间序列特征处理
1.特征抽取
**Tsfresh（TimeSeries Fresh）**是一个Python第三方工具包。 它可以自动计算大量的时间序列数据的特征。此外，该包还包含了特征重要性评估、特征选择的方法，因此，不管是基于时序数据的分类问题还是回归问题，tsfresh都会是特征提取一个不错的选择。官方文档：[Introduction — tsfresh 0.17.1.dev24+g860c4e1 documentation](https://tsfresh.readthedocs.io/en/latest/text/introduction.html)

In [31]:
# # 特征提取
# train_features = extract_features(data_train,column_id = 'id',column_sort='time')
# train_features

In [32]:
#all_features = extract_features(all_data,column_id='id',column_sort='time')

- 将all_data存储为pkl格式,方便下次读取

In [2]:
#all_features.to_pickle('./all_data.pkl')
import pandas as pd
all_features = pd.read_pickle('./all_data.pkl')
all_baseline_data = pd.read_pickle('./all_baseline_data.pkl')

In [3]:
all_features.to_csv('./all_data.csv')

In [7]:
all_baseline_data.to_csv('./all_baseline_data.csv')

In [2]:
# all_features['label'] = data_train['label']
# all_features['label'] = all_features['label'].fillna(-1)
# all_features.to_pickle('./ts_test.pkl')

In [35]:
all_features

Unnamed: 0,heartbeat_signals__variance_larger_than_standard_deviation,heartbeat_signals__has_duplicate_max,heartbeat_signals__has_duplicate_min,heartbeat_signals__has_duplicate,heartbeat_signals__sum_values,heartbeat_signals__abs_energy,heartbeat_signals__mean_abs_change,heartbeat_signals__mean_change,heartbeat_signals__mean_second_derivative_central,heartbeat_signals__median,...,heartbeat_signals__permutation_entropy__dimension_5__tau_1,heartbeat_signals__permutation_entropy__dimension_6__tau_1,heartbeat_signals__permutation_entropy__dimension_7__tau_1,heartbeat_signals__query_similarity_count__query_None__threshold_0.0,"heartbeat_signals__matrix_profile__feature_""min""__threshold_0.98","heartbeat_signals__matrix_profile__feature_""max""__threshold_0.98","heartbeat_signals__matrix_profile__feature_""mean""__threshold_0.98","heartbeat_signals__matrix_profile__feature_""median""__threshold_0.98","heartbeat_signals__matrix_profile__feature_""25""__threshold_0.98","heartbeat_signals__matrix_profile__feature_""75""__threshold_0.98"
0,0.0,0.0,1.0,1.0,38.927945,18.216197,0.019894,-0.004859,0.000117,0.125531,...,2.184420,2.500658,2.722686,0.0,6.445546,12.165525,10.246524,10.746992,8.388625,11.484910
1,0.0,0.0,1.0,1.0,19.445634,7.705092,0.019952,-0.004762,0.000105,0.030481,...,2.710933,3.065802,3.224835,0.0,3.209140,12.649111,9.031069,9.437545,6.723180,12.094899
2,0.0,0.0,1.0,1.0,21.192974,9.140423,0.009863,-0.004902,0.000101,0.000000,...,1.263370,1.406001,1.509478,0.0,3.054539,8.246211,7.370478,8.246211,5.966122,8.246211
3,0.0,0.0,1.0,1.0,42.113066,15.757623,0.018743,-0.004783,0.000103,0.241397,...,2.986728,3.534354,3.854177,0.0,3.010557,9.797959,6.331360,6.406440,5.266743,7.091706
4,0.0,0.0,1.0,1.0,69.756786,51.229616,0.014514,0.000000,-0.000137,0.000000,...,1.914511,2.165627,2.323993,0.0,9.181236,13.429784,9.959913,9.516290,9.286013,10.270925
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
119995,0.0,0.0,1.0,1.0,43.175130,18.967833,0.016106,-0.004902,0.000411,0.205399,...,3.150910,3.625398,3.843586,0.0,3.687770,8.700294,5.991330,6.323450,4.155558,7.191577
119996,0.0,0.0,1.0,1.0,31.030782,14.413244,0.021473,-0.004902,0.000429,0.000000,...,1.732287,1.955659,2.081946,0.0,10.456465,12.982197,11.338307,11.244766,10.763332,11.762948
119997,0.0,0.0,1.0,1.0,31.648623,13.083992,0.017566,-0.004665,0.000087,0.010807,...,2.248241,2.497097,2.663404,0.0,6.037870,11.661904,9.312119,8.973721,8.064338,10.409977
119998,0.0,0.0,1.0,1.0,19.305442,6.700835,0.019937,-0.004547,0.000617,0.000000,...,2.538456,2.912829,3.021449,0.0,10.350940,15.065584,12.961223,12.887409,12.118259,13.558463


In [36]:
# all_features = pd.concat([all_features,all_baseline_data],axis=1)

In [37]:
# all_features

- 导入已经跑好的特征(以pkl格式存储),直接读取用,不用每次都要重新生成这么耗时

2. 特征选择   
train_features中包含了heartbeat_signals的779种常见的时间序列特征（所有这些特征的解释可以去看官方文档），这其中有的特征可能为NaN值（产生原因为当前数据不支持此类特征的计算），使用以下方式去除NaN值：

In [38]:
# # 去除抽取特征中的NAN值
impute(all_features)

Unnamed: 0,heartbeat_signals__variance_larger_than_standard_deviation,heartbeat_signals__has_duplicate_max,heartbeat_signals__has_duplicate_min,heartbeat_signals__has_duplicate,heartbeat_signals__sum_values,heartbeat_signals__abs_energy,heartbeat_signals__mean_abs_change,heartbeat_signals__mean_change,heartbeat_signals__mean_second_derivative_central,heartbeat_signals__median,...,heartbeat_signals__permutation_entropy__dimension_5__tau_1,heartbeat_signals__permutation_entropy__dimension_6__tau_1,heartbeat_signals__permutation_entropy__dimension_7__tau_1,heartbeat_signals__query_similarity_count__query_None__threshold_0.0,"heartbeat_signals__matrix_profile__feature_""min""__threshold_0.98","heartbeat_signals__matrix_profile__feature_""max""__threshold_0.98","heartbeat_signals__matrix_profile__feature_""mean""__threshold_0.98","heartbeat_signals__matrix_profile__feature_""median""__threshold_0.98","heartbeat_signals__matrix_profile__feature_""25""__threshold_0.98","heartbeat_signals__matrix_profile__feature_""75""__threshold_0.98"
0,0.0,0.0,1.0,1.0,38.927945,18.216197,0.019894,-0.004859,0.000117,0.125531,...,2.184420,2.500658,2.722686,0.0,6.445546,12.165525,10.246524,10.746992,8.388625,11.484910
1,0.0,0.0,1.0,1.0,19.445634,7.705092,0.019952,-0.004762,0.000105,0.030481,...,2.710933,3.065802,3.224835,0.0,3.209140,12.649111,9.031069,9.437545,6.723180,12.094899
2,0.0,0.0,1.0,1.0,21.192974,9.140423,0.009863,-0.004902,0.000101,0.000000,...,1.263370,1.406001,1.509478,0.0,3.054539,8.246211,7.370478,8.246211,5.966122,8.246211
3,0.0,0.0,1.0,1.0,42.113066,15.757623,0.018743,-0.004783,0.000103,0.241397,...,2.986728,3.534354,3.854177,0.0,3.010557,9.797959,6.331360,6.406440,5.266743,7.091706
4,0.0,0.0,1.0,1.0,69.756786,51.229616,0.014514,0.000000,-0.000137,0.000000,...,1.914511,2.165627,2.323993,0.0,9.181236,13.429784,9.959913,9.516290,9.286013,10.270925
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
119995,0.0,0.0,1.0,1.0,43.175130,18.967833,0.016106,-0.004902,0.000411,0.205399,...,3.150910,3.625398,3.843586,0.0,3.687770,8.700294,5.991330,6.323450,4.155558,7.191577
119996,0.0,0.0,1.0,1.0,31.030782,14.413244,0.021473,-0.004902,0.000429,0.000000,...,1.732287,1.955659,2.081946,0.0,10.456465,12.982197,11.338307,11.244766,10.763332,11.762948
119997,0.0,0.0,1.0,1.0,31.648623,13.083992,0.017566,-0.004665,0.000087,0.010807,...,2.248241,2.497097,2.663404,0.0,6.037870,11.661904,9.312119,8.973721,8.064338,10.409977
119998,0.0,0.0,1.0,1.0,19.305442,6.700835,0.019937,-0.004547,0.000617,0.000000,...,2.538456,2.912829,3.021449,0.0,10.350940,15.065584,12.961223,12.887409,12.118259,13.558463


In [39]:
all_features_columns = all_features.columns
all_features

Unnamed: 0,heartbeat_signals__variance_larger_than_standard_deviation,heartbeat_signals__has_duplicate_max,heartbeat_signals__has_duplicate_min,heartbeat_signals__has_duplicate,heartbeat_signals__sum_values,heartbeat_signals__abs_energy,heartbeat_signals__mean_abs_change,heartbeat_signals__mean_change,heartbeat_signals__mean_second_derivative_central,heartbeat_signals__median,...,heartbeat_signals__permutation_entropy__dimension_5__tau_1,heartbeat_signals__permutation_entropy__dimension_6__tau_1,heartbeat_signals__permutation_entropy__dimension_7__tau_1,heartbeat_signals__query_similarity_count__query_None__threshold_0.0,"heartbeat_signals__matrix_profile__feature_""min""__threshold_0.98","heartbeat_signals__matrix_profile__feature_""max""__threshold_0.98","heartbeat_signals__matrix_profile__feature_""mean""__threshold_0.98","heartbeat_signals__matrix_profile__feature_""median""__threshold_0.98","heartbeat_signals__matrix_profile__feature_""25""__threshold_0.98","heartbeat_signals__matrix_profile__feature_""75""__threshold_0.98"
0,0.0,0.0,1.0,1.0,38.927945,18.216197,0.019894,-0.004859,0.000117,0.125531,...,2.184420,2.500658,2.722686,0.0,6.445546,12.165525,10.246524,10.746992,8.388625,11.484910
1,0.0,0.0,1.0,1.0,19.445634,7.705092,0.019952,-0.004762,0.000105,0.030481,...,2.710933,3.065802,3.224835,0.0,3.209140,12.649111,9.031069,9.437545,6.723180,12.094899
2,0.0,0.0,1.0,1.0,21.192974,9.140423,0.009863,-0.004902,0.000101,0.000000,...,1.263370,1.406001,1.509478,0.0,3.054539,8.246211,7.370478,8.246211,5.966122,8.246211
3,0.0,0.0,1.0,1.0,42.113066,15.757623,0.018743,-0.004783,0.000103,0.241397,...,2.986728,3.534354,3.854177,0.0,3.010557,9.797959,6.331360,6.406440,5.266743,7.091706
4,0.0,0.0,1.0,1.0,69.756786,51.229616,0.014514,0.000000,-0.000137,0.000000,...,1.914511,2.165627,2.323993,0.0,9.181236,13.429784,9.959913,9.516290,9.286013,10.270925
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
119995,0.0,0.0,1.0,1.0,43.175130,18.967833,0.016106,-0.004902,0.000411,0.205399,...,3.150910,3.625398,3.843586,0.0,3.687770,8.700294,5.991330,6.323450,4.155558,7.191577
119996,0.0,0.0,1.0,1.0,31.030782,14.413244,0.021473,-0.004902,0.000429,0.000000,...,1.732287,1.955659,2.081946,0.0,10.456465,12.982197,11.338307,11.244766,10.763332,11.762948
119997,0.0,0.0,1.0,1.0,31.648623,13.083992,0.017566,-0.004665,0.000087,0.010807,...,2.248241,2.497097,2.663404,0.0,6.037870,11.661904,9.312119,8.973721,8.064338,10.409977
119998,0.0,0.0,1.0,1.0,19.305442,6.700835,0.019937,-0.004547,0.000617,0.000000,...,2.538456,2.912829,3.021449,0.0,10.350940,15.065584,12.961223,12.887409,12.118259,13.558463


接下来,按照特征和响应变量之间的相关性进行特征选择,这一过程包含两步:  
- 首先单独计算每个特征和响应变量之间的相关性
- 然后利用Benjamini-Yekutieli procedure[1]进行特征选择,决定那些特征可以被保留.  
特征选择的一些常用方法   


- 初步进行了特征选择后的特征

In [40]:
import pickle
feature_file = open("./HeartbeatClassification/train_features_file.pkl","rb")
train_features = pickle.load(feature_file)

# 去除抽取特征中的NAN值
impute(train_features)
#train_features.head()

Unnamed: 0,heartbeat_signals__variance_larger_than_standard_deviation,heartbeat_signals__has_duplicate_max,heartbeat_signals__has_duplicate_min,heartbeat_signals__has_duplicate,heartbeat_signals__sum_values,heartbeat_signals__abs_energy,heartbeat_signals__mean_abs_change,heartbeat_signals__mean_change,heartbeat_signals__mean_second_derivative_central,heartbeat_signals__median,...,heartbeat_signals__fourier_entropy__bins_2,heartbeat_signals__fourier_entropy__bins_3,heartbeat_signals__fourier_entropy__bins_5,heartbeat_signals__fourier_entropy__bins_10,heartbeat_signals__fourier_entropy__bins_100,heartbeat_signals__permutation_entropy__dimension_3__tau_1,heartbeat_signals__permutation_entropy__dimension_4__tau_1,heartbeat_signals__permutation_entropy__dimension_5__tau_1,heartbeat_signals__permutation_entropy__dimension_6__tau_1,heartbeat_signals__permutation_entropy__dimension_7__tau_1
0,0.0,0.0,1.0,1.0,38.927945,18.216197,0.019894,-0.004859,0.000117,0.125531,...,0.095763,0.109222,0.109222,0.356175,0.940492,1.180828,1.734917,2.184420,2.500658,2.722686
1,0.0,0.0,1.0,1.0,19.445634,7.705092,0.019952,-0.004762,0.000105,0.030481,...,0.248333,0.409767,0.567944,0.913016,1.791964,1.360828,2.118249,2.710933,3.065802,3.224835
2,0.0,0.0,1.0,1.0,21.192974,9.140423,0.009863,-0.004902,0.000101,0.000000,...,0.054659,0.054659,0.150231,0.204601,0.542013,0.712221,1.031064,1.263370,1.406001,1.509478
3,0.0,0.0,1.0,1.0,42.113066,15.757623,0.018743,-0.004783,0.000103,0.241397,...,0.054659,0.109222,0.186062,0.258874,1.426345,1.389686,2.206088,2.986728,3.534354,3.854177
4,0.0,0.0,1.0,1.0,69.756786,51.229616,0.014514,0.000000,-0.000137,0.000000,...,0.054659,0.109222,0.109222,0.163690,0.517722,1.045339,1.543338,1.914511,2.165627,2.323993
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99995,0.0,0.0,1.0,1.0,63.323449,28.742238,0.023588,-0.004902,0.000794,0.388402,...,0.054659,0.054659,0.109222,0.109222,1.405361,1.326208,2.137411,2.873602,3.391830,3.679969
99996,0.0,0.0,1.0,1.0,69.657534,31.866323,0.017373,-0.004543,0.000051,0.421138,...,0.095763,0.095763,0.109222,0.163690,0.749555,1.408284,2.244166,3.085504,3.728881,4.095457
99997,0.0,0.0,1.0,1.0,40.897057,16.412857,0.019470,-0.004538,0.000834,0.213306,...,0.164224,0.186062,0.299588,0.353661,0.995174,1.305626,2.005282,2.601062,2.996962,3.293562
99998,0.0,0.0,1.0,1.0,42.333303,14.281281,0.017032,-0.004902,0.000013,0.264974,...,0.095763,0.109222,0.163690,0.218060,1.321241,1.460980,2.387132,3.236950,3.793512,4.018302


In [41]:
train_features.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Columns: 779 entries, heartbeat_signals__variance_larger_than_standard_deviation to heartbeat_signals__permutation_entropy__dimension_7__tau_1
dtypes: float64(779)
memory usage: 595.1 MB


In [42]:
sum(train_features.isnull().sum())

0

In [43]:
train_features

Unnamed: 0,heartbeat_signals__variance_larger_than_standard_deviation,heartbeat_signals__has_duplicate_max,heartbeat_signals__has_duplicate_min,heartbeat_signals__has_duplicate,heartbeat_signals__sum_values,heartbeat_signals__abs_energy,heartbeat_signals__mean_abs_change,heartbeat_signals__mean_change,heartbeat_signals__mean_second_derivative_central,heartbeat_signals__median,...,heartbeat_signals__fourier_entropy__bins_2,heartbeat_signals__fourier_entropy__bins_3,heartbeat_signals__fourier_entropy__bins_5,heartbeat_signals__fourier_entropy__bins_10,heartbeat_signals__fourier_entropy__bins_100,heartbeat_signals__permutation_entropy__dimension_3__tau_1,heartbeat_signals__permutation_entropy__dimension_4__tau_1,heartbeat_signals__permutation_entropy__dimension_5__tau_1,heartbeat_signals__permutation_entropy__dimension_6__tau_1,heartbeat_signals__permutation_entropy__dimension_7__tau_1
0,0.0,0.0,1.0,1.0,38.927945,18.216197,0.019894,-0.004859,0.000117,0.125531,...,0.095763,0.109222,0.109222,0.356175,0.940492,1.180828,1.734917,2.184420,2.500658,2.722686
1,0.0,0.0,1.0,1.0,19.445634,7.705092,0.019952,-0.004762,0.000105,0.030481,...,0.248333,0.409767,0.567944,0.913016,1.791964,1.360828,2.118249,2.710933,3.065802,3.224835
2,0.0,0.0,1.0,1.0,21.192974,9.140423,0.009863,-0.004902,0.000101,0.000000,...,0.054659,0.054659,0.150231,0.204601,0.542013,0.712221,1.031064,1.263370,1.406001,1.509478
3,0.0,0.0,1.0,1.0,42.113066,15.757623,0.018743,-0.004783,0.000103,0.241397,...,0.054659,0.109222,0.186062,0.258874,1.426345,1.389686,2.206088,2.986728,3.534354,3.854177
4,0.0,0.0,1.0,1.0,69.756786,51.229616,0.014514,0.000000,-0.000137,0.000000,...,0.054659,0.109222,0.109222,0.163690,0.517722,1.045339,1.543338,1.914511,2.165627,2.323993
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99995,0.0,0.0,1.0,1.0,63.323449,28.742238,0.023588,-0.004902,0.000794,0.388402,...,0.054659,0.054659,0.109222,0.109222,1.405361,1.326208,2.137411,2.873602,3.391830,3.679969
99996,0.0,0.0,1.0,1.0,69.657534,31.866323,0.017373,-0.004543,0.000051,0.421138,...,0.095763,0.095763,0.109222,0.163690,0.749555,1.408284,2.244166,3.085504,3.728881,4.095457
99997,0.0,0.0,1.0,1.0,40.897057,16.412857,0.019470,-0.004538,0.000834,0.213306,...,0.164224,0.186062,0.299588,0.353661,0.995174,1.305626,2.005282,2.601062,2.996962,3.293562
99998,0.0,0.0,1.0,1.0,42.333303,14.281281,0.017032,-0.004902,0.000013,0.264974,...,0.095763,0.109222,0.163690,0.218060,1.321241,1.460980,2.387132,3.236950,3.793512,4.018302


- 将basline里面的训练数据与其拼接

In [44]:
# train = pd.read_pickle('./all_baseline_data.pkl').iloc[:100000,1:]
# train_features = pd.concat([train_features,train],axis=1)

In [45]:
# 按照特征和数据label之间的相关性进行特征选择
train_features_filtered = select_features(train_features,data_train_label)

train_features = train_features_filtered

In [46]:
train_features.head()

Unnamed: 0,heartbeat_signals__sum_values,"heartbeat_signals__fft_coefficient__attr_""abs""__coeff_35","heartbeat_signals__fft_coefficient__attr_""abs""__coeff_34","heartbeat_signals__fft_coefficient__attr_""abs""__coeff_33","heartbeat_signals__fft_coefficient__attr_""abs""__coeff_32","heartbeat_signals__fft_coefficient__attr_""abs""__coeff_31","heartbeat_signals__fft_coefficient__attr_""abs""__coeff_30","heartbeat_signals__fft_coefficient__attr_""abs""__coeff_29","heartbeat_signals__fft_coefficient__attr_""abs""__coeff_28","heartbeat_signals__fft_coefficient__attr_""abs""__coeff_27",...,"heartbeat_signals__fft_coefficient__attr_""abs""__coeff_84","heartbeat_signals__fft_coefficient__attr_""imag""__coeff_97","heartbeat_signals__fft_coefficient__attr_""abs""__coeff_90","heartbeat_signals__fft_coefficient__attr_""abs""__coeff_94","heartbeat_signals__fft_coefficient__attr_""abs""__coeff_92","heartbeat_signals__fft_coefficient__attr_""real""__coeff_97","heartbeat_signals__fft_coefficient__attr_""abs""__coeff_75","heartbeat_signals__fft_coefficient__attr_""real""__coeff_88","heartbeat_signals__fft_coefficient__attr_""real""__coeff_92","heartbeat_signals__fft_coefficient__attr_""real""__coeff_83"
0,38.927945,1.168685,0.982133,1.223496,1.2363,1.104172,1.497129,1.358095,1.704225,1.745158,...,0.531883,-0.047438,0.55437,0.307586,0.564596,0.56296,0.591859,0.504124,0.52845,0.473568
1,19.445634,1.460752,1.924501,1.925485,1.715938,2.079957,1.818636,2.49045,1.673244,2.821067,...,0.56359,-0.109579,0.697446,0.398073,0.640969,0.270192,0.224925,0.645082,0.635135,0.297325
2,21.192974,1.787166,2.146987,1.68619,1.540137,2.291031,2.403422,1.765422,1.993213,2.756081,...,0.712487,-0.074042,0.321703,0.390386,0.716929,0.316524,0.422077,0.722742,0.68059,0.383754
3,42.113066,2.071539,1.00034,2.728281,1.391727,2.017176,2.610492,0.747448,2.900299,1.294779,...,0.601499,-0.184248,0.564669,0.623353,0.46698,0.651774,0.308915,0.550097,0.466904,0.494024
4,69.756786,0.653924,0.231422,1.080003,0.711244,1.357904,1.237998,1.346404,1.64587,0.941866,...,0.015292,0.070505,0.065835,0.05178,0.09294,0.103773,0.179405,-0.089611,0.091841,0.056867


## 3.3.4 特征选择
![jupyter](./image/1.png) 

- 要查看源码手册看看方法学!!!

- 方差选择法

In [47]:
# 基于特征间的关系进行筛选  方差选择法
#  方差选择法中，先要计算各个特征的方差，然后根据设定的阈值，选择方差大于阈值的特征
from sklearn.feature_selection import VarianceThreshold
#其中参数threshold为方差的阈值
sel_var = VarianceThreshold(threshold=0.5)
train_features_var = sel_var.fit_transform(train_features)#, data_train_label)
train_features_var =  pd.DataFrame(train_features_var)

In [48]:
train_features_var

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,228,229,230,231,232,233,234,235,236,237
0,38.927945,1.168685,0.982133,1.223496,1.236300,1.104172,1.497129,1.358095,1.704225,1.745158,...,-2.161341,-1.546245,-34.890840,-2.055638,3.471266,-58.620451,-10.792015,-4.123715,0.667119,-1.921796
1,19.445634,1.460752,1.924501,1.925485,1.715938,2.079957,1.818636,2.490450,1.673244,2.821067,...,-1.549588,-1.269734,-64.457188,-2.408836,-18.270162,-64.798866,-20.534880,13.841282,0.667119,-0.458652
2,21.192974,1.787166,2.146987,1.686190,1.540137,2.291031,2.403422,1.765422,1.993213,2.756081,...,-1.780844,-1.383843,-73.590789,-1.169188,1.662887,-53.995417,-8.598951,14.418712,0.667119,-0.878152
3,42.113066,2.071539,1.000340,2.728281,1.391727,2.017176,2.610492,0.747448,2.900299,1.294779,...,-1.252030,0.058345,-65.921705,0.617548,10.407576,-76.584465,-8.344192,15.132761,0.667119,-2.362533
4,69.756786,0.653924,0.231422,1.080003,0.711244,1.357904,1.237998,1.346404,1.645870,0.941866,...,-0.573069,-0.681231,-39.619402,0.425593,86.081037,-79.000978,18.806833,19.472772,0.667119,-0.310125
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99995,63.323449,0.417221,2.036034,1.659054,0.500584,1.693545,0.859932,1.963009,1.524831,1.344715,...,-2.258430,-2.458467,6.742953,-1.797246,-33.523814,88.956889,-35.317854,22.821690,0.667119,-0.159357
99996,69.657534,1.611333,1.793044,1.092325,0.507138,1.763940,2.677643,2.640827,1.128049,0.856280,...,-0.413565,-0.485189,-81.910362,-2.249474,-24.439460,-27.042323,2.612480,22.338030,0.667119,-0.869863
99997,40.897057,1.190514,0.674603,1.632769,0.229008,2.027802,0.302457,2.016243,0.352602,1.836034,...,0.589696,-0.255260,71.062444,0.746519,-17.257643,168.595352,-41.006511,31.310401,0.667119,-1.075225
99998,42.333303,1.237608,1.325212,2.785515,1.918571,0.814167,2.613950,2.083409,1.330934,2.801509,...,-2.528968,-1.802000,-67.639185,-1.811680,10.811580,-44.313454,8.045512,-7.708835,0.667119,-0.139487


In [49]:
# 查看那些特征被保留了      # 保留了482个
train_features_var_columns = train_features.columns[sel_var.variances_ > sel_var.threshold]

In [50]:
len(train_features_var_columns)

238

- 卡方检验   
输入的X必须要是非负的!! 

In [51]:
min(train_features_var)

0

In [52]:
# 1. 经典的卡方检验是用于检验自变量对因变量的相关性。 假设自变量有N种取值，因变量有M种取值，考虑自变
# 量等于i且因变量等于j的样本频数的观察值与期望的差距。 其统计量如下： χ2=∑(A−T)2T，其中A为实际值，
# T为理论值
# 2. (注：卡方只能运用在正定矩阵上，否则会报错Input X must be non-negative)
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2 #参数k为选择的特征个数
sel_chi2 = SelectKBest(chi2,k=200)

In [53]:
# train_features_var_chi = sel_chi2.fit_transform(train_features_var, data_train['label'])

- 基于树模型的特征选择

In [54]:
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import SelectFromModel

In [55]:
clf = ExtraTreesClassifier()

In [56]:
data_train

Unnamed: 0,id,time,heartbeat_signals
0,0,0,0.991230
0,0,1,0.943533
0,0,2,0.764677
0,0,3,0.618571
0,0,4,0.379632
...,...,...,...
99999,99999,200,0.000000
99999,99999,201,0.000000
99999,99999,202,0.000000
99999,99999,203,0.000000


In [57]:
data_train = pd.read_csv("train.csv")
clf.fit(train_features,data_train['label'])

ExtraTreesClassifier()

In [58]:
train_features_fi = pd.DataFrame({'col':train_features.columns,'fi':clf.feature_importances_})

In [59]:
train_features_fi_columns = list(train_features_fi.sort_values('fi',ascending=False)['col'].iloc[:250]) # 调参取多少

In [60]:
len(train_features_fi_columns)

250

- 基于L1正则项的特征选择(L1-based feature selection)   
对于SVM和逻辑回归，**参数C控制稀疏性：C越小，被选中的特征越少。**对于Lasso，参数alpha越大，被选中的特征越少。   
C是正则项强度  

In [61]:
from sklearn.svm import LinearSVC
from sklearn.feature_selection import SelectFromModel

In [62]:
lsvc = LinearSVC(C=0.5,penalty = "l1",dual=False)

In [63]:
lsvc.fit(train_features,data_train['label'])

LinearSVC(C=0.5, dual=False, penalty='l1')

In [64]:
model_l1 = SelectFromModel(lsvc)

In [65]:
x_l1 = model_l1.fit_transform(train_features,data_train['label'])

In [66]:
train_features_l1_columns = train_features.columns[model_l1.get_support()]

In [67]:
len(train_features_l1_columns)

578

In [68]:
# # 将L1特征选择的结果列名保存下来
# name = ['train_features_l1_columns']
# tem = pd.DataFrame(columns=name,data=list)
# tem.to_csv('./train_features_l1_columns.csv')

# 特征筛选(总的)
<1> 将对特征做相关性,方差,树模型筛选出来的特征进行交集合并  
<2> 对all_features进行特征筛选(通过feature_columns进行)      
   先将all_features_columns中有的特征但是feature_columns中没有特征的列名提取出来存储为train_all_cha

In [69]:
# <1>                                   交集  .union
# 求交集 list(set(train_features_var_columns).intersection(set(train_features_fi_columns)))
feature_columns = list(set(train_features_var_columns).intersection(set(train_features_fi_columns)))
feature_columns = list(set(feature_columns).intersection(set(train_features_l1_columns)))

In [70]:
len(feature_columns)

42

In [71]:
feature_columns

['heartbeat_signals__sum_of_reoccurring_values',
 'heartbeat_signals__fft_coefficient__attr_"abs"__coeff_18',
 'heartbeat_signals__absolute_sum_of_changes',
 'heartbeat_signals__fft_coefficient__attr_"angle"__coeff_5',
 'heartbeat_signals__fft_coefficient__attr_"real"__coeff_8',
 'heartbeat_signals__fft_coefficient__attr_"abs"__coeff_0',
 'heartbeat_signals__fft_aggregated__aggtype_"kurtosis"',
 'heartbeat_signals__fft_coefficient__attr_"abs"__coeff_5',
 'heartbeat_signals__longest_strike_above_mean',
 'heartbeat_signals__fft_coefficient__attr_"imag"__coeff_3',
 'heartbeat_signals__fft_coefficient__attr_"angle"__coeff_2',
 'heartbeat_signals__sum_of_reoccurring_data_points',
 'heartbeat_signals__fft_coefficient__attr_"angle"__coeff_3',
 'heartbeat_signals__cid_ce__normalize_True',
 'heartbeat_signals__count_above_mean',
 'heartbeat_signals__sum_values',
 'heartbeat_signals__fft_coefficient__attr_"real"__coeff_3',
 'heartbeat_signals__fft_coefficient__attr_"abs"__coeff_1',
 'heartbeat_s

In [72]:
# <2>
train_all_cha = list(set(all_features_columns) - set(feature_columns))

In [73]:
train_all_cha

['heartbeat_signals__energy_ratio_by_chunks__num_segments_10__segment_focus_7',
 'heartbeat_signals__fft_coefficient__attr_"angle"__coeff_17',
 'heartbeat_signals__fft_coefficient__attr_"abs"__coeff_60',
 'heartbeat_signals__cwt_coefficients__coeff_7__w_10__widths_(2, 5, 10, 20)',
 'heartbeat_signals__agg_linear_trend__attr_"rvalue"__chunk_len_10__f_agg_"min"',
 'heartbeat_signals__fft_coefficient__attr_"real"__coeff_13',
 'heartbeat_signals__fft_coefficient__attr_"imag"__coeff_79',
 'heartbeat_signals__cwt_coefficients__coeff_6__w_5__widths_(2, 5, 10, 20)',
 'heartbeat_signals__fft_coefficient__attr_"angle"__coeff_63',
 'heartbeat_signals__fft_coefficient__attr_"abs"__coeff_56',
 'heartbeat_signals__large_standard_deviation__r_0.8',
 'heartbeat_signals__change_quantiles__f_agg_"var"__isabs_False__qh_0.8__ql_0.2',
 'heartbeat_signals__cwt_coefficients__coeff_8__w_5__widths_(2, 5, 10, 20)',
 'heartbeat_signals__fft_coefficient__attr_"imag"__coeff_14',
 'heartbeat_signals__last_location_

In [74]:
# # 仅对特征做一下相关性后的结果
# train_all_cha = list(set(all_features.columns) - set(train_features.columns))

In [75]:
all_features = all_features.drop(train_all_cha,axis=1)

In [76]:
all_features

Unnamed: 0,heartbeat_signals__sum_values,heartbeat_signals__skewness,heartbeat_signals__kurtosis,heartbeat_signals__absolute_sum_of_changes,heartbeat_signals__longest_strike_below_mean,heartbeat_signals__longest_strike_above_mean,heartbeat_signals__count_above_mean,heartbeat_signals__count_below_mean,heartbeat_signals__sum_of_reoccurring_values,heartbeat_signals__sum_of_reoccurring_data_points,...,"heartbeat_signals__fft_coefficient__attr_""angle""__coeff_2","heartbeat_signals__fft_coefficient__attr_""angle""__coeff_3","heartbeat_signals__fft_coefficient__attr_""angle""__coeff_5","heartbeat_signals__fft_aggregated__aggtype_""centroid""","heartbeat_signals__fft_aggregated__aggtype_""variance""","heartbeat_signals__fft_aggregated__aggtype_""kurtosis""",heartbeat_signals__value_count__value_0,"heartbeat_signals__augmented_dickey_fuller__attr_""usedlag""__autolag_""AIC""",heartbeat_signals__number_crossing_m__m_0,heartbeat_signals__permutation_entropy__dimension_7__tau_1
0,38.927945,1.349485,1.908603,4.058359,92.0,72.0,95.0,110.0,6.827155,17.140083,...,66.094067,-100.176273,-66.537168,20.048266,640.867764,5.659187,93.0,15.0,3.0,2.722686
1,19.445634,3.663488,15.174346,4.070173,98.0,44.0,82.0,123.0,2.312404,8.627914,...,17.747878,-153.030423,-83.722326,24.557446,657.447951,6.283027,84.0,4.0,3.0,3.224835
2,21.192974,1.841456,3.868159,2.012112,148.0,51.0,55.0,150.0,4.634713,10.362727,...,-101.623335,-169.149004,-41.923965,25.010491,723.128806,5.731839,149.0,0.0,3.0,1.509478
3,42.113066,1.401586,4.354385,3.823527,60.0,102.0,121.0,84.0,8.769288,22.319213,...,-42.149320,17.948936,-59.680542,21.766021,672.013235,5.526573,61.0,4.0,3.0,3.854177
4,69.756786,0.254199,-1.761625,2.960919,105.0,97.0,97.0,108.0,0.000000,0.000000,...,-143.204956,-97.503099,-96.259369,8.833336,276.246018,11.708390,106.0,15.0,2.0,2.323993
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
119995,43.175130,0.873972,-0.034405,3.285724,89.0,84.0,100.0,105.0,10.839504,29.034663,...,-59.848403,-89.599293,-82.318506,21.318383,755.820492,4.619303,71.0,7.0,1.0,3.843586
119996,31.030782,1.446583,1.709305,4.380414,118.0,59.0,76.0,129.0,1.023693,2.385302,...,-163.057245,26.244576,56.435375,23.354989,695.770289,5.519805,119.0,9.0,3.0,2.081946
119997,31.648623,1.542174,3.199598,3.583429,99.0,76.0,83.0,122.0,5.786418,14.988460,...,74.514008,-70.381354,-40.750021,22.138171,679.671776,5.573726,100.0,3.0,3.0,2.663404
119998,19.305442,3.144576,13.361822,4.067139,104.0,72.0,77.0,128.0,2.681428,8.135710,...,91.700297,-47.164106,-19.203040,27.000844,718.096237,6.228419,105.0,6.0,3.0,3.021449


In [77]:
all_features.to_pickle('./205_0.5_250_0.5.pkl')

### 修改all_features的列名
(原因好像因为列名里面有) " " 这些特殊符号导致模型出bug)

In [78]:
for i in range(len(all_features_columns)):
    all_features.rename(columns={all_features_columns[i]:i},inplace = True)

In [79]:
all_features

Unnamed: 0,4,15,16,18,19,20,21,22,29,30,...,566,567,569,664,665,667,668,739,740,779
0,38.927945,1.349485,1.908603,4.058359,92.0,72.0,95.0,110.0,6.827155,17.140083,...,66.094067,-100.176273,-66.537168,20.048266,640.867764,5.659187,93.0,15.0,3.0,2.722686
1,19.445634,3.663488,15.174346,4.070173,98.0,44.0,82.0,123.0,2.312404,8.627914,...,17.747878,-153.030423,-83.722326,24.557446,657.447951,6.283027,84.0,4.0,3.0,3.224835
2,21.192974,1.841456,3.868159,2.012112,148.0,51.0,55.0,150.0,4.634713,10.362727,...,-101.623335,-169.149004,-41.923965,25.010491,723.128806,5.731839,149.0,0.0,3.0,1.509478
3,42.113066,1.401586,4.354385,3.823527,60.0,102.0,121.0,84.0,8.769288,22.319213,...,-42.149320,17.948936,-59.680542,21.766021,672.013235,5.526573,61.0,4.0,3.0,3.854177
4,69.756786,0.254199,-1.761625,2.960919,105.0,97.0,97.0,108.0,0.000000,0.000000,...,-143.204956,-97.503099,-96.259369,8.833336,276.246018,11.708390,106.0,15.0,2.0,2.323993
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
119995,43.175130,0.873972,-0.034405,3.285724,89.0,84.0,100.0,105.0,10.839504,29.034663,...,-59.848403,-89.599293,-82.318506,21.318383,755.820492,4.619303,71.0,7.0,1.0,3.843586
119996,31.030782,1.446583,1.709305,4.380414,118.0,59.0,76.0,129.0,1.023693,2.385302,...,-163.057245,26.244576,56.435375,23.354989,695.770289,5.519805,119.0,9.0,3.0,2.081946
119997,31.648623,1.542174,3.199598,3.583429,99.0,76.0,83.0,122.0,5.786418,14.988460,...,74.514008,-70.381354,-40.750021,22.138171,679.671776,5.573726,100.0,3.0,3.0,2.663404
119998,19.305442,3.144576,13.361822,4.067139,104.0,72.0,77.0,128.0,2.681428,8.135710,...,91.700297,-47.164106,-19.203040,27.000844,718.096237,6.228419,105.0,6.0,3.0,3.021449


In [80]:
all_features.rename(columns={},inplace = True)

# 3.4 模型训练

## 训练数据/测试数据准备

In [81]:
x_train = all_features.iloc[:100000]
data_train = pd.read_csv("train.csv")
y_train = data_train['label']
x_test = all_features.iloc[100000:]

In [82]:
l = ''
l2 = list(x_test.columns)
for i in range(len(x_test.columns)):
    l = l+str(l2[i])
set(l)

{'0', '1', '2', '3', '4', '5', '6', '7', '8', '9'}

In [83]:
def abs_sum(y_pre,y_tru):
    y_pre=np.array(y_pre)
    y_tru=np.array(y_tru)
    loss=sum(sum(abs(y_pre-y_tru)))
    return loss

In [84]:
# clf 为分类器的简称
def cv_model(clf, train_x, train_y, test_x, clf_name):
    folds = 5
    seed = 2021
    kf = KFold(n_splits=folds, shuffle=True, random_state=seed)
    
    # 设置测试集，输出矩阵。每一组数据输出：[0,0,0,0] 以概率值输入
    test = np.zeros((test_x.shape[0],4))

    #训练集损失
    train_scores = []
    # 交叉验证分数
    cv_scores = []
    onehot_encoder = OneHotEncoder(sparse=False)
    
    # 将训练集[k折] 操作，i值代表第（i+1）折。每一个k折都是进行随机抽样操作
    # train_index: 用于训练的（k-1）的样本索引值
    # valid_index: 剩下1折样本索引值，用于给出【训练误差】
    for i, (train_index, valid_index) in enumerate(kf.split(train_x, train_y)):

        #打印第（i+1）个模型结果
        print('************************************ {} ************************************'.format(str(i+1)))

        #将训练集分为：真正训练的数据（K-1折），和 训练集中的测试数据（1折）        
        trn_x, trn_y, val_x, val_y = train_x.iloc[train_index], train_y[train_index], train_x.iloc[valid_index], train_y[valid_index]

        # lgb 模型
        if clf_name == "lgb":

            # 训练样本
            train_matrix = clf.Dataset(trn_x, label=trn_y)
            # 训练集中测试样本
            valid_matrix = clf.Dataset(val_x, label=val_y)

            # 参数设置

            params = {
                'boosting_type': 'gbdt',
                'objective': 'multiclass',       # 任务类型为[多分类]
                'num_class': 4,                  # 类别个数
                'num_leaves': 2**6,               # 最大的叶子数  (138)
                'feature_fraction': 0.93,        # 原理是0.8
                'bagging_fraction': 0.64,        # 原来是0.8
                'bagging_freq': 5,              # 每4次迭代,进行一次bagging
                'learning_rate': 0.01,           # 学习效率 ,原来是0.1
                'seed': seed,
                'nthread': 10,                   # n线程
                'verbose': -1,
                #'lambda_L1':0.4,                # 新添加 L1
                #'lambda_L2':0.5,                # 新添加 L2
                'min_data_in_leaf':43 ,           # 叶子可能具有的最小记录数
                'max_depth': 11,                # 最优11
                'min_child_weight':6.5,
                'reg_lambda': 7,
                'reg_alpha': 0.21,
                'min_split_gain': 0.288,
            }

            # 模型
            model = clf.train(params, 
                      train_set=train_matrix,    # 训练样本
                      valid_sets=valid_matrix,   # 测试样本
                      num_boost_round=4833,      # 迭代次数，原来为2000
                      verbose_eval=1000,          # 表达是否要详细显示评估信息的参数 每100次输出一次评估结果
                      early_stopping_rounds=200) # 如果数据在500次内没有提高，停止计算，原来是200
            
            #这是是验证集的预测，但是还有测试集的预测
        #训练集预测
        train_pred = model.predict(trn_x, num_iteration=model.best_iteration)
        #验证预测
        val_pred = model.predict(val_x, num_iteration=model.best_iteration)
        #测试集预测
        test_pred = model.predict(test_x, num_iteration=model.best_iteration) 

      #将训练的实际，变为只有1列，然后oneHot
        trn_y=np.array(trn_y).reshape(-1, 1)
        trn_y = onehot_encoder.fit_transform(trn_y)
            
       #将验证的实际，变为只有1列，然后oneHot
        val_y=np.array(val_y).reshape(-1, 1)
        val_y = onehot_encoder.fit_transform(val_y)
        print('预测的概率矩阵为：')
        print(test_pred)

        # 将预测结果填入到test里面，这是一个【i个模型结果累加过程】
        test += test_pred

        #对于训练集：
        train_score=abs_sum(trn_y, train_pred)
        train_scores.append(train_score)
        print('训练集每一轮的损失如下：')
        print(train_scores)
        # 评测公式
        score=abs_sum(val_y, val_pred)
        cv_scores.append(score)
        t = i+1
        print('验证集第%s轮的损失值如下 :' % t,cv_scores)
        print()
        
        ##训练集中每i轮的损失
        print("%s_训练集_scotrainre_list:" % clf_name, train_scores)   
        ##验证集中每i轮的损失
        print("%s_验证集_scotrainre_list:" % clf_name, cv_scores)

        ##输出模型的平均损失
        print("%s_训练集_score_mean:" % clf_name, np.mean(train_scores))
        print("%s_验证集_score_mean:" % clf_name, np.mean(cv_scores))
    ##输出输出模型训练后的各各损失间的标准差
    print("%s_训练集_score_std:" % clf_name, np.std(train_scores))
    print("%s_验证集_score_std:" % clf_name, np.std(cv_scores))
    
    # 取k折数，i个模型输出结果的平均值
    test=test/kf.n_splits

    return test

In [85]:
def lgb_model(x_train, y_train, x_test):
    lgb_test = cv_model(lgb, x_train, y_train, x_test, "lgb")
    return lgb_test
lgb_test = lgb_model(x_train, y_train, x_test)

************************************ 1 ************************************
Training until validation scores don't improve for 200 rounds.
[1000]	valid_0's multi_logloss: 0.0596633
[2000]	valid_0's multi_logloss: 0.0508416
[3000]	valid_0's multi_logloss: 0.0496905
[4000]	valid_0's multi_logloss: 0.049496
Did not meet early stopping. Best iteration is:
[4830]	valid_0's multi_logloss: 0.0493642
预测的概率矩阵为：
[[9.96802305e-01 2.99739793e-03 5.98794431e-05 1.40417862e-04]
 [6.87207730e-04 1.57173813e-03 9.97717661e-01 2.33928784e-05]
 [1.36013335e-05 8.98495441e-06 8.66717883e-05 9.99890742e-01]
 ...
 [8.87399778e-01 1.11311558e-02 1.01195180e-01 2.73885785e-04]
 [9.98655682e-01 1.16171297e-03 1.46743700e-04 3.58613760e-05]
 [7.84694438e-01 2.54469621e-03 4.09107853e-02 1.71850081e-01]]
训练集每一轮的损失如下：
[1631.1934838518962]
验证集第1轮的损失值如下 : [963.2090574292487]

lgb_训练集_scotrainre_list: [1631.1934838518962]
lgb_验证集_scotrainre_list: [963.2090574292487]
lgb_训练集_score_mean: 1631.1934838518962
lgb_验证集_sc

In [86]:
temp=pd.DataFrame(lgb_test)
result=pd.read_csv('sample_submit.csv')
result['label_0']=temp[0]
result['label_1']=temp[1]
result['label_2']=temp[2]
result['label_3']=temp[3]
result.to_csv('./submit25.csv',index=False)

In [87]:
data = result
for index,row in data.iterrows():
    row_max = max(list(row)[1:])
    if row_max > 0.9:
        for i in range(1,5):
            if row[i] > 0.9:
                data.iloc[index,i]=1
            else:
                data.iloc[index,i]=0

In [88]:
data.to_csv('./submit25_youhua.csv',index = False)