数据挖掘五大流程：
1、获取数据
2、数据预处理 （让数据适应模型，匹配模型的需求）
3、特征工程  （降低计算成本，提升模型上限）
4、建模，测试模型并预测结果
5、上限，验证模型效果

一、Preprocessing&Impute

1、数据的无量纲化可以是线性的，也可以是非线性的。线性的无量纲化包括中心化和缩放处理。中心化是让所有记录减去一个固定值，即让数据平移到某个位置，缩放的本质是通过除以一个固定的值，将数据固定在某个范围内，取对数也算是一种缩放。

2、当数据x按照最小值中心化后，再按极差（最大值-最小值）缩放，数据移动了最小个单位，并且会被收敛到[0,1]之间，叫做
数据归一化(Normalization)，不是正则化(regularization)，正则化不是数据预处理的一种手段。归一化后地数据服从正态分布。

In [1]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

In [2]:
data = [[-1,2],[-0.5,6],[0,10],[1,18]]
pd.DataFrame(data)

Unnamed: 0,0,1
0,-1.0,2
1,-0.5,6
2,0.0,10
3,1.0,18


In [4]:
# 实现归一化
scaler = MinMaxScaler() # 实例化
scaler = scaler.fit(data) # fit 在这里本质是生成min(x),max(x)这些要用的元素
result = scaler.transform(data)  # 通过接口导出结果

In [5]:
result

array([[0.  , 0.  ],
       [0.25, 0.25],
       [0.5 , 0.5 ],
       [1.  , 1.  ]])

In [6]:
# result_ = scaler.fit_transform(data)  # 训练和导出结果一步达成

In [7]:
scaler.inverse_transform(result) # 将归一化后的结果逆转

array([[-1. ,  2. ],
       [-0.5,  6. ],
       [ 0. , 10. ],
       [ 1. , 18. ]])

In [8]:
# 使用MinMaxScaler的参数feature_range实现将数据归一化到[0,1]以外的范围中
data = [[-1,2],[-0.5,6],[0,10],[1,18]]
scaler = MinMaxScaler(feature_range=[5,10]) # 依然实例化
result = scaler.fit_transform(data) 

In [9]:
result

array([[ 5.  ,  5.  ],
       [ 6.25,  6.25],
       [ 7.5 ,  7.5 ],
       [10.  , 10.  ]])

In [16]:
# 当x中的特征数量非常多的时候，fit会报错并表示，数据量太大了我计算不了
# 此时使用partial_fit作为训练接口
# scaler = scaler.partial_fit(data)

Numpy实现归一化

In [12]:
# 用Numpy来实现归一化
import numpy as np
X = np.array([[-1,2],[-0.5,6],[0,10],[1,18]])

In [13]:
X

array([[-1. ,  2. ],
       [-0.5,  6. ],
       [ 0. , 10. ],
       [ 1. , 18. ]])

In [12]:
# 归一化
X_nor = (X - X.min(axis=0))/(X.max(axis=0) - X.min(axis=0))  # 速度不会比sklearn快

In [13]:
X_nor

array([[0.  , 0.  ],
       [0.25, 0.25],
       [0.5 , 0.5 ],
       [1.  , 1.  ]])

In [14]:
# 逆转归一化
X_returned = X_nor * (X.max(axis=0) - X.min(axis=0)) + X.min(axis=0)  

In [15]:
X_returned

array([[-1. ,  2. ],
       [-0.5,  6. ],
       [ 0. , 10. ],
       [ 1. , 18. ]])

StandardScaler

当数据（x）按均值中心化后，再按标准差缩放后，数据就会服从均值为0，方差为1的正态分布，这个过程就叫做数据标准化（Standardization)

In [17]:
from sklearn.preprocessing import StandardScaler
data = [[-1,2],[-0.5,6],[0,10],[1,18]]

In [18]:
scaler = StandardScaler()
scaler.fit(data)  # 本质是生成均值和方差等要用得东西

StandardScaler(copy=True, with_mean=True, with_std=True)

In [19]:
scaler.mean_ # 查看均值和方差

array([-0.125,  9.   ])

In [20]:
scaler.var_ # 查看方差的属性

array([ 0.546875, 35.      ])

In [21]:
x_std = scaler.transform(data)  # 通过接口导出结果

In [22]:
x_std  # 标准化后的结果

array([[-1.18321596, -1.18321596],
       [-0.50709255, -0.50709255],
       [ 0.16903085,  0.16903085],
       [ 1.52127766,  1.52127766]])

In [23]:
x_std.mean()

0.0

In [24]:
x_std.std()

1.0

In [25]:
scaler.fit_transform(data)  

array([[-1.18321596, -1.18321596],
       [-0.50709255, -0.50709255],
       [ 0.16903085,  0.16903085],
       [ 1.52127766,  1.52127766]])

In [26]:
scaler.inverse_transform(x_std)

array([[-1. ,  2. ],
       [-0.5,  6. ],
       [ 0. , 10. ],
       [ 1. , 18. ]])

StandardScaler & MinMaxScaler，空值NaN会被当成缺失值，在fit时候忽略，在transform的时候保持NaN的状态显示。
大多数机器学习算法中，比如PCA、聚类、SVM、逻辑回归时，会选择StandardScaler,因为MinMaxScaler对异常值比较敏感，其只在不涉及距离度量、梯度、协方差计算及数据需要被压缩到特定区间时广泛使用。

In [1]:
import pandas as pd
data = pd.read_csv(r"D:\titanic\Narrativedata.csv",index_col=0)  # index_col  意思是第0列是我的索引
data.head()

Unnamed: 0,Age,Sex,Embarked,Survived
0,22.0,male,S,No
1,38.0,female,C,Yes
2,26.0,female,S,Yes
3,35.0,female,S,Yes
4,35.0,male,S,No


In [2]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 4 columns):
Age         714 non-null float64
Sex         891 non-null object
Embarked    889 non-null object
Survived    891 non-null object
dtypes: float64(1), object(3)
memory usage: 34.8+ KB


In [3]:
data.isnull().sum()

Age         177
Sex           0
Embarked      2
Survived      0
dtype: int64

In [4]:
Age = data.loc[:,"Age"].values.reshape(-1,1)  

In [5]:
Age[:20]

array([[22.],
       [38.],
       [26.],
       [35.],
       [35.],
       [nan],
       [54.],
       [ 2.],
       [27.],
       [14.],
       [ 4.],
       [58.],
       [20.],
       [39.],
       [14.],
       [55.],
       [ 2.],
       [nan],
       [31.],
       [nan]])

In [18]:
from sklearn.impute import SimpleImputer
imp_mean = SimpleImputer()  # 默认均值填补
imp_median = SimpleImputer(strategy="median")
imp_0 = SimpleImputer(strategy="constant",fill_value=0)

In [19]:
imp_mean = imp_mean.fit_transform(Age)
imp_median = imp_median.fit_transform(Age)
imp_0 = imp_0.fit_transform(Age)

In [21]:
# imp_mean[:20]

In [22]:
# imp_median[:20]

In [23]:
# imp_0[:20]

In [21]:
# 这里使用中位数填补Age
data.loc[:,"Age"] = imp_median
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 4 columns):
Age         891 non-null float64
Sex         891 non-null object
Embarked    889 non-null object
Survived    891 non-null object
dtypes: float64(1), object(3)
memory usage: 34.8+ KB


In [22]:
Embarked = data.loc[:,"Embarked"].values.reshape(-1,1)

In [23]:
imp_mode = SimpleImputer(strategy="most_frequent")
data.loc[:,"Embarked"] = imp_mode.fit_transform(Embarked)

In [24]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 4 columns):
Age         891 non-null float64
Sex         891 non-null object
Embarked    891 non-null object
Survived    891 non-null object
dtypes: float64(1), object(3)
memory usage: 34.8+ KB


In [35]:
# # 用Numpy & Pandas 填补
# import pandas as pd
# data_ = pd.read_csv(r"‪D:\titanic\Narrativedata.csv",index_col=0)
# data_.head()

In [36]:
# data_.loc[:,"Age"] = data_.loc[:,"Age"].fillna(data_.loc[:,"Age"].median())
# #.fillna 在DataFrame里面直接进行填补

# data_.dropna(axis=0,inplace=True)
# #.dropna(axis=0)删除所有有缺失值的行，.dropna(axis=1)删除所有有缺失值的列
# #参数inplace，为True表示在原数据集上进行修改，为False表示生成一个复制对象，不修改原数据，默认False

数据编码 preprocessing.LabelEncoder:标签专用，能够将分类转换为分类数值

In [25]:
from sklearn.preprocessing import LabelEncoder
y = data.iloc[:,-1]  # 要输入的是标签，不是特征矩阵，所以允许一维

In [26]:
le = LabelEncoder()
le = le.fit(y)
label = le.transform(y)

In [56]:
# label

In [27]:
le.classes_  # 属性.classes_查看标签中究竟有多少类别

array(['No', 'Unknown', 'Yes'], dtype=object)

In [48]:
# le.fit_transform(y)  # 一步到位

In [57]:
# le.inverse_transform(label)  # 逆转

In [28]:
data.iloc[:,-1] = label

In [29]:
data.head()

Unnamed: 0,Age,Sex,Embarked,Survived
0,22.0,male,S,0
1,38.0,female,C,2
2,26.0,female,S,2
3,35.0,female,S,2
4,35.0,male,S,0


In [61]:
# # 以上简写
# from sklearn.preprocessing import LabelEncoder
# data.iloc[:,-1] = LabelEncoder().fit_transform(data.iloc[:,-1])

preprocessing.OrdinalEncoder:特征专用，将分类特征转换为分类数值

In [32]:
from sklearn.preprocessing import OrdinalEncoder

In [33]:
data_ = data.copy()
# data_.head()

In [64]:
OrdinalEncoder().fit(data_.iloc[:,1:-1]).categories_

[array(['female', 'male'], dtype=object), array(['C', 'Q', 'S'], dtype=object)]

In [34]:
data_.iloc[:,1:-1] = OrdinalEncoder().fit_transform(data_.iloc[:,1:-1])

In [35]:
data_.head()

Unnamed: 0,Age,Sex,Embarked,Survived
0,22.0,1.0,2.0,0
1,38.0,0.0,0.0,2
2,26.0,0.0,2.0,2
3,35.0,0.0,2.0,2
4,35.0,1.0,2.0,0


preprocessing.OneHotEncoder: 独热编码，创建哑变量，用于不相关的独立变量

In [30]:
from sklearn.preprocessing import OneHotEncoder
X = data.iloc[:,1:-1]

In [37]:
enc = OneHotEncoder(categories="auto").fit(X)
result = enc.transform(X).toarray()

In [81]:
# enc.transform(X).toarray()

In [38]:
result

array([[0., 1., 0., 0., 1.],
       [1., 0., 1., 0., 0.],
       [1., 0., 0., 0., 1.],
       ...,
       [1., 0., 0., 0., 1.],
       [0., 1., 1., 0., 0.],
       [0., 1., 0., 1., 0.]])

In [83]:
# OneHotEncoder(categories="auto").fit_transform(X).toarray()  # 一步到位

In [85]:
# pd.DataFrame(enc.inverse_transform(result)) # 还原

In [86]:
enc.get_feature_names()

array(['x0_female', 'x0_male', 'x1_C', 'x1_Q', 'x1_S'], dtype=object)

In [39]:
# axis=1,表示跨行合并，也就是将量表的左右所连，如果是axis=0，就是将表上下相连
newdata = pd.concat([data,pd.DataFrame(result)],axis=1)

In [40]:
newdata.head()

Unnamed: 0,Age,Sex,Embarked,Survived,0,1,2,3,4
0,22.0,male,S,0,0.0,1.0,0.0,0.0,1.0
1,38.0,female,C,2,1.0,0.0,1.0,0.0,0.0
2,26.0,female,S,2,1.0,0.0,0.0,0.0,1.0
3,35.0,female,S,2,1.0,0.0,0.0,0.0,1.0
4,35.0,male,S,0,0.0,1.0,0.0,0.0,1.0


In [41]:
newdata.drop(["Sex","Embarked"],axis=1,inplace=True)

In [42]:
newdata.columns = ["Age","Survived","Female","Male","Embarked_C","Embarked_Q","Embarked_S"]
newdata.head()

Unnamed: 0,Age,Survived,Female,Male,Embarked_C,Embarked_Q,Embarked_S
0,22.0,0,0.0,1.0,0.0,0.0,1.0
1,38.0,2,1.0,0.0,1.0,0.0,0.0
2,26.0,2,1.0,0.0,0.0,0.0,1.0
3,35.0,2,1.0,0.0,0.0,0.0,1.0
4,35.0,0,0.0,1.0,0.0,0.0,1.0


sklearn.preprocessing.LabelBinarizer 可以对标签做哑变量，但是一般标签不做这种变换。

处理连续型特征：二值化与分段（sklearn.preprocessing.Binarizer）根据阈值将数据二值化，用于处理连续型变量

In [49]:
# 将年龄二值化
data_2 = newdata.copy()

In [50]:
from sklearn.preprocessing import Binarizer # 所有处理特征的类里面不接受一维数组

In [51]:
X = data_2.iloc[:,0].values.reshape(-1,1) # 类为特征专用，所以不能使用一维数组

In [52]:
transformer = Binarizer(threshold=30).fit_transform(X)

In [53]:
data_2.iloc[:,0] = transformer

In [54]:
data_2.head()

Unnamed: 0,Age,Survived,Female,Male,Embarked_C,Embarked_Q,Embarked_S
0,0.0,0,0.0,1.0,0.0,0.0,1.0
1,1.0,2,1.0,0.0,1.0,0.0,0.0
2,0.0,2,1.0,0.0,0.0,0.0,1.0
3,1.0,2,1.0,0.0,0.0,0.0,1.0
4,1.0,0,0.0,1.0,0.0,0.0,1.0


In [55]:
# preprocessing.KBinsDiscretizer(n_bins  encode  strategy  )

In [56]:
from sklearn.preprocessing import KBinsDiscretizer  # 将连续型变量划分为分类变量的类，将连续型变量分类后按顺序分箱编码

In [57]:
X = data.iloc[:,0].values.reshape(-1,1)

In [58]:
data.head()

Unnamed: 0,Age,Sex,Embarked,Survived
0,22.0,male,S,0
1,38.0,female,C,2
2,26.0,female,S,2
3,35.0,female,S,2
4,35.0,male,S,0


In [2]:
# est = KBinsDiscretizer(n_bins=3,encode='ordinal',strategy='uniform')
# est.fit_transform(X)

In [62]:
set(est.fit_transform(X).ravel()) # 去重
# 查看转换后的分的箱，变成了一列中的三箱

{0.0, 1.0, 2.0}

In [63]:
est = KBinsDiscretizer(n_bins=3,encode='onehot',strategy='uniform')

In [64]:
est.fit_transform(X).toarray()

array([[1., 0., 0.],
       [0., 1., 0.],
       [1., 0., 0.],
       ...,
       [0., 1., 0.],
       [1., 0., 0.],
       [0., 1., 0.]])

数据预处理：
preprocessing:    
   
MinMaxScaler     
StandScaler     
   
LabelEncoder   
OrdinalEncoder   
Binarzer   
OneHotEncoder   
KBinsDiscretizer   
   
impute.SimpleImputer(strategy="mean/median/most_frequent/constant")   


特征工程：特征选择、特征提取、特征创造

In [None]:
import pandas as pd
data = pd.read_csv(r"D:\titanic\digit_recognizor.csv")

In [70]:
data.head()

Unnamed: 0,label,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,...,pixel774,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [4]:
X = data.iloc[:,1:]
y = data.iloc[:,0]

Filter过滤法（全部特征--> 最佳特征子集--> 算法--> 模型评估）


In [8]:
# 方差过滤VarianceThreshold(过滤方差为0或者方差很小的特征)
from sklearn.feature_selection import VarianceThreshold

In [9]:
selector = VarianceThreshold() # 实例化，不填参数默认方差为0

In [10]:
X_var0 = selector.fit_transform(X) # 获取删除不合格特征之后的新特征矩阵

In [76]:
# 也可以直接写成 X = VarianceThreshold().fit_transform(X)
X_var0.shape

(42000, 708)

In [77]:
pd.DataFrame(X_var0).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,698,699,700,701,702,703,704,705,706,707
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [11]:
import numpy as np

In [82]:
np.median(X.var().values)  # 方差中位数

1352.286703180131

In [12]:
X_fsvar = VarianceThreshold(np.median(X.var().values)).fit_transform(X)

In [84]:
X_fsvar.shape 

(42000, 392)

In [85]:
# 若特征是伯努利随机变量，假设p=0.8，即二分类特征中某种分类占到80%以上的时候删除特征
X_bvar = VarianceThreshold(0.8*(1-0.8)).fit_transform(X)

In [86]:
X_bvar.shape

(42000, 685)

In [87]:
# 方差过滤对模型的影响（KNN & RandomForest）运行效果  和   运行时间 对比
# KNN 必须遍历每个特征和每个样本，时间较长

In [91]:
# KNN vs 随机森林在不同方差过滤效果下的对比
from sklearn.ensemble import RandomForestClassifier as RFC
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.model_selection import cross_val_score
import numpy as np

In [88]:
# X = data.iloc[:,1:]
# y = data.iloc[:,0]

# X_fsvar = VarianceThreshold(np.median(X.var().values)).fit_transform(X)

# cross_val_score(KNN(),X,y,cv=5).mean()  # 运行需要35mins
# # 0.9658569700264943

# # python的魔法命令，可以使用%%timeit来计算本框中的代码所需要的时间
# # 为了计算所需时间，需要将这个框中的代码运行很多次（7次）后求均值，因此%%timeit的时间会远远超过框中代码的运行时间

# %%timeit
# cross_val_score(KNN(),X,y,cv=5).mean()   # 4 hours

In [89]:
# # KNN方差过滤后
# cross_val_score(KNN().x_fsvar,y,cv=5).mean() # 20 mins
# # 0.96599974
# %%timeit
# cross_val_score(KNN().x_fsvar,y,cv=5).mean() # 2 hours

In [92]:
# 随机森林方差过滤前
cross_val_score(RFC(n_estimators=10,random_state=0),X,y,cv=5).mean()

0.9380003861799541

In [95]:
%%timeit
cross_val_score(RFC(n_estimators=10,random_state=0),X,y,cv=5).mean()

15.6 s ± 683 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [96]:
# 随机森林方差过滤后
cross_val_score(RFC(n_estimators=10,random_state=0),X_fsvar,y,cv=5).mean()

0.9388098166696807

In [97]:
%%timeit
cross_val_score(RFC(n_estimators=10,random_state=0),X_fsvar,y,cv=5).mean()

15.7 s ± 260 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


总结：随机森林略逊与KNN，但运行时间却连KNN的1%都不到，方差过滤后，随机森林准确率也微弱上升，但时间几乎没变化

因此，过滤法的主要对象是：需要遍历特征或升维的算法们，目的是：在维持算法表现的前提下，帮助算法们降低计算成本

相关性过滤：卡方、F检验、互信息

1、卡方检验：针对离散型标签--分类问题，会提示先进行方差检验，由高到低排序，模型表现提升时然后进行卡方。

In [2]:
from sklearn.ensemble import RandomForestRegressor as RFC
from sklearn.model_selection import cross_val_score
from sklearn.feature_selection import SelectKBest  # k个分数最高的特征的
from sklearn.feature_selection import chi2   # 卡方检验

In [13]:
# 假设这里我知道我需要300个特征
X_fschi = SelectKBest(chi2,k=300).fit_transform(X_fsvar,y)

In [14]:
X_fschi.shape

(42000, 300)

In [15]:
cross_val_score(RFC(n_estimators=10,random_state=0),X_fschi,y,cv=5).mean()

0.8494929643973098

可以看出，模型效果降低了，这说明我们在设定k=300的时候删除了与模型相关的特征，k值设置的太小，要么调整k值，要么放弃过滤。

In [16]:
# 学习曲线
# import matplotlib.pyplot as plt
# score_ = []
# for i in range(200,400,10):
#     X_fschi = SelectKBest(chi2,k=i).fit_transform(X_fsvar,y)
#     score = cross_val_score(RFC(n_estimators=10,random_state=0),X_fschi,y,cv=5).mean()
#     score_.append(score)
    
# plt.plot(range(200,400,10),score_)
# plt.show()

卡方分布的本质是推测两组数据之间的差异，原假设是“两组数据是相互独立的”，卡方检验有卡方值和P值两个
统计量，当卡方值很大，P <= 0.05或0.01时，两组数据是相关的，拒绝原假设，接受备择假设。

In [17]:
chivalue,pvalues_chi = chi2(X_fsvar,y)

In [23]:
chivalue.shape[0]

392

In [24]:
(pvalues_chi > 0.05).sum()

0

In [3]:
# pvalues_chi  # P值都是0，说明所有的标签都和特征相关（写反了）

In [25]:
# k取多少？我们想要消除所有p值大于设定值，比如0.05和0.01的特征：
k = chivalue.shape[0] - (pvalues_chi > 0.05).sum()  # 想保留的特征的数量

In [26]:
# X_fschi = SelectKBest(chi2,k=填写具体的k值).fit_transform(X_fsvar,y)
# cross_val_score(RFC(n_estimators=10,random_state=0),X_fschi,y,cv=5).mean()

F 检验（ANOVA）,又叫方差齐性检验，捕捉每个特征和标签之间的线性关系的方法。可以做回归和分类（feature_selection.f_classif & feature_selection.f_regressor）
原假设是（“数据不存在显著的线性关系”），和卡方一样，P <= 0.05或0.01时，这些特征与标签是显著线性相关的，否则被删除。

In [27]:
from sklearn.feature_selection import f_classif

In [28]:
F,pvalues_f = f_classif(X_fsvar,y)

In [30]:
# F

In [32]:
# pvalues_f

In [33]:
k = F.shape[0] - (pvalues_f > 0.05).sum()

In [34]:
k

392

In [35]:
# X_fsF = SelectKBest(f_classif,k=填写具体的k).fit_transform(X_fsvar,y)
# cross_val_score(RFC(n_estimators=10,random_state=0),X_fsF,y,cv=5).mean()

互信息法：捕捉每个特征和标签之间的任意关系（线性和非线性）的过滤方法。回归和分类，
（feature_selection.mutual_info_classif & feature_selection.mutual_info_regression）
参数和用法同F检验。互信息不返回P值或F值类似的统计量，它返回“每个特征与目标之间的互信息量的估计”，0-1之间，为0表示两变量独立，1为两变量完全相关。

In [36]:
from sklearn.feature_selection import mutual_info_classif as MIC

In [37]:
result = MIC(X_fsvar,y)

In [39]:
(result > 0).sum()

392

In [40]:
k = result.shape[0] - sum(result <= 0)

In [41]:
# X_fsF = SelectKBest(MIC,k=填写具体的k).fit_transform(X_fsvar,y)
# cross_val_score(RFC(n_estimators=10,random_state=0),X_fsF,y,cv=5).mean()

嵌入法：（feature_selection.SelectFromModel）
    是一种让算法自己决定使用那些特征的方法，即特征选择和算法训练同时进行。
    是过滤法的进化版，可以实现过滤法的所有性能。
    嵌入法引入了算法来挑选特征，且每次都会使用全部特征，计算速度和计算量都很大。

estimator(使用的模型评估器，只要带feature_importances_或coel_属性，或带有L1,L2惩罚项的模型，都可以使用）
threshold(特征重要性的阈值，重要性低于这个阈值的特征都被删除）

In [42]:
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier as RFC

In [43]:
RFC_ = RFC(n_estimators=10,random_state=0) # 随机森林实例化

In [46]:
X_embedded = SelectFromModel(RFC_,threshold=0.005).fit_transform(X,y)  # 实例化嵌入法

In [None]:
# 0.005这个阈值对于780个特征的数据来说，是一个非常高的阈值，因为平均每个特征只能够分到大约0.001

In [45]:
X_embedded.shape

(42000, 47)

In [47]:
# 模型维度明显被降低了，
# 同样的，我们可以画学习曲线来找最佳阈值

In [48]:
# 学习曲线 (10 mins)
# import numpy as np
# import matplotlib.pyplot as plt

# RFC_.fit(X,y).feature_importances_  # 显示特征重要性

# threshold = np.linspace(0,(RFC_.fit(X,y).feature_importances_).max()，20)  # 选择max和min之间20个数（不均分）

# score = []
# for i in threshold:
#     X_embedded = SelectFromModel(RFC_,threshold=i).fit_transform(X,y)
#     once = cross_val_score(RFC_.X_embedded,y,cv=5).mean()
#     score.append(once)
    
# plt.plot(threshold,score)
# plt.show()

In [49]:
# 根据学习曲线，选择threshold=0.00067看看
X_embedded = SelectFromModel(RFC_,threshold=0.00067).fit_transform(X,y)
X_embedded.shape

(42000, 324)

In [50]:
cross_val_score(RFC_,X_embedded,y,cv=5).mean()

0.939905083368037

嵌入法选出来的特征比方差过滤要少，时间要更短，交叉验证的结果比方差过滤后的结果要好

In [51]:
# 在threshold左右选20个值再遍历一下，找出最高值 (10 mins)
# score_2 = []
# for i in np.linspace(0,0.00134,20):
#     X_embedded = SelectFromModel(RFC_,threshold=i).fit_transform(X,y)
#     once = cross_val_score(RFC_,X_embedded,y,cv=5).mean()
#     score_2.append(once)
# plt.figure(figsize=[20,5])
# plt.plot(np.linspace(0,0.000134,20),score_2)
# plt.xticks(np.linspace(0,0.000134,20))
# plt.show()

In [52]:
X_embedded = SelectFromModel(RFC_,threshold=0.000564).fit_transform(X,y)

In [53]:
X_embedded.shape

(42000, 340)

In [54]:
cross_val_score(RFC_,X_embedded,y,cv=5).mean()

0.9408335415056387

In [55]:
cross_val_score(RFC(n_estimators=100,random_state=0),X_embedded,y,cv=5).mean()

0.9639525817795566

得出的特征数目依然小于方差筛选，并且模型的表现也比没有筛选之前更高，已经完全可以和KNN匹敌了，可见，嵌入式我们很容易就能够实现特征选择的目标：减少计算量，提升模型表现。因此，比起要思考很多的过滤法来说，嵌入法更有效，然而，过滤法计算远远比嵌入法快，所以大型数据中优先考虑过滤法，或者下面这种结合了过滤法和嵌入法的---包装法Wrapper

包装法：feature_selection.RFE
    特征选择和算法训练同时进行，进行一次删除一些，直到最终达到所需数量的要选择的特征。
    时间成本位于嵌入法和过滤法之间。
    选取最佳特征子集的是一个专门的算法---递归特征消除法（RFE）
    可以利用很少的特征就达到很好的效果。和嵌入法的效果能够匹敌，比嵌入法更快。
    estimator(实例化后的评估器）
    n_features_to_select(想要选择的特征个数）
    step(每次递归中先要移除的特征个数）
    .support_
    .ranking_

In [1]:
from sklearn.feature_selection import RFE

In [4]:
# RFC_ = RFC(n_estimators=10,random_state=0)

In [5]:
# selector = RFE(RFC_,n_features_to_select=340,step=50).fit(X,y)

In [8]:
# selector.support_.sum() # 340  # 返回所有特征是否被选中的布尔矩阵，然后加和

In [9]:
# selector.ranking_  # 返回特征的按次数迭代中综合重要性排名，越重要特征排在越前面

In [10]:
# X_wrapper = selector.transform(X)  # 包装法得到的特征矩阵

In [6]:
# cross_val_score(RFC_,X_wrapper,y,cv=5).mean()

对包装法画学习曲线

In [7]:
# score = []
# for i in range(1,751,50):
#     X_wrapper = RFE(RFC_,n_features_to_select=i,step=50).fit_transfrom(X,y)
#     once = cross_val_score(RFC_,X_wrapper,y,cv=5).mean()
#     score.append(once)
    
# plt.figure(figsize=[20,5])
# plt.plot(range(1,751,50),score)
# plt.xticks(range(1,751,50))
# plt.show()

包装法最容易在特征数量最少的情况下找到最佳的模型表现，比嵌入法和过滤法都高效很多，在缩减了94%的特征的基础上还能保证模型表现在90%以上的特征组合，不可谓不高效。

总结：
    过滤法更快捷，但更粗糙，包装法和嵌入法更精确，没思路时从过滤法走起，看具体数据具体分析。

feature_selection:   
VarianceThreshold    
chi2---SelectKBest   
f_classif---f_regressor ----SelectKBest  
mutual_info_classif---mutual_info_regression ---SelectKBest  
SelectFromModel   
RFE


## 分箱

In [4]:
import numpy as np
import pandas as pd
score_list = np.random.randint(25,100,size=20)
score_list

array([56, 37, 83, 91, 25, 45, 48, 77, 59, 88, 43, 36, 28, 79, 56, 76, 35,
       27, 50, 32])

In [3]:
bins = [1,59,70,80,100]
score_cut = pd.cut(score_list,bins)

In [5]:
pd.value_counts(score_cut)

(1, 59]      9
(80, 100]    7
(70, 80]     2
(59, 70]     2
dtype: int64

In [7]:
df = pd.DataFrame()
df['score'] = score_list
df

Unnamed: 0,score
0,56
1,37
2,83
3,91
4,25
5,45
6,48
7,77
8,59
9,88


In [8]:
df['student'] = [pd.util.testing.rands(3) for i in range(20)]

In [9]:
df

Unnamed: 0,score,student
0,56,6ti
1,37,EHa
2,83,sI3
3,91,e9o
4,25,2A9
5,45,LTA
6,48,F4o
7,77,1XT
8,59,vag
9,88,CG8


In [11]:
pd.cut(df['score'],bins)

0       (1, 59]
1       (1, 59]
2     (80, 100]
3     (80, 100]
4       (1, 59]
5       (1, 59]
6       (1, 59]
7      (70, 80]
8       (1, 59]
9     (80, 100]
10      (1, 59]
11      (1, 59]
12      (1, 59]
13     (70, 80]
14      (1, 59]
15     (70, 80]
16      (1, 59]
17      (1, 59]
18      (1, 59]
19      (1, 59]
Name: score, dtype: category
Categories (4, interval[int64]): [(1, 59] < (59, 70] < (70, 80] < (80, 100]]

In [12]:
df['categories'] = pd.cut(df['score'],bins,labels = ['low','ok','good','great'])

In [13]:
df

Unnamed: 0,score,student,categories
0,56,6ti,low
1,37,EHa,low
2,83,sI3,great
3,91,e9o,great
4,25,2A9,low
5,45,LTA,low
6,48,F4o,low
7,77,1XT,good
8,59,vag,low
9,88,CG8,great


In [14]:
pd.value_counts(df['categories'])

low      14
great     3
good      3
ok        0
Name: categories, dtype: int64