#  特征选择

[结合Scikit-learn介绍几种常用的特征选择方法](http://blog.csdn.net/bryan__/article/details/51607215)

[Feature selection](http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection)

- [去掉取值变化小的特征](#id1)

- [单变量特征选择](#id2)

- [基于学习模型的特征排序](#id3)_

- [线性模型和正则化](#id4)

- [随机森林](#rf)

- [两种顶层特征选取](#id5)

<h2 id="id1">去掉取值变化小的特征</h2>

如果一个特征中大部分数据都是同一个数,可以去掉

In [1]:
from sklearn.feature_selection import VarianceThreshold
X = [[0,0,1],[0,1,0],[1,0,0],[0,1,1],[0,1,0],[0,1,1]]
sel = VarianceThreshold(threshold=(.8*(1-0.8)))
sel.fit_transform(X)

array([[0, 1],
       [1, 0],
       [0, 0],
       [1, 1],
       [1, 0],
       [1, 1]])

<h2 id="id2">单变量特征选择</h2>

- Pearson 相关系数

用皮尔森相关系数衡量变量之间的线性相关性,取值区间 [-1,1];-1表示负相关,+1 表示正相关,0 表示没有线性相关


In [2]:
import numpy as np
from scipy.stats import pearsonr
X = [1,2,3]
y = [1,2,2]
pearsonr(X,y)

(0.86602540378443871, 0.33333333333333331)

<h2 id="id3">基于学习模型的特征排序</h2>

Pearson 相关系数评价的是否存在线性关系,对于非线性关系,可以用基于树(决策树,随机森林)或者扩展线性模型等.但需要注意过拟合问题,因此树的深度最好不要太大,再运用交叉验证.

在[波士顿房价数据集](https://archive.ics.uci.edu/ml/datasets/Housing)上使用 sklearn 的[随机森林回归](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html)给出一个单变量选择的例子

In [8]:
from sklearn.cross_validation import cross_val_score,ShuffleSplit
from sklearn.datasets import load_boston
from sklearn.ensemble import RandomForestRegressor
boston = load_boston()

In [10]:
boston['data']

array([[  6.32000000e-03,   1.80000000e+01,   2.31000000e+00, ...,
          1.53000000e+01,   3.96900000e+02,   4.98000000e+00],
       [  2.73100000e-02,   0.00000000e+00,   7.07000000e+00, ...,
          1.78000000e+01,   3.96900000e+02,   9.14000000e+00],
       [  2.72900000e-02,   0.00000000e+00,   7.07000000e+00, ...,
          1.78000000e+01,   3.92830000e+02,   4.03000000e+00],
       ..., 
       [  6.07600000e-02,   0.00000000e+00,   1.19300000e+01, ...,
          2.10000000e+01,   3.96900000e+02,   5.64000000e+00],
       [  1.09590000e-01,   0.00000000e+00,   1.19300000e+01, ...,
          2.10000000e+01,   3.93450000e+02,   6.48000000e+00],
       [  4.74100000e-02,   0.00000000e+00,   1.19300000e+01, ...,
          2.10000000e+01,   3.96900000e+02,   7.88000000e+00]])

In [11]:
boston['target']

array([ 24. ,  21.6,  34.7,  33.4,  36.2,  28.7,  22.9,  27.1,  16.5,
        18.9,  15. ,  18.9,  21.7,  20.4,  18.2,  19.9,  23.1,  17.5,
        20.2,  18.2,  13.6,  19.6,  15.2,  14.5,  15.6,  13.9,  16.6,
        14.8,  18.4,  21. ,  12.7,  14.5,  13.2,  13.1,  13.5,  18.9,
        20. ,  21. ,  24.7,  30.8,  34.9,  26.6,  25.3,  24.7,  21.2,
        19.3,  20. ,  16.6,  14.4,  19.4,  19.7,  20.5,  25. ,  23.4,
        18.9,  35.4,  24.7,  31.6,  23.3,  19.6,  18.7,  16. ,  22.2,
        25. ,  33. ,  23.5,  19.4,  22. ,  17.4,  20.9,  24.2,  21.7,
        22.8,  23.4,  24.1,  21.4,  20. ,  20.8,  21.2,  20.3,  28. ,
        23.9,  24.8,  22.9,  23.9,  26.6,  22.5,  22.2,  23.6,  28.7,
        22.6,  22. ,  22.9,  25. ,  20.6,  28.4,  21.4,  38.7,  43.8,
        33.2,  27.5,  26.5,  18.6,  19.3,  20.1,  19.5,  19.5,  20.4,
        19.8,  19.4,  21.7,  22.8,  18.8,  18.7,  18.5,  18.3,  21.2,
        19.2,  20.4,  19.3,  22. ,  20.3,  20.5,  17.3,  18.8,  21.4,
        15.7,  16.2,

In [12]:
boston['feature_names']

array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
       'TAX', 'PTRATIO', 'B', 'LSTAT'], 
      dtype='|S7')

In [9]:
X = boston['data']
Y = boston['target']
names = boston['feature_names']

rf = RandomForestRegressor(n_estimators=20,max_depth=4)
scores = []
for i in range(X.shape[1]):
    score = cross_val_score(rf,X[:,i:i+1],Y,scoring='r2',
                           cv=ShuffleSplit(len(X),3,.3))
    scores.append((round(np.mean(score),3),names[i]))
print sorted(scores,reverse=True)

[(0.65, 'LSTAT'), (0.569, 'RM'), (0.415, 'INDUS'), (0.404, 'NOX'), (0.329, 'TAX'), (0.302, 'PTRATIO'), (0.208, 'CRIM'), (0.195, 'RAD'), (0.188, 'ZN'), (0.141, 'B'), (0.072, 'AGE'), (0.041, 'DIS'), (-0.028, 'CHAS')]


<h2 id="id4">线性模型和正则化<h2>

越重要的特征在模型中系数会越大

In [13]:
from sklearn.linear_model import LinearRegression
import numpy as np
np.random.seed(0)
size = 500

# 3 个特征的数据集
X = np.random.normal(0,1,(size,3))  # 数据集大小 size*3
X

array([[ 1.76405235,  0.40015721,  0.97873798],
       [ 2.2408932 ,  1.86755799, -0.97727788],
       [ 0.95008842, -0.15135721, -0.10321885],
       ..., 
       [-0.74013679, -0.56549781,  0.47603138],
       [-2.15806856,  1.31855102, -0.23929659],
       [-0.24679356, -1.07934317, -0.11422555]])

In [23]:
# Y = X0 + 2*X1 +noise
Y = X[:,0] + 2*X[:,1] + np.random.normal(0,2,size) # size*1

lr = LinearRegression()
lr.fit(X,Y)

# 调整输出方式
def pretty_print(coefs,names=None,sort=False):
    if names == None:
        names = ["X%s" % x for x in range(len(coefs))]
    lst = zip(coefs,names)
    if sort:
        lst = sorted(lst,key=lambda x:-np.abs(x[0]))
    return " + ".join("%s*%s" % (round(coef,3),name) for coef,name in lst)

print "Linear model:",pretty_print(lr.coef_,sort=True)

Linear model: 1.975*X1 + 1.141*X0 + -0.061*X2


- 正则化

In [24]:
from sklearn.linear_model import Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_boston
boston = load_boston()
scaler = StandardScaler()
X = scaler.fit_transform(boston['data'])
X

array([[-0.41771335,  0.28482986, -1.2879095 , ..., -1.45900038,
         0.44105193, -1.0755623 ],
       [-0.41526932, -0.48772236, -0.59338101, ..., -0.30309415,
         0.44105193, -0.49243937],
       [-0.41527165, -0.48772236, -0.59338101, ..., -0.30309415,
         0.39642699, -1.2087274 ],
       ..., 
       [-0.41137448, -0.48772236,  0.11573841, ...,  1.17646583,
         0.44105193, -0.98304761],
       [-0.40568883, -0.48772236,  0.11573841, ...,  1.17646583,
         0.4032249 , -0.86530163],
       [-0.41292893, -0.48772236,  0.11573841, ...,  1.17646583,
         0.44105193, -0.66905833]])

In [25]:
Y = boston['target']
names = boston['feature_names']

# 如果继续增加alpha的值，得到的模型就会越来越稀疏，即越来越多的特征系数会变成0
lasso = Lasso(alpha=.3)
lasso.fit(X,Y)
print "Lasso mode1: ",pretty_print(lasso.coef_,names,sort=True)

Lasso mode1:  -3.707*LSTAT + 2.992*RM + -1.757*PTRATIO + -1.081*DIS + -0.7*NOX + 0.631*B + 0.54*CHAS + -0.236*CRIM + 0.081*ZN + -0.0*INDUS + -0.0*AGE + 0.0*RAD + -0.0*TAX




- L2 正则化

In [28]:
from sklearn.linear_model import Ridge
from sklearn.metrics import r2_score
size = 100

for i in range(5):
    print "Random seed %s" %i
    np.random.seed(seed=i)
    X_seed = np.random.normal(0,1,size)
    X1 = X_seed + np.random.normal(0,.1,size)
    X2 = X_seed + np.random.normal(0,.1,size)
    X3 = X_seed + np.random.normal(0,.1,size)
    Y = X1 + X2 + X3 + np.random.normal(0,1,size)
    X = np.array([X1,X2,X3]).T
    
    lr = LinearRegression()
    lr.fit(X,Y)
    print "Linear model1:",pretty_print(lr.coef_)
    
    ridge = Ridge(alpha=10)
    ridge.fit(X,Y)
    print "Ridge model:",pretty_print(ridge.coef_)

Random seed 0
Linear model1: 0.728*X0 + 2.309*X1 + -0.082*X2
Ridge model: 0.938*X0 + 1.059*X1 + 0.877*X2
Random seed 1
Linear model1: 1.152*X0 + 2.366*X1 + -0.599*X2
Ridge model: 0.984*X0 + 1.068*X1 + 0.759*X2
Random seed 2
Linear model1: 0.697*X0 + 0.322*X1 + 2.086*X2
Ridge model: 0.972*X0 + 0.943*X1 + 1.085*X2
Random seed 3
Linear model1: 0.287*X0 + 1.254*X1 + 1.491*X2
Ridge model: 0.919*X0 + 1.005*X1 + 1.033*X2
Random seed 4
Linear model1: 0.187*X0 + 0.772*X1 + 2.189*X2
Ridge model: 0.964*X0 + 0.982*X1 + 1.098*X2


<h2 id="rf">随机森林</h2>

准确率高,鲁棒性好,易于使用

下列中采用特征得分 Gini Importance,这种方法存在偏向,对具有更多类别的变量更加有利;2,对存在关联的多个特征,其中任意一个都可以作为指示器,一旦某个特征被选择之后,其他特征的重要性就会急剧下降 

In [40]:
X = boston['data']
Y = boston['target']
names = boston['feature_names']
rf = RandomForestRegressor()
rf.fit(X,Y)
print 'Features sorted by their score:'
print sorted(zip(map(lambda x:round(x,4),rf.feature_importances_),names),reverse=True)

Features sorted by their score:
[(0.5027, 'LSTAT'), (0.3232, 'RM'), (0.0611, 'DIS'), (0.0249, 'NOX'), (0.0207, 'CRIM'), (0.0192, 'PTRATIO'), (0.0146, 'B'), (0.0133, 'TAX'), (0.0118, 'AGE'), (0.0057, 'INDUS'), (0.0015, 'RAD'), (0.0008, 'ZN'), (0.0006, 'CHAS')]


<h2 id="id5">两种顶层特征选取<h2>

- 稳定性选择

在不同的数据子集和特征子集上运行特征选择算法,不断重复,最终汇总特征选择结果.理想情况下,重要特征的得分会接近 100%,稍弱一点的特征得分会是接近0的数

sklearn 中[随机lasso](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RandomizedLasso.html)和[随机逻辑回归](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RandomizedLogisticRegression.html)中有稳定性选择的实现

In [41]:
#波士顿数据集
from sklearn.linear_model import RandomizedLasso
rlasso = RandomizedLasso(alpha=.025)
rlasso.fit(X,Y)

print 'Features sorted by their score:'
print sorted(zip(map(lambda x:round(x,4),rlasso.scores_),names),reverse=True)

Features sorted by their score:
[(1.0, 'RM'), (1.0, 'PTRATIO'), (1.0, 'LSTAT'), (0.635, 'B'), (0.56, 'CHAS'), (0.42, 'CRIM'), (0.36, 'TAX'), (0.205, 'NOX'), (0.175, 'DIS'), (0.155, 'INDUS'), (0.07, 'ZN'), (0.045, 'RAD'), (0.03, 'AGE')]


- 递归特征消除

反复构建模型,选好最好(或最差)的特征,把选出来的特征放到一边,在剩下的特征重复这个过程,直到所有特征都遍历完

In [42]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

lr = LinearRegression()
rfe = RFE(lr,n_features_to_select=1)
rfe.fit(X,Y)

print 'Features sorted by their rank:'
print sorted(zip(map(lambda x:round(x,4),rfe.ranking_),names))

Features sorted by their rank:
[(1.0, 'NOX'), (2.0, 'RM'), (3.0, 'CHAS'), (4.0, 'PTRATIO'), (5.0, 'DIS'), (6.0, 'LSTAT'), (7.0, 'RAD'), (8.0, 'CRIM'), (9.0, 'INDUS'), (10.0, 'ZN'), (11.0, 'TAX'), (12.0, 'B'), (13.0, 'AGE')]
