# 数据降维

## 1.数据降维的四种思路

<1> 经验法：根据业务专家或数据专家的经验、实际数据情况、业务理解程度等进行考虑；

<2> 测算法：通过不断测试多种维度参与运算，通过结果来反复验证和调整并最终找到最佳特征方案；

<3> 基于统计分析的方法：通过相关性分析不同维度间的线性相关性，在相关性高的维度中进行人工去除或筛选；或者通过计算不同维度间的互信息量，找到具有较高互信息量的特征集，然后把其中的一个特征去除或留下；

<4> 机器学习算法：通过机器学习算法得到不同特征的特征值和权重，然后再根据权重来选择较大的特征。

In [2]:
import numpy as np
# 加载决策树分类器
from sklearn.tree import DecisionTreeClassifier
# 加载pca
from sklearn.decomposition import PCA

In [3]:
# 使用np.loadtxt加载数据
data = np.loadtxt('./data/data_dimension_reduction_1.txt')
data[0]

array([ 1.88622997,  1.31785876, -0.16480621,  0.56536882, -1.11934542,
       -0.53218995, -0.6843102 ,  1.24149827,  1.00579225,  0.45485041,
        0.        ])

In [4]:
x = data[:, :-1]
y = data[:, -1:]
x[0]

array([ 1.88622997,  1.31785876, -0.16480621,  0.56536882, -1.11934542,
       -0.53218995, -0.6843102 ,  1.24149827,  1.00579225,  0.45485041])

In [6]:
help(DecisionTreeClassifier)

Help on class DecisionTreeClassifier in module sklearn.tree._classes:

class DecisionTreeClassifier(sklearn.base.ClassifierMixin, BaseDecisionTree)
 |  DecisionTreeClassifier(criterion='gini', splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, class_weight=None, presort='deprecated', ccp_alpha=0.0)
 |  
 |  A decision tree classifier.
 |  
 |  Read more in the :ref:`User Guide <tree>`.
 |  
 |  Parameters
 |  ----------
 |  criterion : {"gini", "entropy"}, default="gini"
 |      The function to measure the quality of a split. Supported criteria are
 |      "gini" for the Gini impurity and "entropy" for the information gain.
 |  
 |  splitter : {"best", "random"}, default="best"
 |      The strategy used to choose the split at each node. Supported
 |      strategies are "best" to choose the best split and "random" to choose
 

In [10]:
# 使用sklearn中的DecisionTreeClassifier判断变量的重要性
model_tree = DecisionTreeClassifier(random_state=0)
model_tree.fit(x, y)
feature_importances = model_tree.feature_importances_
print(feature_importances)

[0.03331054 0.01513967 0.02199713 0.119727   0.47930312 0.04776297
 0.17111746 0.02585441 0.02012725 0.06566044]


In [11]:
help(PCA)

Help on class PCA in module sklearn.decomposition._pca:

class PCA(sklearn.decomposition._base._BasePCA)
 |  PCA(n_components=None, copy=True, whiten=False, svd_solver='auto', tol=0.0, iterated_power='auto', random_state=None)
 |  
 |  Principal component analysis (PCA).
 |  
 |  Linear dimensionality reduction using Singular Value Decomposition of the
 |  data to project it to a lower dimensional space. The input data is centered
 |  but not scaled for each feature before applying the SVD.
 |  
 |  It uses the LAPACK implementation of the full SVD or a randomized truncated
 |  SVD by the method of Halko et al. 2009, depending on the shape of the input
 |  data and the number of components to extract.
 |  
 |  It can also use the scipy.sparse.linalg ARPACK implementation of the
 |  truncated SVD.
 |  
 |  Notice that this class does not support sparse input. See
 |  :class:`TruncatedSVD` for an alternative with sparse data.
 |  
 |  Read more in the :ref:`User Guide <PCA>`.
 |  
 |  Pa

In [13]:
# 使用sklearn中的PCA提取主成分,根据方差占比选择主成分的数量
model_pca = PCA()
model_pca.fit(x)
model_pca.transform(x)
components = model_pca.components_
explained_variance = model_pca.explained_variance_
explained_variance_ratio = model_pca.explained_variance_ratio_

In [16]:
print(explained_variance[:2])
print(explained_variance_ratio[:5].sum())

[4.22602937 2.21149972]
0.774389008950177
