#### The goal of ensemble methods is to combine the predictions of several base estimators built with a given learning algorithm in order to improve generalizability / robustness over a single estimator.

Two families of ensemble methods are usually distinguished:

1.**In averaging methods**, the driving principle is to build several estimators independently and then to average their predictions. On average, the combined estimator is usually better than any of the single base estimator because its variance is reduced.

Examples: Bagging methods, Forests of randomized trees, …

2.By contrast, **in boosting methods**, base estimators are built sequentially and one tries to reduce the bias of the combined estimator. The motivation is to combine several weak models to produce a powerful ensemble.

Examples: AdaBoost, Gradient Tree Boosting, …

#### 1.11.1 bagging meta-estimator

In scikit-learn, bagging methods are offered as a unified BaggingClassifier meta-estimator (resp. BaggingRegressor), taking as input a user-specified base estimator along with parameters specifying the strategy to draw random subsets. In particular, max_samples and max_features control the size of the subsets (in terms of samples and features), while bootstrap and bootstrap_features control whether samples and features are drawn with or without replacement.

In [1]:
from sklearn.ensemble import BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier
bagging = BaggingClassifier(KNeighborsClassifier(),max_samples=0.5,max_features=0.5)

#### 1.11.2. Forests of randomized trees

The sklearn.ensemble module includes two averaging algorithms based on randomized decision trees: the RandomForest algorithm and the Extra-Trees method.

In [2]:
from sklearn.ensemble import RandomForestClassifier
X = [[0.,0.],
    [1.,1.]]
Y = [0,1]
clf = RandomForestClassifier(n_estimators=10)
clf = clf.fit(X,Y)

In random forests (see RandomForestClassifier and RandomForestRegressor classes), **each tree in the ensemble is built from a sample drawn with replacement (i.e., a bootstrap sample)**有放回随机抽样 from the training set. In addition, **when splitting a node during the construction of the tree, the split that is chosen is no longer the best split among all features. Instead, the split that is picked is the best split among a random subset of the features.**样本随机并且特征随机 As a result of this randomness, the bias of the forest usually slightly increases (with respect to the bias of a single non-random tree) but, due to averaging, its variance also decreases, usually more than compensating for the increase in bias, hence yielding an overall better model.

In contrast to the original publication, the scikit-learn implementation combines classifiers by averaging their probabilistic prediction, instead of letting each classifier vote for a single class.

##### 1.11.2.2. Extremely Randomized Trees
In extremely randomized trees (see ExtraTreesClassifier and ExtraTreesRegressor classes), **randomness goes one step further in the way splits are computed**. As in random forests, a random subset of candidate features is used, but instead of looking for the most discriminative thresholds, thresholds are drawn at random for each candidate feature and the best of these randomly-generated thresholds is picked as the splitting rule. This usually allows to reduce the variance of the model a bit more, at the expense of a slightly greater increase in bias

>在RF采取bagging和random subspace的基础上，进一步在每一棵树train决策树的时候，选取的split value采用随机生成。原先决策树针对是连续数值的特征会计算局部split value，（一个特征可能可以产生多个split value，都计算一下，然后评估所有的特征中的哪一个split value最好，就以该特征的该split value分裂）；但是现在，对每一个特征，在它的特征取值范围内，随机生成一个split value，再计算看选取哪一个特征来进行分裂（树多一层）

In [4]:
from sklearn.model_selection import cross_val_score
from sklearn.datasets import make_blobs
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.tree import DecisionTreeClassifier

X,y = make_blobs(n_samples=10000,n_features=10,centers=100,random_state=0)

In [5]:
clf = DecisionTreeClassifier(max_depth=None,min_samples_split=2,random_state=0)
scores = cross_val_score(clf,X,y)#Evaluate a score by cross-validation. use the default 3-fold cross-validation
scores.mean()

0.97940879382055857

In [7]:
clf = RandomForestClassifier(n_estimators=10,max_depth=None,min_samples_split=2,random_state=0)
scores = cross_val_score(clf,X,y)
scores.mean()

0.99960784313725493

In [8]:
clf = ExtraTreesClassifier(n_estimators=10,max_depth=None,min_samples_split=2,random_state=0)
scores = cross_val_score(clf,X,y)
scores.mean()

0.99989898989898995

##### 1.11.2.3. Parameters
The main parameters to adjust when using these methods is **n_estimators** and **max_features**. The former is the number of trees in the forest. The larger the better, but also the longer it will take to compute. In addition, note that results will stop getting significantly better beyond a critical number of trees. The latter is the size of the random subsets of features to consider when splitting a node. The lower the greater the reduction of variance, but also the greater the increase in bias. **Empirical good default values are max_features=n_features for regression problems, and max_features=sqrt(n_features) for classification tasks (where n_features is the number of features in the data)**. Good results are often achieved when setting max_depth=None in combination with min_samples_split=1 (i.e., when fully developing the trees). **Bear in mind though that these values are usually not optimal, and might result in models that consume a lot of RAM. The best parameter values should always be cross-validated**. In addition, **note that in random forests, bootstrap samples are used by default (bootstrap=True) while the default strategy for extra-trees is to use the whole dataset (bootstrap=False)**. When using bootstrap sampling the generalization accuracy can be estimated on the left out or out-of-bag samples. This can be enabled by setting oob_score=True.

##### 1.11.2.4. Parallelization 并行
Finally, this module also features the parallel construction of the trees and the parallel computation of the predictions through the n_jobs parameter. If n_jobs=k then computations are partitioned into k jobs, and run on k cores of the machine. If n_jobs=-1 then all cores available on the machine are used. Note that because of inter-process communication overhead, the speedup might not be linear (i.e., using k jobs will unfortunately not be k times as fast). Significant speedup can still be achieved though when building a large number of trees, or when building a single tree requires a fair amount of time (e.g., on large datasets).

##### 1.11.2.5. Feature importance evaluation
>在随机森林中某个特征X的重要性的计算方法如下：
1.对于随机森林中的每一颗决策树,使用相应的袋外数据数据来计算它的袋外数据误差记为errOOB1(error Out of Bag);
2.随机地对袋外数据OOB所有样本的特征X加入噪声干扰(就可以随机的改变样本在特征X处的值),再次计算它的袋外数据误差,记为errOOB2;
3.假设随机森林中有Ntree棵树,那么对于特征X的重要性=∑(errOOB2-errOOB1)/Ntree,之所以可以用这个表达式来作为相应特征的重要性的度量值是因为：若给某个特征随机加入噪声之后,袋外的准确率大幅度降低,则说明这个特征对于样本的分类结果影响很大,也就是说它的重要程度比较高。

##### 1.11.2.6 Totally Random Trees Embedding
>TRTE是一种非监督学习的数据转化方法。它将低维的数据集映射到高维，从而让映射到高维的数据更好的运用于分类回归模型。我们知道，在支持向量机中运用了核方法来将低维的数据集映射到高维，此处TRTE提供了另外一种方法。

>TRTE在数据转化的过程也使用了类似于RF的方法，建立T个决策树来拟合数据。当决策树建立完毕以后，数据集里的每个数据在T个决策树中叶子节点的位置也定下来了。比如我们有3颗决策树，每个决策树有5个叶子节点，某个数据特征x划分到第一个决策树的第2个叶子节点，第二个决策树的第3个叶子节点，第三个决策树的第5个叶子节点。则x映射后的特征编码为(0,1,0,0,0,     0,0,1,0,0,     0,0,0,0,1), 有15维的高维特征。这里特征维度之间加上空格是为了强调三颗决策树各自的子编码。

#### 1.11.3. AdaBoost