# 投票分类器

使用卫星数据集，创建并训练一个投票分类器，由随机森林、逻辑回归和svc三种不同的分类器组成

In [2]:
from sklearn.ensemble import RandomForestClassifier,VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

In [7]:
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
x,y = make_moons(n_samples=10000,noise=0.4)
print(len(x))
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.2,random_state=42)
print(len(x_train))

10000
8000


In [17]:
rf_clf = RandomForestClassifier()
lr_clf = LogisticRegression(solver='lbfgs')
svc_clf = SVC()
voting_clf = VotingClassifier(estimators = [('rf_clf',rf_clf),('lr_clf',lr_clf),('svc_clf',svc_clf)],voting='hard')

from sklearn.metrics import accuracy_score
for clf in (lr_clf,svc_clf,rf_clf,voting_clf):
    clf.fit(x_train,y_train)
    accuracy = accuracy_score(y_test,clf.predict(x_test))
    print('classifier:{},accuracy:{}'.format(clf.__class__.__name__,accuracy))

classifier:LogisticRegression,accuracy:0.8195




classifier:SVC,accuracy:0.854
classifier:RandomForestClassifier,accuracy:0.8125




classifier:VotingClassifier,accuracy:0.848


将硬投票改为软投票试试，由于svm在默认情况下不支持估算出概率，必须将probability设置为True

In [18]:
rf_clf = RandomForestClassifier()
lr_clf = LogisticRegression(solver='lbfgs')
svc_clf = SVC(probability=True)
voting_clf = VotingClassifier(estimators = [('rf_clf',rf_clf),('lr_clf',lr_clf),('svc_clf',svc_clf)],voting='soft')

from sklearn.metrics import accuracy_score
for clf in (lr_clf,svc_clf,rf_clf,voting_clf):
    clf.fit(x_train,y_train)
    accuracy = accuracy_score(y_test,clf.predict(x_test))
    print('classifier:{},accuracy:{}'.format(clf.__class__.__name__,accuracy))

classifier:LogisticRegression,accuracy:0.8195




classifier:SVC,accuracy:0.854
classifier:RandomForestClassifier,accuracy:0.8315




classifier:VotingClassifier,accuracy:0.841


书上说修改为软投票后准确率会提高到91%，目前我为测试出来

# bagging和pasting

训练一个包含500个决策树分类器的集成，每次随机从训练集中采样100个训练实际进行训练，若bootstrap=true则放回，否则不放回

In [23]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=500,max_samples=100,bootstrap=True,n_jobs=-1,oob_score=True)
bag_clf.fit(x_train,y_train)
y_pred = bag_clf.predict(x_test)
print('oob_score:',bag_clf.oob_score_)
print('accuracy:',accuracy_score(y_test,y_pred))

oob_score: 0.859125
accuracy: 0.85


注意oob_score参数，翻译为包外评估分数，因为BaggingClassifier默认采用m个训练实例，然后放回样本（bootstrap=True）,
这意味着对每个预测器来说，平均只对部分的训练实例（63%）进行采样，而剩余未被采样的将用于包外评估

# 使用随机森林查看特征重要性

查看单个决策树会发现，重要的特征更可能出现在靠近根节点的位置，而不重要的特征通常出现在靠近叶节点的位置

In [30]:
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier

iris = load_iris()
x,y = iris['data'],iris['target']
rfc_clf = RandomForestClassifier(n_estimators = 500, criterion='gini',oob_score=True,n_jobs = -1)
rfc_clf.fit(x,y)
for feature_name,importance in zip(iris['feature_names'],rfc_clf.feature_importances_):
    print(feature_name,importance)

sepal length (cm) 0.0938533541376908
sepal width (cm) 0.024136087222109918
petal length (cm) 0.44146052727464014
petal width (cm) 0.44055003136555876


In [31]:
print('n_features:',rfc_clf.n_features_)
print('oob_score:',rfc_clf.oob_score_) #注意需要oob_score参数，才可以查看包外评估

n_features: 4
oob_score: 0.9533333333333334
