In this notbook, we are going to run a random forest classifier on the iris dataset. We will then take a look at the feature importances that a random forest classifier gives us in scikit-learn. 

The importances are calculated by looking at how much each feature increases or decreases impurity across the trees. It ends up being a weighted average. 

Remember, the Gini Impurity is a look at the correctness (or purity) of the division of a dataset at each decision tree node. A score of 0 is perfect, the dataset was correctly divided at each node. It is calculated as the sum of the squared ratios of each class in a tree level.

$$
G_i = 1 - \sum^{n}_{k=1} {P_{i,k}}^2
$$

Where $P$ is the ratio of class $k$ in the training instances in the $i$th node

In [8]:
# Import the classifier and dataset
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris

In [9]:
# Load Iris Data
iris = load_iris()

In [10]:
# Build the classifier model
rnd_clf = RandomForestClassifier(n_estimators=500, n_jobs=-1)

In [11]:
# Train the classifier
rnd_clf.fit(iris['data'], iris['target'])

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=500,
                       n_jobs=-1, oob_score=False, random_state=None, verbose=0,
                       warm_start=False)

In [12]:
# Understand the data structure.
print (iris['feature_names'])
print (rnd_clf.feature_importances_)

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
[0.09713884 0.02366421 0.41395356 0.4652434 ]


In [13]:
# Since the importance scores are in the same order as the feature names, we can zip it
for name, score in zip(iris['feature_names'], rnd_clf.feature_importances_):
    print(name, score)

sepal length (cm) 0.0971388390410611
sepal width (cm) 0.023664206089290044
petal length (cm) 0.41395355786021987
petal width (cm) 0.465243397009429
