# Forest Cover Analysis - EXAMPLE

## 1 - Prep Stuff

First, we need some libraries

In [1]:
import numpy as np
import pandas as pd

from sklearn.tree import DecisionTreeClassifier

Next we need to import the dataset which we are storing in the variable _**df**_ (*we are using __df__ as shorthand for "Data-Frame"*)

In [2]:
df = pd.read_csv("data.csv")

In [3]:
df.head()

Unnamed: 0,Elevation,Aspect,Slope,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,Horizontal_Distance_To_Fire_Points,...,Soil_Type32,Soil_Type33,Soil_Type34,Soil_Type35,Soil_Type36,Soil_Type37,Soil_Type38,Soil_Type39,Soil_Type40,Cover_Type
0,2596,51,3,258,0,510,221,232,148,6279,...,0,0,0,0,0,0,0,0,0,5
1,2590,56,2,212,-6,390,220,235,151,6225,...,0,0,0,0,0,0,0,0,0,5
2,2804,139,9,268,65,3180,234,238,135,6121,...,0,0,0,0,0,0,0,0,0,2
3,2785,155,18,242,118,3090,238,238,122,6211,...,0,0,0,0,0,0,0,0,0,2
4,2595,45,2,153,-1,391,220,234,150,6172,...,0,0,0,0,0,0,0,0,0,5


The **Cover_Type** entires are our labels, all other entries are our features.

See how many entries we have (*first number*) and how many features we have (*second number minus 1*)

In [4]:
df.shape

(581012, 55)

See how many unique choices we have for features and what those choices are

In [5]:
len(df['Cover_Type'].unique())

7

In [6]:
df['Cover_Type'].unique()

array([5, 2, 1, 7, 3, 6, 4])

We have seven possible labels (*the integers 1 though 7*).

The last thing we need to do is split our data into features and labels (*for clarity in what we're doing*).  First, lets split out the features:

In [7]:
feats = df.drop(['Cover_Type'], axis=1)
feats.shape

(581012, 54)

In [8]:
feats.head()

Unnamed: 0,Elevation,Aspect,Slope,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,Horizontal_Distance_To_Fire_Points,...,Soil_Type31,Soil_Type32,Soil_Type33,Soil_Type34,Soil_Type35,Soil_Type36,Soil_Type37,Soil_Type38,Soil_Type39,Soil_Type40
0,2596,51,3,258,0,510,221,232,148,6279,...,0,0,0,0,0,0,0,0,0,0
1,2590,56,2,212,-6,390,220,235,151,6225,...,0,0,0,0,0,0,0,0,0,0
2,2804,139,9,268,65,3180,234,238,135,6121,...,0,0,0,0,0,0,0,0,0,0
3,2785,155,18,242,118,3090,238,238,122,6211,...,0,0,0,0,0,0,0,0,0,0
4,2595,45,2,153,-1,391,220,234,150,6172,...,0,0,0,0,0,0,0,0,0,0


The second thing we're going to do is split out the labels

In [9]:
labels = df['Cover_Type']
labels.shape

(581012,)

In [10]:
labels.head()

0    5
1    5
2    2
3    2
4    5
Name: Cover_Type, dtype: int64

Note that we have also listed the shapes of the lists holding the features and labels, respectively; as well, as listed the first few entires of each list.

## 2 - Choose some algorithms

For this case, I'm going to choose the following algorithms to calculate the importances of each feature in determining the label.  These algorithms are:

* Decision Trees
* Random Forests
* AdaBoost
* Gradient Boosting
* XGBoost

### 2.1 - Decision Trees

We first create the model, however since we can create decision trees based on the **Gini** impurity measure or the information gain (**Entropy**), we create two models where the first uses Gini and the second uses Entropy.

Creating and then training the first decision tree model:

In [11]:
dTree_gini = DecisionTreeClassifier(random_state=0, criterion="gini")
dTree_gini.fit(feats, labels);

and then the second

In [12]:
dTree_ent = DecisionTreeClassifier(random_state=0, criterion="entropy")
dTree_ent.fit(feats, labels);

#### Results

Now lets get our results! First for from **Decision Trees using Gini**

In [13]:
tmp = dTree_gini.feature_importances_
res_dTree_gini = []
for i in range(54):
    res_dTree_gini.append((df.columns[i], tmp[i]))

res_dTree_gini = sorted(res_dTree_gini, key=lambda x: x[1], reverse=True)
for row in res_dTree_gini:
    print(row)

('Elevation', 0.3355156449563527)
('Horizontal_Distance_To_Roadways', 0.1517793067932444)
('Horizontal_Distance_To_Fire_Points', 0.14334439967046675)
('Horizontal_Distance_To_Hydrology', 0.062457634921759896)
('Vertical_Distance_To_Hydrology', 0.04408045488878572)
('Hillshade_Noon', 0.033247085162902826)
('Hillshade_9am', 0.02986512860434374)
('Aspect', 0.026133021303965703)
('Hillshade_3pm', 0.021994512573794523)
('Slope', 0.017144924010747527)
('Wilderness_Area3', 0.013514764573181242)
('Soil_Type32', 0.012577568085688955)
('Soil_Type4', 0.01159510367273196)
('Soil_Type2', 0.010086355463854075)
('Soil_Type23', 0.009973041152741256)
('Wilderness_Area1', 0.009384731799897069)
('Soil_Type22', 0.007721544549167265)
('Soil_Type29', 0.007433454516058362)
('Soil_Type31', 0.005168908485206785)
('Soil_Type24', 0.00504022149899362)
('Soil_Type33', 0.004911952700481703)
('Wilderness_Area2', 0.004012570408032134)
('Soil_Type39', 0.0035334248336770834)
('Soil_Type30', 0.0032445532184575804)
('Soi

In [14]:
tmp = dTree_ent.feature_importances_
res_dTree_ent = []
for i in range(54):
    res_dTree_ent.append((df.columns[i], tmp[i]))

res_dTree_ent = sorted(res_dTree_ent, key=lambda x: x[1], reverse=True)
for row in res_dTree_ent:
    print(row)

('Elevation', 0.4156200238385435)
('Horizontal_Distance_To_Fire_Points', 0.1293779762112096)
('Horizontal_Distance_To_Roadways', 0.12707525618127335)
('Horizontal_Distance_To_Hydrology', 0.05370760323633776)
('Vertical_Distance_To_Hydrology', 0.040278883675037694)
('Hillshade_Noon', 0.02580366895038672)
('Aspect', 0.024919807680558133)
('Hillshade_9am', 0.023399465135678723)
('Wilderness_Area1', 0.019856997630487483)
('Hillshade_3pm', 0.018390709922365998)
('Slope', 0.012700106586081391)
('Wilderness_Area3', 0.011783293607673474)
('Soil_Type32', 0.008839190547693835)
('Soil_Type4', 0.008287173509378087)
('Soil_Type23', 0.00714431681175647)
('Soil_Type22', 0.006633197541515513)
('Soil_Type29', 0.005709589051130517)
('Soil_Type2', 0.005482861665056215)
('Soil_Type39', 0.0052271549944714205)
('Soil_Type24', 0.005053822039222676)
('Wilderness_Area2', 0.005035830355800437)
('Soil_Type31', 0.004499399652461103)
('Soil_Type38', 0.004141559517816544)
('Soil_Type10', 0.003856903538800462)
('Soi

In [18]:
from sklearn import tree
from IPython.display import SVG
from graphviz import Source
from IPython.display import display

In [17]:
!pip search pydotplus

Traceback (most recent call last):
  File "/usr/local/bin/pip", line 6, in <module>
    from pkg_resources import load_entry_point
ImportError: No module named pkg_resources


In [23]:
names = list(feats)

In [24]:
decTreePartial1 = DecisionTreeClassifier(random_state=0, max_depth=3, criterion='gini').fit(feats, labels)

In [25]:
Source(tree.export_graphviz(decTreePartial1, out_file=None, feature_names=names, class_names=['1','2','3','4','5','6','7'], filled=True))

ExecutableNotFound: failed to execute ['dot', '-Tsvg'], make sure the Graphviz executables are on your systems' PATH

<graphviz.files.Source at 0x12373e940>