# Basics of scikit-learn with Decision Trees

All Rights Reserved © <a href="http://www.louisdorard.com" style="color: #6D00FF;">Louis Dorard</a>

<img src="http://s3.louisdorard.com.s3.amazonaws.com/ML_icon.png">


In this notebook we present some basics of scikit-learn: preparing training data, fitting a model, predicting against it, and exporting it.

We illustrate this with Decision Tree classification and also show some specific features related to tree models.

## 1. Loading data

### Reading from CSV

We use the Pandas library to easily load CSV files:

In [1]:
from pandas import read_csv
# path = "https://oml-data.s3.amazonaws.com/" # load data from http location; you can also load from local path
path = 'data/'
data = read_csv(path + "boston-housing.csv", index_col=0)

### Inspecting data

I recommend to inspect data in a spreadsheet program and in a data visualization tool. Pandas can also be used to some extent. Here's a quick way to just make sure the data was read correctly:

In [96]:
data.head()

Unnamed: 0_level_0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
2,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
3,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
4,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
5,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2


### Define inputs and outputs

Pandas uses its own data structures. We need to select where inputs are outputs are and transform that to standard Python data structures that scikit-learn can understand. The usual convention is to call the set of inputs `X` and the set of outputs `y`.

In [97]:
target_column = 'medv'
outputs = data[target_column]
y = outputs.values

features = data.drop(target_column, axis=1)
X = features.values

We can use `print` to see the contents of variables:

In [98]:
print(X)

[[6.3200e-03 1.8000e+01 2.3100e+00 ... 1.5300e+01 3.9690e+02 4.9800e+00]
 [2.7310e-02 0.0000e+00 7.0700e+00 ... 1.7800e+01 3.9690e+02 9.1400e+00]
 [2.7290e-02 0.0000e+00 7.0700e+00 ... 1.7800e+01 3.9283e+02 4.0300e+00]
 ...
 [6.0760e-02 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9690e+02 5.6400e+00]
 [1.0959e-01 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9345e+02 6.4800e+00]
 [4.7410e-02 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9690e+02 7.8800e+00]]


In [99]:
print(y)

[24.  21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 15.  18.9 21.7 20.4
 18.2 19.9 23.1 17.5 20.2 18.2 13.6 19.6 15.2 14.5 15.6 13.9 16.6 14.8
 18.4 21.  12.7 14.5 13.2 13.1 13.5 18.9 20.  21.  24.7 30.8 34.9 26.6
 25.3 24.7 21.2 19.3 20.  16.6 14.4 19.4 19.7 20.5 25.  23.4 18.9 35.4
 24.7 31.6 23.3 19.6 18.7 16.  22.2 25.  33.  23.5 19.4 22.  17.4 20.9
 24.2 21.7 22.8 23.4 24.1 21.4 20.  20.8 21.2 20.3 28.  23.9 24.8 22.9
 23.9 26.6 22.5 22.2 23.6 28.7 22.6 22.  22.9 25.  20.6 28.4 21.4 38.7
 43.8 33.2 27.5 26.5 18.6 19.3 20.1 19.5 19.5 20.4 19.8 19.4 21.7 22.8
 18.8 18.7 18.5 18.3 21.2 19.2 20.4 19.3 22.  20.3 20.5 17.3 18.8 21.4
 15.7 16.2 18.  14.3 19.2 19.6 23.  18.4 15.6 18.1 17.4 17.1 13.3 17.8
 14.  14.4 13.4 15.6 11.8 13.8 15.6 14.6 17.8 15.4 21.5 19.6 15.3 19.4
 17.  15.6 13.1 41.3 24.3 23.3 27.  50.  50.  50.  22.7 25.  50.  23.8
 23.8 22.3 17.4 19.1 23.1 23.6 22.6 29.4 23.2 24.6 29.9 37.2 39.8 36.2
 37.9 32.5 26.4 29.6 50.  32.  29.8 34.9 37.  30.5 36.4 31.1 29.1 50.
 33.3 3

## 2. Initializing an estimator

* Implementations of learning algorithms reside in _estimator_ objects in scikit.
* An estimator is an object that can "fit" a model to data. More info [here](http://scikit-learn.org/stable/developers/contributing.html#apis-of-scikit-learn-objects).
* The estimator's (hyper)parameters are set upon initialization — explicitly, or implicitly to default values.
* Usually, we call estimators with variable names that say `model`, but the actual model is "empty" at first, since no data has been seen.

Here is how to initialize a DecisionTreeClassifier estimator:

In [100]:
from sklearn import tree
regressor = tree.DecisionTreeRegressor(max_depth = None)

* Check out the other arguments that this constructor can take, from the inline documentation (via Shift + Tab) or the [online documentation](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier)
* Scikit's online documentation on [trees](http://scikit-learn.org/stable/modules/tree.html) also contains practical tips and theoretical explanations
* Also try using a kNN model, with `neighbors.KNeighborsClassifier`

## 3. Learning a model and predicting

It's time to actually train the model from the training inputs and outputs:

In [101]:
model = regressor.fit(X, y)

Let's make 2 predictions with our new model:

In [102]:
new_x = [ [0.00632,18,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98] ]
model.predict(new_x)

array([24.])

## 4. Exporting a model

The standard thing to do to persist the `model` object is to save it to a file using the pickle library:

In [59]:
import pickle
pickle.dump(model, open('model.pkl', 'wb'))

You can then load it back:

In [60]:
model = pickle.load(open('model.pkl', 'rb'))

In order to use that model object, you'll need the right version of scikit.

### Exporting a tree

The structure of a scikit Decision Tree is completely "open". We can navigate through it and generate another representation of it in our language of choice.

Here we export a description of the tree in a format that can be read by the popular D3.js visualization library in JavaScript (see [source](http://planspace.org/20151129-see_sklearn_trees_with_d3/)).

In [61]:
def rules(clf, features, labels, node_index=0):
    node = {}
    if clf.tree_.children_left[node_index] == -1:  # indicates leaf
        count_labels = zip(clf.tree_.value[node_index, 0], labels)
        node['name'] = ', '.join(('{} of {}'.format(int(count), label)
                                  for count, label in count_labels))
    else:
        feature = features[clf.tree_.feature[node_index]]
        threshold = clf.tree_.threshold[node_index]
        node['name'] = '{} > {}'.format(feature, threshold)
        left_index = clf.tree_.children_left[node_index]
        right_index = clf.tree_.children_right[node_index]
        node['children'] = [rules(clf, features, labels, right_index),
                            rules(clf, features, labels, left_index)]
    return node

Apply that function to our tree model:

In [62]:
r = rules(model, ['sepal L', 'sepal W', 'petal L', 'petal W'], ['setosa', 'versicolor', 'virginica'])
print(r)

{'name': 'petal L > 2.450000047683716', 'children': [{'name': 'petal W > 1.75', 'children': [{'name': 'petal L > 4.850000381469727', 'children': [{'name': '0 of setosa, 0 of versicolor, 43 of virginica'}, {'name': 'sepal W > 3.0999999046325684', 'children': [{'name': '0 of setosa, 1 of versicolor, 0 of virginica'}, {'name': '0 of setosa, 0 of versicolor, 2 of virginica'}]}]}, {'name': 'petal L > 4.949999809265137', 'children': [{'name': 'petal W > 1.5499999523162842', 'children': [{'name': 'petal L > 5.449999809265137', 'children': [{'name': '0 of setosa, 0 of versicolor, 1 of virginica'}, {'name': '0 of setosa, 2 of versicolor, 0 of virginica'}]}, {'name': '0 of setosa, 0 of versicolor, 3 of virginica'}]}, {'name': 'petal W > 1.6500000953674316', 'children': [{'name': '0 of setosa, 0 of versicolor, 1 of virginica'}, {'name': '0 of setosa, 47 of versicolor, 0 of virginica'}]}]}]}, {'name': '50 of setosa, 0 of versicolor, 0 of virginica'}]}


## 5. Creating a simpler tree

First, look at feature importances:

In [103]:
model.feature_importances_

array([3.75389548e-02, 5.27234900e-04, 4.83924045e-03, 7.71886017e-04,
       6.27727806e-02, 5.75983675e-01, 1.09311768e-02, 7.29891043e-02,
       1.33387525e-03, 1.15587186e-02, 7.12232927e-03, 6.47109221e-03,
       2.07159932e-01])

Only keep the two most important features (i.e. the last two features):

In [70]:
model = tree.DecisionTreeClassifier(max_depth = 3)
model = model.fit(X[:, [2, 3]], y)

In [73]:
model.feature_importances_

array([0.58561555, 0.41438445])

Export tree as rule-set:

In [65]:
r = rules(model, ['petal L', 'petal W'], ['setosa', 'versicolor', 'virginica'])
print(r)

{'name': 'petal L > 2.450000047683716', 'children': [{'name': 'petal W > 1.75', 'children': [{'name': 'petal L > 4.850000381469727', 'children': [{'name': '0 of setosa, 0 of versicolor, 43 of virginica'}, {'name': '0 of setosa, 1 of versicolor, 2 of virginica'}]}, {'name': 'petal L > 4.949999809265137', 'children': [{'name': '0 of setosa, 2 of versicolor, 4 of virginica'}, {'name': '0 of setosa, 47 of versicolor, 1 of virginica'}]}]}, {'name': '50 of setosa, 0 of versicolor, 0 of virginica'}]}


In [66]:
from sklearn import neighbors
knn_classifier = neighbors.KNeighborsClassifier()

In [67]:
knn_model = knn_classifier.fit(X, y)

In [68]:
new_x = [ [1.2,  3.0,  5.4,  4.2], [1.2,  3.0,  5.4,  4.2] ]
knn_model.predict(new_x)

array(['Iris-virginica', 'Iris-virginica'], dtype=object)

In [69]:
import pickle
pickle.dump(knn_model, open('knn_model.pkl', 'wb'))