### Classification with decision trees

| ID | Refund | Marital Status | Income | Cheat
|-
| 1 | Yes | Single | 125K | No
| 2 | No | Married | 100K | No
| 3 | No | Single | 70K | No
| 4 | Yes | Married | 120K | No
| 5 | No | Divorced | 95K | Yes
| 6 | No | Married | 60K | No
| 7 | Yes | Divorced | 220K | No
| 8 | No | Single | 85K | Yes
| 9 | No | Married | 75K | No
| 10 | No | Single | 90K | Yes


**Impurity Measures:**
* Gini index $I(S) = 1 - \sum\limits_k (p_k)^2$
* Entropy  $I(S) = -\sum\limits_k p_k \log(p_k)$
* Missclassification error  $I(S) = 1 - \max\limits_k p_k$

$p_k$ - the proportion of class $k$ in the tree node $s$

**The information gain:** <br/>
$$ Gain(S, A) = I(S) - \sum\limits_v\frac{|S_v|}{|S|}\cdot I(S_v),$$ where $A$ - afeature, and $v$ - its value

For our example:
$$I(S) = -(\frac{3}{10}\log(\frac{3}{10}) + \frac{7}{10}\log(\frac{7}{10})) = 0.61$$

For feature *Marital Status*

$$Gain(S, \text{`Marital Status`}) = I(S) - (\frac{4}{10}\cdot I(S_{single}) + \frac{2}{10}\cdot I(S_{divorced}) + \frac{4}{10}\cdot I(S_{married})) =  0.19$$

In [None]:
# Your code here

### Regression with decision tress

We try to minimize the MSE
$$I(S) = \frac{1}{|S|} \sum\limits_{i \in S} (y_i - c)^2 $$ 
$$ c = \frac{1}{|S|}\sum\limits_{i \in S} y_i $$

### Как посмотреть на деревья?

We will work with the library `graphviz`. [Here](http://scikit-learn.org/stable/modules/tree.html) there is an aexample of the tree.

To view the rules you can use the code [from here](http://stackoverflow.com/questions/20224526/how-to-extract-the-decision-rules-from-scikit-learn-decision-tree)

In [None]:
def get_code(tree, feature_names):
        left      = tree.tree_.children_left
        right     = tree.tree_.children_right
        threshold = tree.tree_.threshold
        features  = [feature_names[i] for i in tree.tree_.feature]
        value = tree.tree_.value

        def recurse(left, right, threshold, features, node):
                if (threshold[node] != -2):
                        print "if ( " + features[node] + " <= " + str(threshold[node]) + " ) {"
                        if left[node] != -1:
                                recurse (left, right, threshold, features,left[node])
                        print "} else {"
                        if right[node] != -1:
                                recurse (left, right, threshold, features,right[node])
                        print "}"
                else:
                        print "return " + str(value[node])


                        recurse(left, right, threshold, features, 0)

### Lets start

In [None]:
import pandas as pd
import numpy as np
import subprocess

from sklearn.datasets import make_circles
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz
import matplotlib.pyplot as plt

plt.style.use('ggplot')

%matplotlib inline

#### Classification

In [None]:
X, y = make_circles(n_samples=500, noise=0.1, factor=0.2)

In [None]:
# Your code here

In [None]:
with open('tree.dot', 'w') as fout:
    export_graphviz(tree, out_file=fout, feature_names=['x1', 'x2'], class_names=['0', '1'])
command = ['dot', '-Tpng', 'tree.dot', '-o', 'tree.png']
subprocess.check_call(command)
plt.figure(figsize=(10, 10))
plt.imshow(plt.imread('tree.png'))
plt.axis('off')

#### Регрессия

In [None]:
X = np.linspace(-5, 5, 100)
X += np.random.randn(X.shape[0])

y = 5*np.sin(2*X) + X**2
y += 2*np.random.randn(y.shape[0])

In [None]:
plt.scatter(X, y)

In [None]:
# Your code here

In [None]:
with open('tree.dot', 'w') as fout:
    export_graphviz(tree, out_file=fout, feature_names=['x1'])
command = ['dot', '-Tpng', 'tree.dot', '-o', 'tree.png']
subprocess.check_call(command)
plt.figure(figsize=(20, 20))
plt.imshow(plt.imread('tree.png'))
plt.axis('off')

### Practice

Upload the [data](https://www.dropbox.com/s/3t1moa1wpflx2u9/california.dat?dl=0).

**Task 1:** Try to find the optimcal depth of the tree.<br/>
Split the data into train-test in proportion 70/30.<br/>
Learn the DT with depths from `1` to `30`. For each depth evaluate the MSE on train and test<br/>
Create a grpah with these errors

In [None]:
# Your code here

**Task 2:** Feature importance `DecisionTreeRegressor.feature_importances_`

In [None]:
# Your code here