# Julia 機器學習：DecisionTree 決策樹 及其進階版 RandomForest
- decision tree: 
- random forest: copes the issue of overfitting
> [Simple explanation](http://notebookpage1005.blogspot.com/2018/03/random-forest.html)

本範例需要使用到的套件有 DecisionTree、ScikitLearn，請在執行以下範例前先安裝。

```
] add DecisionTree
] add ScikitLearn
```

In [2]:
using Pkg
Pkg.add(["DecisionTree","ScikitLearn"])

[32m[1m   Updating[22m[39m registry at `C:\Users\HSI\.julia\registries\General`


[?25l

[32m[1m   Updating[22m[39m git-repo `https://github.com/JuliaRegistries/General.git`


[2K[?25h

[32m[1m  Resolving[22m[39m package versions...
[32m[1m  Installed[22m[39m ElasticArrays ─────── v1.2.0
[32m[1m  Installed[22m[39m ArrayLayouts ──────── v0.2.6
[32m[1m  Installed[22m[39m ProgressMeter ─────── v1.2.0
[32m[1m  Installed[22m[39m IRTools ───────────── v0.3.2
[32m[1m  Installed[22m[39m PyPlot ────────────── v2.9.0
[32m[1m  Installed[22m[39m JLD2 ──────────────── v0.1.13
[32m[1m  Installed[22m[39m Adapt ─────────────── v1.0.1
[32m[1m  Installed[22m[39m ZygoteRules ───────── v0.2.0
[32m[1m  Installed[22m[39m NBInclude ─────────── v2.2.0
[32m[1m  Installed[22m[39m LaTeXStrings ──────── v1.1.0
[32m[1m  Installed[22m[39m ElasticPDMats ─────── v0.2.1
[32m[1m  Installed[22m[39m FastGaussQuadrature ─ v0.4.2
[32m[1m  Installed[22m[39m GaussianMixtures ──── v0.3.1
[32m[1m  Installed[22m[39m Documenter ────────── v0.24.11
[32m[1m  Installed[22m[39m DecisionTree ──────── v0.10.1
[32m[1m  Installed[22m[39m NNlib ────────

In [3]:
using DecisionTree

┌ Info: Precompiling DecisionTree [7806a523-6efd-50cb-b5f6-3fa6f1930dbb]
└ @ Base loading.jl:1260


In [4]:
using ScikitLearn.CrossValidation: cross_val_score

┌ Info: Precompiling ScikitLearn [3646fa90-6ef7-5e7e-9f22-8aca16db6324]
└ @ Base loading.jl:1260


## 載入資料

In [5]:
features, labels = DecisionTree.load_data("iris");
# features: 150*4 Array{Any,2}
# labels: 150-element Array{Any,1}

## Casting

In [23]:
features = float.(features);
labels = string.(labels);

## 決策樹模型

In [24]:
model = DecisionTree.DecisionTreeClassifier(max_depth=2)

DecisionTreeClassifier
max_depth:                2
min_samples_leaf:         1
min_samples_split:        2
min_purity_increase:      0.0
pruning_purity_threshold: 1.0
n_subfeatures:            0
classes:                  nothing
root:                     nothing

可用模型:

* `DecisionTreeClassifier`
* `DecisionTreeRegressor`
* `RandomForestClassifier`
* `RandomForestRegressor`
* `AdaBoostStumpClassifier`

## 訓練

In [25]:
DecisionTree.fit!(model, features, labels)

DecisionTreeClassifier
max_depth:                2
min_samples_leaf:         1
min_samples_split:        2
min_purity_increase:      0.0
pruning_purity_threshold: 1.0
n_subfeatures:            0
classes:                  ["Iris-setosa", "Iris-versicolor", "Iris-virginica"]
root:                     Decision Tree
Leaves: 3
Depth:  2

## 印出決策樹

In [26]:
DecisionTree.print_tree(model, 5)

Feature 3, Threshold 2.45
L-> Iris-setosa : 50/50
R-> Feature 4, Threshold 1.75
    L-> Iris-versicolor : 49/54
    R-> Iris-virginica : 45/46


## 預測

In [27]:
new_iris = [5.9, 3.0, 5.1, 1.9]
DecisionTree.predict(model, new_iris)

"Iris-virginica"

In [28]:
DecisionTree.predict_proba(model, new_iris)

3-element Array{Float64,1}:
 0.0
 0.021739130434782608
 0.9782608695652174

## `predict_proba` 對應的類別

In [29]:
DecisionTree.get_classes(model)

3-element Array{String,1}:
 "Iris-setosa"
 "Iris-versicolor"
 "Iris-virginica"

## 隨機森林模型

In [30]:
model = DecisionTree.RandomForestClassifier(n_trees=50, max_depth=2)

RandomForestClassifier
n_trees:             50
n_subfeatures:       -1
partial_sampling:    0.7
max_depth:           2
min_samples_leaf:    1
min_samples_split:   2
min_purity_increase: 0.0
classes:             nothing
ensemble:            nothing

## 訓練

In [31]:
DecisionTree.fit!(model, features, labels)

RandomForestClassifier
n_trees:             50
n_subfeatures:       -1
partial_sampling:    0.7
max_depth:           2
min_samples_leaf:    1
min_samples_split:   2
min_purity_increase: 0.0
classes:             ["Iris-setosa", "Iris-versicolor", "Iris-virginica"]
ensemble:            Ensemble of Decision Trees
Trees:      50
Avg Leaves: 3.2
Avg Depth:  2.0

## 預測

In [32]:
new_iris = [5.9, 3.0, 5.1, 1.9]
DecisionTree.predict(model, new_iris)

"Iris-virginica"

## 交叉驗證
- `cross_val_score` is from SciKitLearn
- `cv = n` means n-fold. That is, data are grouped into n parts, and for example pick one for testing, the other n-1 for training. This process repeats. (ZK 2020-05-16). See [k-fold cv](https://www.google.com/search?client=firefox-b-d&q=k+fold+cross+validation).

In [35]:
accuracy = cross_val_score(model, features, labels, cv=5) # 5-fold
# from SciKitLearn

5-element Array{Float64,1}:
 0.9333333333333333
 0.9666666666666667
 0.9
 0.9
 1.0