[Link to tutorial](https://juliaai.github.io/DataScienceTutorials.jl/getting-started/ensembles-3/)

In [1]:
using Pkg
Pkg.activate(".")
Pkg.instantiate()

[32m[1m  Activating[22m[39m project at `~/Documents/Personal Stuff/Repos/mike_scratch/mlj_tutorial/A-ensembles-3`


└ @ nothing /Users/michaelherman/Documents/Personal Stuff/Repos/mike_scratch/mlj_tutorial/A-ensembles-3/Manifest.toml:0


[32m[1mPrecompiling[22m[39m project...


[32m  ✓ [39m[90mDecisionTree[39m


[32m  ✓ [39mMLJDecisionTreeInterface
  2 dependencies successfully precompiled in 2 seconds. 94 already precompiled.


This tutorial creates a homogeneous ensemble using learning networks.

No bagging is used, so every atomic model gets the same learned parameters, unless teh atomic model training algorithm has randomness (e.g. DecisionTree) with random subsampling of features at nodes.

Note that MLJ has a built in model wrapper called `EnsembleModel` for creating bagged ensembles. This implements a 

In [2]:
using MLJ
using PyPlot
import Statistics

In [3]:
Xs = source()
ys = source()
DecisionTreeRegressor = @load DecisionTreeRegressor pkg=DecisionTree
atom = DecisionTreeRegressor()

machines = (machine(atom, Xs, ys) for i in 1:100)

┌ Info: For silent loading, specify `verbosity=0`. 
└ @ Main /Users/michaelherman/.julia/packages/MLJModels/tMgLW/src/loading.jl:168


import MLJDecisionTreeInterface ✔


Base.Generator{UnitRange{Int64}, var"#9#10"}(var"#9#10"(), 1:100)

Overloading `mean` for nodes.

In [6]:
Statistics.mean(v...) = mean(v)
Statistics.mean(v::AbstractVector{<:AbstractNode}) = node(mean, v...)

yhat = mean([predict(m, Xs) for m in machines])

Node{Nothing}
  args:
    1:	Node{Machine{DecisionTreeRegressor,…}}
    2:	Node{Machine{DecisionTreeRegressor,…}}
    3:	Node{Machine{DecisionTreeRegressor,…}}
    4:	Node{Machine{DecisionTreeRegressor,…}}
    5:	Node{Machine{DecisionTreeRegressor,…}}
    6:	Node{Machine{DecisionTreeRegressor,…}}
    7:	Node{Machine{DecisionTreeRegressor,…}}
    8:	Node{Machine{DecisionTreeRegressor,…}}
    9:	Node{Machine{DecisionTreeRegressor,…}}
    10:	Node{Machine{DecisionTreeRegressor,…}}
    11:	Node{Machine{DecisionTreeRegressor,…}}
    12:	Node{Machine{DecisionTreeRegressor,…}}
    13:	Node{Machine{DecisionTreeRegressor,…}}
    14:	Node{Machine{DecisionTreeRegressor,…}}
    15:	Node{Machine{DecisionTreeRegressor,…}}
    16:	Node{Machine{DecisionTreeRegressor,…}}
    17:	Node{Machine{DecisionTreeRegressor,…}}
    18:	Node{Machine{DecisionTreeRegressor,…}}
    19:	Node{Machine{DecisionTreeRegressor,…}}
    20:	Node{Machine{DecisionTreeRegressor,…}}
    21:	Node{Machine{DecisionTreeRegressor,…}}


Defining new composit model type and instance.

In [8]:
surrogate = Deterministic()
mach = machine(surrogate, Xs, ys; predict=yhat)

@from_network mach begin
    mutable struct OneHundredModels
        atom=atom
    end
end

one_hundred_models = OneHundredModels()

OneHundredModels(
    atom = DecisionTreeRegressor(
            max_depth = -1,
            min_samples_leaf = 5,
            min_samples_split = 2,
            min_purity_increase = 0.0,
            n_subfeatures = 0,
            post_prune = false,
            merge_purity_threshold = 1.0,
            rng = Random._GLOBAL_RNG()))

Application to data

In [9]:
X, y = @load_boston;

In [10]:
r = range(atom, :min_samples_split, lower=2, upper=100, scale=:log)
mach = machine(atom, X, y)
figure()
curve = learning_curve!(mach, range=r, measure=mav, resampling=CV(nfolds=9), verbosity=0)
plot(curve.parameter_values, curve.measurements)
xlabel(curve.parameter_name)

│   caller = npyinitialize() at numpy.jl:67
└ @ PyCall /Users/michaelherman/.julia/packages/PyCall/L0fLP/src/numpy.jl:67


PyObject Text(0.5, 0, 'min_samples_split')

![curve](https://juliaai.github.io/DataScienceTutorials.jl/assets/getting-started/ensembles-3/code/output/e1.svg)

Tune regularization parameter for all trees in ensemble simultaneously.

In [11]:
r = range(one_hundred_models, :(atom.min_samples_split), lower=2, upper=100, scale=:log)

mach = machine(one_hundred_models, X, y)

figure()
curve = learning_curve!(mach, range=r, measure=mav, resampling=CV(nfolds=9), verbosity=0)
plot(curve.parameter_values, curve.measurements)
xlabel(curve.parameter_name)

PyObject Text(0.5, 0, 'atom.min_samples_split')

![curve2](https://juliaai.github.io/DataScienceTutorials.jl/assets/getting-started/ensembles-3/code/output/e2.svg)