[Link to tutorial page](https://juliaai.github.io/DataScienceTutorials.jl/getting-started/ensembles/)

# Ensemble models

## Preliminary steps

Generate dummy data.

In [5]:
using Pkg
Pkg.activate(".")
Pkg.instantiate()

[32m[1m  Activating[22m[39m project at `~/Repos/mike_scratch/mlj_tutorial/A-ensembles`


└ @ nothing /Users/mph/Repos/mike_scratch/mlj_tutorial/A-ensembles/Manifest.toml:0


[32m[1m   Installed[22m[39m PooledArrays ────────── v1.4.0


[32m[1m   Installed[22m[39m NearestNeighbors ────── v0.4.9
[32m[1m   Installed[22m[39m StaticArrays ────────── v1.3.2


[32m[1m   Installed[22m[39m NearestNeighborModels ─ v0.1.6
[32m[1m   Installed[22m[39m DataFrames ──────────── v1.3.1


[32m[1mPrecompiling[22m[39m project...


[32m  ✓ [39m[90mPooledArrays[39m


[91m  ✗ [39m[90mPyCall[39m


[91m  ✗ [39mPyPlot


[32m  ✓ [39m[90mStaticArrays[39m


[32m  ✓ [39m[90mNearestNeighbors[39m


[32m  ✓ [39mNearestNeighborModels


[32m  ✓ [39mDataFrames


  5 dependencies successfully precompiled in 10 seconds. 91 already precompiled.
  [91m2[39m dependencies errored. To see a full report either run `import Pkg; Pkg.precompile()` or load the packages


In [6]:
using MLJ
import DataFrames: DataFrame
using PrettyPrinting
using StableRNGs

In [7]:
rng = StableRNG(512)
Xraw = rand(rng, 300, 3)
y = exp.(Xraw[:,1] - Xraw[:,2] - 2Xraw[:,3] + 0.1*rand(rng,300))
X = DataFrame(Xraw, :auto)

train, test = partition(eachindex(y), 0.7)

([1, 2, 3, 4, 5, 6, 7, 8, 9, 10  …  201, 202, 203, 204, 205, 206, 207, 208, 209, 210], [211, 212, 213, 214, 215, 216, 217, 218, 219, 220  …  291, 292, 293, 294, 295, 296, 297, 298, 299, 300])

In [8]:
# load a simple model
KNNRegressor = @load KNNRegressor
knn_model = KNNRegressor(K=10)

┌ Info: For silent loading, specify `verbosity=0`. 
└ @ Main /Users/mph/.julia/packages/MLJModels/tMgLW/src/loading.jl:168


import NearestNeighborModels ✔




KNNRegressor(
    K = 10,
    algorithm = :kdtree,
    metric = Distances.Euclidean(0.0),
    leafsize = 10,
    reorder = true,
    weights = NearestNeighborModels.Uniform())

In [9]:
knn = machine(knn_model, X, y)

Machine{KNNRegressor,…} trained 0 times; caches data
  model: NearestNeighborModels.KNNRegressor
  args: 
    1:	Source @255 ⏎ `Table{AbstractVector{Continuous}}`
    2:	Source @220 ⏎ `AbstractVector{Continuous}`


In [10]:
fit!(knn, rows=train)
ŷ = predict(knn, rows=test)
rms(ŷ, y[test])

┌ Info: Training Machine{KNNRegressor,…}.
└ @ MLJBase /Users/mph/.julia/packages/MLJBase/MuLnJ/src/machines.jl:464


0.06389980172436367

In [11]:
# We could have done the following instead.
evaluate!(knn, resampling=Holdout(fraction_train=0.7, rng=StableRNG(666)), measure=rms)

PerformanceEvaluation object with these fields:
  measure, measurement, operation, per_fold,
  per_observation, fitted_params_per_fold,
  report_per_fold, train_test_pairs
Extract:
┌────────────────────────┬─────────────┬───────────┬──────────┐
│[22m measure                [0m│[22m measurement [0m│[22m operation [0m│[22m per_fold [0m│
├────────────────────────┼─────────────┼───────────┼──────────┤
│ RootMeanSquaredError() │ 0.124       │ predict   │ [0.124]  │
└────────────────────────┴─────────────┴───────────┴──────────┘


## Homogenous ensebles