In [1]:
using Pkg
Pkg.activate(".")

[32m[1m  Activating[22m[39m project at `~/Repos/mike_scratch/mlj_tutorial`


# Julia Tutorial

[Link to main page](https://juliaai.github.io/DataScienceTutorials.jl/)

## Choosing and evaluation a model

[Link](https://juliaai.github.io/DataScienceTutorials.jl/getting-started/choosing-a-model/)

In [2]:
using RDatasets, MLJ, DecisionTree, NearestNeighborModels, MLJScikitLearnInterface, GLM


In [3]:
iris = dataset("datasets", "iris")
first(iris, 3) |> pretty

┌─────────────┬────────────┬─────────────┬────────────┬─────────────────────────────────┐
│[1m SepalLength [0m│[1m SepalWidth [0m│[1m PetalLength [0m│[1m PetalWidth [0m│[1m Species                         [0m│
│[90m Float64     [0m│[90m Float64    [0m│[90m Float64     [0m│[90m Float64    [0m│[90m CategoricalValue{String, UInt8} [0m│
│[90m Continuous  [0m│[90m Continuous [0m│[90m Continuous  [0m│[90m Continuous [0m│[90m Multiclass{3}                   [0m│
├─────────────┼────────────┼─────────────┼────────────┼─────────────────────────────────┤
│ 5.1         │ 3.5        │ 1.4         │ 0.2        │ setosa                          │
│ 4.9         │ 3.0        │ 1.4         │ 0.2        │ setosa                          │
│ 4.7         │ 3.2        │ 1.3         │ 0.2        │ setosa                          │
└─────────────┴────────────┴─────────────┴────────────┴─────────────────────────────────┘


In [4]:
iris2 = coerce(iris, :PetalWidth => OrderedFactor)
first(iris2[:, [:PetalLength, :PetalWidth]], 1) |> pretty

┌─────────────┬───────────────────────────────────┐
│[1m PetalLength [0m│[1m PetalWidth                        [0m│
│[90m Float64     [0m│[90m CategoricalValue{Float64, UInt32} [0m│
│[90m Continuous  [0m│[90m OrderedFactor{22}                 [0m│
├─────────────┼───────────────────────────────────┤
│ 1.4         │ 0.2                               │
└─────────────┴───────────────────────────────────┘


In [5]:
y, X = unpack(iris, ==(:Species))
first(X, 1) |> pretty

┌─────────────┬────────────┬─────────────┬────────────┐
│[1m SepalLength [0m│[1m SepalWidth [0m│[1m PetalLength [0m│[1m PetalWidth [0m│
│[90m Float64     [0m│[90m Float64    [0m│[90m Float64     [0m│[90m Float64    [0m│
│[90m Continuous  [0m│[90m Continuous [0m│[90m Continuous  [0m│[90m Continuous [0m│
├─────────────┼────────────┼─────────────┼────────────┤
│ 5.1         │ 3.5        │ 1.4         │ 0.2        │
└─────────────┴────────────┴─────────────┴────────────┘


In [6]:
y, X = unpack(iris, ==(:Species), !=(:PetalLength))
first(X,1) |> pretty

┌─────────────┬────────────┬────────────┐
│[1m SepalLength [0m│[1m SepalWidth [0m│[1m PetalWidth [0m│
│[90m Float64     [0m│[90m Float64    [0m│[90m Float64    [0m│
│[90m Continuous  [0m│[90m Continuous [0m│[90m Continuous [0m│
├─────────────┼────────────┼────────────┤
│ 5.1         │ 3.5        │ 0.2        │
└─────────────┴────────────┴────────────┘


In [7]:
X, y = @load_iris

((sepal_length = [5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9  …  6.7, 6.9, 5.8, 6.8, 6.7, 6.7, 6.3, 6.5, 6.2, 5.9], sepal_width = [3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1  …  3.1, 3.1, 2.7, 3.2, 3.3, 3.0, 2.5, 3.0, 3.4, 3.0], petal_length = [1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5  …  5.6, 5.1, 5.1, 5.9, 5.7, 5.2, 5.0, 5.2, 5.4, 5.1], petal_width = [0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1  …  2.4, 2.3, 1.9, 2.3, 2.5, 2.3, 1.9, 2.0, 2.3, 1.8]), CategoricalArrays.CategoricalValue{String, UInt32}["setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa"  …  "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica"])

In [8]:
# List all the models we can choose from in MLJ for X and y.
# `models` will list models.
# Below will list the ones matching this type of data.
for m in models(matching(X, y))
    if m.prediction_type == :probabilistic
        println(rpad(m.name, 30), "($(m.package_name))")
    end
end

AdaBoostClassifier            (MLJScikitLearnInterface)
AdaBoostStumpClassifier       (DecisionTree)
BaggingClassifier             (MLJScikitLearnInterface)
BayesianLDA                   (MLJScikitLearnInterface)
BayesianLDA                   (MultivariateStats)
BayesianQDA                   (MLJScikitLearnInterface)
BayesianSubspaceLDA           (MultivariateStats)
CatBoostClassifier            (CatBoost)
ConstantClassifier            (MLJModels)
DecisionTreeClassifier        (BetaML)
DecisionTreeClassifier        (DecisionTree)
DummyClassifier               (MLJScikitLearnInterface)
EvoTreeClassifier             (EvoTrees)
ExtraTreesClassifier          (MLJScikitLearnInterface)
GaussianNBClassifier          (MLJScikitLearnInterface)
GaussianNBClassifier          (NaiveBayes)
GaussianProcessClassifier     (MLJScikitLearnInterface)
GradientBoostingClassifier    (MLJScikitLearnInterface)
KNNClassifier                 (NearestNeighborModels)
KNeighborsClassifier          (MLJScikitLearnI

In [9]:
# Load a model.
knc = @load KNeighborsClassifier

import MLJScikitLearnInterface ✔


┌ Info: For silent loading, specify `verbosity=0`. 
└ @ Main /Users/mph/.julia/packages/MLJModels/UM8fF/src/loading.jl:159


KNeighborsClassifier

In [10]:
# If a model is provided by multiple packages, useg teh pkg argument to specify.
linreg = @load LinearRegressor pkg=GLM

import MLJGLMInterface ✔

┌ Info: For silent loading, specify `verbosity=0`. 
└ @ Main /Users/mph/.julia/packages/MLJModels/UM8fF/src/loading.jl:159





MLJGLMInterface.LinearRegressor

## Fit, predict, transform

In [46]:
using MLJ
import Statistics
using PrettyPrinting
using StableRNGs

In [12]:
X, y = @load_iris;

In [23]:
DecisionTreeClassifier = @load DecisionTreeClassifier pkg=DecisionTree
tree_model = DecisionTreeClassifier()

import MLJDecisionTreeInterface ✔


┌ Info: For silent loading, specify `verbosity=0`. 
└ @ Main /Users/mph/.julia/packages/MLJModels/UM8fF/src/loading.jl:159


DecisionTreeClassifier(
  max_depth = -1, 
  min_samples_leaf = 1, 
  min_samples_split = 2, 
  min_purity_increase = 0.0, 
  n_subfeatures = 0, 
  post_prune = false, 
  merge_purity_threshold = 1.0, 
  display_depth = 5, 
  feature_importance = :impurity, 
  rng = Random._GLOBAL_RNG())

Some important definitions.

A "model", like the `tree_model` we imported, is just a container for the hyperparameters of the model.

In [27]:
tree_model

DecisionTreeClassifier(
  max_depth = -1, 
  min_samples_leaf = 1, 
  min_samples_split = 2, 
  min_purity_increase = 0.0, 
  n_subfeatures = 0, 
  post_prune = false, 
  merge_purity_threshold = 1.0, 
  display_depth = 5, 
  feature_importance = :impurity, 
  rng = Random._GLOBAL_RNG())

A "machine" is an object wrapping both a model and data and can contain information on the _trained_ model. But it does _not_ fit the model by itself. However, it does  check that the model is compatible with the scientific type of the data and will warn you outherwise.

### MLJ Machine

In [26]:
tree = machine(tree_model, X, y)

untrained Machine; caches model-specific representations of data
  model: DecisionTreeClassifier(max_depth = -1, …)
  args: 
    1:	Source @049 ⏎ Table{AbstractVector{Continuous}}
    2:	Source @751 ⏎ AbstractVector{Multiclass{3}}


### Training and testing a supervised model

Splitting the data

In [43]:
rng = StableRNG(566)
train, test = partition(eachindex(y), 0.7, shuffle=true, rng=rng)
test[1:3]

3-element Vector{Int64}:
 39
 54
  9

In [44]:
# Fit the machine.
fit!(tree, rows=train)

UndefVarError: UndefVarError: fit! not defined

In [47]:
fit!(tree, rows=train)

UndefVarError: UndefVarError: fit! not defined