## Fit, predict, transform

In [1]:
using MLJ
import Statistics
using PrettyPrinting
using StableRNGs

In [2]:
X, y = @load_iris;

In [3]:
DecisionTreeClassifier = @load DecisionTreeClassifier pkg=DecisionTree
tree_model = DecisionTreeClassifier()

┌ Info: For silent loading, specify `verbosity=0`. 
└ @ Main /Users/mph/.julia/packages/MLJModels/tMgLW/src/loading.jl:168


import MLJDecisionTreeInterface ✔


DecisionTreeClassifier(
    max_depth = -1,
    min_samples_leaf = 1,
    min_samples_split = 2,
    min_purity_increase = 0.0,
    n_subfeatures = 0,
    post_prune = false,
    merge_purity_threshold = 1.0,
    pdf_smoothing = 0.0,
    display_depth = 5,
    rng = Random._GLOBAL_RNG())

Some important definitions.

A "model", like the `tree_model` we imported, is just a container for the hyperparameters of the model.

In [4]:
tree_model

DecisionTreeClassifier(
    max_depth = -1,
    min_samples_leaf = 1,
    min_samples_split = 2,
    min_purity_increase = 0.0,
    n_subfeatures = 0,
    post_prune = false,
    merge_purity_threshold = 1.0,
    pdf_smoothing = 0.0,
    display_depth = 5,
    rng = Random._GLOBAL_RNG())

A "machine" is an object wrapping both a model and data and can contain information on the _trained_ model. But it does _not_ fit the model by itself. However, it does  check that the model is compatible with the scientific type of the data and will warn you outherwise.

### MLJ Machine

In [5]:
tree = machine(tree_model, X, y)

Machine{DecisionTreeClassifier,…} trained 0 times; caches data
  model: MLJDecisionTreeInterface.DecisionTreeClassifier
  args: 
    1:	Source @210 ⏎ `Table{AbstractVector{Continuous}}`
    2:	Source @496 ⏎ `AbstractVector{Multiclass{3}}`


### Training and testing a supervised model

Splitting the data

In [6]:
rng = StableRNG(566)
train, test = partition(eachindex(y), 0.7, shuffle=true, rng=rng)
test[1:3]

3-element Vector{Int64}:
 39
 54
  9

In [7]:
# Fit the machine.
fit!(tree, rows=train)

┌ Info: Training Machine{DecisionTreeClassifier,…}.
└ @ MLJBase /Users/mph/.julia/packages/MLJBase/MuLnJ/src/machines.jl:464


Machine{DecisionTreeClassifier,…} trained 1 time; caches data
  model: MLJDecisionTreeInterface.DecisionTreeClassifier
  args: 
    1:	Source @210 ⏎ `Table{AbstractVector{Continuous}}`
    2:	Source @496 ⏎ `AbstractVector{Multiclass{3}}`


Running fit modified the machine. It now contains the trained parameters.

In [8]:
fitted_params(tree) |> pprint

(tree = Decision Tree
Leaves: 5
Depth:  4,
 encoding =
     Dict(CategoricalArrays.CategoricalValue{String, UInt32} "virginica" =>
              0x00000003,
          CategoricalArrays.CategoricalValue{String, UInt32} "setosa" =>
              0x00000001,
          CategoricalArrays.CategoricalValue{String, UInt32} "versicolor" =>
              0x00000002))

Making predictions on the test set.

In [9]:
ŷ = predict(tree, rows=test)    # ŷ by typing y"\"hat
@show ŷ[1]

ŷ[1] = UnivariateFinite{Multiclass{3}}(setosa=>1.0, versicolor=>0.0, virginica=>0.0)


                    [1mUnivariateFinite{Multiclass{3}}[22m      
              [90m┌                                        ┐[39m 
       [0msetosa [90m┤[39m[38;5;2m■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■[39m[0m 1.0 [90m [39m 
   [0mversicolor [90m┤[39m[0m 0.0                                    [90m [39m 
    [0mvirginica [90m┤[39m[0m 0.0                                    [90m [39m 
              [90m└                                        ┘[39m 

Get the predicted mode of the categorical.

In [10]:
ȳ = predict_mode(tree, rows=test)
@show ȳ[1]
@show mode(ŷ[1])

ȳ[1] = CategoricalArrays.CategoricalValue{String, UInt32} "setosa"
mode(ŷ[1]) = CategoricalArrays.CategoricalValue{String, UInt32} "setosa"


CategoricalArrays.CategoricalValue{String, UInt32} "setosa"

Measure the discrepancy. We'll use average cross entropy.

In [11]:
mce = cross_entropy(ŷ, y[test]) |> mean
round(mce, digits=4)

2.4029

## Unsupervised models

Univariates have a `transform` method and may have an `inverse_transform` method. Wrap the unsupervised model and data in a machine.

In [12]:
v = [1,2,3,4]
stand_model = UnivariateStandardizer()
stand = machine(stand_model, v)

Machine{UnivariateStandardizer,…} trained 0 times; caches data
  model: UnivariateStandardizer
  args: 
    1:	Source @838 ⏎ `AbstractVector{Count}`


In [13]:
fit!(stand)
w = transform(stand, v)
@show round.(w, digits=2)
@show mean(w)
@show std(w)

┌ Info: Training Machine{UnivariateStandardizer,…}.
└ @ MLJBase /Users/mph/.julia/packages/MLJBase/MuLnJ/src/machines.jl:464


round.(w, digits = 2) = [-1.16, -0.39, 0.39, 1.16]
mean(w) = 0.0
std(w) = 1.0


1.0

Not ethat this hase applied a typical standardization.

In [22]:
@show round.((v .- mean(v)) ./ std(v), digits=2)

round.((v .- mean(v)) ./ std(v), digits = 2) = [-1.16, -0.39, 0.39, 1.16]


4-element Vector{Float64}:
 -1.16
 -0.39
  0.39
  1.16

The inverse transform will turn it back.

In [24]:
vv = inverse_transform(stand, w)
vv ≈ v

true