In [1]:
using MLJ

based on the tutorial [here](https://github.com/ablaom/MachineLearningInJulia2020/blob/for-MLJ-version-0.16/tutorials.md#part-3-transformers-and-pipelines)

# Scientific Types

MLJ models specify a *scientific type* in order to make it easy to focus on the intended *purpose* of your models. For example, one scientific type is an `OrderedFactor`. 

In [2]:
scitype(3.141)

Continuous

In [3]:
time = [2.3, 4.5, 4.2, 1.8, 7.1]
scitype(time)

AbstractVector{Continuous} (alias for AbstractArray{Continuous, 1})

Usually MLJ does a good job figuring out which Scientific Type you want but you can force a particular type on a variable via `coerce()`. 

In [4]:
height = [185, 153, 163, 114, 180]
scitype(height)

AbstractVector{Count} (alias for AbstractArray{Count, 1})

In [6]:
height = coerce(height, Continuous)
scitype(height)

AbstractVector{Continuous} (alias for AbstractArray{Continuous, 1})

Here's an example of an `OrderedFactor`

In [7]:
exam_mark = ["rotten", "great", "bla", missing, "great"]
scitype(exam_mark)

AbstractVector{Union{Missing, Textual}} (alias for AbstractArray{Union{Missing, Textual}, 1})

In [8]:
exam_mark = coerce(exam_mark, OrderedFactor)
scitype(exam_mark)

┌ Info: Trying to coerce from `Union{Missing, String}` to `OrderedFactor`.
│ Coerced to `Union{Missing,OrderedFactor}` instead.
└ @ ScientificTypes /home/john/.julia/packages/ScientificTypes/Vswzn/src/convention/coerce.jl:174


AbstractVector{Union{Missing, OrderedFactor{3}}} (alias for AbstractArray{Union{Missing, OrderedFactor{3}}, 1})

In [9]:
# see the ordering of the factors 
levels(exam_mark)

3-element Vector{String}:
 "bla"
 "great"
 "rotten"

In [11]:
# you can fix the ordering too 
levels!(exam_mark, ["rotten", "bla", "great"])

5-element CategoricalArrays.CategoricalArray{Union{Missing, String},1,UInt32}:
 "rotten"
 "great"
 "bla"
 missing
 "great"

In [13]:
exam_mark[1] < exam_mark[2]  # we have an ordering via the "<" relation

true

If we take a slice we still don't lose the information about all the levels

In [14]:
levels(exam_mark[1:2])

3-element Vector{String}:
 "rotten"
 "bla"
 "great"

# Two-dimensional Data

MLJ Models generally expect any two-dimensional data to be *tabular*. This means that any subtype of `Tables.jl`'s type `Table` should work. 

Simple example: *column table*

In [15]:
column_table = (h=height, e=exam_mark, t=time)

(h = [185.0, 153.0, 163.0, 114.0, 180.0],
 e = Union{Missing, CategoricalArrays.CategoricalValue{String, UInt32}}["rotten", "great", "bla", missing, "great"],
 t = [2.3, 4.5, 4.2, 1.8, 7.1],)

In [16]:
scitype(column_table)

Table{Union{AbstractVector{Union{Missing, OrderedFactor{3}}}, AbstractVector{Continuous}}}

To inspect the scitype of each column, we use `schema()`

In [17]:
schema(column_table)

┌─────────┬──────────────────────────────────────────────────┬──────────────────
│[22m _.names [0m│[22m _.types                                          [0m│[22m _.scitypes     [0m ⋯
├─────────┼──────────────────────────────────────────────────┼──────────────────
│ h       │ Float64                                          │ Continuous      ⋯
│ e       │ Union{Missing, CategoricalValue{String, UInt32}} │ Union{Missing,  ⋯
│ t       │ Float64                                          │ Continuous      ⋯
└─────────┴──────────────────────────────────────────────────┴──────────────────
[36m                                                                1 column omitted[0m
_.nrows = 5


Example 2: Table from a dictionary 

In [18]:
dict_table = Dict(:h => height, :e => exam_mark, :t => time)
schema(dict_table)

┌─────────┬──────────────────────────────────────────────────┬──────────────────
│[22m _.names [0m│[22m _.types                                          [0m│[22m _.scitypes     [0m ⋯
├─────────┼──────────────────────────────────────────────────┼──────────────────
│ e       │ Union{Missing, CategoricalValue{String, UInt32}} │ Union{Missing,  ⋯
│ h       │ Float64                                          │ Continuous      ⋯
│ t       │ Float64                                          │ Continuous      ⋯
└─────────┴──────────────────────────────────────────────────┴──────────────────
[36m                                                                1 column omitted[0m
_.nrows = 5


In [23]:
using Pkg
Pkg.add("DataFrames")
Pkg.add("CSV")

[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/gitRepos/ml-demos/Project.toml`
[32m[1m  No Changes[22m[39m to `~/gitRepos/ml-demos/Manifest.toml`
[32m[1m   Resolving[22m[39m package versions...
[32m[1m   Installed[22m[39m InlineStrings ─ v1.1.1
[32m[1m    Updating[22m[39m `~/gitRepos/ml-demos/Project.toml`
 [90m [336ed68f] [39m[92m+ CSV v0.9.11[39m
[32m[1m    Updating[22m[39m `~/gitRepos/ml-demos/Manifest.toml`
 [90m [336ed68f] [39m[92m+ CSV v0.9.11[39m
 [90m [842dd82b] [39m[92m+ InlineStrings v1.1.1[39m
 [90m [91c51154] [39m[92m+ SentinelArrays v1.3.8[39m
 [90m [ea10d353] [39m[92m+ WeakRefStrings v1.4.1[39m
[32m[1mPrecompiling[22m[39m project...
[32m  ✓ [39m[90mInlineStrings[39m
[32m  ✓ [39m[90mWeakRefStrings[39m
[32m  ✓ [39mCSV
  3 dependencies successfully precompiled in 9 seconds (180 already precompiled)


In [20]:
using DataFrames

In [22]:
df = DataFrame(column_table)
schema(df)

┌─────────┬──────────────────────────────────────────────────┬──────────────────
│[22m _.names [0m│[22m _.types                                          [0m│[22m _.scitypes     [0m ⋯
├─────────┼──────────────────────────────────────────────────┼──────────────────
│ h       │ Float64                                          │ Continuous      ⋯
│ e       │ Union{Missing, CategoricalValue{String, UInt32}} │ Union{Missing,  ⋯
│ t       │ Float64                                          │ Continuous      ⋯
└─────────┴──────────────────────────────────────────────────┴──────────────────
[36m                                                                1 column omitted[0m
_.nrows = 5


Most MLJ models will **not** accept a Matrix in place of a table. You must wrap it:

In [25]:
matrix_table = MLJ.table(rand(2,3))
schema(matrix_table)

┌─────────┬─────────┬────────────┐
│[22m _.names [0m│[22m _.types [0m│[22m _.scitypes [0m│
├─────────┼─────────┼────────────┤
│ x1      │ Float64 │ Continuous │
│ x2      │ Float64 │ Continuous │
│ x3      │ Float64 │ Continuous │
└─────────┴─────────┴────────────┘
_.nrows = 2


NOTE: the matrix is wrapped, *not* copied

# Fixing Scientific Types in Tabular Data

In [26]:
using CSV

In [30]:
file = CSV.File(joinpath("data", "horse.csv"))
horse = DataFrame(file)
first(horse, 4) # view the first 4 rows 

Unnamed: 0_level_0,surgery,age,rectal_temperature,pulse,respiratory_rate,temperature_extremities
Unnamed: 0_level_1,Int64,Int64,Float64,Int64,Int64,Int64
1,2,1,38.5,66,66,3
2,1,1,39.2,88,88,3
3,2,1,38.3,40,40,1
4,1,9,39.1,164,164,4


examine the scientific type

In [31]:
schema(horse)

┌─────────────────────────┬─────────┬────────────┐
│[22m _.names                 [0m│[22m _.types [0m│[22m _.scitypes [0m│
├─────────────────────────┼─────────┼────────────┤
│ surgery                 │ Int64   │ Count      │
│ age                     │ Int64   │ Count      │
│ rectal_temperature      │ Float64 │ Continuous │
│ pulse                   │ Int64   │ Count      │
│ respiratory_rate        │ Int64   │ Count      │
│ temperature_extremities │ Int64   │ Count      │
│ mucous_membranes        │ Int64   │ Count      │
│ capillary_refill_time   │ Int64   │ Count      │
│ pain                    │ Int64   │ Count      │
│ peristalsis             │ Int64   │ Count      │
│ abdominal_distension    │ Int64   │ Count      │
│ packed_cell_volume      │ Float64 │ Continuous │
│ total_protein           │ Float64 │ Continuous │
│ outcome                 │ Int64   │ Count      │
│ surgical_lesion         │ Int64   │ Count      │
│ cp_data                 │ Int64   │ Count      │
└───

let MLJ guess the appropriate fix via `autotype()`

In [32]:
autotype(horse)

Dict{Symbol, Type} with 11 entries:
  :abdominal_distension    => OrderedFactor
  :pain                    => OrderedFactor
  :surgery                 => OrderedFactor
  :mucous_membranes        => OrderedFactor
  :surgical_lesion         => OrderedFactor
  :outcome                 => OrderedFactor
  :capillary_refill_time   => OrderedFactor
  :age                     => OrderedFactor
  :temperature_extremities => OrderedFactor
  :peristalsis             => OrderedFactor
  :cp_data                 => OrderedFactor

In [34]:
# accept the changes
coerce!(horse , autotype(horse))
schema(horse)

┌─────────────────────────┬─────────────────────────────────┬───────────────────
│[22m _.names                 [0m│[22m _.types                         [0m│[22m _.scitypes      [0m ⋯
├─────────────────────────┼─────────────────────────────────┼───────────────────
│ surgery                 │ CategoricalValue{Int64, UInt32} │ OrderedFactor{2} ⋯
│ age                     │ CategoricalValue{Int64, UInt32} │ OrderedFactor{2} ⋯
│ rectal_temperature      │ Float64                         │ Continuous       ⋯
│ pulse                   │ Int64                           │ Count            ⋯
│ respiratory_rate        │ Int64                           │ Count            ⋯
│ temperature_extremities │ CategoricalValue{Int64, UInt32} │ OrderedFactor{4} ⋯
│ mucous_membranes        │ CategoricalValue{Int64, UInt32} │ OrderedFactor{6} ⋯
│ capillary_refill_time   │ CategoricalValue{Int64, UInt32} │ OrderedFactor{3} ⋯
│ pain                    │ CategoricalValue{Int64, UInt32} │ OrderedFactor{5} ⋯
│

we want the remain `Count` to be `Continuous`

In [36]:
coerce!(horse, Count => Continuous)
schema(horse)

┌─────────────────────────┬─────────────────────────────────┬───────────────────
│[22m _.names                 [0m│[22m _.types                         [0m│[22m _.scitypes      [0m ⋯
├─────────────────────────┼─────────────────────────────────┼───────────────────
│ surgery                 │ CategoricalValue{Int64, UInt32} │ OrderedFactor{2} ⋯
│ age                     │ CategoricalValue{Int64, UInt32} │ OrderedFactor{2} ⋯
│ rectal_temperature      │ Float64                         │ Continuous       ⋯
│ pulse                   │ Float64                         │ Continuous       ⋯
│ respiratory_rate        │ Float64                         │ Continuous       ⋯
│ temperature_extremities │ CategoricalValue{Int64, UInt32} │ OrderedFactor{4} ⋯
│ mucous_membranes        │ CategoricalValue{Int64, UInt32} │ OrderedFactor{6} ⋯
│ capillary_refill_time   │ CategoricalValue{Int64, UInt32} │ OrderedFactor{3} ⋯
│ pain                    │ CategoricalValue{Int64, UInt32} │ OrderedFactor{5} ⋯
│

In [37]:
# correct remainder manually
coerce!(horse, 
    :surgery => Multiclass,
    :age => Multiclass, 
    :mucous_membranes => Multiclass,
    :capillary_refill_tiem => Multiclass, 
    :outcome => Multiclass, 
    :cp_data => Multiclass
)

schema(horse)

┌─────────────────────────┬─────────────────────────────────┬───────────────────
│[22m _.names                 [0m│[22m _.types                         [0m│[22m _.scitypes      [0m ⋯
├─────────────────────────┼─────────────────────────────────┼───────────────────
│ surgery                 │ CategoricalValue{Int64, UInt32} │ Multiclass{2}    ⋯
│ age                     │ CategoricalValue{Int64, UInt32} │ Multiclass{2}    ⋯
│ rectal_temperature      │ Float64                         │ Continuous       ⋯
│ pulse                   │ Float64                         │ Continuous       ⋯
│ respiratory_rate        │ Float64                         │ Continuous       ⋯
│ temperature_extremities │ CategoricalValue{Int64, UInt32} │ OrderedFactor{4} ⋯
│ mucous_membranes        │ CategoricalValue{Int64, UInt32} │ Multiclass{6}    ⋯
│ capillary_refill_time   │ CategoricalValue{Int64, UInt32} │ OrderedFactor{3} ⋯
│ pain                    │ CategoricalValue{Int64, UInt32} │ OrderedFactor{5} ⋯
│