In [1]:
using Pkg;
Pkg.activate(".");
Pkg.instantiate();

[32m[1m  Activating[22m[39m project at `~/Repos/DoingRightNow-Analysis`


In [2]:
using DataFrames, Arrow, CategoricalArrays, ScientificTypes, MLJ

In [3]:
DATA_FILE_PATH = "./data/model_data.arrow";
df = DataFrame(Arrow.Table(DATA_FILE_PATH));
df = copy(df);

In [4]:

function clean_data!(df)
    
    # Fix machine types.
    HEFAMINC_ordered_set = [
        "Less than 5,000",
        "5,000 to 7,499",
        "7,500 to 9,999",
        "10,000 to 12,499",
        "12,500 to 14,999",
        "15,000 to 19,999",
        "20,000 to 24,999",
        "25,000 to 29,999",
        "30,000 to 34,999",
        "35,000 to 39,999",
        "40,000 to 49,999",
        "50,000 to 59,999",
        "60,000 to 74,999",
        "75,000 to 99,999",
        "100,000 to 149,999",
        "150,000 and over"
    ]

    df.TRTIER2 = categorical(df.TRTIER2)
    df.GESTFIPS_label = categorical(df.GESTFIPS_label)
    df.HEFAMINC_label = categorical(df.HEFAMINC_label; levels=HEFAMINC_ordered_set, ordered=true)
    df.PEMARITL_label = categorical(df.PEMARITL_label)
    df.HETENURE_label = categorical(df.HETENURE_label)
    df.TUDIARYDAY_label = categorical(df.TUDIARYDAY_label)

    # drop columns and disallow missing.
    drop_cols = [
        :TUCASEID,:TUACTIVITY_N,:TUSTARTTIM,:TUSTOPTIME,
        :start_time_int,:stop_time_int,:TULINENO, :TUDIARYDAY
        ]
    select!(df, Not(drop_cols))
    disallowmissing!(df)

    # Define scientific types.
    coerce!(df, :snap_time_int => Continuous, :PRTAGE => Continuous)
end

clean_data!(df);

In [5]:
y, X = unpack(df, ==(:TRTIER2));

In [6]:
train, test = partition(eachindex(y), 0.8)

([1, 2, 3, 4, 5, 6, 7, 8, 9, 10  …  2093636, 2093637, 2093638, 2093639, 2093640, 2093641, 2093642, 2093643, 2093644, 2093645], [2093646, 2093647, 2093648, 2093649, 2093650, 2093651, 2093652, 2093653, 2093654, 2093655  …  2617047, 2617048, 2617049, 2617050, 2617051, 2617052, 2617053, 2617054, 2617055, 2617056])

In [7]:
X_test = X[test, :]
X = X[train,:]

y_test = y[test]
y = y[train];

## Find the right model to use

We'll take a look at what type of models are available to MLJ to predict on our target.

In [None]:
for m in models(matching(X,y))
    if m.prediction_type == :probabilistic
        println(rpad(m.name, 30), "($(m.package_name))")
    end
end

The only models showing are tree-based models. We're prodicting a multi-class category. And this is how it is encoded in the data. Tree-based models will handle this explicitly.

But we _should_ be able to use something like a multivariate logistic regression, shouldn't we? Likely, the reason is typing. A regression isn't going to work on non-encoded predictors. According to the documentation, it _should_ properly interpret the multivariate target though.

In [None]:
# One Hot Encode X into a new object called X2.
ohe = OneHotEncoder(drop_last=true)
mach = fit!(machine(ohe, X))
X2 = MLJ.transform(mach, X)

# Search for the available models.
for m in models(matching(X2,y))
    if m.prediction_type == :probabilistic
        println(rpad(m.name, 30), "($(m.package_name))")
    end
end

That's a big variety of models to choose from.

We'll start from the smaller list of tree-based models. The random forest is a good one. We can do this two ways -- by using the default `RandomForestClassifier` or by composing our own bagging of a set of `DecisionTreeClassifier` models.

The easy, fast thing to do would be to use the default. But I'd like to get some practice in. So I'm going to do the bagging from scratch.

Before I continue, I'm going to partition the data into testing and training.

## Random Forest Classifier

Note, a lot of this is adopted from [this MLJ documentation](https://alan-turing-institute.github.io/MLJ.jl/dev/tuning_models/#Tuning-multiple-nested-hyperparameters) and to a lesser extent from [this slightly outdated tutorial](https://juliaai.github.io/DataScienceTutorials.jl/getting-started/ensembles-2/).

The `DecisionTreeClassifier` from the `BetaML` package works with no encoding or transformation. But it takes a very long time to run. We'll try setting up a pipeline to transform the data ard run the `DecisionTreeClassifier` from the `DecisionTree` package.

In [None]:
# Load models from packages.
DecisionTreeClassifier = @load DecisionTreeClassifier pkg=DecisionTree

### Step 1 -- Define a new model struct.

Likely this is a probabalistic model. We'll need to confirm this and define a probabalistic network composite model.

In [None]:
supertype(typeof(DecisionTreeClassifier()))    # Should be "Probabilistic"

In [None]:
# Define a new model struct.
mutable struct CompositeA <: ProbabilisticNetworkComposite
    preprocessor    # This part does the pre-processing.
    classifier    # This part does the classifying
end

### Step 2 -- Create and wrap the learning network in `prefit`

In [None]:
# Wrap the above steps into a function called `prefit`
import MLJBase    # We need to import in order to overload `MLJBase.prefit`
function MLJBase.prefit(composite::CompositeA, verbosity, X, y)
    # Define data input nodes. We just want the training set.
    Xs = source(X[train,:])
    ys = source(y[train])

    # First machine -- We substitute the symbols in the struct defined above for the model objects.
    mach1 = machine(:preprocessor, Xs)
    x = MLJ.transform(mach1, Xs)    # `transform` has duplicated namespace. So we specify `MLJ.transform`
    mach2 = machine(:classifier, x, ys)
    ŷ = predict(mach2, x)

    verbosity > 0 && @info "I'm a noisy fellow!"

    #return "learning network interface":
    return (; predict=ŷ)
end

`prefit` always returns a _learning network interface_. Here, the inteface dictates that calling `predict(mach, Xnew)` on a machine `mach` bound to some instance of `CompositeA` should internally call `y\hat(Xnew)`.


This means we can use the above like any other model.

In [None]:
using MLJ

one_hot_encoder = OneHotEncoder()
tree = DecisionTreeClassifier(n_subfeatures=3)
ensemble_model = EnsembleModel(model=tree, n=20)

composite_a = CompositeA(one_hot_encoder,ensemble_model)

In [None]:
mach = machine(composite_a, X, y)
#fit!(mach, rows=train, verbosity=0)
estimates = evaluate!(mach, measure=cross_entropy)    # Equal to fit! then predict! then calling the measure.

### Tuning Hyperparameters

Let's start by tuning the `tree.n_subfeatures` parameter.

In [None]:
r_n_subfeatures = range(composite_a, :(classifier.model.n_subfeatures),lower=1, upper=6)
tuned_composite_a = TunedModel(
    composite_a,
    range=r_n_subfeatures,
    tuning=RandomSearch(rng=123),
    measure=cross_entropy,
    resampling=CV(nfolds=6),
    n=100,
)
mach = machine(tuned_composite_a, X, y) |> fit!
report(mach).best_model
# estimates2 = evaluate!(mach, measure=cross_entropy)    # Equal to fit! then predict! then calling the measure.

That takes way too long. I even tried it on my gaming PC and throwing compute at it doesn't fix the problem.

Let's try the out-of-the-box RandomForest model.

# Out-of-the-box Random Forest

In [None]:
# Load models from packages.
RandomForestClassifier = @load RandomForestClassifier pkg=DecisionTree

In [None]:
# Define a new model struct.
mutable struct ATUSRandomForest <: ProbabilisticNetworkComposite
    preprocessor    # This part does the pre-processing.
    classifier    # This part does the classifying
end

In [10]:
# Create prefit
import MLJBase
function MLJBase.prefit(composite::ATUSRandomForest, verbosity, X, y)

    # Learning network
    Xs = source(X)
    ys = source(y)
    mach1 = machine(:preprocessor, Xs)
    x = MLJ.transform(mach1, Xs)
    mach2 = machine(:classifier, x, ys)
    yhat = predict(mach2, x)

    verbosity > 0 && @info "I sure am noisy"

    # return "learning network interface":
    return (; predict=yhat)

end

UndefVarError: UndefVarError: ATUSRandomForest not defined

In [None]:
one_hot_encoder = OneHotEncoder()
forest = RandomForestClassifier(
    n_subfeatures=12,
    sampling_fraction=0.3,    # We have lots of data. Only use 30%.
    max_depth=10,
    rng=71
    )

atus_random_forest = ATUSRandomForest(one_hot_encoder,forest)

In [None]:
mach = machine(atus_random_forest, X, y)
fit!(mach)

In [None]:
ŷ = predict(mach, X_test)
#MulticlassFScore()(ŷ, y[test])

In [None]:
mean(cross_entropy(ŷ, y_test))

The cross_entropy for this is 1.98179

# Clustering and RandomForest

In [12]:
# Load models from packages.
import MLJBase
RandomForestClassifier = @load RandomForestClassifier pkg=DecisionTree
KMeans = @load KMeans pkg=Clustering verbosity=0

# Define a new model struct.
mutable struct ATUSClusterClassifier <: ProbabilisticNetworkComposite
    one_hot_encoder    # This part does the pre-processing.
    continuous_encoder    # Force any remaining non-continuous to continuous.
    clusterer    # This part clusters the predictors.
    classifier    # This part does the classifying
end

# Create prefit
function MLJBase.prefit(composite::ATUSClusterClassifier, verbosity, X, y)

    verbosity > 0 && @info "Running ATUSClusterClassifier composite model."

    # Learning network
    Xs = source(X)
    ys = source(y)

    ## Transform categoricals using one-hot-encoding.
    mach_ohe = machine(:one_hot_encoder, Xs)
    X_ohe = MLJ.transform(mach_ohe, Xs)
    
    ## Transform everything else to continuous.
    mach_cont = machine(:continuous_encoder, X_ohe)
    X_cont = MLJ.transform(mach_cont, X_ohe)
    
    ## Cluster predictors. Produces a table of k columns with distances.
    mach_clust = machine(:clusterer, X_cont)
    X_clust = MLJ.transform(mach_clust, X_cont)
    
    ## Run the classifier and predict.
    mach_class = machine(:classifier, X_clust, ys)
    yhat = predict(mach_class, X_clust)

    # return "learning network interface":
    return (; predict=yhat)

end

one_hot_encoder = OneHotEncoder(ordered_factor=false, drop_last=true)
continuous_encoder = ContinuousEncoder()
kmeans = KMeans(k=100)    # Each individual has 288 observations. A little arbitrary.
forest = RandomForestClassifier(
    n_subfeatures=12,
    sampling_fraction=0.3,    # We have lots of data. Only use 30%.
    max_depth=10,
    rng=71
    )

atus_cluster_classifier = ATUSClusterClassifier(
    one_hot_encoder, 
    continuous_encoder, 
    kmeans,
    forest)

import MLJDecisionTreeInterface ✔


┌ Info: For silent loading, specify `verbosity=0`. 
└ @ Main /Users/mph/.julia/packages/MLJModels/UM8fF/src/loading.jl:159


ATUSClusterClassifier(
  one_hot_encoder = OneHotEncoder(
        features = Symbol[], 
        drop_last = true, 
        ordered_factor = false, 
        ignore = false), 
  continuous_encoder = ContinuousEncoder(
        drop_last = false, 
        one_hot_ordered_factors = false), 
  clusterer = KMeans(
        k = 100, 
        metric = Distances.SqEuclidean(0.0), 
        init = :kmpp), 
  classifier = RandomForestClassifier(
        max_depth = 10, 
        min_samples_leaf = 1, 
        min_samples_split = 2, 
        min_purity_increase = 0.0, 
        n_subfeatures = 12, 
        n_trees = 100, 
        sampling_fraction = 0.3, 
        feature_importance = :impurity, 
        rng = 71))

In [13]:
mach = machine(atus_cluster_classifier, X, y)
fit!(mach)

┌ Info: Training machine(ATUSClusterClassifier(one_hot_encoder = OneHotEncoder(features = Symbol[], …), …), …).
└ @ MLJBase /Users/mph/.julia/packages/MLJBase/g5E7V/src/machines.jl:492
┌ Info: Running ATUSClusterClassifier composite model.
└ @ Main /Users/mph/Repos/DoingRightNow-Analysis/atus_ml_model.ipynb:17


┌ Info: Training machine(:one_hot_encoder, …).
└ @ MLJBase /Users/mph/.julia/packages/MLJBase/g5E7V/src/machines.jl:492
┌ Info: Spawning 50 sub-features to one-hot encode feature :GESTFIPS_label.
└ @ MLJModels /Users/mph/.julia/packages/MLJModels/UM8fF/src/builtins/Transformers.jl:878


┌ Info: Spawning 1 sub-features to one-hot encode feature :PEMARITL_label.
└ @ MLJModels /Users/mph/.julia/packages/MLJModels/UM8fF/src/builtins/Transformers.jl:878
┌ Info: Spawning 2 sub-features to one-hot encode feature :HETENURE_label.
└ @ MLJModels /Users/mph/.julia/packages/MLJModels/UM8fF/src/builtins/Transformers.jl:878
┌ Info: Spawning 6 sub-features to one-hot encode feature :TUDIARYDAY_label.
└ @ MLJModels /Users/mph/.julia/packages/MLJModels/UM8fF/src/builtins/Transformers.jl:878


┌ Info: Training machine(:continuous_encoder, …).
└ @ MLJBase /Users/mph/.julia/packages/MLJBase/g5E7V/src/machines.jl:492


┌ Info: Training machine(:clusterer, …).
└ @ MLJBase /Users/mph/.julia/packages/MLJBase/g5E7V/src/machines.jl:492


┌ Info: Training machine(:classifier, …).
└ @ MLJBase /Users/mph/.julia/packages/MLJBase/g5E7V/src/machines.jl:492


trained Machine; does not cache data
  model: ATUSClusterClassifier(one_hot_encoder = OneHotEncoder(features = Symbol[], …), …)
  args: 
    1:	Source @287 ⏎ Table{Union{AbstractVector{Continuous}, AbstractVector{Multiclass{51}}, AbstractVector{Multiclass{2}}, AbstractVector{Multiclass{3}}, AbstractVector{Multiclass{7}}, AbstractVector{OrderedFactor{16}}}}
    2:	Source @483 ⏎ AbstractVector{Multiclass{99}}


In [14]:
ŷ = predict(mach, X_test)

523411-element CategoricalDistributions.UnivariateFiniteVector{Multiclass{99}, Int64, UInt32, Float64}:
 UnivariateFinite{Multiclass{99}}(101=>0.99, 102=>0.0, 103=>0.0, 104=>0.0, 105=>0.0, 201=>0.0, 202=>0.0, 203=>0.0, 204=>0.0, 205=>0.0, 206=>0.0, 207=>0.0, 208=>0.0, 209=>0.0, 299=>0.0, 301=>0.0, 302=>0.0, 303=>0.0, 304=>0.0, 305=>0.0, 399=>0.0, 401=>0.0, 402=>0.0, 403=>0.0, 404=>0.0, 405=>0.0, 499=>0.0, 501=>0.0, 502=>0.0, 503=>0.0, 504=>0.0, 599=>0.0, 601=>0.0, 602=>0.0, 603=>0.0, 604=>0.0, 699=>0.0, 701=>0.0, 702=>0.0, 801=>0.0, 802=>0.0, 803=>0.0, 804=>0.0, 805=>0.0, 806=>0.0, 807=>0.0, 899=>0.0, 901=>0.0, 902=>0.0, 903=>0.0, 904=>0.0, 905=>0.0, 999=>0.0, 1001=>0.0, 1002=>0.0, 1003=>0.0, 1004=>0.0, 1101=>0.0, 1102=>0.0, 1201=>0.01, 1202=>0.0, 1203=>0.0, 1204=>0.0, 1205=>0.0, 1301=>0.0, 1302=>0.0, 1303=>0.0, 1401=>0.0, 1499=>0.0, 1501=>0.0, 1502=>0.0, 1503=>0.0, 1504=>0.0, 1505=>0.0, 1506=>0.0, 1507=>0.0, 1508=>0.0, 1599=>0.0, 1601=>0.0, 1602=>0.0, 1801=>0.0, 1802=>0.0, 1803=>0.0, 

In [16]:
mean(cross_entropy(ŷ, y_test))

1.6856045053589397

Mean cross entropy of 1.6856. This is an improvement over the first model.

It's worth noting that fitting this took about 30 minutes.

# Cluster & Random Forest: Tuning the k-means parameter

Now we try to tune the k-means parameter.

# TESTING AND SCRATCH

In [13]:
test_mach1 = machine(OneHotEncoder(drop_last=true), X) |> fit!
X_test = MLJ.transform(test_mach1, X);

┌ Info: Training machine(OneHotEncoder(features = Symbol[], …), …).
└ @ MLJBase /Users/mph/.julia/packages/MLJBase/g5E7V/src/machines.jl:492
┌ Info: Spawning 50 sub-features to one-hot encode feature :GESTFIPS_label.
└ @ MLJModels /Users/mph/.julia/packages/MLJModels/UM8fF/src/builtins/Transformers.jl:878
┌ Info: Spawning 15 sub-features to one-hot encode feature :HEFAMINC_label.
└ @ MLJModels /Users/mph/.julia/packages/MLJModels/UM8fF/src/builtins/Transformers.jl:878
┌ Info: Spawning 1 sub-features to one-hot encode feature :PEMARITL_label.
└ @ MLJModels /Users/mph/.julia/packages/MLJModels/UM8fF/src/builtins/Transformers.jl:878
┌ Info: Spawning 2 sub-features to one-hot encode feature :HETENURE_label.
└ @ MLJModels /Users/mph/.julia/packages/MLJModels/UM8fF/src/builtins/Transformers.jl:878
┌ Info: Spawning 6 sub-features to one-hot encode feature :TUDIARYDAY_label.
└ @ MLJModels /Users/mph/.julia/packages/MLJModels/UM8fF/src/builtins/Transformers.jl:878


In [14]:
test_mach2 = machine(ContinuousEncoder(), X_test) |> fit!
X_test2 = MLJ.transform(test_mach2, X_test)

┌ Info: Training machine(ContinuousEncoder(drop_last = false, …), …).
└ @ MLJBase /Users/mph/.julia/packages/MLJBase/g5E7V/src/machines.jl:492


Row,snap_time_int,GESTFIPS_label__AK,GESTFIPS_label__AL,GESTFIPS_label__AR,GESTFIPS_label__AZ,GESTFIPS_label__CA,GESTFIPS_label__CO,GESTFIPS_label__CT,GESTFIPS_label__DC,GESTFIPS_label__DE,GESTFIPS_label__FL,GESTFIPS_label__GA,GESTFIPS_label__HI,GESTFIPS_label__IA,GESTFIPS_label__ID,GESTFIPS_label__IL,GESTFIPS_label__IN,GESTFIPS_label__KS,GESTFIPS_label__KY,GESTFIPS_label__LA,GESTFIPS_label__MA,GESTFIPS_label__MD,GESTFIPS_label__ME,GESTFIPS_label__MI,GESTFIPS_label__MN,GESTFIPS_label__MO,GESTFIPS_label__MS,GESTFIPS_label__MT,GESTFIPS_label__NC,GESTFIPS_label__ND,GESTFIPS_label__NE,GESTFIPS_label__NH,GESTFIPS_label__NJ,GESTFIPS_label__NM,GESTFIPS_label__NV,GESTFIPS_label__NY,GESTFIPS_label__OH,GESTFIPS_label__OK,GESTFIPS_label__OR,GESTFIPS_label__PA,GESTFIPS_label__RI,GESTFIPS_label__SC,GESTFIPS_label__SD,GESTFIPS_label__TN,GESTFIPS_label__TX,GESTFIPS_label__UT,GESTFIPS_label__VA,GESTFIPS_label__VT,GESTFIPS_label__WA,GESTFIPS_label__WI,GESTFIPS_label__WV,"HEFAMINC_label__Less than 5,000","HEFAMINC_label__5,000 to 7,499","HEFAMINC_label__7,500 to 9,999","HEFAMINC_label__10,000 to 12,499","HEFAMINC_label__12,500 to 14,999","HEFAMINC_label__15,000 to 19,999","HEFAMINC_label__20,000 to 24,999","HEFAMINC_label__25,000 to 29,999","HEFAMINC_label__30,000 to 34,999","HEFAMINC_label__35,000 to 39,999","HEFAMINC_label__40,000 to 49,999","HEFAMINC_label__50,000 to 59,999","HEFAMINC_label__60,000 to 74,999","HEFAMINC_label__75,000 to 99,999","HEFAMINC_label__100,000 to 149,999",PEMARITL_label__Married,HETENURE_label__Non-pay,HETENURE_label__Own,PRTAGE,TUDIARYDAY_label__Friday,TUDIARYDAY_label__Monday,TUDIARYDAY_label__Saturday,TUDIARYDAY_label__Sunday,TUDIARYDAY_label__Thursday,TUDIARYDAY_label__Tuesday
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,60.0,0.0,0.0,0.0,0.0,1.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,30.0,0.0,0.0,1.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,43.0,0.0,0.0,0.0,0.0,1.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,23.0,1.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,20.0,0.0,0.0,1.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,15.0,0.0,0.0,0.0,1.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,24.0,0.0,0.0,0.0,1.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,16.0,0.0,0.0,0.0,1.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,15.0,0.0,0.0,0.0,1.0,0.0,0.0
10,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,16.0,0.0,0.0,0.0,0.0,0.0,0.0


In [15]:
test_mach3 = machine(KMeans(k=100),X_test2) |> fit!


┌ Info: Training machine(KMeans(k = 100, …), …).
└ @ MLJBase /Users/mph/.julia/packages/MLJBase/g5E7V/src/machines.jl:492


trained Machine; caches model-specific representations of data
  model: KMeans(k = 100, …)
  args: 
    1:	Source @902 ⏎ Table{AbstractVector{Continuous}}


In [37]:
X_test3 = MLJ.transform(test_mach3, X_test2)

Row,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16,x17,x18,x19,x20,x21,x22,x23,x24,x25,x26,x27,x28,x29,x30,x31,x32,x33,x34,x35,x36,x37,x38,x39,x40,x41,x42,x43,x44,x45,x46,x47,x48,x49,x50,x51,x52,x53,x54,x55,x56,x57,x58,x59,x60,x61,x62,x63,x64,x65,x66,x67,x68,x69,x70,x71,x72,x73,x74,x75,x76,x77,x78,x79,x80,x81,x82,x83,x84,x85,x86,x87,x88,x89,x90,x91,x92,x93,x94,x95,x96,x97,x98,x99,x100
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,4836.09,6.18537e5,7.92463e5,89036.1,4.28988e5,4.5536e5,2.46699e5,8932.75,1.18816e6,23593.0,1.2972e6,9.60838e5,1.74541e5,73597.3,53528.6,60069.6,1.66417e5,1.27894e5,1.06411e6,232.134,42638.2,1.13637e6,274804.0,3.41917e5,1.29649e6,2.97217e5,357142.0,36161.1,7.06387e5,19341.0,82512.4,1.07581e6,5.1829e5,5.99046e5,8.38221e5,3.83558e5,7673.07,8.49167e5,902.922,66972.2,47360.5,18504.3,9.40322e5,1.19556e5,533620.0,1.47811e5,1.85388e5,14246.8,2.73165e5,1.60524e5,6.17517e5,1.91734e5,1.18665e6,2.17441e5,2.00719e5,4.57095e5,3.28098e5,5272.79,3.90561e5,2.56798e5,1.13274e6,7.4828e5,5.73884e5,242031.0,2.16117e5,3672.07,27045.8,4.17383e5,1.1066e6,3.04408e5,1.39316e5,4.94446e5,255.124,6.62631e5,1.03974e6,90771.7,1924.74,3.24048e5,3.62047e5,9.09821e5,1.24328e6,1.02424e5,4.84672e5,7.48844e5,1.01299e6,12429.8,7.01373e5,1.24384e6,6.5385e5,32414.8,2.28874e5,5.57215e5,7.968e5,2055.16,8.83381e5,4.21296e5,1.07546e5,1312.81,9.93549e5,8.91807e5
2,3898.88,619041.0,7.91867e5,90668.4,4.29376e5,4.56806e5,2.4579e5,8270.94,1.18949e6,22745.9,1.29657e6,9.62151e5,174937.0,74843.4,52927.9,61355.7,1.68013e5,1.29213e5,1.06316e6,1839.8,42042.9,1.13778e6,2.76237e5,3.43588e5,1.29779e6,296755.0,3.5629e5,37467.8,7.05786e5,20962.8,81740.0,1.07733e6,5.19592e5,6.00633e5,8.37628e5,3.85037e5,9160.27,8.50592e5,53.2605,66327.3,48689.4,18711.9,9.39785e5,1.18973e5,5.33012e5,1.49048e5,1.84432e5,13496.9,2.72435e5,1.59849e5,6.16597e5,1.93277e5,1.18612e6,2.16435e5,2.00784e5,4.56443e5,328513.0,5597.66,3.90045e5,2.57265e5,1.13183e6,7.49614e5,5.73309e5,2.43615e5,217602.0,5228.08,28289.7,4.18924e5,1.10681e6,3.05835e5,1.38691e5,4.93795e5,638.227,6.62067e5,1.03992e6,91085.1,1001.74,3.23116e5,3.62634e5,9.11285e5,1.24461e6,1.01688e5,4.85888e5,7.48255e5,1.01443e6,13659.4,7.02697e5,1.24325e6,6.55264e5,31826.5,2.29039e5,5.5849e5,7.98128e5,2245.93,8.83675e5,4.20392e5,1.08864e5,2835.82,992886.0,8.90801e5
3,4084.2,6.18602e5,7.91904e5,89740.2,4.28986e5,4.55958e5,2.45963e5,8336.78,1.18869e6,22892.1,1.29662e6,9.61361e5,1.74544e5,74082.5,52967.2,60577.4,1.67101e5,1.2842e5,1.06335e6,922.275,42079.9,1.13695e6,2.75395e5,3.42643e5,1.29701e6,2.96734e5,3.56438e5,36680.6,7.05825e5,20039.2,81853.8,1.07645e6,5.18807e5,5.99725e5,837664.0,3.84175e5,8294.93,8.49754e5,200.569,66385.8,47892.6,18400.8,9.39797e5,119005.0,5.33055e5,1.48291e5,1.84625e5,13601.0,2.7253e5,1.59921e5,6.16775e5,1.92387e5,1.18613e6,2.1665e5,2.00534e5,4.56505e5,3.28112e5,5235.73,3.90048e5,2.56841e5,1.132e6,7.48815e5,5.73337e5,2.42708e5,2.16738e5,4332.93,27529.7,4.18035e5,1.1065e6,304996.0,138741.0,4.93856e5,251.076,6.62091e5,1.03962e6,90728.1,1180.89,3.23299e5,3.62159e5,9.1043e5,1.24382e6,1.01786e5,4.8514e5,7.48289e5,1.01358e6,12905.6,7.01903e5,1.24329e6,6.5443e5,31860.4,2.28746e5,5.57717e5,7.97331e5,1942.09,8.83326e5,4.20563e5,1.08072e5,1954.96,9.92952e5,8.91016e5
4,3939.66,6.19418e5,7.91988e5,91309.3,4.29726e5,4.57403e5,245838.0,8376.08,1.19006e6,22807.7,1.29669e6,962717.0,1.75289e5,75394.1,53047.3,61915.7,1.68646e5,1.2978e5,1.06319e6,2474.9,42163.6,1.13837e6,2.76832e5,3.44238e5,1.29836e6,2.96907e5,3.5635e5,38032.6,7.05906e5,21601.2,81819.3,1.07795e6,5.20156e5,601263.0,8.37749e5,3.85642e5,9767.25,8.51185e5,114.509,66436.5,49259.4,19020.1,9.3992e5,1.19097e5,5.3313e5,1.49596e5,184468.0,13581.5,2.72524e5,1.59951e5,6.16642e5,1.93897e5,1.18625e6,2.1646e5,2.01059e5,4.5655e5,3.2887e5,5933.28,3.90184e5,2.57634e5,1.13188e6,7.50186e5,5.73434e5,2.44245e5,2.18208e5,5851.12,28839.9,4.19543e5,1.10712e6,306428.0,1.38805e5,4.93903e5,987.438,6.62195e5,1.04022e6,91418.0,1045.85,323158.0,3.63031e5,9.11887e5,1.24519e6,1.01776e5,4.86431e5,7.48377e5,1.01503e6,14206.3,7.03266e5,1.24338e6,6.55853e5,31948.9,2.29338e5,5.59047e5,7.98698e5,2550.26,8.84003e5,4.2044e5,1.09432e5,3451.15,9.92991e5,890826.0
5,3986.77,6.19609e5,7.92069e5,91613.3,4.29906e5,4.57689e5,245888.0,8450.71,1.19033e6,22863.9,1.29677e6,962989.0,1.7547e5,75659.5,53128.0,62185.2,1.68946e5,1.30053e5,1.06324e6,2776.51,42244.9,1.13865e6,2.77116e5,3.44546e5,1.29863e6,2.97001e5,3.56406e5,38304.1,7.05986e5,21904.3,81882.9,1.07824e6,5.20427e5,6.01562e5,8.3783e5,3.8593e5,10056.8,8.51468e5,170.367,66512.8,49533.1,19181.6,9.40007e5,1.19179e5,5.3321e5,1.49861e5,1.84513e5,13647.3,2.72592e5,1.60025e5,6.16691e5,1.94192e5,1.18634e6,2.165e5,201206.0,4.56626e5,3.29052e5,6106.56,3.90273e5,2.57821e5,1.13193e6,7.5046e5,5.73517e5,2.44544e5,2.18498e5,6147.55,29105.1,4.19838e5,1.10728e6,3.06712e5,1.38883e5,4.93978e5,1166.54,6.6228e5,1.04038e6,91590.2,1094.37,3.23206e5,3.63231e5,9.12174e5,1.24546e6,1.01843e5,4.86694e5,7.48459e5,1.01531e6,14470.1,7.0354e5,1.24346e6,6.56136e5,32030.8,229495.0,5.59315e5,7.98971e5,2710.12,8.84173e5,4.2049e5,1.09704e5,3744.28,9.93065e5,8.90866e5
6,4105.48,6.19968e5,7.92245e5,92160.3,4.30245e5,4.58205e5,2.46011e5,8615.33,1.19083e6,22997.6,1.29694e6,9.63483e5,1.75811e5,76142.1,53302.9,62674.4,1.69487e5,1.30548e5,1.06336e6,3319.38,42420.6,1.13916e6,2.77629e5,3.451e5,1.29912e6,2.97199e5,3.56539e5,38796.8,7.06161e5,22449.5,82029.1,1.07877e6,5.20919e5,6.02102e5,838006.0,3.86452e5,10579.6,8.51981e5,303.678,66680.2,50029.5,19491.2,9.40192e5,1.19357e5,5.33384e5,1.50342e5,1.84629e5,13797.3,2.72746e5,160187.0,6.16812e5,1.94724e5,1.18652e6,2.16608e5,2.01492e5,4.56792e5,3.29396e5,6435.61,3.90462e5,2.58174e5,1.13205e6,750957.0,5.73696e5,245083.0,2.1902e5,6681.81,29587.3,4.2037e5,1.10759e6,3.07224e5,1.39053e5,4.94145e5,1505.29,6.62461e5,1.04069e6,91917.3,1215.46,3.23325e5,3.63603e5,9.12693e5,1.24596e6,1.01996e5,4.87171e5,7.48636e5,1.01583e6,14949.9,7.04035e5,1.24364e6,6.56646e5,32207.7,2.29798e5,5.59803e5,7.99467e5,3016.83,8.84497e5,4.20615e5,1.10199e5,4273.04,9.9323e5,8.90974e5
7,3927.32,6.19358e5,7.91964e5,91210.8,4.29669e5,457311.0,2.45825e5,8354.48,1.18997e6,22792.4,1.29666e6,9.62629e5,1.75232e5,75308.5,53023.7,61828.8,1.68548e5,1.29692e5,1.06318e6,2377.26,42139.7,1.13828e6,2.7674e5,3.44139e5,1.29827e6,2.96879e5,3.56335e5,37945.0,7.05882e5,21503.1,81801.4,1.07785e6,5.20069e5,601166.0,8.37725e5,3.85548e5,9673.63,8.51093e5,99.2259,66414.3,49171.1,18969.3,9.39894e5,1.19072e5,5.33107e5,1.49511e5,1.84456e5,13562.9,2.72505e5,1.5993e5,6.16629e5,1.93802e5,1.18623e6,216450.0,2.01013e5,4.56528e5,328812.0,5878.54,3.90157e5,2.57574e5,1.13186e6,7.50097e5,5.73409e5,244148.0,2.18115e5,5755.2,28754.4,4.19448e5,1.10707e6,3.06336e5,1.38782e5,4.93881e5,930.747,6.62171e5,1.04017e6,91363.7,1033.03,3.23145e5,3.62968e5,9.11794e5,1.2451e6,101757.0,4.86347e5,7.48353e5,1.01494e6,14121.3,7.03178e5,1.24335e6,6.55762e5,31924.8,2.29288e5,5.58961e5,7.98609e5,2500.02,8.83949e5,4.20427e5,1.09344e5,3356.34,9.92969e5,8.90816e5
8,4077.28,6.19892e5,7.92206e5,92046.2,4.30173e5,4.58097e5,2.45982e5,8577.86,1.19073e6,22966.4,1.2969e6,9.63379e5,1.75738e5,76040.8,53263.3,62571.8,1.69374e5,1.30444e5,1.06333e6,3206.05,42380.9,1.13906e6,2.77522e5,3.44984e5,1.29902e6,2.97155e5,3.56508e5,38693.5,7.06122e5,22335.7,81995.3,1.07866e6,520816.0,6.01989e5,8.37966e5,3.86343e5,10470.3,8.51873e5,272.516,66642.2,49925.5,19424.5,9.40151e5,1.19317e5,5.33345e5,1.50241e5,1.84601e5,13762.7,2.7271e5,160150.0,6.16783e5,1.94613e5,1.18648e6,2.16582e5,2.0143e5,4.56754e5,3.29323e5,6365.05,3.90419e5,2.58099e5,1.13202e6,7.50853e5,5.73656e5,2.44971e5,2.18911e5,6570.2,29486.1,4.20259e5,1.10753e6,307117.0,1.39015e5,494107.0,1432.79,662420.0,1.04062e6,91847.1,1186.78,3.23297e5,3.63524e5,9.12585e5,1.24585e6,1.01961e5,487071.0,748596.0,1.01572e6,14849.2,7.03931e5,1.2436e6,6.56539e5,32167.8,2.29732e5,5.59701e5,7.99363e5,2950.74,8.84427e5,4.20585e5,1.10095e5,4162.53,9.93192e5,8.90948e5
9,4105.14,6.19968e5,7.92245e5,92159.6,4.30245e5,4.58204e5,2.46011e5,8614.92,1.19083e6,22997.2,1.29694e6,963482.0,175810.0,76141.4,53302.4,62673.8,1.69486e5,1.30547e5,1.06335e6,3318.65,42420.1,1.13916e6,2.77629e5,3.45099e5,1.29912e6,297199.0,3.56538e5,38796.1,7.06161e5,22448.8,82028.7,1.07877e6,5.20918e5,602101.0,8.38006e5,386451.0,10578.9,8.5198e5,303.304,66679.8,50028.8,19490.6,9.40192e5,1.19356e5,5.33383e5,150341.0,1.84628e5,13796.9,2.72745e5,1.60187e5,6.16812e5,1.94724e5,1.18652e6,2.16607e5,2.01491e5,4.56791e5,3.29396e5,6435.01,3.90461e5,2.58173e5,1.13205e6,7.50956e5,5.73696e5,2.45082e5,2.19019e5,6681.09,29586.6,4.20369e5,1.10759e6,3.07224e5,1.39053e5,4.94144e5,1504.7,6.6246e5,1.04068e6,91916.7,1215.11,3.23325e5,3.63603e5,9.12693e5,1.24596e6,1.01995e5,4.87171e5,7.48635e5,1.01583e6,14949.2,7.04034e5,1.24364e6,6.56646e5,32207.3,2.29797e5,5.59802e5,7.99467e5,3016.25,8.84496e5,4.20614e5,1.10198e5,4272.32,9.93229e5,8.90973e5
10,4076.68,6.19891e5,7.92205e5,92045.5,430172.0,458096.0,2.45982e5,8577.27,1.19073e6,22965.8,1.2969e6,9.63379e5,1.75737e5,76040.2,53262.7,62571.2,1.69374e5,1.30444e5,1.06333e6,3205.42,42380.3,1.13906e6,2.77521e5,3.44984e5,1.29902e6,2.97155e5,3.56507e5,38692.9,706121.0,22335.1,81994.7,1.07865e6,5.20815e5,6.01989e5,8.37966e5,386342.0,10469.7,8.51873e5,271.931,66641.6,49924.9,19424.0,940150.0,1.19316e5,533344.0,1.5024e5,1.84601e5,13762.2,2.7271e5,1.60149e5,6.16783e5,1.94613e5,1.18648e6,2.16581e5,2.01429e5,4.56754e5,3.29322e5,6364.5,3.90419e5,2.58098e5,1.13202e6,7.50852e5,5.73655e5,2.4497e5,2.1891e5,6569.58,29485.5,4.20258e5,1.10753e6,3.07116e5,1.39014e5,4.94106e5,1432.24,6.62419e5,1.04062e6,91846.6,1186.18,3.23296e5,3.63524e5,9.12584e5,1.24585e6,1.0196e5,4.8707e5,7.48595e5,1.01572e6,14848.6,7.03931e5,1.2436e6,6.56539e5,32167.2,2.29732e5,559700.0,7.99363e5,2950.2,8.84427e5,4.20585e5,1.10095e5,4161.92,9.93192e5,8.90947e5


In [23]:
X_test3

2093645-element CategoricalArray{Int64,1,UInt32}:
 20
 39
 39
 39
 39
 39
 39
 39
 39
 39
 ⋮
 11
 25
 11
 11
 11
 25
 25
 11
 25

In [29]:
Table(X_test3)

MethodError: MethodError: no method matching Table(::CategoricalVector{Int64, UInt32, Int64, CategoricalValue{Int64, UInt32}, Union{}})
Closest candidates are:
  Table(!Matched::Type...) at ~/.julia/packages/ScientificTypesBase/N7myy/src/ScientificTypesBase.jl:107

In [24]:
test_mach4 = machine(OneHotEncoder(drop_last=true),X_test3)
X_test4 = MLJ.transform(test_mach4, X_test3)

│ supports. Suppress this type check by specifying `scitype_check_level=0`.
│ 
│ Run `@doc MLJModels.OneHotEncoder` to learn more about your model's requirements.
│ 
│ Commonly, but non exclusively, supervised models are constructed using the syntax
│ `machine(model, X, y)` or `machine(model, X, y, w)` while most other models are
│ constructed with `machine(model, X)`.  Here `X` are features, `y` a target, and `w`
│ sample or class weights.
│ 
│ In general, data in `machine(model, data...)` is expected to satisfy
│ 
│     scitype(data) <: MLJ.fit_data_scitype(model)
│ 
│ In the present case:
│ 
│ scitype(data) = Tuple{AbstractVector{Multiclass{100}}}
│ 
│ fit_data_scitype(model) = Tuple{Table}
└ @ MLJBase /Users/mph/.julia/packages/MLJBase/g5E7V/src/machines.jl:230


ErrorException: machine(OneHotEncoder(features = Symbol[], …), …) has not been trained. 

In [None]:
X_test3 = transform()

In [34]:
report(test_mach3).assignments

2093645-element Vector{Int64}:
 20
 39
 39
 39
 39
 39
 39
 39
 39
 39
  ⋮
 11
 25
 11
 11
 11
 25
 25
 11
 25