# Introduction to Machine Learning with Julia

> Disclaimer: This JupyterNotebook is inspired by the getting started section from MLJ's documentation.

One of the packages for Machine Learning in Julia is [MLJ](https://github.com/alan-turing-institute/MLJ.jl).
If you are new to this library, which we are(!), MLJ has a good [documentation](https://alan-turing-institute.github.io/MLJ.jl/dev/).


First, we have to install this package. If you already installed it, you can of course skip this step.

In [1]:
using Pkg;
Pkg.activate("my_MLJ_env", shared=true);

Pkg.add(["MLJ", "DataFrames", "MLJDecisionTreeInterface", "Distributions"])

[32m[1m Activating[22m[39m environment at `~/.julia/environments/my_MLJ_env/Project.toml`
[32m[1m   Updating[22m[39m registry at `~/.julia/registries/General`


[?25l    

[32m[1m   Updating[22m[39m git-repo `https://github.com/JuliaRegistries/General.git`




[32m[1m  Resolving[22m[39m package versions...
[32m[1m   Updating[22m[39m `~/.julia/environments/my_MLJ_env/Project.toml`
[90m [no changes][39m
[32m[1m   Updating[22m[39m `~/.julia/environments/my_MLJ_env/Manifest.toml`
[90m [no changes][39m


Then we can use/import this library/package:

In [2]:
using MLJ

MLJ (Machine Learning in Julia) is a toolbox providing a common interface for selecting, tuing, evaluting, composing, and comparing over 160 ML models. You can take a look at all supported models [here](https://alan-turing-institute.github.io/MLJ.jl/dev/list_of_supported_models/).

You can also take a look at the list of models (model registry) by calling the function `models()`:

In [3]:
models()

184-element Array{NamedTuple{(:name, :package_name, :is_supervised, :abstract_type, :deep_properties, :docstring, :fit_data_scitype, :hyperparameter_ranges, :hyperparameter_types, :hyperparameters, :implemented_methods, :inverse_transform_scitype, :is_pure_julia, :is_wrapper, :iteration_parameter, :load_path, :package_license, :package_url, :package_uuid, :predict_scitype, :prediction_type, :supports_class_weights, :supports_online, :supports_training_losses, :supports_weights, :transform_scitype, :input_scitype, :target_scitype, :output_scitype),T} where T<:Tuple,1}:
 (name = ABODDetector, package_name = OutlierDetectionNeighbors, ... )
 (name = ABODDetector, package_name = OutlierDetectionPython, ... )
 (name = AEDetector, package_name = OutlierDetectionNetworks, ... )
 (name = ARDRegressor, package_name = ScikitLearn, ... )
 (name = AdaBoostClassifier, package_name = ScikitLearn, ... )
 (name = AdaBoostRegressor, package_name = ScikitLearn, ... )
 (name = AdaBoostStumpClassifier, pa

What we see here is that MLJ wraps models from other packages such as ScikitLearn.

We will now proceed with the "Hello World" example for Machine Learning: The iris dataset.

## Obtain Datasest

In [4]:
iris = load_iris()
selectrows(iris, 1:3) |> pretty

┌──────────────┬─────────────┬──────────────┬─────────────┬─────────────────────────────────┐
│[1m sepal_length [0m│[1m sepal_width [0m│[1m petal_length [0m│[1m petal_width [0m│[1m target                          [0m│
│[90m Float64      [0m│[90m Float64     [0m│[90m Float64      [0m│[90m Float64     [0m│[90m CategoricalValue{String,UInt32} [0m│
│[90m Continuous   [0m│[90m Continuous  [0m│[90m Continuous   [0m│[90m Continuous  [0m│[90m Multiclass{3}                   [0m│
├──────────────┼─────────────┼──────────────┼─────────────┼─────────────────────────────────┤
│ 5.1          │ 3.5         │ 1.4          │ 0.2         │ setosa                          │
│ 4.9          │ 3.0         │ 1.4          │ 0.2         │ setosa                          │
│ 4.7          │ 3.2         │ 1.3          │ 0.2         │ setosa                          │
└──────────────┴─────────────┴──────────────┴─────────────┴─────────────────────────────────┘


The piping operator `|>` allows for function chaining. In the example above, we first select the first three rows and print them pretty.

What data type is our `iris` variable?

In [5]:
typeof(iris)

NamedTuple{(:sepal_length, :sepal_width, :petal_length, :petal_width, :target),Tuple{Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},CategoricalArrays.CategoricalArray{String,1,UInt32,String,CategoricalArrays.CategoricalValue{String,UInt32},Union{}}}}

We can also take a look at the schema of `iris` like this:

In [6]:
schema(iris)

┌──────────────┬─────────────────────────────────┬───────────────┐
│[22m _.names      [0m│[22m _.types                         [0m│[22m _.scitypes    [0m│
├──────────────┼─────────────────────────────────┼───────────────┤
│ sepal_length │ Float64                         │ Continuous    │
│ sepal_width  │ Float64                         │ Continuous    │
│ petal_length │ Float64                         │ Continuous    │
│ petal_width  │ Float64                         │ Continuous    │
│ target       │ CategoricalValue{String,UInt32} │ Multiclass{3} │
└──────────────┴─────────────────────────────────┴───────────────┘
_.nrows = 150


## Split Dataset
We first split our dataset into training and test. To do so, we can convert this NamedTuple to a [DataFrame](https://dataframes.juliadata.org/stable/) that is similar to Python's `pandas` both in design and functionality.

In [7]:
import DataFrames

iris = DataFrames.DataFrame(iris)
size(iris)  # get the shape of this DataFrame

(150, 5)

Now we split features and target:

In [8]:
target, features = unpack(iris, ==(:target), colname -> true, rng=123)
first(features, 3) |> pretty

┌──────────────┬─────────────┬──────────────┬─────────────┐
│[1m sepal_length [0m│[1m sepal_width [0m│[1m petal_length [0m│[1m petal_width [0m│
│[90m Float64      [0m│[90m Float64     [0m│[90m Float64      [0m│[90m Float64     [0m│
│[90m Continuous   [0m│[90m Continuous  [0m│[90m Continuous   [0m│[90m Continuous  [0m│
├──────────────┼─────────────┼──────────────┼─────────────┤
│ 6.7          │ 3.3         │ 5.7          │ 2.1         │
│ 5.7          │ 2.8         │ 4.1          │ 1.3         │
│ 7.2          │ 3.0         │ 5.8          │ 1.6         │
└──────────────┴─────────────┴──────────────┴─────────────┘


The function `first()` is similar to `pandas`' function `head()`, it lets you display or select the first couple of rows (in our example 3). You can also take a look at the last couple of rows by calling `last()`. Try it!

In [9]:
target[1:3]

3-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "virginica"
 "versicolor"
 "virginica"

Now we create our train/test split:

In [10]:
train_features = features[1:100, :]
train_target = target[1:100]

size(train_features)

(100, 4)

In [11]:
test_features = features[101:150, :]
test_target = target[101:150]
size(test_features)

(50, 4)

## Train a Model
Before we can train a model, we have to select one first. Previously we just took a look at what models are available in general. However, we can also check which models are suitable for our dataset:

In [13]:
models(matching(train_features, train_target))

47-element Array{NamedTuple{(:name, :package_name, :is_supervised, :abstract_type, :deep_properties, :docstring, :fit_data_scitype, :hyperparameter_ranges, :hyperparameter_types, :hyperparameters, :implemented_methods, :inverse_transform_scitype, :is_pure_julia, :is_wrapper, :iteration_parameter, :load_path, :package_license, :package_url, :package_uuid, :predict_scitype, :prediction_type, :supports_class_weights, :supports_online, :supports_training_losses, :supports_weights, :transform_scitype, :input_scitype, :target_scitype, :output_scitype),T} where T<:Tuple,1}:
 (name = AdaBoostClassifier, package_name = ScikitLearn, ... )
 (name = AdaBoostStumpClassifier, package_name = DecisionTree, ... )
 (name = BaggingClassifier, package_name = ScikitLearn, ... )
 (name = BayesianLDA, package_name = MultivariateStats, ... )
 (name = BayesianLDA, package_name = ScikitLearn, ... )
 (name = BayesianQDA, package_name = ScikitLearn, ... )
 (name = BayesianSubspaceLDA, package_name = MultivariateS

As we can see, multiple models are suitable. For now, we will stick to a simple model: a decision tree.

In order to use this classifier, we have to load it and bind it to some name:

In [None]:
Tree = @load DecisionTreeClassifier pkg = DecisionTree

(In this case, we need to specify `pkg` because multiple packages provide a model type with the name `DecisionTreeClassifier`.) Now we can instantiate a model with default hyperparameters:

In [None]:
tree = Tree()

Remember the very first code cell in this JupyterNotebook? It installed some packages like the MLJ library itself, but also `MLJDecisionTreeInterface`. Why did we do that? Because MLJ is just a wrapper and `DecisionTree.jl` is no dependency of MLJ. Therefore, we have to install it ourselves. 

When you want to use a model (i. e. load and bind it like we did with the decision tree classifier), and this model or package is not in your path, you'll enocunter an error. However, that is no need to panic! The error message will tell you what to do.

Ok, so now that we instantiated our decision tree, we can train our decision tree.

In [None]:
# First we have to create a machine
mach = machine(tree, train_features, train_target)

# And then we can train
fit!(mach)

**What is a machine?**

A machine binds a model to data and it stores the model's learned parameters. For more information, take a look at the [documentation on machines](https://alan-turing-institute.github.io/MLJ.jl/stable/machines/).

### Evaluate

Once trained we can evaluate our model:

In [None]:
predictions = predict(mach, test_features)
predictions[1:5]

In [None]:
log_loss(predictions, test_target) |> mean

Our `predictions` contain probabilities of all possible classes. For each element, we now have to fetch the class with the highes probability. We can do that with:

In [None]:
clean_predictions = mode.(predictions)

Now we can plot a confusion matrix and calculate the Accuracy:

In [None]:
confusion_matrix(clean_predictions, test_target)

In [None]:
accuracy(clean_predictions, test_target)

Are you wondering which measures are available?

In [None]:
measures()

Instead of fetching the most likely predicted class manually, we can automate it with:

In [None]:
preditions = predict_mode(mach, test_features)

On top of that (and maybe because we want to manage our recources efficiently), we can train, predict, and evaluate a model of our choice with one single function call: `evaluate()`. This function perofrms all the steps we did above.

In [None]:
evaluate(tree, train_features, train_target,
        resampling=CV(shuffle=true),
        measures=[log_loss, accuracy],
        verbosity=0)

Furthermore, we do not even need to split the data into training and test, because we can also automate that:

In [None]:
mach2 = machine(tree, features, target)
evaluate!(mach, resampling=Holdout(fraction_train=0.7),
    measures=[log_loss, accuracy],
    verbosity=0)

## Exercise
Play around a little bit and train a different model.

In [None]:
# insert your code here

## Next Steps
The getting started guide from MLJ states:

"To learn a little more about what MLJ can do, browse [Common MLJ Workflows](https://alan-turing-institute.github.io/MLJ.jl/dev/common_mlj_workflows/) or [Data Science Tutorials in Julia](https://alan-turing-institute.github.io/DataScienceTutorials.jl/) or try the [JuliaCon2020 Workshop](https://github.com/ablaom/MachineLearningInJulia2020) on MLJ (recorded [here](https://www.youtube.com/watch?time_continue=27&v=qSWbCn170HU&feature=emb_title)) returning to the manual as needed."

So have fun checking them out if you're interested!