# Configuration Tutorial

This tutorial will show how to use Tribuo's configuration and provenance systems to build models on MNIST (because we wouldn't be doing ML without an MNIST demo).
We'll focus on logistic regression, show how many different trainers can be stored in the same configuration, and how the provenance system allows the configuration for a specific run to be regenerated.
We'll also briefly look at Tribuo's feature transformation system and see how that integrates into configuration and provenance.

## Setup
You'll need to get a copy of the MNIST dataset in the original IDX format.

First the training data:

`wget http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz`

Then the test data:

`wget http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz`

Tribuo's IDX loader natively reads gzipped files so you don't need to unzip them.

It's Java, so first we load in the necessary Tribuo jars. Here we're using the classification experiments jar, along with the json interop jar to read and write the provenance information.

In [1]:
%jars ./tribuo-core-4.1.0-SNAPSHOT.jar
%jars ./tribuo-classification-experiments-4.0.0-jar-with-dependencies.jar
%jars ./tribuo-json-4.0.0-jar-with-dependencies.jar

Now lets import the packages we need. We'll use a few file manipulation things from Java, and then Tribuo's core packages, the transformation packages, the classification package, classification evaluation package, and then a few things that relate to the provenance system.

In [2]:
import java.nio.file.Files;
import java.nio.file.Paths;

In [3]:
import org.tribuo.*;
import org.tribuo.util.Util;
import org.tribuo.transform.*;
import org.tribuo.transform.transformations.LinearScalingTransformation;
import org.tribuo.classification.*;
import org.tribuo.classification.evaluation.*;
import com.oracle.labs.mlrg.olcut.config.ConfigurationManager;
import com.oracle.labs.mlrg.olcut.provenance.*;
import com.oracle.labs.mlrg.olcut.provenance.primitives.*;
import com.oracle.labs.mlrg.olcut.config.json.JsonConfigFactory;

By default OLCUT's `ConfigurationManager` only understands XML files, this snippet adds JSON support to all `ConfigurationManager`s in the running JVM. It can be added dynamically on the command line by supplying `--config-file-format <fully-qualified-class-name>` where the class name is for example `com.oracle.labs.mlrg.olcut.config.json.JsonConfigFactory`, if you're using OLCUT's CLI options processing.

In [4]:
ConfigurationManager.addFileFormatFactory(new JsonConfigFactory())

## Using a configuration file
We're going to read in an example configuration file, in JSON format. This configuration knows about a bunch of different trainers, and also the training and testing MNIST data sources. In the tutorials directory we supply both the JSON and XML versions of this file, and the remainder of this tutorial is completely agnostic to which one is used.

In [5]:
String configFile = "example-config.json";
String.join("\n",Files.readAllLines(Paths.get(configFile)))

{
  "config" : {
    "components" : [ {
      "name" : "mnist-test",
      "type" : "org.tribuo.datasource.IDXDataSource",
      "export" : "false",
      "import" : "false",
      "properties" : {
        "featuresPath" : "t10k-images-idx3-ubyte.gz",
        "outputPath" : "t10k-labels-idx1-ubyte.gz",
        "outputFactory" : "label-factory"
      }
    }, {
      "name" : "mnist-train",
      "type" : "org.tribuo.datasource.IDXDataSource",
      "export" : "false",
      "import" : "false",
      "properties" : {
        "featuresPath" : "train-images-idx3-ubyte.gz",
        "outputPath" : "train-labels-idx1-ubyte.gz",
        "outputFactory" : "label-factory"
      }
    }, {
      "name" : "adagrad",
      "type" : "org.tribuo.math.optimisers.AdaGrad",
      "export" : "false",
      "import" : "false",
      "properties" : {
        "epsilon" : "0.01",
        "initialLearningRate" : "0.5"
      }
    }, {
      "name" : "log",
      "type" : "org.tribuo.classification.sgd.object

Now we'll make a `ConfigurationManager` and hand it the configuration file to load. Our configuration system also supports CLI options which can load things out of the supplied configuration files. We have examples of this in each of the simple `TrainTest` demo classes in each prediction backend.

In [6]:
var cm = new ConfigurationManager(configFile);

First we'll load in the training and testing `DataSource`s (as instances of `IDXDataSource`), pass them into two `Dataset`s to aggregate the appropriate metadata, and we'll make the evaluator for later use.

In [7]:
DataSource<Label> mnistTrain = (DataSource<Label>) cm.lookup("mnist-train");
DataSource<Label> mnistTest = (DataSource<Label>) cm.lookup("mnist-test");
var trainData = new MutableDataset<>(mnistTrain);
var testData = new MutableDataset<>(mnistTest);
var evaluator = new LabelEvaluator();
System.out.println(String.format("Training data size = %d, number of features = %d, number of classes = %d",trainData.size(),trainData.getFeatureMap().size(),trainData.getOutputInfo().size()));
System.out.println(String.format("Testing data size = %d, number of features = %d, number of classes = %d",testData.size(),testData.getFeatureMap().size(),testData.getOutputInfo().size()));

Training data size = 60000, number of features = 717, number of classes = 10
Testing data size = 10000, number of features = 668, number of classes = 10


## Loading in trainers from the configuration
Our configuration file contains a number of different trainers, so let's pull them out and take a look.

The first one we'll see is a CART decision tree, with a max tree depth of 6.

In [8]:
var cart = (Trainer<Label>) cm.lookup("cart");
cart

CARTClassificationTrainer(maxDepth=6,minChildWeight=5.0,fractionFeaturesInSplit=0.5,impurity=GiniIndex,seed=12345)

Next we'll load an XGBoost trainer, using 10 trees, 6 computation threads, and some regularisation parameters.

In [9]:
var xgb = (Trainer<Label>) cm.lookup("xgboost");
xgb

XGBoostTrainer(numTrees=10,parameters{colsample_bytree=1.0, silent=1, seed=1, max_depth=4, booster=gbtree, objective=multi:softprob, lambda=1.0, eta=0.5, nthread=6, alpha=1.0, subsample=1.0, gamma=0.1, min_child_weight=1.0})

Finally we'll load in a logistic regression trainer, using AdaGrad as the gradient optimizer.

In [10]:
var logistic = (Trainer<Label>) cm.lookup("logistic");
logistic

LinearSGDTrainer(objective=LogMulticlass,optimiser=AdaGrad(initialLearningRate=0.5,epsilon=0.01,initialValue=0.0),epochs=2,minibatchSize=1,seed=1)

We can also load a list in containing all the `Trainer` implementations in this config file. Note: the config system by default returns the same instance when it's queried for the same named config. So the list contains references to the objects we've already loaded.

In [11]:
var trainers = (List<Trainer>) cm.lookupAll(Trainer.class);
System.out.println("Loaded " + trainers.size() + " trainers.");

Loaded 3 trainers.


## Training the model and extracting configuration
We're going to focus on the logistic regression trainer now, so let's train a logistic regression model on our MNIST training set.

In [12]:
var lrStartTime = System.currentTimeMillis();
var lrModel = logistic.train(trainData);
var lrEndTime = System.currentTimeMillis();
System.out.println("Training logistic regression took " + Util.formatDuration(lrStartTime,lrEndTime));

Training logistic regression took (00:00:05:097)


We can inspect the trained model for it's provenance, as we saw in the Classification tutorial.

The new step is extracting a configuration from that provenance. The `ProvenanceUtil.extractConfiguration()` call returns a `List<ConfigurationData>` which is the object representation of a configuration file. We can see that it's extracted configurations for 5 objects from our single model, we'll look at those after we've written out the file.

In [13]:
var provenance = lrModel.getProvenance();
var provConfig = ProvenanceUtil.extractConfiguration(provenance);
provConfig.size()

5

The `ConfigurationManager` is the way we can generate a configuration file from the object representation.
We create a new `ConfigurationManager`, add the configuration we extracted from the provenance, and then write
it out to a new JSON file.

In [14]:
var outputFile = "mnist-logistic-config.json";
var newCM = new ConfigurationManager();
newCM.addConfiguration(provConfig);
newCM.save(new File(outputFile),true);
String.join("\n",Files.readAllLines(Paths.get(outputFile)))

{
  "config" : {
    "components" : [ {
      "name" : "idxdatasource-1",
      "type" : "org.tribuo.datasource.IDXDataSource",
      "export" : "false",
      "import" : "false",
      "properties" : {
        "outputPath" : "/Users/apocock/Development/Tribuo/tutorials/train-labels-idx1-ubyte.gz",
        "outputFactory" : "labelfactory-4",
        "featuresPath" : "/Users/apocock/Development/Tribuo/tutorials/train-images-idx3-ubyte.gz"
      }
    }, {
      "name" : "linearsgdtrainer-0",
      "type" : "org.tribuo.classification.sgd.linear.LinearSGDTrainer",
      "export" : "false",
      "import" : "false",
      "properties" : {
        "seed" : "1",
        "minibatchSize" : "1",
        "shuffle" : "true",
        "epochs" : "2",
        "optimiser" : "adagrad-2",
        "objective" : "logmulticlass-3",
        "loggingInterval" : "10000"
      }
    }, {
      "name" : "adagrad-2",
      "type" : "org.tribuo.math.optimisers.AdaGrad",
      "export" : "false",
      "import" :

The five elements of the configuration are: the training data "idxdatasource-1", the logistic regression "linearsgdtrainer-0", the training log loss function "logmulticlass-3", the AdaGrad gradient optimizer "adagrad-2", and the label factory "labelfactory-4". The only unexpected part is the `LabelFactory` which is the factory that converts `String`s into `Label` instances.

## Rebuilding a model from it's configuration

Now to reconstruct our model, we can load in the Trainer and DataSource from the new `ConfigurationManager`, pass the source into a `Dataset`, and finally call train on the new trainer supplying the new dataset.

In [15]:
var newTrainer = (Trainer<Label>) newCM.lookup("linearsgdtrainer-0");
var newSource = (DataSource<Label>) newCM.lookup("idxdatasource-1");
var newDataset = new MutableDataset<>(newSource);
var newModel = newTrainer.train(newDataset, Collections.singletonMap("reconfigured-model",new BooleanProvenance("reconfigured-model",true)));

First we'll confirm that the old model and new models aren't equal (as they have different timestamps, among other provenance checks).

In [16]:
lrModel.equals(newModel)

false

Now we'll evaluate the first model:

In [17]:
var lrEvaluator = evaluator.evaluate(lrModel,testData);
System.out.println(lrEvaluator.toString());
System.out.println(lrEvaluator.getConfusionMatrix().toString());

Class                           n          tp          fn          fp      recall        prec          f1
0                             980         904          76          21       0.922       0.977       0.949
1                           1,135       1,072          63          18       0.944       0.983       0.964
2                           1,032         856         176          56       0.829       0.939       0.881
3                           1,010         844         166          84       0.836       0.909       0.871
4                             982         888          94          72       0.904       0.925       0.915
5                             892         751         141         143       0.842       0.840       0.841
6                             958         938          20         139       0.979       0.871       0.922
7                           1,028         963          65         133       0.937       0.879       0.907
8                             974         892 

It's about what we'd expect for a linear model on MNIST. Not SOTA, but it'll do for now.

Now let's check the new model:

In [18]:
var newEvaluator = evaluator.evaluate(newModel,testData);
System.out.println(newEvaluator.toString());
System.out.println(newEvaluator.getConfusionMatrix().toString());

Class                           n          tp          fn          fp      recall        prec          f1
0                             980         904          76          21       0.922       0.977       0.949
1                           1,135       1,072          63          18       0.944       0.983       0.964
2                           1,032         856         176          56       0.829       0.939       0.881
3                           1,010         844         166          84       0.836       0.909       0.871
4                             982         888          94          72       0.904       0.925       0.915
5                             892         751         141         143       0.842       0.840       0.841
6                             958         938          20         139       0.979       0.871       0.922
7                           1,028         963          65         133       0.937       0.879       0.907
8                             974         892 

We can see that both models perform identically. This is because our provenance system records the RNG seeds used at all points, and Tribuo is scrupulous about how and when it uses PRNGs. If you find a model reconstruction that gives a different answer (unless you're using XGBoost, which has some non-determinism beyond our control) then file an issue on our GitHub as that's a bug.

## What else lives in the Provenance?

These evaluations have provenance in the same way the models do, and we can use a pretty printer in OLCUT to make it a little more human readable.

In addition to the configuration information like the gradient optimiser and RNG seed, the provenance includes run specific information like the "reconfigured-model" flag we added, along with a hash of the data, timestamps for the various data files involved, and a timestamp for the model creation and dataset creation.

In [19]:
var evalProvenance = newEvaluator.getProvenance();
System.out.println(ProvenanceUtil.formattedProvenanceString(evalProvenance));

EvaluationProvenance(
	class-name = org.tribuo.provenance.EvaluationProvenance
	model-provenance = LinearSGDModel(
			class-name = org.tribuo.classification.sgd.linear.LinearSGDModel
			dataset = MutableDataset(
					class-name = org.tribuo.MutableDataset
					datasource = IDXDataSource(
							class-name = org.tribuo.datasource.IDXDataSource
							outputPath = /Users/apocock/Development/Tribuo/tutorials/train-labels-idx1-ubyte.gz
							outputFactory = LabelFactory(
									class-name = org.tribuo.classification.LabelFactory
								)
							featuresPath = /Users/apocock/Development/Tribuo/tutorials/train-images-idx3-ubyte.gz
							features-file-modified-time = 2000-07-21T14:20:24-04:00
							output-resource-hash = 3552534A0A558BBED6AED32B30C495CCA23D567EC52CAC8BE1A0730E8010255C
							datasource-creation-time = 2020-08-31T17:22:56.081381-04:00
							output-file-modified-time = 2000-07-21T14:20:27-04:00
							idx-feature-type = UBYTE
							features-resource-hash = 440FCABF73CC5

## Feature Transformations

We can take the new trainer, wrap it programmatically in a TransfomTrainer which rescales the input features into the range `[0,2]`, and still generate provenance and configuration automatically as the model is trained.

In [20]:
var transformations = new TransformationMap(List.of(new LinearScalingTransformation(0,1)));
var transformed = new TransformTrainer(newTrainer,transformations);
var transformStart = System.currentTimeMillis();
var transformedModel = transformed.train(newDataset);
var transformEnd = System.currentTimeMillis();
System.out.println("Training transformed logistic regression took " + Util.formatDuration(transformStart,transformEnd));

Training transformed logistic regression took (00:00:08:740)


Now we'll evaluate the rescaled model. Here we see that rescaling the data into the zero-one range improves the linear model performance a couple of percent as all the data is now on the same scale. As expected it's still not SOTA, but we're not using a huge CNN or some other complex model, for that you can try out our TensorFlow interface, or use the XGBoost trainer we loaded in from the original configuration file.

In [21]:
LabelEvaluation transformedEvaluator = evaluator.evaluate(transformedModel,testData);
System.out.println(transformedEvaluator.toString());
System.out.println(transformedEvaluator.getConfusionMatrix().toString());

Class                           n          tp          fn          fp      recall        prec          f1
0                             980         957          23          40       0.977       0.960       0.968
1                           1,135       1,109          26          36       0.977       0.969       0.973
2                           1,032         940          92          90       0.911       0.913       0.912
3                           1,010         927          83         141       0.918       0.868       0.892
4                             982         914          68          73       0.931       0.926       0.928
5                             892         813          79         183       0.911       0.816       0.861
6                             958         892          66          45       0.931       0.952       0.941
7                           1,028         918         110          54       0.893       0.944       0.918
8                             974         753 

We can emit a configuration which includes both the transformation trainer and the original trainer pulled from the old configuration.

In [22]:
var transformedProvConfig = ProvenanceUtil.extractConfiguration(transformedModel.getProvenance());
var transformedOutputFile = "mnist-transformed-logistic-config.json";
newCM = new ConfigurationManager();
newCM.addConfiguration(transformedProvConfig);
newCM.save(new File(transformedOutputFile),true);
String.join("\n",Files.readAllLines(Paths.get(transformedOutputFile)))

{
  "config" : {
    "components" : [ {
      "name" : "linearscalingtransformation-4",
      "type" : "org.tribuo.transform.transformations.LinearScalingTransformation",
      "export" : "false",
      "import" : "false",
      "properties" : {
        "targetMax" : "1.0",
        "targetMin" : "0.0"
      }
    }, {
      "name" : "labelfactory-7",
      "type" : "org.tribuo.classification.LabelFactory",
      "export" : "false",
      "import" : "false"
    }, {
      "name" : "adagrad-5",
      "type" : "org.tribuo.math.optimisers.AdaGrad",
      "export" : "false",
      "import" : "false",
      "properties" : {
        "epsilon" : "0.01",
        "initialLearningRate" : "0.5",
        "initialValue" : "0.0"
      }
    }, {
      "name" : "linearsgdtrainer-2",
      "type" : "org.tribuo.classification.sgd.linear.LinearSGDTrainer",
      "export" : "false",
      "import" : "false",
      "properties" : {
        "seed" : "1",
        "minibatchSize" : "1",
        "shuffle" : "t

Aside from the names (which have different tag numbers) we can see that this configuration is identical to the previous one, but with the addition of the `transformtrainer-0` and it's dependents.

## Conclusion
We've taken a closer look at Tribuo's configuration and provenance systems, showing how to train a model using a configuration file, how to inspect the model's provenance, extract it's configuration, and finally how to combine that extracted configuration with other programmatic elements of the Tribuo library (in this case the feature transformation system). We saw that the provenance combines both the configuration of the trainer and the datasource, along with runtime information extracted from the dataset itself (e.g. timestamps and file hashes).

Tribuo's configuration system is integrated into a CLI options/arguments parsing system, which can be used to override elements from the configuration file. The values from the options are then stored in the `ConfigurationManager` and appear in the provenance and downstream configuration objects as expected. Tribuo also provides a redaction system for configuration files (e.g. to ensure a password isn't stored in the provenance) and for provenance objects themselves (e.g. to remove the data provenance from a trained model), which aids model deployment to untrusted or less trusted systems.