# Classification Tutorial

This tutorial will show how to use Tribuo's classification models to predict Iris species using Fisher's well known Irises dataset (it's 2020 and we're still using a dataset from 1936 in demos, but not to worry we'll use MNIST from the 90s next time). We'll focus on a simple logistic regression, and investigate the provenance and metadata that Tribuo stores inside each model.

## Setup
You'll need to get a copy of the irises dataset.

`wget https://archive.ics.uci.edu/ml/machine-learning-databases/iris/bezdekIris.data`

It's Java, so first we load in the necessary Tribuo jars. Here we're using the classification experiments jar, along with the json interop jar to read and write the provenance information.

In [1]:
%jars ./tribuo-classification-experiments-4.0.0-jar-with-dependencies.jar
%jars ./tribuo-json-4.0.0-jar-with-dependencies.jar

In [2]:
import java.nio.file.Paths;

We import everything from the base org.tribuo package, along with the simple CSV loader, and the classification packages. We're going to build a logistic regression, so we'll need that too.

In [3]:
import org.tribuo.*;
import org.tribuo.evaluation.TrainTestSplitter;
import org.tribuo.data.csv.CSVLoader;
import org.tribuo.classification.*;
import org.tribuo.classification.evaluation.*;
import org.tribuo.classification.sgd.linear.LogisticRegressionTrainer;

These imports are for the provenance system, which we'll get to in a minute.

In [4]:
import com.fasterxml.jackson.databind.*;
import com.oracle.labs.mlrg.olcut.provenance.ProvenanceUtil;
import com.oracle.labs.mlrg.olcut.config.json.*;

## Loading the data
In Tribuo, all the prediction types have an associated `OutputFactory` implementation, which can create the appropriate `Output` subclasses from an input. Here we're going to use `LabelFactory` as we're performing multi-class classification. We then pass the `labelFactory` into the simple `CSVLoader` which reads all the columns into a `DataSource`.

In [5]:
var labelFactory = new LabelFactory();
var csvLoader = new CSVLoader<>(labelFactory);

Our copy of irises doesn't have any column headers, so we create the headers and supply them to the load method along with the path, and which variable is the output (in this case "species"). Irises doesn't have a pre-defined train/test split, so we're going to create one, with 70% of the data used for training.

In [6]:
var irisHeaders = new String[]{"sepalLength", "sepalWidth", "petalLength", "petalWidth", "species"};
var irisesSource = csvLoader.loadDataSource(Paths.get("bezdekIris.data"),"species",irisHeaders);
var irisSplitter = new TrainTestSplitter<>(irisesSource,0.7,1L);

We feed the training datasource and the test datasource into their respective datasets. These datasets compute all the necessary metadata, like the feature domain and the output domain. For training datasets it's best to use a `MutableDataset` as it can have transformations applied to it, and the domains grow as more examples are added. Now we have datasets we're ready to train some models.

In [7]:
var trainingDataset = new MutableDataset<>(irisSplitter.getTrain());
var testingDataset = new MutableDataset<>(irisSplitter.getTest());
System.out.println(String.format("Training data size = %d, number of features = %d, number of classes = %d",trainingDataset.size(),trainingDataset.getFeatureMap().size(),trainingDataset.getOutputInfo().size()));
System.out.println(String.format("Testing data size = %d, number of features = %d, number of classes = %d",testingDataset.size(),testingDataset.getFeatureMap().size(),testingDataset.getOutputInfo().size()));

Training data size = 105, number of features = 4, number of classes = 3
Testing data size = 45, number of features = 4, number of classes = 3


## Training the model
Now let's instantiate the trainer, and see what it's default hyperparameters are. For full control over these parameters you can directly use `LinearSGDTrainer` which is fully configurable.

In [8]:
Trainer<Label> trainer = new LogisticRegressionTrainer();
System.out.println(trainer.toString());

LinearSGDTrainer(objective=LogMulticlass,optimiser=AdaGrad(initialLearningRate=1.0,epsilon=0.1,initialValue=0.0),epochs=5,minibatchSize=1,seed=12345)


So that's a linear model, using a logistic loss, trained with `AdaGrad` for 5 epochs.

Now let's train the model. As with other packages, training is pretty simple when you have the training algorithm and training data.

In [9]:
Model<Label> irisModel = trainer.train(trainingDataset);

## Evaluating the model
Once we've trained a model, it's time to figure out how good it is. For this we ask the `labelFactory` what the appropriate `Evaluator` is (or instantiate it directly), then pass the evaluator the model and the test dataset. You can also supply a datasource instead of the dataest. The `LabelEvaluator` class implements all the common classification metrics, each of which can be individually inspected. `LabelEvaluator.toString()` produces a nicely formatted summary of the metrics.

In [10]:
var evaluator = new LabelEvaluator();
var evaluation = evaluator.evaluate(irisModel,testingDataset);
System.out.println(evaluation.toString());

Class                           n          tp          fn          fp      recall        prec          f1
Iris-versicolor                16          16           0           1       1.000       0.941       0.970
Iris-virginica                 15          14           1           0       0.933       1.000       0.966
Iris-setosa                    14          14           0           0       1.000       1.000       1.000
Total                          45          44           1           1
Accuracy                                                                    0.978
Micro Average                                                               0.978       0.978       0.978
Macro Average                                                               0.978       0.980       0.978
Balanced Error Rate                                                         0.022


[Precision, recall, and F1](https://en.wikipedia.org/wiki/Precision_and_recall) are standard metrics used when evaluating multiclass classifiers.

We can also print the confusion matrix.

In [11]:
System.out.println(evaluation.getConfusionMatrix().toString());

                   Iris-versicolor   Iris-virginica      Iris-setosa
Iris-versicolor                 16                0                0
Iris-virginica                   1               14                0
Iris-setosa                      0                0               14



## Model Metadata
Tribuo tracks the feature and output domains of all constructed models. This means it's possible to run techniques like LIME without access to the original training data, and also to add checks that a particular input is within the bounds seen by the trained model.

Let's look at the feature domain from our Irises model.

In [12]:
var featureMap = irisModel.getFeatureIDMap();
for (var v : featureMap) {
    System.out.println(v.toString());
    System.out.println();
}

CategoricalFeature(name=petalLength,id=0,count=105,map={1.2=1, 6.9=1, 3.6=1, 3.0=1, 1.7=4, 4.9=4, 4.4=3, 3.5=2, 5.9=2, 5.4=1, 4.0=4, 1.4=12, 4.5=4, 5.0=2, 5.5=3, 6.7=2, 3.7=1, 1.9=1, 6.0=2, 5.2=1, 5.7=2, 4.2=2, 4.7=2, 4.8=4, 1.6=4, 5.8=2, 3.8=1, 6.3=1, 3.3=1, 1.0=1, 5.6=4, 5.1=5, 4.6=3, 4.1=2, 1.5=9, 1.3=4, 3.9=3, 6.6=1, 6.1=2})

CategoricalFeature(name=petalWidth,id=1,count=105,map={2.0=3, 0.5=1, 1.2=3, 0.3=6, 1.6=2, 0.1=3, 0.4=5, 2.5=3, 2.3=4, 1.7=2, 1.1=3, 2.1=4, 0.6=1, 1.4=6, 1.0=5, 2.4=1, 1.8=12, 0.2=20, 1.9=4, 1.5=7, 1.3=8, 2.2=2})

CategoricalFeature(name=sepalLength,id=2,count=105,map={6.9=3, 6.4=3, 7.4=1, 4.9=4, 4.4=1, 5.9=3, 5.4=5, 7.2=3, 7.7=3, 5.0=8, 6.2=2, 5.5=5, 6.7=7, 6.0=3, 5.2=2, 6.5=3, 5.7=4, 4.7=2, 4.8=3, 5.8=4, 5.3=1, 6.8=3, 6.3=5, 7.3=1, 5.6=6, 5.1=7, 4.6=4, 7.6=1, 7.1=1, 6.6=2, 6.1=5})

CategoricalFeature(name=sepalWidth,id=3,count=105,map={2.0=1, 2.8=10, 3.6=4, 2.3=3, 2.5=5, 3.1=8, 3.8=4, 3.0=19, 2.6=4, 4.4=1, 3.3=4, 3.5=4, 2.4=2, 3.2=10, 2.9=5, 3.7=3, 3.4=6, 2.2

We can see the 4 features, along with a histogram of their values. This information can be used to sample from each feature, to build candidate examples for local explainers like LIME, or to check the range. The feature information is frozen at model training time, so it can also be used to check the number of times a feature occurred in the training set, when the feature set is sparse (as is commonly the case in NLP problems).

## Model Provenance

Modern applications deploy many different kinds of ML models, helping with many different aspects of the application. However most ML packages don't provide good support for tracking and rebuilding models. In Tribuo each model tracks it's provenance. It knows how it was created, when it was created, and what data was involved. Let's look at the data provenance for our irises model. By default Tribuo prints the provenance in a moderately human readable format in each provenance object's `toString()`, but all the information is accessible programmatically.

In [13]:
var provenance = irisModel.getProvenance();
System.out.println(ProvenanceUtil.formattedProvenanceString(provenance.getDatasetProvenance().getSourceProvenance()));

TrainTestSplitter(
	class-name = org.tribuo.evaluation.TrainTestSplitter
	source = CSVLoader(
			class-name = org.tribuo.data.csv.CSVLoader
			outputFactory = LabelFactory(
					class-name = org.tribuo.classification.LabelFactory
				)
			response-name = species
			separator = ,
			quote = "
			path = file:/Users/apocock/Development/Tribuo/tutorials/bezdekIris.data
			file-modified-time = 2020-07-06T10:52:01.938-04:00
			resource-hash = 36F668D1CBC29A8C2C1128C5D2F0D400FA04ED4DC62D12246F44CE9360360CC0
		)
	train-proportion = 0.7
	seed = 1
	size = 150
	is-train = true
)


We can see the model was trained on a datasource which was split in two, using a specific random seed & split percentage. The original datasource was a CSV file, and the file modified time and SHA-256 hash are recorded too.

We can similarly inspect the trainer provenance to find out about the training algorithm.

In [14]:
System.out.println(ProvenanceUtil.formattedProvenanceString(provenance.getTrainerProvenance()));

LogisticRegressionTrainer(
	class-name = org.tribuo.classification.sgd.linear.LogisticRegressionTrainer
	seed = 12345
	minibatchSize = 1
	shuffle = true
	epochs = 5
	optimiser = AdaGrad(
			class-name = org.tribuo.math.optimisers.AdaGrad
			epsilon = 0.1
			initialLearningRate = 1.0
			initialValue = 0.0
			host-short-name = StochasticGradientOptimiser
		)
	objective = LogMulticlass(
			class-name = org.tribuo.classification.sgd.objectives.LogMulticlass
			host-short-name = LabelObjective
		)
	loggingInterval = 1000
	train-invocation-count = 0
	is-sequence = false
	host-short-name = Trainer
)


Here we see as expected that our model was trained using a `LogisticRegressionTrainer` which used `AdaGrad` as the gradient descent algorithm.

Provenance can be extracted from models and stored as json files, if you wish to keep a separate record (or redact the provenance from a deployed model).

In [15]:
ObjectMapper objMapper = new ObjectMapper();
objMapper.registerModule(new JsonProvenanceModule());
objMapper = objMapper.enable(SerializationFeature.INDENT_OUTPUT);

The json provenance is verbose, but provides an alternative human readable serialization format.

In [16]:
String jsonProvenance = objMapper.writeValueAsString(ProvenanceUtil.marshalProvenance(provenance));
System.out.println(jsonProvenance);

[ {
  "marshalled-class" : "com.oracle.labs.mlrg.olcut.provenance.io.ObjectMarshalledProvenance",
  "object-name" : "linearsgdmodel-0",
  "object-class-name" : "org.tribuo.classification.sgd.linear.LinearSGDModel",
  "provenance-class" : "org.tribuo.provenance.ModelProvenance",
  "map" : {
    "instance-values" : {
      "marshalled-class" : "com.oracle.labs.mlrg.olcut.provenance.io.MapMarshalledProvenance",
      "map" : { }
    },
    "tribuo-version" : {
      "marshalled-class" : "com.oracle.labs.mlrg.olcut.provenance.io.SimpleMarshalledProvenance",
      "key" : "tribuo-version",
      "value" : "4.0.0",
      "provenance-class" : "com.oracle.labs.mlrg.olcut.provenance.primitives.StringProvenance",
      "additional" : "",
      "is-reference" : false
    },
    "trainer" : {
      "marshalled-class" : "com.oracle.labs.mlrg.olcut.provenance.io.SimpleMarshalledProvenance",
      "key" : "trainer",
      "value" : "logisticregressiontrainer-2",
      "provenance-class" : "org.tribuo

      "marshalled-class" : "com.oracle.labs.mlrg.olcut.provenance.io.SimpleMarshalledProvenance",
      "key" : "response-name",
      "value" : "species",
      "provenance-class" : "com.oracle.labs.mlrg.olcut.provenance.primitives.StringProvenance",
      "additional" : "",
      "is-reference" : false
    },
    "outputFactory" : {
      "marshalled-class" : "com.oracle.labs.mlrg.olcut.provenance.io.SimpleMarshalledProvenance",
      "key" : "outputFactory",
      "value" : "labelfactory-7",
      "provenance-class" : "org.tribuo.classification.LabelFactory$LabelFactoryProvenance",
      "additional" : "",
      "is-reference" : true
    },
    "separator" : {
      "marshalled-class" : "com.oracle.labs.mlrg.olcut.provenance.io.SimpleMarshalledProvenance",
      "key" : "separator",
      "value" : ",",
      "provenance-class" : "com.oracle.labs.mlrg.olcut.provenance.primitives.CharProvenance",
      "additional" : "",
      "is-reference" : false
    },
    "class-name" : {
      

Alternatively the model provenance is also present in the output of `Model.toString()`, though this format is not machine readable.

In [17]:
System.out.println(irisModel.toString());

linear-sgd-model - Model(class-name=org.tribuo.classification.sgd.linear.LinearSGDModel,dataset=Dataset(class-name=org.tribuo.MutableDataset,datasource=SplitDataSourceProvenance(className=org.tribuo.evaluation.TrainTestSplitter,innerSourceProvenance=CSV(class-name=org.tribuo.data.csv.CSVLoader,outputFactory=OutputFactory(class-name=org.tribuo.classification.LabelFactory),response-name=species,separator=,,quote=",path=file:/Users/apocock/Development/Tribuo/tutorials/bezdekIris.data,file-modified-time=2020-07-06T10:52:01.938-04:00,resource-hash=SHA-256[36F668D1CBC29A8C2C1128C5D2F0D400FA04ED4DC62D12246F44CE9360360CC0]),trainProportion=0.7,seed=1,size=150,isTrain=true),transformations=[],is-sequence=false,is-dense=false,num-examples=105,num-features=4,num-outputs=3,tribuo-version=4.0.0),trainer=Trainer(class-name=org.tribuo.classification.sgd.linear.LogisticRegressionTrainer,seed=12345,minibatchSize=1,shuffle=true,epochs=5,optimiser=StochasticGradientOptimiser(class-name=org.tribuo.math.op

Evaluations also have a provenance that records the model provenance along with the test data provenance. We're using an alternate form of the JSON provenance that's easier to read, though a little less precise. This form is suitable for refereence but can't be used to reconstruct the original provenance object as it's converted everything into Strings.

In [18]:
String jsonEvaluationProvenance = objMapper.writeValueAsString(ProvenanceUtil.convertToMap(evaluation.getProvenance()));
System.out.println(jsonEvaluationProvenance);

{
  "tribuo-version" : "4.0.0",
  "dataset-provenance" : {
    "num-features" : "4",
    "num-examples" : "45",
    "num-outputs" : "3",
    "tribuo-version" : "4.0.0",
    "datasource" : {
      "train-proportion" : "0.7",
      "seed" : "1",
      "size" : "150",
      "source" : {
        "resource-hash" : "36F668D1CBC29A8C2C1128C5D2F0D400FA04ED4DC62D12246F44CE9360360CC0",
        "path" : "file:/Users/apocock/Development/Tribuo/tutorials/bezdekIris.data",
        "file-modified-time" : "2020-07-06T10:52:01.938-04:00",
        "quote" : "\"",
        "response-name" : "species",
        "outputFactory" : {
          "class-name" : "org.tribuo.classification.LabelFactory"
        },
        "separator" : ",",
        "class-name" : "org.tribuo.data.csv.CSVLoader"
      },
      "class-name" : "org.tribuo.evaluation.TrainTestSplitter",
      "is-train" : "false"
    },
    "transformations" : [ ],
    "is-sequence" : "false",
    "is-dense" : "false",
    "class-name" : "org.tribuo.Mu

We can see that this provenance includes all the fields from the models' provenance, along with the test data, it's split, and the CSV it came from.

This provenance information is useful on it's own for tracking models, but when combined with the config system described in the configuration tutorial it becomes a powerful way of rebuilding models and experiments, allowing near perfect replicability of any ML model.

## Conclusion
We looked at Tribuo's csv loading mechanism, how to train a simple classifier, how to evaluate a classifier on test data, and also what metadata and provenance information is stored inside Tribuo's `Model` and `Evaluation` objects.