# Document Classification
This tutorial will show how to perform document classification in Tribuo, using a variety of different methods to extract features from the text. We'll use the venerable [20-newsgroups dataset](http://qwone.com/~jason/20Newsgroups/) where the task is to predict what newsgroup a particular post is from, though this tutorial would be equally applicable to any document classification task (including tasks like sentiment analysis). We're going to train a simple logistic regression with fixed hyperparameters using a variety of feature extraction methods. The aim is to show how to extract features from text rather than focusing on the performance, as using a more powerful model like XGBoost, or performing hyperparameter optimization on the logisitic regression will likely improve the performance of all the feature extraction techniques.

# Setup

You'll need a copy of the 20 newsgroups dataset, so first download and unpack it:

```
wget http://qwone.com/~jason/20Newsgroups/20news-bydate.tar.gz
mkdir 20news
cd 20news
tar -zxf ../20news-bydate.tar.gz
```

This leaves you with two directories `20news-bydate-train` and `20news-bydate-test`, which contain the standard train and test split for this data.

20 newsgroups comes in a fairly standard format, the dataset is represented by a set of directories where the directory name is the class label, and the directory contains a collection of documents with one document in each file. Each file is a single Usenet post. For the purposes of this tutorial, we'll use the subject and body of the post as the input text for classification.

Here's an example:

```
$ ls 20news-bydate-train/
alt.atheism/               comp.sys.mac.hardware/  rec.motorcycles/     sci.electronics/         talk.politics.guns/
comp.graphics/             comp.windows.x/         rec.sport.baseball/  sci.med/                 talk.politics.mideast/
comp.os.ms-windows.misc/   misc.forsale/           rec.sport.hockey/    sci.space/               talk.politics.misc/
comp.sys.ibm.pc.hardware/  rec.autos/              sci.crypt/           soc.religion.christian/  talk.religion.misc/
$ ls 20news-bydate-train/comp.graphics/
37261  37949  38233  38270  38305  38344  38381  38417  38454  38489  38525  38562  38598  38633  38668  38703  38739
37913  37950  38234  38271  38306  38346  38382  38418  38455  38490  38526  38563  38599  38634  38669  38704  38740
37914  37951  38235  38272  38307  38347  38383  38420  38456  38491  38527  38564  38600  38635  38670  38705  38741
37915  37952  38236  38273  38308  38348  38384  38421  38457  38492  38528  38565  38601  38636  38671  38706  38742
...
```

As this is a pretty common format, Tribuo has a specific `DataSource` which can be used to read in this sort of data, `org.tribuo.data.text.DirectoryFileSource`.

We're going to use the classification experiments jar, along with the ONNX jar which provides support for loading in contextual word embedding models like [BERT](https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html).

In [1]:
%jars ./tribuo-classification-experiments-4.1.0-SNAPSHOT-jar-with-dependencies.jar
%jars ./tribuo-onnx-4.1.0-SNAPSHOT-jar-with-dependencies.jar

We'll also need a selection of imports from the `org.tribuo.data.text` package, along with the usual imports from `org.tribuo` and `org.tribuo.classification` we use when working with classification tasks. We'll load in the BERT support from the `org.tribuo.interop.onnx.extractors` package. Tribuo's BERT support loads in models and tokenizers from [HuggingFace's Transformer](https://huggingface.co/transformers/) package, and can be easily extended to support non-BERT models.

In [2]:
import java.util.Collections;
import java.nio.file.Paths;
import com.oracle.labs.mlrg.olcut.provenance.ProvenanceUtil;
import com.oracle.labs.mlrg.olcut.util.Pair;
import org.tribuo.*;
import org.tribuo.transform.*;
import org.tribuo.transform.transformations.IDFTransformation;
import org.tribuo.data.text.*;
import org.tribuo.data.text.impl.*;
import org.tribuo.classification.*;
import org.tribuo.classification.evaluation.*;
import org.tribuo.classification.sgd.linear.LogisticRegressionTrainer;
import org.tribuo.interop.onnx.extractors.BERTFeatureExtractor;
import org.tribuo.util.tokens.universal.UniversalTokenizer;

We'll instantiate a few classes that we'll use throughout this tutorial, the label factory, the evaluator and the paths to the train and test data.

In [3]:
var labelFactory = new LabelFactory();
var labelEvaluator = new LabelEvaluator();
var trainPath = Paths.get("./20-news/20news-bydate-train");
var testPath = Paths.get("./20-news/20news-bydate-test");

# Extracting features from text
Much of the work of machine learning is in presenting an appropriate representation of the data to the model. This is especially true when working with text data, as there is a plethora of approaches for converting text into the numbers that ML algorithms operate on. The `DirectoryFileSource` allows the user to choose the feature extraction, as it requires a `TextFeatureExtractor` which converts the `String` representing the input text into a Tribuo `Example`. We'll cover several different implementations of the `TextFeatureExtractor` interface in this tutorial, and we expect that users will implement it in their own classes to cope with specific feature extraction requirements.

We'll start with the simplest approach, a "bag of words", where each document is represented by the counts of the words in that document. This means the feature space is equal to the number of words, and most documents only have a positive value for a small number of words (as most words don't appear in any given document). This is particularly well suited to Tribuo's sparse vector representation of examples, and this suitability for NLP tasks is the reason that Tribuo is designed this way. Of course, first we'll need to tell the extractor what a word is, and for this we use a `Tokenizer`. Tokenizers split up a `String` into a stream of tokens. Tribuo provides several basic tokenizers, and an interface for tokenization. We're going to use Tribuo's `UniversalTokenizer` which is descended from tokenizers developed at Sun Labs in the 90s, and used in a variety of Sun products since that time. First we'll use a strict bag of words where each feature takes the value `1` if that word is present in the document, and `0` otherwise. We'll use Tribuo's `BasicPipeline` which can convert `String`s into features, and pass it to the basic `TextFeatureExtractor` implementation, helpfully called `TextFeatureExtractorImpl`.

In [4]:
var tokenizer = new UniversalTokenizer();
var bowPipeline = new BasicPipeline(tokenizer,1);
var bowExtractor = new TextFeatureExtractorImpl<Label>(bowPipeline);

We're now almost ready to make our train and test data sources, and load in the data. The `DirectoryFileSource` also accepts an array of `DocumentPreprocessor`s which can be used to transform the text before feature extraction takes place. We're going to use a specific preprocessor (`NewsPreprocessor`) which standardises the 20 newsgroups data by stripping out the mail headers and returning only the subject and the body of the email. In general the preprocessors are dataset and task specific, which is why Tribuo doesn't ship with many implementations as in most cases users will need to write one from scratch for their specific task.

In [5]:
var newsProc = new NewsPreprocessor();

We'll make a helper function to load the data sources and create the datasets. We're also going to restrict the test dataset so it only contains valid examples, as 20 newsgroups has some test examples that share no words with the train examples (and so have no features we could use to make predictions with).

Let's check our datasets and see if everything has loaded in correctly.

In [6]:
public Pair<Dataset<Label>,Dataset<Label>> mkDatasets(String name, TextFeatureExtractor<Label> extractor) {
    var trainSource = new DirectoryFileSource<>(trainPath,labelFactory,extractor,newsProc);
    var testSource = new DirectoryFileSource<>(testPath,labelFactory,extractor,newsProc);
    var trainDS = new MutableDataset<>(trainSource);
    var testDS = new ImmutableDataset<>(testSource,trainDS.getFeatureIDMap(),trainDS.getOutputIDInfo(),true);
    System.out.println(String.format(name + " training data size = %d, number of features = %d, number of classes = %d",trainDS.size(),trainDS.getFeatureMap().size(),trainDS.getOutputInfo().size()));
    System.out.println(String.format(name + " testing data size = %d, number of features = %d, number of classes = %d",testDS.size(),testDS.getFeatureMap().size(),testDS.getOutputInfo().size()));
    return new Pair<>(trainDS,testDS);
}

var bowPair = mkDatasets("bow",bowExtractor);

bow training data size = 11314, number of features = 146037, number of classes = 20
bow testing data size = 7531, number of features = 146037, number of classes = 20


We've loaded in 11,314 training documents containing 146,037 unique words and 7,532 test documents, each with the expected 20 classes.

Now we're ready to train a model. Let's start with a simple logistic regression.

In [7]:
var lrTrainer = new LogisticRegressionTrainer();
var bowModel = lrTrainer.train(bowPair.getA());
var bowEval = labelEvaluator.evaluate(bowModel,bowPair.getB());
System.out.println(bowEval);

Class                                n          tp          fn          fp      recall        prec          f1
soc.religion.christian             398         320          78          96       0.804       0.769       0.786
rec.autos                          396         326          70         100       0.823       0.765       0.793
talk.religion.misc                 251         145         106         118       0.578       0.551       0.564
comp.windows.x                     394         299          95          76       0.759       0.797       0.778
rec.sport.baseball                 397         345          52          77       0.869       0.818       0.842
comp.graphics                      389         264         125         150       0.679       0.638       0.658
talk.politics.mideast              376         299          77          31       0.795       0.906       0.847
comp.sys.ibm.pc.hardware           392         244         148         130       0.622       0.652       0.637
s

## Term counting
This approach discards a lot of information about the documents, as we're ignoring how many times the word or n-gram appears in the document (also known in information retrieval circles as the Term Frequency or TF). Let's swap the `BasicPipeline` for a `TokenPipeline` which supports term counting via a constructor flag.

In [8]:
var unigramPipeline = new TokenPipeline(tokenizer, 1, true);
var unigramExtractor = new TextFeatureExtractorImpl<Label>(unigramPipeline);
var unigramPair = mkDatasets("unigram",unigramExtractor);

unigram training data size = 11314, number of features = 146037, number of classes = 20
unigram testing data size = 7531, number of features = 146037, number of classes = 20


We can see the number of documents and number of features are still the same, all that's different is the feature values. Let's build another logistic regression.

In [9]:
var unigramModel = lrTrainer.train(unigramPair.getA());
var unigramEval = labelEvaluator.evaluate(unigramModel,unigramPair.getB());
System.out.println(unigramEval);

Class                                n          tp          fn          fp      recall        prec          f1
soc.religion.christian             398         316          82          62       0.794       0.836       0.814
rec.autos                          396         312          84          73       0.788       0.810       0.799
talk.religion.misc                 251         158          93         170       0.629       0.482       0.546
comp.windows.x                     394         275         119          88       0.698       0.758       0.727
rec.sport.baseball                 397         321          76          50       0.809       0.865       0.836
comp.graphics                      389         240         149         107       0.617       0.692       0.652
talk.politics.mideast              376         275         101          34       0.731       0.890       0.803
comp.sys.ibm.pc.hardware           392         252         140         176       0.643       0.589       0.615
s

We see that the logistic regression trained on unigrams gets about 74% accuracy.


## N-grams as features
Let's try a little more complicated feature extractor. The natural step from unigrams is to include word pairs (or bigrams) and count the occurrence of those. This allows us to get simple negations (e.g., "not bad" rather than "not" and "bad") along with places like "New York" rather than "new" and "york". In Tribuo this is as straightforward as telling the token pipeline we'd like bigrams.

In [10]:
var bigramPipeline = new TokenPipeline(tokenizer, 2, true);
var bigramExtractor = new TextFeatureExtractorImpl<Label>(bigramPipeline);
var bigramPair = mkDatasets("bigram",bigramExtractor);

bigram training data size = 11314, number of features = 1253665, number of classes = 20
bigram testing data size = 7531, number of features = 1253665, number of classes = 20


We can see the feature space has massively increased due to the presence of bigram features, we've now got 1.2 million features from the same 11,314 documents.

Now to train another logistic regression.

In [11]:
var bigramModel = lrTrainer.train(bigramPair.getA());
var bigramEval = labelEvaluator.evaluate(bigramModel,bigramPair.getB());
System.out.println(bigramEval);

Class                                n          tp          fn          fp      recall        prec          f1
soc.religion.christian             398         329          69          87       0.827       0.791       0.808
rec.autos                          396         318          78         103       0.803       0.755       0.778
talk.religion.misc                 251         139         112         110       0.554       0.558       0.556
comp.windows.x                     394         294         100         104       0.746       0.739       0.742
rec.sport.baseball                 397         330          67          66       0.831       0.833       0.832
comp.graphics                      389         243         146         161       0.625       0.601       0.613
talk.politics.mideast              376         305          71          99       0.811       0.755       0.782
comp.sys.ibm.pc.hardware           392         276         116         165       0.704       0.626       0.663
s

Our performance only improved a little bit, from 74.2% to 74.5%. This is because despite there being more information in the features, there are also many, many more features making it easier to confuse this simple linear model. As we increase the number of n-gram features we'll start to see diminishing returns as the model complexity increases without a commensurate increase in training data.

## TFIDF vectors

One other factor is that the count of some words isn't usually that helpful, most documents include "a", "the", "and" many times. A popular way to deal with this is to scale the term frequencies (i.e. the n-gram counts) by the Inverse Document Frequency (or IDF), producing TF-IDF vectors. In Tribuo the IDF is a transformation which is applied separately to the dataset after it's constructed, as it uses aggregate information from the whole dataset which isn't available until all the examples have been loaded in. Let's see how that affects performance.

In [16]:
// Create a transformation map that contains a single IDFTransformation to apply to every feature
var trMap = new TransformationMap(Collections.singletonList(new IDFTransformation()));
// Copy out the datasets.
var tfidfTrain = MutableDataset.createDeepCopy(bigramPair.getA());
var tfidfTest = MutableDataset.createDeepCopy(bigramPair.getB());
// Fit the IDF transformation and apply it to the data
var transformers = tfidfTrain.createTransformers(trMap);
tfidfTrain.transform(transformers);
tfidfTest.transform(transformers);
// Train and evaluate a logistic regression
var tfidfModel = lrTrainer.train(tfidfTrain);
var tfidfEval = labelEvaluator.evaluate(tfidfModel,tfidfTest);
System.out.println(tfidfEval);

Class                                n          tp          fn          fp      recall        prec          f1
soc.religion.christian             398         330          68          82       0.829       0.801       0.815
rec.autos                          396         288         108          76       0.727       0.791       0.758
talk.religion.misc                 251         160          91         108       0.637       0.597       0.617
comp.windows.x                     394         313          81          70       0.794       0.817       0.806
rec.sport.baseball                 397         326          71          33       0.821       0.908       0.862
comp.graphics                      389         266         123         117       0.684       0.695       0.689
talk.politics.mideast              376         296          80          24       0.787       0.925       0.851
comp.sys.ibm.pc.hardware           392         285         107         227       0.727       0.557       0.631
s

Using TFIDF features has improved the accuracy to 76%.

## Feature hashing

A popular technique for reducing the feature space when dealing with such large problems is feature hashing. This is where the features are mapped back down to a smaller space using a hash function. It induces collisions between the features, so the model might treat "New York" and "San Fransisco" as the same feature, but the collisions are generated essentially at random based on the hash function, and so provide a strong regularising effect which frequently improves performance.

To use feature hashing in Tribuo simply pass a hash dimension to the `TokenPipeline` on construction. We'll map everything down to 100,000 features and see how that affects the model.

In [19]:
var hashPipeline = new TokenPipeline(tokenizer, 2, true, 100000);
var hashExtractor = new TextFeatureExtractorImpl<Label>(hashPipeline);
var hashPair = mkDatasets("hash-100k",hashExtractor);

hash-50k training data size = 11314, number of features = 100000, number of classes = 20
hash-50k testing data size = 7532, number of features = 100000, number of classes = 20


As expected we have the same number of training & test examples, but now there are only 100,000 features. Let's build another logistic regression.

In [20]:
var hashModel = lrTrainer.train(hashPair.getA());
var hashEval = labelEvaluator.evaluate(hashModel,hashPair.getB());
System.out.println(hashEval);

Class                                n          tp          fn          fp      recall        prec          f1
soc.religion.christian             398         344          54         127       0.864       0.730       0.792
rec.autos                          396         295         101          72       0.745       0.804       0.773
talk.religion.misc                 251         140         111         113       0.558       0.553       0.556
comp.windows.x                     395         258         137          88       0.653       0.746       0.696
rec.sport.baseball                 397         314          83          73       0.791       0.811       0.801
comp.graphics                      389         243         146         166       0.625       0.594       0.609
talk.politics.mideast              376         308          68          82       0.819       0.790       0.804
comp.sys.ibm.pc.hardware           392         259         133         191       0.661       0.576       0.615
s

The performance dropped a little here, but the model has less than a tenth of the parameters compared to the bigram model, making it much smaller and faster. In many cases dropping a couple of points of accuracy for something 10x smaller and significantly faster is a worthwhile tradeoff, but as with most machine learning tasks this is problem dependent. Tuning the hashing dimension and the trainer parameters will likely produce a model with similar accuracy at greatly reduced computational cost.

## Word embeddings

All the approaches described above have no notion of word similarity, they rely upon exactly the same words appearing in the training and test documents, when in practice word similarity is likely to be very useful information for the classifier because no two documents use exactly the same phrasing. For example, the unigrams "excellent" and "fantastic" are equally dissimilar to an n-gram model, when in fact those words are quite similar in meaning. Adding notions of word similarity to ML models usually means embedding each word into some vector space, then words with similar meanings can be close in the vector space, and words with dissimilar or opposite meanings are far apart. There are many popular word embedding algorithms, like [Word2Vec](https://arxiv.org/abs/1301.3781), [GloVe](https://nlp.stanford.edu/projects/glove/) or [FastText](https://fasttext.cc/) which build embeddings on a corpus of text that can then be used in downstream tasks. Tribuo doesn't have a class which can directly load those word vectors, as they all come in different file formats, but it's pretty straightforward to build a `TextFeatureExtractor` that will tokenize the input text, look up each word or n-gram in the vector space and then average them across the input (I know because we've built one for our internal word vector file format and it took an afternoon). If there is interest from the community in supporting a specific word vector file format, we're happy to accept PRs that add the support.

While these more traditional forms of word vector are very powerful, as they are precomputed they treat each word the same no matter the context it appears in. For example "bank" could mean a river bank, or a financial institution, but a word2vec vector has both meanings because it can't understand the whole sentence. This lead to the rise of *contextual* word embeddings, which produce a vector for each word based on the whole input sequence. The most popular of these embeddings are based on the [Transformer](https://papers.nips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html) architecture, usually a variant of Google's [BERT](https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html) model.

## Using BERT embeddings

BERT is a multilayer transformer network, which reads in a sentence and produces both an embedding of the sentence, along with embeddings for each wordpiece. A wordpiece is the token that BERT operates on, which is either a whole word, or a chunk of a word, emitted by the wordpiece tokenizer. This word chunking is trained on a large corpus and allows common prefixes & suffixes (e.g. "un", "ing") to be split off the words and to share state. We can use BERT to produce a single vector which represents the sentence or document and then use that vector as features in a downstream Tribuo classifier.

Tribuo works with BERT models that are stored in [ONNX format](https://onnx.ai), and can load in tokenizers produced by [HuggingFace Transformers](https://huggingface.co/transformers/) which helpfully provides a Python script to convert BERT models from HuggingFace format into ONNX format for deployment. We provide a `TextFeatureExtractor` implementation called `BERTFeatureExtractor` which can produce sentence embeddings out by passing the text through a BERT model. Tribuo uses Microsoft's [ONNX Runtime](https://www.onnxruntime.ai/) to load the model, and has it's own implementation of the Wordpiece tokenization algorithm, along with the necessary glue to produce tokens in the format that BERT expects. One downside of BERT models is that they have a maximum document length that they can process, usually 512 wordpieces. Tribuo's feature extractor automatically truncates documents longer than this to 512 wordpieces, though it emits a warning when it does so.

To follow along with this part of the tutorial you'll need to produce a BERT model in onnx format. To do that you'll need access to a Python environment with HuggingFace and pytorch installed to export the model. Running the following snippet will produce a `bert-base-uncased.onnx` file that we can use for the rest of the tutorial.

```
python convert_graph_to_onnx.py --framework pt --model bert-base-uncased bert-base-uncased.onnx
```

You'll also need to download the `tokenizer.json` that goes with the BERT variant you are using, for `bert-base-uncased` that file is [here](https://huggingface.co/bert-base-uncased/blob/main/tokenizer.json). Assuming both of those files are now in the same directory as this tutorial, we can create the `BERTFeatureExtractor`. We're going to start with just the `[CLS]` token which provides an embedding for the whole sentence.

In [None]:
var bertPath = Paths.get("./bert-base-uncased.onnx");
var tokenizerPath = Paths.get("./tokenizer.json");
var bertCLS = new BERTFeatureExtractor(labelFactory,bertPath,tokenizerPath,BERTFeatureExtractor.OutputPooling.CLS);
var bertCLSPair = mkDatasets("bert-cls",bertCLS);

Now we build a logistic regression.

In [None]:
var clsModel = lrTrainer.train(bertCLSPair.getA());
var clsEval = labelEvaluator.evaluate(clsModel,bertCLSPair.getB());
System.out.println(clsEval);

The CLS token usually needs to be fine-tuned on the specific task, so it's not too surprising that it's performance is lacking, but the option is available in Tribuo to allow working with fine-tuned BERT models. We can also use the average of the token embeddings as the document vector and see if that improves performance.

In [None]:
var bertAve = new BERTFeatureExtractor(labelFactory,bertPath,tokenizerPath,BERTFeatureExtractor.OutputPooling.CLS_AND_MEAN);
var bertAvePair = mkDatasets("bert-ave",bertAve);
var aveModel = lrTrainer.train(bertAvePair.getA());
var aveEval = labelEvaluator.evaluate(aveModel,bertAvePair.getB());
System.out.println(aveEval);

We can see that using BERT ...

Using different BERT versions can change the accuracy, and there are smaller versions like DistillBERT and TinyBERT which are useful for deploying models in constrained environments.

# Deploying the feature extractors

Similarly to when working with columnar data, the feature extractor used is recorded in the model provenance. We can see that for the BERT model here.

In [None]:
var sourceProvenance = aveModel.getProvenance().getDatasetProvenance().getSourceProvenance();
System.out.println(ProvenanceUtil.formattedProvenanceString(sourceProvenance));

This means that the model has recorded how the features were extracted, but the extraction process itself isn't part of the serialized model (which we wouldn't really want anyway as BERT models are hundreds of megabytes). So to use one of these models at inference time the feature extraction pipeline needs to be rebuilt from the configuration, in the same way we rebuilt the `RowProcessor` in the columnar tutorial.

# Conclusion

We looked at a document classification task in Tribuo. As most of the work in NLP tends to be on featurising the data, we discussed several different ways of converting text into features for use in machine learning. We looked at Bag of Words models, using n-grams, term frequencies, TFIDF vectors and finally feature hashing. We also discussed word vector approaches, and showed how to use the popular contextual word embedding model, BERT, to extract features to use in document classification. All the models trained were simple logistic regressions, without any parameter tuning. Using a more powerful classifier like XGBoost, or performing hyperparameter tuning on the logistic regression will likely improve performance quite a bit over the baselines presented here.