Document classification at inference time #283

christianwaldmann · 2022-09-26T10:28:06Z

Ask the question

In the columnar tutorial it is explained how to rebuild the RowProcessor in order to generate an Example at inference time.
How would this work for the document classification tutorial which doesn't use CSV but text files as input (for a regular model, not the BERT model)?

Assuming I have extracted the bigrams and applied a TFIDF transformation and a minimum cardinality for the dataset.
My current approach looks like this:

Build ConfigurationManager from model provenance
Get TextFeatureExtractor, OutputFactory and DocumentPreprocessor from ConfigurationManager with lookup()
Preprocess the new text file with the DocumentPreprocessor
Create Example with TextFeatureExtractor::extract()

My problems and questions for this approach are:

To get the feature extractor in step 2 I have to know the exact name and number (e.g. cm.lookup("bigramextractor-2"). Is there a way to do this without hardcoding, as this is not robust when the extractor changes? (Or the number)
Do I have to extract the TFIDF transformation manually and then apply to the Example?
Is there a way to rebuild the whole feature extraction pipeline automatically?

Is your question about a specific ML algorithm or approach?
Please describe the ML algorithm, with appropriate links or references.

Is your question about a specific Tribuo class?
List the classes involved.

System details

Tribuo version 4.2
Java version (if appropriate)
OS/Architecture (if appropriate)

Additional context
Add any other context or screenshots about the question

The text was updated successfully, but these errors were encountered:

Craigacp · 2022-09-26T14:44:18Z

I hear the point that this is kinda ugly, and we'll look at making it less so after the 4.3 release (which is coming around JavaOne).

For point 1, you can use cm.lookupSingleton(TextFeatureExtractor.class, true) which will pull out the feature extractor. In the case of columnar data you can use cm.lookupSingleton(RowProcessor.class, false) as there may be multiple text extractors.

For 2, TF-IDF is split slightly oddly in Tribuo. We compute TF during feature extraction, and IDF as a transformation on the dataset. To roll it into a single pipeline the easiest thing is to use TransformTrainer which produces TransformedModel. That will automatically apply the IDF transformation. You'll still need to pull out the text feature extractor and apply that separately. This split between feature extraction and transformation is because the feature extraction step in a DataSource is explicitly one row at a time, and the only place where whole dataset statistics are computed are in transforms. It's a bit of a mess though for TF-IDF, so we'll take a look at it.

For 3, not yet, we'll look at making it smoother. You can always save out the configuration for the text processing and store it as a config file (in xml, json, edn or pb) next to the model, then you can inspect the names, but that's not fully programmatic. We have tended to drive Tribuo via configuration files for complex text processing tasks, so we have the config file around at inference time too.

christianwaldmann · 2022-12-08T13:56:31Z

Thanks for your anser. I have a follow-up question for point 2. How would I go about implementing it using the TransformTrainer?

I want to fit the TF-IDF transformation on only the training data. So fit should be only on train data, but apply on all (train, test or inference time).
Would the following code achieve this behaviour?

Trainer<Label> trainer = ...
TransformationMap trMap = new TransformationMap(Collections.singletonList(new IDFTransformation()));
Trainer<Label> trainerTfidf = new TransformTrainer<Label>(trainer, trMap);

Craigacp · 2022-12-08T14:38:11Z

Yes, that will apply the IDFTransformation to all the features that the trainer sees. If the trainer only sees n-gram features then you're ok, if you're adding other features in then you probably don't want to IDF those so you should use the regex expansion argument to apply the IDF only to the n-grams.

christianwaldmann · 2022-12-12T15:51:23Z

Ok, thank you very much. That answered all my questions.

christianwaldmann added the question General question label Sep 26, 2022

christianwaldmann closed this as completed Dec 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document classification at inference time #283

Document classification at inference time #283

christianwaldmann commented Sep 26, 2022

Craigacp commented Sep 26, 2022

christianwaldmann commented Dec 8, 2022

Craigacp commented Dec 8, 2022

christianwaldmann commented Dec 12, 2022

Document classification at inference time #283

Document classification at inference time #283

Comments

christianwaldmann commented Sep 26, 2022

Craigacp commented Sep 26, 2022

christianwaldmann commented Dec 8, 2022

Craigacp commented Dec 8, 2022

christianwaldmann commented Dec 12, 2022