Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document classification at inference time #283

Closed
christianwaldmann opened this issue Sep 26, 2022 · 4 comments
Closed

Document classification at inference time #283

christianwaldmann opened this issue Sep 26, 2022 · 4 comments
Labels
question General question

Comments

@christianwaldmann
Copy link

Ask the question

In the columnar tutorial it is explained how to rebuild the RowProcessor in order to generate an Example at inference time.
How would this work for the document classification tutorial which doesn't use CSV but text files as input (for a regular model, not the BERT model)?

Assuming I have extracted the bigrams and applied a TFIDF transformation and a minimum cardinality for the dataset.
My current approach looks like this:

  1. Build ConfigurationManager from model provenance
  2. Get TextFeatureExtractor, OutputFactory and DocumentPreprocessor from ConfigurationManager with lookup()
  3. Preprocess the new text file with the DocumentPreprocessor
  4. Create Example with TextFeatureExtractor::extract()

My problems and questions for this approach are:

  1. To get the feature extractor in step 2 I have to know the exact name and number (e.g. cm.lookup("bigramextractor-2"). Is there a way to do this without hardcoding, as this is not robust when the extractor changes? (Or the number)
  2. Do I have to extract the TFIDF transformation manually and then apply to the Example?
  3. Is there a way to rebuild the whole feature extraction pipeline automatically?

Is your question about a specific ML algorithm or approach?
Please describe the ML algorithm, with appropriate links or references.

Is your question about a specific Tribuo class?
List the classes involved.

System details

  • Tribuo version 4.2
  • Java version (if appropriate)
  • OS/Architecture (if appropriate)

Additional context
Add any other context or screenshots about the question

@christianwaldmann christianwaldmann added the question General question label Sep 26, 2022
@Craigacp
Copy link
Member

I hear the point that this is kinda ugly, and we'll look at making it less so after the 4.3 release (which is coming around JavaOne).

For point 1, you can use cm.lookupSingleton(TextFeatureExtractor.class, true) which will pull out the feature extractor. In the case of columnar data you can use cm.lookupSingleton(RowProcessor.class, false) as there may be multiple text extractors.

For 2, TF-IDF is split slightly oddly in Tribuo. We compute TF during feature extraction, and IDF as a transformation on the dataset. To roll it into a single pipeline the easiest thing is to use TransformTrainer which produces TransformedModel. That will automatically apply the IDF transformation. You'll still need to pull out the text feature extractor and apply that separately. This split between feature extraction and transformation is because the feature extraction step in a DataSource is explicitly one row at a time, and the only place where whole dataset statistics are computed are in transforms. It's a bit of a mess though for TF-IDF, so we'll take a look at it.

For 3, not yet, we'll look at making it smoother. You can always save out the configuration for the text processing and store it as a config file (in xml, json, edn or pb) next to the model, then you can inspect the names, but that's not fully programmatic. We have tended to drive Tribuo via configuration files for complex text processing tasks, so we have the config file around at inference time too.

@christianwaldmann
Copy link
Author

Thanks for your anser. I have a follow-up question for point 2. How would I go about implementing it using the TransformTrainer?

I want to fit the TF-IDF transformation on only the training data. So fit should be only on train data, but apply on all (train, test or inference time).
Would the following code achieve this behaviour?

Trainer<Label> trainer = ...
TransformationMap trMap = new TransformationMap(Collections.singletonList(new IDFTransformation()));
Trainer<Label> trainerTfidf = new TransformTrainer<Label>(trainer, trMap);

@Craigacp
Copy link
Member

Craigacp commented Dec 8, 2022

Yes, that will apply the IDFTransformation to all the features that the trainer sees. If the trainer only sees n-gram features then you're ok, if you're adding other features in then you probably don't want to IDF those so you should use the regex expansion argument to apply the IDF only to the n-grams.

@christianwaldmann
Copy link
Author

Ok, thank you very much. That answered all my questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question General question
Projects
None yet
Development

No branches or pull requests

2 participants