New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document classification at inference time #283
Comments
I hear the point that this is kinda ugly, and we'll look at making it less so after the 4.3 release (which is coming around JavaOne). For point 1, you can use For 2, TF-IDF is split slightly oddly in Tribuo. We compute TF during feature extraction, and IDF as a transformation on the dataset. To roll it into a single pipeline the easiest thing is to use For 3, not yet, we'll look at making it smoother. You can always save out the configuration for the text processing and store it as a config file (in xml, json, edn or pb) next to the model, then you can inspect the names, but that's not fully programmatic. We have tended to drive Tribuo via configuration files for complex text processing tasks, so we have the config file around at inference time too. |
Thanks for your anser. I have a follow-up question for point 2. How would I go about implementing it using the TransformTrainer? I want to fit the TF-IDF transformation on only the training data. So fit should be only on train data, but apply on all (train, test or inference time).
|
Yes, that will apply the |
Ok, thank you very much. That answered all my questions. |
Ask the question
In the columnar tutorial it is explained how to rebuild the RowProcessor in order to generate an Example at inference time.
How would this work for the document classification tutorial which doesn't use CSV but text files as input (for a regular model, not the BERT model)?
Assuming I have extracted the bigrams and applied a TFIDF transformation and a minimum cardinality for the dataset.
My current approach looks like this:
My problems and questions for this approach are:
cm.lookup("bigramextractor-2"
). Is there a way to do this without hardcoding, as this is not robust when the extractor changes? (Or the number)Is your question about a specific ML algorithm or approach?
Please describe the ML algorithm, with appropriate links or references.
Is your question about a specific Tribuo class?
List the classes involved.
System details
Additional context
Add any other context or screenshots about the question
The text was updated successfully, but these errors were encountered: