-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor evaluation metrics to support decoded generated text metrics like BLEU and ROUGE. #3539
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, nice work @justinxzhao ! Just left some comments to add more comments to clarify some pieces of the code, but this is a good change overall.
ludwig/models/llm.py
Outdated
_predictions, | ||
_decoded_predictions, | ||
) = realign_target_and_prediction_tensors_for_inference(targets, predictions, of_name, self.tokenizer) | ||
of_obj.update_metrics(_targets[of_name], _predictions[of_name], _decoded_targets, _decoded_predictions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we streictly need to pass all this stuff? plus isn't there a more elegant way that having an if textfeature and having update_metrics to have a different signature only in that case? It's not amazing abstraction, and requires us to check if textoutput everywhere we call update metrics (it may be happening only here, but still, it's a red flag that comething is wrong with the abstraction). One option could be to have kwargs in the superclass update metrics and put the edditional things as parameters passed through kwargs. it would still need an if for determining the inputs, but at least we would not have textoutputs need a different signature for update_metrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having TextOutputFeature
own its tokenizer (for decoding predictions) would enable the original update_metrics()
signature to work, though TextOutputFeature
would still need to override the base class's implementation of update_metrics()
to decode texts with the tokenizer and use the decoded texts to compute RESPONSE
metrics.
However, constructing TextOutputFeature
with the tokenizer would require a more invasive change.
- Output feature construction code paths are still fully unified under
BaseModel.build_outputs()
. - We would need to change the
build_outputs()
signature to take in an additionaltraining_set_metadata
argument (which contains tokenizer information), so that it can be passed through the output feature constructors.TextOutputFeature
is the only output feature who would needtraining_set_metadata
in this way. - Output feature construction call sites would need to be updated as well -- there are 3 (
ecd.py
,gbm.py
, andllm.py
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This refactoring seems worthwhile to do if other output features need training set metadata. For now, keeping the ownership of the tokenizer at the model level and relying on the model to decode predictions and provide them for TextOutputFeature
's metrics calculations seems like a cleaner, less invasive change -- WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made one simplification, which is to do decoding within update_metrics()
instead of outside at the model level. This should reduce some duplication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good to me!
Adding new decoded text metrics
New evaluation metrics for text output features can be added to the metrics registry under the
RESPONSE
keyword, like so:New metrics added:
Changes to TextOutputFeature.update_metrics()
Since we need to pass in additional decoded inputs to text features, we explicitly define
update_metrics()
forTextOutputFeature
instead of relying on the baseOutputFeature
implementation. The decoded texts are optional arguments and if not provided, decoded text metrics are skipped.Some torchmetrics text metrics return a dictionary of sub-metrics. These are individually unpacked in
model. get_metrics()
.Changes to the LLM class
The LLM class sets has two separate methods for evaluation metrics: 1)
update_metrics
for ZS/FS and 2)update_metrics_finetune_llm()
for fine-tuning LLMs. These appear to have slightly different treatments for aligning tensors and handling pad tokens.Once predictions and targets are computed, they are decoded and passed to
TextOutputFeature.update_metrics()
.Refactoring the
MetricsPrintedTable
This PR also replaces the MetricsPrintedTable with a simpler, stateless method. Instead of:
We consolidate printed metrics into one call:
I've also changed the metrics table to print out metrics in a transposed way with splits as columns. This looks substantially better with many additional metrics.
Combined loss is also merged into this table as the last column instead of a separate table, which has frequently confused users.