Implement CalibrationAnalysis #417

neubig · 2022-08-29T21:49:14Z

Calibration is whether a system's confidence is well-correlated with whether the system got the answer right or not. It would be nice if we could do analyses related to calibration, such as calculating expected calibration error: https://arxiv.org/abs/1706.04599

I think this should probably be implemented as an additional variety of analysis, which would be simple and self-contained: https://github.com/neulab/ExplainaBoard/blob/main/explainaboard/analysis/analyses.py#L45

pfliu-nlp · 2022-08-29T22:02:55Z

@neubig, this is also one feature I have been expecting. The only complicated thing is we need probability as an additional feature.
We probably can have some rules like this: if a system output file from a classification task contains the feature of probability, then the processor will conduct calibration adaptively, (kinda similar to training set dependent features)

neubig · 2022-08-29T22:05:53Z

At the moment, if an analysis is not applicable it returns None, so we could do a similar thing here.

pfliu-nlp · 2022-08-29T22:34:45Z

Yes, I noticed that. But this also will result in potential bugs when deploying the web platforms, which I have debugged for a pretty long time.
It's a potential schema validation bug that will happen here: https://github.com/neulab/explainaboard_web/blob/c711cf8277c4f19d0c12e18b008b7ec2b8779d00/backend/src/impl/default_controllers_impl.py#L447

neubig · 2022-08-29T22:57:23Z

I don't think returning None is necessarily a bad thing if we know it's expected behavior. But we could definitely discuss ways to rectify this if it's a problem.

odashi · 2022-08-30T04:55:04Z

I just created #418 around the topic of None.

Basically I think None is not informative as it provides no fine-grained information, and users have no control in case of invalid operations. I prefer either:

Raising exceptions to inform appropriate information outside the process. Since Python is designed to work with many exceptions (even usual control flow, such as for and StopIteration), exceptions are the first level choices over giving original semantics.
Using the result semantics in other languages:
- Rust: provides a native Result type that owns either data or error.
- Go: functions may return a tuple of data and error. if the second value (error) is non-nil it indicates some error occurred.

odashi · 2022-08-30T05:16:07Z

I think it is better that None is used to notify only no information. If there are some information that are useful to inform externally, it is good to adopt either way described above.

qjiang002 · 2022-09-21T00:12:28Z

Here are some draft ideas of implementing calibration analysis:

When to perform calibration analysis: (1) tasks that have accuracy metric; (2) users provide both predicted labels and confidence values in the output file. Check that users provide confidence values in range [0, 1], not logits.
Bucketing: divide the samples into K bins where K is a hyper-parameter. Divide confidence range [0, 1] into K intervals.
Calculate the accuracy and average confidence of each bin, and calculate ECE and MCE according to formula (3) and (5) in this paper

neubig · 2022-09-21T00:30:46Z

One comment: we may want to view a CalibrationAnalysisResult as either a subclass of a BucketAnalysisResult, or a BucketAnalysisResult with some auxiliary information (namely ECE and MCE). That would make it easy to, for example, display the calibration diagram using (nearly) the same code that we normally use for displaying buckets.

odashi · 2022-09-21T05:28:51Z

@neubig Subclassing and composition should be used when there is semantically meaningful relationship between both classes. If we need only the same code, implementing separate ones is better to organize the whole thing.

odashi · 2022-10-24T12:08:10Z

This issue was resolved.

neubig added good first issue Good for newcomers new-analysis labels Aug 29, 2022

This was referenced Oct 4, 2022

Implement calibration analysis #529

Merged

add calibration analysis doc #554

Merged

odashi closed this as completed Oct 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement CalibrationAnalysis #417

Implement CalibrationAnalysis #417

neubig commented Aug 29, 2022

pfliu-nlp commented Aug 29, 2022

neubig commented Aug 29, 2022

pfliu-nlp commented Aug 29, 2022

neubig commented Aug 29, 2022

odashi commented Aug 30, 2022 •

edited

odashi commented Aug 30, 2022 •

edited

qjiang002 commented Sep 21, 2022 •

edited

neubig commented Sep 21, 2022

odashi commented Sep 21, 2022 •

edited

odashi commented Oct 24, 2022

Implement CalibrationAnalysis #417

Implement CalibrationAnalysis #417

Comments

neubig commented Aug 29, 2022

pfliu-nlp commented Aug 29, 2022

neubig commented Aug 29, 2022

pfliu-nlp commented Aug 29, 2022

neubig commented Aug 29, 2022

odashi commented Aug 30, 2022 • edited

odashi commented Aug 30, 2022 • edited

qjiang002 commented Sep 21, 2022 • edited

neubig commented Sep 21, 2022

odashi commented Sep 21, 2022 • edited

odashi commented Oct 24, 2022

odashi commented Aug 30, 2022 •

edited

odashi commented Aug 30, 2022 •

edited

qjiang002 commented Sep 21, 2022 •

edited

odashi commented Sep 21, 2022 •

edited