Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement CalibrationAnalysis #417

Closed
neubig opened this issue Aug 29, 2022 · 10 comments
Closed

Implement CalibrationAnalysis #417

neubig opened this issue Aug 29, 2022 · 10 comments
Labels

Comments

@neubig
Copy link
Contributor

neubig commented Aug 29, 2022

Calibration is whether a system's confidence is well-correlated with whether the system got the answer right or not. It would be nice if we could do analyses related to calibration, such as calculating expected calibration error: https://arxiv.org/abs/1706.04599

I think this should probably be implemented as an additional variety of analysis, which would be simple and self-contained: https://github.com/neulab/ExplainaBoard/blob/main/explainaboard/analysis/analyses.py#L45

@pfliu-nlp
Copy link
Collaborator

@neubig, this is also one feature I have been expecting. The only complicated thing is we need probability as an additional feature.
We probably can have some rules like this: if a system output file from a classification task contains the feature of probability, then the processor will conduct calibration adaptively, (kinda similar to training set dependent features)

@neubig
Copy link
Contributor Author

neubig commented Aug 29, 2022

At the moment, if an analysis is not applicable it returns None, so we could do a similar thing here.

@pfliu-nlp
Copy link
Collaborator

Yes, I noticed that. But this also will result in potential bugs when deploying the web platforms, which I have debugged for a pretty long time.
It's a potential schema validation bug that will happen here: https://github.com/neulab/explainaboard_web/blob/c711cf8277c4f19d0c12e18b008b7ec2b8779d00/backend/src/impl/default_controllers_impl.py#L447

@neubig
Copy link
Contributor Author

neubig commented Aug 29, 2022

I don't think returning None is necessarily a bad thing if we know it's expected behavior. But we could definitely discuss ways to rectify this if it's a problem.

@odashi
Copy link
Contributor

odashi commented Aug 30, 2022

I just created #418 around the topic of None.

Basically I think None is not informative as it provides no fine-grained information, and users have no control in case of invalid operations. I prefer either:

  • Raising exceptions to inform appropriate information outside the process. Since Python is designed to work with many exceptions (even usual control flow, such as for and StopIteration), exceptions are the first level choices over giving original semantics.
  • Using the result semantics in other languages:
    • Rust: provides a native Result type that owns either data or error.
    • Go: functions may return a tuple of data and error. if the second value (error) is non-nil it indicates some error occurred.

@odashi
Copy link
Contributor

odashi commented Aug 30, 2022

I think it is better that None is used to notify only no information. If there are some information that are useful to inform externally, it is good to adopt either way described above.

@qjiang002
Copy link
Collaborator

qjiang002 commented Sep 21, 2022

Here are some draft ideas of implementing calibration analysis:

  • When to perform calibration analysis: (1) tasks that have accuracy metric; (2) users provide both predicted labels and confidence values in the output file. Check that users provide confidence values in range [0, 1], not logits.
  • Bucketing: divide the samples into K bins where K is a hyper-parameter. Divide confidence range [0, 1] into K intervals.
  • Calculate the accuracy and average confidence of each bin, and calculate ECE and MCE according to formula (3) and (5) in this paper

@neubig
Copy link
Contributor Author

neubig commented Sep 21, 2022

One comment: we may want to view a CalibrationAnalysisResult as either a subclass of a BucketAnalysisResult, or a BucketAnalysisResult with some auxiliary information (namely ECE and MCE). That would make it easy to, for example, display the calibration diagram using (nearly) the same code that we normally use for displaying buckets.

@odashi
Copy link
Contributor

odashi commented Sep 21, 2022

@neubig Subclassing and composition should be used when there is semantically meaningful relationship between both classes. If we need only the same code, implementing separate ones is better to organize the whole thing.

@odashi
Copy link
Contributor

odashi commented Oct 24, 2022

This issue was resolved.

@odashi odashi closed this as completed Oct 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants