ExplainaBoard: An Explainable Leaderboard for NLP
ExplainaBoard is an interpretable, interactive and reliable leaderboard with seven (so far) new features (F) compared with generic leaderboard.
- F1: Single-system Analysis: What is a system good or bad at?
- F2: Pairwise Analysis: Where is one system better (worse) than another?
- F3: Data Bias Analysis: What are the characteristics of different evaluated datasets?
- F5: Common errors: What are common mistakes that top-5 systems made?
- F6: Fine-grained errors: where will errors occur?
- F7: System Combination: Is there potential complementarity between different systems?
We not only provide a Web-based Interactive Toolkit but also release an API that users can flexible evaluate their systems offline, which means, you can play with ExplainaBoard at following levels:
- U1: Just playing with it: You can walk around, track NLP progress, understand relative merits of different top-performing systems.
- U2: We help you analyze your model: You submit your model outputs and deploy them into online ExplainaBoard
- U3: Do it by yourself: You can process your model outputs by yourself using our API.
API-based Toolkit: Quick Installation
Method 1: Simple installation from PyPI (Python 3 only)
pip install interpret-eval
Method 2: Install from the source and develop locally (Python 3 only)
# Clone current repo git clone https://github.com/neulab/ExplainaBoard.git cd ExplainaBoard # Requirements pip install -r requirements.txt # Install the package python setup.py install
Then, you can run following examples via bash
interpret-eval --task chunk --systems ./interpret_eval/example/test-conll00.tsv --output out.json
test-conll00.tsv denotes your system output file whose format depends on different tasks.
For each task we have provided one example output file to show how they are formated.
The above command will generate a detailed report (saved in
out.json) for your input system (
Specifically, following statistics are included:
- fine-grained performance
- Confidence interval
- Error Case
Web-based Toolkit: Quick Learning
We deploy ExplainaBoard as a Web toolkit, which includes 9 NLP tasks, 40 datasets and 300 systems. Detailed information is as follows.
So far, ExplainaBoard covers following tasks
|Text-Span Classification||Aspect Sentiment||4||20||4|
|Text pair Classification||NLI||2||6||7|
|Structure Prediction||Semantic Parsing||4||12||4|
Submit Your Results
Download System Outputs
We haven't released datasets or corresponding system outputs that require licenses. But If you have licenses please fill in this form and we will send them to you privately. (Description of output's format can refer here If these system outputs are useful for you, you can cite our work.
Currently Covered Systems
We thanks all authors who share their system outputs with us: Ikuya Yamada, Stefan Schweter, Colin Raffel, Yang Liu, Li Dong. We also thank Vijay Viswanathan, Yiran Chen, Hiroaki Hayashi for useful discussion and feedback about ExplainaBoard.