Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added sumeval stuff #2

Merged
merged 5 commits into from
Aug 1, 2022
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
150 changes: 150 additions & 0 deletions SumEval/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
Baselines/*

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class


# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
lang2vec/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST
*.patch
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
.pybuilder/
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
# For a library or package, you might want to ignore these files since the code is
# intended to run in multiple environments; otherwise, check them in:
# .python-version

# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock

# PEP 582; used by e.g. github.com/David-OConnor/pyflow
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/

# pytype static type analyzer
.pytype/

# Cython debug symbols
cython_debug/

# PyCharm/vim files
.idea
*.swp

src/averaging_baseline.py
Baselines/combined_preds/*.json
data/combined_gold/*.json
67 changes: 67 additions & 0 deletions SumEval/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# SumEval

Helper Code and Datasets for the 1st Workshop on Scaling Up Multilingual Evaluation (SUMEval).

## Submission Instructions
The test files containing the training configurations and languages for which the predictions are to be made are located in [`data/test_release/`](data/test_release/)
kabirahuja2431 marked this conversation as resolved.
Show resolved Hide resolved

We have three types of test files (in most cases) for every dataset-model pair which are:
kabirahuja2431 marked this conversation as resolved.
Show resolved Hide resolved
- Test sets containing new configurations but same languages as seen in the training data. These are often denoted without any suffix, for eg: [`data/test_release/XNLI_XLMR.json`](data/test_release/XNLI_XLMR.json)
- Test sets containing new languages aka surprise languages but same configurations as the ones seen during training. These are denoted by the suffix '_surprise_langs_same_configs', for eg: [`data/test_release/XNLI_XLMR_surprise_langs_same_configs.json`](data/test_release/XNLI_XLMR_surprise_langs_same_configs.json)
- Test sets containing surprise languages as well as new configurations. These are denoted by the suffix '_surprise_langs_diff_configs', for eg: [`data/test_release/XNLI_XLMR_surprise_langs_diff_configs.json`](data/test_release/XNLI_XLMR_surprise_langs_diff_configs.json)

All the test files are of the following format:

```json
[
{
"train_config": {
"<train_lang_1>": "<Size(train_lang_1)>",
.,
.,
.,
"<train_lang_n>": "<Size(train_lang_n)>",
},
"eval_results" : {
"<eval_lang_1>" : "x",
.,
.,
.,
"<eval_lang_m>" : "x",
}

}

]
```

We ask the participants to predict the `"x"` values in these files by training predictor models on the training data, and replacing `"x"` with the predicted values in these files. For instance one can generate the predictions using the LITMUS predictor baseline by running:

```bash
python -m src.sum_eval_predict --data_dir ./data --out_dir ./Baselines
```

This will generate predictions for each file in the `./Baselines/preds` directory.

Once the predictions are generated they can be combined together to be compatible with [Explainaboard](https://explainaboard.inspiredco.ai/) by running:

```bash
python src/combine_predictions.py --pred_dir Baselines/preds --out_dir Baselines/combined_pred --value_name predicted_value
```

This will generate two files namely `Baselines/combined_pred/sumeval_test.json` and `Baselines/combined_pred/sumeval_surprise.json` which can be uploaded to Explainaboard for evaluation. The former will combine predictions for the test files not involving any surpise languages while the latter as the name suggests involve combining predictions for test data with surprise languages (for both same and diff versions)

### Explainaboard Submission instructions

The two files generate above should be uploaded to explainaboard (as seperate submissions) by following the steps below:
1. Visit the Explainaboard [link](https://explainaboard.inspiredco.ai/) and signup/login
2. Go to "Systems" and click "New"
3. Under Task select "tabular-regression" and select "sumeval2022" as dataset
4. Select "test" as split while submitting `sumeval_test.json` and "surprise" for sumeval_surprise.json`
5. Select "RMSE" and "AbsoluteError" as the metrics
6. Upload `sumeval_test.json` or `sumeval_surprise.json` (according to the split selected in step 4) by clicking on "Select File"
7. For your final submissions that you want to be considered, uncheck "Make it private?"

This shall upload your submissions on the explainaboard which should appear [here](https://explainaboard.inspiredco.ai/leaderboards?dataset=sumeval2022). The AbsoluteError and RMSE columns should reveal the average errors across all the test sets. For a fine-grained analaysis click on the "Analysis" button.

Contact Kabir (t-kabirahuja@microsoft.com) if there are any questions.
38 changes: 38 additions & 0 deletions SumEval/data/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
## Data for SumEval-2022 Shared-Task

The `train/` directory contains `.json` file for a given task and MMLM (XLMR and TULRv6). Each file consists of a list of dictonaries, where each dictionary is of the form:

```json
{
"train_config" : {
"pivot_lang1" : "<Amount of Training Data in pivot_lang1>",
"pivot_lang2" : "<Amount of Training Data in pivot_lang2>",
...
},

"eval_results" : {
"tgt_lang1" : "<Performance on tgt_lang1>:",
"tgt_lang2" : "<Performance on tgt_lang2>",
...
}
}


```

Each dictionary contains two data structures stored in the keys `"train_config"` and `"eval_results"`. The value of `"train_config"` contains amount of data used in each language (called pivot languages) to fine-tune the MMLM and `"eval_results"` consists of performance of the corresponding model on a set of languages. Note that the set of pivot and target languages may or may note be the same.

The task is to use this data to train a regression model which when given the amount of data used in different languages to fine-tune a given MMLM and a target langauge, predicts the performance of the fine-tuned model on that target language. For more details refer to the Shared-Task [page](https://www.microsoft.com/en-us/research/event/sumeval-2022/shared-task/)


The descrption of performance data files is given below:
- `XNLI_XLMR.json` : Contains Performance of [XLM-R](https://arxiv.org/abs/1911.02116) (large variant) based classifiers when fine-tuned for [XNLI](https://arxiv.org/abs/1809.05053) task on different amounts of training data in different languages and evaluated on the 15 supported languages in XNLI
- `XNLI_TULRv6Large.json` : Contains Performance of [T-ULRv6](https://www.microsoft.com/en-us/research/blog/microsoft-turing-universal-language-representation-model-t-ulrv5-tops-xtreme-leaderboard-and-trains-100x-faster/) models on XNLI data.
- `WikiANN_XLMR.json` : Contains Performance of XLM-R (large variant) based classifiers when fine-tuned and evaluated on different languages for [WikiANN](https://aclanthology.org/P17-1178/) dataset.
- `UDPOS_XLMR.json` : Performance of XLM-R (large variant) based classifiers when fine-tuned and evaluated on different languages for [UDPOS](https://universaldependencies.org/) dataset for Part Of Speech Tagging.
- `TyDiQA_XLMR_ID.json`: Performance of XLM-R based models fine-tuned on [TyDiQA-GoldP](https://arxiv.org/abs/2003.05002) datasets and also evaluated on TyDiQA-GoldP (i.e. ID for in distribution)
- `TyDiQA_XLMR_OOD.json`: Performance of XLM-R based models fine-tuned on [TyDiQA-GoldP](https://arxiv.org/abs/2003.05002) datasets and **evaluated on [XQUAD](https://arxiv.org/abs/1910.11856) (i.e. OOD for out of distribution)**
- `TyDiQA_TULRv6Large_ID.json`: Performance of T-ULRv6 based models fine-tuned on [TyDiQA-GoldP](https://arxiv.org/abs/2003.05002) datasets and also evaluated on TyDiQA-GoldP (i.e. ID for in distribution)
- `TyDiQA_TULRv6Large_OOD.json`: Performance of T-ULRv6 based models fine-tuned on [TyDiQA-GoldP](https://arxiv.org/abs/2003.05002) datasets and **evaluated on [XQUAD](https://arxiv.org/abs/1910.11856)** (i.e. OOD for out of distribution)

As a baseline you may use the [LITMUS tool](https://github.com/microsoft/Litmus) and build your solutions on top of it. Note the repo currently supports mBERT and XLM-R only as the pre-trained multilingual models. We shall release the support for T-ULR models soon.