microsoft · kabirahuja2431 · Aug 1, 2022 · Jul 29, 2022 · Aug 1, 2022 · Aug 1, 2022
diff --git a/SumEval/.gitignore b/SumEval/.gitignore
@@ -0,0 +1,150 @@
+Baselines/*
+
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+lang2vec/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+*.patch
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+cover/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+.pybuilder/
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# IPython
+profile_default/
+ipython_config.py
+
+# pyenv
+#   For a library or package, you might want to ignore these files since the code is
+#   intended to run in multiple environments; otherwise, check them in:
+# .python-version
+
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow
+__pypackages__/
+
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+
+# SageMath parsed files
+*.sage.py
+
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+
+# Pyre type checker
+.pyre/
+
+# pytype static type analyzer
+.pytype/
+
+# Cython debug symbols
+cython_debug/
+
+# PyCharm/vim files
+.idea
+*.swp
+
+src/averaging_baseline.py
+Baselines/combined_preds/*.json
+data/combined_gold/*.json
diff --git a/SumEval/README.md b/SumEval/README.md
@@ -0,0 +1,67 @@
+# SumEval
+
+Helper Code and Datasets for the 1st Workshop on Scaling Up Multilingual Evaluation (SUMEval).
+
+## Submission Instructions
+The test files containing the training configurations and languages for which the predictions are to be made are located in [`data/test_release/`](data/test_release/)
+
+We have three types of test files (in most cases) for every dataset-model pair which are:
+- Test sets containing new configurations but same languages as seen in the training data. These are often denoted without any suffix, for eg: [`data/test_release/XNLI_XLMR.json`](data/test_release/XNLI_XLMR.json)
+- Test sets containing new languages aka surprise languages but same configurations as the ones seen during training. These are denoted by the suffix '_surprise_langs_same_configs', for eg: [`data/test_release/XNLI_XLMR_surprise_langs_same_configs.json`](data/test_release/XNLI_XLMR_surprise_langs_same_configs.json)
+- Test sets containing surprise languages as well as new configurations. These are denoted by the suffix '_surprise_langs_diff_configs', for eg: [`data/test_release/XNLI_XLMR_surprise_langs_diff_configs.json`](data/test_release/XNLI_XLMR_surprise_langs_diff_configs.json)
+
+All the test files are of the following format:
+
+```json
+[
+  {
+    "train_config": {
+      "<train_lang_1>": "<Size(train_lang_1)>",
+      .,
+      .,
+      .,
+      "<train_lang_n>": "<Size(train_lang_n)>",
+    },
+    "eval_results" : {
+      "<eval_lang_1>" : "x",
+      .,
+      .,
+      .,
+      "<eval_lang_m>" : "x",
+    }
+
+  }
+
+]
+```
+
+We ask the participants to predict the `"x"` values in these files by training predictor models on the training data, and replacing `"x"` with the predicted values in these files. For instance one can generate the predictions using the LITMUS predictor baseline by running:
+
+```bash
+python -m src.sum_eval_predict --data_dir ./data --out_dir ./Baselines
+```
+
+This will generate predictions for each file in the `./Baselines/preds` directory.
+
+Once the predictions are generated they can be combined together to be compatible with [Explainaboard](https://explainaboard.inspiredco.ai/) by running:
+
+```bash
+python src/combine_predictions.py --pred_dir Baselines/preds --out_dir Baselines/combined_pred --value_name predicted_value
+```
+
+This will generate two files namely `Baselines/combined_pred/sumeval_test.json` and `Baselines/combined_pred/sumeval_surprise.json` which can be uploaded to Explainaboard for evaluation. The former will combine predictions for the test files not involving any surpise languages while the latter as the name suggests involve combining predictions for test data with surprise languages (for both same and diff versions)
+
+### Explainaboard Submission instructions
+
+The two files generate above should be uploaded to explainaboard (as seperate submissions) by following the steps below:
+1. Visit the Explainaboard [link](https://explainaboard.inspiredco.ai/) and signup/login
+2. Go to "Systems" and click "New"
+3. Under Task select "tabular-regression" and select "sumeval2022" as dataset
+4. Select "test" as split while submitting `sumeval_test.json` and "surprise" for sumeval_surprise.json`
+5. Select "RMSE" and "AbsoluteError" as the metrics
+6. Upload `sumeval_test.json` or `sumeval_surprise.json` (according to the split selected in step 4) by clicking on "Select File"
+7. For your final submissions that you want to be considered, uncheck "Make it private?" 
+
+This shall upload your submissions on the explainaboard which should appear [here](https://explainaboard.inspiredco.ai/leaderboards?dataset=sumeval2022). The AbsoluteError and RMSE columns should reveal the average errors across all the test sets. For a fine-grained analaysis click on the "Analysis" button.
+
+Contact Kabir (t-kabirahuja@microsoft.com) if there are any questions.
diff --git a/SumEval/data/README.md b/SumEval/data/README.md
@@ -0,0 +1,38 @@
+## Data for SumEval-2022 Shared-Task
+
+The `train/` directory contains `.json` file for a given task and MMLM (XLMR and TULRv6). Each file consists of a list of dictonaries, where each dictionary is of the form:
+
+```json
+{
+    "train_config" : {
+        "pivot_lang1" : "<Amount of Training Data in pivot_lang1>",
+        "pivot_lang2" : "<Amount of Training Data in pivot_lang2>",
+        ...
+    },
+
+    "eval_results" : {
+        "tgt_lang1" : "<Performance on tgt_lang1>:",
+        "tgt_lang2" : "<Performance on tgt_lang2>",
+        ...
+    }
+}
+
+
+```
+
+Each dictionary contains two data structures stored in the keys `"train_config"` and `"eval_results"`. The value of `"train_config"` contains amount of data used in each language (called pivot languages) to fine-tune the MMLM and `"eval_results"` consists of performance of the corresponding model on a set of languages. Note that the set of pivot and target languages may or may note be the same.
+
+The task is to use this data to train a regression model which when given the amount of data used in different languages to fine-tune a given MMLM and a target langauge, predicts the performance of the fine-tuned model on that target language. For more details refer to the Shared-Task [page](https://www.microsoft.com/en-us/research/event/sumeval-2022/shared-task/)
+
+
+The descrption of performance data files is given below:
+- `XNLI_XLMR.json` : Contains Performance of [XLM-R](https://arxiv.org/abs/1911.02116) (large variant) based classifiers when fine-tuned for [XNLI](https://arxiv.org/abs/1809.05053) task on different amounts of training data in different languages and evaluated on the 15 supported languages in XNLI
+- `XNLI_TULRv6Large.json` : Contains Performance of [T-ULRv6](https://www.microsoft.com/en-us/research/blog/microsoft-turing-universal-language-representation-model-t-ulrv5-tops-xtreme-leaderboard-and-trains-100x-faster/) models on XNLI data.
+- `WikiANN_XLMR.json` : Contains Performance of XLM-R (large variant) based classifiers when fine-tuned and evaluated on different languages for [WikiANN](https://aclanthology.org/P17-1178/) dataset.
+- `UDPOS_XLMR.json` : Performance of XLM-R (large variant) based classifiers when fine-tuned and evaluated on different languages for [UDPOS](https://universaldependencies.org/) dataset for Part Of Speech Tagging.
+- `TyDiQA_XLMR_ID.json`: Performance of XLM-R based models fine-tuned on [TyDiQA-GoldP](https://arxiv.org/abs/2003.05002) datasets and also evaluated on TyDiQA-GoldP (i.e. ID for in distribution)
+- `TyDiQA_XLMR_OOD.json`: Performance of XLM-R based models fine-tuned on [TyDiQA-GoldP](https://arxiv.org/abs/2003.05002) datasets and **evaluated on [XQUAD](https://arxiv.org/abs/1910.11856) (i.e. OOD for out of distribution)**
+- `TyDiQA_TULRv6Large_ID.json`: Performance of T-ULRv6 based models fine-tuned on [TyDiQA-GoldP](https://arxiv.org/abs/2003.05002) datasets and also evaluated on TyDiQA-GoldP (i.e. ID for in distribution)
+- `TyDiQA_TULRv6Large_OOD.json`: Performance of T-ULRv6 based models fine-tuned on [TyDiQA-GoldP](https://arxiv.org/abs/2003.05002) datasets and **evaluated on [XQUAD](https://arxiv.org/abs/1910.11856)** (i.e. OOD for out of distribution)
+
+As a baseline you may use the [LITMUS tool](https://github.com/microsoft/Litmus) and build your solutions on top of it. Note the repo currently supports mBERT and XLM-R only as the pre-trained multilingual models. We shall release the support for T-ULR models soon.