We present Corpus of Linguistic Acceptability in Chinese (CoLAC), the first large-scale acceptability dataset in a non-Indo-European language handcrafted by linguists to evaluate the grammatical proficiency of language models. Our dataset consists of 7,495 sentences collected from one syntax textbook, one linguistics handbook, and 68 linguistics journal articles, all verified by native speakers of Mandarin.
Every example sentence has two labels:
-
label0: a single label from the linguist who proposed the example (Note that this label is not from a single linguist, as we collected examples from one syntax textbook, one handbook for Chinese syntax and about 70 journal articles authored by different theoretical syntacticians), which we call linguist label,
-
label1: a crowd label, mapped from the mean ratings from other native speakers of Mandarin Chinese. This label is used in all our experiments.
Statistics of CoLAC:
We ran several baselines, using XLM-R, the Chinese RoBERTa, variants of InstructGPT, ChatGPT and mTk. Results are shown below.
For details of the experiments, see our paper.
@misc{hu2023revisiting,
title={Revisiting Acceptability Judgements},
author={Hai Hu and Ziyin Zhang and Weifang Huang and Jackie Yan-Ki Lai and Aini Li and Yina Patterson and Jiahui Huang and Peng Zhang and Chien-Jer Charles Lin and Rui Wang},
year={2023},
eprint={2305.14091},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2305.14091}
}