Universal Semantic Tagging (UST; Bjerva et al., 2016) aims to provide lightweight unified analysis for all languages at the word level. This moderate-sized corpus provides high-quality manual annotations of UST for English and Chinese.
The data source consists of 1100 English--Chinese parallel sentences from the Wall Street Journal (WSJ) section of Penn TreeBank (PTB; Marcus et al., 1993) and 1000 sentences from Chinese TreeBank (CTB; Xue et al., 2005). Chinese counterparts of original English sentences in WSJ are literally translated by English–Chinese bilinguals.
The first column is tokens; the second one is POS tags automatically predicated by the Stanford CoreNLP tool; the last column is manual annotations of UST, whose observed inter-annotator agreement achieves 92.9% and 91.2% for English and Chinese respectively.
Further details about this corpus can be found in the paper titled Universal Semantic Tagging for English and Mandarin Chinese, which is to be published in the proceeding of NAACL 2021. The link of this article will be updated once it is got.