A dataset for studying human evaluation of Machine Translation (MT). This dataset contains a sample of 100 MT sentences with the corresponding source sentences and four reference translations, from MTC4 Chinese-English corpus judged by various groups of annotators using different reference translations. The data was used to analyze the effect of reference bias on monolingual MT evaluation. See the following paper for a detailed description of human annotation and experimental results:
Marina Fomicheva and Lucia Specia (2016). Reference Bias in Monolingual Machine Translation Evaluation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL).
A small web tool we used for collecting the judgements is available here as well.