Skip to content
rzanoli edited this page May 26, 2015 · 31 revisions

There are a number of annotated data sets you can use with the EOP. Some of them are distributed as part of the EOP itself (e.g., RTE-3 English data set), while others have to be downloaded separately due to their restrictive licence. All the data sets are distributed in RTE-3 style format that is fully compatible with the EOP, i.e.

<?xml version="1.0" encoding="UTF-8"?>
<entailment-corpus lang="EN">
  <pair id="1" entailment="ENTAILMENT" task="IE" >
    <t>The sale was made to pay Yukos' US$ 27.5 billion tax bill, Yuganskneftegaz was originally sold for US$ 9.4 billion to a little known company Baikalfinansgroup which was later bought by the Russian state-owned oil company Rosneft .</t>
    <h>Baikalfinansgroup was sold to Rosneft.</h>
  </pair>
  ................
  ................
  <pair id="800" entailment="ENTAILMENT" task="SUM" >
    <t>US Steel could even have a technical advantage over Nucor since one new method of steel making it is considering, thin strip casting, may produce higher quality steel than the Nucor thin slab technique.</t>
    <h>US Steel may invest in strip casting.</h>
  </pair>
</entailment-corpus>

Data Sets distributed with the EOP

  • RTE-3 for English, German and Italian
  • RTE-3 for Bulgarian (thanks to Iliana Simova and the BulTreeBank team)

Data Sets that have to be downloaded separately

The EXCITEMENT data sets (Kotlerman et al, 2015) contain negative feedbacks from customers where they state reasons for dissatisfaction with a given company. The data sets are available for English and Italian. To shed light onto how different RTE systems perform under different conditions (e.g., a system might perform better on balanced data sets, while, another system, could be insensitive to that) and let the EOP users compare their results on the same data (the original data set does not provide an explicit division between training and test data) the EOP comes with 4 data sets for the English language, while there are 2 data sets available for Italian; all of them originated from the Kotlerman's data set, and each data set consists of a training data set for training the system and a test data set for evaluating it. The produced splits differ each other from the way they are structured along the two orthogonal dimensions of balanced-unbalanced and mixed-pure. Balanced-unbalanced refers to the fact that the data set contains a comparable number of positive and negative examples (balanced) or not (unbalanced), while mixed-pure regards the possibility to have the T-H pairs of a specific topic they talk about (e.g., food, Internet) equally distributed between training and test set (mixed) or only in train or in test (i.e., pure). The split data set used for evaluating the EOP can be download here:

  • English data set (balanced-mix, balanced-pure, unbalanced-mix, unbalanced-pure). download
  • Italian data set balanced-mix, balanced-pure). download

while the data set which was used to produce the 4 splits is available by following the link below:

  • EXCITEMENT data sets (Kotlerman et al, 2015). download

SICK (Marelli et al, 2014) is the data set that was used at SemEval-2014 for the two subtasks of (i) Relatedness (i.e., predicting the de-gree of semantic similarity between two sentences), and (ii) Entailment (i.e., detecting the entailment relation holding between two sentences). It consists of 9,840 English sentence pairs, built starting from two existing sets: the 8K ImageFlickr data and the SemEval 2012 STS MSR-Video Description data set. download

OMQ data set, a RTE-style dataset semi-automatically created from manually categorized German customer requests. download

####References####

Kotlerman, Lili, Ido Dagan, Bernardo Magnini, and Luisa Bentivogli. To be published. Textual entailment graphs. Journal of Natural Language Engineering Special Issue on Graphs for NLP, 12.

Marelli, Marco, Stefano Menini, Marco Baroni, Luisa Bentivogli, Raffaella Bernardi, and Roberto Zamparelli. 2014. A SICK cure for the evaluation of compositional distributional semantic models. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC-2014), Reykjavik, Iceland, May 26-31, 2014., pages 216-223.

Clone this wiki locally