Skip to content

phosseini/CREST

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CREST: A Causal Relation Schema for Text 🚀

CREST is a machine-readable format/schema that is created to help researchers who work on causal/counterfactual relation extraction and commonsense causal reasoning, to use and leverage the scattered data resources around these topics more easily. CREST-formatted data are stored as pandas DataFrame.

How to convert dataset(s) to CREST:

  • Clone this repository and go to the /CREST directory.
  • Install the requirements: pip install -r requirements.txt
  • Download spaCy's model: python -m spacy download en_core_web_sm
  • Run the /crest/convert.py:
    • python convert.py -i: printing the full list of currently supported datasets
    • python convert.py [DATASET_ID_0] ... [DATASET_ID_n] [OUTPUT_FILE_NAME]
      • DATASET_ID_*: id of a dataset.
      • OUTPUT_FILE_NAME: name of the output file that should be in .xlsx format
  • Examples:
    • Converting datasets 1 and 2: python convert.py 1 2 output.xlsx
    • Converting dataset 5: python convert.py 5 output.xlsx

The excel file of all converted datasets: crest_v2.xlsx

  • PDTB is not available in this file due to copyright. However, you can still use CREST to convert this dataset if you have access to PDTB.

CREST format

Each relation in a CREST-formatted DataFrame has the following fields/values:

  • original_id: the id of a relation in the original dataset, if such an id exists.
  • span1: a list of strings of the first span/argument of the relation.
  • span2: a list of strings of the second span/argument of the relation
  • signal: a list of strings of signals/markers of the relation in context, if any.
  • context: a text string of the context in which the relation appears.
  • idx: indices of span1, span2, and signal tokens/spans in context stored in 3 lines, each line in the form of span_type start_1:end_1 ... start_n:end_n. For example, if span1 has multiple tokens/spans with start:end indices 2:5 and 10:13, respectively, span1's line value in idx is span1 2:5 10:13. Indices are sorted based on the start indexes of tokens/spans.
  • label: label of the relation, 0: non-causal, 1: causal
  • direction: direction between span1 and span2. 0: span1 => span2, 1: span1 <= span2, -1: not-specified
  • source: id of the source dataset (ids are listed in a table below)
  • split: 0: train, 1: dev, 2: test. This is the split to which the relation belongs in the original dataset. If there is no split specified for a relation in the original dataset, we assign the relation to the train split by default.

Note: The reason we save a list of strings instead of a single string for span1, span2, and signal is that these text spans may contain multiple non-consecutive sub-spans in context.

Available Data Resources

List of data resources already converted to CREST format:

Id Data resource Samples Causal Non-causal Document Year
1 SemEval 2007 Task 4 1,529 114 1,415 Paper 2007
2 SemEval 2010 Task 8 10,717 1,331 9,386 Paper 2010
3 EventCausality 583 583 - Paper 2011
4 Causal-TimeBank 318 318 - Paper 2014
5 EventStoryLine v1.5 2,608 2,608 - Paper 2016
6 CaTeRS 2,502 308 2,194 Paper 2016
7 BECauSE v2.1 ⚠️ 729 554 175 Paper 2017
8 Choice of Plausible Alternatives (COPA) 2,000 1,000 1,000 Paper 2011
9 The Penn Discourse Treebank (PDTB) 3.0 ⚠️ 7,991 7,991 - Manual 2019
10 BioCause Corpus 844 844 - Paper 2013
11 Temporal and Causal Reasoning (TCR) 172 172 - Paper 2018
12 Benchmark Corpus for Adverse Drug Effects 5,671 5,671 - Paper 2012
13 SemEval 2020 Task 5 :atom: 5,501 5,501 - Paper 2020

⚠️ The data is either not publicly available or partially available. You can still use CREST for conversion if you have full access to this dataset.

:atom:  Counterfactual Relations

CREST conversion

We provide helper methods to convert CREST-formatted data to popular formats and annotation schemes, mainly formats that are used across relation extraction/classification tasks. In the following, there is a list of formats for which we have already developed CREST converter methods:

  • brat: we have provided helper methods for two-way conversion of CREST data frames to brat (see example here). brat is a popular web-based annotation tool that has been used for a variety of relation extraction NLP tasks. We use brat for two main reasons: 1) better visualization of causal and non-causal relations and their arguments, and 2) modifying annotations if needed and adding new annotations to provided context. In the following, there is a sample of a converted version of CREST-formatted relation to brat (example is taken from CaTeRS dataset):

  • TACRED: TACRED is a large-scale relation extraction dataset. We convert samples from CREST to TACRED since TACRED-formatted data can be easily used as input to many transformers-based language models (e.g. for Relation Classification/Extraction). You can find an example of converting CREST-formatted data to TACRED in this notebook.

How you can contribute:

  • Are there any related datasets you don’t see in the list? Let us know or feel free to submit a Pull Request (PR), we actively check the PRs and appreciate it ☺️
  • Is there a well-known or widely-used machine-readable format you think can be added? We can add the helper methods for conversion or we appreciate PRs.

How to cite CREST?

For now, please cite our arXiv paper:

@article{hosseini2021predicting,
  title={Predicting Directionality in Causal Relations in Text},
  author={Hosseini, Pedram and Broniatowski, David A and Diab, Mona},
  journal={arXiv preprint arXiv:2103.13606},
  year={2021}
}