## Reading Causal Relations Corpora
By: Pedram Hosseini (phosseini@gwu.edu)

There have been efforts in creating various causal relation corpora with different level of granularity and schemes. These efforts, even though admirable, are fairly sparse which makes it hard for people in the NLP community to use the generated knowledge. To take a step toward alleviating this problem, we wrote methods in a **Converter** class to convert all of these resources into a simple and friendly format so that anyone can easily use these samples. We try to keep most of the annotations from the original sources in the new schema so that no information will be lost during the data conversion. In the following, there is a list of current data sets which are covered in our collection:

- **SemEval 2007 task 4** - Public (source: **1**)
- **SemEval 2010 task 8** - Public (source: **2**)
- **EventCausality** - Public (source: **3**)
- **Causal-TimeBank** - Not public (source: **4**)
- **EventStoryLine (v0.9, v1.0, v1.5)** - Public (source: **5**)
- **CaTeRS** - Public (source: **6**)
- **BECAUSE v2.1** - Public (source: **7**)
- **Choice of Plausible Alternatives (COPA)** - Public (source: **8**)
- **Penn Discourse TreeBank (PDTB) 3.0** - Not public (source: **9**)
- **BioCause** - Public (source: **10**)
- **Your data set?**

#### JOIN US
We invite everyone in the ML/NLP/NLU community and groups of researchers who work on causal/counterfactual relations extraction in language to contribute to this repository, **crest/converter.py** in particular, so that we all take a step forward in improving the quality of availbale data resources and alleviate the sparsness issue.

In [1]:
import os
import sys

root_path = os.path.abspath(os.path.join(os.path.dirname("__file__"), '..'))
sys.path.insert(0, root_path)

from crest import converter
from crest.utils import crest2brat, min_avg_max
converter = converter.Converter()
total_samples = 0

## Convert all datasets

In [2]:
df, mis = converter.convert2crest(dataset_ids=[1, 2, 3, 4, 5, 6, 7, 8, 9], save_file=True)

print("samples: " + str(len(df)))
print("+ causal: {}".format(len(df.loc[df["label"] == 1])))
print("- non-causal: {}".format(len(df.loc[df["label"] == 0])))
print("train: {}".format(len(df.loc[df["split"] == 0])))
print("dev: {}".format(len(df.loc[df["split"] == 1])))
print("test: {}".format(len(df.loc[df["split"] == 2])))

samples: 28879
+ causal: 14709
- non-causal: 14170
train: 18798
dev: 1586
test: 4840


## SemEval 2007 task 4 

In [2]:
data, mis = converter.convert_semeval_2007_4()
total_samples += len(data)
print("samples: " + str(len(data)))
print("mismatch: " + str(mis))
print("+ causal: {}".format(len(data.loc[data["label"] == 1])))
print("- non-causal: {}".format(len(data.loc[data["label"] == 0])))

# crest2brat(data, '../data/crest_brat/1')

samples: 1529
mismatch: 0
+ causal: 114
- non-causal: 1415


In [3]:
data.head()

Unnamed: 0,original_id,span1,span2,signal,context,idx,label,direction,source,ann_file,split
0,1,[tumor shrinkage],[radiation therapy],[],The period of tumor shrinkage after radiation ...,"{'span1': [[14, 29]], 'span2': [[36, 53]], 'si...",1,1,1,,0
1,2,[Habitat degradation],[stream channels],[],Habitat degradation from within stream channel...,"{'span1': [[0, 19]], 'span2': [[32, 47]], 'sig...",0,1,1,,0
2,3,[discomfort],[traveling],[],Earplugs relieve the discomfort from traveling...,"{'span1': [[21, 31]], 'span2': [[37, 46]], 'si...",1,1,1,,0
3,4,[daily terror],[antipersonnel land mines],[],We continue to see progress toward a world fre...,"{'span1': [[55, 67]], 'span2': [[71, 95]], 'si...",1,1,1,,0
4,5,[segment],[anecdotes],[],The Global Warming segment starts off with two...,"{'span1': [[19, 26]], 'span2': [[53, 62]], 'si...",0,1,1,,0


In [4]:
min_avg_max(data)

Avg. length: 17.521909744931328
+++++++++++++++
min length: 3
min context: Trees grow seeds.
+++++++++++++++
max length: 82
max context: Literary criticism is the study of literature by means of a microscopic knowledge of the language in which a book is written, of its growth from various roots, of its stages of development and the factors influencing them, of its condition in the period of this particular composition, of the writer's idiosyncrasies of thought and style in his ripening periods, of the general history and literature of his race, and of the special characteristics of his age and of his contemporary writers.


## SemEval 2010 task 8

In [5]:
data, mis = converter.convert_semeval_2010_8()
total_samples += len(data)
print("samples: {}".format(len(data)))
print("mismatch: " + str(mis))
print("+ causal: {}".format(len(data.loc[data["label"] == 1])))
print("- non-causal: {}".format(len(data.loc[data["label"] == 0])))

# crest2brat(data, '../data/crest_brat/2')

samples: 10717
mismatch: 0
+ causal: 1331
- non-causal: 9386


In [6]:
data.head()

Unnamed: 0,original_id,span1,span2,signal,context,idx,label,direction,source,ann_file,split
0,1,[configuration],[elements],[],The system as described above has its greatest...,"{'span1': [[73, 86]], 'span2': [[98, 106]], 's...",0,1,2,,0
1,2,[child],[cradle],[],The child was carefully wrapped and bound into...,"{'span1': [[4, 9]], 'span2': [[51, 57]], 'sign...",0,-1,2,,0
2,3,[author],[disassembler],[],The author of a keygen uses a disassembler to ...,"{'span1': [[4, 10]], 'span2': [[30, 42]], 'sig...",0,1,2,,0
3,4,[ridge],[surge],[],A misty ridge uprises from the surge.,"{'span1': [[8, 13]], 'span2': [[31, 36]], 'sig...",0,-1,2,,0
4,5,[student],[association],[],The student association is the voice of the un...,"{'span1': [[4, 11]], 'span2': [[12, 23]], 'sig...",0,0,2,,0


In [7]:
min_avg_max(data)

Avg. length: 17.21246617523561
+++++++++++++++
min length: 3
min context: Trees grow seeds.
+++++++++++++++
max length: 85
max context: It was formerly known as How Park, possibly through the early connexion of William de Ow with the parish, and had its origin in the charter of 1200 granting William Briwere the elder chase of hare, fox, cat and wolf through all the king's land (per totam terram nostram) and warren of hares, pheasants and partridges throughout all his own lands, as also licence to inclose two coppices, one of which was situated between King's Somborne and Stockbridge and the other was called How Wood.


## EventCausality data set

In [8]:
data, mis = converter.convert_event_causality()
total_samples += len(data)
print("samples: " + str(len(data)))
print("mismatch: " + str(mis))
print("+ causal: {}".format(len(data.loc[data["label"] == 1])))
print("- non-causal: {}".format(len(data.loc[data["label"] == 0])))

crest2brat(data, '../data/crest_brat/3')

samples: 485
mismatch: 0
+ causal: 485
- non-causal: 0


In [9]:
data.head()

Unnamed: 0,original_id,span1,span2,signal,context,idx,label,direction,source,ann_file,split
0,C_6_4_6_102010.01.13.google.china.exit,[attacks],[conclude],[],"The company says the attacks "" have led us to ...","{'span1': [[21, 28]], 'span2': [[46, 54]], 'si...",1,0,3,2010.01.13.google.china.exit,1
1,C_6_10_6_142010.01.13.google.china.exit,[conclude],[review],[],"The company says the attacks "" have led us to ...","{'span1': [[46, 54]], 'span2': [[70, 76]], 'si...",1,0,3,2010.01.13.google.china.exit,1
2,C_12_5_12_182010.01.13.google.china.exit,[deliveries],[interpreted],[],A large number of flower deliveries were made ...,"{'span1': [[25, 35]], 'span2': [[105, 116]], '...",1,0,3,2010.01.13.google.china.exit,1
3,C_14_3_14_142010.01.13.google.china.exit,[leaves],[advancement],[],""" If Google leaves China , it is likely to be ...","{'span1': [[12, 18]], 'span2': [[57, 68]], 'si...",1,0,3,2010.01.13.google.china.exit,1
4,C_14_3_14_192010.01.13.google.china.exit,[leaves],[success],[],""" If Google leaves China , it is likely to be ...","{'span1': [[12, 18]], 'span2': [[87, 94]], 'si...",1,0,3,2010.01.13.google.china.exit,1


In [10]:
data.iloc[35].context

"A major earthquake struck southern Haiti on Tuesday , knocking down buildings and power lines and inflicting what its ambassador to the United States called a catastrophe for the Western Hemisphere 's poorest nation . "

In [11]:
min_avg_max(data)

Avg. length: 42.3979381443299
+++++++++++++++
min length: 13
min context: Those shortages are already the source of local and regional conflict . 
+++++++++++++++
max length: 202
max context: French police responded to reports of car theft in a town near Paris late Tuesday and a shootout ensued with a group of alleged thieves . Most of them escaped but police captured one and he was later identified as a suspected ETA member , said the spokeswoman , who by custom is not identified .  Spanish media reported that the shootout occurred in the town of Dammarie-les-Lys .  The dead French policeman was wearing a bullet-proof vest but bullets struck fatally elsewhere on his body .  He was reported to be in his 50s , and the father of four children .  ETA has traditionally used France as its rearguard logistics and planning base to prepare attacks across the border in Spain , officials say .  But in recent years as Spain has enlisted increased cooperation from France in cracking down on ETA hi

## Causal-TimeBank

In [12]:
data, mis = converter.convert_causal_timebank()
total_samples += len(data)
print("samples: " + str(len(data)))
print("mismatch: " + str(mis))
print("+ causal: {}".format(len(data.loc[data["label"] == 1])))
print("- non-causal: {}".format(len(data.loc[data["label"] == 0])))

crest2brat(data, '../data/crest_brat/4')

samples: 318
mismatch: 0
+ causal: 318
- non-causal: 0


In [13]:
data.head()

Unnamed: 0,original_id,span1,span2,signal,context,idx,label,direction,source,ann_file,split
0,10,[was],[thought],[So],"Not that long ago , before the Chinese takeove...","{'span1': [[82, 85]], 'span2': [[215, 222]], '...",1,0,4,ABC19980108.1830.0711.xml,
1,27,[downturn],[spending],[],"But in the past three months , stocks have plu...","{'span1': [[88, 96]], 'span2': [[139, 147]], '...",1,0,4,ABC19980108.1830.0711.xml,
2,36,[change],[reposition],[So],"I think that the mood is fairly gloomy , and I...","{'span1': [[72, 78]], 'span2': [[174, 184]], '...",1,0,4,ABC19980108.1830.0711.xml,
3,6,[rains],[landslides],[],Officials in California are warning residents ...,"{'span1': [[60, 65]], 'span2': [[105, 115]], '...",1,0,4,PRI19980213.2000.0313.xml,
4,22,[rains],[get],[],Forecasters say the picture will get worse bec...,"{'span1': [[56, 61]], 'span2': [[33, 36]], 'si...",1,0,4,PRI19980213.2000.0313.xml,


In [14]:
min_avg_max(data)

Avg. length: 32.79874213836478
+++++++++++++++
min length: 13
min context: Iraq said the roundup was to protect them from unspecified threats ; 
+++++++++++++++
max length: 107
max context: WASHINGTON _ Following are statements made Friday and Thursday by Lawrence Wechsler , a lawyer for the White House secretary , Betty Currie ; the White House ; White House spokesman Mike McCurry , and President Clinton in response to an article in The New York Times on Friday about her statements regarding a meeting with the president : Wechsler on Thursday " Without commenting on the allegations raised in this article , to the extent that there is any implication or suggestion that Mrs. Currie was aware of any legal or ethical impropriety by anyone , that implication or suggestion is entirely inaccurate . " 


## EventStoryLine

In [2]:
data, mis = converter.convert_eventstorylines_v1(version="1.5")
total_samples += len(data)
print("samples: " + str(len(data)))
print("mismatch: " + str(mis))
print("+ causal (PRECONDITION and FALLING_ACTION): {}".format(len(data.loc[data["label"] == 1])))
print("- non-causal: {}".format(len(data.loc[data["label"] == 0])))

# crest2brat(data, '../data/crest_brat/5')

samples: 2608
mismatch: 0
+ causal (PRECONDITION and FALLING_ACTION): 2608
- non-causal: 0


In [3]:
data.head(2)

Unnamed: 0,original_id,span1,span2,signal,context,idx,label,direction,source,ann_file,split
0,246682,[double murder],[killing],[],Cumbria double murder : Son suspected of killi...,"{'span1': [[8, 21]], 'span2': [[41, 48]], 'sig...",1,1,5,32_11ecbplus.xml.xml,
1,246683,[sectioned],[suicide attempt],[],"John Jenkin , 23 , had been sectioned after an...","{'span1': [[28, 37]], 'span2': [[56, 71]], 'si...",1,1,5,32_11ecbplus.xml.xml,


In [7]:
a = data.loc[data['context'] == 'SEACOM downtime explained']
b = data.loc[data['original_id'] == '245609']

In [8]:
b

Unnamed: 0,original_id,span1,span2,signal,context,idx,label,direction,source,ann_file,split
1382,245609,[downtime],[explained],[],SEACOM downtime explained,"{'span1': [[7, 15]], 'span2': [[16, 25]], 'sig...",1,0,5,30_6ecbplus.xml.xml,


In [6]:
min_avg_max(data)

Avg. length: 43.20782208588957
+++++++++++++++
min length/id: {'len_min': 4, 'original_id': '245609'}
min context: SEACOM downtime explained 
+++++++++++++++
max length/id: {'len_max': 839, 'original_id': '241458'}
max context: The Athens protest march marking the zenith of the general strike called for the 5th of May was attended by an approximate 200 , 000 ( 20 , 000 which is the foreign broadcast number referring to the PAME march alone ) , although because of lack of media coverage due to the media participation in the general strike no concrete estimates can be made . After the PAME ( Communist Party union ) protesters left Syntagma square , the first lines of the main march started arriving before the Parliament with the first clashes erupting at the end of Stadiou street . The march then walked on the Unknown Soldier grounds leading the Presidential Guard to retreat , and attempted to storm the Parliament but was pushed back by riot police forces which today demonstrated a parti

## CaTeRS

In [33]:
data, mis = converter.convert_caters()
total_samples += len(data)
print("samples: " + str(len(data)))
print("mismatch: " + str(mis))
print("+ causal: {}".format(len(data.loc[(data["label"] == 1) | (data["label"] == 2)])))
print("- non-causal: {}".format(len(data.loc[data["label"] == 0])))

crest2brat(data, '../data/crest_brat/6')

samples: 2502
mismatch: 0
+ causal: 308
- non-causal: 2194


In [34]:
data.head()

Unnamed: 0,original_id,span1,span2,signal,context,idx,label,direction,source,ann_file,split
0,R1,[resuscitation],[passed away],[],There was a man in the alley named Bill\nBill ...,"{'span1': [[194, 207]], 'span2': [[166, 177]],...",0,0,6,test_15Oct.ann,2
1,R2,[shot],[passed away],[],There was a man in the alley named Bill\nBill ...,"{'span1': [[136, 140]], 'span2': [[166, 177]],...",1,0,6,test_15Oct.ann,2
2,R3,[intoxicated],[insulted],[],There was a man in the alley named Bill\nBill ...,"{'span1': [[49, 60]], 'span2': [[91, 99]], 'si...",1,0,6,test_15Oct.ann,2
3,R4,[insulted],[shot],[],There was a man in the alley named Bill\nBill ...,"{'span1': [[91, 99]], 'span2': [[136, 140]], '...",1,0,6,test_15Oct.ann,2
4,R6,[ruined],[sad],[],Grayson wanted to bake his brother a birthday ...,"{'span1': [[192, 198]], 'span2': [[232, 235]],...",1,0,6,test_15Oct.ann,2


In [35]:
min_avg_max(data)

Avg. length: 42.48760991207035
+++++++++++++++
min length: 20
min context: Billy felt lonely in school.
He had no friends.
One day, a new kid came to school.
They instantly became friends.
They became inseparable.
+++++++++++++++
max length: 65
max context: My mother's cat was ill, so my brother took it to the vet for her.
They said the cat needed to stay there for some tests.
They called my brother an hour later, telling him to come pick it up.
When he got there, the girl at the front desk said the cat had died!
Later that day, she called to say she was mistaken, the cat was fine.


## BECAUSE v2.1

Since the raw text files for **PTB** and **NYT** need LDC subscription, these file have not been covered in our data reader yet. Once we have access to the raw files from these data resources, we will write the proper data readers for them.

In [36]:
data, mis = converter.convert_because()
total_samples += len(data)
print("samples: " + str(len(data)))
print("mismatch: " + str(mis))
print("+ causal: {}".format(len(data.loc[(data["label"] == 1) | (data["label"] == 2)])))
print("- non-causal: {}".format(len(data.loc[data["label"] == 0])))

crest2brat(data, '../data/crest_brat/7')

samples: 729
mismatch: 0
+ causal: 554
- non-causal: 175


In [37]:
data.head()

Unnamed: 0,original_id,span1,span2,signal,context,idx,label,direction,source,ann_file,split
0,E1,[that has arisen],[the past few years],[over],"And second, we should address the issue that h...","{'span1': [[40, 55]], 'span2': [[76, 94]], 'si...",0,1,7,CHRG-111shrg61651.ann,
1,E4,[these banks are too big to fail],"[they have lower funding costs, they are able ...",[Because],"Because these banks are too big to fail, they ...","{'span1': [[8, 39]], 'span2': [[41, 179]], 'si...",1,0,7,CHRG-111shrg61651.ann,
2,E5,[they make more money],[the cycle],[over],"Because these banks are too big to fail, they ...","{'span1': [[111, 131]], 'span2': [[137, 146]],...",0,1,7,CHRG-111shrg61651.ann,
3,E6,[too big],[fail],"[too, to]","Because these banks are too big to fail, they ...","{'span1': [[24, 31]], 'span2': [[35, 39]], 'si...",1,0,7,CHRG-111shrg61651.ann,
4,E7,[you look at the European situation today],[it is much worse than what we have in this co...,[If],"If you look at the European situation today, f...","{'span1': [[3, 43]], 'span2': [[58, 153]], 'si...",0,0,7,CHRG-111shrg61651.ann,


In [38]:
min_avg_max(data)

Avg. length: 32.358024691358025
+++++++++++++++
min length: 3
min context: so why
not? 
+++++++++++++++
max length: 84
max context: I think this is an important task, and there's a great deal of agreement, that we should be moving to empower the Federal Reserve to have regulatory authority over a wide range of financial institutions in recognition in part of the fact that they have a systemic impact and that the current situation puts the Fed in an untenable position of being given a set of expectations to respond when it doesn't have the full panoply of tools to respond.
    


## Choice of Plausible Alternatives (COPA)

In [2]:
data, mis = converter.convert_copa(dataset_code=2)
total_samples += len(data)
print("samples: " + str(len(data)))
print("mismatch: " + str(mis))
print("+ causal: {}".format(len(data.loc[data["label"] == 1])))
print("- non-causal: {}".format(len(data.loc[data["label"] == 0])))

{'original_id': 1887, 'span1': ['The boy flinched'], 'span2': ['He turned and ran away'], 'signal': [], 'context': 'The boy flinched. He turned and ran away', 'idx': {'span1': [[0, 16]], 'span2': [[18, 39]], 'signal': []}, 'label': 0, 'direction': 1, 'source': 8, 'ann_file': 'BCOPA-CE.xml', 'split': 2}
{'original_id': 2387, 'span1': ['The boy flinched'], 'span2': ['He turned and ran away'], 'signal': [], 'context': 'The boy flinched. He turned and ran away', 'idx': {'span1': [[0, 16]], 'span2': [[18, 39]], 'signal': []}, 'label': 1, 'direction': 0, 'source': 8, 'ann_file': 'BCOPA-CE.xml', 'split': 2}
samples: 1998
mismatch: 2
+ causal: 999
- non-causal: 999


In [5]:
data.head()

Unnamed: 0,original_id,span1,span2,signal,context,idx,label,direction,source,ann_file,split
0,1,[My body cast a shadow over the grass],[The sun was rising],[],My body cast a shadow over the grass. The sun ...,"{'span1': [[0, 36]], 'span2': [[38, 56]], 'sig...",1,1,8,copa-dev.xml,1
1,1,[My body cast a shadow over the grass],[The grass was cut],[],My body cast a shadow over the grass. The gras...,"{'span1': [[0, 36]], 'span2': [[38, 55]], 'sig...",0,1,8,copa-dev.xml,1
2,2,[The woman tolerated her friend's difficult be...,[The woman knew her friend was going through a...,[],The woman tolerated her friend's difficult beh...,"{'span1': [[0, 51]], 'span2': [[53, 108]], 'si...",1,1,8,copa-dev.xml,1
3,2,[The woman tolerated her friend's difficult be...,[The woman felt that her friend took advantage...,[],The woman tolerated her friend's difficult beh...,"{'span1': [[0, 51]], 'span2': [[53, 114]], 'si...",0,1,8,copa-dev.xml,1
4,3,[The women met for coffee],[They wanted to catch up with each other],[],The women met for coffee. They wanted to catch...,"{'span1': [[0, 24]], 'span2': [[26, 65]], 'sig...",1,1,8,copa-dev.xml,1


## Penn Discourse Treebank (PDTB 3.0)

In [2]:
data, mis = converter.convert_pdtb3()
total_samples += len(data)
print("samples: " + str(len(data)))
print("mismatch: " + str(mis))
print("+ causal: {}".format(len(data.loc[data["label"] == 1])))
print("- non-causal: {}".format(len(data.loc[data["label"] == 0])))

samples: 7991
mismatch: 0
+ causal: 7991
- non-causal: 0


In [3]:
data.head()

Unnamed: 0,original_id,span1,span2,signal,context,idx,label,direction,source,ann_file,split
0,9,[that 150 million shares of MiniScribe common ...,[there's a tremendous amount of exposure],[],Mr. Rifenburgh also noted that 150 million sha...,"{'span1': [[26, 116]], 'span2': [[122, 161]], ...",1,0,9,,0
1,20,[but questioning whether the company can survi...,[It's a wait-and-see attitude],[],Analysts and consultants had mixed reactions t...,"{'span1': [[109, 192]], 'span2': [[195, 223]],...",1,0,9,,0
2,39,"[At first glance, gold and utilities seem stra...","[After all, gold prices usually soar when infl...",[],"At first glance, gold and utilities seem stran...","{'span1': [[0, 59]], 'span2': [[61, 119]], 'si...",1,1,9,,0
3,42,"[Utility stocks, on the other hand, thrive on ...",[the fat dividends utilities pay look more att...,[],"Utility stocks, on the other hand, thrive on d...","{'span1': [[0, 57]], 'span2': [[67, 162]], 'si...",1,1,9,,0
4,46,[But the two groups have something very import...,"[It's as if investors, the past few days, are ...",[],But the two groups have something very importa...,"{'span1': [[0, 132]], 'span2': [[134, 254]], '...",1,0,9,,0


## BioCause

In [2]:
data, mis = converter.convert_biocause()
total_samples += len(data)
print("samples: " + str(len(data)))
print("mismatch: " + str(mis))
print("+ causal: {}".format(len(data.loc[data["label"] == 1])))
print("- non-causal: {}".format(len(data.loc[data["label"] == 0])))

[crest-log] Error in converting BioCause. Detail: 
samples: 844
mismatch: 0
+ causal: 844
- non-causal: 0


In [3]:
data.head(3)

Unnamed: 0,original_id,span1,span2,signal,context,idx,label,direction,source,ann_file,split
0,,[Each paired reaction set (TB/PA14) resulted i...,[there is little difference between the two is...,[These results show that],Characterization of the cheB2 Mutant \nWe chos...,"{'span1': [[836, 954]], 'span2': [[997, 1131]]...",1,0,10,PMC2714965-02-Results-05.ann,0
1,,"[As shown in Figure 3A, the newly engineered c...",[a delayed C. elegans killing comparable to th...,[showed],Characterization of the cheB2 Mutant \nWe chos...,"{'span1': [[1377, 1467]], 'span2': [[1468, 155...",1,0,10,PMC2714965-02-Results-05.ann,0
2,,"[In addition, we engineered a similar cheB2 mu...",[the virulence phenotype of a cheB2 mutant and...,[This further confirmed],Characterization of the cheB2 Mutant \nWe chos...,"{'span1': [[1692, 1882]], 'span2': [[1919, 201...",1,0,10,PMC2714965-02-Results-05.ann,0


In [8]:
for idx, row in data.iterrows():
    data.iloc[idx]['original_id'] = idx

In [10]:
crest2brat(data, 'biocause')