## Reading Causal Relations Corpora
By: Pedram Hosseini (pdr.hosseini@gmail.com)

There have been efforts in creating various causal relation corpora with different level of granularity and schemes. These efforts, even though admirable, are fairly sparse which makes it hard for people in the NLP community to use the generated knowledge. To take a step toward alleviating this problem, we wrote methods in a **Converter** class to convert all of these resources into a simple and friendly format so that anyone can easily use these samples. We try to keep most of the annotations from the original sources in the new schema so that no information will be lost during the data conversion. In the following, there is a list of current data sets which are covered in our collection:

- **SemEval 2007 task 4** - Public (source: **1**)
- **SemEval 2010 task 8** - Public (source: **2**)
- **EventCausality data set** - Public (source: **3**)
- **Causal-TimeBank** - Not public (source: **4**)
- **EventStoryLine (v0.9, v1.0, v1.5)** - Public (source: **5**)
- **CaTeRS** - Public (source: **6**)
- **BECAUSE v2.1** - Public (source: **7**)
- **Choice of Plausible Alternatives (COPA)** - Public (source: **8**)
- **Your data set?**

#### JOIN US
We invite everyone in the ML/NLP/NLU community and groups of researchers who work on causal/counterfactual relations extraction in language to contribute to this repository, **crest/converter.py** in particular, so that we all take a step forward in improving the quality of availbale data resources and alleviate the sparsness issue.

In [1]:
import os
import sys

root_path = os.path.abspath(os.path.join(os.path.dirname("__file__"), '..'))
sys.path.insert(0, root_path)

from crest import converter
converter = converter.Converter()
total_samples = 0

def find_len_max(df_data):
    len_max = 0
    i_max = -1
    for index, row in df_data.iterrows():
        if len(row.context) > len_max:
            len_max = len(row.context)
            i_max = index
    print("max length = " + str(len_max))
    print(data.iloc[i_max].context)

## Convert all datasets

In [2]:
df, mis = converter.convert2crest(dataset_ids=[1, 2, 3, 4, 5, 6, 7, 8], save_file=True)
print("samples: " + str(len(df)))
print("+ causal: {}".format(len(df.loc[(df["label"] == 1) | (df["label"] == 2)])))
print("- non-causal: {}".format(len(df.loc[df["label"] == 0])))

samples: 18419
+ causal: 6639
- non-causal: 11780


## SemEval 2007 task 4 

In [2]:
data, mis = converter.convert_semeval_2007_4()
total_samples += len(data)
print("samples: " + str(len(data)))
print("mismatch: " + str(mis))
print("+ causal: {}".format(len(data.loc[(data["label"] == 1) | (data["label"] == 2)])))
print("- non-causal: {}".format(len(data.loc[data["label"] == 0])))

samples: 220
mismatch: 0
+ causal: 114
- non-causal: 106


In [4]:
data.head()

Unnamed: 0,original_id,span1,span2,signal,context,idx,label,source,ann_file,split
0,1,[tumor shrinkage],[radiation therapy],[],The period of tumor shrinkage after radiation ...,"{'span1': [[14, 29]], 'span2': [[36, 53]], 'si...",2,1,,0
1,2,[Habitat degradation],[stream channels],[],Habitat degradation from within stream channel...,"{'span1': [[0, 19]], 'span2': [[32, 47]], 'sig...",0,1,,0
2,3,[discomfort],[traveling],[],Earplugs relieve the discomfort from traveling...,"{'span1': [[21, 31]], 'span2': [[37, 46]], 'si...",2,1,,0
3,4,[daily terror],[antipersonnel land mines],[],We continue to see progress toward a world fre...,"{'span1': [[55, 67]], 'span2': [[71, 95]], 'si...",2,1,,0
4,5,[segment],[anecdotes],[],The Global Warming segment starts off with two...,"{'span1': [[19, 26]], 'span2': [[53, 62]], 'si...",0,1,,0


In [5]:
find_len_max(data)

max length = 493
Literary criticism is the study of literature by means of a microscopic knowledge of the language in which a book is written, of its growth from various roots, of its stages of development and the factors influencing them, of its condition in the period of this particular composition, of the writer's idiosyncrasies of thought and style in his ripening periods, of the general history and literature of his race, and of the special characteristics of his age and of his contemporary writers.



## SemEval 2010 task 8

In [3]:
data, mis = converter.convert_semeval_2010_8()
total_samples += len(data)
print("samples: {}".format(len(data)))
print("mismatch: " + str(mis))
print("+ causal: {}".format(len(data.loc[(data["label"] == 1) | (data["label"] == 2)])))
print("- non-causal: {}".format(len(data.loc[data["label"] == 0])))

samples: 10717
mismatch: 0
+ causal: 1331
- non-causal: 9386


In [7]:
data.head()

Unnamed: 0,original_id,span1,span2,signal,context,idx,label,source,ann_file,split
0,1,[configuration],[elements],[],The system as described above has its greatest...,"{'span1': [[73, 86]], 'span2': [[98, 106]], 's...",0,2,,0
1,2,[child],[cradle],[],The child was carefully wrapped and bound into...,"{'span1': [[4, 9]], 'span2': [[51, 57]], 'sign...",0,2,,0
2,3,[author],[disassembler],[],The author of a keygen uses a disassembler to ...,"{'span1': [[4, 10]], 'span2': [[30, 42]], 'sig...",0,2,,0
3,4,[ridge],[surge],[],A misty ridge uprises from the surge.\n,"{'span1': [[8, 13]], 'span2': [[31, 36]], 'sig...",0,2,,0
4,5,[student],[association],[],The student association is the voice of the un...,"{'span1': [[4, 11]], 'span2': [[12, 23]], 'sig...",0,2,,0


In [8]:
find_len_max(data)

max length = 493
Literary criticism is the study of literature by means of a microscopic knowledge of the language in which a book is written, of its growth from various roots, of its stages of development and the factors influencing them, of its condition in the period of this particular composition, of the writer's idiosyncrasies of thought and style in his ripening periods, of the general history and literature of his race, and of the special characteristics of his age and of his contemporary writers.



## EventCausality data set

In [4]:
data, mis = converter.convert_event_causality()
total_samples += len(data)
print("samples: " + str(len(data)))
print("mismatch: " + str(mis))
print("+ causal: {}".format(len(data.loc[(data["label"] == 1) | (data["label"] == 2)])))
print("- non-causal: {}".format(len(data.loc[data["label"] == 0])))

samples: 485
mismatch: 0
+ causal: 485
- non-causal: 0


In [10]:
data.head()

Unnamed: 0,original_id,span1,span2,signal,context,idx,label,source,ann_file,split
0,C_6_4_6_10,[attacks],[conclude],[],"The company says the attacks "" have led us to ...","{'span1': [[21, 28]], 'span2': [[46, 54]], 'si...",1,3,2010.01.13.google.china.exit,1
1,C_6_10_6_14,[conclude],[review],[],"The company says the attacks "" have led us to ...","{'span1': [[46, 54]], 'span2': [[70, 76]], 'si...",1,3,2010.01.13.google.china.exit,1
2,C_12_5_12_18,[deliveries],[interpreted],[],A large number of flower deliveries were made ...,"{'span1': [[25, 35]], 'span2': [[105, 116]], '...",1,3,2010.01.13.google.china.exit,1
3,C_14_3_14_14,[leaves],[advancement],[],""" If Google leaves China , it is likely to be ...","{'span1': [[12, 18]], 'span2': [[57, 68]], 'si...",1,3,2010.01.13.google.china.exit,1
4,C_14_3_14_19,[leaves],[success],[],""" If Google leaves China , it is likely to be ...","{'span1': [[12, 18]], 'span2': [[87, 94]], 'si...",1,3,2010.01.13.google.china.exit,1


In [11]:
find_len_max(data)

max length = 1083
French police responded to reports of car theft in a town near Paris late Tuesday and a shootout ensued with a group of alleged thieves . Most of them escaped but police captured one and he was later identified as a suspected ETA member , said the spokeswoman , who by custom is not identified .  Spanish media reported that the shootout occurred in the town of Dammarie-les-Lys .  The dead French policeman was wearing a bullet-proof vest but bullets struck fatally elsewhere on his body .  He was reported to be in his 50s , and the father of four children .  ETA has traditionally used France as its rearguard logistics and planning base to prepare attacks across the border in Spain , officials say .  But in recent years as Spain has enlisted increased cooperation from France in cracking down on ETA hideouts , there have been various exchanges of gunfire between ETA suspects and French police , wounding some officers .  Almost all of ETA 's fatal shootings and car bombings

## Causal-TimeBank

In [5]:
data, mis = converter.convert_causal_timebank()
total_samples += len(data)
print("samples: " + str(len(data)))
print("mismatch: " + str(mis))
print("+ causal: {}".format(len(data.loc[(data["label"] == 1) | (data["label"] == 2)])))
print("- non-causal: {}".format(len(data.loc[data["label"] == 0])))

samples: 318
mismatch: 0
+ causal: 318
- non-causal: 0


In [13]:
data.head()

Unnamed: 0,original_id,span1,span2,signal,context,idx,label,source,ann_file,split
0,10,[was],[thought],[So],"Not that long ago , before the Chinese takeove...","{'span1': [[82, 85]], 'span2': [[215, 222]], '...",1,4,ABC19980108.1830.0711.xml,
1,27,[downturn],[spending],[],"But in the past three months , stocks have plu...","{'span1': [[88, 96]], 'span2': [[139, 147]], '...",1,4,ABC19980108.1830.0711.xml,
2,36,[change],[reposition],[So],"I think that the mood is fairly gloomy , and I...","{'span1': [[72, 78]], 'span2': [[174, 184]], '...",1,4,ABC19980108.1830.0711.xml,
3,6,[rains],[landslides],[],Officials in California are warning residents ...,"{'span1': [[60, 65]], 'span2': [[105, 115]], '...",1,4,PRI19980213.2000.0313.xml,
4,22,[get],[rains],[],Forecasters say the picture will get worse bec...,"{'span1': [[33, 36]], 'span2': [[56, 61]], 'si...",2,4,PRI19980213.2000.0313.xml,


In [14]:
find_len_max(data)

max length = 616
WASHINGTON _ Following are statements made Friday and Thursday by Lawrence Wechsler , a lawyer for the White House secretary , Betty Currie ; the White House ; White House spokesman Mike McCurry , and President Clinton in response to an article in The New York Times on Friday about her statements regarding a meeting with the president : Wechsler on Thursday " Without commenting on the allegations raised in this article , to the extent that there is any implication or suggestion that Mrs. Currie was aware of any legal or ethical impropriety by anyone , that implication or suggestion is entirely inaccurate . " 


## EventStoryLine

In [6]:
data, mis = converter.convert_event_storylines(version="1.5")
total_samples += len(data)
print("samples: " + str(len(data)))
print("mismatch: " + str(mis))
print("+ causal (FALLING_ACTION): {}".format(len(data.loc[data["label"] == 2])))
print("+ causal (PRECONDITION): {}".format(len(data.loc[data["label"] == 1])))

samples: 2608
mismatch: 0
+ causal (FALLING_ACTION): 1269
+ causal (PRECONDITION): 1339


In [7]:
data.head()

Unnamed: 0,original_id,span1,span2,signal,context,idx,label,source,ann_file,split
0,246682,[double murder],[killing],[],Cumbria double murder : Son suspected of killi...,"{'span1': [[8, 21]], 'span2': [[41, 48]], 'sig...",2,5,32_11ecbplus.xml.xml,
1,246683,[sectioned],[suicide attempt],[],"John Jenkin , 23 , had been sectioned after an...","{'span1': [[28, 37]], 'span2': [[56, 71]], 'si...",2,5,32_11ecbplus.xml.xml,
2,246684,[double murder],[dead],[],"John Jenkin , 23 , had been sectioned after an...","{'span1': [[95, 108]], 'span2': [[165, 169]], ...",2,5,32_11ecbplus.xml.xml,
3,246685,[double murder],[dead],[],Cumbria double murder : Son suspected of killi...,"{'span1': [[8, 21]], 'span2': [[303, 307]], 's...",2,5,32_11ecbplus.xml.xml,
4,246686,[killing],[dead],[],Cumbria double murder : Son suspected of killi...,"{'span1': [[41, 48]], 'span2': [[303, 307]], '...",1,5,32_11ecbplus.xml.xml,


In [8]:
find_len_max(data)

max length = 4674
The Athens protest march marking the zenith of the general strike called for the 5th of May was attended by an approximate 200 , 000 ( 20 , 000 which is the foreign broadcast number referring to the PAME march alone ) , although because of lack of media coverage due to the media participation in the general strike no concrete estimates can be made . After the PAME ( Communist Party union ) protesters left Syntagma square , the first lines of the main march started arriving before the Parliament with the first clashes erupting at the end of Stadiou street . The march then walked on the Unknown Soldier grounds leading the Presidential Guard to retreat , and attempted to storm the Parliament but was pushed back by riot police forces which today demonstrated a particularly staunch attitude and resolve against the demonstrators . Soon battles erupted around the Parliament with protesters throwing Molotov cocktails and rocks , with one riot police armored van torched , and 

## CaTeRS

In [2]:
data, mis = converter.convert_caters()
total_samples += len(data)
print("samples: " + str(len(data)))
print("mismatch: " + str(mis))
print("+ causal: {}".format(len(data.loc[(data["label"] == 1) | (data["label"] == 2)])))
print("- non-causal: {}".format(len(data.loc[data["label"] == 0])))

samples: 2502
mismatch: 0
+ causal: 308
- non-causal: 2194


In [19]:
data.head()

Unnamed: 0,original_id,span1,span2,signal,context,idx,label,source,ann_file,split
0,R1,[lost],[fit],[],Kay lost 50 pounds and her clothes no longer f...,"{'span1': [[4, 8]], 'span2': [[45, 48]], 'sign...",1,6,part_12.ann,1
1,R2,[decided],[tried on],[],Kay lost 50 pounds and her clothes no longer f...,"{'span1': [[54, 61]], 'span2': [[139, 147]], '...",0,6,part_12.ann,1
2,R3,[loving],[tried on],[],Kay lost 50 pounds and her clothes no longer f...,"{'span1': [[110, 116]], 'span2': [[139, 147]],...",0,6,part_12.ann,1
3,R4,[tried on],[bought],[],Kay lost 50 pounds and her clothes no longer f...,"{'span1': [[139, 147]], 'span2': [[220, 226]],...",0,6,part_12.ann,1
4,R5,[impressed],[bought],[],Kay lost 50 pounds and her clothes no longer f...,"{'span1': [[206, 215]], 'span2': [[220, 226]],...",1,6,part_12.ann,1


In [20]:
find_len_max(data)

max length = 330
My mother's cat was ill, so my brother took it to the vet for her.
They said the cat needed to stay there for some tests.
They called my brother an hour later, telling him to come pick it up.
When he got there, the girl at the front desk said the cat had died!
Later that day, she called to say she was mistaken, the cat was fine.


## BECAUSE v2.1

Since the raw text files for **PTB** and **NYT** need LDC subscription, these file have not been covered in our data reader yet. Once we have access to the raw files from these data resources, we will write the proper data readers for them.

In [2]:
data, mis = converter.convert_because()
total_samples += len(data)
print("samples: " + str(len(data)))
print("mismatch: " + str(mis))
print("+ causal: {}".format(len(data.loc[(data["label"] == 1) | (data["label"] == 2)])))
print("- non-causal: {}".format(len(data.loc[data["label"] == 0])))

samples: 729
mismatch: 0
+ causal: 554
- non-causal: 175


In [3]:
data.head()

Unnamed: 0,original_id,span1,span2,signal,context,idx,label,source,ann_file,split
0,E1,[that has arisen],[the past few years],[over],"And second, we should address the issue that h...","{'span1': [[40, 55]], 'span2': [[76, 94]], 'si...",0,7,CHRG-111shrg61651.ann,
1,E4,[these banks are too big to fail],"[they have lower funding costs, they are able ...",[Because],"Because these banks are too big to fail, they ...","{'span1': [[8, 39]], 'span2': [[41, 179]], 'si...",1,7,CHRG-111shrg61651.ann,
2,E5,[they make more money],[the cycle],[over],"Because these banks are too big to fail, they ...","{'span1': [[111, 131]], 'span2': [[137, 146]],...",0,7,CHRG-111shrg61651.ann,
3,E6,[too big],[fail],"[too, to]","Because these banks are too big to fail, they ...","{'span1': [[24, 31]], 'span2': [[35, 39]], 'si...",1,7,CHRG-111shrg61651.ann,
4,E7,[you look at the European situation today],[it is much worse than what we have in this co...,[If],"If you look at the European situation today, f...","{'span1': [[3, 43]], 'span2': [[58, 153]], 'si...",0,7,CHRG-111shrg61651.ann,


In [4]:
find_len_max(data)

max length = 576
These include formalizing the current informal coordination among U.S. financial regulators by amending and enhancing the Executive Order which created the President's Working Group on Financial Markets, and while retaining State level regulation of mortgage origination practices, creating a new Federal level commission, the Mortgage Origination Commission, to establish minimum standards for among other things personal conduct and disciplinary history, minimum educational requirements, testing criteria and procedures, and appropriate licensing revocation standards.
    


## Choice of Plausible Alternatives (COPA)

In [5]:
data, mis = converter.convert_copa()
total_samples += len(data)
print("samples: " + str(len(data)))
print("mismatch: " + str(mis))
print("+ causal: {}".format(len(data.loc[(data["label"] == 1) | (data["label"] == 2)])))
print("- non-causal: {}".format(len(data.loc[data["label"] == 0])))

samples: 1000
mismatch: 0
+ causal: 1000
- non-causal: 0


In [6]:
data.head()

Unnamed: 0,original_id,span1,span2,signal,context,idx,label,source,ann_file,split
0,1,[My body cast a shadow over the grass],[The sun was rising],[],My body cast a shadow over the grass. The sun ...,"{'span1': [[0, 36]], 'span2': [[38, 56]], 'sig...",1,8,copa-dev.xml,1
1,2,[The woman tolerated her friend's difficult be...,[The woman knew her friend was going through a...,[],The woman tolerated her friend's difficult beh...,"{'span1': [[0, 51]], 'span2': [[53, 108]], 'si...",1,8,copa-dev.xml,1
2,3,[The women met for coffee],[They wanted to catch up with each other],[],The women met for coffee. They wanted to catch...,"{'span1': [[0, 24]], 'span2': [[26, 65]], 'sig...",1,8,copa-dev.xml,1
3,4,[The runner wore shorts],[The forecast predicted high temperatures],[],The runner wore shorts. The forecast predicted...,"{'span1': [[0, 22]], 'span2': [[24, 64]], 'sig...",1,8,copa-dev.xml,1
4,5,[The guests of the party hid behind the couch],[It was a surprise party],[],The guests of the party hid behind the couch. ...,"{'span1': [[0, 44]], 'span2': [[46, 69]], 'sig...",1,8,copa-dev.xml,1
