## Reading Causal Relations Corpora
By: Pedram Hosseini (pdr.hosseini@gmail.com)

There have been efforts in creating various causal relation corpora with different level of granularity based on multiple annotation schemes. These efforts, even though admirable, are fairly sparse which makes it hard for people in the NLP community to use the generated knowledge by these resources. To take a step to alleviate this problem to some degree and to make it easier for people to benefit from these data sets with reach source of human annotation, we wrote methods in a **CausalDataReader** class to convert all of these resources into a simple and friendly format so that anyone can easily use these samples.
We try to keep most of the annotations from the original sources in the new schema so that no information will be lost during the data conversion. In the following, there is a list of current data sets which are covered in our collection:

- **SemEval 2007 task 4** - Public (source: **1**)
- **SemEval 2010 task 8** - Public (source: **2**)
- **EventCausality data set** - Public (source: **3**)
- **Causal-TimeBank** - Not public (source: **4**)
- **Crowdsourcing-StoryLines 1.2** - Public (source: **5**)
- **CaTeRS** - Public (source: **6**)
- **BECAUSE v2.1** Public (source: **7**)
- **Your data set?**

#### JOIN US
We invite everyone in the ML/NLP/NLU community and groups of researchers who work on causal relations extraction in language to contribute to this repository, **data_reader.py** in particular, so that we all take a step forward in improving the quality of availbale data resources and alleviate the sparsness issue.

In [None]:
import data_reader as dr
obj = dr.CausalDataReader()
total_samples = 0

## SemEval 2010 task 8

In [None]:
data = obj.read_semeval_2010_8()
total_samples += len(data)
print("samples: " + str(len(data)))

In [None]:
data.head()

In [None]:
len_max = 0
i_max = -1
for index, row in data.iterrows():
    if len(row.text) > len_max:
        len_max = len(row.text)
        i_max = index
print("max length = " + str(len_max))
print(data.iloc[i_max].text)

## SemEval 2007 task 4 

In [None]:
data = obj.read_semeval_2007_4()
total_samples += len(data)
print("samples: " + str(len(data)))

In [None]:
data.head()

In [None]:
len_max = 0
i_max = -1
for index, row in data.iterrows():
    if len(row.text) > len_max:
        len_max = len(row.text)
        i_max = index
print("max length = " + str(len_max))
print(data.iloc[i_max].text)

## EventCausality data set

In [None]:
data = obj.read_event_causality()
total_samples += len(data)
print("samples: " + str(len(data)))

In [None]:
data.head()

In [None]:
len_max = 0
i_max = -1
for index, row in data.iterrows():
    if len(row.text) > len_max:
        len_max = len(row.text)
        i_max = index
print("max length = " + str(len_max))
print(data.iloc[i_max].text)

## Causal-TimeBank

In [None]:
data = obj.read_causal_timebank()
total_samples += len(data)
print("samples: " + str(len(data)))

In [None]:
data.head()

In [None]:
len_max = 0
i_max = -1
for index, row in data.iterrows():
    if len(row.text) > len_max:
        len_max = len(row.text)
        i_max = index
print("max length = " + str(len_max))
print(data.iloc[i_max].text)

## Crowdsourcing-StoryLines

In [None]:
data = obj.read_story_lines()
total_samples += len(data)
print("samples: " + str(len(data)))

In [None]:
data.head()

In [None]:
len_max = 0
i_max = -1
for index, row in data.iterrows():
    if len(row.text) > len_max:
        len_max = len(row.text)
        i_max = index
print(len_max)
print(data.iloc[i_max].text)

## CaTeRS

In [None]:
data = obj.read_CaTeRS()
total_samples += len(data)
print("samples: " + str(len(data)))

In [None]:
i = 200
print(data.iloc[i]['text'])
print(data.iloc[i]['arg1'])
print(data.iloc[i]['arg2'])
print(data.iloc[i]['ann_file'])

In [None]:
data.head()

In [None]:
len_max = 0
i_max = -1
for index, row in data.iterrows():
    if len(row.text) > len_max:
        len_max = len(row.text)
        i_max = index
print("max length = " + str(len_max))
print(data.iloc[i_max].text)

## BECAUSE v2.1

Since the raw text files for **PTB** and **NYT** need LDC subscription, these file have not been covered in our data reader yet. Once we have access to the raw files from these data resources, we will write the proper data readers for them.

In [None]:
data = obj.read_because()
total_samples += len(data)
print("samples: " + str(len(data)))

In [None]:
data.head()

In [None]:
len_max = 0
i_max = -1
for index, row in data.iterrows():
    print(row.text)
    print("------")
    if len(row.text) > len_max:
        len_max = len(row.text)
        i_max = index
print("max length = " + str(len_max))
print(data.iloc[i_max].text)

### Data set statistics

In [None]:
import os
import pandas as pd

data_path = 'data/causal/gold_causal.csv'
if os.path.exists(data_path):
    data = pd.read_csv(data_path)
else:
    data = obj.read_all()
data.to_csv('data/causal/gold_causal.csv')
train = data.loc[data.split == 0]
dev = data.loc[data.split == 1]
test = data.loc[data.split == 2]
pos = data.loc[data.label == 1]
neg = data.loc[data.label == 0]
print("Total samples: " + str(len(data)))
print("----------------------------")
print("# Cuasal: " + str(len(pos)))
print("# Non-Cuasal: " + str(len(neg)))
print("----------------------------")
print("# train: " + str(len(train)))
print("# dev: " + str(len(dev)))
print("# test: " + str(len(test)))
print("----------------------------")
for i in range(7):
    print("# source [" + str(i+1) + "]: " + str(len(data.loc[data.source == (i+1)])))

In [None]:
# checking if all samples have labels
print("# unlabeled samples: " + str(len(data.loc[(data.label != 0) & (data.label != 1)])))