## Converting CREST to popular formats
In this notebook, we show how CREST can be formatted to popular and useful formats such as BRAT for Relation Annotation or TACRED, a popular datasert for relation extraction.

In [1]:
import os
import sys

root_path = os.path.abspath(os.path.join(os.path.dirname("__file__"), '..'))
sys.path.insert(0, root_path)

import ast
import json
import pandas as pd
import numpy as np

from crest.utils import crest2tacred

from collections import Counter
from sklearn.metrics import f1_score, confusion_matrix, classification_report

## Loading CREST data

In [2]:
# loading the CREST-formatted data
df = pd.read_excel('../data/causal/crest.xlsx', index_col=[0])

# check if split is nan, then set the split to train
df.loc[np.isnan(df['split']),'split'] = 0

# check if there's no more nan split value
assert len(df.loc[np.isnan(df['split'])]) == 0

In [3]:
df.head()

Unnamed: 0,original_id,span1,span2,signal,context,idx,label,source,ann_file,split
0,1,['tumor shrinkage'],['radiation therapy'],[],The period of tumor shrinkage after radiation ...,"{'span1': [[14, 29]], 'span2': [[36, 53]], 'si...",2,1,,0.0
1,2,['Habitat degradation'],['stream channels'],[],Habitat degradation from within stream channel...,"{'span1': [[0, 19]], 'span2': [[32, 47]], 'sig...",0,1,,0.0
2,3,['discomfort'],['traveling'],[],Earplugs relieve the discomfort from traveling...,"{'span1': [[21, 31]], 'span2': [[37, 46]], 'si...",2,1,,0.0
3,4,['daily terror'],['antipersonnel land mines'],[],We continue to see progress toward a world fre...,"{'span1': [[55, 67]], 'span2': [[71, 95]], 'si...",2,1,,0.0
4,5,['segment'],['anecdotes'],[],The Global Warming segment starts off with two...,"{'span1': [[19, 26]], 'span2': [[53, 62]], 'si...",0,1,,0.0


# Preparing data splits
In this section, we prepare the train and dev/test splits for experiments. Setup is in a way that we first separate positive and negative samples, then we split them by `frac_val` (e.g. %80/%20 for train and dev/test, respectively). Then we combine the train splits of positive and negative samples and also the dev/test splits of positiv and negative samples. The reason is that we want to make sure that train and dev/test are not dominated by one class.

To include a data in this operation, it is enough to add the data source code in CREST to `source_codes` list.

### Creating train and dev

In [4]:
# [1] selecting SemEval sub-data
source_codes = [3, 4, 5, 6, 7]
# source_codes = [1, 2]
sem_df = df[df['source'].isin(source_codes)]

# [1.1] getting number of causal (positive) and non-causal (negative) samples
n_pos = len(sem_df[sem_df['label'].isin([1, 2])])
n_neg = len(sem_df[sem_df['label'].isin([0])])

print("(original data) positive: {}, negative: {}".format(n_pos, n_neg))

# args = {'frac_val': 0.8, 'n_neg': 3 * n_pos, 'n_pos': n_pos}
args = {'frac_val': 0.9, 'n_neg': n_neg, 'n_pos': n_pos}

# [2] train and dev splits
# [2.1] non-causal train and dev
sem_neg = sem_df.loc[sem_df['label'] == 0].sample(n=args['n_neg'], random_state=42)
neg_train = sem_neg.apply(lambda x: x.sample(frac=args['frac_val'], random_state=42))
neg_dev = sem_neg.drop(neg_train.index)

# [2.2] causal train and dev
sem_pos = sem_df[sem_df['label'].isin([1, 2])].sample(n=args['n_pos'], random_state=42)
pos_train = sem_pos.apply(lambda x: x.sample(frac=args['frac_val'], random_state=42))
pos_dev = sem_pos.drop(pos_train.index)

# concatenating causal and non-causal samples 
train_frames = [neg_train, pos_train]
dev_frames = [neg_dev, pos_dev]

train_df = pd.concat(train_frames)
dev_df = pd.concat(dev_frames)

# shuffling samples
train_df = train_df.sample(frac=1, random_state=42)
dev_df = dev_df.sample(frac=1, random_state=42)

# changing the ternary classes to binary (1 and 2 are both causal)
train_df.loc[train_df['label'] == 2, 'label'] = 1
dev_df.loc[dev_df['label'] == 2, 'label'] = 1

print("train: {}, dev: {}".format(len(train_df), len(dev_df)))

(original data) positive: 4273, negative: 2369
train: 5978, dev: 664


## Convert CREST to TACRED
TACRED is a well-known and popular dataset for relation extraction. TACRED has been used in many studies as a bechnmark for evaluating the performance of models for relation extraction. These models include but not limited to popular language models such as BERT. That is why we decided to include TACRED as one the formats that CREST can be converted to. You can find TACRED here: https://nlp.stanford.edu/projects/tacred/

In [5]:
splits = {'train': train_df, 'dev': dev_df}

for key, value in splits.items():
    data = crest2tacred(value, key, save_json=True)

### Checking context overlaps between splits
To make sure that there's no/least overlap between the context values in train and dev, we check the context similarity of these two splits.

In [6]:
resolve_overlap = 'train'

dev_path = '../data/causal/splits/dev.json'
train_path = '../data/causal/splits/train.json'

dev = pd.read_json(dev_path)
train = pd.read_json(train_path)

split_info = {'dev_data': dev, 'dev_path': dev_path, 'train_data': train, 'train_path': train_path}

train_context = []
dev_context = []

for index, row in train.iterrows():
    train_context.append(' '.join(row['token']))
for index, row in dev.iterrows():
    dev_context.append(' '.join(row['token']))

print('==== with overlap ====')
print('train: {}, unique: {}'.format(len(train_context), len(set(train_context))))
print('dev: {}, unique: {}'.format(len(dev_context), len(set(dev_context))))

## finding overlaps
overlaps = list(set(train_context) & set(dev_context))

records = []    
for index, row in split_info[resolve_overlap + '_data'].iterrows():
    if ' '.join(row['token']) not in overlaps:
        records.append(row.to_dict())

# removing the old split file
os.remove(split_info[resolve_overlap + '_path'])

# saving records into a JSON file
with open(split_info[resolve_overlap + '_path'], 'w') as fout:
    json.dump(records, fout)

print('\n==== without overlap ====')
print('{} size: {}'.format(resolve_overlap, len(records)))

==== with overlap ====
train: 5526, unique: 2381
dev: 621, unique: 519

==== without overlap ====
train size: 3858


## Computing the baseline

In [7]:
dev = pd.read_json('../data/causal/splits/dev.json')
train = pd.read_json('../data/causal/splits/train.json')

labels = []
preds = []
train_labels = {}
dev_labels = {}

for index, row in train.iterrows():
    if row['relation'] in train_labels:
        train_labels[row['relation']] += 1
    else:
        train_labels[row['relation']] = 1

for index, row in dev.iterrows():
    labels.append(row['relation'])

c = Counter(labels)
major_label, count = c.most_common()[0]
    
for i in range(len(labels)):
    if labels[i] in dev_labels:
        dev_labels[labels[i]] += 1
    else:
        dev_labels[labels[i]] = 1
                
    preds.append(major_label)

print("train labels: {}".format(train_labels))
print("dev labels: {}".format(dev_labels))

print("micro-f1: {}".format(f1_score(preds, labels, average='micro')))
print("macro-f1: {}".format(f1_score(preds, labels, average='macro')))
print("weighted-f1: {}".format(f1_score(preds, labels, average='weighted')))

print(confusion_matrix(preds, labels))
print(classification_report(preds, labels))

train labels: {1: 2957, 0: 901}
dev labels: {1: 404, 0: 217}
micro-f1: 0.6505636070853462
macro-f1: 0.39414634146341465
weighted-f1: 0.7882926829268293
[[  0   0]
 [217 404]]
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         0
           1       1.00      0.65      0.79       621

    accuracy                           0.65       621
   macro avg       0.50      0.33      0.39       621
weighted avg       1.00      0.65      0.79       621



  _warn_prf(average, modifier, msg_start, len(result))
