## Converting CREST to popular formats
In this notebook, we show how CREST can be formatted to popular and useful formats such as BRAT for Relation Annotation or TACRED, a popular datasert for relation extraction.

In [1]:
import os
import sys

root_path = os.path.abspath(os.path.join(os.path.dirname("__file__"), '..'))
sys.path.insert(0, root_path)

import ast
import json
import pandas as pd
import numpy as np

from crest.utils import crest2tacred

## Loading CREST data

In [2]:
# loading the CREST-formatted data
df = pd.read_excel('../data/causal/crest.xlsx', index_col=[0])

# check if split is nan, then set the split to train
df.loc[np.isnan(df['split']),'split'] = 0

# check if there's no more nan split value
assert len(df.loc[np.isnan(df['split'])]) == 0

In [3]:
df.head()

Unnamed: 0,original_id,span1,span2,signal,context,idx,label,source,ann_file,split
0,1,['tumor shrinkage'],['radiation therapy'],[],The period of tumor shrinkage after radiation ...,"{'span1': [[14, 29]], 'span2': [[36, 53]], 'si...",2,1,,0.0
1,2,['Habitat degradation'],['stream channels'],[],Habitat degradation from within stream channel...,"{'span1': [[0, 19]], 'span2': [[32, 47]], 'sig...",0,1,,0.0
2,3,['discomfort'],['traveling'],[],Earplugs relieve the discomfort from traveling...,"{'span1': [[21, 31]], 'span2': [[37, 46]], 'si...",2,1,,0.0
3,4,['daily terror'],['antipersonnel land mines'],[],We continue to see progress toward a world fre...,"{'span1': [[55, 67]], 'span2': [[71, 95]], 'si...",2,1,,0.0
4,5,['segment'],['anecdotes'],[],The Global Warming segment starts off with two...,"{'span1': [[19, 26]], 'span2': [[53, 62]], 'si...",0,1,,0.0


# Preparing data splits
In this section, we prepare the train and dev/test splits for experiments. Setup is in a way that we first separate positive and negative samples, then we split them by `frac_val` (e.g. %80/%20 for train and dev/test, respectively). Then we combine the train splits of positive and negative samples and also the dev/test splits of positiv and negative samples. The reason is that we want to make sure that train and dev/test are not dominated by one class.

To include a data in this operation, it is enough to add the data source code in CREST to `source_codes` list.

### Creating train and dev

In [4]:
# [1] selecting SemEval sub-data
source_codes = [3, 4, 5, 6, 7]
sem_df = df[df['source'].isin(source_codes)]
frac_val = 0.8

# [1.1] getting number of causal (positive) and non-causal (negative) samples
n_pos = len(sem_df[sem_df['label'].isin([1, 2])])
n_neg = len(sem_df[sem_df['label'].isin([0])])
print("(original data) positive: {}, negative: {}".format(n_pos, n_neg))

# [2] train and dev splits
# [2.1] non-causal train and dev
sem_neg = sem_df.loc[sem_df['label'] == 0].sample(n=n_neg, random_state=42)
neg_train = sem_neg.apply(lambda x: x.sample(frac=frac_val, random_state=42))
neg_dev = sem_neg.drop(neg_train.index)

# [2.2] causal train and dev
sem_pos = sem_df.loc[(sem_df['label'] == 1)|(sem_df['label'] == 2)].sample(n=n_pos, random_state=42)
pos_train = sem_pos.apply(lambda x: x.sample(frac=frac_val, random_state=42))
pos_dev = sem_pos.drop(pos_train.index)

# concatenating causal and non-causal samples 
train_frames = [neg_train, pos_train]
dev_frames = [neg_dev, pos_dev]

train_df = pd.concat(train_frames)
dev_df = pd.concat(dev_frames)

# shuffling samples
train_df = train_df.sample(frac=1, random_state=42)
dev_df = dev_df.sample(frac=1, random_state=42)

# changing the ternary classes to binary (1 and 2 are both causal)
train_df.loc[train_df['label'] == 2, 'label'] = 1
dev_df.loc[dev_df['label'] == 2, 'label'] = 1

print("train: {}, dev: {}".format(len(train_df), len(dev_df)))

(original data) positive: 4273, negative: 2369
train: 5313, dev: 1329


## Convert CREST to TACRED
TACRED is a well-known and popular dataset for relation extraction. TACRED has been used in many studies as a bechnmark for evaluating the performance of models for relation extraction. These models include but not limited to popular language models such as BERT. That is why we decided to include TACRED as one the formats that CREST can be converted to. You can find TACRED here: https://nlp.stanford.edu/projects/tacred/

In [6]:
from sklearn.metrics import f1_score, confusion_matrix, classification_report

splits = {'train': train_df, 'dev': dev_df}

for key, value in splits.items():
    data = crest2tacred(value, key, save_json=True)
    
    ### Computing baseline performance (majority class)
    if key == "dev":
        major_label = int(value['label'].mode()[0])
        relations = {}
        preds = []
        labels = []
        for x in data:
            if x['relation'] in relations:
                relations[x['relation']] += 1
            else:
                relations[x['relation']] = 1

            preds.append(major_label)
            labels.append(int(x['relation']))

        print("total samples: {}".format(relations))
        print("macro-f1: {}".format(f1_score(preds, labels, average='macro')))
        print("micro-f1: {}".format(f1_score(preds, labels, average='micro')))
        print("weighted-f1: {}".format(f1_score(preds, labels, average='weighted')))
        print(confusion_matrix(preds, labels))
        print(classification_report(preds, labels))

total samples: {1: 801, 0: 439}
macro-f1: 0.39245467907888293
micro-f1: 0.6459677419354839
weighted-f1: 0.7849093581577659
[[  0   0]
 [439 801]]
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         0
           1       1.00      0.65      0.78      1240

    accuracy                           0.65      1240
   macro avg       0.50      0.32      0.39      1240
weighted avg       1.00      0.65      0.78      1240

