## Converting CREST to popular formats
In this notebook, we show how CREST can be formatted to popular and useful formats such as BRAT for Relation Annotation or TACRED, a popular datasert for relation extraction.

In [1]:
import os
import sys

root_path = os.path.abspath(os.path.join(os.path.dirname("__file__"), '..'))
sys.path.insert(0, root_path)

import ast
import json
import pandas as pd
import numpy as np

from crest.utils import crest2tacred

## Loading CREST data

In [2]:
# loading the CREST-formatted data
df = pd.read_excel('../data/causal/crest.xlsx', index_col=[0])

# check if split is nan, then set the split to train
df.loc[np.isnan(df['split']),'split'] = 0

# check if there's no more nan split value
assert len(df.loc[np.isnan(df['split'])]) == 0

In [3]:
df.head()

Unnamed: 0,original_id,span1,span2,signal,context,idx,label,source,ann_file,split
0,1,['tumor shrinkage'],['radiation therapy'],[],The period of tumor shrinkage after radiation ...,"{'span1': [[14, 29]], 'span2': [[36, 53]], 'si...",2,1,,0.0
1,2,['Habitat degradation'],['stream channels'],[],Habitat degradation from within stream channel...,"{'span1': [[0, 19]], 'span2': [[32, 47]], 'sig...",0,1,,0.0
2,3,['discomfort'],['traveling'],[],Earplugs relieve the discomfort from traveling...,"{'span1': [[21, 31]], 'span2': [[37, 46]], 'si...",2,1,,0.0
3,4,['daily terror'],['antipersonnel land mines'],[],We continue to see progress toward a world fre...,"{'span1': [[55, 67]], 'span2': [[71, 95]], 'si...",2,1,,0.0
4,5,['segment'],['anecdotes'],[],The Global Warming segment starts off with two...,"{'span1': [[19, 26]], 'span2': [[53, 62]], 'si...",0,1,,0.0


In [8]:
a = df.loc[df['source'] == 7]
b = a.loc[a['split'] == 0]
len(a), len(b)

(729, 729)

## Convert CREST to TACRED
TACRED is a well-known and popular dataset for relation extraction. TACRED has been used in many studies as a bechnmark for evaluating the performance of models for relation extraction. These models include but not limited to popular language models such as BERT. That is why we decided to include TACRED as one the formats that CREST can be converted to. You can find TACRED here: https://nlp.stanford.edu/projects/tacred/

In [11]:
crest_tacred = crest2tacred(df, 'dev', [0], [7], save_json=True)
len(crest_tacred)

405

In [5]:
crest_tacred[100]

{'span1_start': 0,
 'span1_end': 0,
 'span2_start': 2,
 'span2_end': 3,
 'id': '1011',
 'token': ['Inhibition',
  'through',
  'synaptic',
  'depression',
  'is',
  'unlike',
  'the',
  'previous',
  'forms',
  'of',
  'inhibition',
  'in',
  'that',
  'it',
  'turns',
  'on',
  'more',
  'slowly',
  'and',
  'thus',
  'acts',
  'as',
  'delayed',
  'negative',
  'feedback.'],
 'relation': 1}