# Preprocessing of TAC-based Relation Extraction dataset

This notebook shows how to preprocess data in CONLL format, which is quite popular for storing the NLP datasets, for Knodle framework.

To show how it works, we have taken a relation extraction dataset based on TAC KBP corpora (Surdeanu (2013)), also used in Roth (2014). The TAC dataset was annotated with entity pairs extracted from Freebase (Google (2014)) where corresponding relations have been mapped to the 41 TAC relations types as used in the TAC KBP challenges (e.g., per:schools_attended and org:members). 

In order to show the whole process of weak annotation, we have reconstructed the entity pairs and used them to annotate the dataset from scrath. As development and test sets we used the gold corpus annotated via crowdsourcing and human labeling from KBP (Zhang et al. (2017)).  

Importantly, in this dataset we preserve the samples, where no rule matched, as __negative samples__, what is considered to be a good practice in many NLP tasks, e.g. relation extraction. 

The steps are the following:
- the input data files are downloaded from MINIO database: 
    - raw train data saved in .conll format
    - gold-annotated dev data saved in .conll format
    - gold-annotated test data saved in .conll format
    - list of rules (namely, Freebase entity pairs) with corresponding classes
    - list of classes
- list of rules with corresponding classes is transformed to mapping_rules_labels t matrics
- the non-labelled train data are read from .conll file and annotated with entity pairs. Basing on them, rule_matches_z matrix and a DataFrame with train samples are generated
- the already annotated dev and test data are read from .conll file together with gold labels and stored as a DataFrame.

## Imports

Firstly, let's make some basic imports

In [1]:
import argparse
import sys
import os
from pathlib import Path
import logging
from typing import Dict, Union, Tuple
from minio import Minio
import random
from IPython.display import HTML
import csv

import numpy as np
import pandas as pd
import scipy.sparse as sp
from joblib import dump
from tqdm.auto import tqdm

from knodle.trainer.utils import log_section

pd.set_option('display.max_colwidth', -1)
np.set_printoptions(threshold=sys.maxsize)

In [2]:
# define the files names
Z_MATRIX_OUTPUT_TRAIN = "train_rule_matches_z.lib"
Z_MATRIX_OUTPUT_DEV = "dev_rule_matches_z.lib"
Z_MATRIX_OUTPUT_TEST = "test_rule_matches_z.lib"

T_MATRIX_OUTPUT_TRAIN = "mapping_rules_labels_t.lib"

TRAIN_SAMPLES_OUTPUT = "df_train.lib"
DEV_SAMPLES_OUTPUT = "df_dev.lib"
TEST_SAMPLES_OUTPUT = "df_test.lib"

# file names for .csv files
TRAIN_SAMPLES_OUTPUT_CSV = "df_train.csv"
DEV_SAMPLES_OUTPUT_CSV = "df_dev.csv"
TEST_SAMPLES_OUTPUT_CSV = "df_test.csv"

# define the path to the folder where the data will be stored
data_path = "../../../data_from_minio/TAC"
os.path.join(data_path)

'../../../data_from_minio/TAC'

## Download the dataset

This dataset, as all datasets provided in Knodle, could be easily downloaded from Minio database with Minio client. 

In [4]:
client = Minio("knodle.cc", secure=False)
files = ["train.conll", "dev.conll", "test.conll", "labels.txt", "rules.csv"]

for file in tqdm(files):
    client.fget_object(
        bucket_name="knodle",
        object_name=os.path.join("datasets/tac", file),
        file_path=os.path.join(data_path, file),
    )

HBox(children=(FloatProgress(value=0.0, max=5.0), HTML(value='')))




In [5]:
# set paths to input data
path_labels = os.path.join(data_path, "labels.txt")
path_rules = os.path.join(data_path, "rules.csv")
path_train_data = os.path.join(data_path, "train.conll")
path_dev_data = os.path.join(data_path, "dev.conll")
path_test_data = os.path.join(data_path, "test.conll")

## Labels & Rules Data Preprocessing¶

### Get labels

First, let's read labels from the file with the corresponding label ids.

In [6]:
labels2ids = {}
with open(path_labels, encoding="UTF-8") as file:
    for line in file.readlines():
        relation, relation_enc = line.replace("\n", "").split(",")
        labels2ids[relation] = int(relation_enc)

num_classes = len(labels2ids)

In [7]:
print(labels2ids)

{'per:alternate_names': 0, 'per:date_of_birth': 1, 'per:age': 2, 'per:country_of_birth': 3, 'per:stateorprovince_of_birth': 4, 'per:city_of_birth': 5, 'per:origin': 6, 'per:date_of_death': 7, 'per:country_of_death': 8, 'per:stateorprovince_of_death': 9, 'per:city_of_death': 10, 'per:cause_of_death': 11, 'per:countries_of_residence': 12, 'per:stateorprovinces_of_residence': 13, 'per:cities_of_residence': 14, 'per:schools_attended': 15, 'per:title': 16, 'per:employee_of': 17, 'per:religion': 18, 'per:spouse': 19, 'per:children': 20, 'per:parents': 21, 'per:siblings': 22, 'per:other_family': 23, 'per:charges': 24, 'org:alternate_names': 25, 'org:members': 26, 'org:member_of': 27, 'org:subsidiaries': 28, 'org:political/religious_affiliation': 29, 'org:top_members/employees': 30, 'org:number_of_employees/members': 31, 'org:parents': 32, 'org:founded_by': 33, 'org:founded': 34, 'org:country_of_headquarters': 35, 'org:stateorprovince_of_headquarters': 36, 'org:city_of_headquarters': 37, 'org:

### Get rules

Secondly, rules (in our case, entity pairs extracted from Freebase) that are stored in the separate csv file with corresponding _label_ and _label_id_ (_label_ to _label_id_ correspondance is the same as in file with labels list) are read and stored.

In [8]:
rules = pd.read_csv(path_rules)
num_rules_from_file = len(rules)
rules

Unnamed: 0,rule,rule_id,label,label_id
0,ATG Art_Technology_Group,0,org:alternate_names,25
1,Union_Cycliste_Internationale UCI,1,org:alternate_names,25
2,UCI Union_Cycliste_Internationale,2,org:alternate_names,25
3,Hanwha 한화,3,org:alternate_names,25
4,Radio_Free_Europe Radio_Liberty,4,org:alternate_names,25
...,...,...,...,...
247315,Ginger_Baker drums,212032,per:title,16
247316,painter Qi_Baishi,212033,per:title,16
247317,Qi_Baishi painter,212034,per:title,16
247318,engineer David_Lennox,212035,per:title,16


Most rules and classes have one-to-one correspondance. However, there could be cases where a rule correponds to different classes. For example, _"Oracle, New_York"_ entity pair can reflect to both _org:stateorprovince_of_headquarters_ and _org:city_of_headquarters_ relations. In such cases information about all correponding classed will be saved and reflected in the _mapping_rules_labels_t_ matrix we are going to build in the next section.

### Get rules to classes correspondance matrix

Before that, basing on this dataframe let's build 2 dictionaries that we are going to use later:
- rule to rule ids correspondings
- rule ids to label ids correspondings

In [9]:
rule2rule_id = dict(zip(rules["rule"], rules["rule_id"]))

rules_n_label_ids = rules[["rule_id", "label_id"]].groupby('rule_id')
rule2label = rules_n_label_ids['label_id'].apply(lambda s: s.tolist()).to_dict()

num_rules = max(rules.rule_id.values) + 1
print(f"Number of rules: {num_rules}")

Number of rules: 212037


Finally, let's the build `mapping_rules_labels_t` matrix with the information about which rule corresponds to which class. 

In [10]:
def get_mapping_rules_labels_t(rule2label: Dict, num_classes: int) -> np.ndarray:
    """ Function calculates t matrix (rules x labels) using the known correspondence of relations to decision rules """
    mapping_rules_labels_t = np.zeros([len(rule2label), num_classes])
    for rule, labels in rule2label.items():
        mapping_rules_labels_t[rule, labels] = 1
    return mapping_rules_labels_t

mapping_rules_labels_t = get_mapping_rules_labels_t(rule2label, num_classes)

## Train data preprocessing

Train data should be annotated with rules we already have. Remember, there is no gold labels (as opposite to evaluation and test data). To preserve samples without rule matches as negative samples in the training set, we do not eliminate them but add them to the preprocessed data with empty rule and rule_id value. 

So, the annotation is done in the following way: 
- the sentences are extracted from .conll file
- a pair of tokens tagged as object and subject are looked up in rules list
- if they form any rule from the rules list, this sentence is added to the train set. The matched rule and rule id is added accordingly.
- if they are not, this sentence is added to the train set with empty rule match

In [16]:
def count_file_lines(file_name: str) -> int:
    """ Count the number of line in a file """
    with open(file_name, encoding='utf-8') as f:
        return len(f.readlines())

In [17]:
train_data = open(path_train_data)
for i in range(30):
    line = train_data.readline()
    print(line)

#	index	token	subj	subj_type	obj	obj_type	stanford_pos	stanford_ner	stanford_deprel	stanford_head

# id=E0065795:0-pos docid=E0065795:0 reln=org:alternate_names

1	Profile	_	_	_	_	VB	O	advmod	16

2	,	_	_	_	_	,	O	punct	16

3	basic	_	_	_	_	JJ	O	amod	4

4	information	_	_	_	_	NN	O	compound	5

5	ATG	_	_	OBJECT	ORG	NNP	ORG	nsubj	16

6	(	_	_	_	_	-LRB-	O	punct	11

7	Art	SUBJECT	ORGANIZATION	_	_	NNP	ORG	compound	9

8	Technology	SUBJECT	ORGANIZATION	_	_	NNP	ORG	compound	9

9	Group	SUBJECT	ORGANIZATION	_	_	NNP	ORG	nmod	11

10	,	_	_	_	_	,	O	punct	11

11	Inc.	_	_	_	_	NNP	GPE	appos	5

12	,	_	_	_	_	,	O	punct	11

13	NASDAQ	_	_	_	_	NNP	ORG	npadvmod	11

14	:	_	_	_	_	:	O	punct	13

15	)	_	_	_	_	-RRB-	O	punct	11

16	makes	_	_	_	_	VBZ	O	ROOT	16

17	software	_	_	_	_	NN	O	dobj	16

18	and	_	_	_	_	CC	O	cc	16

19	delivers	_	_	_	_	VBZ	O	conj	16

20	e	_	_	_	_	NN	O	nmod	22

21	-	_	_	_	_	HYPH	O	punct	22

22	commerce	_	_	_	_	NN	O	nmod	26

23	and	_	_	_	_	CC	O	cc	22

24	Web	_	_	_	_	NN	O	compound	25

25	marketing	_	_	_	

In [18]:
def extract_subj_obj_middle_words(line: str, subj: list, obj: list, subj_min_token_id: int, obj_min_token_id: int, sample: str):
    splitted_line = line.split("\t")
    token = splitted_line[1]
    if splitted_line[2] == "SUBJECT":
        if not subj_min_token_id:
            subj_min_token_id = int(splitted_line[0])
        subj.append(token)
        sample += " " + token
    elif splitted_line[4] == "OBJECT":
        if not obj_min_token_id:
            obj_min_token_id = int(splitted_line[0])
        obj.append(token)
        sample += " " + token
    else:
        if (bool(subj) and not bool(obj)) or (not bool(subj) and bool(obj)):
            sample += " " + token
    return subj, obj, subj_min_token_id, obj_min_token_id, sample

def get_rule_n_rule_id(subj: list, obj: list, subj_min_token_id: int, obj_min_token_id: int, rule2rule_id: dict) -> Union[Tuple[str, int], Tuple[None, None]]:
    if subj_min_token_id < obj_min_token_id:
        rule = "_".join(subj) + " " + "_".join(obj)
    else:
        rule = "_".join(obj) + " " + "_".join(subj)
    if rule in rule2rule_id.keys():
        return rule, rule2rule_id[rule]
    return None, None

def encode_labels(label: str, label2id: dict) -> Union[int, None]:
    """ Encodes labels with corresponding labels id. If relation is unknown, adds it to the dict with new label id """
    if label in label2id:
        return label2id[label]
    else:
        print(f"Warning! There is a {label} label found which is not in the list of TAC relations! The sentence will be skipped.")
        return None

def verbose(processed_lines: int, num_lines: int) -> None:
    if processed_lines % (int(round(num_lines / 10))) == 0:
        print(f"Processed {processed_lines / num_lines * 100 :0.0f}%")


def annotate_conll_data_with_lfs(conll_data: str, rule2rule_id: Dict, labels2ids: Dict = None) -> pd.DataFrame:
    num_lines = count_file_lines(conll_data)
    processed_lines = 0
    samples, rules, enc_rules, labels, enc_labels = [], [], [], [], []
    with open(conll_data, encoding='utf-8') as f:
        for line in f:
            processed_lines += 1
            line = line.strip()
            if line.startswith("# id="):  # Instance starts
                sample = ""
                subj, obj = [], []
                subj_min_token_id, obj_min_token_id = None, None
                if labels2ids:
                    label = line.split(" ")[3][5:]
                    label_id = encode_labels(label, labels2ids)
            elif line == "":  # Instance ends
                if len(subj) == 0 or len(obj) == 0:      # there is a mistake in sample annotation, and no token was annotated as subj/obj 
                    continue
                rule, rule_id = get_rule_n_rule_id(subj, obj, subj_min_token_id, obj_min_token_id, rule2rule_id)
                samples.append(sample.lstrip())
                rules.append(rule)
                enc_rules.append(rule_id)
                if labels2ids and label_id is not None:
                    labels.append(label)
                    enc_labels.append(label_id)
            elif line.startswith("#"):  # comment
                continue
            else:
                subj, obj, subj_min_token_id, obj_min_token_id, sample = extract_subj_obj_middle_words(line, subj, obj, subj_min_token_id, obj_min_token_id, sample)
            verbose(processed_lines, num_lines)
            
    print(f"Preprocessing of {conll_data.split('/')[-1]} file is finished.")
    if labels2ids:
        return pd.DataFrame.from_dict({"samples": samples, "rules": rules, "enc_rules": enc_rules, "labels": labels, "enc_labels": enc_labels}) 
    return pd.DataFrame.from_dict({"samples": samples, "rules": rules, "enc_rules": enc_rules})

In [19]:
train_data = annotate_conll_data_with_lfs(path_train_data, rule2rule_id)

Processed 10%
Processed 20%
Processed 30%
Processed 40%
Processed 50%
Processed 60%
Processed 70%
Processed 80%
Processed 90%
Preprocessing of train.conll file is finished.


In [15]:
train_data.head()

Unnamed: 0,samples,rules,enc_rules
0,ATG ( Art Technology Group,ATG Art_Technology_Group,0.0
1,) makes software and delivers e - commerce,,
2,Union Cycliste Internationale ( UCI,Union_Cycliste_Internationale UCI,1.0
3,1987 by CTCA and has been awarded the distinguished level,,
4,Union Cycliste Internationale ( UCI,Union_Cycliste_Internationale UCI,1.0


After that we could build a rule_matches_z matrix for train data and save it as a sparse matrix .

In [14]:
def get_rule_matches_z_matrix (data: pd.DataFrame, num_rules: int) -> sp.csr_matrix:
    """
    Function calculates the z matrix (samples x rules)
    data: pd.DataFrame (samples, matched rules, matched rules id )
    output: sparse z matrix
    """
    data_without_nan = data.reset_index().dropna()
    rule_matches_z_matrix_sparse = sp.csr_matrix(
        (
            np.ones(len(data_without_nan['index'].values)),
            (data_without_nan['index'].values, data_without_nan['enc_rules'].values)
        ),
        shape=(len(data.index), num_rules)
    )
    return rule_matches_z_matrix_sparse

In [15]:
train_rule_matches_z = get_rule_matches_z_matrix(train_data, num_rules)

## Dev & Test data preprocessing¶

The validation and test data are to be read from the corresponding input files. Although the gold label are known and  could be simply received from the same input conll data, we still anotate the dev and test data with the same rules we used to annotate the train data (namely, Freebase entity pairs). That is done in order to lately evaluate the rules and get a baseline result by comparing the known gold labels and the weakly labels. However, because of the rules specificity, there is a very small amount of matched rules in dev and test data. That is why in final DataFrame for most of the samples _rules_ and _enc_rules_ values equal None.

Apart from the 41 "meaningful" relations, there are also samples which are annotated as _"no_relation"_ samples in validation and test data. That's why we need to add one more class to the _labels2ids_ dictionary. 

In [16]:
labels2ids["no_relation"] = max(labels2ids.values()) + 1

Now we can process the development and test data. We shall use the same function as for processing of training data with one difference: the labels will be also read and stored for each sample. 

In [17]:
dev_data = annotate_conll_data_with_lfs(path_dev_data, rule2rule_id, labels2ids)
test_data = annotate_conll_data_with_lfs(path_test_data, rule2rule_id, labels2ids)

Processed 10%
Processed 20%
Processed 30%
Processed 40%
Processed 50%
Processed 60%
Processed 70%
Processed 80%
Processed 90%
Preprocessing of dev.conll file is finished.
Processed 10%
Processed 20%
Processed 30%
Processed 40%
Processed 50%
Processed 60%
Processed 70%
Processed 80%
Processed 90%
Processed 100%
Preprocessing of test.conll file is finished.


In [18]:
dev_data.head()

Unnamed: 0,samples,rules,enc_rules,labels,enc_labels
0,Douglas Flint will become chairman,,,per:title,16
1,Jeffrey White in mid-February issued an injunction against Wikileaks after the Zurich-based Bank Julius Baer,,,no_relation,41
2,PARIS 2009-07-07 11:07:32 UTC French media earlier reported that Montcourt,,,per:city_of_death,10
3,"current holdings of Blackstone-operated funds include Universal Orlando , Cadbury Schweppes , Freedom Communications",,,no_relation,41
4,Nepali government and the guerrillas reached in an understanding during summit talks held on July 16 at Prime Minister Girija Prashad Koirala,,,no_relation,41


We also provide `rule_matches_z` matrices for dev and test data in order to calculate the simple majority baseline. They won't be used in any of the denoising algorithms provided in Knodle.

In [19]:
dev_rule_matches_z = get_rule_matches_z_matrix(dev_data, num_rules)
test_rule_matches_z = get_rule_matches_z_matrix(test_data, num_rules)

## Statistics

Let's collect some statistics of the data we collected.

In [20]:
print(f"Number of rules: {num_rules}")
print(f"Dimension of t matrix: {mapping_rules_labels_t.shape}")
print(f"Number of samples in train set: {len(train_data)}")

Number of rules: 212037
Dimension of t matrix: (212037, 41)
Number of samples in train set: 1937211


In [21]:
print(f"Number of samples in dev set: {len(dev_data)}")
dev_stat = dev_data.groupby(['enc_labels','labels'])['samples'].count().sort_values(ascending=False).reset_index(name='count')
HTML(dev_stat.to_html(index=False))

Number of samples in dev set: 5368


enc_labels,labels,count
41,no_relation,4015
16,per:title,217
30,org:top_members/employees,139
17,per:employee_of,101
25,org:alternate_names,87
2,per:age,60
12,per:countries_of_residence,59
7,per:date_of_death,54
6,per:origin,49
14,per:cities_of_residence,46


In [22]:
print(f"Number of samples in test set: {len(test_data)}")
test_stat = test_data.groupby(['enc_labels','labels'])['samples'].count().sort_values(ascending=False).reset_index(name='count')
HTML(test_stat.to_html(index=False))

Number of samples in test set: 18660


enc_labels,labels,count
41,no_relation,14517
16,per:title,626
30,org:top_members/employees,409
17,per:employee_of,351
2,per:age,269
25,org:alternate_names,242
14,per:cities_of_residence,233
12,per:countries_of_residence,199
6,per:origin,164
24,per:charges,125


## Save files

... and we save all the data we got. 

In [23]:
Path(os.path.join(data_path, "processed")).mkdir(parents=True, exist_ok=True)

dump(sp.csr_matrix(mapping_rules_labels_t), os.path.join(data_path, "processed", T_MATRIX_OUTPUT_TRAIN))

dump(train_data["samples"], os.path.join(data_path, "processed", TRAIN_SAMPLES_OUTPUT))
train_data["samples"].to_csv(os.path.join(data_path, "processed", TRAIN_SAMPLES_OUTPUT_CSV), header=True)
dump(train_rule_matches_z, os.path.join(data_path, "processed", Z_MATRIX_OUTPUT_TRAIN))

dump(dev_data[["samples", "labels", "enc_labels"]], os.path.join(data_path, "processed", DEV_SAMPLES_OUTPUT))
dev_data[["samples", "labels", "enc_labels"]].to_csv(os.path.join(data_path, "processed", DEV_SAMPLES_OUTPUT_CSV), header=True)
dump(dev_rule_matches_z, os.path.join(data_path, "processed", Z_MATRIX_OUTPUT_DEV))

dump(test_data[["samples", "labels", "enc_labels"]], os.path.join(data_path, "processed", TEST_SAMPLES_OUTPUT))
test_data[["samples", "labels", "enc_labels"]].to_csv(os.path.join(data_path, "processed", TEST_SAMPLES_OUTPUT_CSV), header=True)
dump(test_rule_matches_z, os.path.join(data_path, "processed", Z_MATRIX_OUTPUT_TEST))

['../../../data_from_minio/TAC/processed/test_rule_matches_z.lib']

## Finish

Congrats! Now we have all the data we need to launch Knodle on weakly-annotated TAC-based data.