# TAC-based Relation Extraction dataset

This notebook shows how to preprocess data in CONLL format to use it in Knodle framework.

The dataset preproessed here is weakly-supervised dataset built over Knowledge Base Population challenges in the Text Analysis Conference. For development and test purposes the corpus annotated via crowdsourcing and human labeling from KBP is used (Zhang et al. (2017)). The training is done on a weakly-supervised noisy dataset based on TAC KBP corpora (Surdeanu (2013)), also used in Roth (2014). The TAC dataset was annotated with entity pairs extracted from Freebase (Google (2014)) where corresponding relations have been mapped to the 41 TAC relations types as used in the TAC KBP challenges (e.g., per:schools_attended and org:members). The amount of entity pairs per relation was limited to 10.000 and each entity pair is allowed to be mentioned in no more than 500 sentences.

Additionally, if no rule matched a sentence, it was added to the dataset with no_relation label. 

## Imports

Firstly, let's make some basic imports

In [122]:
import argparse
import sys
import os
from pathlib import Path
import logging
from typing import Dict
from minio import Minio

import numpy as np
import pandas as pd
import scipy.sparse as sp
from joblib import dump
from tqdm.auto import tqdm

from knodle.trainer.utils import log_section

logger = logging.getLogger(__name__)
PRINT_EVERY = 1000000

In [128]:
# define the files names
Z_MATRIX_OUTPUT_TRAIN = "train_rule_matches_z.lib"
Z_MATRIX_OUTPUT_DEV = "dev_rule_matches_z.lib"
Z_MATRIX_OUTPUT_TEST = "test_rule_matches_z.lib"

T_MATRIX_OUTPUT_TRAIN = "mapping_rules_labels.lib"

TRAIN_SAMPLES_OUTPUT = "df_train.lib"
DEV_SAMPLES_OUTPUT = "df_dev.lib"
TEST_SAMPLES_OUTPUT = "df_test.lib"

data_path = "../../data_from_minio"

## Download the dataset

This dataset, as all datasets provided in knodle, are laoded to minio and can be easily downlaoded with Minio client. 

In [127]:
def get_conll_config():
    config = {
        "minio_url": "knodle.dm.univie.ac.at",
        "minio_user": "UnM_LN*jSYK74Iz4",
        "minio_pw": "cQOs4|9Dr2_+HuFKneC8@dRgAtrV21i4Dumy",
        "minio_bucket": "knodle",
        "minio_prefix": "datasets/conll",
        "minio_files": [
            "train.conll",
            "dev.conll",
            "test.conll",
            "rules.csv",
            "labels.txt"
        ],
        "data_dir": data_path,
        "num_features": 400,
        "num_classes": 2,
    }
    return config

config = get_conll_config()
client = Minio(config.get("minio_url"), secure=False)

for file in tqdm(config.get("minio_files")):
    client.fget_object(
        bucket_name=config.get("minio_bucket"),
        object_name=os.path.join(config.get("minio_prefix"), file),
        file_path=os.path.join(data_path, "conll_data", file),
    )

HBox(children=(FloatProgress(value=0.0, max=5.0), HTML(value='')))

2021-03-10 17:27:19,113 urllib3.connectionpool DEBUG    Starting new HTTP connection (1): knodle.dm.univie.ac.at:80
2021-03-10 17:27:19,246 urllib3.connectionpool DEBUG    http://knodle.dm.univie.ac.at:80 "HEAD /knodle/datasets/conll/train.conll HTTP/1.1" 200 0
2021-03-10 17:27:19,317 urllib3.connectionpool DEBUG    http://knodle.dm.univie.ac.at:80 "GET /knodle/datasets/conll/train.conll HTTP/1.1" 200 2209656369





KeyboardInterrupt: 

In [104]:
# set paths to input data
path_labels = os.path.join(data_path, "conll_data", "labels.txt")
path_rules = os.path.join(data_path, "conll_data", "rules.csv")
path_train_data = os.path.join(data_path, "conll_data", "train.conll")
path_dev_data = os.path.join(data_path, "conll_data", "dev.conll")
path_test_data = os.path.join(data_path, "conll_data", "test.conll")

## Preview dataset¶

In [97]:
def count_file_lines(file_name: str) -> int:
    """ Count the number of line in a file """
    with open(file_name) as f:
        return len(f.readlines())

In [102]:
train_data = open(path_train_data)
for i in range(30):
    line = train_data.readline()
    print(line)

#	index	token	subj	subj_type	obj	obj_type	stanford_pos	stanford_ner	stanford_deprel	stanford_head

# id=E0065795:0-pos docid=E0065795:0 reln=org:alternate_names

1	Profile	_	_	_	_	VB	O	advmod	16

2	,	_	_	_	_	,	O	punct	16

3	basic	_	_	_	_	JJ	O	amod	4

4	information	_	_	_	_	NN	O	compound	5

5	ATG	_	_	OBJECT	ORG	NNP	ORG	nsubj	16

6	(	_	_	_	_	-LRB-	O	punct	11

7	Art	SUBJECT	ORGANIZATION	_	_	NNP	ORG	compound	9

8	Technology	SUBJECT	ORGANIZATION	_	_	NNP	ORG	compound	9

9	Group	SUBJECT	ORGANIZATION	_	_	NNP	ORG	nmod	11

10	,	_	_	_	_	,	O	punct	11

11	Inc.	_	_	_	_	NNP	GPE	appos	5

12	,	_	_	_	_	,	O	punct	11

13	NASDAQ	_	_	_	_	NNP	ORG	npadvmod	11

14	:	_	_	_	_	:	O	punct	13

15	)	_	_	_	_	-RRB-	O	punct	11

16	makes	_	_	_	_	VBZ	O	ROOT	16

17	software	_	_	_	_	NN	O	dobj	16

18	and	_	_	_	_	CC	O	cc	16

19	delivers	_	_	_	_	VBZ	O	conj	16

20	e	_	_	_	_	NN	O	nmod	22

21	-	_	_	_	_	HYPH	O	punct	22

22	commerce	_	_	_	_	NN	O	nmod	26

23	and	_	_	_	_	CC	O	cc	22

24	Web	_	_	_	_	NN	O	compound	25

25	marketing	_	_	_	

### Get labels

First, let's read labels from the file and encode them with ids. 

In [105]:
labels2ids = {}
with open(path_labels, encoding="UTF-8") as file:
    for line in file.readlines():
        relation, relation_enc = line.replace("\n", "").split(",")
        labels2ids[relation] = int(relation_enc)

num_labels = len(labels2ids)

Since we want to have samples with no rule matched as negative samples, let's heuristically calculate the other class id

In [106]:
other_class_id = max(labels2ids.values()) + 1

### Get rules

Secondly, we should get the rules. In our case, they are entity pairs extracted from Freebase and stored in the separate csv file.

In [108]:
rules = pd.read_csv(path_rules)
rules.head()

Unnamed: 0,rule,rule_id,label,label_id
0,Art_Technology_Group ATG,0,org:alternate_names,25
1,Union_Cycliste_Internationale UCI,1,org:alternate_names,25
2,Hanwha 한화,2,org:alternate_names,25
3,Radio_Free_Europe Radio_Liberty,3,org:alternate_names,25
4,Hermès Hermes,4,org:alternate_names,25


In [109]:
rule2rule_id = dict(zip(rules.rule, rules.rule_id))
num_rules = max(rules.rule_id.values) + 1

### Get rules to classes correspondance matrix

Lastly, we need to know which rule corresponds to which class. This information can be got from t_matrix, which is also stored in rules DataFrame. 

In [116]:
def get_t_matrix(rules: pd.DataFrame, num_labels: int) -> np.ndarray:
    """ Function calculates t matrix (rules x labels) using the known correspondence of relations to decision rules """
    rule_assignments_t = np.empty([rules.rule_id.max() + 1, num_labels])
    for index, row in rules.iterrows():
        rule_assignments_t[row["rule_id"], row["label_id"]] = 1
    return rule_assignments_t

rule_assignments_t = get_t_matrix(rules, num_labels)
dump(sp.csr_matrix(rule_assignments_t), os.path.join(data_path, T_MATRIX_OUTPUT_TRAIN))

HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))




['data/mapping_rules_labels.lib']

## Train data preprocessing

Train data should be annotated with rules we already have. Remember, there are no gold labels (as opposite to dev and test data).

The annotation is done in the following way: 
- the sentences are extracted from conll format
- the tokens labelled as object and subject are checked whether thery are in rules list
- if yes, this sentence is labelled with the corresponding relation
- if not, this sentence is labelled with no_relation label (filter_out_other parameter is set to False)

In [118]:
def annotate_conll_data_with_lfs(conll_data: str, rule2rule_id: Dict, filter_out_other: bool = True) -> pd.DataFrame:
    num_lines = count_file_lines(conll_data)
    processed_lines = 0
    samples, rules, enc_rules = [], [], []
    with open(conll_data, encoding='utf-8') as f:
        for line in f:
            processed_lines += 1
            line = line.strip()
            if line.startswith("# id="):  # Instance starts
                sample = ""
                subj, obj = {}, {}
            elif line == "":  # Instance ends
                if len(list(subj.keys())) == 0 or len(list(obj.keys())) == 0:
                    continue
                if min(list(subj.keys())) < min(list(obj.keys())):
                    rule = "_".join(list(subj.values())) + " " + "_".join(list(obj.values()))
                else:
                    rule = "_".join(list(subj.values())) + " " + "_".join(list(obj.values()))
                if rule in rule2rule_id.keys():
                    samples.append(sample)
                    rules.append(rule)
                    rule_id = rule2rule_id[rule]
                    enc_rules.append(rule_id)
                elif not filter_out_other:
                    samples.append(sample)
                    rules.append(None)
                    enc_rules.append(None)
                else:
                    continue
            elif line.startswith("#"):  # comment
                continue
            else:
                splitted_line = line.split("\t")
                token = splitted_line[1]
                if splitted_line[2] == "SUBJECT":
                    subj[splitted_line[0]] = token
                    sample += " " + token
                elif splitted_line[4] == "OBJECT":
                    obj[splitted_line[0]] = token
                    sample += " " + token
                else:
                    sample += " " + token
            if processed_lines % PRINT_EVERY == 0:
                logger.info("Processed {:0.2f}% of {} file".format(100 * processed_lines / num_lines,
                                                                   conll_data.split("/")[-1]))
                
    logger.info("Data preprocessing is finished")
    return pd.DataFrame.from_dict({"samples": samples, "rules": rules, "enc_rules": enc_rules})

train_data = annotate_conll_data_with_lfs(path_train_data, rule2rule_id, False)

HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

2021-03-10 17:18:23,628 __main__     INFO     Processed 1.50% of train.conll file
2021-03-10 17:18:25,088 __main__     INFO     Processed 3.00% of train.conll file
2021-03-10 17:18:26,503 __main__     INFO     Processed 4.51% of train.conll file
2021-03-10 17:18:27,940 __main__     INFO     Processed 6.01% of train.conll file
2021-03-10 17:18:29,312 __main__     INFO     Processed 7.51% of train.conll file
2021-03-10 17:18:30,653 __main__     INFO     Processed 9.01% of train.conll file
2021-03-10 17:18:32,011 __main__     INFO     Processed 10.51% of train.conll file
2021-03-10 17:18:33,374 __main__     INFO     Processed 12.02% of train.conll file
2021-03-10 17:18:34,737 __main__     INFO     Processed 13.52% of train.conll file
2021-03-10 17:18:36,103 __main__     INFO     Processed 15.02% of train.conll file
2021-03-10 17:18:37,479 __main__     INFO     Processed 16.52% of train.conll file
2021-03-10 17:18:38,844 __main__     INFO     Processed 18.02% of train.conll file
2021-03-10




After that we could build a z_matrix for train data and save it as a sparse matrix .

In [119]:
def get_z_matrix(data: pd.DataFrame, num_rules: int) -> np.ndarray:
    """
    Function calculates the z matrix (samples x rules)
    data: pd.DataFrame (samples, matched rules, matched rules id )
    output: sparse z matrix
    """
    data_without_nan = data.reset_index().dropna()
    z_matrix_sparse = sp.csr_matrix(
        (
            np.ones(len(data_without_nan['index'].values)),
            (data_without_nan['index'].values, data_without_nan['enc_rules'].values)
        ),
        shape=(len(data.index), num_rules)
    )
    return z_matrix_sparse

In [120]:
train_rule_matches_z = get_z_matrix(train_data, num_rules)

dump(train_data, os.path.join(path_output, TRAIN_SAMPLES_OUTPUT))
dump(train_rule_matches_z, os.path.join(path_output, Z_MATRIX_OUTPUT_TRAIN))

NameError: name 'path_output' is not defined

## Dev & Test data preprocessing¶

The dev and test data are to be simply read from the data without any additional annotation since the gold label are known for them.  

In [None]:
def get_conll_data_with_labels(
        conll_data: str, rule2rule_id: Dict, labels2ids: dict, other_class_id: int = None
) -> pd.DataFrame:
    """
    Processing of TACRED dataset. The function reads the .conll input file, extract the samples and the labels as well
    as argument pairs, which are saved as decision rules.
    :param conll_data: input data in .conll format
    :param rule2rule_id: corresponding of rules to rules ids
    :param labels2ids: dictionary of label - id corresponding
    :param other_class_id: id of other_class_label
    :return: DataFrame with columns "samples" (extracted sentences), "rules" (entity pairs), "enc_rules" (entity pairs
            ids), "labels" (original labels)
    """

    num_lines = count_file_lines(conll_data)
    processed_lines = 0

    samples, labels, rules, enc_rules = [], [], [], []
    with open(conll_data, encoding='utf-8') as f:
        for line in f:
            processed_lines += 1
            line = line.strip()
            if line.startswith("# id="):  # Instance starts
                sample = ""
                subj, obj = {}, {}
                label = labels2id.get(encode_labels(line.split(" ")[3][5:], other_class_id)
            elif line == "":  # Instance ends
                if min(list(subj.keys())) < min(list(obj.keys())):
                    rule = "_".join(list(subj.values())) + " " + "_".join(list(obj.values()))
                else:
                    rule = "_".join(list(subj.values())) + " " + "_".join(list(obj.values()))

                if rule in rule2rule_id.keys():
                    samples.append(sample)
                    labels.append(label)
                    rules.append(rule)
                    rule_id = rule2rule_id[rule]
                    enc_rules.append(rule_id)

                else:
                    samples.append(sample)
                    labels.append(label)
                    rules.append(None)
                    enc_rules.append(None)

            elif line.startswith("#"):  # comment
                continue
            else:
                splitted_line = line.split("\t")
                token = splitted_line[1]
                if splitted_line[2] == "SUBJECT":
                    subj[splitted_line[0]] = token
                    sample += " " + token
                elif splitted_line[4] == "OBJECT":
                    obj[splitted_line[0]] = token
                    sample += " " + token
                else:
                    sample += " " + token
            if processed_lines % PRINT_EVERY == 0:
                logger.info("Processed {:0.2f}% of {} file".format(100 * processed_lines / num_lines,
                                                                   conll_data.split("/")[-1]))

    return pd.DataFrame.from_dict({"samples": samples, "rules": rules, "enc_rules": enc_rules, "labels": labels})

dev_data = get_conll_data_with_ent_pairs(path_dev_data, rule2rule_id, labels2ids, other_class_id)
test_data = get_conll_data_with_ent_pairs(path_test_data, rule2rule_id, labels2ids, other_class_id)

z_matrix could be build with the same function as we used for building the z_matrix for train data.

In [None]:
dev_rule_matches_z = get_z_matrix(dev_data, num_classes)
test_rule_matches_z = get_z_matrix(test_data, num_classes)

... and we save all the data we got. 

In [18]:
dump(dev_data, os.path.join(path_output, DEV_SAMPLES_OUTPUT))
dump(dev_rule_matches_z, os.path.join(path_output, Z_MATRIX_OUTPUT_DEV))

dump(test_data, os.path.join(path_output, TEST_SAMPLES_OUTPUT))
dump(test_rule_matches_z, os.path.join(path_output, Z_MATRIX_OUTPUT_TEST))

NameError: name 'dev_data' is not defined