# Preprocessing of TAC-based Relation Extraction dataset

This notebook shows how to preprocess data in CONLL format, which is quite popular for storing the NLP datasets, for Knodle framework.

To show how it works, we have taken a relation extraction dataset based on TAC KBP corpora (Surdenau (2013)), also used in Roth (2014). The TAC dataset was annotated with entity pairs extracted from Freebase (Google (2014)) where corresponding relations have been mapped to the 41 TAC relations types as used in the TAC KBP challenges (e.g., per:schools_attended and org:members).

In order to show the whole process of weak annotation, we have reconstructed the entity pairs and used them to annotate the dataset from scrath. As development and test sets we used the gold corpus annotated via crowdsourcing and human labeling from KBP (Zhang et al. (2017)).  

Importantly, in this dataset we preserve the samples, where no rule matched, as __negative samples__, what is considered to be a good practice in many NLP tasks, e.g. relation extraction. 

The steps are the following:
- the input data files are downloaded from MINIO database: 
    - raw train data saved in .conll format
    - gold-annotated dev data saved in .conll format
    - gold-annotated test data saved in .conll format
    - list of rules (namely, Freebase entity pairs) with corresponding classes
    - list of classes
- list of rules with corresponding classes is transformed to mapping_rules_labels t matrics
- the non-labelled train data are read from .conll file and annotated with entity pairs. Basing on them, rule_matches_z matrix and a DataFrame with train samples are generated
- the already annotated dev and test data are read from .conll file together with gold labels and stored as a DataFrame.

## Imports

Firstly, let's make some basic imports

In [None]:
import argparse
import sys
import os
from pathlib import Path
import logging
from typing import Dict, Union, Tuple
from minio import Minio
import random
from IPython.display import HTML
import csv

import numpy as np
import pandas as pd
import scipy.sparse as sp
from joblib import dump
from tqdm.auto import tqdm

from knodle.trainer.utils import log_section

pd.set_option('display.max_colwidth', -1)
np.set_printoptions(threshold=sys.maxsize)

In [None]:
# define the files names
Z_MATRIX_OUTPUT_TRAIN = "train_rule_matches_z.lib"
Z_MATRIX_OUTPUT_DEV = "dev_rule_matches_z.lib"
Z_MATRIX_OUTPUT_TEST = "test_rule_matches_z.lib"

T_MATRIX_OUTPUT_TRAIN = "mapping_rules_labels_t.lib"

TRAIN_SAMPLES_OUTPUT = "df_train.lib"
DEV_SAMPLES_OUTPUT = "df_dev.lib"
TEST_SAMPLES_OUTPUT = "df_test.lib"

# file names for .csv files
TRAIN_SAMPLES_OUTPUT_CSV = "df_train.csv"
DEV_SAMPLES_OUTPUT_CSV = "df_dev.csv"
TEST_SAMPLES_OUTPUT_CSV = "df_test.csv"

# define the path to the folder where the data will be stored
data_path = "../../../data_from_minio_old/TAC"
os.path.join(data_path)

## Download the dataset

This dataset, as all datasets provided in Knodle, could be easily downloaded from Minio database with Minio client. 

In [None]:
client = Minio("knodle.dm.univie.ac.at", secure=False)
files = ["train.conll", "dev.conll", "test.conll", "labels.txt", "rules.csv"]

for file in tqdm(files):
    client.fget_object(
        bucket_name="knodle",
        object_name=os.path.join("datasets/conll", file),
        file_path=os.path.join(data_path, file),
    )

In [None]:
# set paths to input data
path_labels = os.path.join(data_path, "labels.txt")
path_rules = os.path.join(data_path, "rules.csv")
path_train_data = os.path.join(data_path, "train.conll")
path_dev_data = os.path.join(data_path, "dev.conll")
path_test_data = os.path.join(data_path, "test.conll")

## Labels & Rules Data Preprocessing¶

### Get labels

First, let's read labels from the file with the corresponding label ids.

In [None]:
labels2ids = {}
with open(path_labels, encoding="UTF-8") as file:
    for line in file.readlines():
        relation, relation_enc = line.replace("\n", "").split(",")
        labels2ids[relation] = int(relation_enc)

num_classes = len(labels2ids)

In [None]:
print(labels2ids)

### Get rules

Secondly, rules (in our case, entity pairs extracted from Freebase) that are stored in the separate csv file with corresponding label and label_id (label to label_id correspondence is the same as in file with labels list) are read and stored.

In [None]:
rules = pd.read_csv(path_rules)
num_rules_from_file = len(rules)
rules

Most rules and classes have one-to-one correspondence. However, there could be cases where a rule corresponds to different classes. For example, "Oracle, New_York" entity pair can reflect to both org:stateorprovince_of_headquarters and org:city_of_headquarters relations. In such cases information about all corresponding classed will be saved and reflected in the mapping_rules_labels_t matrix we are going to build in the next section.

### Get rules to classes correspondence matrix

Before that, basing on this dataframe let's build 2 dictionaries that we are going to use later:
- rule to rule ids corresponding
- rule ids to label ids corresponding

In [None]:
rule2rule_id = dict(zip(rules["rule"], rules["rule_id"]))

rules_n_label_ids = rules[["rule_id", "label_id"]].groupby('rule_id')
rule2label = rules_n_label_ids['label_id'].apply(lambda s: s.tolist()).to_dict()

num_rules = max(rules.rule_id.values) + 1
print(f"Number of rules: {num_rules}")

Finally, let's the build mapping_rules_labels_t matrix with the information about which rule corresponds to which class. 

In [None]:
def get_mapping_rules_labels_t(rule2label: Dict, num_classes: int) -> np.ndarray:
    """ Function calculates t matrix (rules x labels) using the known correspondence of relations to decision rules """
    mapping_rules_labels_t = np.zeros([len(rule2label), num_classes])
    for rule, labels in rule2label.items():
        mapping_rules_labels_t[rule, labels] = 1
    return mapping_rules_labels_t

mapping_rules_labels_t = get_mapping_rules_labels_t(rule2label, num_classes)

## Train data preprocessing

Train data should be annotated with rules we already have. Remember, there is no gold labels (as opposite to evaluation and test data). To preserve samples without rule matches as negative samples in the training set, we do not eliminate them but add them to the preprocessed data with empty rule and rule_id value. 

So, the annotation is done in the following way: 
- the sentences are extracted from .conll file
- a pair of tokens tagged as object and subject are looked up in rules list
- if they form any rule from the rules list, this sentence is added to the train set. The matched rule and rule id is added accordingly.
- if they are not, this sentence is added to the train set with empty rule match

In [None]:
def count_file_lines(file_name: str) -> int:
    """ Count the number of line in a file """
    with open(file_name) as f:
        return len(f.readlines())

In [None]:
train_data = open(path_train_data)
for i in range(30):
    line = train_data.readline()
    print(line)

In [None]:
def extract_subj_obj_middle_words(line: str, subj: list, obj: list, subj_min_token_id: int, obj_min_token_id: int, sample: str):
    splitted_line = line.split("\t")
    token = splitted_line[1]
    if splitted_line[2] == "SUBJECT":
        if not subj_min_token_id:
            subj_min_token_id = int(splitted_line[0])
        subj.append(token)
        sample += " " + token
    elif splitted_line[4] == "OBJECT":
        if not obj_min_token_id:
            obj_min_token_id = int(splitted_line[0])
        obj.append(token)
        sample += " " + token
    else:
        if (bool(subj) and not bool(obj)) or (not bool(subj) and bool(obj)):
            sample += " " + token
    return subj, obj, subj_min_token_id, obj_min_token_id, sample

def get_rule_n_rule_id(subj: list, obj: list, subj_min_token_id: int, obj_min_token_id: int, rule2rule_id: dict) -> Union[Tuple[str, int], Tuple[None, None]]:
    if subj_min_token_id < obj_min_token_id:
        rule = "_".join(subj) + " " + "_".join(obj)
    else:
        rule = "_".join(obj) + " " + "_".join(subj)
    if rule in rule2rule_id.keys():
        return rule, rule2rule_id[rule]
    return None, None

def encode_labels(label: str, label2id: dict) -> int:
    """ Encodes labels with corresponding labels id. If relation is unknown, adds it to the dict with new label id """
    if label in label2id:
        label_id = label2id[label]
    else:
        # todo: warning and 
        label_id = len(label2id)
        label2id[label] = label_id
    return label_id

def verbose(processed_lines: int, num_lines: int) -> None:
    if processed_lines % (int(round(num_lines / 10))) == 0:
        print(f"Processed {processed_lines / num_lines * 100 :0.0f}%")


def annotate_conll_data_with_lfs(conll_data: str, rule2rule_id: Dict, labels2ids: Dict = None) -> pd.DataFrame:
    num_lines = count_file_lines(conll_data)
    processed_lines = 0
    samples, rules, enc_rules, labels, enc_labels = [], [], [], [], []
    with open(conll_data, encoding='utf-8') as f:
        for line in f:
            processed_lines += 1
            line = line.strip()
            if line.startswith("# id="):  # Instance starts
                sample = ""
                subj, obj = [], []
                subj_min_token_id, obj_min_token_id = None, None
                if labels2ids:
                    label = line.split(" ")[3][5:]
                    label_id = encode_labels(label, labels2ids)
            elif line == "":  # Instance ends
                if len(subj) == 0 or len(obj) == 0:      # there is a mistake in sample annotation, and no token was annotated as subj/obj 
                    continue
                rule, rule_id = get_rule_n_rule_id(subj, obj, subj_min_token_id, obj_min_token_id, rule2rule_id)
                samples.append(sample.lstrip())
                rules.append(rule)
                enc_rules.append(rule_id)
                if labels2ids:
                    labels.append(label)
                    enc_labels.append(label_id)
            elif line.startswith("#"):  # comment
                continue
            else:
                subj, obj, subj_min_token_id, obj_min_token_id, sample = extract_subj_obj_middle_words(line, subj, obj, subj_min_token_id, obj_min_token_id, sample)
            verbose(processed_lines, num_lines)
            
    print(f"Preprocessing of {conll_data.split('/')[-1]} file is finished.")
    if labels2ids:
        return pd.DataFrame.from_dict({"samples": samples, "rules": rules, "enc_rules": enc_rules, "labels": labels, "enc_labels": enc_labels}) 
    return pd.DataFrame.from_dict({"samples": samples, "rules": rules, "enc_rules": enc_rules})

In [None]:
train_data = annotate_conll_data_with_lfs(path_train_data, rule2rule_id)

In [None]:
train_data.head()

After that we could build a rule_matches_z matrix for train data and save it as a sparse matrix .

In [None]:
def get_rule_matches_z_matrix (data: pd.DataFrame, num_rules: int) -> sp.csr_matrix:
    """
    Function calculates the z matrix (samples x rules)
    data: pd.DataFrame (samples, matched rules, matched rules id )
    output: sparse z matrix
    """
    data_without_nan = data.reset_index().dropna()
    rule_matches_z_matrix_sparse = sp.csr_matrix(
        (
            np.ones(len(data_without_nan['index'].values)),
            (data_without_nan['index'].values, data_without_nan['enc_rules'].values)
        ),
        shape=(len(data.index), num_rules)
    )
    return rule_matches_z_matrix_sparse

In [None]:
train_rule_matches_z = get_rule_matches_z_matrix(train_data, num_rules)

## Dev & Test data preprocessing¶

The validation and test data are to be read from the corresponding input files. Although the gold label are known and  could be simply received from the same input conll data, we still annotate the dev and test data with the same rules we used to annotate the train data (namely, Freebase entity pairs). That is done in order to lately evaluate the rules and get a baseline result by comparing the known gold labels and the weakly labels. However, because of the rules specificity, there is a very small amount of matched rules in dev and test data. That is why in final DataFrame for most of the samples "rules" and "enc_rules" values equal None.

Apart from the 41 "meaningful" relations, there are also samples which are annotated as "no_relation" samples in validation and test data. That's why we need to add one more class to our labels2ids dictionary. 

In [None]:
labels2ids["no_relation"] = max(labels2ids.values()) + 1

Now we can process the development and test data. We shall use the same function as for processing of training data with one difference: the labels will be also read and stored for each sample. 

In [None]:
dev_data = annotate_conll_data_with_lfs(path_dev_data, rule2rule_id, labels2ids)
test_data = annotate_conll_data_with_lfs(path_test_data, rule2rule_id, labels2ids)

In [None]:
dev_data.head()

We also provide rule_matches_z matrices for dev and test data in order to calculate the simple majority baseline. They won't be used in any of the denoising algorithms provided in Knodle.

In [None]:
dev_rule_matches_z = get_rule_matches_z_matrix(dev_data, num_rules)
test_rule_matches_z = get_rule_matches_z_matrix(test_data, num_rules)

## Statistics

Let's collect some statistics of the data we collected.

In [None]:
print(f"Number of rules: {num_rules}")
print(f"Dimension of t matrix: {mapping_rules_labels_t.shape}")
print(f"Number of samples in train set: {len(train_data)}")

In [None]:
print(f"Number of samples in dev set: {len(dev_data)}")
dev_stat = dev_data.groupby(['enc_labels','labels'])['samples'].count().sort_values(ascending=False).reset_index(name='count')
HTML(dev_stat.to_html(index=False))

In [None]:
print(f"Number of samples in test set: {len(test_data)}")
test_stat = test_data.groupby(['enc_labels','labels'])['samples'].count().sort_values(ascending=False).reset_index(name='count')
HTML(test_stat.to_html(index=False))

## Save files

... and we save all the data we got. 

In [None]:
Path(os.path.join(data_path, "processed")).mkdir(parents=True, exist_ok=True)

dump(sp.csr_matrix(mapping_rules_labels_t), os.path.join(data_path, "processed", T_MATRIX_OUTPUT_TRAIN))

dump(train_data["samples"], os.path.join(data_path, "processed", TRAIN_SAMPLES_OUTPUT))
train_data["samples"].to_csv(os.path.join(data_path, "processed", TRAIN_SAMPLES_OUTPUT_CSV), header=True)
dump(train_rule_matches_z, os.path.join(data_path, "processed", Z_MATRIX_OUTPUT_TRAIN))

dump(dev_data[["samples", "labels", "enc_labels"]], os.path.join(data_path, "processed", DEV_SAMPLES_OUTPUT))
dev_data[["samples", "labels", "enc_labels"]].to_csv(os.path.join(data_path, "processed", DEV_SAMPLES_OUTPUT_CSV), header=True)
dump(dev_rule_matches_z, os.path.join(data_path, "processed", Z_MATRIX_OUTPUT_DEV))

dump(test_data[["samples", "labels", "enc_labels"]], os.path.join(data_path, "processed", TEST_SAMPLES_OUTPUT))
test_data[["samples", "labels", "enc_labels"]].to_csv(os.path.join(data_path, "processed", TEST_SAMPLES_OUTPUT_CSV), header=True)
dump(test_rule_matches_z, os.path.join(data_path, "processed", Z_MATRIX_OUTPUT_TEST))

## Finish

Congrats! Now we have all the data we need to launch Knodle on weakly-annotated TAC-based data.