# Preprocessing of TAC-based Relation Extraction dataset

This notebook shows how to preprocess data in CONLL format to use it in Knodle framework.

To show how it works, we have taken a relation extraction dataset based on TAC KBP corpora (Surdeanu (2013)), also used in Roth (2014). The TAC dataset was annotated with entity pairs extracted from Freebase (Google (2014)) where corresponding relations have been mapped to the 41 TAC relations types as used in the TAC KBP challenges (e.g., per:schools_attended and org:members). 

In order to show the whole process of weak annotation, we have reconstructed the entity pairs and used them to annotate the dataset from scrath. As development and test sets we used the gold corpus annotated via crowdsourcing and human labeling from KBP (Zhang et al. (2017)).  

Importantly, in this dataset we preserve the samples where no rule matched as negative samples, what is considered to be a good practice in case of relation extraction task. 

The steps are the following:
- the input data files are downloaded from MINIO database: 
    - raw train data in .conll format
    - gold-annotated dev data in .conll format
    - gold-annotated test data in .conll format
    - list of rules (namely, Freebase entity pairs) with corresponding classes
    - list of classes
- list of rules with corresponding classes is transformed to mapping_rules_labels t matrics
- the non-labelled train data are read from .conll file and annotated with entity pairs. Basing on them, rule_matches_z matrix and a DataFrame with train samples are generated
- the already annotated dev and test data are read from .conll file together with gold labels and stored as a DataFrame.

## Imports

Firstly, let's make some basic imports

In [1]:
import argparse
import sys
import os
from pathlib import Path
import logging
from typing import Dict, Union, Tuple
from minio import Minio
import random
from IPython.display import HTML

import numpy as np
import pandas as pd
import scipy.sparse as sp
from joblib import dump
from tqdm.auto import tqdm

from knodle.trainer.utils import log_section

pd.set_option('display.max_colwidth', -1)
np.set_printoptions(threshold=sys.maxsize)

In [2]:
# define the files names
Z_MATRIX_OUTPUT_TRAIN = "train_rule_matches_z.lib"
Z_MATRIX_OUTPUT_DEV = "dev_rule_matches_z.lib"
Z_MATRIX_OUTPUT_TEST = "test_rule_matches_z.lib"

T_MATRIX_OUTPUT_TRAIN = "mapping_rules_labels.lib"

TRAIN_SAMPLES_OUTPUT = "df_train.lib"
DEV_SAMPLES_OUTPUT = "df_dev.lib"
TEST_SAMPLES_OUTPUT = "df_test.lib"

# define the path to the folder where the data will be stored
data_path = "../../../data_from_minio/TAC"
os.path.join(data_path)

'../../../data_from_minio/TAC'

## Download the dataset

This dataset, as all datasets provided in Knodle, could be easily downloaded from Minio database with Minio client. 

In [3]:
def get_conll_config():
    config = {
        "minio_url": "knodle.dm.univie.ac.at",
        "minio_bucket": "knodle",
        "minio_prefix": "datasets/conll",
        "minio_files": [
            "labels.txt",
            # "train.conll",
            "dev.conll",
            "test.conll",
            "rules.csv"
        ],
        "data_dir": data_path
    }
    return config

config = get_conll_config()
client = Minio(config.get("minio_url"), secure=False)

for file in tqdm(config.get("minio_files")):
    client.fget_object(
        bucket_name=config.get("minio_bucket"),
        object_name=os.path.join(config.get("minio_prefix"), file),
        file_path=os.path.join(data_path, file),
    )

HBox(children=(FloatProgress(value=0.0, max=4.0), HTML(value='')))




In [4]:
!wget --recursive --no-parent http://knodle.dm.univie.ac.at/minio/knodle/datasets

--2021-03-15 19:00:52--  http://knodle.dm.univie.ac.at/minio/knodle/datasets
Resolving knodle.dm.univie.ac.at (knodle.dm.univie.ac.at)... 131.130.125.86
Connecting to knodle.dm.univie.ac.at (knodle.dm.univie.ac.at)|131.130.125.86|:80... connected.
HTTP request sent, awaiting response... 403 Forbidden
2021-03-15 19:00:52 ERROR 403: Forbidden.



In [10]:
# set paths to input data
path_labels = os.path.join(data_path, "labels.txt")
path_rules = os.path.join(data_path, "rules_updated.csv")
path_train_data = os.path.join(data_path, "train.conll")
path_dev_data = os.path.join(data_path, "dev.conll")
path_test_data = os.path.join(data_path, "test.conll")

## Labels & Rules Data Preprocessing¶

### Get labels

First, let's read labels from the file with the corresponding label ids.

In [11]:
labels2ids = {}
with open(path_labels, encoding="UTF-8") as file:
    for line in file.readlines():
        relation, relation_enc = line.replace("\n", "").split(",")
        labels2ids[relation] = int(relation_enc)

num_classes = len(labels2ids)

In [12]:
print(labels2ids)

{'per:alternate_names': 0, 'per:date_of_birth': 1, 'per:age': 2, 'per:country_of_birth': 3, 'per:stateorprovince_of_birth': 4, 'per:city_of_birth': 5, 'per:origin': 6, 'per:date_of_death': 7, 'per:country_of_death': 8, 'per:stateorprovince_of_death': 9, 'per:city_of_death': 10, 'per:cause_of_death': 11, 'per:countries_of_residence': 12, 'per:stateorprovinces_of_residence': 13, 'per:cities_of_residence': 14, 'per:schools_attended': 15, 'per:title': 16, 'per:employee_of': 17, 'per:religion': 18, 'per:spouse': 19, 'per:children': 20, 'per:parents': 21, 'per:siblings': 22, 'per:other_family': 23, 'per:charges': 24, 'org:alternate_names': 25, 'org:members': 26, 'org:member_of': 27, 'org:subsidiaries': 28, 'org:political/religious_affiliation': 29, 'org:top_members/employees': 30, 'org:number_of_employees/members': 31, 'org:parents': 32, 'org:founded_by': 33, 'org:founded': 34, 'org:country_of_headquarters': 35, 'org:stateorprovince_of_headquarters': 36, 'org:city_of_headquarters': 37, 'org:

### Get rules

Secondly, rules (in our case, entity pairs extracted from Freebase) that are stored in the separate csv file are read and stored.

In [58]:
rules = pd.read_csv(path_rules)
rules

Unnamed: 0,rule,rule_id,label,label_id
0,ATG Art_Technology_Group,0,org:alternate_names,25
1,Union_Cycliste_Internationale UCI,1,org:alternate_names,25
2,UCI Union_Cycliste_Internationale,2,org:alternate_names,25
3,Hanwha 한화,3,org:alternate_names,25
4,Radio_Free_Europe Radio_Liberty,4,org:alternate_names,25
...,...,...,...,...
247315,Ginger_Baker drums,212032,per:title,16
247316,painter Qi_Baishi,212033,per:title,16
247317,Qi_Baishi painter,212034,per:title,16
247318,engineer David_Lennox,212035,per:title,16


In [59]:
# todo: delete
rules2label_ids = dict(zip(rules.rule, rules.label_id))

There also could be entity pairs corresponding to different relations, what is not a problem for Knodle since this information can be easily preserve in mapping_rules_labels_t we are going to build in the next section. 

In [60]:
# df1 = rules[rules.duplicated('rule', keep=False)].sort_values('rule')
# print(df1[['rule', 'rule_id', 'label']])

There could be also the cases where one rule corresponds to different classes (e.g., "Oracle, New_York" entity pair can reflect to both org:stateorprovince_of_headquarters and org:city_of_headquarters relations). However, currently Knodle doesn't support multi-label classification, so in such cases the label class is chosen randomly. 

In [61]:
idx = np.random.permutation(np.arange(len(rules)))
rules = rules.iloc[idx].drop_duplicates(subset=["rule_id"])
rules

Unnamed: 0,rule,rule_id,label,label_id
176322,Mahidol_Adulyadej Chulalongkorn,164124,per:parents,21
1695,Royal_Northern_College_of_Music RNCM,1695,org:alternate_names,25
183090,Wu_Zetian Buddhism,169071,per:religion,18
153602,WCW Steve_Regal,143493,per:employee_of,17
209349,Niklas_Isfeldt Sweden,109944,per:stateorprovince_of_birth,4
...,...,...,...,...
199174,Diego_Silang Gabriela_Silang,184312,per:spouse,19
66119,David_Black Eastern_University,62157,org:top_members/employees,30
35957,Olde_Towne_Brewing_Company 2004,35934,org:founded,34
37987,Knoxville_College 1875,37964,org:founded,34


Now we are ready to transform this dataframe into a dictionary with rule to rule ids correspondings. 

In [62]:
rule2rule_id = dict(zip(rules.rule, rules.rule_id))
num_rules = max(rules.rule_id.values) + 1

In [63]:
print(random.sample(list(rule2rule_id.items()), 10))

[('Fukushima Hideyo_Noguchi', 196261), ('2004 Toltecalli_Academy', 37529), ('Chennai Janaki_Ramachandran', 194647), ('Abd Cairo', 101490), ('Bruce Linda_Lee_Cadwell', 187769), ('Holly_McNarland singer', 205522), ('L.A. Dutch', 104450), ('West_Jessamine_High_School Kentucky', 51884), ('Wainwright Washington', 58928), ('Kruidvat Renswoude', 10345)]


### Get rules to classes correspondance matrix

Lastly, build mapping_rules_labels_t to get the information about which rule corresponds to which class. 

In [64]:
def get_mapping_rules_labels_t(rules: pd.DataFrame, num_classes: int) -> np.ndarray:
    """ Function calculates t matrix (rules x labels) using the known correspondence of relations to decision rules """
    mapping_rules_labels_t = np.empty([rules.rule_id.max() + 1, num_classes])
    for index, row in rules.iterrows():
        mapping_rules_labels_t[row["rule_id"], row["label_id"]] = 1
    return mapping_rules_labels_t

mapping_rules_labels_t = get_mapping_rules_labels_t(rules, num_classes)

## Train data preprocessing

Train data should be annotated with rules we already have. Remember, there are no gold labels (as opposite to evaluation and test data). To preserve samples without rule matches as negative samples in the training set, we do not eliminate them but add them to the preprocessed data with empty rule and rule_id value. 

So, the annotation is done in the following way: 
- the sentences are extracted from conll format
- the tokens labelled as object and subject are checked whether thery are in rules list
- if yes, this sentence is added to the train set with the corresponding matched rule and matched rule id
- if not, this sentence is added to the train set without empty rule match

In [65]:
def count_file_lines(file_name: str) -> int:
    """ Count the number of line in a file """
    with open(file_name) as f:
        return len(f.readlines())

In [66]:
train_data = open(path_train_data)
for i in range(30):
    line = train_data.readline()
    print(line)

#	index	token	subj	subj_type	obj	obj_type	stanford_pos	stanford_ner	stanford_deprel	stanford_head

# id=E0065795:0-pos docid=E0065795:0 reln=org:alternate_names

1	Profile	_	_	_	_	VB	O	advmod	16

2	,	_	_	_	_	,	O	punct	16

3	basic	_	_	_	_	JJ	O	amod	4

4	information	_	_	_	_	NN	O	compound	5

5	ATG	_	_	OBJECT	ORG	NNP	ORG	nsubj	16

6	(	_	_	_	_	-LRB-	O	punct	11

7	Art	SUBJECT	ORGANIZATION	_	_	NNP	ORG	compound	9

8	Technology	SUBJECT	ORGANIZATION	_	_	NNP	ORG	compound	9

9	Group	SUBJECT	ORGANIZATION	_	_	NNP	ORG	nmod	11

10	,	_	_	_	_	,	O	punct	11

11	Inc.	_	_	_	_	NNP	GPE	appos	5

12	,	_	_	_	_	,	O	punct	11

13	NASDAQ	_	_	_	_	NNP	ORG	npadvmod	11

14	:	_	_	_	_	:	O	punct	13

15	)	_	_	_	_	-RRB-	O	punct	11

16	makes	_	_	_	_	VBZ	O	ROOT	16

17	software	_	_	_	_	NN	O	dobj	16

18	and	_	_	_	_	CC	O	cc	16

19	delivers	_	_	_	_	VBZ	O	conj	16

20	e	_	_	_	_	NN	O	nmod	22

21	-	_	_	_	_	HYPH	O	punct	22

22	commerce	_	_	_	_	NN	O	nmod	26

23	and	_	_	_	_	CC	O	cc	22

24	Web	_	_	_	_	NN	O	compound	25

25	marketing	_	_	_	

In [67]:
def extract_subj_obj_middle_words(line: str, subj: list, obj: list, subj_min_token_id: int, obj_min_token_id: int, sample: str):
    splitted_line = line.split("\t")
    token = splitted_line[1]
    if splitted_line[2] == "SUBJECT":
        if not subj_min_token_id:
            subj_min_token_id = int(splitted_line[0])
        subj.append(token)
        sample += " " + token
    elif splitted_line[4] == "OBJECT":
        if not obj_min_token_id:
            obj_min_token_id = int(splitted_line[0])
        obj.append(token)
        sample += " " + token
    else:
        if (bool(subj) and not bool(obj)) or (not bool(subj) and bool(obj)):
            sample += " " + token
    return subj, obj, subj_min_token_id, obj_min_token_id, sample

def get_rule_n_rule_id(subj: list, obj: list, subj_min_token_id: int, obj_min_token_id: int, rule2rule_id: dict) -> Union[Tuple[str, int], Tuple[None, None]]:
    # print(f"new rule? Subj: {subj}, obj: {obj}")
    if subj_min_token_id < obj_min_token_id:
        rule = "_".join(subj) + " " + "_".join(obj)
    else:
        rule = "_".join(obj) + " " + "_".join(subj)
    if rule in rule2rule_id.keys():
        return rule, rule2rule_id[rule]
    return None, None

def encode_labels(label: str, label2id: dict) -> int:
    """ Encodes labels with corresponding labels id. If relation is unknown, adds it to the dict with new label id """
    if label in label2id:
        label_id = label2id[label]
    else:
        label_id = len(label2id)
        label2id[label] = label_id
    return label_id

def print_progress(processed_lines: int, num_lines: int) -> None:
    if processed_lines % (int(round(num_lines / 10))) == 0:
        print(f"Processed {processed_lines / num_lines * 100 :0.0f}%")


def annotate_conll_data_with_lfs(conll_data: str, rule2rule_id: Dict, labels2ids: Dict = None) -> pd.DataFrame:
    num_lines = count_file_lines(conll_data)
    processed_lines = 0
    samples, rules, enc_rules, labels, enc_labels = [], [], [], [], []
    # todo: delete the following line
    relation_train_to_delete = []
    with open(conll_data, encoding='utf-8') as f:
        for line in f:
            processed_lines += 1
            line = line.strip()
            if line.startswith("# id="):  # Instance starts
                sample = ""
                subj, obj = [], []
                subj_min_token_id, obj_min_token_id = None, None
                if labels2ids:
                    label = line.split(" ")[3][5:]
                    label_id = encode_labels(label, labels2ids)
            elif line == "":  # Instance ends
                if len(subj) == 0 or len(obj) == 0:      # there is a mistake in sample annotation, and no token was annotated as subj/obj 
                    continue
                rule, rule_id = get_rule_n_rule_id(subj, obj, subj_min_token_id, obj_min_token_id, rule2rule_id)
                samples.append(sample.lstrip())
                rules.append(rule)
                enc_rules.append(rule_id)
                if labels2ids:
                    labels.append(label)
                    enc_labels.append(label_id)
            elif line.startswith("#"):  # comment
                continue
            else:
                subj, obj, subj_min_token_id, obj_min_token_id, sample = extract_subj_obj_middle_words(line, subj, obj, subj_min_token_id, obj_min_token_id, sample)
            print_progress(processed_lines, num_lines)
            
    print(f"Preprocessing of {conll_data.split('/')[-1]} file is finished.")
    if labels2ids:
        return pd.DataFrame.from_dict({"samples": samples, "rules": rules, "enc_rules": enc_rules, "labels": labels, "enc_labels": enc_labels}) 
    return pd.DataFrame.from_dict({"samples": samples, "rules": rules, "enc_rules": enc_rules})

In [68]:
train_data = annotate_conll_data_with_lfs(path_train_data, rule2rule_id, labels2ids)
# uncomment
# train_data = annotate_conll_data_with_lfs(path_train_data, rule2rule_id)

Processed 10%
Processed 20%
Processed 30%
Processed 40%
Processed 50%
Processed 60%
Processed 70%
Processed 80%
Processed 90%
Preprocessing of train.conll file is finished.


In [69]:
train_data.head()

Unnamed: 0,samples,rules,enc_rules,labels,enc_labels
0,ATG ( Art Technology Group,ATG Art_Technology_Group,0.0,org:alternate_names,25
1,) makes software and delivers e - commerce,,,no_relation,41
2,Union Cycliste Internationale ( UCI,Union_Cycliste_Internationale UCI,1.0,org:alternate_names,25
3,1987 by CTCA and has been awarded the distinguished level,,,no_relation,41
4,Union Cycliste Internationale ( UCI,Union_Cycliste_Internationale UCI,1.0,org:alternate_names,25


In [70]:
rules.head()

Unnamed: 0,rule,rule_id,label,label_id
176322,Mahidol_Adulyadej Chulalongkorn,164124,per:parents,21
1695,Royal_Northern_College_of_Music RNCM,1695,org:alternate_names,25
183090,Wu_Zetian Buddhism,169071,per:religion,18
153602,WCW Steve_Regal,143493,per:employee_of,17
209349,Niklas_Isfeldt Sweden,109944,per:stateorprovince_of_birth,4


In [71]:
# todo: delete!
train_data["label_by_rule_to_delete"] = train_data["rules"].apply(
    lambda x: rules2label_ids[x] if x is not None else 41
)



In [72]:
not_same = train_data[train_data["enc_labels"] != train_data["label_by_rule_to_delete"]]
print(len(not_same))
not_same_without_nr = not_same[not_same["enc_labels"] != 41]
print(len(not_same_without_nr))

352873
349745


In [36]:
not_same.head(10)

Unnamed: 0,samples,rules,enc_rules,labels,enc_labels,label_by_rule_to_delete
1040,Trans World Airlines ( TWA,Trans_World_Airlines TWA,57.0,no_relation,41,25
1327,United States Military Academy in West Point,United_States_Military_Academy West_Point,72.0,no_relation,41,25
5310,Organization of American States ( OAS,Organization_of_American_States OAS,209.0,no_relation,41,25
6179,Automatic Data Processing ( ADP,Automatic_Data_Processing ADP,255.0,org:alternate_names,25,32
6181,Automatic Data Processing ( ADP,Automatic_Data_Processing ADP,255.0,org:alternate_names,25,32
6183,"Automatic Data Processing , Inc. ( NYSE : ) ADP",Automatic_Data_Processing ADP,255.0,org:alternate_names,25,32
6185,Automatic Data Processing ( ADP,Automatic_Data_Processing ADP,255.0,org:alternate_names,25,32
6187,Automatic Data Processing ( ADP,Automatic_Data_Processing ADP,255.0,org:alternate_names,25,32
6189,Automatic Data Processing ( ADP,Automatic_Data_Processing ADP,255.0,org:alternate_names,25,32
6191,Automatic Data Processing ( ADP,Automatic_Data_Processing ADP,255.0,org:alternate_names,25,32


In [46]:
not_same_without_nr.drop_duplicates(
    subset=["enc_rules", "labels", "enc_labels", "label_by_rule_to_delete"], keep='last'
).head(50)

Unnamed: 0,samples,rules,enc_rules,labels,enc_labels,label_by_rule_to_delete
6193,"Automatic Data Processing , Inc. ( ADP",Automatic_Data_Processing ADP,255.0,org:alternate_names,25,32
16832,Koch Records was formed in May 1999 by Alex Ameen to serve as the North American umbrella for Koch Entertainment,Koch_Records Koch_Entertainment,567.0,org:alternate_names,25,32
31449,"Universal Studios producer , and former vice chairman of Worldwide Production for Universal Pictures",Universal_Studios Universal_Pictures,1192.0,org:alternate_names,25,32
31451,Universal Pictures movie studio and operates Universal Studios,Universal_Pictures Universal_Studios,1193.0,org:alternate_names,25,32
35503,"Hærens Jegerkommando , Marinejegerkommandoen and NORSOF",Hærens_Jegerkommando NORSOF,1315.0,org:alternate_names,25,32
47379,Western Electric Company in New York,Western_Electric New_York,1757.0,org:city_of_headquarters,37,36
47440,"Humanscale Corporation , based in New York",Humanscale New_York,1771.0,org:city_of_headquarters,37,36
47748,"Dailymotion is a video hosting service website , based in Paris",Dailymotion Paris,1834.0,org:city_of_headquarters,37,36
47749,"Paris called the Metropolis , but there are signs that it is spreading thanks to videos on file- sharing websites YouTube and Dailymotion",Paris Dailymotion,1833.0,org:city_of_headquarters,37,36
47751,"Dailymotion is a video hosting service website , based in Paris , France","Dailymotion Paris_,_France",1835.0,org:city_of_headquarters,37,36


In [47]:
len(not_same_without_nr.drop_duplicates(
    subset=["enc_rules", "labels", "enc_labels", "label_by_rule_to_delete"], keep='last'
))

77643

In [38]:
print(len(not_same_without_nr))

349745


In [33]:
count = len(train_data.dropna().query('enc_labels != label_by_rule_to_delete'))  # 2
print(f"Number of labels that are not the same: {count}")
print(f"Total number of rows: {len(train_data)}")

Number of labels that are not the same: 352873
Total number of rows: 1937211


After that we could build a rule_matches_z matrix for train data and save it as a sparse matrix .

In [None]:
def get_rule_matches_z_matrix (data: pd.DataFrame, num_rules: int) -> sp.csr_matrix:
    """
    Function calculates the z matrix (samples x rules)
    data: pd.DataFrame (samples, matched rules, matched rules id )
    output: sparse z matrix
    """
    data_without_nan = data.reset_index().dropna()
    rule_matches_z_matrix_sparse = sp.csr_matrix(
        (
            np.ones(len(data_without_nan['index'].values)),
            (data_without_nan['index'].values, data_without_nan['enc_rules'].values)
        ),
        shape=(len(data.index), num_rules)
    )
    return rule_matches_z_matrix_sparse

In [None]:
train_rule_matches_z = get_rule_matches_z_matrix(train_data, num_rules)

## Dev & Test data preprocessing¶

The validation and test data are to be read from the corresponding input files. Although the gold label are known and  could be simply received from the same input conll data, we still anotate the dev and test data with the same rules we used to annotate the train data (namely, Freebase entity pairs). That is done in order to lately evaluate the rules and get a baseline result by comparing the known gold labels and the weakly labels. However, because of the rules specificity, there is a very small amount of matched rules in dev and test data. That is why in final DataFrame for most of the samples "rules" and "enc_rules" values equal None.

Apart from the 41 "meaningful" relations, there are also samples which are annotated as "no_relation" samples in validation and test data. That's why we need to add one more class to our labels2ids dictionary. 

In [None]:
labels2ids["no_relation"] = max(labels2ids.values()) + 1

Now we can process the development and test data. We shall use the same function as for processing of training data with one difference: the labels will be also read and stored for each sample. 

In [None]:
dev_data = annotate_conll_data_with_lfs(path_dev_data, rule2rule_id, labels2ids)
test_data = annotate_conll_data_with_lfs(path_test_data, rule2rule_id, labels2ids)

In [None]:
dev_data.head()

We also provide rule_matches_z matrices for dev and test data in order to calculate the simple majority baseline. They won't be used in any of the denoising algorithms provided in Knodle.

In [None]:
dev_rule_matches_z = get_rule_matches_z_matrix(dev_data, num_rules)
test_rule_matches_z = get_rule_matches_z_matrix(test_data, num_rules)

## Statistics

Let's collect some statistics of the data we collected.

In [None]:
print(f"Number of rules: {num_rules}")
print(f"Dimension of t matrix: {mapping_rules_labels_t.shape}")
print(f"Number of samples in train set: {len(train_data)}")

In [None]:
print(f"Number of samples in dev set: {len(dev_data)}")
dev_stat = dev_data.groupby(['enc_labels','labels'])['samples'].count().sort_values(ascending=False).reset_index(name='count')
HTML(dev_stat.to_html(index=False))

In [None]:
print(f"Number of samples in test set: {len(test_data)}")
test_stat = test_data.groupby(['enc_labels','labels'])['samples'].count().sort_values(ascending=False).reset_index(name='count')
HTML(test_stat.to_html(index=False))

## Save files

... and we save all the data we got. 

In [None]:
Path(os.path.join(data_path, "processed")).mkdir(parents=True, exist_ok=True)

dump(sp.csr_matrix(mapping_rules_labels_t), os.path.join(data_path, "processed", T_MATRIX_OUTPUT_TRAIN))

dump(train_data, os.path.join(data_path, "processed", TRAIN_SAMPLES_OUTPUT))
dump(train_rule_matches_z, os.path.join(data_path, "processed", Z_MATRIX_OUTPUT_TRAIN))

dump(dev_data, os.path.join(data_path, "processed", DEV_SAMPLES_OUTPUT))
dump(dev_rule_matches_z, os.path.join(data_path, "processed", Z_MATRIX_OUTPUT_DEV))

dump(test_data, os.path.join(data_path, "processed", TEST_SAMPLES_OUTPUT))
dump(test_rule_matches_z, os.path.join(data_path, "processed", Z_MATRIX_OUTPUT_TEST))

## Finish

Congrats! Now we have all the data we need to launch Knodle on weakly-annotated TAC-based data.