## Dependencies

In [None]:
import sys

sys.path.append('../lib')

## Prepare training and test datasets

In [9]:
# DRKG data
!cp ../graph_data/formatted_relations/drkg/formatted_drkg.tsv ./rawdata/drkg/formatted_drkg.tsv
!cp ../graph_data/formatted_relations/drkg/unformatted_drkg.tsv ./rawdata/drkg/unformatted_drkg.tsv

We need to prepare the training and external test datasets. We will use the training dataset to train the model and the test dataset to evaluate the model for all KGE models. But the drkg dataset have so many relations in the `unformatted_drkg.tsv` which cannot be formatted, so we use the relations in the `formatted_drkg.tsv` to prepare the training and external test datasets.

After that, we split the `formatted_drkg.tsv` into training and test datasets. The training dataset is named as `formatted_drkg_train.tsv` and the test dataset is named as `formatted_drkg_test.tsv`.

In [14]:
!python ../lib/data.py split -i ./rawdata/drkg/formatted_drkg.tsv -o1 ./rawdata/drkg/formatted_drkg_train.tsv -o2 ./rawdata/drkg/formatted_drkg_test.tsv -r 0.95

In [15]:
# HSDN data
!cp ../graph_data/formatted_relations/hsdn/formatted_hsdn.tsv ./rawdata/optional/formatted_hsdn.tsv

## Prepare DRKG dataset

Merge the `formatted_drkg_train.tsv` and `unformatted_drkg.tsv` to get the `train.tsv` and `valid.tsv` which is used to train the KGE models.

In [16]:
datadir = './rawdata/drkg'

In [17]:
import os
import pandas as pd

selected_columns = [
    "relation_type",
    "source_type",
    "source_id",
    "target_type",
    "target_id",
    "resource",
]

formatted_drkg_data = pd.read_csv(
    os.path.join(datadir, "formatted_drkg_train.tsv"), sep="\t"
)
formatted_drkg_data = formatted_drkg_data[selected_columns]
print("Formatted DRKG data shape: ", formatted_drkg_data.shape)

unformatted_drkg_data = pd.read_csv(
    os.path.join(datadir, "unformatted_drkg.tsv"), sep="\t"
)
unformatted_drkg_data = unformatted_drkg_data[selected_columns]
print("Unformatted DRKG data shape: ", unformatted_drkg_data.shape)

relations = pd.concat(
    [
        formatted_drkg_data,
        unformatted_drkg_data,
    ]
)

# Save the merged data
relations.to_csv(os.path.join("drkg", "relations.tsv"), sep="\t", index=False)

Formatted DRKG data shape:  (5394641, 6)
Unformatted DRKG data shape:  (194412, 6)


In [18]:
!python ../lib/data.py hrt -i ./drkg/relations.tsv -o ./drkg/relations_hrt.tsv

In [20]:
!python ../lib/data.py split -i ./drkg/relations_hrt.tsv -o1 ./drkg/train.tsv -o2 ./drkg/valid.tsv -r 0.95

In [None]:
!python ../lib/data.py hrt -i ./rawdata/drkg/formatted_drkg_test.tsv -o ./drkg/test.tsv