# Create an Open-World Split

This notebook details how dataset splits can be created using the
example of IRT-CDE. The algorithm to determine *concept entities* and
the subsequent selection of *open-world* entities is described in
Section 3 of the paper. An implementation of that algorithm can be
found in `irt/graph/split.py:Splitter.create`. We first create a
`split.Dataset` and then, adding textual information, a
`text.Dataset`. These both then form an IRT dataset.

First, a knowledge graph needs to be loaded. We use CoDEx and the
loader defined in `irt/graph/loader.py`. Each loader function returns
a `irt.graph.GraphImport` instance that is used to instantiate an
`irt.graph.Graph`.


In [1]:
%load_ext autoreload
%autoreload 2

import irt

name = 'irt.cde-ipynb'

You need to have codex cloned:

``` bash
mkdir -p lib
git clone https://github.com/tsafavi/codex lib/codex
```

In [None]:
# create a graph import

from irt.graph import loader as graph_loader

data_dir = irt.ENV.LIB_DIR / 'codex/data'

source = graph_loader.load_codex(
    data_dir / 'triples/codex-m/train.txt',
    data_dir / 'triples/codex-m/valid.txt',
    data_dir / 'triples/codex-m/test.txt',
    f_ent2id=data_dir / 'entities/en/entities.json',
    f_rel2id=data_dir / 'relations/en/relations.json',
)

In [None]:
# instantiate and persist a graph instance

from irt.graph import graph
g = graph.Graph(name=name, source=source)

print(str(g))
print(g.description)
g.save(irt.ENV.DATASET_DIR / name / 'graph')

## Determine the relation ratio

Each relation has a ratio which we use to determine concept entities.

In [None]:
from irt.graph import split
from tabulate import tabulate

rels = split.Relation.from_graph(g)
rels.sort(key=lambda rel: rel.ratio)


def show_relations(rels, N: int = 10):
    rows = [(i, r.r, r.ratio, len(r.hs), len(r.ts), r.name) for i, r in enumerate(rels, 1)]

    print(f'first {N}')
    print(tabulate(rows[:N]))


print(f'got {len(rels)} relations')
show_relations(rels)

In [None]:
import matplotlib.pyplot as plt

def plot_relations(g, rels):
    fig = plt.figure()
    ax = fig.add_subplot(111)

    ax.set_title(f'Relation Distribution {name}')
    ax.set_xlabel('Relation')
    ax.set_ylabel('Ratio')

    ax.plot(range(len(rels)), [r.ratio for r in rels], color='#333')

plot_relations(g, rels)

After some examination we decide to apply a threshold at relation 27 and exclude some of the selected relations. Additional relations are not included (though this is possible and was applied for IRT-FB).

In [None]:
# define the configuration

cfg = split.Config(
    # make it deterministic
    seed=30061990,
    # select concept entities from the first 27 relations
    threshold=27,
    # retain around 60% of all triples for the cw split
    ow_split=0.6,
    # retain around 50% of all ow triples for testing
    ow_train_split=0.5,
    # exclude some relations
    excludelist=set((
        'P551:residence',
        'P407:language of work or name',
        'P530:diplomatic relation',
    )),
    # do not include additional relations
    includelist=set(),
)

print(cfg)

In [None]:
# based on this configuration, a split is created

from irt.common import helper
helper.seed(cfg.seed)

path = helper.path(irt.ENV.DATASET_DIR / name / 'split', create=True)
splitter = split.Splitter(g=g, cfg=cfg, name=name, path=path)
splitter.create()

In [None]:
# we have the raw data saved to an sqlite database
# create the fitting loader and pass it to the text selector

from irt.text import loader as text_loader

database = irt.ENV.SRC_DIR / 'text' / 'cde' / 'contexts-v7-2020-12-31.db'
loader = text_loader.SQLite(database=database)

from irt.text import selector

# this creates the text files in DATASET_DIR / <name> / text

path = helper.path(irt.ENV.DATASET_DIR / name, create=True)
selector.create(loader=loader, path=path, seed=cfg.seed, contexts=30, mask=True, mark=True)

In [2]:
ds = irt.Dataset(irt.ENV.DATASET_DIR / name)
print(str(ds))

RecursionError: maximum recursion depth exceeded

In [4]:
# verbose description
print(ds.description)

RecursionError: maximum recursion depth exceeded while calling a Python object