# INK: explanatory notebook

Within this notebook, we give a simple example of how the INK library can be used to extract the neighbourhood of certain subjects of interest.<br>
We also show the mining capabilities for both task specific and task agnostic cases.


The example dataset can be found in the /datasets folder.<br>
We have used the common known animal dataset, describing several animals with their properties.

In [1]:
from ink.base.connectors import RDFLibConnector
from ink.base.structure import InkExtractor
from ink.miner.rulemining import RuleSetMiner

To start, three different packages are loaded.
* A connector: which is used to load the original dataset.
Here we have used an RDFLib connector, but other connectors are available.
* The INK extractor, which will transform the neighbourhood of certain nodes into a binary representation.
* The Rule Set Miner, which is our rule mining module.

### 1. create connector

In [2]:
connector = RDFLibConnector('ink/datasets/animals.owl', 'xml')

The RDLib connector simply requires the filename of the KG to be loaded and the accompanied formatting file.<br>
Querying will be the most important function within such a connector. SPARQL queries can be executed as follows:

In [3]:
connector.query("Select ?s where {?s a <http://dl-learner.org/benchmark/dataset/animals/T-Rex>.}")

[{'s': {'type': 'uri',
   'value': 'http://dl-learner.org/benchmark/dataset/animals#trex01'}}]

### 2. create the extractor
INK extractor takes a connector as argument. We also provide a list of prefixes.

In [4]:
prefix={
    "http://www.w3.org/1999/02/22-rdf-syntax-ns#": "rdf#",
    "http://www.w3.org/2000/01/rdf-schema#": "rdfs#",
    "http://www.w3.org/2002/07/owl#": "owl#",
    "http://dl-learner.org/benchmark/dataset/animals/": "animals/"
}
extractor = InkExtractor(connector, prefixes=prefix)

### 3. create the rule miner
All arguments are optinal here, take a look at the documentation for morre information and how they influence the mining capabilities.

In [5]:
miner = RuleSetMiner(chains=100,max_len_rule_set=3, forest_size=10)

## <ins>Task specific rule mining</ins>
We define two different sets of animals:
* one containing only mammals
* non mammals in the other set

In [6]:
pos = set([ "http://dl-learner.org/benchmark/dataset/animals#dog01",
            "http://dl-learner.org/benchmark/dataset/animals#dolphin01",
            "http://dl-learner.org/benchmark/dataset/animals#platypus01",
            "http://dl-learner.org/benchmark/dataset/animals#bat01"])

neg = set(["http://dl-learner.org/benchmark/dataset/animals#trout01",
            "http://dl-learner.org/benchmark/dataset/animals#herring01",
            "http://dl-learner.org/benchmark/dataset/animals#shark01",
            "http://dl-learner.org/benchmark/dataset/animals#lizard01",
            "http://dl-learner.org/benchmark/dataset/animals#croco01",
            "http://dl-learner.org/benchmark/dataset/animals#trex01",
            "http://dl-learner.org/benchmark/dataset/animals#turtle01",
            "http://dl-learner.org/benchmark/dataset/animals#eagle01",
            "http://dl-learner.org/benchmark/dataset/animals#ostrich01",
            "http://dl-learner.org/benchmark/dataset/animals#penguin01"])

The goal for ink is to learn a rule to seperate the pos set (mammals) from the negative set (non mammals).
First: we extract the neighbourhoods until a certain depth of both the nodes in the pos set and neg set.

In [7]:
X_train, y_train = extractor.create_dataset(4, pos, neg, jobs=4)

X_train stores the neighbourhood of each node inside the pos set and neg set as a dictionary of (predicate,object) values.<br>
An example animal is given below:

In [8]:
print("animal:", X_train[0][0])
for x in X_train[0][1]:
    print("relation:",x)
    print("objects:",X_train[0][1][x])

animal: http://dl-learner.org/benchmark/dataset/animals#penguin01
relation: rdf#type
objects: ['animals/Penguin', 'owl#NamedIndividual']
relation: rdf#type.rdfs#subClassOf
objects: ['N1319a72957e34f7e858ecded1c8131b2', 'Neacb0a4fc914498abe3ca6174aeffa68', 'animals/Homeothermic', 'animals/Animal', 'N964bdc9614854844aba1bf9b44671ea7', 'animals/HasEggs']
relation: rdf#type.rdf#type
objects: ['owl#Class']
relation: rdf#type.rdfs#subClassOf.rdfs#subClassOf
objects: ['animals/Animal', 'animals/Animal']
relation: rdf#type.rdfs#subClassOf.rdf#type
objects: ['owl#Class', 'owl#Class', 'owl#Class']
relation: rdf#type.rdfs#subClassOf.rdfs#subClassOf.rdf#type
objects: ['owl#Class', 'owl#Class']


y_train contains the class vector indicating to which group each animal belongs.

In [9]:
y_train

array([0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0])

Next, we contruct from these dictionaries the binary INK representation


In [10]:
X_train = extractor.fit_transform(X_train, counts=True, levels=True)

The counts and levels modules were enabled. These modules are responsible for 1) counting the number of available objects related to a subject and 2) add additional values for > and < relations when all objects are numbers.
<br>
The transformed X_train variable is a tuple consisting of 3 parts:
* a sparse boolean matrix
* a list of indices (here the uri's to the animals)
* a list of column values

This X_train tuple can be transformed into a pandas dataframe:

In [11]:
import pandas as pd
df_train = pd.DataFrame.sparse.from_spmatrix(X_train[0], index=X_train[1], columns=X_train[2])
df_train.head()

Unnamed: 0,count.rdf#type,count.rdf#type.rdfs#subClassOf,count.rdf#type.rdfs#subClassOf.rdf#type,count.rdf#type.rdfs#subClassOf.rdf#type.owl#Class,count.rdf#type.rdfs#subClassOf.rdf#type.owl#Class<=2,count.rdf#type.rdfs#subClassOf.rdf#type.owl#Class<=3,count.rdf#type.rdfs#subClassOf.rdf#type.owl#Class<=4,count.rdf#type.rdfs#subClassOf.rdf#type.owl#Class>=2,count.rdf#type.rdfs#subClassOf.rdf#type.owl#Class>=3,count.rdf#type.rdfs#subClassOf.rdf#type.owl#Class>=4,...,rdf#type§animals/Herring,rdf#type§animals/Lizard,rdf#type§animals/Ostrich,rdf#type§animals/Penguin,rdf#type§animals/Platypus,rdf#type§animals/Shark,rdf#type§animals/T-Rex,rdf#type§animals/Trout,rdf#type§animals/Turtle,rdf#type§owl#NamedIndividual
http://dl-learner.org/benchmark/dataset/animals#penguin01,True,True,True,True,0,True,True,True,True,0,...,0,0,0,1,0,0,0,0,0,True
http://dl-learner.org/benchmark/dataset/animals#eagle01,True,True,True,True,0,True,True,True,True,0,...,0,0,0,0,0,0,0,0,0,True
http://dl-learner.org/benchmark/dataset/animals#trout01,True,True,True,True,0,True,True,True,True,0,...,0,0,0,0,0,0,0,1,0,True
http://dl-learner.org/benchmark/dataset/animals#shark01,True,True,True,True,0,True,True,True,True,0,...,0,0,0,0,0,1,0,0,0,True
http://dl-learner.org/benchmark/dataset/animals#bat01,True,True,True,True,0,True,True,True,True,0,...,0,0,0,0,0,0,0,0,0,True


To mine rules, we can simple provide the X_train tuple to the rule mining module. <br>
Together with the labels (y_train), this will rule miner will now try to find a rule to descriminate the pos from the neg set.



In [12]:
acc, rules = miner.fit(X_train, y_train)

The output of this miner is both the achieved accuracy and the found rules. <br>
We can print the rules as follows:

In [13]:
print(acc)
miner.print_rules(rules)

1.0
['rdf#type.rdfs#subClassOf§animals/HasMilk']


with 100% accuracy, we can separate the positive class of rules from the negative ones by using the rule above.
This rule state that starting from the individual of a certain type ?x which has a subclass ?y with the relation hasMilk is a mammal.



## <ins>Task agnostic rule mining</ins>

A similar setup can be used to mine task agnostic rules. 
<br>The only difference is that no training labels need to be provided.

We no define a general SPARQL query to select our nodes of interest:

In [14]:
query = """
SELECT ?s WHERE {
    ?s a ?o.
    ?o rdfs:subClassOf <http://dl-learner.org/benchmark/dataset/animals/Animal>.
}
"""

This query can be used directly in our neighbourhoud extractor.

In [15]:
X_train, _ = extractor.create_dataset(10, query, None)

As you can see, we provide no negative set and do not store the y_train values (because they are all positive). <br>
Next, we transform these neighbourhoods into the INK binary representation.

In [16]:
X_train = extractor.fit_transform(X_train)

X_train can now be used to mine task-agnostic rules. We created

In [17]:
miner.support = 10
miner.fit(X_train)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,Unnamed: 10
0,(rdf#type.rdfs#subClassOf.rdf#type),(rdf#type.rdf#type),(rdf#type.rdfs#subClassOf.rdf#type),20.0,20.0,20.0,1.0,0.05,-380.0,inf
1,(rdf#type.rdf#type),20.0,20.0,20.0,1.0,0.05,-380.0,inf,,
2,(rdf#type.rdfs#subClassOf.rdfs#subClassOf.rdf#...,(rdf#type.rdf#type),(rdf#type.rdfs#subClassOf.rdfs#subClassOf.rdf#...,20.0,17.0,17.0,1.0,0.05,-323.0,inf
3,(rdf#type.rdf#type),(rdf#type.rdfs#subClassOf.rdfs#subClassOf.rdf#...,(rdf#type.rdf#type),17.0,20.0,17.0,1.0,0.05,-323.0,-106.666667
4,(rdf#type.rdfs#subClassOf.rdf#type),(rdf#type.rdfs#subClassOf.rdfs#subClassOf.rdf#...,(rdf#type.rdfs#subClassOf.rdf#type),17.0,20.0,17.0,0.85,0.05,-323.0,-106.666667
5,(rdf#type.rdfs#subClassOf.rdfs#subClassOf.rdf#...,(rdf#type.rdfs#subClassOf.rdf#type),(rdf#type.rdfs#subClassOf.rdfs#subClassOf.rdf#...,20.0,17.0,17.0,1.0,0.05,-323.0,inf
6,(rdf#type.rdfs#subClassOf.rdfs#subClassOf),(rdf#type.rdfs#subClassOf),22.0,99.0,22.0,1.0,0.010101,-2156.0,inf,
7,(rdf#type.rdfs#subClassOf),(rdf#type.rdfs#subClassOf.rdfs#subClassOf),99.0,22.0,22.0,0.222222,0.010101,-2156.0,-27.0,


The resulting table shows all mined rules for this simple example. <br>
The confidence, lift and leverage measures can be used to further evaluate these mined rules.