# AgilAI - CPS Informatik
## Creating Knowledge Graph for Knowledge Base using Transformers - © TU Kaiserslautern
##### Khushnood Adil Rafique (MSc, MCA) (khushnood.rafique@cs.uni-kl.de),
##### Frank Wawrzik (MSc) (wawrzik@cs.uni-kl.de),
##### Vishwanath Tarikere Sathyanarayana(MSc) (sathyana@rhrk.uni-kl.de)
* In this work, we try to Implement a full pipeline that extracts relations from text, builds a knowledge graph, and visualizes it.
* The main idea is based on named entity recognition (NER) and relation classification
* The classes or labels are based on Ontology standard ISO 26262
* The model used here is a  Relation Extraction By End-to-end Language generation (REBEL).


##### What is Knowledge Base and Knowledge Graph?
A Knowledge Base (KB) is information stored as structured data, ready to be used for analysis or inference. Usually, a KB is stored as a graph (i.e. a Knowledge Graph), where nodes are entities and edges are relations between entities.


##### How REBEL works?
REBEL is a text2text model trained by BabelScape by fine-tuning BART for translating a raw input sentence containing entities and implicit relations into a set of triplets that explicitly refer to those relations. It has been trained on more than 200 different relation types.

##### Import all the necessary libraries and classes.
* Transformers: Load the REBEL mode.
* pyvis: Graphs visualizations.

In [None]:
!pip install transformers pyvis

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.22.2-py3-none-any.whl (4.9 MB)
[K     |████████████████████████████████| 4.9 MB 37.0 MB/s 
[?25hCollecting pyvis
  Downloading pyvis-0.3.0.tar.gz (592 kB)
[K     |████████████████████████████████| 592 kB 59.4 MB/s 
[?25hCollecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 50.6 MB/s 
Collecting huggingface-hub<1.0,>=0.9.0
  Downloading huggingface_hub-0.10.0-py3-none-any.whl (163 kB)
[K     |████████████████████████████████| 163 kB 69.9 MB/s 
Collecting jsonpickle>=1.4.1
  Downloading jsonpickle-2.2.0-py2.py3-none-any.whl (39 kB)
Collecting jedi>=0.10
  Downloading jedi-0.18.1-py2.py3-none-any.whl (1.6 MB)
[K     |████████████████████████████████| 1.6 MB 52.2 MB/s 
Building wheels for collecte

In [None]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import math
import torch
import IPython
from pyvis.network import Network

#### Load the Relation Extraction model
With the transformers library, we can load the pre-trained REBEL model and tokenizer with a few lines of code.

In [None]:
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("Babelscape/rebel-large")
model = AutoModelForSeq2SeqLM.from_pretrained("Babelscape/rebel-large")

Downloading:   0%|          | 0.00/1.23k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/123 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/344 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.42k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

#### From short text to Knowledge Base
The below function is able to parse the strings generated by REBEL and transform them into relation triplets (e.g. the <HDD, Has Property, Areal density). This function must take into account additional new tokens (i.e. the <triplet> , <subj>, and <obj> tokens) used while training the model. Fortunately, the REBEL model card provides us with a complete code example for this function.

In [None]:
def extract_relations_from_model_output(text):
    relations = []
    relation, subject, relation, object_ = '', '', '', ''
    text = text.strip()
    current = 'x'
    text_replaced = text.replace("<s>", "").replace("<pad>", "").replace("</s>", "")
    for token in text_replaced.split():
        if token == "<triplet>":
            current = 't'
            if relation != '':
                relations.append({
                    'head': subject.strip(),
                    'type': relation.strip(),
                    'tail': object_.strip()
                })
                relation = ''
            subject = ''
        elif token == "<subj>":
            current = 's'
            if relation != '':
                relations.append({
                    'head': subject.strip(),
                    'type': relation.strip(),
                    'tail': object_.strip()
                })
            object_ = ''
        elif token == "<obj>":
            current = 'o'
            relation = ''
        else:
            if current == 't':
                subject += ' ' + token
            elif current == 's':
                object_ += ' ' + token
            elif current == 'o':
                relation += ' ' + token
    if subject != '' and relation != '' and object_ != '':
        relations.append({
            'head': subject.strip(),
            'type': relation.strip(),
            'tail': object_.strip()
        })
    return relations

##### The above function outputs a list of relations, where each relation is represented as a dictionary with the following keys:

* head : The subject of the relation (e.g. “HDD”).
* type : The relation type (e.g. “has property”).
* tail : The object of the relation (e.g. “areal density”).

#### Implement knowledge base class. 
Our KB class is made of a list of relations and has several methods to deal with adding new relations to the knowledge base or printing them.

In [None]:
class KB():
    def __init__(self):
        self.relations = []

    def are_relations_equal(self, r1, r2):
        return all(r1[attr] == r2[attr] for attr in ["head", "type", "tail"])

    def exists_relation(self, r1):
        return any(self.are_relations_equal(r1, r2) for r2 in self.relations)

    def add_relation(self, r):
        if not self.exists_relation(r):
            self.relations.append(r)

    def print(self):
        print("Relations:")
        for r in self.relations:
            print(f"  {r}")

#### Next function, from_small_text_to_kb  
This returns a KB object with relations extracted from a short text. It does the following:

* Initialize an empty knowledge base KB object.
* Tokenize the input text.
* Use REBEL to generate relations from the text.
* Parse REBEL output and store relation triplets into the knowledge base object.
* Return the knowledge base object.

In [None]:
def from_small_text_to_kb(text, verbose=False):
    kb = KB()

    # Tokenizer text
    model_inputs = tokenizer(text, max_length=512, padding=True, truncation=True,
                            return_tensors='pt')
    if verbose:
        print(f"Num tokens: {len(model_inputs['input_ids'][0])}")

    # Generate
    gen_kwargs = {
        "max_length": 216,
        "length_penalty": 0,
        "num_beams": 3,
        "num_return_sequences": 3
    }
    generated_tokens = model.generate(
        **model_inputs,
        **gen_kwargs,
    )
    decoded_preds = tokenizer.batch_decode(generated_tokens, skip_special_tokens=False)

    # create kb
    for sentence_pred in decoded_preds:
        relations = extract_relations_from_model_output(sentence_pred)
        for r in relations:
            kb.add_relation(r)

    return kb

#### Import Pandas
* pandas read_csv () function is used to read a CSV file into a dataframe.
* Importing our dataset (Knowledege Base)

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv("combined_data_v5.csv")



In [None]:
df.tail()

Unnamed: 0,Sentence #,Word,POS,Agila_DB_tag
53229,1936,functions,NOUN,O
53230,1936,with,ADP,O
53231,1936,data,NOUN,O
53232,1936,tables,NOUN,O
53233,1936,.,PUNCT,O


#### From long text to Knowledge Base
 The model works better with shorter inputs. Intuitively, raw text relations are often expressed in single or contiguous sentences, therefore it may not be necessary to consider a high number of sentences at the same time to extract specific relations. Additionally, extracting a few relations is a simpler task than extracting many relations.
 
We can tackle the above problem by diving an input text long 1000 tokens into eight shorter overlapping spans long 128 tokens and extract relations from each span. While doing so, we also add some metadata to the extracted relations containing their span boundaries. With this info, we are able to see from which span of the text we extracted a specific relation which is now saved in our knowledge base.

Let’s modify the KB methods so that span boundaries are saved as well. The relation dictionary has now the keys:

* head : The subject of the relation (e.g. “Fabio”).
* type : The relation type (e.g. “lives in”).
* tail : The object of the relation (e.g. “Italy”).
* meta : A dictionary containing meta information about the relation. This dictionary has a spans key, whose value is the list of span boundaries.

In [None]:
class KB():
    def __init__(self):
        self.relations = []

    def are_relations_equal(self, r1, r2):
        return all(r1[attr] == r2[attr] for attr in ["head", "type", "tail"])

    def exists_relation(self, r1):
        return any(self.are_relations_equal(r1, r2) for r2 in self.relations)

    def merge_relations(self, r1):
        r2 = [r for r in self.relations
              if self.are_relations_equal(r1, r)][0]
        spans_to_add = [span for span in r1["meta"]["spans"]
                        if span not in r2["meta"]["spans"]]
        r2["meta"]["spans"] += spans_to_add

    def add_relation(self, r):
        if not self.exists_relation(r):
            self.relations.append(r)
        else:
            self.merge_relations(r)

    def print(self):
        print("Relations:")
        for r in self.relations:
            print(f"  {r}")

#### from_text_to_kb
from_text_to_kb function, which is similar to the from_small_text_to_kb function but is able to manage longer texts by splitting them into spans.

In [None]:
def from_text_to_kb(text, span_length=128, verbose=False):
    # tokenize whole text
    inputs = tokenizer([text], return_tensors="pt")

    # compute span boundaries
    num_tokens = len(inputs["input_ids"][0])
    if verbose:
        print(f"Input has {num_tokens} tokens")
    num_spans = math.ceil(num_tokens / span_length)
    if verbose:
        print(f"Input has {num_spans} spans")
    overlap = math.ceil((num_spans * span_length - num_tokens) / 
                        max(num_spans - 1, 1))
    spans_boundaries = []
    start = 0
    for i in range(num_spans):
        spans_boundaries.append([start + span_length * i,
                                 start + span_length * (i + 1)])
        start -= overlap
    if verbose:
        print(f"Span boundaries are {spans_boundaries}")

    # transform input with spans
    tensor_ids = [inputs["input_ids"][0][boundary[0]:boundary[1]]
                  for boundary in spans_boundaries]
    tensor_masks = [inputs["attention_mask"][0][boundary[0]:boundary[1]]
                    for boundary in spans_boundaries]
    inputs = {
        "input_ids": torch.stack(tensor_ids),
        "attention_mask": torch.stack(tensor_masks)
    }

    # generate relations
    num_return_sequences = 3
    gen_kwargs = {
        "max_length": 256,
        "length_penalty": 0,
        "num_beams": 3,
        "num_return_sequences": num_return_sequences
    }
    generated_tokens = model.generate(
        **inputs,
        **gen_kwargs,
    )

    # decode relations
    decoded_preds = tokenizer.batch_decode(generated_tokens,
                                           skip_special_tokens=False)

    # create kb
    kb = KB()
    i = 0
    for sentence_pred in decoded_preds:
        current_span_index = i // num_return_sequences
        relations = extract_relations_from_model_output(sentence_pred)
        for relation in relations:
            relation["meta"] = {
                "spans": [spans_boundaries[current_span_index]]
            }
            kb.add_relation(relation)
        i += 1

    return kb

In [None]:
relations = []
for i in range(1, 1937):
  print(f"sentence {i}")
  text = " ".join(list(df["Word"][df["Sentence #"] == i]))
  kb = from_text_to_kb(text, verbose=True)
  relations += kb.relations

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
sentence 286
Input has 37 tokens
Input has 1 spans
Span boundaries are [[0, 128]]
sentence 287
Input has 36 tokens
Input has 1 spans
Span boundaries are [[0, 128]]
sentence 288
Input has 64 tokens
Input has 1 spans
Span boundaries are [[0, 128]]
sentence 289
Input has 45 tokens
Input has 1 spans
Span boundaries are [[0, 128]]
sentence 290
Input has 19 tokens
Input has 1 spans
Span boundaries are [[0, 128]]
sentence 291
Input has 19 tokens
Input has 1 spans
Span boundaries are [[0, 128]]
sentence 292
Input has 31 tokens
Input has 1 spans
Span boundaries are [[0, 128]]
sentence 293
Input has 28 tokens
Input has 1 spans
Span boundaries are [[0, 128]]
sentence 294
Input has 29 tokens
Input has 1 spans
Span boundaries are [[0, 128]]
sentence 295
Input has 21 tokens
Input has 1 spans
Span boundaries are [[0, 128]]
sentence 296
Input has 27 tokens
Input has 1 spans
Span boundaries are [[0, 128]]
sentence 297
Input has 18 tokens


Token indices sequence length is longer than the specified maximum sequence length for this model (1745 > 1024). Running this sequence through the model will result in indexing errors


sentence 1536
Input has 1745 tokens
Input has 14 spans
Span boundaries are [[0, 128], [124, 252], [248, 376], [372, 500], [496, 624], [620, 748], [744, 872], [868, 996], [992, 1120], [1116, 1244], [1240, 1368], [1364, 1492], [1488, 1616], [1612, 1740]]
sentence 1537
Input has 10 tokens
Input has 1 spans
Span boundaries are [[0, 128]]
sentence 1538
Input has 400 tokens
Input has 4 spans
Span boundaries are [[0, 128], [90, 218], [180, 308], [270, 398]]
sentence 1539
Input has 1107 tokens
Input has 9 spans
Span boundaries are [[0, 128], [122, 250], [244, 372], [366, 494], [488, 616], [610, 738], [732, 860], [854, 982], [976, 1104]]
sentence 1540
Input has 14 tokens
Input has 1 spans
Span boundaries are [[0, 128]]
sentence 1541
Input has 28 tokens
Input has 1 spans
Span boundaries are [[0, 128]]
sentence 1542
Input has 35 tokens
Input has 1 spans
Span boundaries are [[0, 128]]
sentence 1543
Input has 11 tokens
Input has 1 spans
Span boundaries are [[0, 128]]
sentence 1544
Input has 39 toke

### Store in Pickle File 

In [None]:
import pickle

with open('relations.pkl', 'wb') as f:
  pickle.dump(relations, f)

#### Load Pickle File

In [None]:
import pickle
# relations.pkl file already contains the pre-extracted subject and object nodes for our sentences
kb = pickle.load(open("relations.pkl", "rb"))


In [None]:
nodes = []
for dp in kb:
    nodes.extend([dp["head"], dp["tail"]])

In [None]:
#getting the unique nodes from the extracted pickle file 
nodes = list(set(nodes))

In [None]:
tags = dict()

In [None]:
import pandas as pd
df = pd.read_csv("combined_data_v5.csv")


In [None]:
df = pd.read_csv("combined_data_v5.csv")

In [None]:
# labelling the node with Agiladb_tag
for node in nodes:
    try:
        tag = df[df.Word == node.split()[0]]["Agila_DB_tag"].value_counts().index[0]
        tags[node] = tag
    except:
        pass

In [None]:
# removing the "O" class nodes from the existing nodes and spliting the class value to get the class label
filtered_tags = dict()

for key, value in tags.items():
    if value != "O":
        filtered_tags[key] = value.split("-")[-1]

In [None]:
tag_nodes = list(filtered_tags.keys())

In [None]:
# still removing the relations where subject and object are same
filtered_kb = []
for dp in kb:
    f_kb = dict()
    if (dp["head"] in tag_nodes) and (dp["tail"] in tag_nodes):
        if dp["head"] != dp["tail"]:
            f_kb["head"] = dp["head"]
            f_kb["tail"] = dp["tail"]
            f_kb["type"] = dp["type"]
            filtered_kb.append(f_kb)

In [None]:
set(filtered_tags.values())

{'comp', 'func', 'hwc', 'hwp', 'hwsp', 'mea', 'qt', 'sw', 'sys', 'unit'}

| Class I (Subject)| Realtion (Predicate) |Class II (Object) |
|:-----|:---------------|:---------------|
|system | has part directly |component|
|hardware component| has part directly | hardware part |
|element (here: component, <br> hardware component <br> harware part <br> hardware subpart <br> software <br> system) | implements |function |
|processing unit| executes | software |
|hardware subpart| part of directly | hardware part |
|element (here: component, <br> hardware component <br> harware part <br> hardware subpart <br> software <br> system)|has property|quantity |
|quantity| has value | measure |
|measure| has unit | unit |

In [None]:
# forming relations between the nodes
def get_newKB(kb, tags):
    new_kb = []
    for dp in kb:
        d_ = dict()
        d_["head"] = dp["head"]
        d_["tail"] = dp["tail"]
        
        if tags[dp["head"]] == "sys" and tags[dp["tail"]] == "comp":
            d_["type"] = "has part directly"
            if d_ not in new_kb:
                new_kb.append(d_)
        elif  tags[dp["head"]] == "comp" and tags[dp["tail"]] == "sys":
            d_["type"] = "direct part of"
            if d_ not in new_kb:
                new_kb.append(d_)
            
        elif  tags[dp["head"]] == "hwc" and tags[dp["tail"]] == "hwp":
            d_["type"] = "has part directly"
            if d_ not in new_kb:
                new_kb.append(d_)
        elif  tags[dp["head"]] == "hwp" and tags[dp["tail"]] == "hwc":
            d_["type"] = "direct part of"
            if d_ not in new_kb:
                new_kb.append(d_)
            
        elif  tags[dp["head"]] not in ["mea", "qt", "unit","func"] and tags[dp["tail"]] == "func":
            d_["type"] = "implements"
            if d_ not in new_kb:
                new_kb.append(d_)
        elif  tags[dp["head"]] == "func" and tags[dp["tail"]] not in ["mea", "qt", "unit"]:
            d_["type"] = "implemented by"
            if d_ not in new_kb:
                new_kb.append(d_)
            
        elif  tags[dp["head"]] in ["sys", "comp", "hwc", "hwp", "hwsp"] and tags[dp["tail"]] == "sw":
            d_["type"] = "executes"
            if d_ not in new_kb:
                new_kb.append(d_)
        elif  tags[dp["head"]] == "sw" and tags[dp["tail"]] in ["sys", "comp", "hwc", "hwp", "hwsp"]:
            d_["type"] = "executed by"
            if d_ not in new_kb:
                new_kb.append(d_)
            
        elif  tags[dp["head"]] == "hwsp" and tags[dp["tail"]] == "hwp":
            d_["type"] = "part of directly"
            if d_ not in new_kb:
                new_kb.append(d_)
        elif  tags[dp["head"]] == "hwp" and tags[dp["tail"]] == "hwsp":
            d_["type"] = "has part"
            if d_ not in new_kb:
                new_kb.append(d_)
        
        elif  tags[dp["head"]] not in ["mea", "func", "unit", "qt"] and tags[dp["tail"]] == "qt":
            d_["type"] = "has property"
            if d_ not in new_kb:
                new_kb.append(d_)
        elif  tags[dp["head"]] == "func" and tags[dp["tail"]] not in ["mea", "func", "unit", "qt"]:
            d_["type"] = "property of"
            if d_ not in new_kb:
                new_kb.append(d_)
            
        elif  tags[dp["head"]] == "qt" and tags[dp["tail"]] == "mea":
            d_["type"] = "has value"
            if d_ not in new_kb:
                new_kb.append(d_)
        elif  tags[dp["head"]] == "mea" and tags[dp["tail"]] == "qt":
            d_["type"] = "value of"
            if d_ not in new_kb:
                new_kb.append(d_)
            
        elif  tags[dp["head"]] == "mea" and tags[dp["tail"]] == "unit":
            d_["type"] = "has unit"
            if d_ not in new_kb:
                new_kb.append(d_)
        elif  tags[dp["head"]] == "unit" and tags[dp["tail"]] == "mea":
            d_["type"] = "unit of"
            if d_ not in new_kb:
                new_kb.append(d_)
            
    return new_kb
            
    

In [None]:
new_kb = get_newKB(filtered_kb, filtered_tags)

In [None]:
nodes = []
for r in new_kb:
    nodes.extend([r["head"], r["tail"]])

#unique nodes to plot in the knowledge base
nodes = list(set(nodes))

##### Pyvis Network
Visualize the output of our work by plotting the knowledge bases. As our knowledge bases are graphs, we can use the pyvis library, which allows the creation of interactive network visualizations.

We define a save_network_html function that:

* Initialize an empty directed pyvis network.
* Add the knowledge base entities as nodes.
* Add the knowledge base relations as edges.
* Save the network in an HTML file.

In [None]:
import IPython
from pyvis.network import Network
def save_network_html(kb, nodes):
    # create network
    net = Network(directed=True, width="1500px", height="900px", bgcolor="#eeeeee")

    # nodes
    color_entity = "#00FF00"
    for e in nodes:
        net.add_node(e, shape="circle", color=color_entity)

    # edges
    for r in kb:
        net.add_edge(r["head"], r["tail"],
                    title=r["type"], label=r["type"])
        
    # save network
    net.repulsion(
        node_distance=200,
        central_gravity=0.2,
        spring_length=200,
        spring_strength=0.05,
        damping=0.09
    )
    net.set_edge_smooth('dynamic')
    net.show("file.html")

In [None]:

#returns the knowledge base graph
save_network_html(new_kb, nodes)

In [None]:
from IPython.display import IFrame

IFrame(src='./file.html', width=1000, height=600)