# Constructing a Knowledge Graph from Maintenance Work Order Data

In this notebook we are going to construct a simple knowledge graph using Python, and run some queries on the graph in Neo4j. We have broken the notebook into several steps:

1. Reading in the data
2. Cleaning the data via Lexical Normalisation
3. Extracting entities via Named Entity Recognition (NER)
4. Creating relations between entities via Relation Extraction (RE)
5. Combining NER + RE
6. Creating the graph
7. Querying the graph in Neo4j


## Installing required packages

To run this notebook you will need to install the following via pip:

- `py2neo`: A library for working with Neo4j in Python.
- `gqvis`: Our simple tool for visualising graph queries in Jupyter.
- `flair`: A deep learning library for natural language processing. Note this library is quite large (a couple gb I believe). If you don't wish to install this, we have provided non deep-learning based alternatives so you can still follow along.

You will also need to have Neo4j installed for the last part of the tutorial. You can download and install Neo4j Desktop [here](https://neo4j.com/).

We will be running through the code during the tutorial so there is no need to install anything unless you would also like to try the code out yourself and run some graph queries.



In [None]:
!pip install py2neo
!pip install gqvis
!pip install flair

# 1. Read in the data

Here is a description of the datasets we are working with in this notebook.

First of all, the datasets for the NER model:

- `ner_dataset/train.txt`: The dataset we will use to *train* the NER model to predict the entities appearing in each work order.
- `ner_dataset/dev.txt`: The dataset we will use to *validate* the quality of the model during training.
- `ner_dataset/test.txt`: The dataset we will use to *evaluate* the final performance of the NER model after training.

We also have three datasets for the Relation Extraction (RE) model:

- `re_dataset/train.csv`
- `re_dataset/dev.csv`
- `re_dataset/test.csv`

We are going to be building a knowledge graph on a small sample set of work orders. This will not be seen by the NER or RE models prior to constructing the graph - the idea is to get our models to run *inference* over this dataset to automatically predict the entities, and relationships between the entities, to build a graph.

- `sample_work_orders.csv`: A csv file containing a set of work orders.

Here is an example of what the first few rows of each dataset look like:

![alt text](images/example-data.png "Example datasets")

We are using the simple `csv` library to read in the data, though this can also be done using `pandas`.

## 1.1. Inspecting the data

Let's start by inspecting the `sample_work_orders.csv` CSV dataset. This is the dataset we will be building the graph from.

In [74]:
from csv import DictReader

work_order_file = "data/sample_work_orders.csv"

# A simple function to read in a csv file and return a list,
# where each element in the list is a dictionary of {heading : value}
def load_csv(filename):
    data = []
    with open(filename, 'r') as f:
        reader = DictReader(f)
        for row in reader:
            data.append(row)
    return data

        
work_order_data = load_csv(work_order_file)

# Let's have a look at the first 10 rows
for row in work_order_data[:10]:
    print(row)

    


OrderedDict([('StartDate', '10/07/2005'), ('FLOC', '1234.1.1'), ('ShortText', 'repair cracked hyd tank')])
OrderedDict([('StartDate', '14/07/2005'), ('FLOC', '1234.1.2'), ('ShortText', 'engine wont start')])
OrderedDict([('StartDate', '17/07/2005'), ('FLOC', '1234.1.3'), ('ShortText', 'a/c blowing hot air')])
OrderedDict([('StartDate', '20/07/2005'), ('FLOC', '1234.1.2'), ('ShortText', 'engin u/s')])
OrderedDict([('StartDate', '21/07/2005'), ('FLOC', '1234.1.2'), ('ShortText', 'fix engine')])
OrderedDict([('StartDate', '22/07/2005'), ('FLOC', '1234.1.4'), ('ShortText', 'pump service')])
OrderedDict([('StartDate', '23/07/2005'), ('FLOC', '1234.1.4'), ('ShortText', 'pump leak')])
OrderedDict([('StartDate', '24/07/2005'), ('FLOC', '1234.1.4'), ('ShortText', 'fix leak on pump')])
OrderedDict([('StartDate', '25/07/2005'), ('FLOC', '1234.1.2'), ('ShortText', 'engine not running')])
OrderedDict([('StartDate', '26/07/2005'), ('FLOC', '1234.1.2'), ('ShortText', 'engine has problems starting')])

# 2. Cleaning the data via Lexical Normalisation

Before we start extracting entities from the short text, it's a good idea to do some text cleaning, i.e. "lexical normalisation". This is important as we would prefer to have a single node for a single concept, e.g. one node for "engine" as opposed to two nodes for "engin" and "engine". We can also take this opportunity to normalise different variations of the same failure mode ("overheating", "blowing hot air", etc) to a single failure mode "overheating".

In the interest of time/simplicity we are not going to use a neural model here, but instead we will use a simple lexicon-based normaliser. This model will simply replace a misspelled phrase with its correct form. This is not practical in the real world (as there's no way we could possibly build a lexicon of all possible misspellings) but it is good enough for our small example.



In [56]:
""" A lexicon-based normaliser. Normalises sentences by replacing any ngrams 
(sequences of 1 or more words) with their replacement as per a predefined
lexicon."""

import itertools

class LexiconNormaliser:
    """ A lexicon-based normaliser.
    
    Args:
        lexicon_file: The filename of the lexicon.
    
    """
    def __init__(self, lexicon_file, max_ngram_size = 3):
        
        lexicon_data = load_csv(lexicon_file)
        self.max_ngram_size = max_ngram_size
        
        # Convert the loaded csv into a dictionary mapping incorrect form -> correct form
        self.lexicon = {}
        for row in lexicon_data:
            self.lexicon[row["key"]] = row["value"]      
    
    def normalise(self, sentence: str):
        """ 
            Normalise the given sentence via the lexicon.
            
            Args:
                sentence(str): The sentence to normalise.
            
            Returns:
                str: The normalised sentence.
        """
        words = sentence.split()
        ngrams = self._get_ngrams(words)
        
        # Reversing ngrams ensures the larger ngrams are normalised first.
        for ngram in reversed(ngrams):
            if ngram in self.lexicon:
                sentence = sentence.replace(ngram, self.lexicon[ngram])
        
        return sentence
    
    
    def _get_ngrams(self, sentence):        
        """
            Given a sentence, return a list of all combinations of ngrams
            up to a certain size.
            
            Args:
                sentence: A list of words, e.g. ["fix", "broken", "pump"].
                
            Returns:
                ngrams: A list of ngrams containing up to max_ngram_size words.
                        For example, given the input ["fix", "broken", "pump"],
                        return ["fix", "broken", "pump", "fix broken", "broken pump", "fix broken pump"] 
        
        """
        ngrams = []        
        for n in range(self.max_ngram_size):
            for c in itertools.combinations(sentence, n + 1):
                ngrams.append(" ".join(c))
        return ngrams
    


Now that the LexiconNormaliser has been defined, let's run it over all of the ShortText fields in our dataset.

In [75]:
lexicon_file = "data/lexicon_normalisation.csv"
lexicon_normaliser = LexiconNormaliser(lexicon_file)

work_order_data = load_csv(work_order_file)

for i, row in enumerate(work_order_data):
    before = row['ShortText']    
    row['ShortText'] = lexicon_normaliser.normalise(row['ShortText'])
    
    # Let's print the first 5 to have a look at the difference
    if i <= 5:
        print(before)
        print(row['ShortText'])
        print()
    

repair cracked hyd tank
repair cracked hydraulic tank

engine wont start
engine failure to start

a/c blowing hot air
air conditioner overheating

engin u/s
engine breakdown

fix engine
fix engine

pump service
pump service



# 3. Named Entity Recognition

Our first task is to extract the entities in the short text descriptions and construct nodes from those entities. This is how we are able to unlock the knowledge captured within the short text and combine it with the structured fields.

![alt text](images/extracting-entities-v2.png "Extracting entities")

## 3.1. Loading and inspecting the data

Let's start by defining some functions for loading the CONLL-formatted data. The CONLL format is a widely used format for Named Entity Recognition, and looks like this:

    Michael B-PER
    works O
    at O
    The B-ORG
    University I-ORG
    of I-ORG
    Western I-ORG
    Australia I-ORG
    
It's a bit tricky to work with it in this format, so we are going to define some functions to parse it into something like this:

    { tokens: ['Michael', 'works', 'at', 'The', 'University', 'of', 'Western', 'Australia'],
      labels: ['B-PER', 'O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'I-ORG'] }
      
Note that many NLP libraries also have this functionality (NLTK for example) - but we will do it in pure Python in the interest of keeping our dependencies minimal.

In [80]:
import os

def to_conll_document(s: str):
    """Parse a CONLL-formatted document into a dictionary of
    tokens and labels.

    Args:
        s (str): A string, separated by newlines, where each
        line is a token, then a space, then a label.

    Returns:
        dict: A dict of tokens and labels.
    """
    tokens, labels = [], []
    for line in s.split("\n"):
        if len(line.strip()) == 0:
            continue
        token, label = line.split()

        tokens.append(token)
        labels.append(label)
    return {'tokens': tokens, 'labels': labels}


def load_conll_dataset(filename: str) -> list:
    """Load a list of documents from the given CONLL-formatted dataset.

    Args:
        filename (str): The filename to load from.

    Returns:
        list: A list of documents, where each document is a dict of tokens and labels.
    """
    documents = []
    with open(filename, "r") as f:
        docs = f.read().split("\n\n")
        for d in docs:
            if len(d) == 0:
                continue
            document = to_conll_document(d)
            documents.append(document)
    print(f"Loaded {len(documents)} documents from {filename}.")
    return documents



Let's take a quick look at the first row of our training dataset to make sure it loads OK:

In [62]:
NER_DATASET_PATH = "data/ner_dataset"
train_dataset = load_conll_dataset(os.path.join(NER_DATASET_PATH, 'train.txt'))

print(train_dataset[0])

Loaded 3200 documents from data/ner_dataset\train.txt.
{'tokens': ['ram', 'on', 'cup', 'rod', 'support', 'broken'], 'labels': ['B-Item', 'B-Location', 'B-Item', 'B-Item', 'I-Item', 'B-Observation']}


## 3.2 Define an abstract base class for NER Models

Seeing as we would like to be able to work with a range of NER models, it's a good idea to create an 'abstract base class' to represent an NER model. This way, we can create classes for our NER models that inherit from this base class. Every model we create must have these three functions:

- `train`: Train the model on the datasets in the given path.
- `inference`: Run inference over the given sentence.
- `load`: Load the model from the given path.

If we try to create an NER model that does not have one of these functions, it will raise an error.

In [63]:
""" Abstract base class for the NER Model. """

from abc import ABC, abstractmethod


class NERModel(ABC):
    def __init__(self):
        pass

    @abstractmethod
    def train(self, datasets_path: str):
        pass

    @abstractmethod
    def inference(self, sent: list):
        pass

    @abstractmethod
    def load(self, model_path):
        pass


## 3.3. Define our NER models

### 3.3.1. Flair-based NER Model

In this tutorial we will use [Flair](https://github.com/flairNLP/flair), which simplifies the process of building a deep learning model for a variety of NLP tasks.

The code below is a class representing a `FlairNERModel`, which is based on the `NERModel` class above. It has the same three methods, i.e `train()`, `inference()`, and `save()`.

In [65]:
"""A Flair-based Named Entity Recognition model. Learns to predict entity
classes via deep learning."""


# TODO: Tidy up, fix this code as it does not work atm in this notebook


import os
import flair
from flair.data import Corpus, Sentence
from flair.datasets import ColumnCorpus
from flair.embeddings import (
    StackedEmbeddings,
    FlairEmbeddings,
)
from flair.models import SequenceTagger
from flair.trainers import ModelTrainer
from typing import List
from flair.visual.training_curves import Plotter
import torch


HIDDEN_SIZE = 256

# Check whether CUDA is available and set the device accordingly
if torch.cuda.is_available():
    flair.device = torch.device("cuda:0")
else:
    flair.device = torch.device("cpu")
print("Device:", flair.device)


class FlairNERModel(NERModel):

    model_name: str = "Flair"

    """A Flair-based Named Entity Recognition model.
    """

    def __init__(self):
        super(FlairNERModel, self).__init__()

        self.model = None

    def train(self, datasets_path: os.path, trained_model_path: os.path):
        """ Train the Flair model on the given conll datasets.

        Args:
            datasets_path (os.path): The folder containing the
              train, dev and text CONLL-formatted datasets.
            trained_model_path (os.path): The folder to save the trained
              model to.
        """

        columns = {0: "text", 1: "ner"}
        corpus: Corpus = ColumnCorpus(
            datasets_path,
            columns,
            train_file="train.txt",
            dev_file="dev.txt",
            test_file="test.txt",
        )
        label_dict = corpus.make_label_dictionary(label_type="ner")

        # Train the sequence tagger
        embedding_types = [
            FlairEmbeddings("mix-forward"),
            FlairEmbeddings("mix-backward"),
        ]

        embeddings = StackedEmbeddings(embeddings=embedding_types)

        tagger = SequenceTagger(
            hidden_size=HIDDEN_SIZE,
            embeddings=embeddings,
            tag_dictionary=label_dict,
            tag_type="ner",
            use_crf=True,
        )

        trainer = ModelTrainer(tagger, corpus)

        sm = "cpu"
        if torch.cuda.is_available():
            sm = "gpu"
        trainer.train(
            trained_model_path,
            learning_rate=0.1,
            mini_batch_size=32,
            max_epochs=10,
            embeddings_storage_mode=sm,
        )

        plotter = Plotter()
        plotter.plot_weights(os.path.join(trained_model_path, "weights.txt"))

        self.load(os.path.join(trained_model_path, 'final-model.pt'))

    def inference(self, sent: list) -> dict:
        """Run the inference on a given list of short texts.

        Args:
            sent (list): The sentence (list of words).

        Returns:
            dict: The tagged sentence now in the form of {'tokens': [list],
                'labels': [list]}.

        Raises:
            ValueError: If the model has not yet been trained.
        """
        if self.model is None:
            raise ValueError(
                "The NER Model has not yet been trained. "
                "Please train/load this Flair model before proceeding."
            )
        
        sentence_obj = Sentence(sentence, use_tokenizer=False)
        self.model.predict(sentence_obj)
        labels = ["O"] * len(sentence)

        for entity in sentence_obj.get_spans("ner"):
            for i, token in enumerate(entity):
                label = entity.get_label("ner").value
                prefix = "B-" if i == 0 else "I-"
                
                # Token idx starts from 1 in Flair.
                labels[token.idx - 1] = prefix + label

        return { 'tokens': sent, 'labels': labels }

    def load(self, model_path: str):
        """Load the model from the specified path.

        Args:
            model_path (os.path): The path to load.

        Raises:
            ValueError: If the path does not exist i.e. model not yet trained.
        """
        self.model = SequenceTagger.load(model_path)

Device: cuda:0


### 3.3.2. Dictionary-based NER model

If you are not able to use the Flair library, here is a simple model you can use to extract the entities, albeit with a much weaker performance. This one scans the training data, builds a mapping between each phrase (one or more tokens in a row) and the most common entity type associated with that phrase, then uses that entity type as the prediction when seeing that token in the test data.

The model is super simple, so we won't show the code here, but feel free to have a look under `helpers/DictionaryNERModel.py` if you are interested.

In [7]:
from helpers import DictionaryNERModel

## 3.4. Training the model

Depending on whether you are using Flair or the DictionaryNERModel, you can run one of the cells below.



### 3.4.1. Using Flair

We have trained the Flair-based model and have uploaded the model onto Huggingface. The following code will download that model and load the weights, so there is no need for you to train the model yourself.

In [70]:
flair_ner_model = FlairNERModel()
flair_ner_model.train(NER_DATASET_PATH, 'models/ner_models/flair') # Uncomment to train manually
#m.load("nlp-tlp/mwo-ner-test") # TODO: Replace with load_pretrained

2022-11-23 13:42:17,887 Reading data from data\ner_dataset
2022-11-23 13:42:17,888 Train: data\ner_dataset\train.txt
2022-11-23 13:42:17,889 Dev: data\ner_dataset\dev.txt
2022-11-23 13:42:17,890 Test: data\ner_dataset\test.txt
2022-11-23 13:42:18,546 Computing label dictionary. Progress:


3200it [00:00, 42659.99it/s]


2022-11-23 13:42:18,626 Dictionary created for label 'ner' with 12 values: Item (seen 4590 times), Activity (seen 1952 times), Observation (seen 1574 times), Location (seen 957 times), Consumable (seen 308 times), Agent (seen 191 times), Specifier (seen 122 times), Cardinality (seen 114 times), Attribute (seen 80 times), Time (seen 70 times), Event (seen 5 times)
2022-11-23 13:42:19,003 SequenceTagger predicts: Dictionary with 45 tags: O, S-Item, B-Item, E-Item, I-Item, S-Activity, B-Activity, E-Activity, I-Activity, S-Observation, B-Observation, E-Observation, I-Observation, S-Location, B-Location, E-Location, I-Location, S-Consumable, B-Consumable, E-Consumable, I-Consumable, S-Agent, B-Agent, E-Agent, I-Agent, S-Specifier, B-Specifier, E-Specifier, I-Specifier, S-Cardinality, B-Cardinality, E-Cardinality, I-Cardinality, S-Attribute, B-Attribute, E-Attribute, I-Attribute, S-Time, B-Time, E-Time, I-Time, S-Event, B-Event, E-Event, I-Event


  "There should be no best model saved at epoch 1 except there "


2022-11-23 13:42:19,164 ----------------------------------------------------------------------------------------------------
2022-11-23 13:42:19,166 Model: "SequenceTagger(
  (embeddings): StackedEmbeddings(
    (list_embedding_0): FlairEmbeddings(
      (lm): LanguageModel(
        (drop): Dropout(p=0.25, inplace=False)
        (encoder): Embedding(275, 100)
        (rnn): LSTM(100, 2048)
        (decoder): Linear(in_features=2048, out_features=275, bias=True)
      )
    )
    (list_embedding_1): FlairEmbeddings(
      (lm): LanguageModel(
        (drop): Dropout(p=0.25, inplace=False)
        (encoder): Embedding(275, 100)
        (rnn): LSTM(100, 2048)
        (decoder): Linear(in_features=2048, out_features=275, bias=True)
      )
    )
  )
  (word_dropout): WordDropout(p=0.05)
  (locked_dropout): LockedDropout(p=0.5)
  (embedding2nn): Linear(in_features=4096, out_features=4096, bias=True)
  (rnn): LSTM(4096, 256, batch_first=True, bidirectional=True)
  (linear): Linear(in_feature

100%|██████████████████████████████████████████████████████████████████████████████████| 13/13 [00:01<00:00, 12.10it/s]


2022-11-23 13:42:27,809 Evaluating as a multi-label problem: False


  _warn_prf(average, modifier, msg_start, len(result))


2022-11-23 13:42:27,822 DEV : loss 1.2472093105316162 - f1-score (micro avg)  0.6347
2022-11-23 13:42:27,830 BAD EPOCHS (no improvement): 0
2022-11-23 13:42:27,832 saving best model
2022-11-23 13:42:28,530 ----------------------------------------------------------------------------------------------------
2022-11-23 13:42:28,763 epoch 2 - iter 10/100 - loss 1.43692249 - samples/sec: 1397.40 - lr: 0.100000
2022-11-23 13:42:29,008 epoch 2 - iter 20/100 - loss 1.37321242 - samples/sec: 1316.87 - lr: 0.100000
2022-11-23 13:42:29,236 epoch 2 - iter 30/100 - loss 1.34296771 - samples/sec: 1422.22 - lr: 0.100000
2022-11-23 13:42:29,472 epoch 2 - iter 40/100 - loss 1.35099978 - samples/sec: 1367.53 - lr: 0.100000
2022-11-23 13:42:29,701 epoch 2 - iter 50/100 - loss 1.33625601 - samples/sec: 1403.50 - lr: 0.100000
2022-11-23 13:42:29,945 epoch 2 - iter 60/100 - loss 1.32141391 - samples/sec: 1322.32 - lr: 0.100000
2022-11-23 13:42:30,185 epoch 2 - iter 70/100 - loss 1.30116132 - samples/sec: 13

100%|██████████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 30.45it/s]


2022-11-23 13:42:31,316 Evaluating as a multi-label problem: False
2022-11-23 13:42:31,329 DEV : loss 0.8964381814002991 - f1-score (micro avg)  0.7031
2022-11-23 13:42:31,337 BAD EPOCHS (no improvement): 0
2022-11-23 13:42:31,338 saving best model
2022-11-23 13:42:32,056 ----------------------------------------------------------------------------------------------------
2022-11-23 13:42:32,291 epoch 3 - iter 10/100 - loss 1.01690784 - samples/sec: 1380.80 - lr: 0.100000
2022-11-23 13:42:32,529 epoch 3 - iter 20/100 - loss 1.07365253 - samples/sec: 1362.13 - lr: 0.100000
2022-11-23 13:42:32,766 epoch 3 - iter 30/100 - loss 1.08271915 - samples/sec: 1361.60 - lr: 0.100000
2022-11-23 13:42:32,993 epoch 3 - iter 40/100 - loss 1.05695919 - samples/sec: 1422.57 - lr: 0.100000
2022-11-23 13:42:33,232 epoch 3 - iter 50/100 - loss 1.05866040 - samples/sec: 1350.06 - lr: 0.100000
2022-11-23 13:42:33,461 epoch 3 - iter 60/100 - loss 1.04329547 - samples/sec: 1403.36 - lr: 0.100000
2022-11-23 13:

100%|██████████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 30.23it/s]


2022-11-23 13:42:34,855 Evaluating as a multi-label problem: False
2022-11-23 13:42:34,867 DEV : loss 0.7291697859764099 - f1-score (micro avg)  0.7483
2022-11-23 13:42:34,875 BAD EPOCHS (no improvement): 0
2022-11-23 13:42:34,877 saving best model
2022-11-23 13:42:35,584 ----------------------------------------------------------------------------------------------------
2022-11-23 13:42:35,817 epoch 4 - iter 10/100 - loss 0.96283753 - samples/sec: 1409.70 - lr: 0.100000
2022-11-23 13:42:36,068 epoch 4 - iter 20/100 - loss 0.97040872 - samples/sec: 1292.85 - lr: 0.100000
2022-11-23 13:42:36,327 epoch 4 - iter 30/100 - loss 0.94653485 - samples/sec: 1249.95 - lr: 0.100000
2022-11-23 13:42:36,555 epoch 4 - iter 40/100 - loss 0.92465933 - samples/sec: 1409.69 - lr: 0.100000
2022-11-23 13:42:36,786 epoch 4 - iter 50/100 - loss 0.89689591 - samples/sec: 1403.52 - lr: 0.100000
2022-11-23 13:42:37,018 epoch 4 - iter 60/100 - loss 0.88834573 - samples/sec: 1391.30 - lr: 0.100000
2022-11-23 13:

100%|██████████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 30.59it/s]


2022-11-23 13:42:38,393 Evaluating as a multi-label problem: False


  _warn_prf(average, modifier, msg_start, len(result))


2022-11-23 13:42:38,406 DEV : loss 0.6760690212249756 - f1-score (micro avg)  0.769
2022-11-23 13:42:38,414 BAD EPOCHS (no improvement): 0
2022-11-23 13:42:38,416 saving best model
2022-11-23 13:42:39,120 ----------------------------------------------------------------------------------------------------
2022-11-23 13:42:39,348 epoch 5 - iter 10/100 - loss 0.76627534 - samples/sec: 1428.57 - lr: 0.100000
2022-11-23 13:42:39,596 epoch 5 - iter 20/100 - loss 0.78489846 - samples/sec: 1300.82 - lr: 0.100000
2022-11-23 13:42:39,838 epoch 5 - iter 30/100 - loss 0.79016919 - samples/sec: 1333.33 - lr: 0.100000
2022-11-23 13:42:40,081 epoch 5 - iter 40/100 - loss 0.79636929 - samples/sec: 1322.32 - lr: 0.100000
2022-11-23 13:42:40,315 epoch 5 - iter 50/100 - loss 0.78293260 - samples/sec: 1373.39 - lr: 0.100000
2022-11-23 13:42:40,555 epoch 5 - iter 60/100 - loss 0.78318672 - samples/sec: 1338.93 - lr: 0.100000
2022-11-23 13:42:40,801 epoch 5 - iter 70/100 - loss 0.77112929 - samples/sec: 131

100%|██████████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 29.75it/s]


2022-11-23 13:42:41,947 Evaluating as a multi-label problem: False
2022-11-23 13:42:41,959 DEV : loss 0.6060569286346436 - f1-score (micro avg)  0.7787
2022-11-23 13:42:41,968 BAD EPOCHS (no improvement): 0
2022-11-23 13:42:41,970 saving best model
2022-11-23 13:42:42,685 ----------------------------------------------------------------------------------------------------
2022-11-23 13:42:42,917 epoch 6 - iter 10/100 - loss 0.70583337 - samples/sec: 1403.53 - lr: 0.100000
2022-11-23 13:42:43,148 epoch 6 - iter 20/100 - loss 0.75519577 - samples/sec: 1403.52 - lr: 0.100000
2022-11-23 13:42:43,389 epoch 6 - iter 30/100 - loss 0.74832640 - samples/sec: 1333.33 - lr: 0.100000
2022-11-23 13:42:43,635 epoch 6 - iter 40/100 - loss 0.73161762 - samples/sec: 1311.48 - lr: 0.100000
2022-11-23 13:42:43,875 epoch 6 - iter 50/100 - loss 0.71630152 - samples/sec: 1338.92 - lr: 0.100000
2022-11-23 13:42:44,104 epoch 6 - iter 60/100 - loss 0.71072069 - samples/sec: 1415.94 - lr: 0.100000
2022-11-23 13:

100%|██████████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 30.02it/s]


2022-11-23 13:42:45,486 Evaluating as a multi-label problem: False
2022-11-23 13:42:45,499 DEV : loss 0.6382891535758972 - f1-score (micro avg)  0.7716
2022-11-23 13:42:45,507 BAD EPOCHS (no improvement): 1
2022-11-23 13:42:45,509 ----------------------------------------------------------------------------------------------------
2022-11-23 13:42:45,745 epoch 7 - iter 10/100 - loss 0.61624884 - samples/sec: 1373.39 - lr: 0.100000
2022-11-23 13:42:45,980 epoch 7 - iter 20/100 - loss 0.64597210 - samples/sec: 1367.52 - lr: 0.100000
2022-11-23 13:42:46,213 epoch 7 - iter 30/100 - loss 0.61691749 - samples/sec: 1385.29 - lr: 0.100000
2022-11-23 13:42:46,448 epoch 7 - iter 40/100 - loss 0.63591151 - samples/sec: 1373.40 - lr: 0.100000
2022-11-23 13:42:46,673 epoch 7 - iter 50/100 - loss 0.64754442 - samples/sec: 1434.98 - lr: 0.100000
2022-11-23 13:42:46,915 epoch 7 - iter 60/100 - loss 0.65196755 - samples/sec: 1327.81 - lr: 0.100000
2022-11-23 13:42:47,148 epoch 7 - iter 70/100 - loss 0.6

100%|██████████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 29.95it/s]


2022-11-23 13:42:48,447 Evaluating as a multi-label problem: False
2022-11-23 13:42:48,459 DEV : loss 0.6019043326377869 - f1-score (micro avg)  0.78
2022-11-23 13:42:48,467 BAD EPOCHS (no improvement): 0
2022-11-23 13:42:48,469 saving best model
2022-11-23 13:42:49,180 ----------------------------------------------------------------------------------------------------
2022-11-23 13:42:49,418 epoch 8 - iter 10/100 - loss 0.61074184 - samples/sec: 1361.76 - lr: 0.100000
2022-11-23 13:42:49,649 epoch 8 - iter 20/100 - loss 0.63358930 - samples/sec: 1391.32 - lr: 0.100000
2022-11-23 13:42:49,876 epoch 8 - iter 30/100 - loss 0.63715997 - samples/sec: 1415.95 - lr: 0.100000
2022-11-23 13:42:50,119 epoch 8 - iter 40/100 - loss 0.61378976 - samples/sec: 1324.63 - lr: 0.100000
2022-11-23 13:42:50,369 epoch 8 - iter 50/100 - loss 0.62808950 - samples/sec: 1297.53 - lr: 0.100000
2022-11-23 13:42:50,602 epoch 8 - iter 60/100 - loss 0.62822052 - samples/sec: 1385.29 - lr: 0.100000
2022-11-23 13:42

100%|██████████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 29.95it/s]


2022-11-23 13:42:52,030 Evaluating as a multi-label problem: False
2022-11-23 13:42:52,043 DEV : loss 0.5799474716186523 - f1-score (micro avg)  0.7824
2022-11-23 13:42:52,050 BAD EPOCHS (no improvement): 0
2022-11-23 13:42:52,052 saving best model
2022-11-23 13:42:52,762 ----------------------------------------------------------------------------------------------------
2022-11-23 13:42:52,998 epoch 9 - iter 10/100 - loss 0.58800616 - samples/sec: 1381.63 - lr: 0.100000
2022-11-23 13:42:53,256 epoch 9 - iter 20/100 - loss 0.59572057 - samples/sec: 1249.99 - lr: 0.100000
2022-11-23 13:42:53,486 epoch 9 - iter 30/100 - loss 0.60261734 - samples/sec: 1409.87 - lr: 0.100000
2022-11-23 13:42:53,732 epoch 9 - iter 40/100 - loss 0.59319719 - samples/sec: 1315.78 - lr: 0.100000
2022-11-23 13:42:53,968 epoch 9 - iter 50/100 - loss 0.58632868 - samples/sec: 1372.56 - lr: 0.100000
2022-11-23 13:42:54,216 epoch 9 - iter 60/100 - loss 0.58270096 - samples/sec: 1301.32 - lr: 0.100000
2022-11-23 13:

100%|██████████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 29.61it/s]


2022-11-23 13:42:55,618 Evaluating as a multi-label problem: False
2022-11-23 13:42:55,630 DEV : loss 0.5421388149261475 - f1-score (micro avg)  0.8009
2022-11-23 13:42:55,638 BAD EPOCHS (no improvement): 0
2022-11-23 13:42:55,640 saving best model
2022-11-23 13:42:56,354 ----------------------------------------------------------------------------------------------------
2022-11-23 13:42:56,601 epoch 10 - iter 10/100 - loss 0.57466440 - samples/sec: 1316.80 - lr: 0.100000
2022-11-23 13:42:56,830 epoch 10 - iter 20/100 - loss 0.58684094 - samples/sec: 1421.82 - lr: 0.100000
2022-11-23 13:42:57,070 epoch 10 - iter 30/100 - loss 0.56883227 - samples/sec: 1350.36 - lr: 0.100000
2022-11-23 13:42:57,319 epoch 10 - iter 40/100 - loss 0.55401358 - samples/sec: 1306.18 - lr: 0.100000
2022-11-23 13:42:57,560 epoch 10 - iter 50/100 - loss 0.54733924 - samples/sec: 1349.59 - lr: 0.100000
2022-11-23 13:42:57,812 epoch 10 - iter 60/100 - loss 0.55085324 - samples/sec: 1285.07 - lr: 0.100000
2022-11-

100%|██████████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 30.09it/s]


2022-11-23 13:42:59,224 Evaluating as a multi-label problem: False
2022-11-23 13:42:59,236 DEV : loss 0.5560851693153381 - f1-score (micro avg)  0.8116
2022-11-23 13:42:59,243 BAD EPOCHS (no improvement): 0
2022-11-23 13:42:59,245 saving best model
2022-11-23 13:43:00,707 ----------------------------------------------------------------------------------------------------
2022-11-23 13:43:00,709 loading file models\ner_models\flair\best-model.pt
2022-11-23 13:43:01,113 SequenceTagger predicts: Dictionary with 47 tags: O, S-Item, B-Item, E-Item, I-Item, S-Activity, B-Activity, E-Activity, I-Activity, S-Observation, B-Observation, E-Observation, I-Observation, S-Location, B-Location, E-Location, I-Location, S-Consumable, B-Consumable, E-Consumable, I-Consumable, S-Agent, B-Agent, E-Agent, I-Agent, S-Specifier, B-Specifier, E-Specifier, I-Specifier, S-Cardinality, B-Cardinality, E-Cardinality, I-Cardinality, S-Attribute, B-Attribute, E-Attribute, I-Attribute, S-Time, B-Time, E-Time, I-Time

100%|██████████████████████████████████████████████████████████████████████████████████| 13/13 [00:01<00:00, 12.35it/s]


2022-11-23 13:43:02,369 Evaluating as a multi-label problem: False
2022-11-23 13:43:02,381 0.7524	0.8009	0.7759	0.6451
2022-11-23 13:43:02,383 
Results:
- F-score (micro) 0.7759
- F-score (macro) 0.78
- Accuracy 0.6451

By class:
              precision    recall  f1-score   support

        Item     0.6919    0.7828    0.7346       548
 Observation     0.7445    0.7578    0.7511       223
    Activity     0.8744    0.8704    0.8724       216
    Location     0.8014    0.8014    0.8014       141
  Consumable     0.8444    0.8085    0.8261        47
       Agent     0.8095    1.0000    0.8947        34
 Cardinality     0.8500    0.9444    0.8947        18
        Time     1.0000    0.9474    0.9730        19
   Attribute     0.1429    0.1429    0.1429        14
   Specifier     0.9091    0.9091    0.9091        11

   micro avg     0.7524    0.8009    0.7759      1271
   macro avg     0.7668    0.7965    0.7800      1271
weighted avg     0.7558    0.8009    0.7768      1271

2022-11-23 

### 3.4.2. Using the DictionaryNERModel

In [71]:
from helpers import DictionaryNERModel

NER_DATASET_PATH = "data/ner_dataset"

dictionary_ner_model = DictionaryNERModel()
dictionary_ner_model.train(NER_DATASET_PATH, 'models/ner_models/dictionary')

Building dictionary...
Loaded 3200 documents from data/ner_dataset\train.txt.
Loaded 401 documents from data/ner_dataset\dev.txt.


## 3.5. Running inference on unseen sentences

The next step is to use our trained model to infer the entity type of each entity appearing in a list of previously unseen data.

In [77]:
tagged_bio_sents = []

sentences = []
for row in work_order_data:
    sentence = row["ShortText"].split() # We must 'tokenise' the sentence first, i.e. split into words
    tagged_sent = flair_ner_model.inference(sentence) # replace 'flair' with 'dictionary' if not using flair  
    tagged_bio_sents.append(tagged_sent)

# Print an example tagged sentence
print(tagged_bio_sents[12])

{'tokens': ['air', 'conditioner', 'breakdown'], 'labels': ['B-Item', 'I-Item', 'B-Observation']}


# 4. Extracting relations between the entities via Relation Extraction

We have extracted the entities appearing in each work order. The next step is to extract the relationships between those entities. We can do this using Relation Extraction.

![alt text](images/building-relations.png "Building relations")

## 4.1. Loading and inspecting the data

Let's take a look again at the RE dataset we are working with.

In [10]:
import os
import json

RE_DATASET_PATH = "data/re_dataset"


def load_re_dataset(filename: str) -> list:
    """Load the Relation Extraction dataset into a list.
        
    Args:
        filename (str): The name of the file to load.
    """
    re_data = []
    with open(filename, 'r') as f:
        for row in f:
            re_data.append(row.strip().split(','))
    return re_data

train_dataset = load_re_dataset(os.path.join(RE_DATASET_PATH, 'train.csv'))

# Let's take a quick look just to make sure it loads as expected...
for row in train_dataset[:3]:
    print(row)


['broken', 'rod support', 'Observation', 'Item', 'rod support broken', '0', '1', 'O']
['rod support', 'broken', 'Item', 'Observation', 'rod support broken', '1', '0', 'HAS_OBSERVATION']
['broken', 'cup', 'Observation', 'Item', 'cup rod support broken', '0', '2', 'O']


We can interpret this as follows:
 - 'broken': entity 1
 - 'rod support': entity 2
 - 'Observation': label of entity 1
 - 'Item': label of entity 2
 - 'rod support broken': The text between 'broken' and 'rod support', inclusive
 - '0': The mention index of entity 1
 - '1': The mention index of entity 2
 - 'O': The relation type. "O" means no relation.

## 4.2. Define the Abstract Base Class

We are going to see two different RE models, so let's define an abstract base class again just like we did for the NER models. Just like the NER model, we have three functions:

- `inference`: Given a row (as above, but without the last column), predict the given relation type ("O" if no relation).
- `train`: Train the model on the files in the given dataset path.
- `load`: Load the model from the specified path.

In [11]:
from abc import ABC, abstractmethod

class REModel(ABC):
    def __init__(self):
        pass

    @abstractmethod
    def inference(self, row: list) -> str:
        pass        

    @abstractmethod
    def train(self, re_datasets_path: str):
        pass

    @abstractmethod
    def load(self, model_path: str):
        pass

## 4.3. Define our RE model(s)

### 4.3.1. Flair-based RE model

In [12]:
""" A Flair-based relation extraction model.
This one uses Flair's TextClassifier model to classify the
relation type of a given row.
"""

import os
import json
from typing import List

import flair
from flair.trainers import ModelTrainer
from flair.datasets import CSVClassificationCorpus
from flair.embeddings import (
    PooledFlairEmbeddings,
    DocumentRNNEmbeddings,
)
from flair.data import Sentence
from typing import List
from flair.models import TextClassifier
from flair.visual.training_curves import Plotter

import torch

MAX_EPOCHS = 1
HIDDEN_SIZE = 256

# Check whether CUDA is available and set the device accordingly
if torch.cuda.is_available():
    flair.device = torch.device("cuda:0")
else:
    flair.device = torch.device("cpu")
print("Device:", flair.device)


class FlairREModel(REModel):

    """The Flair-based RE model."""

    model_name: str = "Flair"

    def __init__(self):
        super(FlairREModel, self).__init__()
        self.model = None

    def train(self, datasets_path: os.path, trained_model_path: os.path):
        """Train the Flair RE model on the given CSV datasets.

        Args:
            datasets_path (os.path): The path containing the train and dev
               datasets.
            trained_model_path (os.path): The path to save the trained model.
        """

        column_name_map = {
            0: "text",
            1: "text",
            2: "text",
            3: "text",
            4: "text",
            7: "label_relation",
        }

        # Define corpus, labels, word embeddings, doc embeddings
        corpus = CSVClassificationCorpus(
            datasets_path,
            column_name_map,
            delimiter=",",
            label_type="relation",
        )

        label_dict = corpus.make_label_dictionary(label_type="relation")

        word_embeddings = [
            PooledFlairEmbeddings("mix-forward"),
            PooledFlairEmbeddings("mix-backward"),
        ]

        document_embeddings = DocumentRNNEmbeddings(
            word_embeddings, hidden_size=HIDDEN_SIZE
        )

        # Initialise sequence tagger
        tagger = TextClassifier(
            document_embeddings,
            label_dictionary=label_dict,
            label_type="relation",
        )

        # Initialize trainer
        trainer = ModelTrainer(tagger, corpus)

        sm = "cpu"
        if torch.cuda.is_available():
            sm = "gpu"
        
        # Start training
        trainer.train(
            trained_model_path,
            learning_rate=0.1,
            mini_batch_size=32,
            max_epochs=MAX_EPOCHS,
            patience=3,
            embeddings_storage_mode=sm,
        )

        self.load(os.path.join(trained_model_path, 'final-model.pt'))

    def load(self, model_path: str):
        """Load the chunked frequency dict from the given folder.

        Args:
            model_path (str): The filename containing the model.
               Can also be the name of a repo on Huggingface.
        """
        TextClassifier.load(model_path)

    def inference(self, row: list) -> str:
        """Run the inference over the given document.

        Args:
            row (list): The row to predict the relation of.

        Returns:
            str: The relation type.
        """
        
        s = Sentence(" ".join(rel[:5]))
        label = "O"
        self.model.predict(s)
        if len(s.labels) > 0:
            label = str(s.labels[0].value)
        return label


Device: cuda:0


### 4.3.2. 'SimpleMWO' RE model

Because maintenance work orders are very short (5-7 words typically), generally speaking we can create a useful knowledge graph by simply linking each Item entity in the work order and each other entity in that work order. For example:

    replace pump
    
We can say the "pump" entity `HAS_ACTIVITY` "replace". Likewise for the following:

    fix air conditioner , not working
    
We can say that "air conditioner" `HAS_ACTIVITY` "fix", and `HAS_OBSERVATION` "not working".

This is not a foolproof method, though - it is a heuristic, i.e. a rule-based method designed to exploit a pattern in the data. For creating this specific type of knowledge graph, though, it works quite well, and thus we can define a model to use this heuristic as a weaker alternative to a deep learning model.

Just like the dictionary-based NER model, the model is super simple, so we won't show the code here, but feel free to have a look under `helpers/SimpleMWOREModel.py` if you are interested.

In [13]:
from helpers import SimpleMWOREModel

Here's an example output from the model. Note we have set the last column (i.e. the relation type) to `None`, as it is our model's job to predict that column:

In [14]:
r = SimpleMWOREModel()

r.inference([
  "rod support",
  "broken",
  "Item",
  "Observation",
  "rod support broken",
  "1",
  "0",
  None
 ])

'HAS_OBSERVATION'

## 4.4. Train the model/load the pretrained model

Let's load the pretrained Flair RE model from Huggingface.

(or alternatively you can train it yourself by uncommenting the train line, and commenting the load line).

In [17]:
re_model = FlairREModel()
re_model.train(RE_DATASET_PATH, "models/re_models/flair") # Uncomment to train manually

# TODO: Load from huggingface
# re_model.load('nlp-tlp/mwo-re')


2022-11-21 20:37:32,560 Reading data from data\re_dataset
2022-11-21 20:37:32,562 Train: data\re_dataset\train.csv
2022-11-21 20:37:32,562 Dev: data\re_dataset\dev.csv
2022-11-21 20:37:32,563 Test: data\re_dataset\test.csv
2022-11-21 20:37:32,653 Computing label dictionary. Progress:


24804it [00:04, 6156.20it/s]


2022-11-21 20:37:36,688 Dictionary created for label 'relation' with 13 values: O (seen 15398 times), HAS_ACTIVITY (seen 2825 times), HAS_OBSERVATION (seen 2174 times), APPEARS_WITH (seen 1982 times), HAS_LOCATION (seen 1556 times), HAS_CONSUMABLE (seen 334 times), HAS_SPECIFIER (seen 173 times), HAS_AGENT (seen 143 times), HAS_CARDINALITY (seen 114 times), HAS_ATTRIBUTE (seen 76 times), HAS_TIME (seen 25 times), HAS_EVENT (seen 4 times)


  "There should be no best model saved at epoch 1 except there "


2022-11-21 20:37:39,123 ----------------------------------------------------------------------------------------------------
2022-11-21 20:37:39,125 Model: "TextClassifier(
  (decoder): Linear(in_features=256, out_features=13, bias=True)
  (dropout): Dropout(p=0.0, inplace=False)
  (locked_dropout): LockedDropout(p=0.0)
  (word_dropout): WordDropout(p=0.0)
  (loss_function): CrossEntropyLoss()
  (document_embeddings): DocumentRNNEmbeddings(
    (embeddings): StackedEmbeddings(
      (list_embedding_0): PooledFlairEmbeddings(
        (context_embeddings): FlairEmbeddings(
          (lm): LanguageModel(
            (drop): Dropout(p=0.25, inplace=False)
            (encoder): Embedding(275, 100)
            (rnn): LSTM(100, 2048)
            (decoder): Linear(in_features=2048, out_features=275, bias=True)
          )
        )
      )
      (list_embedding_1): PooledFlairEmbeddings(
        (context_embeddings): FlairEmbeddings(
          (lm): LanguageModel(
            (drop): Dropout(

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 104/104 [00:16<00:00,  6.26it/s]


2022-11-21 20:40:28,143 Evaluating as a multi-label problem: False


  _warn_prf(average, modifier, msg_start, len(result))


2022-11-21 20:40:28,163 DEV : loss 0.0031255579087883234 - f1-score (micro avg)  0.9164
2022-11-21 20:40:28,894 BAD EPOCHS (no improvement): 0
2022-11-21 20:40:28,896 saving best model
2022-11-21 20:40:31,928 ----------------------------------------------------------------------------------------------------
2022-11-21 20:40:31,929 loading file models\re_models\flair\best-model.pt


100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 102/102 [00:15<00:00,  6.40it/s]


2022-11-21 20:40:49,408 Evaluating as a multi-label problem: False
2022-11-21 20:40:49,427 0.9587	0.8521	0.9023	0.9332
2022-11-21 20:40:49,428 
Results:
- F-score (micro) 0.9023
- F-score (macro) 0.9294
- Accuracy 0.9332

By class:
                 precision    recall  f1-score   support

HAS_OBSERVATION     1.0000    1.0000    1.0000       313
   HAS_ACTIVITY     1.0000    1.0000    1.0000       279
   HAS_LOCATION     1.0000    1.0000    1.0000       238
   APPEARS_WITH     0.5114    0.2064    0.2941       218
 HAS_CONSUMABLE     1.0000    1.0000    1.0000        59
      HAS_AGENT     1.0000    1.0000    1.0000        21
  HAS_SPECIFIER     1.0000    1.0000    1.0000        15
  HAS_ATTRIBUTE     1.0000    1.0000    1.0000        12
HAS_CARDINALITY     1.0000    1.0000    1.0000        10
       HAS_TIME     1.0000    1.0000    1.0000         5

      micro avg     0.9587    0.8521    0.9023      1170
      macro avg     0.9511    0.9206    0.9294      1170
   weighted avg     0.909

## 4.5. Inference

Now we have our RE model, the next step is to run inference on the MWO dataset to extract the relationships between the entities.

We need our data to be in the same format as required by the model, i.e. a list of rows where each row has five columns (entity 1, entity 2, etc), just like the training data used to train the model.

So before we can run RE, we need to 'wrangle' our data again to get it into the right format.

### 4.5.1. Converting the BIO format to the "Mention"-based format

The BIO-based format from the NER model has one key downside - it is not good for representing 'phrases' of more than one token in length. This makes it difficult to work with for future steps, such as constructing nodes from the entities and running relation extraction. In light of this, we will now convert the BIO-formatted predictions into Mention format, i.e. go from this:

    {'tokens': ['a/c', 'not', 'working'],
     'labels': ['B-Item', 'B-Observation', 'I-Observation']}
    
To this:

    {'tokens': ['a/c', 'not', 'working'],
     'mentions': [
         {'start': 0, 'labels': ['Item'], 'end': 1},
         {'start': 1, 'labels': ['Observation'], 'end': 3}]}
    
Note that this format is also able to now support multiple labels per mention (though we will only be using single labels for simplicity). Researchers use this format for **entity typing**, which is similar to NER but with >= 1 label per mention.

This step is just a bit of data wrangling - here we have defined a helper function to convert a BIO-tagged sentence into a Mention-tagged sentence.

In [53]:
import json

def bio_to_mention(bio_doc: dict):
    """Return a Mention-format representation of a BIO-formatted
    tagged sentence.

    Args:
        bio_doc (dict): The BIO doc to convert to the Mention-based doc.

    Returns:
        dict: A mention-formatted dict created from the bio_doc.
    """
    tokens = bio_doc["tokens"]
    labels = bio_doc["labels"]
    mentions_list = []

    start = 0
    end = 0
    label = None
    for i, (token, label) in enumerate(
        zip(tokens, labels)
    ):
        if label.startswith("B-"):
            if len(mentions_list) > 0:
                mentions_list[-1]["end"] = i
            mentions_list.append({"start": i, "labels": [label[2:]]})
        elif label == "O" and len(mentions_list) > 0:
            mentions_list[-1]["end"] = i
        if len(mentions_list) == 0:
            continue
        if i == (len(tokens) - 1) and "end" not in mentions_list[-1]:
            mentions_list[-1]["end"] = i + 1
            
    for m in mentions_list:
        m['phrase'] = " ".join(tokens[m['start']:m['end']])
    return {'tokens': tokens, 'mentions': mentions_list}


# For each BIO tagged sentence in tagged_sents, convert it to the mention-based
# representation
tagged_sents = []
for doc in tagged_bio_sents:
    mention_doc = bio_to_mention(doc)
    tagged_sents.append(mention_doc)

# Let's print our example sentence again, this time with the mention-based
# representation.
# We'll use json.dumps to make it a bit easier to read.
print(json.dumps(tagged_sents[12],indent=1))

{
 "tokens": [
  "a/c",
  "not",
  "working"
 ],
 "mentions": [
  {
   "start": 0,
   "labels": [
    "Item"
   ],
   "end": 1,
   "phrase": "a/c"
  },
  {
   "start": 1,
   "labels": [
    "Observation"
   ],
   "end": 3,
   "phrase": "not working"
  }
 ]
}


Note we have added a "phrase" to each mention. We technically could get this phrase by looking at the list of tokens from the `start` to the `end` of the mention, but storing it inside `mentions` directly makes things easier later on.

### 4.5.2. Building a list of potential relations between entities

Now we have our data in a more amenable format, but we still need tabular data as required by the RE model. To refresh your memory, this is the required format:

 - entity 1
 - entity 2
 - label of entity 1
 - label of entity 2
 - The text between 'broken' and 'rod support', inclusive
 - The position of entity 1
 - The position of entity 2
 - The relation type. "O" means no relation.
 
We don't need that last column here as this is what we want our model to predict. We will set it to `None` to denote that no relation has been assigned yet.

We also need to add a new column to represent the document index - we will see why later.

Here is a helper function to transform our mention-based entity format of a single document into a list of potential relationships between each entity and each other entity in that document.

In [43]:
def build_potential_relations(tagged_sents) -> list:
    """Build a list of potential relations, i.e. all possible relationships
    between each entity in each document. The 8th column (which denotes the
    relationship type) will be set to None. The 9th column is the document index.
    
    Args:
        tagged_sents(list): The list of tagged sentences, where each sentence is a
            dict of tokens: [list of tokens] and mentions: [list of mentions].
    
    Returns:
        list: A list of rows, where each row is a potential relationship.
    """

    relations = []
    for doc_idx, doc in enumerate(tagged_sents):
        for m1_idx, mention_1 in enumerate(doc['mentions']):
            entity_1 = " ".join(doc['tokens'][mention_1['start']: mention_1['end']])
            label_1 = mention_1['labels'][0]

            for m2_idx, mention_2 in enumerate(doc['mentions']):
                if m1_idx == m2_idx:
                    continue
                entity_2 = " ".join(doc['tokens'][mention_2['start']: mention_2['end']])
                label_2 = mention_2['labels'][0]
                mention_text = " ".join(doc['tokens'][mention_1['start']:mention_2['end']]   )         

                relations.append(
                    [entity_1, entity_2, label_1, label_2, mention_text, m1_idx, m2_idx, None, doc_idx]         
                )
    return relations
            
relations = build_potential_relations(tagged_sents)
print(relations[0])

['repair', 'cracked', 'Activity', 'Observation', 'repair cracked', 0, 1, None, 0]


### 4.5.3 Running inference over every row

Now that our data is in the same format that we used to train the RE model, we can run the inference on it.

In [44]:
def tag_all_relations(relations: list):
    """Run model inference over every potential relation in the list of
    relations.
    
    Args:
        relations(list): The list of (untagged) relations.
        
    Returns:
        tagged_relations(list): The same list, but with the rel_type in the
           8th column.
    
    """
    tagged_relations = []

    for rel in relations:
        tagged_rel = rel[:]
        rel_type = rel_model.inference(rel)
        tagged_rel[7] = rel_type
        tagged_relations.append(tagged_rel)
    return tagged_relations
        
rel_model = SimpleMWOREModel() # or FlairREModel()
tagged_relations = tag_all_relations(relations)

# Print the first 10 rows
for row in tagged_relations[:10]:
    print(row)
        

['repair', 'cracked', 'Activity', 'Observation', 'repair cracked', 0, 1, 'O', 0]
['repair', 'hyd', 'Activity', 'Item', 'repair cracked hyd', 0, 2, 'O', 0]
['repair', 'tank', 'Activity', 'Item', 'repair cracked hyd tank', 0, 3, 'O', 0]
['cracked', 'repair', 'Observation', 'Activity', '', 1, 0, 'O', 0]
['cracked', 'hyd', 'Observation', 'Item', 'cracked hyd', 1, 2, 'O', 0]
['cracked', 'tank', 'Observation', 'Item', 'cracked hyd tank', 1, 3, 'O', 0]
['hyd', 'repair', 'Item', 'Activity', '', 2, 0, 'HAS_ACTIVITY', 0]
['hyd', 'cracked', 'Item', 'Observation', '', 2, 1, 'HAS_OBSERVATION', 0]
['hyd', 'tank', 'Item', 'Item', 'hyd tank', 2, 3, 'HAS_ITEM', 0]
['tank', 'repair', 'Item', 'Activity', '', 3, 0, 'HAS_ACTIVITY', 0]


# 5. Combining NER+RE

Now we have outputs from both the NER model and the RE model. The NER model's output looks like this:

    {'tokens': ['a/c', 'not', 'working'],
     'mentions': [
         {'start': 0, 'labels': ['Item'], 'end': 1},
         {'start': 1, 'labels': ['Observation'], 'end': 3}]}         
While the RE model's output is shown in the cell above.

The next step is to combine the two outputs. Fortunately we stored the document index in the relations, so we can easily join them up.

Let's add a 'relations' key to this dictionary. It will capture the relationships between mentions, e.g.

    'relations': {'start': 0, 'end': 1, 'type': 'HAS_OBSERVATION'}
    
... which denotes that mention 0 ('a/c') has the observation of mention 1 ('not working').

In [56]:
for i, sent in enumerate(tagged_sents):
    
    # Note we only care about the relations that do not have the class "O".
    doc_relations = [row for row in tagged_relations if row[7] != "O" and row[8] == i]
    
    sent['relations'] = []    
    for row in doc_relations:
        rel = {'start': row[5], 'end': row[6], 'type': row[7]}     
        sent['relations'].append(rel)

# Let's print an example...
print(json.dumps(tagged_sents[10], indent=1))

{
 "tokens": [
  "pump",
  "fault"
 ],
 "mentions": [
  {
   "start": 0,
   "labels": [
    "Item"
   ],
   "end": 1,
   "phrase": "pump"
  },
  {
   "start": 1,
   "labels": [
    "Observation"
   ],
   "end": 2,
   "phrase": "fault"
  }
 ],
 "relations": [
  {
   "start": 0,
   "end": 1,
   "type": "HAS_OBSERVATION"
  }
 ]
}


# 6. Creating the graph

We now have a data structure that stores the tokens, entity mentions, and relationships between those mentions, for each document. The last step is to put it all into a Neo4j graph so that we can query this information.

There are two popular methods for doing this:

- Using `py2neo` to programatically insert data into Neo4j
- Saving CSVs of your entities and relations, then reading them in via a `LOAD CSV` query in Neo4j

The first option is simple but a bit slow, and the second option is a little more complex but much faster. We will go with the first option here in this notebook for simplicity.

> Before proceeding, make sure you have created a new graph in Neo4j and that your new Neo4j graph is running.

You can download and install Neo4j from here if you haven't already: https://neo4j.com/download/. I will be demonstrating the graph during the class so there's no need to have it installed unless you are also interested in trying out some graph queries yourself.

In [62]:
from py2neo import Graph
from py2neo.data import Node, Relationship

GRAPH_PASSWORD = "password" # Set this to the password of your Neo4J graph


def get_node_id(phrase, entity_class):
    """A simple function to generate an id.
    This ensures an entity that can be different classes (pump for example) can have
    a unique node for each class type.
    """
    return f"{phrase}__{entity_class}"
    
def create_graph(tagged_sents):
    """Build the Neo4j graph.
    We do this by iterating over each tagged_sentence, and constructing the
    graph as follows:
     - Create a node to represent the document itself.
     - Create nodes for each entity appearing in that document, if they have not
       already been created. Each unique combination of entity + class will be added, so
       pump (the Item) is different from pump (the Activity).
     - Create a relationship between each entity and each document in which it appears.
     - Create a relationship between each entity and each other entity it is related to,
       via the list of relations.
     
    Args:
        tagged_sents(list): The list of tagged sentences.
    """
    graph = Graph(password = GRAPH_PASSWORD)

    # We will start by deleting all nodes and edges in the current graph.
    # If we don't do this, we will end up with duplicate nodes and edges when running this script again.
    graph.delete_all() 

    tx = graph.begin()
    
    # Keep track of the created entity nodes.
    # We need a way to map the id of the nodes to the py2neo Node objects so that we can
    # easily create relationships between these nodes.
    created_entity_nodes = {}
    
    # Iterate over the list of tagged sentences and programmatically create the graph.
    for sent in tagged_sents:
        
        # Create a node to represent the document.
        # Note that if you had additional properties in tagged_sents (such as dates, costs, etc)
        # you could add them as properties of the Document nodes here.
        document_node = Node("Document", name=" ".join(sent['tokens']))
        tx.create(document_node)
        
        tokens = sent['tokens']
        mentions = sent['mentions']
        relations = sent['relations']
        
        for m in mentions:
            start = m['start']
            end = m['end']
            entity_class = m['labels'][0]        
            phrase = " ".join(tokens[start: end])     
                    
            # Create a node for this entity mention.
            # If the node has already been created (i.e. it exists in created_nodes), 
            # simply retrieve that Node from created_entity_nodes.
            # Otherwise, create it, and add it to created_entity_nodes.
            entity_node_id = get_node_id(phrase, entity_class)

            if entity_node_id in created_entity_nodes:
                entity_node = created_entity_nodes[entity_node_id]
            else:
                entity_node = Node("Entity", entity_class, _id=entity_node_id, name=phrase)
                created_entity_nodes[entity_node_id] = entity_node
                tx.create(entity_node)            
                        
                
            # Create a relationship between that node and the document
            # in which it appears.               
            r = Relationship(entity_node, "APPEARS_IN", document_node)
            tx.create(r)
            
        # Create relationships between each (entity_1, entity_2) in the
        # list of relations for this document.
        for rel in relations:
            start = rel['start']
            end = rel['end']
            
            phrase_1 = mentions[start]['phrase']
            entity_class_1 = mentions[start]['labels'][0]
            
            phrase_2 = mentions[end]['phrase']
            entity_class_2 = mentions[end]['labels'][0]
                       
            node_1 = created_entity_nodes[get_node_id(phrase_1, entity_class_1)]
            node_2 = created_entity_nodes[get_node_id(phrase_2, entity_class_2)]
            
            r = Relationship(node_1, rel['type'], node_2)
            tx.create(r)
    tx.commit()

create_graph(tagged_sents)


        

# 7. Querying the graph


Now that the graph has been created, we can query it in Neo4j. This section lists some example queries that we can run on our graph. Feel free to try your own queries!

Note we are using `gqvis` to visualise these in Jupyter Notebook. The results will look very similar if you run these queries directly in the Neo4j browser.

*Note about gqvis: gqvis works out of the box in Jupyter Notebook, but to get it working in Jupyter Lab you'll need to install the jupyter_requirejs plugin. See the Appendix section at the bottom of this notebook for more details.*

First, let's try a simple query. Here is a query that searches for __all failure modes observed on pumps__:



In [2]:
import gqvis

gqvis.visualise_cypher("MATCH (e:Entity {name: 'pump'})-[r:HAS_OBSERVATION]->(o:Observation) RETURN e, r, o")


We can also use our graph as a way to quickly search and access work orders for the entities appearing in those work orders. For example, searching for __all work orders containing a leak__:

In [7]:
gqvis.visualise_cypher("MATCH (d:Document)<-[a:APPEARS_IN]-(o:Observation {name: 'leak'}) RETURN d, a, o")



We could extend this to also show the items on which the leaks were present:

In [81]:
gqvis.visualise_cypher("""
MATCH (d:Document)<-[a:APPEARS_IN]-(o:Observation {name: "leak"})<-[r:HAS_OBSERVATION]-(e:Entity)
RETURN d, a, o, r, e
""")


    


Our queries can also incorporate structured data, such as the start dates of the work orders. We have not added structured data to our graph for simplicity, but if we stored dates as properties on the `Document` nodes, we could run this type of query.

Here is an example query for __all assets that had leaks from 25 to 28 July__:

    MATCH (d:Document)<-[a:APPEARS_IN]-(e:Entity)-[r:HAS_OBSERVATION]->(o:observation {name: "leak"})-[:APPEARS_IN]->(d)
    WHERE d.StartDate >= 20050725
    AND d.StartDate <= 20050728
    RETURN e, r, o

On a larger graph this would also work well with other forms of structured data such as costs. We could query based on specific asset costs, for example.

# 8. Where to go from here

Feel free to use/adapt any of this code to build your own knowledge graphs. You might like to try running it on your own datasets, or designing your own `NERModel` or `REModel`.

## 8.1. Improving the lexical normalisation model

We only briefly touched on the lexical normalisation component of Knowledge Graph Construction from Text. There are plenty of neural models for lexical normalisation available that yield much better performance than our lexicon-based tagger.

We have also developed a tool to support the rapid creation of training data for lexical normalisation - you can learn about it [here](https://aclanthology.org/2021.emnlp-demo.25/).

## 8.2. Incorporating other structured data into the graph

Graph databases are excellent at bringing together data from a wide range of sources. In a maintenance setting, there are two particular types of structured data that can be easily added to this knowledge graph schema: Downtime events, and Functional Locations.

### Downtime events

A downtime event is a point in time in which an asset is not operational. These events typically have costs and dates associated with them, and can be associated with particular `Item` entities.

By modelling both work orders and downtime events in one graph, we can make queries about downtime events. Here is an example query for the __downtime events associated with assets appearing in work orders from 25 to 28 July (where the downtime events occurred in July)__:

    MATCH (d:WorkOrder)<-[a:APPEARS_IN]-(e:Entity)-[r:HAS_EVENT]->(x:DowntimeEvent)
    WHERE d.StartDate > 20050725
    AND d.StartDate < 20050728
    AND 20050700 <= x.StartDate <= 20050731
    RETURN e, r, x

We can of course extend this to specific assets, such as pumps:

    MATCH (d:WorkOrder)<-[a:APPEARS_IN]-(e:Entity {name: "pump"})-[r:HAS_EVENT]->(x:DowntimeEvent)
    WHERE d.StartDate > 20050725
    AND d.StartDate < 20050728
    AND 20050700 <= x.StartDate <= 20050731
    RETURN e, r, x

In larger graphs the downtime events could even be further queried based on duration, cost, lost feed, or date ranges.

### Functional Locations (FLOCs)

You may have noticed that our original `work_order_data.csv` has a column called "FLOC". This is the functional location of the asset being maintained. In the maintenance domain, this is often of greater interest to reliability engineers than the individual `Item` entities, and thus it would be ideal to create nodes to represent these functional locations in the graph. This way, we could run queries on the failure modes associated with particular FLOCs.

If you are interested in continuing work on this small graph, the next best step would be to create nodes for the functional location data (`floc_data`) and to link the downtime events to those nodes as opposed to the Item nodes.

![alt text](images/adding-flocs.png "Adding FLOCs")

## 8.3. Consolidating the training data for NER+RE

You may have noticed that our training data is split into two parts for the NER and RE tasks, i.e. the NER is in CONLL format and the RE is in tabular format. It is possible to put both the NER and RE training data into a single file, using the mention format we showed previously. For example, each row of your training data could look like this:

    { tokens: [<list of tokens>], mentions: [<list of mentions>], relations: [<list of relations>] }
    
... and you could have a script to 'wrangle' this into the CONLL and tabular format before feeding them into the NER and RE model respectively. If you use a tool like [QuickGraph](https://aclanthology.org/2022.acl-demo.27/) your data will be in a similar format to the above.

## 8.4. From Pipeline to End-to-End Knowledge Graph Construction from Text

We have presented a "pipeline" for KGC here in this notebook. We wrote the notebook this way in order to be able to discuss each of the components (Lexical Normalisation, Named Entity Recognition and Relation Extraction) in isolation, thus making them easier to understand.

However, the current state of the art in NLP/TLP is moving away from pipeline-based KGC models and towards end-to-end neural models, i.e. a single neural model that performs all of these steps simultaneously. If you are interested in learning about this, you might like to read some of the following papers:

> Stewart, M., & Liu, W. (2020, July). Seq2kg: an end-to-end neural model for domain agnostic knowledge graph (not text graph) construction from text. In   Proceedings of the International Conference on Principles of Knowledge Representation and Reasoning (Vol. 17, No. 1, pp. 748-757).

> Eberts, M., & Ulges, A. (2019). Span-based joint entity and relation extraction with transformer pre-training. arXiv preprint arXiv:1909.07755.

> Cabot, P. L. H., & Navigli, R. (2021, November). REBEL: Relation extraction by end-to-end language generation. In Findings of the Association for Computational Linguistics: EMNLP 2021 (pp. 2370-2381).

# Appendix

## GQVis on Jupyter Lab

GQVis works out of the box on Jupyter Notebook, but to get it working in Jupyter Lab, you'll need to run the following command prior to starting Jupyter lab:

    jupyter labextension install jupyterlab_requirejs