# Knowledge Graph Construction from Technical Short Text

In this notebook we are going to construct a simple knowledge graph using Python, and run some queries on the graph in Neo4j. We have broken the notebook into several steps:

1. Problem description
2. Introduction to Natural and Technical Language Processing
3. Loading the data
4. Cleaning the data via **Lexical Normalisation**
5. Extracting entities via **Named Entity Recognition** (NER)
6. Creating relations between entities via **Relation Extraction** (RE)
7. Combining NER + RE
8. Creating the graph
9. Querying the graph in Neo4j
10. What next?

## Requirements

To run this notebook you will need to install the following via pip (see the cell below):

- `py2neo`: A library for working with Neo4j in Python.
- `gqvis`: Our simple tool for visualising graph queries in Jupyter.
- `flair`: A deep learning library for natural language processing. Note this library is quite large (a couple gb I believe). If you don't wish to install this, we have provided non deep-learning based alternatives so you can still follow along.

In [None]:
!pip install py2neo
!pip install gqvis
!pip install flair

Sections 8 (Creating the graph) and 9 (Querying the graph in Neo4j) require Neo4j to be installed. We are offering this notebook as both a local installation to run in Jupyter Notebook (via the GitHub repository), and on Google Colab.

> ⚠️ If you use the **local Jupyter notebook** (i.e. you have cloned the GitHub repository and are running the notebook via `jupyter notebook`, you will need to have Neo4j Desktop installed (and running) for the last part of the tutorial. You can download and install Neo4j Desktop [here](https://neo4j.com/).

> ⚠️ If you are using the **Google Colab notebook**, you do not need to install anything.

We will be running through the code and Neo4j graph during the tutorial so you will not miss out if you are not able to install Neo4j.

# 1. Problem description

Maintenance work orders (MWOs) capture information on the maintenance performed on assets. Much of this information is structured - such as dates, the functional location (the specific identifier of the asset), costs, and so on. However, a significant volume of the knowledge buried within a MWO is unstructured and therefore inaccessible - it is buried within the short text description.

Here are some examples of these short text descriptions:

    replace pump
    a/c running hot
    repair cracked hydraulic tank
    
As you can see, they often contain indicators of failure modes (e.g. overheating) and end of life events (e.g. a replacement). It would be useful to be able to automatically discover these, however it is next to impossible to manually trawl through thousands of these work orders to discover patterns. We need some way to ask questions of our unstructured data, which is where knowledge graphs shine.

Our task in this session is therefore is to transform these work orders into a **knowledge graph**. We will be primarily focusing on the short text descriptions, but there will be some discussion at the end on incorporating other structured knowledge (e.g. dates) into our graph. The goal of building this knowledge graph is to be able to **ask questions of our data** such as "what are the failure modes observed on pumps?" and "which assets have had leaks in the past 6 months?"

![Insight](images/insight.jpg "Insight")

# 2. Introduction to Natural (and Technical) Language Processing

**Natural language processing** (NLP) is the study of the automatic interpetation and manipulation of natural language, like speech and written text. 

**Technical language processing** (TLP) is a subset of NLP that focuses specifically on technical text, such as the text present in maintenance work orders, doctor's notes, safety records, and so on. 

Here we will provide a brief overview of some of the most important concepts and terms in NLP.

## 2.2. Core NLP terms and ideas

Let's first define some core NLP terms and ideas that we will see numerous times throughout this notebook.

### 2.2.1. Corpus

A 💡 **corpus** is a set of text documents, where a document can be a word, sentence, paragraph, report, etc. Our corpus is a dataset of maintenance work order records.

### 2.2.2. Tokenisation

The first step for almost any NLP application is 💡 **Tokenisation**. This task splits text into smaller units, such as sentences (sentence tokenisation), words (word tokenisation), or characters (character tokenisation). For most applications, we split text into words.

The simplest (but by no means best) way to tokenise is using Python's `split` function, which simply splits a string using the space character as the delimeter:



In [1]:
sentence = "this is a sentence"
words = sentence.split()

print(words)

['this', 'is', 'a', 'sentence']


This is good enough for our purposes (tokenising maintenance work orders), but does not perform well on natural language datasets due to the prevalence of punctuation (full stops, commas etc). The `split` function fails when our sentence ends with a full stop, for example:

In [2]:
sentence = "this is a sentence."
words = sentence.split()

print(words)

['this', 'is', 'a', 'sentence.']


Fortunately this does not happen in our dataset so for simplicity we will just be using the `split` function.

If you are interested in learning more about tokenisation, there are many tokenisation libraries available in Python. The most popular perhaps is NLTK, which has a range of tokenisers available (see [here](https://www.nltk.org/api/nltk.tokenize.html)).

### 2.2.3. Vocabulary

A 💡 **vocabularly** is set of terms that correspond to a particular subject matter. The vocabulary of maintenance texts will be quite large, even though the length of the texts are small. This is because maintainers often write the same word many different ways ('air conditioner' is written as 'a/c', 'air cond', etc).

Note that depending on the domain, we may treat varying capitalisation of the same word ("dog" and "Dog", for example) as different words. In this notebook we will treat everything as lower-cased, though in many domains this is not a good idea (especially when many acronyms are present).

Determining our vocabulary is a matter of creating a set of unique words in the corpus, for example:

In [7]:
vocabulary = set()

dataset = ["this is the first sentence", "this is the second sentence"]
for sent in dataset:
    words = sent.split()
    for word in words:
        vocabulary.add(word.lower())

print(vocabulary)

{'sentence', 'second', 'first', 'the', 'this', 'is'}


### 2.2.4. Word Embeddings

Deep learning models are trained on numerical vectors. So how do we train NLP models to understand text?

💡 **Word embeddings** are numerical representations of language. We can think of an embedding as a `k` dimensional vector, whereby the embeddings of **semantically similar** items (typically words) have a **high cosine similarity**.

![Embeddings](images/embeddings_2.png "Embeddings")

Note how "cat" and "kitten" are close, while "cat" and "houses" are far away.

Actual word embeddings are often much larger (typically 512 dimensions at least), but the general principle is the same. The model that generates the embeddings is designed to ensure that words appearing in a similar context (cat and dog etc) have similar vectors, but words appearing in different contexts (cat and cup) have dissimilar vectors.

For fun, you might like to check out [Semantle](https://semantle.com/), where the goal is to determine the word of the day via its cosine similarity to other words.

#### Language models

💡 **Language models** are used to generate embedding vectors. The choice of model depends largely on the corpus.

Language models are typically **trained** on a corpus and then used to generate embeddings for words appearing in a different corpus. They can also be **fine-tuned** so that the model predicts more meaningful embeddings for words in a new corpus.

Below is a table of some of the most popular language models, including how the models are trained.

| Name     | Model                                                 | Method                                                                     | Granularity            |
|:---------|:------------------------------------------------------|:---------------------------------------------------------------------------|:-----------------------|
| word2vec | Feedforward neural network                            | Predict word given its context (CBOW), or context given a word (skip-gram) | Word-based             |
| GloVe    | Log-bilinear regression model                         | Train word vectors from global co-occurence matrix                         | Word-based             |
| FastText | Feedforward neural network                            | Predict context given a word (skip-gram)                                   | Character n-gram based |
| ELMo     | Bi-directional Long Short Term Memory model (Bi-LSTM) | Predict word given all previous/next words                                 | Character-based        |
| BERT     | Bi-directional Transformer                            | Predict masked word(s)                                                     | Wordpiece-based        |
| Flair    | Bi-LSTM + Conditional Random Field (CRF)              | Predict next character given the previous characters                                                | Character-based     |

While `word2vec` remains one of the most popular language models, it has one key downside - it is not able to generate embedding vectors for words that do not appear in the corpus on which it was trained. This is a problem for our maintenance texts as they are rife with spelling errors, acronyms, etc, as well as many terms that do not occur in common natural language.

Character-level embeddings are able to deal with this issue because they are modelled at the character-level, and thus the vector for similar spellings of the same word ('pump', 'puump', etc) will have high cosine similarity. For our application we will therefore choose Flair embeddings, though FastText, ELMO and BERT would work well too.

### 2.2.5. Sequence Labelling and Text Classification

There are two umbrella terms that almost all NLP tasks fall under - **sequence labelling** and **text classification**.

💡 **Sequence labelling** involves assigning a label to every item in a sequence. Examples of sequence labelling tasks involve Named Entity Recognition (assigning an entity class label to every token in a sentence) and Lexical Normalisation (assigning the correct form of the word to every word in a sentence).

💡 **Text classification** involves assigning one or more label(s) to an entire sequence. Examples of text classification tasks include sentiment analysis (determining whether a document is positive, negative, or neutral). In this notebook, our Relation Extraction model is also a text classification model (determine the relation type of a given (entity 1, entity 2, context)).

![NLP Tasks](images/nlp_tasks.jpg "NLP Tasks")

Broadly speaking, there are three widely-used types of methods for sequence labelling and text classification: **rule-based methods**, **feature extraction-based methods**, and **representation learning** methods.

#### Rule-based methods

Rule-based methods for sequence labelling/text classification are relatively straightfoward: we define rules that govern how a particular token or sequence should be labelled. These rules can simple, i.e. in the form of "token" &rarr; "label", such as "pump" &rarr; "Item", or more complex using regular expressions ("p" followed by any number of "u" characters followed by "mp" &rarr; "Item").

Rule-based methods take significant manual effort to develop and refine. They are effectively hard-coded to one dataset - rules defined on one dataset do not necessarily work on another dataset. They are also prone to user error, particularly when one token can have multiple meanings depending on context (such as "pump" being either an "Item" or "Activity").


#### Feature extraction-based sequence labelling

Unlike rule-based systems, which are more prevalent in commercial applications, feature extraction (i.e. machine-learning) based methods do not require users to spend copius amounts of time developing rule sets for one specific application. 

Feature-based models are trained on *handcrafted features*, which are carefully selected by the developer to incorporate their own domain knowledge to the particular task. For example, in NER, a commonly-used feature is whether a word begins with a capital letter.

Feature extraction-based models include:

- **Hidden markov models (HMM)**: A finite-state automata with probabilistic transitions between states.
- **Conditional random fields (CRF)**: Undirected graphical models that aim to predict multiple variables dependent on one another.


#### Representation learning-based sequence labelling

Recent research in NER trends towards deep learning-based models such as Bidirectional Long Short Term Memory (Bi-LSTM) and Transformers. These models require very little manual effort to employ - they simply need a set of annotated training data. The words are fed through an embedding layer to obtain word embedding vectors, which then propagate through the model. Deep learning models excel when trained on large amounts of data.

There are several representation learning-based models commonly used throughout NLP:

- **Feedforward neural networks**: A "simple" neural network comprised of multiple layers.
- **Recurrent neural networks (RNN)**: A type of neural network designed to handle sequential data.
- **Long short-term memory models (LSTMs)**: A type of RNN that fares better on long sequences (avoids the "vanishing gradient problem).
- **Gated recurrent units (GRUs)**: Similar to LSTM, but with fewer gates (thus, simpler).
- **Convolutional neural networks**: A neural network that begins with a convolutional layer.
- **Transformers**: An encoder/decoder model that combines feedfoward neural networks with multi-headed attention mechanisms.

For our task we are going to use `Flair`, which is a Bidirectional LSTM model, for both Named Entity Recognition and Relation Extraction.


### 2.2.7. Supervised learning and the need for annotated data

The majority of feature extraction and representation learning-based models are **supervised learning** models, i.e. they learn from annotated data. Annotated data can be obtained via manual annotation using tools such as [Redcoat](https://nlp-tlp/redcoat) and [QuickGraph](https://quickgraph.nlp-tlp.org).

For this notebook, our NER model is trained using an annotated dataset of ~4k MWOs tagged by the [UWA Natural & Technical Language Group](https://nlp-tlp.org). 

# 3. Loading the data

We are going to be building a knowledge graph on a small sample set of work orders. This will not be seen by the NER or RE models prior to constructing the graph - the idea is to get our models to run *inference* over this dataset to automatically predict the entities, and relationships between the entities, to build a graph.

- `sample_work_orders.csv`: A csv file containing a set of work orders.

We are using the simple `csv` library to read in the data, though this can also be done using `pandas`.

In [11]:
from csv import DictReader
from helpers import print_table

work_order_file = "data/sample_work_orders.csv"

# A simple function to read in a csv file and return a list,
# where each element in the list is a dictionary of {heading : value}
def load_csv(filename):
    data = []
    with open(filename, 'r') as f:
        reader = DictReader(f)
        for row in reader:
            data.append(row)
    return data

        
work_order_data = load_csv(work_order_file)

# Let's have a look at 10 rows
print_table(work_order_data[30:40])

StartDate     FLOC          ShortText                              Cost   
----------------------------------------------------------------------------------------------------
              1234.02.11    broken handraill/h/s crows nest        3200   
              1234.02.12    broken hose on cylinder                3300   
              1234.02.13    broken l/h pulldown chain              3400   
              1234.02.14    broken locking pin                     3500   
17/08/2001    1234.02.15    broken tool wrench holding cylinder    3600   
              1234.02.16    build up rod support beak              3700   
              1234.02.17    bull hose air leak                     3800   
25/02/2006    1234.02.18    bull hose split                        3900   
              1234.02.19    busted hydraulic hose                  4000   
              1234.02.20    c spanner for minning                  4100   


# 4. Cleaning the data via Lexical Normalisation

Our maintenance work order data is quite noisy - there are spelling errors, typos, acronyms, abbreviations, and so on. This will present challenges to us later when it comes time to build the knowledge graph. So to deal with these issues, we should first clean the text using **lexical normalisation**.

## 4.1. Overview of Lexical Normalisation

**Lexical normalisation** is an NLP task where the goal is to "transform an utterance into its standard form, word by word, including both one-to-many (1-n) and many-to-one (n-1) replacements" (van der Goot *et al.*, 2021). Lexical normalisation is closely related to the task of grammatical error correction however words are not included or removed to improve grammaticality. An example pair of noisy (non-canonical) and clean (canonical) sequences are shown below:

    c/o braken a/c vent on the lhs of dump trk cab
    change out broken air conditioner vent on the left hand side of dump truck cabin

Lexical normalisation focuses on the normalisation of non-canonical lexical terms with one or more of the characteristics shown in the figure below. Lexical errors can be a result of: 

- unconventional and phonetic spelling
- improper casing
- acronyms
- abbreviations and initialisms
- domain-specific terms
- jargon
- neologisms
- erroneous concatenation or tokenization

For technical short text such as maintenance work orders, lexical errors are substantial due to the terse nature of the texts and technical content being conveyed.

![alt text](images/lexnorm.png "Lexical normalisation")

(Figure from van der Goot *et al.*, 2018)

There are two main reasons why applying lexical normalisation to erroneous/noisy texts is advantageous: **improved downstream performance** in NLP tasks (van der Goot *et al.*, 2017) and **improved quality in knowledge representation** (e.g. nodes in a knowledge graph).

The task of lexical normalisation was introduced by (Han and Baldwin, 2011) and made popular through the 2015 workshop on noisy user-generated text (WNUT) (Baldwin *et al.*, 2015) where the main focus was on social media posts, namely Twitter as they capture significant lexical noise. Since then there has been significant progress in the task, with state-of-the-art performance achieved via pipelines of ensembles of discriminators (neural, embeddings) (van der Goot *et al.*, 2017), however statistical/neural token-classification (Stewart *et al.*, 2018; Muller *et al.*, 2019) and sequence-to-sequence (Lourentzou *et al.*, 2019; Nguyen and Sandro, 2020) models have also demonstrated promising performance in various contexts. Recently, state-of-the-art performance on multilingual lexical normalisation has been achieved by fine-tuning ByT5 transformer (Samuel and Milan, 2021; van der Goot *et al.*, 2021).


For low-resource technical texts like maintenance work orders, training strategies employed for general domain texts are challenging to adopt as parallel corpora (noisy/clean) are scarce.


## 4.2. Running lexical normalisation over our data

In the interest of time/simplicity we are not going to use a neural model here, but instead we will use a simple lexicon-based normaliser. This model will simply replace a misspelled phrase with its correct form. This is not practical in the real world (as there's no way we could possibly build a lexicon of all possible misspellings) but it is good enough for our small example.

The following code imports our `LexiconNormaliser` model. This model simply replaces any predefined terms with their replacements, e.g. "puump" will be normalised to "pump", and so on.

If you're interested in seeing this lexicon, it's available under [data/lexicon_normalisation.csv](files/data/lexicon_normalisation.csv).

In [13]:
from helpers import LexiconNormaliser

Now that the LexiconNormaliser has been defined, let's run it over all of the ShortText fields in our dataset.

In [14]:
lexicon_file = "data/lexicon_normalisation.csv"
normaliser = LexiconNormaliser(lexicon_file)

work_order_data = load_csv(work_order_file)

for i, row in enumerate(work_order_data):
    before = row['ShortText']    
    row['ShortText'] = normaliser.normalise(row['ShortText'])
    
    # Let's print the first 5 to have a look at the difference
    if i > 23 and i <= 28:
        print(before)
        print(row['ShortText'])
        print()
    

book out filters
book out filters

both rear jacks leaking oil
both rear jacks leak oil

break out u/s
break out unserviceable

breakout fork cylinder - rework
breakout fork cylinder - rework

broken control connection
breakdown control connection



# 5. Named Entity Recognition

The next task is to extract the entities in the short text descriptions and construct nodes from those entities. This is how we are able to unlock the knowledge captured within the short text and combine it with the structured fields.

![alt text](images/extracting-entities-v2.png "Extracting entities")


## 5.1. Overview of Named Entity Recognition

Named Entity Recognition (NER) is a long-standing NLP task. Introduced in the 6th Message Understanding Conference (MUC-6) in 1996, the goal of NER is to label every *named entity* in a corpus with its corresponding type. NER is a **sequence labelling** task.

The choice of entity class depends on the corpus. NER on natural language corpora typically uses four classes (`Person`, `Organisation`, `Location`, `Miscellaneous`), though there are many schemas available. In our application, we will use the following three classes:

- **`Item`**: A maintainable item such as "exhaust".
- **`Activity`**: A maintenance activity performed on an item, such as "replace".
- **`Observation`**: An observation on an Item, such as "lagging".

![Example tagged sentence](images/tagged_sentence.png)

### 5.1.1. State-of-the-art

The current state-of-the-art in NER makes use of deep learning models trained on copious amounts of training data. Here are three of the state-of-the-art models for NER:

- [**A Unified Generative Framework for Various NER Subtasks**](https://arxiv.org/pdf/2106.01223): Encodes input sequences into a fixed content space and then autoregressively decodes into an output sequence.
- [**LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention**](https://aclanthology.org/2020.emnlp-main.523/): A transformer-based NER model based on the BERT language model.
- [**Contextual String Embeddings for Sequence Labelling (Flair)**](https://aclanthology.org/C18-1139.pdf): A character-level language model feeds into a Bi-LSTM sequence labelling model to predict the entity type of each word.

For a detailed list of SOTA NER models feel free to check out [NLP Progress](http://nlpprogress.com/english/named_entity_recognition.html).

#### Flair

We will be using Flair to perform Named Entity Recognition. Flair combines a character-level language model with a sequence labelling model, and exists as a simple-to-use library on GitHub. A diagram of the model is shown below, taken from the ["Contextual String Embeddings for Sequence Labelling"](https://aclanthology.org/C18-1139.pdf) paper.

![alt text](images/flair_model.png "A diagram of the Flair model")

You can learn more about Flair via the papers ([here](https://aclanthology.org/N19-4010/) and [here](https://aclanthology.org/C18-1139.pdf)), or check out the [GitHub repository](https://github.com/flairNLP/flair).

## 5.2. Loading and inspecting the data

The dataset to train and evaluate the NER model is split into three files:

- `ner_dataset/train.txt`: The dataset we will use to *train* the NER model to predict the entities appearing in each work order.
- `ner_dataset/dev.txt`: The dataset we will use to *validate* the quality of the model during training.
- `ner_dataset/test.txt`: The dataset we will use to *evaluate* the final performance of the NER model after training.

Let's start by defining some functions for loading the CONLL-formatted data. The CONLL format is a widely used format for Named Entity Recognition, and looks like this:

    Michael B-PER
    works O
    at O
    The B-ORG
    University I-ORG
    of I-ORG
    Western I-ORG
    Australia I-ORG
    

Note the **BIO** format being used here ("beginning", "inside", "outside"). **B** denotes that the word is the start of en entity, **I** denotes that the word is inside an entity, and **O** denotes that a word is not an entity.

It's a bit tricky to work with it in this format, so we are going to define some functions to parse it into something like this:

    { tokens: ['Michael', 'works', 'at', 'The', 'University', 'of', 'Western', 'Australia'],
      labels: ['B-PER', 'O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'I-ORG'] }
      
Note that many NLP libraries also have this functionality (NLTK for example) - but we will do it in pure Python in the interest of keeping our dependencies minimal.

In [15]:
import os

def to_conll_document(s: str):
    """Parse a CONLL-formatted document into a dictionary of
    tokens and labels.

    Args:
        s (str): A string, separated by newlines, where each
        line is a token, then a space, then a label.

    Returns:
        dict: A dict of tokens and labels.
    """
    tokens, labels = [], []
    for line in s.split("\n"):
        if len(line.strip()) == 0:
            continue
        token, label = line.split()

        tokens.append(token)
        labels.append(label)
    return {'tokens': tokens, 'labels': labels}


def load_conll_dataset(filename: str) -> list:
    """Load a list of documents from the given CONLL-formatted dataset.

    Args:
        filename (str): The filename to load from.

    Returns:
        list: A list of documents, where each document is a dict of tokens and labels.
    """
    documents = []
    with open(filename, "r") as f:
        docs = f.read().split("\n\n")
        for d in docs:
            if len(d) == 0:
                continue
            document = to_conll_document(d)
            documents.append(document)
    print(f"Loaded {len(documents)} documents from {filename}.")
    return documents



Let's take a quick look at the first row of our training dataset to make sure it loads OK:

In [16]:
NER_DATASET_PATH = "data/ner_dataset"
train_dataset = load_conll_dataset(os.path.join(NER_DATASET_PATH, 'train.txt'))

print(train_dataset[0])

Loaded 3201 documents from data/ner_dataset\train.txt.
{'tokens': ['ram', 'on', 'cup', 'rod', 'support', 'broken'], 'labels': ['B-Item', 'O', 'B-Item', 'B-Item', 'I-Item', 'B-Observation']}


## 5.3. Define an abstract base class for NER Models

Seeing as we would like to be able to work with a range of NER models, it's a good idea to create an 'abstract base class' to represent an NER model. This way, we can create classes for our NER models that inherit from this base class. Every model we create must have these three functions:

- `train`: Train the model on the datasets in the given path.
- `inference`: Run inference over the given sentence.
- `load`: Load the model from the given path.

If we try to create an NER model that does not have one of these functions, it will raise an error.

In [17]:
""" Abstract base class for the NER Model. """

from abc import ABC, abstractmethod


class NERModel(ABC):
    def __init__(self):
        pass

    @abstractmethod
    def train(self, datasets_path: str):
        pass

    @abstractmethod
    def inference(self, sent: list):
        pass

    @abstractmethod
    def load(self, model_path):
        pass


## 5.4. Define our NER models

#### 5.4.1. Flair-based NER Model

In this tutorial we will use [Flair](https://github.com/flairNLP/flair), which simplifies the process of building a deep learning model for a variety of NLP tasks.

The code below is a class representing a `FlairNERModel`, which is based on the `NERModel` class above. It has the same three methods, i.e `train()`, `inference()`, and `save()`.

In [18]:
"""A Flair-based Named Entity Recognition model. Learns to predict entity
classes via deep learning."""

import os
import flair
from flair.data import Corpus, Sentence
from flair.datasets import ColumnCorpus
from flair.embeddings import (
    StackedEmbeddings,
    FlairEmbeddings,
)
from flair.models import SequenceTagger
from flair.trainers import ModelTrainer
from typing import List
from flair.visual.training_curves import Plotter
import torch


HIDDEN_SIZE = 256

# Check whether CUDA is available and set the device accordingly
if torch.cuda.is_available():
    flair.device = torch.device("cuda:0")
else:
    flair.device = torch.device("cpu")
print("Device:", flair.device)


class FlairNERModel(NERModel):

    model_name: str = "Flair"

    """A Flair-based Named Entity Recognition model.
    """

    def __init__(self):
        super(FlairNERModel, self).__init__()

        self.model = None

    def train(self, datasets_path: os.path, trained_model_path: os.path):
        """ Train the Flair model on the given conll datasets.

        Args:
            datasets_path (os.path): The folder containing the
              train, dev and text CONLL-formatted datasets.
            trained_model_path (os.path): The folder to save the trained
              model to.
        """

        columns = {0: "text", 1: "ner"}
        corpus: Corpus = ColumnCorpus(
            datasets_path,
            columns,
            train_file="train.txt",
            dev_file="dev.txt",
            test_file="test.txt",
        )
        label_dict = corpus.make_label_dictionary(label_type="ner")

        # Train the sequence tagger
        embedding_types = [
            FlairEmbeddings("mix-forward"),
            FlairEmbeddings("mix-backward"),
        ]

        embeddings = StackedEmbeddings(embeddings=embedding_types)

        tagger = SequenceTagger(
            hidden_size=HIDDEN_SIZE,
            embeddings=embeddings,
            tag_dictionary=label_dict,
            tag_type="ner",
            use_crf=True,
        )

        trainer = ModelTrainer(tagger, corpus)

        sm = "cpu"
        if torch.cuda.is_available():
            sm = "gpu"
        trainer.train(
            trained_model_path,
            learning_rate=0.1,
            mini_batch_size=32,
            max_epochs=10,
            embeddings_storage_mode=sm,
        )

        plotter = Plotter()
        plotter.plot_weights(os.path.join(trained_model_path, "weights.txt"))

        self.load(os.path.join(trained_model_path, 'final-model.pt'))

    def inference(self, sent: list) -> dict:
        """Run the inference on a given list of short texts.

        Args:
            sent (list): The sentence (list of words).

        Returns:
            dict: The tagged sentence now in the form of {'tokens': [list],
                'labels': [list]}.

        Raises:
            ValueError: If the model has not yet been trained.
        """
        if self.model is None:
            raise ValueError(
                "The NER Model has not yet been trained. "
                "Please train/load this Flair model before proceeding."
            )
        
        sentence_obj = Sentence(sentence, use_tokenizer=False)
        self.model.predict(sentence_obj)
        labels = ["O"] * len(sentence)

        for entity in sentence_obj.get_spans("ner"):
            for i, token in enumerate(entity):
                label = entity.get_label("ner").value
                prefix = "B-" if i == 0 else "I-"
                
                # Token idx starts from 1 in Flair.
                labels[token.idx - 1] = prefix + label

        return { 'tokens': sent, 'labels': labels }

    def load(self, model_path: str):
        """Load the model from the specified path.

        Args:
            model_path (os.path): The path to load.

        Raises:
            ValueError: If the path does not exist i.e. model not yet trained.
        """
        self.model = SequenceTagger.load(model_path)

Device: cuda:0


### 5.4.2. Dictionary-based NER model

If you are not able to use the Flair library, here is a simple model you can use to extract the entities, albeit with a much weaker performance. This one scans the training data, builds a mapping between each phrase (one or more tokens in a row) and the most common entity type associated with that phrase, then uses that entity type as the prediction when seeing that token in the test data.

The model is super simple, so we won't show the code here, but feel free to have a look under `helpers/DictionaryNERModel.py` if you are interested.

In [17]:
from helpers import DictionaryNERModel

## 5.5. Training/loading the models

### 5.5.1. Loading the pre-trained Flair model

Training a deep learning model takes time, especially without a GPU. We have trained the model on the NER dataset already, and have uploaded the model to HuggingFace. You can simply load the model via the code below.

In [19]:
flair_ner_model = FlairNERModel()
flair_ner_model.load('nlp-tlp/mwo-ner')



2022-11-29 21:07:54,871 loading file C:\Users\micha\.flair\models\mwo-ner\e6f3ee401bab0df2bf0ef5b63f59f57b17b631d5eb531f336980540ed4910d45.8d1c1ab84b25bb14ce63a1634124d5da5c29c553c0776cfa3214bf6a47519b55
2022-11-29 21:07:55,435 SequenceTagger predicts: Dictionary with 15 tags: O, S-Item, B-Item, E-Item, I-Item, S-Activity, B-Activity, E-Activity, I-Activity, S-Observation, B-Observation, E-Observation, I-Observation, <START>, <STOP>


### 5.5.2. Training the Flair model from scratch

If you wish to train the model yourself (not recommended unless you have a GPU with CUDA enabled), the code for doing that is as follows:

    flair_ner_model = FlairNERModel()
    flair_ner_model.train(NER_DATASET_PATH, 'models/ner_models/flair')

### 5.5.2. "Training" the DictionaryNERModel

The code below "trains" the dictionary-based model. We put "trains" in quotes here because it is not really training at all, but rather building a dictionary mapping terms to their corresponding entity categories.

Feel free to run this code instead of the Flair code if you do not have Flair installed.

In [18]:
from helpers import DictionaryNERModel

NER_DATASET_PATH = "data/ner_dataset"

dictionary_ner_model = DictionaryNERModel()
dictionary_ner_model.train(NER_DATASET_PATH, 'models/ner_models/dictionary')

Building dictionary...
Loaded 3201 documents from data/ner_dataset\train.txt.
Loaded 402 documents from data/ner_dataset\dev.txt.


## 5.6. Running inference on unseen sentences

The next step is to use our trained model to infer the entity type of each entity appearing in a list of previously unseen data.

In [20]:
tagged_bio_sents = []

#model = dictionary_ner_model # Uncomment this line and comment the line below if you want to use the dictionary model
model = flair_ner_model

sentences = []
for row in work_order_data:
    sentence = row["ShortText"].split() # We must 'tokenise' the sentence first, i.e. split into words
    tagged_sent = model.inference(sentence) # replace 'flair' with 'dictionary' if not using flair  
    tagged_bio_sents.append(tagged_sent)

# Print an example tagged sentence
print(tagged_bio_sents[15])

{'tokens': ['air', 'leak', 'around', 'rotary', 'head'], 'labels': ['O', 'B-Observation', 'O', 'B-Item', 'I-Item']}


# 6. Extracting relations between the entities via Relation Extraction

We have extracted the entities appearing in each work order. The next step is to extract the relationships between those entities. We can do this using Relation Extraction.

![alt text](images/building-relations.png "Building relations")

## 6.1. Overview of Relation Extraction

**Relation extraction** is the process of determining the relationships between entities in text. It plays a pivotal role in knowledge graph construction as it provides the relation component of each triple.
 
For our application we are going to use of the following relationships:

 - **HAS_OBSERVATION**: A relationship between an `Item` and an `Observation`.
 - **HAS_ACTIVITY**: A relationship between an `Item` and an `Activity`.
 - **APPEARS_WITH**: A relationship between an `Item` and another `Item`.
 
Note that this is just one way to represent relationships in the data. It is an example of **co-occurrence relations**, i.e. the relationships are formed between entities that co-occur in the same document. We are modelling them this way in light of the types of queries we want to be able to write (e.g. "what are all failure modes observed on pumps"), but the code here will be applicable to any relation types.

### 6.1.1. Historical approaches to RE

Previous approaches to Relation Extraction treated the task as a **text classification** task, i.e. given two entities, determine the relationship between them. This can be seen in, for example, the [SemEval 2010 Task 8](https://aclanthology.org/S10-1006/). Many of these methods were **feature extraction-based**, making use of a range of different features such as the local context, the part of speech (noun, verb, etc), syntactic patterns, and so on.

### 6.1.2. State-of-the-art

The current SOTA relation extraction models are predominately deep learning based. Many of the most recent RE models perform both NER and RE simultaneously, and thus "relation extraction" is becoming less popular as a task on its own. The models perform both NER + RE by taking a sentence as input and producing a sequence that looks like a marked-up list of triples (such as `<triplet>UWA<subj>Michael<obj>works at`). Examples of these models include:

 - [**DeepStruct: Pretraining of Language Models for Structure Prediction**](https://arxiv.org/pdf/2205.10475): A sequence-labelling based method that uses structure pretraining.
 - [**REBEL: Relation extraction by end-to-end language generation.**](https://aclanthology.org/2021.findings-emnlp.204/): An autoregressive sequence to sequence model.
 - [**Span-based joint entity and relation extraction with transformer pre-training (SpERT)**](https://arxiv.org/pdf/1909.07755): A joint NER + Relation extraction model based on the BERT language model.
 - [**GenIE: generative information extraction**](https://arxiv.org/pdf/2112.08340): A generative model that makes use of an existing knowledge base to constrain the types of relations produced based on the context.

For a list of RE models feel free to check out [Awesome Relation Extraction](https://github.com/roomylee/awesome-relation-extraction) or [NLP Progress](https://nlpprogress.com/english/relationship_extraction.html).

#### Flair

In the interest of simplicity and explainability, we will be performing RE separately from NER, rather than making use of these joint NER + RE models.

We are thus once again going to use Flair, this time to perform Relation Extraction. As of the time of writing this notebook, Flair does have support for RE, but not in the latest branch. We will thus be using Flair's Text Classification model to devise our own method for running relation extraction.

Our Relation Extraction task will be as follows: *given a row of (entity_1, entity_2, label_1, label_2, context), determine the type of relationship*. All components of the row are concatenated into a single sequence which is then fed through the model to produce the predicted label.

Note how this is a **text classification** task, i.e. one sequence mapped to one output label.

## 6.2. Loading and inspecting the data

Let's take a look again at the RE dataset we are working with.

In [21]:
import os
import json

RE_DATASET_PATH = "data/re_dataset"


def load_re_dataset(filename: str) -> list:
    """Load the Relation Extraction dataset into a list.
        
    Args:
        filename (str): The name of the file to load.
    """
    re_data = []
    with open(filename, 'r') as f:
        for row in f:
            re_data.append(row.strip().split(','))
    return re_data

train_dataset = load_re_dataset(os.path.join(RE_DATASET_PATH, 'train.csv'))

# Let's take a quick look at a few rows
for row in train_dataset[:3]:
    print(row)


['broken', 'rod support', 'Observation', 'Item', 'rod support broken', '0', '1', 'O']
['rod support', 'broken', 'Item', 'Observation', 'rod support broken', '1', '0', 'HAS_OBSERVATION']
['broken', 'cup', 'Observation', 'Item', 'cup rod support broken', '0', '2', 'O']


We can interpret this as follows:
 - 'broken': entity 1
 - 'rod support': entity 2
 - 'Observation': label of entity 1
 - 'Item': label of entity 2
 - 'rod support broken': The text between 'broken' and 'rod support', inclusive
 - '0': The mention index of entity 1
 - '1': The mention index of entity 2
 - 'O': The relation type. "O" means no relation.

## 6.3. Define the Abstract Base Class

We are going to see two different RE models, so let's define an abstract base class again just like we did for the NER models. Just like the NER model, we have three functions:

- `inference`: Given a row (as above, but without the last column), predict the given relation type ("O" if no relation).
- `train`: Train the model on the files in the given dataset path.
- `load`: Load the model from the specified path.

In [22]:
from abc import ABC, abstractmethod

class REModel(ABC):
    def __init__(self):
        pass

    @abstractmethod
    def inference(self, row: list) -> str:
        pass        

    @abstractmethod
    def train(self, re_datasets_path: str):
        pass

    @abstractmethod
    def load(self, model_path: str):
        pass

## 6.4. Define our RE model(s)

As with the NER models, we will define one deep learning-based model (Flair) and one simple alternative.

### 6.4.1. Flair-based RE model

In [46]:
""" A Flair-based relation extraction model.
This one uses Flair's TextClassifier model to classify the
relation type of a given row.
"""

import os
import json
from typing import List

import flair
from flair.trainers import ModelTrainer
from flair.datasets import CSVClassificationCorpus
from flair.embeddings import (
    PooledFlairEmbeddings,
    DocumentRNNEmbeddings,
)
from flair.data import Sentence
from typing import List
from flair.models import TextClassifier, SequenceTagger
from flair.visual.training_curves import Plotter

from huggingface_hub import hf_hub_download

import torch

MAX_EPOCHS = 1
HIDDEN_SIZE = 256

# Check whether CUDA is available and set the device accordingly
if torch.cuda.is_available():
    flair.device = torch.device("cuda:0")
else:
    flair.device = torch.device("cpu")
print("Device:", flair.device)


class FlairREModel(REModel):

    """The Flair-based RE model."""

    model_name: str = "Flair"

    def __init__(self):
        super(FlairREModel, self).__init__()
        self.model = None

    def train(self, datasets_path: os.path, trained_model_path: os.path):
        """Train the Flair RE model on the given CSV datasets.

        Args:
            datasets_path (os.path): The path containing the train and dev
               datasets.
            trained_model_path (os.path): The path to save the trained model.
        """

        column_name_map = {
            0: "text",
            1: "text",
            2: "text",
            3: "text",
            4: "text",
            7: "label_relation",
        }

        # Define corpus, labels, word embeddings, doc embeddings
        corpus = CSVClassificationCorpus(
            datasets_path,
            column_name_map,
            delimiter=",",
            label_type="relation",
        )

        label_dict = corpus.make_label_dictionary(label_type="relation")

        word_embeddings = [
            PooledFlairEmbeddings("mix-forward"),
            PooledFlairEmbeddings("mix-backward"),
        ]

        document_embeddings = DocumentRNNEmbeddings(
            word_embeddings, hidden_size=HIDDEN_SIZE
        )

        # Initialise sequence tagger
        tagger = TextClassifier(
            document_embeddings,
            label_dictionary=label_dict,
            label_type="relation",
        )

        # Initialize trainer
        trainer = ModelTrainer(tagger, corpus)

        sm = "cpu"
        if torch.cuda.is_available():
            sm = "gpu"
        
        # Start training
        trainer.train(
            trained_model_path,
            learning_rate=0.1,
            mini_batch_size=32,
            max_epochs=MAX_EPOCHS,
            patience=3,
            embeddings_storage_mode=sm,
        )

        self.load(os.path.join(trained_model_path, 'final-model.pt'))

    def load(self, model_path: str):
        """Load the model from the given path.

        Args:
            model_path (str): The filename containing the model.
               Can also be the name of a repo on Huggingface.
        """
        
        # We need to hard-code this one because TextClassifier doesn't
        # yet support HuggingFace.
        if(model_path == "nlp-tlp/mwo-re"):         
            model_path = hf_hub_download(
                repo_id="nlp-tlp/mwo-re",
                filename="pytorch_model.bin",
                cache_dir=flair.cache_root / "models" / "mwo-re"
            )
        
        self.model = TextClassifier.load(model_path)

    def inference(self, row: list) -> str:
        """Run the inference over the given document.

        Args:
            row (list): The row to predict the relation of.

        Returns:
            str: The relation type.
        """
        
        s = Sentence(" ".join(row[:5]))
        label = "O"
        self.model.predict(s)
        if len(s.labels) > 0:
            label = str(s.labels[0].value)
        return label


Device: cuda:0


### 6.4.2. 'SimpleMWO' RE model

Because maintenance work orders are very short (5-7 words typically), generally speaking we can create a useful knowledge graph by simply linking each Item entity in the work order and each other entity in that work order. For example:

    replace pump
    
We can say the "pump" entity `HAS_ACTIVITY` "replace". Likewise for the following:

    fix air conditioner , not working
    
We can say that "air conditioner" `HAS_ACTIVITY` "fix", and `HAS_OBSERVATION` "not working".

This is not a foolproof method, though - it is a heuristic, i.e. a rule-based method designed to exploit a pattern in the data. For creating this specific type of knowledge graph, though, it works quite well, and thus we can define a model to use this heuristic as a weaker alternative to a deep learning model.

Just like the dictionary-based NER model, the model is super simple, so we won't show the code here, but feel free to have a look under `helpers/SimpleMWOREModel.py` if you are interested.

In [34]:
from helpers import SimpleMWOREModel

Here's an example output from the model. Note we have set the last column (i.e. the relation type) to `None`, as it is our model's job to predict that column:

In [35]:
r = SimpleMWOREModel()

r.inference([
  "rod support",
  "broken",
  "Item",
  "Observation",
  "rod support broken",
  "1",
  "0",
  None
 ])

'HAS_OBSERVATION'

## 6.5. Train the model/load the pretrained model

Let's load the pretrained Flair RE model from Huggingface.

(or alternatively you can train it yourself by uncommenting the train line, and commenting the load line).

In [47]:
flair_re_model = FlairREModel()
flair_re_model.load('nlp-tlp/mwo-re')

#flair_re_model.train(RE_DATASET_PATH, "models/re_models/flair") # Uncomment to train manually

2022-11-29 21:11:54,773 loading file C:\Users\micha\.flair\models\mwo-re\models--nlp-tlp--mwo-re\snapshots\5f8a5e81a96d0c4022097528a90f659755081994\pytorch_model.bin


## 6.6. Inference

Now we have our RE model, the next step is to run inference on the MWO dataset to extract the relationships between the entities.

We need our data to be in the same format as required by the model, i.e. a list of rows where each row has five columns (entity 1, entity 2, etc), just like the training data used to train the model.

So before we can run RE, we need to 'wrangle' our data again to get it into the right format.

### 6.6.1. Converting the BIO format to the "Mention"-based format

The BIO-based format from the NER model has one key downside - it is not good for representing 'phrases' of more than one token in length. This makes it difficult to work with for future steps, such as constructing nodes from the entities and running relation extraction. In light of this, we will now convert the BIO-formatted predictions into Mention format, i.e. go from this:

    {'tokens': ['a/c', 'not', 'working'],
     'labels': ['B-Item', 'B-Observation', 'I-Observation']}
    
To this:

    {'tokens': ['a/c', 'not', 'working'],
     'mentions': [
         {'start': 0, 'labels': ['Item'], 'end': 1},
         {'start': 1, 'labels': ['Observation'], 'end': 3}]}
    
Note that this format is also able to now support multiple labels per mention (though we will only be using single labels for simplicity). Researchers use this format for **entity typing**, which is similar to NER but with >= 1 label per mention.

This step is just a bit of data wrangling - here we have defined a helper function to convert a BIO-tagged sentence into a Mention-tagged sentence.

In [37]:
import json

def bio_to_mention(bio_doc: dict):
    """Return a Mention-format representation of a BIO-formatted
    tagged sentence.

    Args:
        bio_doc (dict): The BIO doc to convert to the Mention-based doc.

    Returns:
        dict: A mention-formatted dict created from the bio_doc.
    """
    tokens = bio_doc["tokens"]
    labels = bio_doc["labels"]
    mentions_list = []

    start = 0
    end = 0
    label = None
    for i, (token, label) in enumerate(
        zip(tokens, labels)
    ):
        if label.startswith("B-"):
            if len(mentions_list) > 0:
                mentions_list[-1]["end"] = i
            mentions_list.append({"start": i, "labels": [label[2:]]})
        elif label == "O" and len(mentions_list) > 0:
            mentions_list[-1]["end"] = i
        if len(mentions_list) == 0:
            continue
        if i == (len(tokens) - 1) and "end" not in mentions_list[-1]:
            mentions_list[-1]["end"] = i + 1
            
    for m in mentions_list:
        m['phrase'] = " ".join(tokens[m['start']:m['end']])
    return {'tokens': tokens, 'mentions': mentions_list}


# For each BIO tagged sentence in tagged_sents, convert it to the mention-based
# representation
tagged_sents = []
for doc in tagged_bio_sents:
    mention_doc = bio_to_mention(doc)
    tagged_sents.append(mention_doc)

# Let's print our example sentence again, this time with the mention-based
# representation.
# We'll use json.dumps to make it a bit easier to read.
print(json.dumps(tagged_sents[15],indent=1))

{
 "tokens": [
  "air",
  "leak",
  "around",
  "rotary",
  "head"
 ],
 "mentions": [
  {
   "start": 1,
   "labels": [
    "Observation"
   ],
   "end": 3,
   "phrase": "leak around"
  },
  {
   "start": 3,
   "labels": [
    "Item"
   ],
   "end": 5,
   "phrase": "rotary head"
  }
 ]
}


Note we have added a "phrase" to each mention. We technically could get this phrase by looking at the list of tokens from the `start` to the `end` of the mention, but storing it inside `mentions` directly makes things easier later on.

### 6.6.2. Building a list of potential relations between entities

Now we have our data in a more amenable format, but we still need tabular data as required by the RE model. To refresh your memory, this is the required format:

 - entity 1
 - entity 2
 - label of entity 1
 - label of entity 2
 - The text between 'broken' and 'rod support', inclusive
 - The position of entity 1
 - The position of entity 2
 - The relation type. "O" means no relation.
 
We don't need that last column here as this is what we want our model to predict. We will set it to `None` to denote that no relation has been assigned yet.

We also need to add a new column to represent the document index - we will see why later.

Here is a helper function to transform our mention-based entity format of a single document into a list of potential relationships between each entity and each other entity in that document.

In [41]:
def build_potential_relations(tagged_sents) -> list:
    """Build a list of potential relations, i.e. all possible relationships
    between each entity in each document. The 8th column (which denotes the
    relationship type) will be set to None. The 9th column is the document index.
    
    Args:
        tagged_sents(list): The list of tagged sentences, where each sentence is a
            dict of tokens: [list of tokens] and mentions: [list of mentions].
    
    Returns:
        list: A list of rows, where each row is a potential relationship.
    """

    relations = []
    for doc_idx, doc in enumerate(tagged_sents):
        for m1_idx, mention_1 in enumerate(doc['mentions']):
            entity_1 = " ".join(doc['tokens'][mention_1['start']: mention_1['end']])
            label_1 = mention_1['labels'][0]

            for m2_idx, mention_2 in enumerate(doc['mentions']):
                if m1_idx == m2_idx:
                    continue
                entity_2 = " ".join(doc['tokens'][mention_2['start']: mention_2['end']])
                label_2 = mention_2['labels'][0]
                mention_text = " ".join(doc['tokens'][mention_1['start']:mention_2['end']]   )         

                relations.append(
                    [entity_1, entity_2, label_1, label_2, mention_text, m1_idx, m2_idx, None, doc_idx]         
                )
    return relations
            
relations = build_potential_relations(tagged_sents)
print(relations[0])

['lights', 'out on', 'Item', 'Observation', 'lights out on', 0, 1, None, 1]


### 6.6.3 Running inference over every row

Now that our data is in the same format that we used to train the RE model, we can run the inference on it.

In [48]:
from helpers import SimpleMWOREModel

def tag_all_relations(relations: list):
    """Run model inference over every potential relation in the list of
    relations.
    
    Args:
        relations(list): The list of (untagged) relations.
        
    Returns:
        tagged_relations(list): The same list, but with the rel_type in the
           8th column.
    
    """
    tagged_relations = []

    for rel in relations:
        tagged_rel = rel[:]
        rel_type = rel_model.inference(rel)
        tagged_rel[7] = rel_type
        tagged_relations.append(tagged_rel)
    return tagged_relations
     
#rel_model = SimpleMWOREModel()
rel_model = flair_re_model
tagged_relations = tag_all_relations(relations)

# Print the first 10 rows
for row in tagged_relations[:10]:
    print(row)
        

['lights', 'out on', 'Item', 'Observation', 'lights out on', 0, 1, 'HAS_OBSERVATION', 1]
['lights', 'machine', 'Item', 'Item', 'lights out on machine', 0, 2, 'APPEARS_WITH', 1]
['out on', 'lights', 'Observation', 'Item', '', 1, 0, 'O', 1]
['out on', 'machine', 'Observation', 'Item', 'out on machine', 1, 2, 'O', 1]
['machine', 'lights', 'Item', 'Item', '', 2, 0, 'O', 1]
['machine', 'out on', 'Item', 'Observation', '', 2, 1, 'HAS_OBSERVATION', 1]
['volt lights', 'out', 'Item', 'Observation', 'volt lights out', 0, 1, 'HAS_OBSERVATION', 2]
['out', 'volt lights', 'Observation', 'Item', '', 1, 0, 'O', 2]
['compressor oil', 'requires top', 'Item', 'Observation', 'compressor oil requires top', 0, 1, 'HAS_OBSERVATION', 3]
['requires top', 'compressor oil', 'Observation', 'Item', '', 1, 0, 'O', 3]


# 7. Combining NER+RE

Now we have outputs from both the NER model and the RE model. The NER model's output looks like this:

    {'tokens': ['a/c', 'not', 'working'],
     'mentions': [
         {'start': 0, 'labels': ['Item'], 'end': 1},
         {'start': 1, 'labels': ['Observation'], 'end': 3}]}         
While the RE model's output is shown in the cell above.

The next step is to combine the two outputs. Fortunately we stored the document index in the relations, so we can easily join them up.

Let's add a 'relations' key to this dictionary. It will capture the relationships between mentions, e.g.

    'relations': {'start': 0, 'end': 1, 'type': 'HAS_OBSERVATION'}
    
... which denotes that mention 0 ('a/c') has the observation of mention 1 ('not working').

In [49]:
for i, sent in enumerate(tagged_sents):
    
    # Note we only care about the relations that do not have the class "O".
    doc_relations = [row for row in tagged_relations if row[7] != "O" and row[8] == i]
    
    sent['relations'] = []    
    for row in doc_relations:
        rel = {'start': row[5], 'end': row[6], 'type': row[7]}     
        sent['relations'].append(rel)

# Let's print an example...
print(json.dumps(tagged_sents[9], indent=1))

{
 "tokens": [
  "ITEM_ID",
  "pre-service",
  "setup"
 ],
 "mentions": [
  {
   "start": 2,
   "labels": [
    "Activity"
   ],
   "end": 3,
   "phrase": "setup"
  }
 ],
 "relations": []
}


# 8. Creating the graph

We now have a data structure that stores the tokens, entity mentions, and relationships between those mentions, for each document. The last step is to put it all into a Neo4j graph so that we can query this information.

There are two popular methods for doing this:

- Using `py2neo` to programatically insert data into Neo4j
- Saving CSVs of your entities and relations, then reading them in via a `LOAD CSV` query in Neo4j

The first option is simple but a bit slow, and the second option is a little more complex but much faster. We will go with the first option here in this notebook for simplicity.

**If you are using Jupyter Notebook** (i.e. you have cloned the GitHub repository and are running this notebook locally), then you will need to have Neo4j Desktop installed. You can download and install Neo4j from here if you haven't already: https://neo4j.com/download/.

> ⚠️ Before proceeding, make sure you have created a new graph in Neo4j and that your new Neo4j graph is running. You can do this by opening Neo4j Browser, clicking "Add" at the top-right, then creating a new graph database.


**If you are running this notebook in Google Colab**, you will not need to install Neo4j. It should already be installed on the Colab machine, though it is a little temperamental so errors may arise.

If any issues occur, please see the "Installing Neo4j on Google Colab" section in the Appendix.

## Code for building the graph

The following code creates the entire graph using `py2neo`.

In [50]:
from py2neo import Graph
from py2neo.data import Node, Relationship
from dateutil.parser import parse as parse_date

GRAPH_PASSWORD = "password" # Set this to the password of your Neo4J graph


def get_node_id(phrase, entity_class):
    """A simple function to generate an id.
    This ensures an entity that can be different classes (pump for example) can have
    a unique node for each class type.
    """
    return f"{phrase}__{entity_class}"
    
# Dates are a little awkward in Neo4j - we have to convert it to an integer representation in Python.
# The APOC library has functions to handle this better.
def date_to_int(date):
    parsed_date = parse_date(str(date))
    date = int("%s%s%s" % (parsed_date.year, str(parsed_date.month).zfill(2), str(parsed_date.day).zfill(2)))
    return date    
    
def create_graph(tagged_sents):
    """Build the Neo4j graph.
    We do this by iterating over each tagged_sentence, and constructing the
    graph as follows:
     - Create a node to represent the document itself.
     - Create nodes for each entity appearing in that document, if they have not
       already been created. Each unique combination of entity + class will be added, so
       pump (the Item) is different from pump (the Activity).
     - Create a relationship between each entity and each document in which it appears.
     - Create a relationship between each entity and each other entity it is related to,
       via the list of relations.
     
    Args:
        tagged_sents(list): The list of tagged sentences.
    """
    graph = Graph(password = GRAPH_PASSWORD)

    # We will start by deleting all nodes and edges in the current graph.
    # If we don't do this, we will end up with duplicate nodes and edges when running this script again.
    graph.delete_all() 

    tx = graph.begin()
    
    # Keep track of the created entity nodes.
    # We need a way to map the id of the nodes to the py2neo Node objects so that we can
    # easily create relationships between these nodes.
    created_entity_nodes = {}
    
    # Iterate over the list of tagged sentences and programmatically create the graph.
    for i, sent in enumerate(tagged_sents):
        
        # Let's first grab some properties from the original work order dataset.
        # We will add date and cost into our graph on the Document nodes.
        work_order_row = work_order_data[i]
        
        document_properties = {
            'name': " ".join(sent['tokens']),
            'cost': work_order_row['Cost']          
            
        }
        if work_order_row['StartDate'] != "":
            document_properties["date"] = date_to_int(work_order_row['StartDate'])

        
        
        # Create a node to represent the document.
        # Note that if you had additional properties in tagged_sents (such as dates, costs, etc)
        # you could add them as properties of the Document nodes here.
        document_node = Node("Document", **document_properties)
        tx.create(document_node)
        
        tokens = sent['tokens']
        mentions = sent['mentions']
        relations = sent['relations']
        
        for m in mentions:
            start = m['start']
            end = m['end']
            entity_class = m['labels'][0]        
            phrase = " ".join(tokens[start: end])     
                    
            # Create a node for this entity mention.
            # If the node has already been created (i.e. it exists in created_nodes), 
            # simply retrieve that Node from created_entity_nodes.
            # Otherwise, create it, and add it to created_entity_nodes.
            entity_node_id = get_node_id(phrase, entity_class)

            if entity_node_id in created_entity_nodes:
                entity_node = created_entity_nodes[entity_node_id]
            else:
                entity_node = Node("Entity", entity_class, _id=entity_node_id, name=phrase)
                created_entity_nodes[entity_node_id] = entity_node
                tx.create(entity_node)            
                        
                
            # Create a relationship between that node and the document
            # in which it appears.               
            r = Relationship(entity_node, "APPEARS_IN", document_node)
            tx.create(r)
            
        # Create relationships between each (entity_1, entity_2) in the
        # list of relations for this document.
        for rel in relations:
            start = rel['start']
            end = rel['end']
            
            phrase_1 = mentions[start]['phrase']
            entity_class_1 = mentions[start]['labels'][0]
            
            phrase_2 = mentions[end]['phrase']
            entity_class_2 = mentions[end]['labels'][0]
                       
            node_1 = created_entity_nodes[get_node_id(phrase_1, entity_class_1)]
            node_2 = created_entity_nodes[get_node_id(phrase_2, entity_class_2)]
            
            r = Relationship(node_1, rel['type'], node_2)
            tx.create(r)
    tx.commit()

create_graph(tagged_sents)        

# 9. Querying the graph


Now that the graph has been created, we can query it in Neo4j. This section lists some example queries that we can run on our graph. Feel free to try your own queries!

Note we are using `gqvis` to visualise these in Jupyter Notebook. The results will look very similar if you run these queries directly in the Neo4j browser.

*Note about gqvis: gqvis works out of the box in Jupyter Notebook, but to get it working in Jupyter Lab you'll need to install the jupyter_requirejs plugin. See the Appendix section at the bottom of this notebook for more details.*

First, let's try a simple query. Here is a query that searches for __all activities performed on engines__:

In [51]:
import gqvis

gqvis.visualise_cypher("""
MATCH (e:Entity {name: "engine"})-[r:HAS_ACTIVITY]->(a:Activity)
RETURN e, r, a
""")

Now let's try __all failure modes observed on engines__:

In [52]:
gqvis.visualise_cypher("""
MATCH (e:Entity {name: "engine"})-[r:HAS_OBSERVATION]->(o:Observation)
RETURN e, r, o
""")


We can also use our graph as a way to quickly search and access work orders for the entities appearing in those work orders. For example, searching for __all work orders containing a leak__:

In [45]:
gqvis.visualise_cypher("MATCH (d:Document)<-[a:APPEARS_IN]-(o:Observation {name: 'leak'}) RETURN d, a, o")



We could extend this to also show the items on which the leaks were present:

In [54]:
gqvis.visualise_cypher("""
MATCH (d:Document)<-[a:APPEARS_IN]-(o:Observation {name: "leak"})<-[r:HAS_OBSERVATION]-(e:Entity)
RETURN d, a, o, r, e
LIMIT 50
""")

Another example query could look at the activities that were performed when leaks occurred:

In [53]:
gqvis.visualise_cypher("""
MATCH (a:Activity)-[r1:APPEARS_IN]->(d:Document)<-[r2:APPEARS_IN]-(o:Observation {name: 'leak'}) RETURN a, r1, d, r2, o
""")



We also added **dates** and **costs** into our graph on the `Document` nodes. This means we can query based on these properties.

Here is an example query for __all assets that had breakdowns between 2005 and 2007__:

In [58]:
gqvis.visualise_cypher("""
    MATCH (d:Document)<-[a:APPEARS_IN]-(e:Entity)-[r:HAS_OBSERVATION]->(o:Observation {name: "breakdown"})-[:APPEARS_IN]->(d)
    WHERE d.date >= 20050101
    AND d.date <= 20070101
    RETURN e, r, o
""")


# 10. What next?

Feel free to use/adapt any of this code to build your own knowledge graphs. You might like to try running it on your own datasets, or designing your own `NERModel` or `REModel`.

Here are some potential areas to explore after this tutorial.

## 10.1. Improving the lexical normalisation model

We only briefly touched on the lexical normalisation component of Knowledge Graph Construction from Text. There are plenty of neural models for lexical normalisation available that yield much better performance than our lexicon-based tagger.

We have also developed a tool (Lexiclean) to support the rapid creation of training data for lexical normalisation - you can learn about it [here](https://aclanthology.org/2021.emnlp-demo.25/).

## 10.2. Incorporating other structured data into the graph

Graph databases are excellent at bringing together data from a wide range of sources. In a maintenance setting, there are two particular types of structured data that can be easily added to this knowledge graph schema: Downtime events, and Functional Locations.

### Downtime events

A downtime event is a point in time in which an asset is not operational. These events typically have costs and dates associated with them, and can be associated with particular `Item` entities.

By modelling both work orders and downtime events in one graph, we can make queries about downtime events. Here is an example query for the __downtime events associated with assets appearing in work orders from 25 to 28 July (where the downtime events occurred in July)__:

    MATCH (d:Document)<-[a:APPEARS_IN]-(e:Entity)-[r:HAS_EVENT]->(x:DowntimeEvent)
    WHERE d.StartDate > 20050725
    AND d.StartDate < 20050728
    AND 20050700 <= x.StartDate <= 20050731
    RETURN e, r, x

We can of course extend this to specific assets, such as pumps:

    MATCH (d:Document)<-[a:APPEARS_IN]-(e:Entity {name: "pump"})-[r:HAS_EVENT]->(x:DowntimeEvent)
    WHERE d.StartDate > 20050725
    AND d.StartDate < 20050728
    AND 20050700 <= x.StartDate <= 20050731
    RETURN e, r, x

In larger graphs the downtime events could even be further queried based on duration, cost, lost feed, or date ranges.

### Functional Locations (FLOCs)

You may have noticed that our original `work_order_data.csv` has a column called "FLOC". This is the functional location of the asset being maintained. In the maintenance domain, this is often of greater interest to reliability engineers than the individual `Item` entities, and thus it would be ideal to create nodes to represent these functional locations in the graph. This way, we could run queries on the failure modes associated with particular FLOCs.

If you are interested in continuing work on this small graph, the next best step would be to create nodes for the functional location data (`floc_data`) and to link the downtime events to those nodes as opposed to the Item nodes.

![alt text](images/adding-flocs.png "Adding FLOCs")

### Asset Hierarchies

Incorporating asset hierarchy taxonomies such as ISO15926 allows for assets to be queried hierchically, e.g. querying over all rotating equipment, etc. Failure modes can also be grouped into specific failure mode categories in order to improve failure mode queries.

## 10.3. Consolidating the training data for NER+RE

You may have noticed that our training data is split into two parts for the NER and RE tasks, i.e. the NER is in CONLL format and the RE is in tabular format. It is possible to put both the NER and RE training data into a single file, using the mention format we showed previously. For example, each row of your training data could look like this:

    { tokens: [<list of tokens>], mentions: [<list of mentions>], relations: [<list of relations>] }
    
... and you could have a script to 'wrangle' this into the CONLL and tabular format before feeding them into the NER and RE model respectively. If you use a tool like [QuickGraph](https://aclanthology.org/2022.acl-demo.27/) your data will be in a similar format to the above.

## 10.4. Coreference resolution

What happens if we ask our model to tag the following?

    pump is broken . it needs fixing

The word "it" will most likely not be tagged as an `Item` as it is not an item at all, but a preposition. Our knowledge graph will not capture the relationship between `pump` and `fixing` because of this.

Coreference resolution is the process of automatically determining the coreferences of words in the text. In the example above, the word "it" is a coference for "pump", and thus running coreference resolution on this document would yield:

    pump is broken . pump needs fixing
    
There is not much to gain from doing this on these short text records (as prepositions are few and far between), but on longer text records (such as maintenance long text, safety reports, etc), coreference resolution is a critical step in the information extraction task.

There are neural models available to do coference resolution, such as `neuralcoref` in the `spaCy` package (see [here](https://spacy.io/universe/project/neuralcoref)).


## 10.5. From Pipeline to End-to-End Knowledge Graph Construction from Text

We have presented a "pipeline" for KGC here in this notebook. We wrote the notebook this way in order to be able to discuss each of the components (Lexical Normalisation, Named Entity Recognition and Relation Extraction) in isolation, thus making them easier to understand.

However, the current state of the art in NLP/TLP is moving away from pipeline-based KGC models and towards end-to-end neural models, i.e. a single neural model that performs all of these steps simultaneously. If you are interested in learning about this, you might like to read some of the following papers:

> Stewart, M., & Liu, W. (2020, July). Seq2kg: an end-to-end neural model for domain agnostic knowledge graph (not text graph) construction from text. In   Proceedings of the International Conference on Principles of Knowledge Representation and Reasoning (Vol. 17, No. 1, pp. 748-757).

> Eberts, M., & Ulges, A. (2019). Span-based joint entity and relation extraction with transformer pre-training. arXiv preprint arXiv:1909.07755.

> Cabot, P. L. H., & Navigli, R. (2021, November). REBEL: Relation extraction by end-to-end language generation. In Findings of the Association for Computational Linguistics: EMNLP 2021 (pp. 2370-2381).

## 10.6. Other related areas

We won't go into detail on them in this notebook, but the following are important areas that are related to knowledge graphs:

- Community detection
- Knowledge graph embeddings
- Reasoning over knowledge graphs
- Ontologies
- RDF schema
- Entity linking (highly applicable to large knowledge graphs i.e. dbpedia, Freebase, Wikidata)

## Acknowledgment

This work is supported by the Australian Research Council through the Centre for Transforming Maintenance through Data Science  (grant number IC180100030), funded by the Australian Government.

## References

- Baldwin, Timothy, Marie-Catherine De Marneffe, Bo Han, Young-Bum Kim, Alan Ritter, and Wei Xu. "Shared tasks of the 2015 workshop on noisy user-generated text: Twitter lexical normalization and named entity recognition." In Proceedings of the Workshop on Noisy User-generated Text, pp. 126-135. 2015.
- Cabot, Pere-Lluís Huguet, and Roberto Navigli. "REBEL: Relation extraction by end-to-end language generation." In Findings of the Association for Computational Linguistics: EMNLP 2021, pp. 2370-2381. 2021.
- Eberts, Markus, and Adrian Ulges. "Span-based joint entity and relation extraction with transformer pre-training." arXiv preprint arXiv:1909.07755 (2019).
- Han, Bo, and Timothy Baldwin. "Lexical normalisation of short text messages: Makn sens a# twitter." In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, pp. 368-378. 2011.
- Josifoski, Martin, Nicola De Cao, Maxime Peyrard, and Robert West. "GenIE: generative information extraction." arXiv preprint arXiv:2112.08340 (2021).
- Lourentzou, Ismini, Kabir Manghnani, and ChengXiang Zhai. "Adapting sequence to sequence models for text normalization in social media." In Proceedings of the international AAAI conference on web and social media, vol. 13, pp. 335-345. 2019.
- Muller, Benjamin, Benoît Sagot, and Djamé Seddah. "Enhancing BERT for lexical normalization." In The 5th Workshop on Noisy User-generated Text (W-NUT). 2019.
- Nguyen, Hoang, and Sandro Cavallari. "Neural multi-task text normalization and sanitization with pointer-generator." In Proceedings of the First Workshop on Natural Language Interfaces, pp. 37-47. 2020.
- Samuel, David, and Milan Straka. 2021. “ÚFAL at MultiLexNorm 2021: Improving Multilingual Lexical Normalization by Fine-Tuning ByT5.” In Proceedings of the Seventh Workshop on Noisy User-Generated Text (W-NUT 2021), 483–92. Stroudsburg, PA, USA: Association for Computational Linguistics.
- Stewart, Michael, Wei Liu, Rachel Cardell-Oliver, and Rui Wang. "Short-text lexical normalisation on industrial log data." In 2018 IEEE International Conference on Big Knowledge (ICBK), pp. 113-122. IEEE, 2018.
- van der Goot, Rob, Alan Ramponi, Arkaitz Zubiaga, Barbara Plank, Benjamin Muller, Iñaki San Vicente Roncal, Nikola Ljubešić, et al. 2021. “MultiLexNorm: A Shared Task on Multilingual Lexical Normalization.” In Proceedings of the Seventh Workshop on Noisy User-Generated Text (W-NUT 2021), 493–509. Online: Association for Computational Linguistics.
- van der Goot, Rob, and Gertjan van Noord. "Monoise: Modeling noise using a modular normalization system." arXiv preprint arXiv:1710.03476 (2017).
- van der Goot, Rob, Barbara Plank, and Malvina Nissim. "To normalize, or not to normalize: The impact of normalization on part-of-speech tagging." arXiv preprint arXiv:1707.05116 (2017).
- van der Goot, Rob, Rik van Noord, and Gertjan van Noord. 2018. “A Taxonomy for In-Depth Evaluation of Normalization for User Generated Content.” In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). Miyazaki, Japan: European Language Resources Association (ELRA). https://aclanthology.org/L18-1109.
- Yamada, Ikuya and Asai, Akari and Shindo, Hiroyuki and Takeda, Hideaki and Matsumoto, Yuji. "LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention". Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020.
- Yan, Hang, Tao Gui, Junqi Dai, Qipeng Guo, Zheng Zhang, and Xipeng Qiu. "A Unified Generative Framework for Various NER Subtasks." In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 5808-5822. 2021.

# Appendix

## GQVis on Jupyter Lab

GQVis works out of the box on Jupyter Notebook (which is recommended), but to get it working in Jupyter Lab, you'll need to run the following command prior to starting Jupyter lab:

    jupyter labextension install jupyterlab_requirejs

## Installing Neo4j in Google Colab

Neo4j should work OK on the Google Colab notebook, but I have noticed it randomly uninstalls from time to time. If it does, there will be an obvious error message that appears when you attempt to run any of the cells in Section 8 (building the graph) and Section 9 (querying the graph).

To reinstall Neo4j, you can run the following code. After doing this the cells in Section 8 and 9 should work OK again.

In [None]:
!nj-4.4/bin/neo4j stop

# https://gist.github.com/korakot/328aaac51d78e589b4a176228e4bb06f

!curl http://dist.neo4j.org/neo4j-community-4.4.0-unix.tar.gz -o neo4j.tar.gz

# decompress and rename
!tar -xf neo4j.tar.gz  # or --strip-components=1
!mv neo4j-community-4.4.0 nj-4.4

# disable password, and start server
!sed -i '/#dbms.security.auth_enabled/s/^#//g' nj-4.4/conf/neo4j.conf
!nj-4.4/bin/neo4j start

Note that there are other options such as Neo4j Aura (which allows you to run Neo4j in a cloud environment), but `gqvis` is not yet able to interact with it.