# Constructing a Knowledge Graph from Maintenance Work Order Data

In this notebook we are going to construct a simple knowledge graph using Python, and run some queries on the graph in Neo4j. We have broken the notebook into several steps:

- Reading in the data
- Cleaning the data
- Extracting entities via Named Entity Recognition (NER)
- Creating relationships between entities via Relation Extraction (RE)
- Putting it all together and building a Neo4j graph
- Querying the graph in Neo4j


# Installing required packages

To run this notebook you will need to install the following via pip:

- `py2neo`: A library for working with Neo4j in Python.
- `gqvis`: Our simple tool for visualising graph queries in Jupyter.
- `flair`: A deep learning library for natural language processing. Note this library is quite large (a couple gb I believe). If you don't wish to install this, we have provided non deep-learning based alternatives so you can still follow along.

You will also need to have Neo4j installed for the last part of the tutorial. You can download and install Neo4j Desktop [here](https://neo4j.com/).

We will be running through the code during the tutorial so there is no need to install anything unless you would also like to try the code out yourself and run some graph queries.



In [8]:
!pip install py2neo
!pip install gqvis
!pip install flair

Collecting click==7.0 (from py2neo)
  Using cached https://files.pythonhosted.org/packages/fa/37/45185cb5abbc30d7257104c434fe0b07e5a195a6847506c074527aa599ec/Click-7.0-py2.py3-none-any.whl
Installing collected packages: click
  Found existing installation: click 8.0.2
    Uninstalling click-8.0.2:
      Successfully uninstalled click-8.0.2
Successfully installed click-7.0


ERROR: celery 5.2.1 has requirement click<9.0,>=8.0, but you'll have click 7.0 which is incompatible.
ERROR: black 21.9b0 has requirement click>=7.1.2, but you'll have click 7.0 which is incompatible.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.




You should consider upgrading via the 'python -m pip install --upgrade pip' command.




You should consider upgrading via the 'python -m pip install --upgrade pip' command.


# 1. Read in the data

Here is a description of the datasets we are working with in this notebook.

First of all, the datasets for the NER model:

- `ner_dataset/train.txt`: The dataset we will use to *train* the NER model to predict the entities appearing in each work order.
- `ner_dataset/dev.txt`: The dataset we will use to *validate* the quality of the model during training.
- `ner_dataset/test.txt`: The dataset we will use to *evaluate* the final performance of the NER model after training.

We also have three datasets for the Relation Extraction (RE) model:

- `re_dataset/train.csv`
- `re_dataset/dev.csv`
- `re_dataset/test.csv`

We are going to be building a knowledge graph on a small sample set of work orders. This will not be seen by the NER or RE models prior to constructing the graph - the idea is to get our models to run *inference* over this dataset to automatically predict the entities, and relationships between the entities, to build a graph.

- `sample_work_orders.csv`: A csv file containing a set of work orders.

Here is an example of what the first few rows of each dataset look like:

![alt text](images/example-data.png "Example datasets")

We are using the simple `csv` library to read in the data, though this can also be done using `pandas`.

# Inspecting the data

Let's start by inspecting the `sample_work_orders.csv` CSV dataset. This is the dataset we will be building the graph from.

In [1]:
from csv import DictReader

work_order_file = "data/sample_work_orders.csv"

# A simple function to read in a csv file and return a list,
# where each element in the list is a dictionary of {heading : value}
def load_csv(filename):
    data = []
    with open(filename, 'r') as f:
        reader = DictReader(f)
        for row in reader:
            data.append(row)
    return data

        
work_order_data = load_csv(work_order_file)

for row in work_order_data:
    print(row)

    


OrderedDict([('StartDate', '10/07/2005'), ('FLOC', '1234.1.1'), ('ShortText', 'repair cracked hyd tank')])
OrderedDict([('StartDate', '14/07/2005'), ('FLOC', '1234.1.2'), ('ShortText', 'engine wont start')])
OrderedDict([('StartDate', '17/07/2005'), ('FLOC', '1234.1.3'), ('ShortText', 'a/c blowing hot air')])
OrderedDict([('StartDate', '20/07/2005'), ('FLOC', '1234.1.2'), ('ShortText', 'engin u/s')])
OrderedDict([('StartDate', '21/07/2005'), ('FLOC', '1234.1.2'), ('ShortText', 'fix engine')])
OrderedDict([('StartDate', '22/07/2005'), ('FLOC', '1234.1.4'), ('ShortText', 'pump service')])
OrderedDict([('StartDate', '23/07/2005'), ('FLOC', '1234.1.4'), ('ShortText', 'pump leak')])
OrderedDict([('StartDate', '24/07/2005'), ('FLOC', '1234.1.4'), ('ShortText', 'fix leak on pump')])
OrderedDict([('StartDate', '25/07/2005'), ('FLOC', '1234.1.2'), ('ShortText', 'engine not running')])
OrderedDict([('StartDate', '26/07/2005'), ('FLOC', '1234.1.2'), ('ShortText', 'engine has problems starting')])

# 2. Cleaning the data

TODO: Probably best to have a simple model for cleaning the data here. I don't think we need to go into too much detail about it, but it would be nice to have this as the first step so things like 'hyd pump' get corrected before running NER/RE. We could mention Lexiclean here

I could just chuck the lexicon cleaner thing from the masterclass here

# 3. Named Entity Recognition

Our first task is to extract the entities in the short text descriptions and construct nodes from those entities. This is how we are able to unlock the knowledge captured within the short text and combine it with the structured fields.

![alt text](images/extracting-entities-v2.png "Extracting entities")

## 3.1. Loading and inspecting the data

Let's start by defining some functions for loading the CONLL-formatted data.

In [2]:
import os

NER_DATASET_PATH = "data/ner_dataset"


def to_conll_document(s: str):
    """Create a ConllDocument from a string as it appears
    in a Conll-formatted file.

    Args:
        s (str): A string, separated by newlines, where each
        line is a token, then a comma and space, then a label.

    Returns:
        dict: A dict of tokens and labels.
    """
    tokens, labels = [], []
    for line in s.split("\n"):
        if len(line.strip()) == 0:
            continue
        token, label = line.split()

        tokens.append(token)
        labels.append(label)
    return {'tokens': tokens, 'labels': labels}


def load_conll_dataset(filename: str) -> list:
    """Load a list of documents from the given CONLL-formatted dataset.

    Args:
        filename (str): The filename to load from.

    Returns:
        list: A list of documents, where each document is a dict of tokens and labels.
    """
    documents = []
    with open(filename, "r") as f:
        docs = f.read().split("\n\n")
        for d in docs:
            if len(d) == 0:
                continue
            document = to_conll_document(d)
            documents.append(document)
    print(f"Loaded {len(documents)} documents from {filename}.")
    return documents



Let's take a quick look at the first row of our training dataset to make sure it loads OK:

In [6]:
train_dataset = load_conll_dataset(os.path.join(NER_DATASET_PATH, 'train.txt'))

print(train_dataset[0])

Loaded 3200 documents from data/ner_dataset\train.txt.
{'tokens': ['ram', 'on', 'cup', 'rod', 'support', 'broken'], 'labels': ['B-Item', 'B-Location', 'B-Item', 'B-Item', 'I-Item', 'B-Observation']}


## 3.2 Define an abstract base class for NER Models

Seeing as we would like to be able to work with a range of NER models, it's a good idea to create an 'abstract base class' to represent an NER model. This way, we can create classes for our NER models that inherit from this base class. Every model we create must have these four functions:

- `train`: Train the model on the datasets in the given path.
- `inference`: Run inference over the given list of sentences.
- `load`: Load the model from the given path.



In [3]:
""" Abstract base class for the NER Model. """

from abc import ABC, abstractmethod


class NERModel(ABC):
    def __init__(self):
        pass

    @abstractmethod
    def train(self, conll_datasets_path: str):
        pass

    @abstractmethod
    def inference(self, raw_sents: list):
        pass

    @abstractmethod
    def load(self, model_path):
        pass


## 3.3 Define a Flair-based NER Model class

In this tutorial we will use [Flair](https://github.com/flairNLP/flair), which simplifies the process of building a deep learning model for a variety of NLP tasks.

The code below is a class representing a `FlairNERModel`, which is based on the `NERModel` class above. It has the same four methods, i.e `train()`, `inference()`, `load()`, and `save()`.

In [11]:
"""A Flair-based Named Entity Recognition model. Learns to predict entity
classes via deep learning."""


# TODO: Tidy up, fix this code as it does not work atm in this notebook


import os
import flair
from flair.data import Corpus, Sentence
from flair.datasets import ColumnCorpus
from flair.embeddings import (
    StackedEmbeddings,
    FlairEmbeddings,
)
from flair.models import SequenceTagger
from flair.trainers import ModelTrainer
from typing import List
from flair.visual.training_curves import Plotter
import torch


# TODO: Get rid of ConllDataset/ConllDocument and just use lists
from mwo2kg_datasets import (
    ConllDataset,
    ConllDocument,
)

HIDDEN_SIZE = 256

# Check whether CUDA is available and set the device accordingly
if torch.cuda.is_available():
    flair.device = torch.device("cuda:0")
else:
    flair.device = torch.device("cpu")
print("Device:", flair.device)


class FlairNERModel(NERModel):

    model_name: str = "Flair"

    """A Flair-based Named Entity Recognition model.
    """

    def __init__(self):
        super(FlairNERModel, self).__init__()

        self.model = None

    def train(self, datasets_path: os.path, trained_model_path: os.path):
        """ Train the Flair model on the given conll datasets.

        Args:
            datasets_path (os.path): The folder containing the
              train, dev and text CONLL-formatted datasets.
            trained_model_path (os.path): The folder to save the trained
              model to.
        """

        columns = {0: "text", 1: "ner"}
        corpus: Corpus = ColumnCorpus(
            datasets_path,
            columns,
            train_file="train.txt",
            dev_file="dev.txt",
            test_file="test.txt",
        )
        label_dict = corpus.make_label_dictionary(label_type="ner")

        # Train the sequence tagger
        embedding_types = [
            FlairEmbeddings("mix-forward"),
            FlairEmbeddings("mix-backward"),
        ]

        embeddings = StackedEmbeddings(embeddings=embedding_types)

        tagger = SequenceTagger(
            hidden_size=HIDDEN_SIZE,
            embeddings=embeddings,
            tag_dictionary=label_dict,
            tag_type="ner",
            use_crf=True,
        )

        trainer = ModelTrainer(tagger, corpus)

        sm = "cpu"
        if torch.cuda.is_available():
            sm = "gpu"
        trainer.train(
            trained_model_path,
            learning_rate=0.1,
            mini_batch_size=32,
            max_epochs=10,
            embeddings_storage_mode=sm,
        )

        plotter = Plotter()
        plotter.plot_weights(os.path.join(trained_model_path, "weights.txt"))

        self.load(os.path.join(trained_model_path, 'final-model.pt'))

    def inference(self, raw_sents: list) -> ConllDataset:
        """Run the inference on a given list of short texts.

        Args:
            raw_sents (list): The list of raw sents to run the inference on.

        Returns:
            ConllDataset: The ConllDataset of preds.

        Raises:
            ValueError: If the model has not yet been trained.
        """
        if self.model is None:
            raise ValueError(
                "The KGC Model has not yet been trained. "
                "Please train this Flair model before proceeding."
            )

        preds_dataset = ConllDataset()

        for i, tokens in enumerate(raw_sents):
            labels = self._tag_sentence(tokens)
            doc = ConllDocument(tokens, labels)
            preds_dataset.add_document(doc)

        return preds_dataset

    def load(self, model_path: str):
        """Load the model from the specified path.

        Args:
            model_path (os.path): The path to load.

        Raises:
            ValueError: If the path does not exist i.e. model not yet trained.
        """
        self.model = SequenceTagger.load(model_path)

    def _tag_sentence(self, sentence: List[str]) -> List[str]:
        """Tag the given sentence (list of tokens) via the model.

        Args:
            sentence (List[str]): A list of tokens.

        Returns:
            List[str]: A list of labels.
        """
        sentence_obj = Sentence(sentence, use_tokenizer=False)
        self.model.predict(sentence_obj)
        labels = ["O"] * len(sentence)

        for entity in sentence_obj.get_spans("ner"):
            for i, token in enumerate(entity):
                label = entity.get_label("ner").value
                prefix = "B-" if i == 0 else "I-"
                
                # Token idx starts from 1 in Flair.
                labels[token.idx - 1] = prefix + label

        return labels

ModuleNotFoundError: No module named 'mwo2kg_datasets'

### (optional) Define a DictionaryNERModel class

If you are not able to use the Flair library, here is a simple model you can use to extract the entities, albeit with a much weaker performance. This one scans the training data, builds a mapping between each phrase (one or more tokens in a row) and the most common entity type associated with that phrase, then uses that entity type as the prediction when seeing that token in the test data.

The model is super simple, so we won't show the code here, but feel free to have a look under `helpers/DictionaryNERModel.py` if you are interested.

In [4]:
from helpers import DictionaryNERModel

## 3.4. Training the model

Depending on whether you are using Flair or the DictionaryNERModel, you can run one of the cells below.



### 3.4.1. Using Flair

We have trained the Flair-based model and have uploaded the model onto Huggingface. The following code will download that model and load the weights, so there is no need for you to train the model yourself.

In [9]:
m = FlairNERModel()
#m.train(NER_DATASET_PATH, 'models/ner_models/flair') # Uncomment to train manually
m.load("nlp-tlp/mwo-ner-test") # TODO: Replace with load_pretrained

NameError: name 'FlairNERModel' is not defined

### 3.4.2. Using the DictionaryNERModel

In [17]:
m = DictionaryNERModel()
m.train(NER_DATASET_PATH, 'models/ner_models/dictionary')

Building dictionary...
Loaded 3200 documents from data/ner_dataset\train.txt.
Loaded 401 documents from data/ner_dataset\dev.txt.


## 3.5. Running inference on unseen sentences

The next step is to use our trained model to infer the entity type of each entity appearing in a list of previously unseen data.

In [6]:
tagged_bio_sents = []

sentences = []
for row in work_order_data:
    sentences.append(row["ShortText"].split()) # We must 'tokenise' the sentence first, i.e. split into words

tagged_bio_sents = m.inference(sentences)

# Print an example tagged sentence
print(tagged_bio_sents[12])

{'tokens': ['a/c', 'not', 'working'], 'labels': ['B-Item', 'B-Observation', 'I-Observation']}


## 3.6. Converting the BIO format to the "Mention"-based format

The BIO-based format above has one key downside - it is not good for representing 'phrases' of more than one token in length. This makes it difficult to work with for future steps, such as constructing nodes from the entities and running relation extraction. In light of this, we will now convert the BIO-formatted predictions into Mention format, i.e. go from this:

    {'tokens': ['a/c', 'not', 'working'],
     'labels': ['B-Item', 'B-Observation', 'I-Observation']}
    
To this:

    {'tokens': ['a/c', 'not', 'working'],
     'mentions': [
         {'start': 0, 'labels': ['Item'], 'end': 1},
         {'start': 1, 'labels': ['Observation'], 'end': 3}]}
    
Note that this format is also able to now support multiple labels per mention (though we will only be using single labels for simplicity). Researchers use this format for **entity typing**, which is similar to NER but with >= 1 label per mention.

This step is just a bit of data wrangling - here we have defined a helper function to convert a BIO-tagged sentence into a Mention-tagged sentence.

In [17]:
import json

def _bio_to_mention(conll_doc: dict):
    """Return a Mention-format representation of a BIO-formatted
    tagged sentence.

    Args:
        conll_doc (ConllDocument): The doc to convert to redcoat.
        doc_idx (int): The id of the doc, necessary to create a Redcoat doc.

    Returns:
        dict: A mention-formatted dict created from the conll_doc.
    """
    tokens = conll_doc["tokens"]
    labels = conll_doc["labels"]
    mentions_list = []

    start = 0
    end = 0
    label = None
    for i, (token, label) in enumerate(
        zip(tokens, labels)
    ):
        if label.startswith("B-"):
            if len(mentions_list) > 0:
                mentions_list[-1]["end"] = i
            mentions_list.append({"start": i, "labels": [label[2:]]})
        elif label == "O" and len(mentions_list) > 0:
            mentions_list[-1]["end"] = i
        if len(mentions_list) == 0:
            continue
        if i == (len(tokens) - 1) and "end" not in mentions_list[-1]:
            mentions_list[-1]["end"] = i + 1
    return {'tokens': tokens, 'mentions': mentions_list}


# For each BIO tagged sentence in tagged_sents, convert it to the mention-based
# representation
tagged_sents_m = []
for doc in tagged_bio_sents:
    mention_doc = _bio_to_mention(doc)
    tagged_sents_m.append(mention_doc)

# Let's print our example sentence again, this time with the mention-based
# representation.
# We'll use json.dumps to make it a bit easier to read.
print(json.dumps(tagged_sents_m[12],indent=1))

{
 "tokens": [
  "a/c",
  "not",
  "working"
 ],
 "mentions": [
  {
   "start": 0,
   "labels": [
    "Item"
   ],
   "end": 1
  },
  {
   "start": 1,
   "labels": [
    "Observation"
   ],
   "end": 3
  }
 ]
}


# 4. Extracting relations between the entities via Relation Extraction

We have extracted the entities appearing in each work order. The next step is to extract the relationships between those entities. We can do this using Relation Extraction.

![alt text](images/building-relations.png "Building relations")

## 4.1. Loading and inspecting the data

Let's take a look again at the RE dataset we are working with.

In [8]:
import os
import json

RE_DATASET_PATH = "data/re_dataset"


def load_re_dataset(filename: str) -> list:
    """Load the Relation Extraction dataset into a list.
        
    Args:
        filename (str): The name of the file to load.
    """
    re_data = []
    with open(filename, 'r') as f:
        for row in f:
            re_data.append(row.strip().split(','))
    return re_data

train_dataset = load_re_dataset(os.path.join(RE_DATASET_PATH, 'train.csv'))

print(json.dumps(train_dataset[0:3], indent=1))


[
 [
  "broken",
  "rod support",
  "Observation",
  "Item",
  "rod support broken",
  "0",
  "1",
  "O"
 ],
 [
  "rod support",
  "broken",
  "Item",
  "Observation",
  "rod support broken",
  "1",
  "0",
  "HAS_OBSERVATION"
 ],
 [
  "broken",
  "cup",
  "Observation",
  "Item",
  "cup rod support broken",
  "0",
  "2",
  "O"
 ]
]


We can interpret this as follows:
 - 'broken': entity 1
 - 'rod support': entity 2
 - 'Observation': label of entity 1
 - 'Item': label of entity 2
 - 'rod support broken': The text between 'broken' and 'rod support', inclusive
 - '0': The position of entity 1
 - '1': The position of entity 2
 - 'O': The relation type. "O" means no relation.

## 4.2. Define the Abstract Base Class

We are going to see two different RE models, so let's define an abstract base class again just like we did for the NER models. Just like the NER model, we have four functions:

- `inference`: Given a row (as above, but without the last column), predict the given relation type ("O" if no relation).
- `train`: Train the model on the files in the given dataset path.
- `load`: Load the model from the specified path.

In [9]:
class REModel(ABC):
    def __init__(self):
        pass

    @abstractmethod
    def inference(self, row: list) -> str:
        pass        

    @abstractmethod
    def train(self, re_datasets_path: str):
        pass

    @abstractmethod
    def load(self, model_path: str):
        pass



## 4.3. Define the Flair model

TODO

In [14]:
""" A dictionary-based KGC model. Can be used as an alternative
to Flair, which is cumbersome to run and install."""

import os
import json
from typing import List

import flair
from flair.trainers import ModelTrainer
from flair.datasets import CSVClassificationCorpus
from flair.embeddings import (
    PooledFlairEmbeddings,
    DocumentRNNEmbeddings,
)
from flair.data import Sentence
from typing import List
from flair.models import TextClassifier
from flair.visual.training_curves import Plotter

import torch

MAX_EPOCHS = 1
HIDDEN_SIZE = 256

# Check whether CUDA is available and set the device accordingly
if torch.cuda.is_available():
    flair.device = torch.device("cuda:0")
else:
    flair.device = torch.device("cpu")
print("Device:", flair.device)


class FlairREModel(REModel):

    """The Flair-based RE model."""

    model_name: str = "Flair"

    def __init__(self):
        super(FlairREModel, self).__init__()
        self.model = None

    def train(self, datasets_path: os.path, trained_model_path: os.path):
        """Train the Flair RE model on the given CSV datasets.

        Args:
            datasets_path (os.path): The path containing the train and dev
               datasets.
            trained_model_path (os.path): The path to save the trained model.
        """

        column_name_map = {
            0: "text",
            1: "text",
            2: "text",
            3: "text",
            4: "text",
            7: "label_relation",
        }

        # Define corpus, labels, word embeddings, doc embeddings
        corpus = CSVClassificationCorpus(
            datasets_path,
            column_name_map,
            delimiter=",",
            label_type="relation",
        )

        label_dict = corpus.make_label_dictionary(label_type="relation")

        word_embeddings = [
            PooledFlairEmbeddings("mix-forward"),
            PooledFlairEmbeddings("mix-backward"),
        ]

        document_embeddings = DocumentRNNEmbeddings(
            word_embeddings, hidden_size=HIDDEN_SIZE
        )

        # Initialise sequence tagger
        tagger = TextClassifier(
            document_embeddings,
            label_dictionary=label_dict,
            label_type="relation",
        )

        # Initialize trainer
        trainer = ModelTrainer(tagger, corpus)

        sm = "cpu"
        if torch.cuda.is_available():
            sm = "gpu"
        
        # Start training
        trainer.train(
            trained_model_path,
            learning_rate=0.1,
            mini_batch_size=32,
            max_epochs=MAX_EPOCHS,
            patience=3,
            embeddings_storage_mode=sm,
        )

        self.load(os.path.join(trained_model_path, 'final-model.pt'))

    def load(self, model_path: str):
        """Load the chunked frequency dict from the given folder.

        Args:
            model_path (str): The filename containing the model.
               Can also be the name of a repo on Huggingface.
        """
        TextClassifier.load(model_path)

    def inference(self, row: list) -> str:
        """Run the inference over the given document.

        Args:
            row (list): The row to predict the relation of.

        Returns:
            str: The relation type.
        """
        
        s = Sentence(" ".join(rel[:5]))
        label = "O"
        self.model.predict(s)
        if len(s.labels) > 0:
            label = str(s.labels[0].value)
        return label


Device: cuda:0


## 4.4 Define a 'SimpleMWO' RE model

Because maintenance work orders are very short (5-7 words typically), generally speaking we can create a useful knowledge graph by simply linking each Item entity in the work order and each other entity in that work order. For example:

    replace pump
    
We can say the "pump" entity `HAS_ACTIVITY` "replace". Likewise for the following:

    fix air conditioner , not working
    
We can say that "air conditioner" `HAS_ACTIVITY` "fix", and `HAS_OBSERVATION` "not working".

This is not a foolproof method, though - it is a heuristic, i.e. a rule-based method designed to exploit a pattern in the data. For creating this specific type of knowledge graph, though, it works quite well, and thus we can define a model to use this heuristic as a weaker alternative to a deep learning model.

Just like the dictionary-based NER model, the model is super simple, so we won't show the code here, but feel free to have a look under `helpers/SimpleMWOREModel.py` if you are interested.

In [1]:
from helpers import SimpleMWOREModel

Here's an example output from the model. Note we have removed the last column, which our model is predicting now:

In [3]:
r = SimpleMWOREModel()

r.inference([
  "rod support",
  "broken",
  "Item",
  "Observation",
  "rod support broken",
  "1",
  "0"
 ])

'HAS_OBSERVATION'

## 4.3. Train the model/load the pretrained model

Let's load the pretrained Flair RE model from Huggingface.

(or alternatively you can train it yourself by uncommenting the train line, and commenting the load line).

In [18]:
re_model = FlairREModel()
re_model.train(RE_DATASET_PATH, "models/re_models/flair") # Uncomment to train manually

# TODO: Load from huggingface
# re_model.load('nlp-tlp/mwo-re')


2022-11-21 17:18:29,714 Reading data from data\re_dataset
2022-11-21 17:18:29,715 Train: data\re_dataset\train.csv
2022-11-21 17:18:29,715 Dev: data\re_dataset\dev.csv
2022-11-21 17:18:29,716 Test: data\re_dataset\test.csv
2022-11-21 17:18:29,808 Computing label dictionary. Progress:


24804it [00:04, 6187.61it/s]


2022-11-21 17:18:33,821 Dictionary created for label 'relation' with 13 values: O (seen 15398 times), HAS_ACTIVITY (seen 2825 times), HAS_OBSERVATION (seen 2174 times), APPEARS_WITH (seen 1982 times), HAS_LOCATION (seen 1556 times), HAS_CONSUMABLE (seen 334 times), HAS_SPECIFIER (seen 173 times), HAS_AGENT (seen 143 times), HAS_CARDINALITY (seen 114 times), HAS_ATTRIBUTE (seen 76 times), HAS_TIME (seen 25 times), HAS_EVENT (seen 4 times)


  "There should be no best model saved at epoch 1 except there "


2022-11-21 17:18:34,783 ----------------------------------------------------------------------------------------------------
2022-11-21 17:18:34,784 Model: "TextClassifier(
  (decoder): Linear(in_features=256, out_features=13, bias=True)
  (dropout): Dropout(p=0.0, inplace=False)
  (locked_dropout): LockedDropout(p=0.0)
  (word_dropout): WordDropout(p=0.0)
  (loss_function): CrossEntropyLoss()
  (document_embeddings): DocumentRNNEmbeddings(
    (embeddings): StackedEmbeddings(
      (list_embedding_0): PooledFlairEmbeddings(
        (context_embeddings): FlairEmbeddings(
          (lm): LanguageModel(
            (drop): Dropout(p=0.25, inplace=False)
            (encoder): Embedding(275, 100)
            (rnn): LSTM(100, 2048)
            (decoder): Linear(in_features=2048, out_features=275, bias=True)
          )
        )
      )
      (list_embedding_1): PooledFlairEmbeddings(
        (context_embeddings): FlairEmbeddings(
          (lm): LanguageModel(
            (drop): Dropout(

100%|████████████████████████████████████████████████████████████████████████████████| 104/104 [00:17<00:00,  6.10it/s]


2022-11-21 17:21:20,435 Evaluating as a multi-label problem: False


  _warn_prf(average, modifier, msg_start, len(result))


2022-11-21 17:21:20,455 DEV : loss 0.0031679796520620584 - f1-score (micro avg)  0.9087
2022-11-21 17:21:21,127 BAD EPOCHS (no improvement): 0
2022-11-21 17:21:21,129 saving best model
2022-11-21 17:21:24,305 ----------------------------------------------------------------------------------------------------
2022-11-21 17:21:24,307 loading file models\re_models\flair\best-model.pt


100%|████████████████████████████████████████████████████████████████████████████████| 102/102 [00:16<00:00,  6.11it/s]


2022-11-21 17:21:42,423 Evaluating as a multi-label problem: False
2022-11-21 17:21:42,441 0.9979	0.8154	0.8975	0.9326
2022-11-21 17:21:42,443 
Results:
- F-score (micro) 0.8975
- F-score (macro) 0.8786
- Accuracy 0.9326

By class:
                 precision    recall  f1-score   support

HAS_OBSERVATION     1.0000    1.0000    1.0000       313
   HAS_ACTIVITY     1.0000    1.0000    1.0000       279
   HAS_LOCATION     1.0000    1.0000    1.0000       238
   APPEARS_WITH     0.6667    0.0183    0.0357       218
 HAS_CONSUMABLE     1.0000    1.0000    1.0000        59
      HAS_AGENT     1.0000    1.0000    1.0000        21
  HAS_SPECIFIER     1.0000    1.0000    1.0000        15
  HAS_ATTRIBUTE     1.0000    1.0000    1.0000        12
HAS_CARDINALITY     1.0000    1.0000    1.0000        10
       HAS_TIME     1.0000    0.6000    0.7500         5

      micro avg     0.9979    0.8154    0.8975      1170
      macro avg     0.9667    0.8618    0.8786      1170
   weighted avg     0.937

PermissionError: [Errno 13] Permission denied: 'models/re_models/flair'

## 4.4. Inference

Now we have our RE model, let's run it over the MWO dataset to extract the relationships between the entities.

We need our data to be in the same format as required by the model, i.e. a list of rows where each row has five columns (entity 1, entity 2, etc), just like the training data used to train the model.

So before we can run RE, we need to 'wrangle' our data again to get it into the right format. Here is a helper function to transform our mention-based entity format of a single document into a list of potential relationships between each entity and each other entity in that document.

# 5. Combining NER+RE

TODO: Do something like this but using RE instead of simple Item -> everything

In [None]:

# This is old code from the master class

triples = []

for row in normalised_work_order_entities:
    for (ngram, entity_class) in row:
        if entity_class != "item": continue
            
        # If this entity is an item, link it to all other entities in the work order       
             
        for (other_ngram, other_entity_class) in row:   
            if ngram == other_ngram: continue # Don't link items to themselves                

            relation_type = other_entity_class.upper()                
            triples.append(((ngram, entity_class), "HAS_%s" % relation_type, (other_ngram, other_entity_class)))
        
for triple in triples:
    print(triple)

# 6. Creating the graph

NOTE: This is still old code, but it probably works well enough to just reuse it here too. I have better ways to do this now (using IMPORT rather than doing it directly via py2neo functions) but this is probably good enough. At the moment it's missing the `triples`, that can be built in section 4.


Now that we have our nodes and relations we can go ahead and build the Neo4J graph.

To do this we are going to use py2neo, a Python library for interacting with Neo4J.

There are also a couple of other ways to do this - you can either use Neo4J and run Cypher queries to insert each node and relation, or use the APOC library to import a list of nodes from a CSV file. I find Python to be the simplest way, however.

> Before proceeding, make sure you have created a new graph in Neo4j and that your new Neo4j graph is running.

You can download and install Neo4j from here if you haven't already: https://neo4j.com/download/. I will be demonstrating the graph during the class so there's no need to have it installed unless you are also interested in trying out some graph queries yourself.

> If you need to build your graph again, make sure to run this cell before running subsequent cells.

In [None]:
from py2neo import Graph
from py2neo.data import Node, Relationship

GRAPH_PASSWORD = "password" # Set this to the password of your Neo4J graph

graph = Graph(password = GRAPH_PASSWORD)

# TODO: create an 'id' such as pump__Item so that you can have the same phrase, but different classes

# We will start by deleting all nodes and edges in the current graph.
# If we don't do this, we will end up with duplicate nodes and edges when running this script again.
graph.delete_all() 

tx = graph.begin()

# We will keep a dictionary of nodes that we have created so far.
# This serves two purposes:
#  - prevents duplicate nodes
#  - provides us with a way to create edges between the nodes
created_entity_nodes = {}

# Creates a node for the specified ngram and entity_class.
# If the node has already been created (i.e. it exists in created_nodes), return the node.
# Otherwise, create a new one.
def create_entity_node(ngram, entity_class):
    if ngram in created_entity_nodes:
        node = created_entity_nodes[ngram]
    else:
        node = Node("Entity", entity_class, name=ngram)
        created_entity_nodes[ngram] = node
        tx.create(node)
    return node


# Create a node for each triple in the list of triples.
# Set the class of each node to the entity_class (e.g. "activity", "item" or "observation").
# Create a relationship between the nodes in the triple.
for ((ngram_1, entity_class_1), relation, (ngram_2, entity_class_2)) in triples:
    
    node_1 = create_entity_node(ngram_1, entity_class_1)
    node_2 = create_entity_node(ngram_2, entity_class_2)   
    
    
    # Create a relationship between two nodes.
    # This does not check for duplicate relationships unlike create_node,
    # so this code will need to be adjusted on larger datasets.
    relationship = Relationship( node_1, relation, node_2 )
    tx.create(relationship)
    
    
tx.commit()
        

## 6.1. Create nodes for the documents (i.e. the Work Orders)

In order to query our graph, we need to create nodes for each work order in our dataset as well. We then need to link each Document node to every Entity node appearing in that document.

In [None]:
from dateutil.parser import parse as parse_date

# Our work_order_data and normalised_work_order entities allow us to do this quite easily,

tx = graph.begin()

# We will once again keep a mapping of created work order nodes, this time indexed by the row index.
created_work_order_nodes = {}

# Dates are a little awkward in Neo4j - we have to convert it to an integer representation in Python.
# The APOC library has functions to handle this better.
def date_to_int(date):
    parsed_date = parse_date(str(date))
    date = int("%s%s%s" % (parsed_date.year, str(parsed_date.month).zfill(2), str(parsed_date.day).zfill(2)))
    return date

# The process of creating a work order node is a bit different to creating an entity,
# as we also want to incorporate some of the structured fields onto the node.
def create_structured_node(index, row, node_type, created_nodes):
    if index in created_nodes:
        return created_nodes[index]

    if 'StartDate' in row:
        row['StartDate'] = date_to_int(row['StartDate'])
    if 'EndDate' in row:
        row['EndDate'] = date_to_int(row['EndDate'])  

    node = Node(node_type, **row)
    created_nodes[index] = node
    tx.create(node)
    return node

for i, row in enumerate(work_order_data):
    node = create_structured_node(i, row, "WorkOrder", created_work_order_nodes)
    
tx.commit()





## 6.2. Link the entities to their corresponding work order nodes

In order to properly query our graph, we need to link every entity node to the work order node in which it appears.

This allows us to run queries such as "pumps with electrical issues in the last 3 months".

In [None]:
tx = graph.begin()

# We can use the normalised_work_order_entries list to do this.
for i, row in enumerate(normalised_work_order_entities):
    for (ngram, entity_class) in row:        
        
        node_1 = created_entity_nodes[ngram]
        node_2 = created_work_order_nodes[i]
        
        relationship = Relationship( node_1, "APPEARS_IN", node_2 )
        tx.create(relationship)
       
tx.commit()

# 7. Querying the graph

## TODO: Update with GQVis

Now that the graph has been created, we can query it in Neo4j. This section lists some example queries that we can run on our graph. If you would like to try these yourself you can paste them directly into the Neo4j console.

First, let's try a simple query. Here is a query that searches for __all failure modes observed on engines__:

    MATCH (e:Entity {name: "engine"})-[r:HAS_OBSERVATION]->(o:observation)
    RETURN e, r, o

We can also use our graph as a way to quickly search and access work orders for the entities appearing in those work orders. For example, searching for __all work orders containing a leak__:

    MATCH (d:WorkOrder)<-[a:APPEARS_IN]-(o:observation {name: "leak"})
    RETURN d, a, o

We could extend this to also show the items on which the leaks were present:

    MATCH (d:WorkOrder)<-[a:APPEARS_IN]-(o:observation {name: "leak"})<-[r:HAS_OBSERVATION]-(e:Entity)
    RETURN d, a, o, r, e

Our queries can also incorporate structured data, such as the start dates of the work orders. Here is an example query for __all assets that had leaks from 25 to 28 July__:

    MATCH (d:WorkOrder)<-[a:APPEARS_IN]-(e:Entity)-[r:HAS_OBSERVATION]->(o:observation {name: "leak"})-[:APPEARS_IN]->(d)
    WHERE d.StartDate >= 20050725
    AND d.StartDate <= 20050728
    RETURN e, r, o

On a larger graph this would also work well with other forms of structured data such as costs. We could query based on specific asset costs, for example.

Now that our work orders and downtime events are in one graph, we can also make queries about downtime events. Here is an example query for the __downtime events associated with assets appearing in work orders from 25 to 28 July (where the downtime events occurred in July)__:

    MATCH (d:WorkOrder)<-[a:APPEARS_IN]-(e:Entity)-[r:HAS_EVENT]->(x:DowntimeEvent)
    WHERE d.StartDate > 20050725
    AND d.StartDate < 20050728
    AND 20050700 <= x.StartDate <= 20050731
    RETURN e, r, x

We can of course extend this to specific assets, such as pumps:

    MATCH (d:WorkOrder)<-[a:APPEARS_IN]-(e:Entity {name: "pump"})-[r:HAS_EVENT]->(x:DowntimeEvent)
    WHERE d.StartDate > 20050725
    AND d.StartDate < 20050728
    AND 20050700 <= x.StartDate <= 20050731
    RETURN e, r, x

In larger graphs the downtime events could even be further queried based on duration, cost, lost feed, or date ranges.

# 6. Future improvements

TODO: Get rid of reference to 'downtime'. Perhaps introduce Seq2KG and other models here as alternate ways of text to graph

## Incorporating FLOCs

Our downtime events are currently linked to Item nodes, but it would make more sense to link them to nodes representing the functional locations.

If you are interested in continuing work on this small graph, the next best step would be to create nodes for the functional location data (`floc_data`) and to link the downtime events to those nodes as opposed to the Item nodes.

![alt text](images/adding-flocs.png "Adding FLOCs")

## Frequencies on edge properties

We could also improve the graph by incorporating frequencies onto the edge properties. For example, if a "leak" occurred on a pump in two different work orders, our link between "pump" and "leak" could have a property called `frequency` with a value of `2`. This would allow us to query, for example, assets that had a particularly high number of leaks.


## Constructing a graph from your own work order data

If you have a work order dataset of your own, feel free to download this code and try it out on your dataset. I would be happy to chat if you would like to further discuss the code or if you run into any issues.

If you need to extract entities not listed in the lexicon, you will need to update the lexicon file to include your new entities. Alternatively, the LexiconTagger can be substituted for a named entity recognition model.

In [None]:
floc_file = "data/sample_flocs.csv"
floc_data = load_csv(floc_file)

# Your code here

# ---------------------------------------------------------------------------------------

## OLD Normalise the entities

### MS: Probably going to take this out to save time, I can leave it as a future step for people. It should probably go before NER/RE anyway

The next step is to normalise the ngrams, i.e. convert each ngram into a normalised form. This is important as we would prefer to have a single node for a single concept, e.g. one node for "engine" as opposed to two nodes for "engin" and "engine".

We will once again be using a lexicon for this task, but it would typically be performed by machine learning.

![alt text](images/normalising-entities.png "Normalising entities")

In [None]:
lexicon_n_file = "data/lexicon_normalisation.csv"
lexicon_normaliser = LexiconTagger(lexicon_n_file)

normalised_work_order_entities = []

# For every row in work_order_entities, replace each ngram with its normalised counterpart
# as per the normalisation lexicon.
# For example, "engin" will become "engine", "leaking" will become "leak", etc.
for row in work_order_entities:
    normalised_work_order_entities.append([(lexicon_normaliser.normalise_ngram(ngram), entity_class) 
                                           for (ngram, entity_class) in row])
    
    
for row in normalised_work_order_entities:
    print(row)

## OLD Extending the graph to incorporate Downtime events

The next step is to incorporate the downtime events.

For this exercise we are going to link the Downtime events to the first Item node appearing in the work orders with the same FLOC as the downtime event.


![alt text](images/adding-downtime-events.png "Adding downtime events")

In [None]:
tx = graph.begin()

created_downtime_nodes = {}

# Create a DowntimeEvent node for each row
for i, downtime_row in enumerate(downtime_data):
    node = create_structured_node(i, downtime_row, "DowntimeEvent", created_downtime_nodes)
    
    # Get all work order nodes with the same FLOC and link the DowntimeEvent to the Items appearing
    # in those work orders
    for j, work_order_row in enumerate(work_order_data):
        if work_order_row["FLOC"] == downtime_row["FLOC"]:
            
            work_order_entities = normalised_work_order_entities[j]
            
            for (ngram, entity_class) in work_order_entities:
                if entity_class != "item": continue    # We don't need to link non-items to downtime events               
                    
                item_node = created_entity_nodes[ngram]
                relationship = Relationship( item_node, "HAS_EVENT", node )
                tx.create(relationship)
                break

    
tx.commit()
