# A. Background Survey

In this section, we are going to understand your knowledge and background. Please follow the instructions:

1. go to https://www.menti.com/
2. Enter the code [6705694] to join the session
3. Answer the questions shown in the screen

# B. Knowledge Graph Construction from Technical Short Text

In this notebook we are going to construct a simple knowledge graph using Python, and run some queries on the graph in Neo4j. We have broken the notebook into several steps:

1. Introduction
2. Introduction to Natural and Technical Language Processing
3. Loading the data
4. Cleaning the data via **Lexical Normalisation**
5. Extracting entities via **Named Entity Recognition** (NER)
6. Creating relations between entities via **Relation Extraction** (RE)
7. Combining Named Entity Recognition + Relation Extraction
8. Creating the graph
9. Querying the graph in Neo4j

# 1. Introduction

## 1.1 Requirements

We will be walking through this notebook, so there is no need for you to install any of these requirements unless you are interested in running it yourself after the class. If you do wish to do run it yourself later, you need the following open-source tools:
1. Python 3: A powerful programming language known for its simplicity, readability, and extensive library support.
2. Neo4j: A powerful graph database that enables you to model, store, and query complex connected data with efficiency.
3. Necessary Python packages: py2neo, gqvis, flair. 

To install the necessary Python packages, execute the following command using pip (as shown in the code cell below):
- `py2neo`: A library for working with Neo4j in Python.
- `gqvis`: Our simple tool for visualising graph queries in Jupyter.
- `flair`: A deep learning library for natural language processing. Note this library is quite large (a couple gb I believe). If you don't wish to install this, we have provided non deep-learning based alternatives so you can still run the code.

In [None]:
!pip install py2neo
!pip install gqvis
!pip install flair

You can download Neo4j by visiting this link: : https://neo4j.com/download/.

## 1.2 Problem description

Maintenance work orders (MWOs) capture information on the maintenance performed on assets. Much of this information is structured - such as dates, the functional location (the specific identifier of the asset), costs, and so on. However, a significant volume of the knowledge buried within a MWO is unstructured and therefore inaccessible - it is buried within the short text description.

Here are some examples of these short text descriptions:

    replace pump
    a/c running hot
    repair cracked hydraulic tank
    
As you can see, they often contain indicators of failure modes (e.g. overheating) and end of life events (e.g. a replacement). It would be useful to be able to automatically discover these, however it is next to impossible to manually trawl through thousands of these work orders to discover patterns. We need some way to ask questions of our unstructured data, which is where knowledge graphs shine.


Our task in this session is therefore is to transform these work orders into a **knowledge graph**. We will be primarily focusing on the short text descriptions, but there will be some discussion at the end on incorporating other structured knowledge (e.g. dates) into our graph. The goal of building this knowledge graph is to be able to **ask questions of our data** such as "what are the failure modes observed on pumps?" and "which assets have had leaks in the past 6 months?"

![Insight](images/dikw_pyramid.png "Insight")

Image sourced from [OntoText](https://www.ontotext.com/knowledgehub/fundamentals/dikw-pyramid/)

# 2. Introduction to Natural (and Technical) Language Processing

**Natural language processing** (NLP) is the study of the automatic interpetation and manipulation of natural language, like speech and written text. 

**Technical language processing** (TLP) is a subset of NLP that focuses specifically on technical text, such as the text present in maintenance work orders, doctor's notes, safety records, and so on. 

<img src="images/KG_con.png" alt="KG_con" width="800px" />


## 2.1. Core NLP terms and ideas

Let's first define some core NLP terms and ideas that we will see numerous times throughout this notebook.

### 2.1.1. Corpus

A 💡 **corpus** is a set of text documents, where a document can be a word, sentence, paragraph, report, etc. Our corpus is a dataset of maintenance work order records.

### 2.1.2. Tokenisation

The first step for almost any NLP application is 💡 **Tokenisation**. This task splits text into smaller units, such as sentences (sentence tokenisation), words (word tokenisation), or characters (character tokenisation). For most applications, we split text into words.

The simplest (but by no means best) way to tokenise is using Python's `split` function, which simply splits a string using the space character as the delimeter:



In [1]:
sentence = "this is a sentence"
words = sentence.split()

print(words)

['this', 'is', 'a', 'sentence']


This is good enough for our purposes (tokenising maintenance work orders), but does not perform well on natural language datasets due to the prevalence of punctuation (full stops, commas etc). The `split` function fails when our sentence ends with a full stop, for example:

In [2]:
sentence = "this is a sentence."
words = sentence.split()

print(words)

['this', 'is', 'a', 'sentence.']


Fortunately this does not happen in our dataset so for simplicity we will just be using the `split` function.

If you are interested in learning more about tokenisation, there are many tokenisation libraries available in Python. The most popular perhaps is NLTK, which has a range of tokenisers available (see [here](https://www.nltk.org/api/nltk.tokenize.html)).

### 2.1.3. Vocabulary

A 💡 **vocabulary** is set of terms that correspond to a particular subject matter. The vocabulary of maintenance texts will be quite large, even though the length of the texts are small. This is because maintainers often write the same word many different ways ('air conditioner' is written as 'a/c', 'air cond', etc). To illustrate why this is an issue, we found that for a corpus of 50,000 maintenance work order records, there were 18,238 unique tokens - predominately due to all of these different word variants.

Note that depending on the domain, we may treat varying capitalisation of the same word ("apple" and "Apple", for example) as different words. In this notebook we will treat everything as lower-cased, though in many domains this is not a good idea (especially when many acronyms are present).

Determining our vocabulary is a matter of creating a set of unique words in the corpus, for example:

In [2]:
vocabulary = set()

dataset = ["this is the first sentence", "this is the second sentence"]
for sent in dataset:
    words = sent.split()
    for word in words:
        vocabulary.add(word.lower())

print(vocabulary)

{'this', 'sentence', 'is', 'the', 'second', 'first'}


🤔 Machines cannot directly process or understand text data. How do we train models to understand the vocabulary in NLP?

&#x1F609; By transforming vocabularies into numerical representations.

### 2.1.4. Word Embeddings


💡 **Word embeddings** are numerical representations of language. We can think of an embedding as a `k` dimensional vector, whereby the embeddings of **semantically similar** items (typically words) have a **high cosine similarity**.

![Embeddings](images/embeddings_2.png "Embeddings")

Image sourced from [Medium](https://medium.com/@hari4om/word-embedding-d816f643140)

Note how "cat" and "kitten" are close, while "cat" and "houses" are far away.

Actual word embeddings are often much larger (typically 512 dimensions at least), but the general principle is the same. The model that generates the embeddings is designed to ensure that words appearing in a similar context (cat and dog etc) have similar vectors, but words appearing in different contexts (cat and cup) have dissimilar vectors.

For fun, you might like to check out [Semantle](https://semantle.com/), where the goal is to determine the word of the day via its cosine similarity to other words.

&#x1F914; How do we generate embeddings?

#### Language models


💡 **Language models**, such as GPT, are used to generate embedding vectors. The choice of model depends largely on the corpus.

Language models are typically **trained** on a corpus and then used to generate embeddings for words appearing in a different corpus. They can also be **fine-tuned** so that the model predicts more meaningful embeddings for words in a new corpus.

Below is a table of some of the most popular language models, including how the models are trained.

| Name     | Model                                                 | Method                                                                     | Granularity            |
|:---------|:------------------------------------------------------|:---------------------------------------------------------------------------|:-----------------------|
| word2vec | Feedforward neural network                            | Predict word given its context (CBOW), or context given a word (skip-gram) | Word-based             |
| GloVe    | Log-bilinear regression model                         | Train word vectors from global co-occurence matrix                         | Word-based             |
| FastText | Feedforward neural network                            | Predict context given a word (skip-gram)                                   | Character n-gram based |
| ELMo     | Bi-directional Long Short Term Memory model (Bi-LSTM) | Predict word given all previous/next words                                 | Character-based        |
| BERT     | Bi-directional Transformer                            | Predict masked word(s)                                                     | Wordpiece-based        |
| Flair    | Bi-LSTM + Conditional Random Field (CRF)              | Predict next character given the previous characters                                                | Character-based     |

While `word2vec` remains one of the most popular language models, it has one key downside - it is not able to generate embedding vectors for words that do not appear in the corpus on which it was trained. This is a problem for our maintenance texts as they are rife with spelling errors, acronyms, etc, as well as many terms that do not occur in common natural language.

Character-level embeddings are able to deal with this issue because they are modelled at the character-level, and thus the vector for similar spellings of the same word ('pump', 'puump', etc) will have high cosine similarity. For our application we will therefore choose Flair embeddings, though FastText, ELMO and BERT would work well too.

You can learn more about Flair via the papers ([here](https://aclanthology.org/N19-4010/) and [here](https://aclanthology.org/C18-1139.pdf)), or check out the [GitHub repository](https://github.com/flairNLP/flair).

### 2.1.5. Sequence Labelling and Text Classification

There are two umbrella terms that almost all NLP tasks fall under - **sequence labelling** and **text classification**.

💡 **Sequence labelling** involves assigning a label to every item in a sequence. Examples of sequence labelling tasks involve Named Entity Recognition (assigning an entity class label to every token in a sentence) and Lexical Normalisation (assigning the correct form of the word to every word in a sentence).


💡 **Text classification** involves assigning one or more label(s) to an entire sequence. Examples of text classification tasks include sentiment analysis (determining whether a document is positive, negative, or neutral). In this notebook, our Relation Extraction model is also a text classification model (determine the relation type of a given (entity 1, entity 2, context)).

![NLP Tasks](images/nlp_tasks.jpg "NLP Tasks")


For our task we are going to use `Flair`, which is a Bidirectional LSTM model, for both Named Entity Recognition and Relation Extraction.

## 2.2. Supervised learning and the need for annotated data
(use google to explain annotation)

The majority of feature extraction and representation learning-based models are **supervised learning** models, i.e. they learn from annotated data. Annotated data can be obtained via manual annotation using tools such as [Redcoat](https://nlp-tlp/redcoat) and [QuickGraph](https://quickgraph.nlp-tlp.org).

For this notebook, our NER model is trained using an annotated dataset of ~4k MWOs tagged by the [UWA Natural & Technical Language Group](https://nlp-tlp.org). 

# 3. Loading the data

We are going to be building a knowledge graph on a small sample set of work orders. This will not be seen by the Named Entity Recognition or Relation Extraction models prior to constructing the graph - the idea is to get our models to run *inference* over this dataset to automatically predict the entities, and relationships between the entities, to build a graph.

- `sample_work_orders.csv`: A csv file containing a set of work orders.

We are using the simple `csv` library to read in the data, though this can also be done using `pandas`.

In [3]:
from csv import DictReader
from helpers import print_table

work_order_file = "data/sample_work_orders.csv"

# A simple function to read in a csv file and return a list,
# where each element in the list is a dictionary of {heading : value}
def load_csv(filename):
    data = []
    with open(filename, 'r') as f:
        reader = DictReader(f)
        for row in reader:
            data.append(row)
    return data

        
work_order_data = load_csv(work_order_file)

# Let's have a look at 10 rows
print_table(work_order_data[30:40])

StartDate     FLOC          ShortText                              Cost   
----------------------------------------------------------------------------------------------------
              1234.02.11    broken handraill/h/s crows nest        3200   
              1234.02.12    broken hose on cylinder                3300   
              1234.02.13    broken l/h pulldown chain              3400   
              1234.02.14    broken locking pin                     3500   
17/08/2001    1234.02.15    broken tool wrench holding cylinder    3600   
              1234.02.16    build up rod support beak              3700   
              1234.02.17    bull hose air leak                     3800   
25/02/2006    1234.02.18    bull hose split                        3900   
              1234.02.19    busted hydraulic hose                  4000   
              1234.02.20    c spanner for minning                  4100   


&#x1F914; Can we directly use the raw data in NLP tasks?

&#x1F609; No, preprocessing is necessary, which helps in improving text consistency, reducing noise, and ensuring better analysis and understanding of the data by NLP models.

# 4. Preprocessing the data via Lexical Normalisation

Our maintenance work order data is quite noisy - there are spelling errors, typos, acronyms, abbreviations, and so on. This will present challenges to us later when it comes time to build the knowledge graph. So to deal with these issues, we should first clean the text using **lexical normalisation**.

In the interest of time/simplicity we are not going to use a neural model here, but instead we will use a simple lexicon-based normaliser. This model will simply replace a misspelled phrase with its correct form. This is not practical in the real world (as there's no way we could possibly build a lexicon of all possible misspellings) but it is good enough for our small example.

The following code imports our `LexiconNormaliser` model. This model simply replaces any predefined terms with their replacements, e.g. "puump" will be normalised to "pump", and so on.

If you're interested in seeing this lexicon, it's available under [data/lexicon_normalisation.csv](files/data/lexicon_normalisation.csv).

In [23]:
from helpers import LexiconNormaliser

lexicon_file = "data/lexicon_normalisation.csv"
normaliser = LexiconNormaliser(lexicon_file)

work_order_data = load_csv(work_order_file)

for i, row in enumerate(work_order_data):
    before = row['ShortText']    
    row['ShortText'] = normaliser.normalise(row['ShortText'])
    # Let's print same samples to have a look at the difference
    if i in (12,32):#in (12):# and i <= 140:
        print(before)
        print(row['ShortText'])
        print()
        
        
# change the example  

air /con very noisy
air conditioner very noisy

broken l/h pulldown chain
breakdown l/h pulldown chain



# 5. Named Entity Recognition

The next task is to extract the entities in the short text descriptions and construct nodes from those entities. This is how we are able to unlock the knowledge captured within the short text and combine it with the structured fields.

NER is a **sequence labelling** task. The choice of entity class depends on the corpus. NER on natural language corpora typically uses four classes (`Person`, `Organisation`, `Location`, `Miscellaneous`), though there are many schemas available. In our application, we will use the following three classes:

- **`Item`**: A maintainable item such as "exhaust".
- **`Activity`**: A maintenance activity performed on an item, such as "replace".
- **`Observation`**: An observation on an Item, such as "lagging".

![Sequence_Labelling](images/seq_label.png)

Note the **BIO** format being used here ("beginning", "inside", "outside"). **B** denotes that the word is the start of en entity, **I** denotes that the word is inside an entity, and **O** denotes that a word is not an entity.

## 5.1. Load the dataset
The dataset to train and evaluate the NER model is split into three files:

- `ner_dataset/train.txt`: The dataset we will use to *train* the NER model to predict the entities appearing in each work order.
- `ner_dataset/dev.txt`: The dataset we will use to *validate* the quality of the model during training.
- `ner_dataset/test.txt`: The dataset we will use to *evaluate* the final performance of the NER model after training.

Let's take a quick look at the first row of our training dataset to make sure it loads OK:

In [3]:
import os
from helpers import load_conll_dataset

NER_DATASET_PATH = "data/ner_dataset"
train_dataset = load_conll_dataset(os.path.join(NER_DATASET_PATH, 'train.txt'))

print(train_dataset[12])

Loaded 3201 documents from data/ner_dataset/train.txt.
{'tokens': ['batteries', 'flat'], 'labels': ['B-Item', 'B-Observation']}


## 5.2. Define our NER model

In this tutorial we will use [Flair](https://github.com/flairNLP/flair), which simplifies the process of building a deep learning model for a variety of NLP tasks.

The code below is a class representing a `FlairNERModel`. It has the same three methods, i.e `train()`, `inference()`, and `save()`.

In [4]:
"""A Flair-based Named Entity Recognition model. Learns to predict entity
classes via deep learning."""

import os
import flair
from helpers.NERModel import NERModel
from flair.data import Corpus, Sentence
from flair.datasets import ColumnCorpus
from flair.embeddings import (
    StackedEmbeddings,
    FlairEmbeddings,
)
from flair.models import SequenceTagger
from flair.trainers import ModelTrainer
from typing import List
from flair.visual.training_curves import Plotter
import torch


HIDDEN_SIZE = 256

# Check whether CUDA is available and set the device accordingly
if torch.cuda.is_available():
    flair.device = torch.device("cuda:0")
else:
    flair.device = torch.device("cpu")
print("Device:", flair.device)


class FlairNERModel(NERModel):

    model_name: str = "Flair"

    """A Flair-based Named Entity Recognition model.
    """

    def __init__(self):
        super(FlairNERModel, self).__init__()

        self.model = None

    def train(self, datasets_path: os.path, trained_model_path: os.path):
        """ Train the Flair model on the given conll datasets.

        Args:
            datasets_path (os.path): The folder containing the
              train, dev and text CONLL-formatted datasets.
            trained_model_path (os.path): The folder to save the trained
              model to.
        """

        columns = {0: "text", 1: "ner"}
        corpus: Corpus = ColumnCorpus(
            datasets_path,
            columns,
            train_file="train.txt",
            dev_file="dev.txt",
            test_file="test.txt",
        )
        label_dict = corpus.make_label_dictionary(label_type="ner")

        # Train the sequence tagger
        embedding_types = [
            FlairEmbeddings("mix-forward"),
            FlairEmbeddings("mix-backward"),
        ]

        embeddings = StackedEmbeddings(embeddings=embedding_types)

        tagger = SequenceTagger(
            hidden_size=HIDDEN_SIZE,
            embeddings=embeddings,
            tag_dictionary=label_dict,
            tag_type="ner",
            use_crf=True,
        )

        trainer = ModelTrainer(tagger, corpus)

        sm = "cpu"
        if torch.cuda.is_available():
            sm = "gpu"
        trainer.train(
            trained_model_path,
            learning_rate=0.1,
            mini_batch_size=32,
            max_epochs=10,
            embeddings_storage_mode=sm,
        )

        plotter = Plotter()
        plotter.plot_weights(os.path.join(trained_model_path, "weights.txt"))

        self.load(os.path.join(trained_model_path, 'final-model.pt'))

    def inference(self, sent: list) -> dict:
        """Run the inference on a given list of short texts.

        Args:
            sent (list): The sentence (list of words).

        Returns:
            dict: The tagged sentence now in the form of {'tokens': [list],
                'labels': [list]}.

        Raises:
            ValueError: If the model has not yet been trained.
        """
        if self.model is None:
            raise ValueError(
                "The NER Model has not yet been trained. "
                "Please train/load this Flair model before proceeding."
            )
        
        sentence_obj = Sentence(sentence, use_tokenizer=False)
        self.model.predict(sentence_obj)
        labels = ["O"] * len(sentence)

        for entity in sentence_obj.get_spans("ner"):
            for i, token in enumerate(entity):
                label = entity.get_label("ner").value
                prefix = "B-" if i == 0 else "I-"
                
                # Token idx starts from 1 in Flair.
                labels[token.idx - 1] = prefix + label

        return { 'tokens': sent, 'labels': labels }

    def load(self, model_path: str):
        """Load the model from the specified path.

        Args:
            model_path (os.path): The path to load.

        Raises:
            ValueError: If the path does not exist i.e. model not yet trained.
        """
        self.model = SequenceTagger.load(model_path)

Device: cuda:0


## 5.3. Training the model

In the interest of time, we will be using a simple dictionary-based model rather than the Flair-based model above. This dictionary model simply matches terms against a predefined dictionary. In real-world applications, we would always use Flair (or other deep learning models) as they are able to learn to differentiate between terms based on context.

In [4]:
from helpers import DictionaryNERModel

NER_DATASET_PATH = "data/ner_dataset"

dictionary_ner_model = DictionaryNERModel()
dictionary_ner_model.train(NER_DATASET_PATH, 'models/ner_models/dictionary')

Building dictionary...
Loaded 3201 documents from data/ner_dataset/train.txt.
Loaded 402 documents from data/ner_dataset/dev.txt.


If you wish to train the Flair-based model yourself (not recommended unless you have a GPU with CUDA enabled), the code for doing that is as follows:

    flair_ner_model = FlairNERModel()
    flair_ner_model.train(NER_DATASET_PATH, 'models/ner_models/flair')

## 5.4. Testing the model

The next step is to use our trained model to infer the entity type of each entity appearing in a list of previously unseen data.

In [5]:
tagged_bio_sents = []

model = dictionary_ner_model
#model = flair_ner_model # uncomment to use Flair

sentences = []
for row in work_order_data:
    sentence = row["ShortText"].split() # We must 'tokenise' the sentence first, i.e. split into words
    tagged_sent = model.inference(sentence) # replace 'flair' with 'dictionary' if not using flair  
    tagged_bio_sents.append(tagged_sent)

# Print an example tagged sentence
print(tagged_bio_sents[22])

{'tokens': ['bent', 'handrail', 'at', 'back', 'cabin', 'door'], 'labels': ['B-Observation', 'B-Item', 'O', 'O', 'B-Item', 'B-Item']}


# 6. Extracting relations between the entities via Relation Extraction

We have extracted the entities appearing in each work order. The next step is to extract the relationships between those entities. We can do this using Relation Extraction.

For our application we are going to use of the following relationships:

 - **HAS_OBSERVATION**: A relationship between an `Item` and an `Observation`.
 - **HAS_ACTIVITY**: A relationship between an `Item` and an `Activity`.
 - **APPEARS_WITH**: A relationship between an `Item` and another `Item`.

![alt text](images/building-relations.png "Building relations")

The process of relation extraction is similar to NER.

In the interest of time, we won't be going into the details of how to run Relation Extraction in this workshop. You are welcome to browse the source code of this repository if you are interested in diving into the finer details. The important thing to note is that the process is very similar to Named Entity Recognition:

 - First, load the dataset
 - Define a Relation Extraction model (we could use Flair, or any other deep learning-based model)
 - Train that Relation Extraction model on the dataset
 - Run 'inference' over unseen data in order to extract the relations between entities in the data

In [6]:
from helpers import run_relation_extraction

tagged_relations, tagged_sents = run_relation_extraction(tagged_bio_sents)

# Print a few example rows.
for row in tagged_relations[22:25]:
    print(row)


['access', 'audit', 'Item', 'Activity', 'access audit', 0, 1, 'HAS_ACTIVITY', 10]
['access', 'repairs', 'Item', 'Activity', 'access audit repairs', 0, 2, 'HAS_ACTIVITY', 10]
['audit', 'access', 'Activity', 'Item', '', 1, 0, 'O', 10]


For further implementation details, please see `helpers/RE.py` (for the Flair based model), and `helpers/run_relation_extraction.py` (for the data wrangling/running the RE).

# 7. Combining Named Entity Recognition + Relation Extraction

Now we have outputs from both the NER model and the RE model. The next step is to combine the two outputs. 


In [8]:
import json

for i, sent in enumerate(tagged_sents):
    
    # Note we only care about the relations that do not have the class "O".
    doc_relations = [row for row in tagged_relations if row[7] != "O" and row[8] == i]
    
    sent['relations'] = []    
    for row in doc_relations:
        rel = {'start': row[5], 'end': row[6], 'type': row[7]}     
        sent['relations'].append(rel)

# Let's print an example...
print(json.dumps(tagged_sents[11], indent=1))

{
 "tokens": [
  "adjust",
  "rotary",
  "head",
  "guides"
 ],
 "mentions": [
  {
   "start": 0,
   "labels": [
    "Activity"
   ],
   "end": 1,
   "phrase": "adjust"
  },
  {
   "start": 1,
   "labels": [
    "Item"
   ],
   "end": 2,
   "phrase": "rotary"
  },
  {
   "start": 2,
   "labels": [
    "Item"
   ],
   "end": 3,
   "phrase": "head"
  },
  {
   "start": 3,
   "labels": [
    "Item"
   ],
   "end": 4,
   "phrase": "guides"
  }
 ],
 "relations": [
  {
   "start": 1,
   "end": 0,
   "type": "HAS_ACTIVITY"
  },
  {
   "start": 1,
   "end": 2,
   "type": "HAS_ITEM"
  },
  {
   "start": 1,
   "end": 3,
   "type": "HAS_ITEM"
  },
  {
   "start": 2,
   "end": 0,
   "type": "HAS_ACTIVITY"
  },
  {
   "start": 2,
   "end": 1,
   "type": "HAS_ITEM"
  },
  {
   "start": 2,
   "end": 3,
   "type": "HAS_ITEM"
  },
  {
   "start": 3,
   "end": 0,
   "type": "HAS_ACTIVITY"
  },
  {
   "start": 3,
   "end": 1,
   "type": "HAS_ITEM"
  },
  {
   "start": 3,
   "end": 2,
   "type": "HAS_ITEM

# 8. Creating the graph

We now have a data structure that stores the tokens, entity mentions, and relationships between those mentions, for each document. The last step is to put it all into a Neo4j graph so that we can query this information.

There are two popular methods for doing this:

- Using `py2neo` to programatically insert data into Neo4j
- Saving CSVs of your entities and relations, then reading them in via a `LOAD CSV` query in Neo4j

> ⚠️ Before proceeding, make sure you have created a new graph in Neo4j and that your new Neo4j graph is running. You can do this by opening Neo4j Browser, clicking "Add" at the top-right, then creating a new graph database.


## Code for building the graph

The following code creates the entire graph using `py2neo`.

In [50]:
from py2neo import Graph
from py2neo.data import Node, Relationship
from dateutil.parser import parse as parse_date

GRAPH_PASSWORD = "password" # Set this to the password of your Neo4J graph


def get_node_id(phrase, entity_class):
    """A simple function to generate an id.
    This ensures an entity that can be different classes (pump for example) can have
    a unique node for each class type.
    """
    return f"{phrase}__{entity_class}"
    
# Dates are a little awkward in Neo4j - we have to convert it to an integer representation in Python.
# The APOC library has functions to handle this better.
def date_to_int(date):
    parsed_date = parse_date(str(date))
    date = int("%s%s%s" % (parsed_date.year, str(parsed_date.month).zfill(2), str(parsed_date.day).zfill(2)))
    return date    
    
def create_graph(tagged_sents):
    """Build the Neo4j graph.
    We do this by iterating over each tagged_sentence, and constructing the
    graph as follows:
     - Create a node to represent the document itself.
     - Create nodes for each entity appearing in that document, if they have not
       already been created. Each unique combination of entity + class will be added, so
       pump (the Item) is different from pump (the Activity).
     - Create a relationship between each entity and each document in which it appears.
     - Create a relationship between each entity and each other entity it is related to,
       via the list of relations.
     
    Args:
        tagged_sents(list): The list of tagged sentences.
    """
    graph = Graph(password = GRAPH_PASSWORD)

    # We will start by deleting all nodes and edges in the current graph.
    # If we don't do this, we will end up with duplicate nodes and edges when running this script again.
    graph.delete_all() 

    tx = graph.begin()
    
    # Keep track of the created entity nodes.
    # We need a way to map the id of the nodes to the py2neo Node objects so that we can
    # easily create relationships between these nodes.
    created_entity_nodes = {}
    
    # Iterate over the list of tagged sentences and programmatically create the graph.
    for i, sent in enumerate(tagged_sents):
        
        # Let's first grab some properties from the original work order dataset.
        # We will add date and cost into our graph on the Document nodes.
        work_order_row = work_order_data[i]
        
        document_properties = {
            'name': " ".join(sent['tokens']),
            'cost': work_order_row['Cost']          
            
        }
        if work_order_row['StartDate'] != "":
            document_properties["date"] = date_to_int(work_order_row['StartDate'])

        
        
        # Create a node to represent the document.
        # Note that if you had additional properties in tagged_sents (such as dates, costs, etc)
        # you could add them as properties of the Document nodes here.
        document_node = Node("Document", **document_properties)
        tx.create(document_node)
        
        tokens = sent['tokens']
        mentions = sent['mentions']
        relations = sent['relations']
        
        for m in mentions:
            start = m['start']
            end = m['end']
            entity_class = m['labels'][0]        
            phrase = " ".join(tokens[start: end])     
                    
            # Create a node for this entity mention.
            # If the node has already been created (i.e. it exists in created_nodes), 
            # simply retrieve that Node from created_entity_nodes.
            # Otherwise, create it, and add it to created_entity_nodes.
            entity_node_id = get_node_id(phrase, entity_class)

            if entity_node_id in created_entity_nodes:
                entity_node = created_entity_nodes[entity_node_id]
            else:
                entity_node = Node("Entity", entity_class, _id=entity_node_id, name=phrase)
                created_entity_nodes[entity_node_id] = entity_node
                tx.create(entity_node)            
                        
                
            # Create a relationship between that node and the document
            # in which it appears.               
            r = Relationship(entity_node, "APPEARS_IN", document_node)
            tx.create(r)
            
        # Create relationships between each (entity_1, entity_2) in the
        # list of relations for this document.
        for rel in relations:
            start = rel['start']
            end = rel['end']
            
            phrase_1 = mentions[start]['phrase']
            entity_class_1 = mentions[start]['labels'][0]
            
            phrase_2 = mentions[end]['phrase']
            entity_class_2 = mentions[end]['labels'][0]
                       
            node_1 = created_entity_nodes[get_node_id(phrase_1, entity_class_1)]
            node_2 = created_entity_nodes[get_node_id(phrase_2, entity_class_2)]
            
            r = Relationship(node_1, rel['type'], node_2)
            tx.create(r)
    tx.commit()

create_graph(tagged_sents)        

# 9. Querying the graph


Now that the graph has been created, we can query it in Neo4j. For example, you could run the following query to find out all activities performed on engines:

    MATCH (e:Entity {name: "engine"})-[r:HAS_ACTIVITY]->(a:Activity)
    RETURN e, r, a

We will run through some more examples in Neo4j during the workshop.

Note we can also query over structured data such as dates and costs. We will not be demonstrating this in the interest of time, but feel free to attempt the query yourself! Hint: The costs and dates are stored on the `Document` nodes, which are connected to the entities appearing in those documents via the `APPEARS_IN` relation.

## Acknowledgment

This work is supported by the Australian Research Council through the Centre for Transforming Maintenance through Data Science  (grant number IC180100030), funded by the Australian Government.

We would like to thank Tyler Bikaun, Melinda Hodkiewicz and Wei Liu for their contributions to this notebook.

## References

- Baldwin, Timothy, Marie-Catherine De Marneffe, Bo Han, Young-Bum Kim, Alan Ritter, and Wei Xu. "Shared tasks of the 2015 workshop on noisy user-generated text: Twitter lexical normalization and named entity recognition." In Proceedings of the Workshop on Noisy User-generated Text, pp. 126-135. 2015.
- Cabot, Pere-Lluís Huguet, and Roberto Navigli. "REBEL: Relation extraction by end-to-end language generation." In Findings of the Association for Computational Linguistics: EMNLP 2021, pp. 2370-2381. 2021.
- Eberts, Markus, and Adrian Ulges. "Span-based joint entity and relation extraction with transformer pre-training." arXiv preprint arXiv:1909.07755 (2019).
- Han, Bo, and Timothy Baldwin. "Lexical normalisation of short text messages: Makn sens a# twitter." In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, pp. 368-378. 2011.
- Josifoski, Martin, Nicola De Cao, Maxime Peyrard, and Robert West. "GenIE: generative information extraction." arXiv preprint arXiv:2112.08340 (2021).
- Lourentzou, Ismini, Kabir Manghnani, and ChengXiang Zhai. "Adapting sequence to sequence models for text normalization in social media." In Proceedings of the international AAAI conference on web and social media, vol. 13, pp. 335-345. 2019.
- Muller, Benjamin, Benoît Sagot, and Djamé Seddah. "Enhancing BERT for lexical normalization." In The 5th Workshop on Noisy User-generated Text (W-NUT). 2019.
- Nguyen, Hoang, and Sandro Cavallari. "Neural multi-task text normalization and sanitization with pointer-generator." In Proceedings of the First Workshop on Natural Language Interfaces, pp. 37-47. 2020.
- Samuel, David, and Milan Straka. 2021. “ÚFAL at MultiLexNorm 2021: Improving Multilingual Lexical Normalization by Fine-Tuning ByT5.” In Proceedings of the Seventh Workshop on Noisy User-Generated Text (W-NUT 2021), 483–92. Stroudsburg, PA, USA: Association for Computational Linguistics.
- Stewart, Michael, Wei Liu, Rachel Cardell-Oliver, and Rui Wang. "Short-text lexical normalisation on industrial log data." In 2018 IEEE International Conference on Big Knowledge (ICBK), pp. 113-122. IEEE, 2018.
- van der Goot, Rob, Alan Ramponi, Arkaitz Zubiaga, Barbara Plank, Benjamin Muller, Iñaki San Vicente Roncal, Nikola Ljubešić, et al. 2021. “MultiLexNorm: A Shared Task on Multilingual Lexical Normalization.” In Proceedings of the Seventh Workshop on Noisy User-Generated Text (W-NUT 2021), 493–509. Online: Association for Computational Linguistics.
- van der Goot, Rob, and Gertjan van Noord. "Monoise: Modeling noise using a modular normalization system." arXiv preprint arXiv:1710.03476 (2017).
- van der Goot, Rob, Barbara Plank, and Malvina Nissim. "To normalize, or not to normalize: The impact of normalization on part-of-speech tagging." arXiv preprint arXiv:1707.05116 (2017).
- van der Goot, Rob, Rik van Noord, and Gertjan van Noord. 2018. “A Taxonomy for In-Depth Evaluation of Normalization for User Generated Content.” In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). Miyazaki, Japan: European Language Resources Association (ELRA). https://aclanthology.org/L18-1109.
- Yamada, Ikuya and Asai, Akari and Shindo, Hiroyuki and Takeda, Hideaki and Matsumoto, Yuji. "LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention". Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020.
- Yan, Hang, Tao Gui, Junqi Dai, Qipeng Guo, Zheng Zhang, and Xipeng Qiu. "A Unified Generative Framework for Various NER Subtasks." In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 5808-5822. 2021.

# Appendix

## GQVis on Jupyter Lab

GQVis works out of the box on Jupyter Notebook (which is recommended), but to get it working in Jupyter Lab, you'll need to run the following command prior to starting Jupyter lab:

    jupyter labextension install jupyterlab_requirejs

## Installing Neo4j in Google Colab

Neo4j should work OK on the Google Colab notebook, but I have noticed it randomly uninstalls from time to time. If it does, there will be an obvious error message that appears when you attempt to run any of the cells in Section 8 (building the graph) and Section 9 (querying the graph).

To reinstall Neo4j, you can run the following code. After doing this the cells in Section 8 and 9 should work OK again.

In [None]:
!nj-4.4/bin/neo4j stop

# https://gist.github.com/korakot/328aaac51d78e589b4a176228e4bb06f

!curl http://dist.neo4j.org/neo4j-community-4.4.0-unix.tar.gz -o neo4j.tar.gz

# decompress and rename
!tar -xf neo4j.tar.gz  # or --strip-components=1
!mv neo4j-community-4.4.0 nj-4.4

# disable password, and start server
!sed -i '/#dbms.security.auth_enabled/s/^#//g' nj-4.4/conf/neo4j.conf
!nj-4.4/bin/neo4j start

Note that there are other options such as Neo4j Aura (which allows you to run Neo4j in a cloud environment), but `gqvis` is not yet able to interact with it.