# Knowledge Graphs Workshop: NLP exercise

## Step 1: First import and install all python packages

In [None]:
# Imports
import pandas as pd
import helperFunctions as hf
import sys
from nltk.corpus import stopwords
import nltk
import spacy
import matplotlib.pyplot as plt
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk import Tree
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
nlp = spacy.load('en_core_web_sm')
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

# Assignment 1
In fields like maintenance, important information, such as when maintenance is performed or what is repaired, is not often registered properly but is frequently very crudely written down somewhere. Natural Language Processing, or NLP for short, deals with making computers understand language. Using NLP, useful information and patterns can be extracted from text, which can even be used to predict when maintenance needs to happen. In this exercise, we will go over an aircraft maintenance dataset where a maintenance engineer briefly described the problem the aircraft had and the action that was taken to fix the problem.

In this exercise, we will take a look at some basic NLP concepts and talk about the following natural language processing concepts:
- Tokenizing
- Removing stopwords
- Lemmatization
- Part-of-speech (POS) tagging
- Inserting data into a knowledge graph


## Basic tokenization
A sentence means nothing to a computer. The first step for a computer to start understanding text is to break it down into a list of each word. This sounds easy enough, but a lot of different decisions can be made when deciding to split a string. For example, do you take into account punctuation or quotation marks? And how do you deal with words such as 'it's,' 'haven't,' 'hasn't,' which are comprised of multiple words?

In [None]:
# The most basic example is to just split a sample but as you can see this can already give some problems
example_sent = """This is a sample sentence, showing off basic tokenization. Words like ''it's', 'haven't', 'hasn't' are harder to correctly tokenize."""

 ### Task
Generate a list of tokens from the given example sentence. For instance, 'This is a sample sentence' would transform into ['this', 'is', 'sample', 'sentence']. You can either create your version or utilize the word_tokenize function from the NLTK library.

In [None]:
### Solution
print(example_sent.split())

## Stopwords
Stopwords are frequently used words that are typically excluded or ignored in NLP because they are deemed to convey little or no meaningful information for analysis. This simplifies text analysis, as a list of the most frequently appearing words is not dominated by common words such as 'and,' 'the,' and 'than.

In [None]:
# List of example sentences
stopword_list = stopwords.words('english')
print(stopword_list)
example_sent = """After every flight, the aircraft undergoes thorough maintenance checks to ensure optimal performance and safety for the next journey."""

### Task
Remove stopwords from the example sentence. Remember to first use your tokenize function!

In [None]:
#### Solution
word_tokens = word_tokenize(example_sent)
# converts the words in word_tokens to lower case and then checks whether 
#they are present in stop_words or not
filtered_sentence = [w for w in word_tokens if not w.lower() in stopword_list]
#with no lower case conversion
filtered_sentence = []
for w in word_tokens:
    if w not in stopword_list:
        filtered_sentence.append(w)
print(word_tokens)
print(filtered_sentence)

# Lemmatization 
When processing sentences you also need to look at lemmatization. 
Lemmatization is a NLP technique that involves reducing words to their base or root form, known as the "lemma." The goal of lemmatization is to group together different inflected forms of a word so that they can be analyzed as a single item. When analysing for example the most occuring words it is nice to take the lemma of tokens as we are more interessted in the lemmas

In [None]:
example_sentence_1 = "the engine is leaking because of loose screws"
example_sentence_2 = "leak in engine because of a loose screw"
example_sentence_3 = "leaks in engine because of loose screws"

### Task
use spacy to create lemmas of words. Play around with different sentences and discuss with your teammates how this can be used for analysing nlp datasets. Remember to first tokenize the sentences!

In [None]:
##### Solution
# Process the text using spaCy
doc = nlp(example_sentence_1)
# Extract lemmatized tokens
lemmatized_tokens = [token.lemma_ for token in doc]
print(lemmatized_tokens)

# Part of Speech tagging.
Part of speech tagging or POS in short involves assigning a grammatical category or part of speech to each word in a sentence. The objective is to analyze and comprehend the syntactic structure of a sentence by determining the role of each word within the context.

Accurately tagging a sentence is of significant importance as tagged words can be used for analysing and machine learning. 

The main parts of speech which can be tagged are nouns, verbs, adjectives, adverbs, pronouns, prepositions, conjunctions, and interjections. POS can be used alongside DEP which stands for Syntactic dependency which is a linguistic concept that represents the grammatical and hierarchical relationships between words in a sentence.


Additionally, some sentences can be ambiguous, having multiple meanings depending on the context. A computer, however, cannot decipher this meaning and heavily relies on how the sentences are tagged.

While there are no specific tasks in this section, it is essential to discuss within your group what is happening and how this understanding can be utilized to analyze maintenance tasks.


In [None]:
doc = nlp(example_sentence_1)
# Analyze syntax
print("Noun phrases:", [chunk.text for chunk in doc.noun_chunks])
print("DEP:", [f'{token.lemma_} {token.dep_}' for token in doc])
print("POS:", [f'{token.lemma_} {token.pos_}'for token in doc])


### Lets print a parse tree to show the syntatic structure of a sentence
Lets use the same example sentences from before to see how a different sentences can have tress that are similar.
A parse tree connsists of the following:
- Root Node: Represents the entire sentence.
- Intermediate Nodes: Represent phrases or constituents, such as noun phrases (NP), verb phrases (VP), etc.
- Leaf Nodes: Represent individual words.


In [None]:
hf.print_nltk_tree(nlp(example_sentence_1))
hf.print_nltk_tree(nlp(example_sentence_2))
hf.print_nltk_tree(nlp(example_sentence_3))

### Task
Discuss with your group why the different sentences have the same tree structure and how this can be useful for analysis

In [None]:
# The trees are the same after removing the stopwords!
hf.print_nltk_tree(nlp(hf.remove_stopwords(example_sentence_2)))
hf.print_nltk_tree(nlp(hf.remove_stopwords(example_sentence_3)))

# Assignment 2: Aircraft dataset
Now, let's apply the knowledge gained from Exercise 1 and the information provided in the knowledge base slides to analyze an aircraft maintenance dataset. Due to time constraints, we have already coded it for you. If you find additional time after completing all exercises, feel free to experiment, enhance the code, and create your own knowledge graph.

### Lets get some insights about the dataset
As you can see a problem occured in the aircraft such as "ENGINE IDLE OVERRIDE KILLED ENGINE" and a maintenance engineer fixed the problem by the action "REMOVED & REPLACE FUEL SERVO"

In [None]:
import pandas as pd
from itables import show
df = pd.read_csv('Aircraft_Annotation_DataFile.csv')
df.columns = [c.lower() for c in df.columns]
df['problem'] = df['problem'].str.strip('.').str.strip()
df['action'] = df['action'].str.strip('.').str.strip()
show(df)

Lets analyze the aircraft dataset by looking at the most used verbs in each sentence. This may give valuable insights.

PS you should also look at some potential mistakes that are happening and why these mistake might happen.

In [None]:
hf.display_most_used(df['action'].iloc[0:200])

In [None]:
hf.display_most_used(df['problem'].iloc[0:100])

Task: Discuss the results of this graph with your group. What conclusions can you make and how can this be used in relation to knowledge graphs?

## Putting it all inside a knowledge base
Adapt the code from the previous exercise to transform your parse trees into graphs and load them into the Knowledge Base.

In [None]:
# Your code here

def create_problem_obj():
    pass

# g = Graph()
# g.namespace_manager.bind('', zorro)
# for obj in df.apply(create_problem_obj, axis=1):
#     for t in obj_to_triples(obj):
#         g.add(t)
# g.serialize('nlp_graph.ttl')

In [None]:
%load_ext ipython_sparql_pandas
from helperFunctions import GraphDB

db = GraphDB()
repo_name = 'zorro'
db.create_repo(repo_name).text

response = db.load_data(repo_name, 'nlp_graph.ttl', 
          graph_name = "https://zorro-project.nl/example/NLPGraph")
print(response.text)