In [1]:
%load_ext autoreload
%autoreload 2

# Processing Label Studio Exports for Spacy
This notebook processes exports from Label Studio for use in Spacy.

In [2]:
import preprocessor
from preprocessor import Preprocessor

ROOT_DIR = preprocessor.ROOT_DIR
DATA_PATH = preprocessor.DATA_PATH

preprocessor = Preprocessor(ROOT_DIR)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\20182640\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## 1. Loading the Data
### 1.1. Loading JSON export from Label Studio
The first step is to load the JSON export from Label Studio. This is done using the `json` library and the predefined function `LoadFile` from the Preprocessor class.

In [3]:
export_data = preprocessor.loadFile('final_assignment_1.json')

# Filter out annotations for which a ground truth exists (drop other annotations for this article as well)
training_data_export = [item for item in export_data if any(annotation['ground_truth'] is False for annotation in item['annotations'])]
ground_truth_export = preprocessor.loadFile('label_studio_ground_truth_task1.json')

The exported data constists of all the annotations from a specific text of all the users. The first annotated text in the exported data file looks something like the following:

In [4]:
print("Length of training data: ", len(training_data_export))
print(training_data_export[0])

Length of training data:  122
{'id': 70476117, 'annotations': [{'id': 22591087, 'completed_by': {'id': 12716, 'email': 'f.a.ensink.op.kemma@student.tue.nl', 'first_name': '', 'last_name': ''}, 'result': [{'id': 'a-cF8klU4-', 'type': 'labels', 'value': {'end': 8, 'text': 'Ephesus', 'start': 1, 'labels': ['landmark_name']}, 'origin': 'manual', 'to_name': 'text', 'from_name': 'label'}, {'id': 'fUvEjMgG3v', 'type': 'labels', 'value': {'end': 73, 'text': 'Ancient Greece', 'start': 59, 'labels': ['location']}, 'origin': 'manual', 'to_name': 'text', 'from_name': 'label'}, {'id': 'lg86acnG1f', 'type': 'labels', 'value': {'end': 127, 'text': 'Seluk', 'start': 122, 'labels': ['location']}, 'origin': 'manual', 'to_name': 'text', 'from_name': 'label'}, {'id': 'WOx1LV2UFN', 'type': 'labels', 'value': {'end': 144, 'text': 'zmir Province', 'start': 131, 'labels': ['location']}, 'origin': 'manual', 'to_name': 'text', 'from_name': 'label'}, {'id': 'laDHrMHSe0', 'type': 'labels', 'value': {'end': 152, '

### 1.2 Converting to Spacy training format:
To provide custom labels to Spacy, we need to convert the data to the following format:

```python
training_data = [
  ("Tokyo Tower is 333m tall.", [(0, 11, "BUILDING"), (20, 23, "HEIGHT")]),
]
```

The `process_export` function from the preprocessor can fix this for us, it retrieves the training data in the given format, in combination with the labels and their relationship

In [5]:
# empty annotations.jsonl file
import os

save_path = os.path.join(ROOT_DIR, "rel_model/assets", "annotations.jsonl")
with open(save_path, "w") as file:
    file.write("")

training_data, training_relations = preprocessor.process_export_sentences(training_data_export)
validation_data, validation_relations = preprocessor.process_export_sentences(ground_truth_export, ground_truth = True)

### 1.3 Checking results
We can print the first 5 relations to check whether it was done correctly.

The same can be done for the labels, which are stored in a dictionary. Lets only print the first one.

In [6]:
print("Training data info item 1 \ntext:")
print(training_data[0][0])
print("Labels:")
print(*training_data[0][1]["entities"], sep = "\n")

print("\n Validation data info item 1 \ntext:")
print(validation_data[0][0])
print("Labels:")
print(*validation_data[0][1]["entities"], sep = "\n")

Training data info item 1 
text:
 Ephesus (; ; ; may ultimately derive from ) was a city in Ancient Greece on the coast of Ionia, southwest of present-day Seluk in zmir Province, Turkey.
Labels:
[1, 9, 'landmark_name']

 Validation data info item 1 
text:
 Ephesus (; ; ; may ultimately derive from ) was a city in Ancient Greece on the coast of Ionia, southwest of present-day Seluk in zmir Province, Turkey.
Labels:
[1, 8, 'landmark_name']


### 1.4 Preparing the data for Spacy
The data is now in the correct format, so it can be processed and saved as Spacy training file using the `preprocess_spacy` function from the `Preprocessor` class.

In [7]:
preprocessor.preprocess_json(training_data = training_data, validation_data = validation_data)

In [8]:
preprocessor.preprocess_json_rel(relational_annotations = training_relations)

# 2 Training the Spacy Model on on the training file

Start with importing all the spacy modules


### 2.1 Spacy before training with custom labels

In [10]:
import spacy
from spacy import displacy

example_text = training_data[0][0]

nlp = spacy.load("en_core_web_sm")
doc = nlp(example_text)

print(nlp.pipe_names)

displacy.render(doc, style="ent")



['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']


### 2.2 Training a spacy NER model
Training of a spacy model usually is done via the command line. This is the reason for the following no so understandable lines of code. There are a few steps in the training process:
1. The spacy model needs a config file and all necessary files are in the spacy folder
2. The model needs training data, which was exported in this file above to the spacy folder
3. After training the model is evaluated and the results are printed for training and evaluation

In [11]:
# If GPU is availabel, use it for training
spacy.prefer_gpu()

False

In [12]:
from ner_model.scripts.convert import convert as ner_convert

ner_convert("en", "ner_model/assets/train.json", "ner_model/assets/train.spacy")
ner_convert("en", "ner_model/assets/dev.json", "ner_model/assets/dev.spacy")


' Ephesus (; ; ; may ultimately derive from ) was a city in Ancient Greece on the coast of Ionia, southwest of present-day Seluk in zmir Province, Turkey.'


'It is listed as a UNESCO World Heritage Site since 1993.Name.Engelsberg Ironworks is named after Englika.'


'The site includes Durham Castle, Durham Cathedral, Durham University, Palace Green and University College, Durham.'


'It is named for the hundreds of paintings of hands stenciled, in multiple collages, on the rock walls.'


'The art was created in several waves between 7,300 BC and 700 AD, during the Archaic period of pre-Columbian South America.'


'The age of the paintings was calculated from the remains of bone pipes used for spraying the paint on the wall of the cave to create the artwork, radiocarbon dating of the artwork, and stratigraphic dating.'


'In total, the site consists of seven component parts  Kintrishi-Mtirala and Ispani in Adjara, Grigoleti and Imnati in Guria, and Pitshora, Nabada, and Churia in Same

ner_model/assets/train.spacy
ner_model/assets/dev.spacy



'The Monastery of Saint John of Rila, also known as Rila Monastery "Sveti Ivan Rilski" (), is the largest and most famous Eastern Orthodox monastery in Bulgaria.'


'The Villa Romana del Casale (Sicilian: "Villa Rumana d Casali") is a large and elaborate Roman villa or palace located about 3km from the town of Piazza Armerina, Sicily.'



In [13]:
from spacy.cli.train import train

train("ner_model/configs/config.cfg", output_path="ner_model/training/", overrides={"paths.train": "ner_model/corpus/train.spacy", "paths.dev": "ner_model/corpus/dev.spacy"})

ℹ Saving to output directory: ner_model\training
ℹ Using CPU
[1m




✔ Initialized pipeline
[1m
ℹ Pipeline: ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer',
'ner']
ℹ Initial learn rate: 0.001
E    #       LOSS TOK2VEC  LOSS TAGGER  LOSS PARSER  LOSS NER  LEMMA_ACC  ENTS_F  ENTS_P  ENTS_R  SPEED   SCORE 
---  ------  ------------  -----------  -----------  --------  ---------  ------  ------  ------  ------  ------
  0       0          0.00         0.00         0.00     22.09       0.00    0.00    0.00    0.00  5829.28    0.00
  0     100          0.00         0.00         0.00    590.47       0.00    0.00    0.00    0.00  8189.41    0.00
  1     200          0.00         0.00         0.00    622.22       0.00    9.76   44.44    5.48  8072.70    0.02
  1     300          0.00         0.00         0.00    751.54       0.00   25.81   60.00   16.44  8266.03    0.06
  2     400          0.00         0.00         0.00    707.21       0.00   29.70   53.57   20.55  8411.67    0.07
  3     500          0.00         0.00         0.00    622.20   

In [14]:
from spacy.cli.evaluate import evaluate

evaluate("ner_model/training/model-best", "ner_model/corpus/dev.spacy", output="ner_model/training/metrics.json")

{'token_acc': 1.0,
 'token_p': 1.0,
 'token_r': 1.0,
 'token_f': 1.0,
 'tag_acc': None,
 'sents_p': None,
 'sents_r': None,
 'sents_f': None,
 'dep_uas': None,
 'dep_las': None,
 'dep_las_per_type': None,
 'pos_acc': None,
 'morph_acc': None,
 'morph_micro_p': None,
 'morph_micro_r': None,
 'morph_micro_f': None,
 'morph_per_feat': None,
 'lemma_acc': None,
 'ents_p': 0.7454545454545455,
 'ents_r': 0.5616438356164384,
 'ents_f': 0.640625,
 'ents_per_type': {'landmark_name': {'p': 0.7857142857142857,
   'r': 0.88,
   'f': 0.830188679245283},
  'date': {'p': 0.75, 'r': 0.6666666666666666, 'f': 0.7058823529411765},
  'number': {'p': 1.0, 'r': 0.6666666666666666, 'f': 0.8},
  'people': {'p': 0.0, 'r': 0.0, 'f': 0.0},
  'component': {'p': 0.5, 'r': 0.14285714285714285, 'f': 0.22222222222222224},
  'location': {'p': 1.0, 'r': 0.375, 'f': 0.5454545454545454},
  'type': {'p': 0.5454545454545454,
   'r': 0.42857142857142855,
   'f': 0.4799999999999999},
  'animal': {'p': 0.0, 'r': 0.0, 'f': 0.0

### 2.2 Visualizing the results
The model is now trained. This model can be loaded into spacy and an example text can be visualized with the NER.

In [15]:
options = {
    "colors": {"location": "lightyellow",
               "person_name": "lightgreen",
               "landmark_name": "lightred",
               "condition": "lightblue"}
}

# Now test teh newly created spacy model on a sample text and visualize it using spacy
nlp = spacy.load("ner_model/training/model-best/")

example_text = str([text[0] for i, text in enumerate(training_data[:20]) if text != ""])
doc = nlp(example_text)

displacy.render(doc, style="ent", jupyter=True, options=options)

# Show the tokens, their labels and their entities
for ent in doc.ents:
    print(ent.text, ent.label_)

built condition
Emas National Park landmark_name
park type
17 number
ancient India location
ancient India location


### 2.3 Training the Spacy Custom REL component

In [27]:
from rel_model.scripts.parse_data import main as rel_convert

rel_convert(json_loc="rel_model/assets/annotations.json", train_file="rel_model/data/train.spacy", dev_file="rel_model/data/dev.spacy")

ValueError: [E090] Extension 'rel' already exists on Doc. To overwrite the existing extension, set `force=True` on `Doc.set_extension`.

In [57]:
from contextvars import ContextVar
from spacy.cli.train import train_cli

if spacy.prefer_gpu():
    train_cli(config_path="rel_model/configs/rel_trf.cfg", output_path="rel_model/training/", code_path="rel_model/scripts/custom_functions.py", ctx=ContextVar("args", default=0))
else:
    train_cli(config_path="rel_model/configs/rel_tok2vec.cfg", output_path="rel_model/training/", code_path="rel_model/scripts/custom_functions.py", ctx=ContextVar("args", default=-1))

AttributeError: '_contextvars.ContextVar' object has no attribute 'args'

In [42]:
from rel_model.scripts.evaluate import main as rel_evaluate

rel_evaluate("rel_model/training/model-best", "rel_model/data/dev.spacy", output="rel_model/training/metrics.json")

ModuleNotFoundError: No module named 'rel_pipe'