In [91]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\20182640\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


# Processing Label Studio Exports for Spacy
This notebook processes exports from Label Studio for use in Spacy.

In [92]:
import preprocessor
from preprocessor import Preprocessor

ROOT_DIR = preprocessor.ROOT_DIR
DATA_PATH = preprocessor.DATA_PATH

preprocessor = Preprocessor(ROOT_DIR)

## 1. Loading the Data
### 1.1. Loading JSON export from Label Studio
The first step is to load the JSON export from Label Studio. This is done using the `json` library and the predefined function `LoadFile` from the Preprocessor class.

In [93]:
export_data = preprocessor.loadFile('final_assignment_1.json')

# Filter out annotations for which a ground truth exists (drop other annotations for this article as well)
training_data_export = [item for item in export_data if any(annotation['ground_truth'] is False for annotation in item['annotations'])]
ground_truth_export = preprocessor.loadFile('label_studio_ground_truth_task1.json')

The exported data constists of all the annotations from a specific text of all the users. The first annotated text in the exported data file looks something like the following:

In [94]:
print("Length of training data: ", len(training_data_export))
print(training_data_export[0])

Length of training data:  122
{'id': 70476117, 'annotations': [{'id': 22591087, 'completed_by': {'id': 12716, 'email': 'f.a.ensink.op.kemma@student.tue.nl', 'first_name': '', 'last_name': ''}, 'result': [{'id': 'a-cF8klU4-', 'type': 'labels', 'value': {'end': 8, 'text': 'Ephesus', 'start': 1, 'labels': ['landmark_name']}, 'origin': 'manual', 'to_name': 'text', 'from_name': 'label'}, {'id': 'fUvEjMgG3v', 'type': 'labels', 'value': {'end': 73, 'text': 'Ancient Greece', 'start': 59, 'labels': ['location']}, 'origin': 'manual', 'to_name': 'text', 'from_name': 'label'}, {'id': 'lg86acnG1f', 'type': 'labels', 'value': {'end': 127, 'text': 'Seluk', 'start': 122, 'labels': ['location']}, 'origin': 'manual', 'to_name': 'text', 'from_name': 'label'}, {'id': 'WOx1LV2UFN', 'type': 'labels', 'value': {'end': 144, 'text': 'zmir Province', 'start': 131, 'labels': ['location']}, 'origin': 'manual', 'to_name': 'text', 'from_name': 'label'}, {'id': 'laDHrMHSe0', 'type': 'labels', 'value': {'end': 152, '

### 1.2 Converting to Spacy training format:
To provide custom labels to Spacy, we need to convert the data to the following format:

```python
training_data = [
  ("Tokyo Tower is 333m tall.", [(0, 11, "BUILDING"), (20, 23, "HEIGHT")]),
]
```

The `process_export` function from the preprocessor can fix this for us, it retrieves the training data in the given format, in combination with the labels and their relationship

In [95]:
# empty annotations.jsonl file
import os

save_path = os.path.join(ROOT_DIR, "rel_model/assets", "annotations.jsonl")
with open(save_path, "w") as file:
    file.write("")

training_data, training_relations = preprocessor.process_export_sentences(training_data_export)
validation_data, validation_relations = preprocessor.process_export_sentences(ground_truth_export, ground_truth = True)

### 1.3 Checking results
We can print the first 5 relations to check whether it was done correctly.

The same can be done for the labels, which are stored in a dictionary. Lets only print the first one.

In [96]:
print("Training data info item 1 \ntext:")
print(training_data[0][0])
print("Labels:")
print(*training_data[0][1]["entities"], sep = "\n")

print("\n Validation data info item 1 \ntext:")
print(validation_data[0][0])
print("Labels:")
print(*validation_data[0][1]["entities"], sep = "\n")

Training data info item 1 
text:
 Ephesus (; ; ; may ultimately derive from ) was a city in Ancient Greece on the coast of Ionia, southwest of present-day Seluk in zmir Province, Turkey.
Labels:
[1, 9, 'landmark_name']

 Validation data info item 1 
text:
 Ephesus (; ; ; may ultimately derive from ) was a city in Ancient Greece on the coast of Ionia, southwest of present-day Seluk in zmir Province, Turkey.
Labels:
[1, 8, 'landmark_name']


### 1.4 Preparing the data for Spacy
The data is now in the correct format, so it can be processed and saved as Spacy training file using the `preprocess_spacy` function from the `Preprocessor` class.

In [97]:
# preprocessor.preprocess_spacy(training_data, warn = False)
preprocessor.preprocess_json(training_data = training_data, validation_data = validation_data)

In [98]:
# preprocessor.preprocess_json_rel(relational_annotations = training_relations)

# 2 Training the Spacy Model on on the training file

Start with importing all the spacy modules


In [99]:
import spacy
from spacy import displacy
import spacy
from spacy.cli.train import train

# If GPU is availabel, use it for training
spacy.prefer_gpu()

False

### 2.1 Spacy before training with custom labels

In [100]:
example_text = training_data[0][0]

nlp = spacy.load("en_core_web_sm")
doc = nlp(example_text)

print(nlp.pipe_names)

displacy.render(doc, style="ent")

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']


### 2.2 Training a spacy model
Training of a spacy model usually is done via the command line. This is the reason for the following no so understandable lines of code. There are a few steps in the training process:
1. The spacy model needs a config file and all necessary files are in the spacy folder
2. The model needs training data, which was exported in this file above to the spacy folder
3. After training the model is evaluated and the results are printed for training and evaluation

In [101]:
import subprocess

subprocess.run("spacy project run convert", cwd="ner_model")

# subprocess.run("spacy project run create-config", cwd="ner_model")

CompletedProcess(args='spacy project run convert', returncode=0)

In [102]:
result = subprocess.run("spacy project run train", cwd="ner_model", shell=True, check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True, encoding="utf-8")
print(result.stdout)

⚠ Your project configuration file includes a `spacy_version` key, which is now
deprecated. Weasel will not validate your version of spaCy.
[1m
ℹ Skipping 'train': nothing changed



In [103]:
result = subprocess.run("spacy project run evaluate", cwd="ner_model", shell=True, check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True, encoding="utf-8")
print(result.stdout)

⚠ Your project configuration file includes a `spacy_version` key, which is now
deprecated. Weasel will not validate your version of spaCy.
[1m
ℹ Skipping 'evaluate': nothing changed



### 2.2 Visualizing the results
The model is now trained. This model can be loaded into spacy and an example text can be visualized with the NER.

In [104]:
options = {
    "colors": {"location": "lightyellow",
               "person_name": "lightgreen",
               "landmark_name": "lightred",
               "condition": "lightblue"}
}

# Now test teh newly created spacy model on a sample text and visualize it using spacy
nlp = spacy.load("ner_model/training/model-best/")

example_text = str([text[0] for i, text in enumerate(training_data) if text != ""])
doc = nlp(example_text)

displacy.render(doc, style="ent", jupyter=True, options=options)

# Show the tokens, their labels and their entities
for ent in doc.ents:
    print(ent.text, ent.label_)

built condition
17 number
ancient India location
ancient India location
Elephanta Island location
Durham location
Doana National Park landmark_name
Djoudj National Bird Sanctuary landmark_name
400 number
Dja Conservation Services landmark_name
350,000 number
Fifteen number
Sevilla location
Dazu District location
Danube Delta landmark_name
Romania location
Romania location
Kaliningrad Oblast location
Lithuania location
Russia location
Lithuania location
Latin America location
Como National Park landmark_name
Germany location
26 July 2021 date
Kolkheti National Park landmark_name
31,253 number
churches landmark_name
Moldavia location
Romania location
Churches of Chilo landmark_name
Spanish colonial architecture, type
churches landmark_name
Chitwan National Park landmark_name
park type
mammal species component
seven number
Slovakia location
17 number
park type
park type
Canal du Midi landmark_name
Canal Royal en Languedoc landmark_name
Canal du Midi landmark_name
Canal du Midi landmark_na