In [29]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\20182640\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


# Processing Label Studio Exports for Spacy
This notebook processes exports from Label Studio for use in Spacy.

In [30]:
import preprocessor
from preprocessor import Preprocessor

ROOT_DIR = preprocessor.ROOT_DIR
DATA_PATH = preprocessor.DATA_PATH

preprocessor = Preprocessor(ROOT_DIR)

## 1. Loading the Data
### 1.1. Loading JSON export from Label Studio
The first step is to load the JSON export from Label Studio. This is done using the `json` library and the predefined function `LoadFile` from the Preprocessor class.

In [31]:
export_data = preprocessor.loadFile('final_assignment_1.json')

# Filter out annotations for which a ground truth exists (drop other annotations for this article as well)
training_data_export = [item for item in export_data if all(annotation['ground_truth'] is False for annotation in item['annotations'])]
ground_truth_export = preprocessor.loadFile('label_studio_ground_truth_task1.json')

The exported data constists of all the annotations from a specific text of all the users. The first annotated text in the exported data file looks something like the following:

In [32]:
print("Length of training data: ", len(training_data_export))
print(training_data_export[0])

Length of training data:  102
{'id': 70476118, 'annotations': [{'id': 23190706, 'completed_by': {'id': 12634, 'email': 'n.p.g.t.v.beuningen@student.tue.nl', 'first_name': '', 'last_name': ''}, 'result': [{'id': 'fLVAxBL9tN', 'type': 'labels', 'value': {'end': 20, 'text': 'Engelsberg Ironworks', 'start': 0, 'labels': ['landmark_name']}, 'origin': 'manual', 'to_name': 'text', 'from_name': 'label'}, {'id': 'FvaT_3g4g6', 'type': 'labels', 'value': {'end': 39, 'text': 'ironworks', 'start': 30, 'labels': ['type']}, 'origin': 'manual', 'to_name': 'text', 'from_name': 'label'}, {'id': 'Kw8qk6PrU_', 'type': 'labels', 'value': {'end': 53, 'text': 'ngelsberg', 'start': 44, 'labels': ['location']}, 'origin': 'manual', 'to_name': 'text', 'from_name': 'label'}, {'id': 'Ywe08mn0Fw', 'type': 'labels', 'value': {'end': 90, 'text': 'Fagersta Municipality', 'start': 69, 'labels': ['location']}, 'origin': 'manual', 'to_name': 'text', 'from_name': 'label'}, {'id': 'e3U09m2Wlj', 'type': 'labels', 'value': {

### 1.2 Converting to Spacy training format:
To provide custom labels to Spacy, we need to convert the data to the following format:

```python
training_data = [
  ("Tokyo Tower is 333m tall.", [(0, 11, "BUILDING"), (20, 23, "HEIGHT")]),
]
```

The `process_export` function from the preprocessor can fix this for us, it retrieves the training data in the given format, in combination with the labels and their relationship

In [33]:
training_data, training_relations = preprocessor.process_export_sentences(training_data_export)
validation_data, validation_relations = preprocessor.process_export_sentences(ground_truth_export)

### 1.3 Checking results
We can print the first 5 relations to check whether it was done correctly.

In [34]:
for i, (label, relations) in enumerate(training_relations.items()):
    if i > 4:
        break
    print(f"Labels: {label}")
    print(f"Relations: {relations}")
    print()

Labels: ('Engelsberg Ironworks', 'ironworks')
Relations: ['org:is_type']

Labels: ('Engelsberg Ironworks', 'ngelsberg')
Relations: ['org:located_in']

Labels: ('Engelsberg Ironworks', 'Fagersta Municipality')
Relations: ['org:located_in']

Labels: ('Engelsberg Ironworks', 'Vstmanland')
Relations: ['org:located_in']

Labels: ('Engelsberg Ironworks', 'Sweden')
Relations: ['org:located_in']



The same can be done for the labels, which are stored in a dictionary. Lets only print the first one.

In [35]:
print("Training data info item 1 \ntext:")
print(training_data[0][0])
print("Labels:")
print(*training_data[0][1]["entities"], sep = "\n")

print("\n Validation data info item 1 \ntext:")
print(validation_data[0][0])
print("Labels:")
print(*validation_data[0][1]["entities"], sep = "\n")

Training data info item 1 
text:
Engelsberg Ironworks () is an ironworks in "ngelsberg", a village in Fagersta Municipality in Vstmanland, Sweden.
Labels:
[0, 20, 'landmark_name']

 Validation data info item 1 
text:
 Ephesus (; ; ; may ultimately derive from ) was a city in Ancient Greece on the coast of Ionia, southwest of present-day Seluk in zmir Province, Turkey.
Labels:
[1, 8, 'landmark_name']


### 1.4 Preparing the data for Spacy
The data is now in the correct format, so it can be processed and saved as Spacy training file using the `preprocess_spacy` function from the `Preprocessor` class.

In [37]:
# preprocessor.preprocess_spacy(training_data, warn = False)
preprocessor.preprocess_json(training_data = training_data, validation_data = validation_data)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\20182640\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


# 2 Training the Spacy Model on on the training file

Start with importing all the spacy modules


In [38]:
import spacy
from spacy import displacy
import spacy
from spacy.cli.train import train

# If GPU is availabel, use it for training
spacy.prefer_gpu()

False

### 2.1 Spacy before training with custom labels

In [39]:
example_text = training_data[0][0]

nlp = spacy.load("en_core_web_sm")
doc = nlp(example_text)

print(nlp.pipe_names)

displacy.render(doc, style="ent")

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']


### 2.2 Training a spacy model
Training of a spacy model usually is done via the command line. This is the reason for the following no so understandable lines of code. There are a few steps in the training process:
1. The spacy model needs a config file and all necessary files are in the spacy folder
2. The model needs training data, which was exported in this file above to the spacy folder
3. After training the model is evaluated and the results are printed for training and evaluation

In [40]:
import subprocess

subprocess.run("spacy project run convert", cwd="ner_model")

# subprocess.run("spacy project run create-config", cwd="ner_model")

CompletedProcess(args='spacy project run convert', returncode=0)

In [41]:
result = subprocess.run("spacy project run train", cwd="ner_model", shell=True, check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True, encoding="utf-8")
print(result.stdout)

⚠ Your project configuration file includes a `spacy_version` key, which is now
deprecated. Weasel will not validate your version of spaCy.
[1m
Running command: 'C:\Users\20182640\.virtualenvs\Text-Mining-xR8YyNgY\Scripts\python.exe' -m spacy train configs/config.cfg --output training/ --paths.train corpus/train.spacy --paths.dev corpus/dev.spacy
ℹ Saving to output directory: training
ℹ Using CPU
[1m
✔ Initialized pipeline
[1m
ℹ Pipeline: ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer',
'ner']
ℹ Initial learn rate: 0.001
E    #       LOSS TOK2VEC  LOSS TAGGER  LOSS PARSER  LOSS NER  LEMMA_ACC  ENTS_F  ENTS_P  ENTS_R  SPEED   SCORE 
---  ------  ------------  -----------  -----------  --------  ---------  ------  ------  ------  ------  ------
  0       0          0.00         0.00         0.00     23.18       0.00    0.00    0.00    0.00  5761.03    0.00
  0     100          0.00         0.00         0.00    595.55       0.00    0.00    0.00    0.00  8143.99    0.00
 

In [42]:
result = subprocess.run("spacy project run evaluate", cwd="ner_model", shell=True, check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True, encoding="utf-8")
print(result.stdout)

⚠ Your project configuration file includes a `spacy_version` key, which is now
deprecated. Weasel will not validate your version of spaCy.
[1m
Running command: 'C:\Users\20182640\.virtualenvs\Text-Mining-xR8YyNgY\Scripts\python.exe' -m spacy evaluate training/model-best corpus/dev.spacy --output training/metrics.json
ℹ Using CPU
[1m

TOK      100.00
TAG      -     
POS      -     
MORPH    -     
LEMMA    -     
UAS      -     
LAS      -     
NER P    51.11 
NER R    31.51 
NER F    38.98 
SENT P   -     
SENT R   -     
SENT F   -     
SPEED    3967  

[1m

                    P       R       F
landmark_name   62.96   68.00   65.38
date            60.00   33.33   42.86
number          33.33   33.33   33.33
type            25.00   14.29   18.18
people           0.00    0.00    0.00
component        0.00    0.00    0.00
location         0.00    0.00    0.00
animal           0.00    0.00    0.00
condition        0.00    0.00    0.00

✔ Saved results to training\metrics.json



### 2.2 Visualizing the results
The model is now trained. This model can be loaded into spacy and an example text can be visualized with the NER.

In [43]:
options = {
    "colors": {"location": "lightyellow",
               "person_name": "lightgreen",
               "landmark_name": "lightred",
               "condition": "lightblue"}
}

# Now test teh newly created spacy model on a sample text and visualize it using spacy
nlp = spacy.load("ner_model/training/model-best/")

example_text = str([text[0] for i, text in enumerate(training_data) if text != ""])
doc = nlp(example_text)

displacy.render(doc, style="ent", jupyter=True, options=options)

# Show the tokens, their labels and their entities
for ent in doc.ents:
    print(ent.text, ent.label_)

Emas National Park landmark_name
park type
17 number
park type
Djoudj National Bird Sanctuary landmark_name
400 number
Reserve landmark_name
park type
park type
cemeteries type
350,000 number
Danube Delta landmark_name
Kaliningrad Oblast location
Russia location
Lithuania location
city type
Como National Park landmark_name
Germany location
26 July 2021 date
churches landmark_name
Moldavia location
Churches of Chilo landmark_name
churches type
park type
seven number
Slovakia location
caves type
17 number
Casbah landmark_name
park type
park type
park type
Canal du Midi landmark_name
Canal Royal en Languedoc landmark_name
Byblos landmark_name
city type
Wawaskesy National Park landmark_name
Bryggen landmark_name
Bryggen landmark_name
89 number
cathedral type
from 1195 date
cathedral type
forest type
park type
Banc d'Arguin National Park landmark_name
city type
34,685 number
park type
park type
Sungas landmark_name
Sakarwar Rajputs landmark_name
park type
collection type
Humahuaca location
