In [44]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Processing Label Studio Exports for Spacy
This notebook processes exports from Label Studio for use in Spacy.

In [45]:
import preprocessor
from preprocessor import Preprocessor

ROOT_DIR = preprocessor.ROOT_DIR
DATA_PATH = preprocessor.DATA_PATH

preprocessor = Preprocessor(ROOT_DIR)

## 1. Loading the Data
### 1.1. Loading JSON export from Label Studio
The first step is to load the JSON export from Label Studio. This is done using the `json` library and the predefined function `LoadFile` from the Preprocessor class.

In [46]:
# export_data = preprocessor.loadFile('label_studio_test_notebook_231002.json')
export_data = preprocessor.loadFile('label_studio_test_notebook_290923.json')

The exported data constists of all the annotations from a specific text of all the users. The first annotated text in the exported data file looks something like the following:

In [47]:
print(export_data[0])

{'id': 70476151, 'annotations': [{'id': 22565484, 'completed_by': {'id': 12641, 'email': 'd.p.m.v.d.hoorn@student.tue.nl', 'first_name': '', 'last_name': ''}, 'result': [{'id': 'tBYLAMaY7Q', 'type': 'labels', 'value': {'end': 18, 'text': 'Chartres Cathedral', 'start': 0, 'labels': ['landmark_name']}, 'origin': 'manual', 'to_name': 'text', 'from_name': 'label'}, {'id': 'rXo1WB9tDL', 'type': 'labels', 'value': {'end': 71, 'text': 'Cathedral of Our Lady of Chartres', 'start': 38, 'labels': ['landmark_name']}, 'origin': 'manual', 'to_name': 'text', 'from_name': 'label'}, {'id': 'jt997QSCPu', 'type': 'labels', 'value': {'end': 96, 'text': 'Catholic church', 'start': 81, 'labels': ['type']}, 'origin': 'manual', 'to_name': 'text', 'from_name': 'label'}, {'id': 'eWtbXE3SqW', 'type': 'labels', 'value': {'end': 108, 'text': 'Chartres', 'start': 100, 'labels': ['location']}, 'origin': 'manual', 'to_name': 'text', 'from_name': 'label'}, {'id': '8FJnNiKvM2', 'type': 'labels', 'value': {'end': 116, 

### 1.2 Converting to Spacy training format:
To provide custom labels to Spacy, we need to convert the data to the following format:

```python
training_data = [
  ("Tokyo Tower is 333m tall.", [(0, 11, "BUILDING"), (20, 23, "HEIGHT")]),
]
```

The `process_export` function from the preprocessor can fix this for us, it retrieves the training data in the given format, in combination with the labels and their relationship

In [48]:
training_data, relation_data = preprocessor.process_export(export_data)

### 1.3 Checking results
We can print the first 5 relations to check whether it was done correctly.

In [49]:
for i, (label, relations) in enumerate(relation_data.items()):
    if i > 4:
        break
    print(f"Labels: {label}")
    print(f"Relations: {relations}")
    print()

Labels: ('Chartres Cathedral', 'Cathedral of Our Lady of Chartres')
Relations: ['org:is_similar_to']

Labels: ('Chartres Cathedral', 'Catholic church')
Relations: ['org:is_type']

Labels: ('Chartres Cathedral', 'Chartres')
Relations: ['org:located_in']

Labels: ('Chartres Cathedral', 'France')
Relations: ['org:located_in']

Labels: ('Chartres Cathedral', 'High Gothic and Classic Gothic architecture')
Relations: ['org:is_type']



The same can be done for the labels, which are stored in a dictionary. Lets only print the first one.

In [50]:
print("Training data info item 1 \ntext:")
print(training_data[0][0])
print("Labels:")
print(*training_data[0][1], sep = "\n")

Training data info item 1 
text:
Chartres Cathedral, also known as the Cathedral of Our Lady of Chartres (), is a Catholic church in Chartres, France, about southwest of Paris, and is the seat of the Bishop of Chartres. Mostly constructed between 1194 and 1220, it stands on the site of at least five cathedrals that have occupied the site since the Diocese of Chartres was formed as an episcopal see in the 4th century. It is one of the best-known and most influential examples of High Gothic and Classic Gothic architecture, It stands on Romanesque basements, while its north spire is more recent (15071513) and is built in the more ornate Flamboyant style.
Labels:
(0, 18, 'landmark_name')


### 1.4 Preparing the data for Spacy
The data is now in the correct format, so it can be processed and saved as Spacy training file using the `preprocess_spacy` function from the `Preprocessor` class.

In [51]:
preprocessor.preprocess_spacy(training_data, warn = False)

(British expedition.,) people
(after the end of the Second World War,) date
(Paris,) location
(snakes,) animal
(Humboldt,) landmark_name
(Roman period,) date
(Central Island,) component
(Lepcis Magna,) location
(episcopal,) people
(148,) number
(sixth or fifth century B.C.,) date
(Lake Turkana National Parks,) landmark_name
(Phoenician settlers,) people
(Phoenician trading-post,) type
(Catholic church,) type
(Queensland,) location
(Lake Turkana National Parks,) landmark_name
(2001,) date
(Chartres Cathedral,) landmark_name
(Tripoli,) location
(constructed,) condition
(the Bishop of Chartres,) occupation
(Sibiloi National Park,) component
(Cathedral of Our Lady of Chartres,) landmark_name
(Numidian Kingdom of Massinissa,) location
(Chartres,) location
(built,) condition
(The archaeological site of Sabratha,) landmark_name
(Flamboyant style,) type
(north spire,) component
(the period of the British Military Administration,) date
(seventh century A.D.,) date
(Sabratha,) location
(Emerald,

# 2 Training the Spacy Model on on the training file

Start with importing all the spacy modules


In [52]:
import spacy
from spacy import displacy
import spacy
from spacy.cli.train import train

# If GPU is availabel, use it for training
spacy.prefer_gpu()

False

### 2.1 Spacy before training with custom labels

In [53]:
example_text = training_data[0][0]

nlp = spacy.load("en_core_web_sm")
doc = nlp(example_text)

print(nlp.pipe_names)

displacy.render(doc, style="ent")

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']


### 2.2 Training a spacy model
Training of a spacy model usually is done via the command line. This is the reason for the following no so understandable lines of code. There are a few steps in the training process:
1. The spacy model needs a config file and all necessary files are in the spacy folder
2. The model needs training data, which was exported in this file above to the spacy folder
3. After training the model is evaluated and the results are printed for training and evaluation

In [54]:
import subprocess

subprocess.run("spacy project run create-config", cwd="spacy")

CompletedProcess(args='spacy project run create-config', returncode=0)

In [58]:
result = subprocess.run("spacy project run train", cwd="spacy", shell=True, check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True, encoding="utf-8")
print(result.stdout)

[38;5;3m⚠ Your project configuration file includes a `spacy_version` key, which
is now deprecated. Weasel will not validate your version of spaCy.[0m
[1m
[38;5;4mℹ Skipping 'train': nothing changed[0m



In [59]:
result = subprocess.run("spacy project run evaluate", cwd="spacy", shell=True, check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)
print(result.stdout)

[38;5;3mâš  Your project configuration file includes a `spacy_version` key, which
is now deprecated. Weasel will not validate your version of spaCy.[0m
[1m
[38;5;4mâ„¹ Skipping 'evaluate': nothing changed[0m



### 2.2 Visualizing the results
The model is now trained. This model can be loaded into spacy and an example text can be visualized with the NER.

In [57]:
# Now test teh newly created spacy model on a sample text and visualize it using spacy
nlp = spacy.load("spacy/training/model-last/")

example_text = training_data[0][0]

doc = nlp(example_text)

# View the tokenized text
print("Tokens:", [token.text for token in doc])

displacy.render(doc, style="ent")

# Show the tokens, their labels and their entities
for ent in doc.ents:
    print(ent.text, ent.label_)


Tokens: ['The', 'archaeological', 'site', 'of', 'Sabratha', 'is', 'an', 'excavated', 'Numidian', 'and', 'later', 'Roman', 'city', 'situed', 'near', 'present', '-', 'day', 'Sabratha', ',', 'Libya', '.', 'It', 'was', 'a', 'Phoenician', 'trading', '-', 'post', 'that', 'served', 'as', 'an', 'outlet', 'for', 'the', 'products', 'of', 'the', 'African', 'hinterland', ',', 'and', 'later', 'part', 'of', 'the', 'short', '-', 'lived', 'Numidian', 'Kingdom', 'of', 'Massinissa', 'before', 'being', 'Romanized', 'and', 'rebuilt', 'in', 'the', '2nd', 'and', '3rd', 'centuries', 'A.D.History', '.', 'Sabratha', ',', 'on', 'the', 'coast', 'of', 'Libya', '40', 'km', ',', 'to', 'the', 'west', 'of', 'modern', 'Tripoli', ',', 'was', 'founded', 'by', 'Phoenician', 'settlers', 'in', 'the', 'sixth', 'or', 'fifth', 'century', 'B.C.', 'and', 'grew', 'to', 'be', 'a', 'prosperous', 'town', 'during', 'much', 'of', 'the', 'Roman', 'period', ',', 'though', 'it', 'did', 'not', 'long', 'survive', 'the', 'coming', 'of', 't