In [1]:
%load_ext autoreload
%autoreload 2

# Processing Label Studio Exports for Spacy
This notebook processes exports from Label Studio for use in Spacy.

In [2]:
import preprocessor
from preprocessor import Preprocessor

ROOT_DIR = preprocessor.ROOT_DIR
DATA_PATH = preprocessor.DATA_PATH

preprocessor = Preprocessor(ROOT_DIR)

## 1. Loading the Data
### 1.1. Loading JSON export from Label Studio
The first step is to load the JSON export from Label Studio. This is done using the `json` library and the predefined function `LoadFile` from the Preprocessor class.

In [3]:
export_data = preprocessor.loadFile('label_studio_test_notebook_231002.json')

The exported data constists of all the annotations from a specific text of all the users. The first annotated text in the exported data file looks something like the following:

In [4]:
print(export_data[0])

{'id': 70476117, 'annotations': [{'id': 22629460, 'completed_by': {'id': 12485, 'email': 'l.r.siecker@student.tue.nl', 'first_name': 'Luc', 'last_name': 'Siecker'}, 'result': [{'id': 'NGoGCdY4Mr', 'type': 'labels', 'value': {'end': 8, 'text': 'Ephesus', 'start': 1, 'labels': ['landmark_name']}, 'origin': 'manual', 'to_name': 'text', 'from_name': 'label'}, {'id': 'smwmA5Ofvl', 'type': 'labels', 'value': {'end': 55, 'text': 'city', 'start': 51, 'labels': ['type']}, 'origin': 'manual', 'to_name': 'text', 'from_name': 'label'}, {'id': 'eMn5ovu-q9', 'type': 'labels', 'value': {'end': 73, 'text': 'in Ancient Greece', 'start': 56, 'labels': ['location']}, 'origin': 'manual', 'to_name': 'text', 'from_name': 'label'}, {'id': 'XypjjOYYWL', 'type': 'labels', 'value': {'end': 95, 'text': 'on the coast of Ionia', 'start': 74, 'labels': ['location']}, 'origin': 'manual', 'to_name': 'text', 'from_name': 'label'}, {'id': 'fzfahtS8TS', 'type': 'labels', 'value': {'end': 127, 'text': 'Seluk', 'start': 1

### 1.2 Converting to Spacy training format:
To provide custom labels to Spacy, we need to convert the data to the following format:

```python
training_data = [
  ("Tokyo Tower is 333m tall.", [(0, 11, "BUILDING"), (20, 23, "HEIGHT")]),
]
```

The `process_export` function from the preprocessor can fix this for us, it retrieves the training data in the given format, in combination with the labels and their relationship

In [5]:
training_data, relation_data = preprocessor.process_export(export_data)

### 1.3 Checking results
We can print the first 5 relations to check whether it was done correctly.

In [6]:
for i, (label, relations) in enumerate(relation_data.items()):
    if i > 4:
        break
    print(f"Labels: {label}")
    print(f"Relations: {relations}")
    print()

Labels: ('Ephesus', 'city')
Relations: ['org:is_type']

Labels: ('Ephesus', 'in Ancient Greece')
Relations: ['org:located_in']

Labels: ('Ephesus', 'on the coast of Ionia')
Relations: ['org:located_in']

Labels: ('Ephesus', 'Seluk')
Relations: ['org:located_in']

Labels: ('Ephesus', 'in zmir Province')
Relations: ['org:located_in']



The same can be done for the labels, which are stored in a dictionary. Lets only print the first one.

In [7]:
print("Training data info item 1 \ntext:")
print(training_data[0][0])
print("Labels:")
print(*training_data[0][1], sep = "\n")

Training data info item 1 
text:
 Ephesus (; ; ; may ultimately derive from ) was a city in Ancient Greece on the coast of Ionia, southwest of present-day Seluk in zmir Province, Turkey. It was built in the 10th century BC on the site of Apasa, the former Arzawan capital, by Attic and Ionian Greek colonists. During the Classical Greek era, it was one of twelve cities that were members of the Ionian League. The city came under the control of the Roman Republic in 129 BC.The city was famous in its day for the nearby Temple of Artemis (completed around 550BC), which has been designated one of the Seven Wonders of the Ancient World. Its many monumental buildings included the Library of Celsus and a theatre capable of holding 24,000 spectators.
Labels:
(1, 8, 'landmark_name')


### 1.4 Preparing the data for Spacy
The data is now in the correct format, so it can be processed and saved as Spacy training file using the `preprocess_spacy` function from the `Preprocessor` class.

In [8]:
preprocessor.preprocess_spacy(training_data)

<!-- ## 2 Training Spacy model
### 2.1 Setup config file
The config file is used to configure the training process. It contains the pipeline, the number of iterations, the dropout rate, the batch size and the learning rate. The config file looks like:
    
```json
{
"lang": "en",
"pipeline": ["ner"],
"path.train": "./data/train.spacy",
"path.dev": "./data/dev.spacy",
"ner.early_stopping": false,
"ner.crf": false,
"ner.batch_size": 32,
"ner.learning_rate": 0.001,
"ner.epochs": 100
}
```	 -->

<!-- ### 2.2 Training the model
To train a Spacy model, the following command line code needs to be run:
    
```bash
python -m spacy train .\\spacy_config.cfg --output .\\models\\
```

This will train a model using the configuration file `spacy_config.cfg` and save it in the `models` folder.

The saved model will consist of a `meta.json` and a binary model file and can be loaded using the `spacy.load` function.  -->

In [9]:
from spacy.cli.train import train
from spacy.cli import debug_data

In [10]:
train(f"{DATA_PATH}/config.cfg", 
      output_path="models/spacy_model", 
      overrides={"paths.train": f"{DATA_PATH}/train.spacy", 
                 "paths.dev": f"{DATA_PATH}/dev.spacy", 
                 "paths.test": f"{DATA_PATH}/dev.spacy"})

[38;5;4mℹ Saving to output directory: models\spacy_model[0m
[38;5;4mℹ Using CPU[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['ner'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  --------  ------  ------  ------  ------
  0       0    102.63    0.00    0.00    0.00    0.00


In [None]:
debug_data(f"{DATA_PATH}/config.cfg", 
           config_overrides={
               "paths.train": f"{DATA_PATH}/train.spacy", 
               "paths.dev": f"{DATA_PATH}/dev.spacy", 
               "paths.test": f"{DATA_PATH}/dev.spacy"})

In [None]:
# Now test teh newly created spacy model on a sample text and visualize it using spacy
import spacy
from spacy import displacy

nlp = spacy.load("models/spacy_model/model-best/")

text = "The Cathedral Basilica of Our Lady of Amiens (), or simply Amiens Cathedral, is a Roman Catholic church. The cathedral is the seat of the Bishop of Amiens. It is situated on a slight ridge overlooking the River Somme in Amiens, the administrative capital of the Picardy region of France, some north of Paris.The cathedral was built almost entirely between 1220 and , a remarkably short period of time for a Gothic cathedral, giving it an unusual unity of style. Amiens is a classic example of the High Gothic style of Gothic architecture. It also has some features of the later Rayonnant style in the enlarged high windows of the choir, added in the mid-1250s."

doc = nlp(text)

displacy.render(doc, style="ent", jupyter=True)
