In [9]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Processing Label Studio Exports for Spacy
This notebook processes exports from Label Studio for use in Spacy.

In [10]:
import preprocessor
from preprocessor import Preprocessor

ROOT_DIR = preprocessor.ROOT_DIR
DATA_PATH = preprocessor.DATA_PATH

preprocessor = Preprocessor(ROOT_DIR)

## 1. Loading the Data
### 1.1. Loading JSON export from Label Studio
The first step is to load the JSON export from Label Studio. This is done using the `json` library and the predefined function `LoadFile` from the Preprocessor class.

In [11]:
export_data = preprocessor.loadFile('label_studio_test_notebook_290923.json')

The exported data constists of all the annotations from a specific text of all the users. The first annotated text in the exported data file looks something like the following:

In [12]:
print(export_data[0])

{'id': 70476151, 'annotations': [{'id': 22565484, 'completed_by': {'id': 12641, 'email': 'd.p.m.v.d.hoorn@student.tue.nl', 'first_name': '', 'last_name': ''}, 'result': [{'id': 'tBYLAMaY7Q', 'type': 'labels', 'value': {'end': 18, 'text': 'Chartres Cathedral', 'start': 0, 'labels': ['landmark_name']}, 'origin': 'manual', 'to_name': 'text', 'from_name': 'label'}, {'id': 'rXo1WB9tDL', 'type': 'labels', 'value': {'end': 71, 'text': 'Cathedral of Our Lady of Chartres', 'start': 38, 'labels': ['landmark_name']}, 'origin': 'manual', 'to_name': 'text', 'from_name': 'label'}, {'id': 'jt997QSCPu', 'type': 'labels', 'value': {'end': 96, 'text': 'Catholic church', 'start': 81, 'labels': ['type']}, 'origin': 'manual', 'to_name': 'text', 'from_name': 'label'}, {'id': 'eWtbXE3SqW', 'type': 'labels', 'value': {'end': 108, 'text': 'Chartres', 'start': 100, 'labels': ['location']}, 'origin': 'manual', 'to_name': 'text', 'from_name': 'label'}, {'id': '8FJnNiKvM2', 'type': 'labels', 'value': {'end': 116, 

### 1.2 Converting to Spacy training format:
To provide custom labels to Spacy, we need to convert the data to the following format:

```python
training_data = [
  ("Tokyo Tower is 333m tall.", [(0, 11, "BUILDING"), (20, 23, "HEIGHT")]),
]
```

The `process_export` function from the preprocessor can fix this for us, it retrieves the training data in the given format, in combination with the labels and their relationship

In [13]:
training_data, relation_data = preprocessor.process_export(export_data)

### 1.3 Checking results
We can print the first 5 relations to check whether it was done correctly.

In [14]:
for i, (label, relations) in enumerate(relation_data.items()):
    if i > 4:
        break
    print(f"Labels: {label}")
    print(f"Relations: {relations}")
    print()

Labels: ('Chartres Cathedral', 'Cathedral of Our Lady of Chartres')
Relations: ['org:is_similar_to']

Labels: ('Chartres Cathedral', 'Catholic church')
Relations: ['org:is_type']

Labels: ('Chartres Cathedral', 'Chartres')
Relations: ['org:located_in']

Labels: ('Chartres Cathedral', 'France')
Relations: ['org:located_in']

Labels: ('Chartres Cathedral', 'High Gothic and Classic Gothic architecture')
Relations: ['org:is_type']



The same can be done for the labels, which are stored in a dictionary. Lets only print the first one.

In [15]:
print("Training data info item 1 \ntext:")
print(training_data[0][0])
print("Labels:")
print(*training_data[0][1], sep = "\n")

Training data info item 1 
text:
Chartres Cathedral, also known as the Cathedral of Our Lady of Chartres (), is a Catholic church in Chartres, France, about southwest of Paris, and is the seat of the Bishop of Chartres. Mostly constructed between 1194 and 1220, it stands on the site of at least five cathedrals that have occupied the site since the Diocese of Chartres was formed as an episcopal see in the 4th century. It is one of the best-known and most influential examples of High Gothic and Classic Gothic architecture, It stands on Romanesque basements, while its north spire is more recent (15071513) and is built in the more ornate Flamboyant style.
Labels:
(0, 18, 'landmark_name')
(38, 71, 'landmark_name')
(81, 96, 'type')
(100, 108, 'location')
(110, 116, 'location')
(137, 142, 'location')
(163, 185, 'occupation')
(206, 227, 'date')
(263, 267, 'number')
(313, 336, 'people')
(354, 363, 'people')
(368, 386, 'date')
(449, 492, 'type')
(507, 527, 'component')
(539, 550, 'component')
(5

### 1.4 Preparing the data for Spacy
The data is now in the correct format, so it can be processed and saved as Spacy training file using the `preprocess_spacy` function from the `Preprocessor` class.

In [20]:
preprocessor.preprocess_spacy(training_data)

<!-- ## 2 Training Spacy model
### 2.1 Setup config file
The config file is used to configure the training process. It contains the pipeline, the number of iterations, the dropout rate, the batch size and the learning rate. The config file looks like:
    
```json
{
"lang": "en",
"pipeline": ["ner"],
"path.train": "./data/train.spacy",
"path.dev": "./data/dev.spacy",
"ner.early_stopping": false,
"ner.crf": false,
"ner.batch_size": 32,
"ner.learning_rate": 0.001,
"ner.epochs": 100
}
```	 -->

<!-- ### 2.2 Training the model
To train a Spacy model, the following command line code needs to be run:
    
```bash
python -m spacy train .\\spacy_config.cfg --output .\\models\\
```

This will train a model using the configuration file `spacy_config.cfg` and save it in the `models` folder.

The saved model will consist of a `meta.json` and a binary model file and can be loaded using the `spacy.load` function.  -->

In [22]:
from spacy.cli.train import train

In [21]:
!python -m spacy init fill-config data/base_config.cfg data/config.cfg

Usage: python -m spacy init fill-config [OPTIONS] BASE_PATH [OUTPUT_FILE]
Try 'python -m spacy init fill-config --help' for help.

Error: Invalid value for 'BASE_PATH': File 'data/base_config.cfg' does not exist.


In [None]:
!python -m spacy train data/config.cfg --output ./models/output

In [32]:
train(f"{DATA_PATH}/config.cfg", output_path="models/spacy_model", overrides={"paths.train": f"{DATA_PATH}/train.spacy", "paths.dev": f"{DATA_PATH}/train.spacy", "paths.test": f"{DATA_PATH}/test.spacy"})

[38;5;4mℹ Saving to output directory: models\spacy_model[0m
[38;5;4mℹ Using CPU[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['ner'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  --------  ------  ------  ------  ------
  0       0     69.78    0.00    0.00    0.00    0.00
 50     200   4777.03  100.00  100.00  100.00    1.00
102     400      5.81  100.00  100.00  100.00    1.00
160     600      0.00  100.00  100.00  100.00    1.00
222     800      0.00  100.00  100.00  100.00    1.00
289    1000      0.00  100.00  100.00  100.00    1.00
373    1200      0.00  100.00  100.00  100.00    1.00


KeyboardInterrupt: 