# Processing Label Studio Exports for Spacy
This notebook processes exports from Label Studio for use in Spacy.

In [1]:
import preprocessor
from preprocessor import Preprocessor
import os
import spacy
from spacy.tokens import DocBin
import pandas as pd
from datetime import date

ROOT_DIR = preprocessor.ROOT_DIR
DATA_PATH = preprocessor.DATA_PATH

preprocessor = Preprocessor(ROOT_DIR)

## 1. Loading the Data
### 1.1. Loading JSON export from Label Studio
The first step is to load the JSON export from Label Studio. This is done using the `json` library and the predefined function `LoadFile` from the Preprocessor class.

In [2]:
export_data = preprocessor.loadFile('export_test.json')

### 1.2 Converting to Spacy training format:
To provide custom labels to Spacy, we need to convert the data to the following format:

```python
training_data = [
  ("Tokyo Tower is 333m tall.", [(0, 11, "BUILDING"), (20, 23, "HEIGHT")]),
]
```

The `process_export` function from the preprocessor can fix this for us, it retrieves the training data in the given format, in combination with the labels and their relationship

In [3]:
training_data, relation_data = preprocessor.process_export(export_data)

### 1.3 Checking results
We can print the first 5 relations to check whether it was done correctly.

In [10]:
for i, (label, relations) in enumerate(relation_data.items()):
    if i > 4:
        break
    print(f"Labels: {label}")
    print(f"Relations: {relations}")
    print()

Labels: ('Chartres Cathedral', 'Cathedral of Our Lady of Chartres')
Relations: ['org:is_similar_to']

Labels: ('Chartres Cathedral', 'Catholic church')
Relations: ['org:is_type']

Labels: ('Chartres Cathedral', 'Chartres')
Relations: ['org:located_in']

Labels: ('Chartres Cathedral', 'France')
Relations: ['org:located_in']

Labels: ('Chartres Cathedral', 'High Gothic and Classic Gothic architecture')
Relations: ['org:is_type']



The same can be done for the labels, which are stored in a dictionary. Lets only print the first one.

In [21]:
print("Training data info item 1 \ntext:")
print(training_data[0][0])
print("Labels:")
print(*training_data[0][1], sep = "\n")

Training data info item 1 
text:
Chartres Cathedral, also known as the Cathedral of Our Lady of Chartres (), is a Catholic church in Chartres, France, about southwest of Paris, and is the seat of the Bishop of Chartres. Mostly constructed between 1194 and 1220, it stands on the site of at least five cathedrals that have occupied the site since the Diocese of Chartres was formed as an episcopal see in the 4th century. It is one of the best-known and most influential examples of High Gothic and Classic Gothic architecture, It stands on Romanesque basements, while its north spire is more recent (15071513) and is built in the more ornate Flamboyant style.
Labels:
(0, 18, 'landmark_name')
(38, 71, 'landmark_name')
(81, 96, 'type')
(100, 108, 'location')
(110, 116, 'location')
(137, 142, 'location')
(163, 185, 'occupation')
(206, 227, 'date')
(263, 267, 'number')
(313, 336, 'people')
(354, 363, 'people')
(368, 386, 'date')
(449, 492, 'type')
(507, 527, 'component')
(539, 550, 'component')
(5

### 1.4 Preparing the data for Spacy
The data is now in the correct format, so it can be processed and saved as Spacy training file using the `preprocess_spacy` function from the `Preprocessor` class.

In [None]:
preprocessor.preprocess_spacy(training_data)