# Spacy dataset creation

This notebook takes train and test datasets (of type `List[InputSample]`)
and transforms them into a structures consumed by Spacy. 

[See more on creating training data for spaCy here](https://spacy.io/usage/training#training-data).

In [None]:
from presidio_evaluator import InputSample

%reload_ext autoreload

In [None]:
DATA_DATE = "Dec-27-2023" # Change to the date when notebook 3 (split to train/test) was ran

In [None]:
data_path = "../../data/{}_{}.json"

train_samples = InputSample.read_dataset_json(data_path.format("train", DATA_DATE))
print("Read {} samples".format(len(train_samples)))

For training, keep only sentences with entities:

In [None]:
train_tagged = [sample for sample in train_samples if len(sample.spans) > 0]
print("Kept {} samples after removal of non-tagged samples".format(len(train_tagged)))

Evaluate training set's entities

In [None]:
print("Entities found in training set:")
entities = []
for sample in train_tagged:
    entities.extend([tag for tag in sample.tags])
set(entities)

Create Spacy dataset

In [None]:
spacy_train = InputSample.create_spacy_dataset(
    dataset=train_tagged, output_path="train.spacy"
)

In [None]:
entities_spacy = [x[1]["entities"] for x in spacy_train]
entities_spacy_flat = []
for samp in entities_spacy:
    for ent in samp:
        entities_spacy_flat.append(ent[2])
set(entities_spacy_flat)

Quick evaluation of samples

In [None]:
[sample[0] for sample in spacy_train[:100]]

Creating dataset files for test and validation

In [None]:
for fold in ("test", "validation"):
    dataset = InputSample.read_dataset_json(data_path.format(fold, DATA_DATE))
    print(f"Read {len(dataset)} samples for {fold}")
    InputSample.create_spacy_dataset(dataset=dataset, output_path=f"{fold}.spacy")