# Spacy dataset creation

This notebook takes train and test datasets (of type `List[InputSample]`)
and transforms them into a structures consumed by Spacy. 

[See more on creating training data for spaCy here](https://spacy.io/usage/training#training-data).

In [1]:
from presidio_evaluator import InputSample
%reload_ext autoreload

In [2]:
DATA_DATE = 'Dec-19-2021'

In [3]:
data_path = "../../data/{}_{}.json"

train_samples = InputSample.read_dataset_json(data_path.format("train",DATA_DATE))
print("Read {} samples".format(len(train_samples)))

tokenizing input:   0%|                                                                       | 0/2122 [00:00<?, ?it/s]

loading model en_core_web_sm


tokenizing input: 100%|███████████████████████████████████████████████████████████| 2122/2122 [00:19<00:00, 109.66it/s]

Read 2122 samples





For training, keep only sentences with entities:

In [4]:
train_tagged = [sample for sample in train_samples if len(sample.spans) > 0]
print("Kept {} samples after removal of non-tagged samples".format(len(train_tagged)))

Kept 1940 samples after removal of non-tagged samples


Evaluate training set's entities

In [5]:
print("Entities found in training set:")
entities = []
for sample in train_tagged:
    entities.extend([tag for tag in sample.tags])
set(entities)

Entities found in training set:


{'ADDRESS',
 'CREDIT_CARD',
 'DATE_TIME',
 'DOMAIN_NAME',
 'EMAIL_ADDRESS',
 'IBAN_CODE',
 'IP_ADDRESS',
 'LOCATION',
 'O',
 'ORGANIZATION',
 'PERSON',
 'PHONE_NUMBER',
 'PREFIX',
 'TITLE',
 'US_SSN'}

Create Spacy dataset

In [None]:
spacy_train = InputSample.create_spacy_dataset(dataset=train_tagged, output_path = "train.spacy")


Skipping illegal span None, text=ΜΟΝΗ ΑΓΙΩΝ ΑΝΑΡΓΥΡΩΝ
Skipping illegal span None, text=U.N


In [None]:
entities_spacy = [x[1]['entities'] for x in spacy_train]
entities_spacy_flat = []
for samp in entities_spacy:
    for ent in samp:
        entities_spacy_flat.append(ent[2])
set(entities_spacy_flat)

Quick evaluation of samples

In [None]:
[sample[0] for sample in spacy_train[:100]]

Creating dataset files for test and validation

In [None]:
for fold in ("test","validation"):
    dataset = InputSample.read_dataset_json(data_path.format(fold,DATA_DATE))
    print(f"Read {len(dataset)} samples for {fold}")
    InputSample.create_spacy_dataset(dataset=dataset, output_path = f"{fold}.spacy")