# Spacy dataset creation

This notebook takes train and test  datasets (of type `List[InputSample]`)
and transforms them into two structures consumed by Spacy:
1. Spacy JSON (see https://spacy.io/api/annotation#json-input)
2. Spacy Pickle files (of structure `[(full_text,"entities":[(start, end, type),(...))]`.  
See more details here: https://spacy.io/api/annotation#json-input)

JSON is used for Spacy's CLI trainer. 
Pickle is used for fine-tuning using the logic in [../models/spacy_retrain.py](../models/spacy_retrain.py)

In [1]:
from presidio_evaluator.data_generator import read_synth_dataset
%reload_ext autoreload

In [2]:
DATA_DATE = 'February_28_2020'
size = 100

In [3]:
data_path = "../../data/{}_{}.json"

train_samples = read_synth_dataset(data_path.format("train",DATA_DATE))
print("Read {} samples".format(len(train_samples)))

Read 213 samples


For training, keep only sentences with entities:

In [4]:
train_tagged = [sample for sample in train_samples if len(sample.spans)>0]
print("Kept {} samples after removal of non-tagged samples".format(len(train_tagged)))

Kept 194 samples after removal of non-tagged samples


Evaluate training set's entities

In [5]:
print("Entities found in training set:")
entities = []
for sample in train_tagged:
    entities.extend([tag for tag in sample.tags])
set(entities)

Entities found in training set:


{'B-LOCATION',
 'B-ORGANIZATION',
 'B-PERSON',
 'B-PHONE_NUMBER',
 'B-TITLE',
 'I-LOCATION',
 'I-ORGANIZATION',
 'I-PERSON',
 'I-PHONE_NUMBER',
 'L-LOCATION',
 'L-ORGANIZATION',
 'L-PERSON',
 'L-PHONE_NUMBER',
 'L-TITLE',
 'O',
 'U-BIRTHDAY',
 'U-CREDIT_CARD',
 'U-EMAIL',
 'U-IBAN',
 'U-LOCATION',
 'U-NATIONALITY',
 'U-ORGANIZATION',
 'U-PERSON',
 'U-TITLE'}

Create Spacy dataset (option 2)

In [6]:
from presidio_evaluator import InputSample
import pickle

spacy_train = InputSample.create_spacy_dataset(train_tagged)


In [7]:
entities_spacy = [x[1]['entities'] for x in spacy_train]
entities_spacy
entities_spacy_flat = []
for samp in entities_spacy:
    for ent in samp:
        entities_spacy_flat.append(ent[2])
set(entities_spacy_flat)

{'GPE', 'O', 'ORG', 'PERSON'}

Create Spacy dataset (option 1: JSON)

In [8]:
from presidio_evaluator import InputSample
spacy_train_json = InputSample.create_spacy_json(train_tagged)

194it [00:00, 17765.88it/s]


Quick evaluation of samples

In [9]:
[sample[0] for sample in spacy_train[:100]]

['Please block card no 4929921611032795',
 'have you heard Line Henriksen speak yet?',
 "Please tell me your date of birth. It's 12/18/1989",
 'The address of Platinum Interior Design is Rue du Chapy 336, Groot-Bijgaarden 1702',
 "A tribute to Laura Lane-Poole â€“ sadly, she wasn't impressed.",
 'I have lost my card 4929149013148403. Could you please block my credit card ASAP ? , My name is Valida Kishiev.',
 'I want to increase limit on my card # 5509339531094917 for certain duration of time. is it possible?',
 'sometimes people call me sofie',
 "I have done an online order but didn't get any message on my registered 60-17-51-75. Could you please look into it ?",
 "I'd like to order a taxi to Smáratún 31, Vík 870",
 'I once lived in 1541 Wit Rd, Johannesburg 2051. I now live in Avenida Noruega 42, Vila Real 5000-047',
 'How do I change the address linked to my credit card to Kringlan 66, Reykjavík 107?',
 'Need to see last 10 transaction of card 5114430119534676',
 'Can I withdraw cas

In [10]:
spacy_train_json[0]['paragraphs'][0]['sentences']

[{'tokens': [{'orth': 'Please', 'tag': 'UH', 'ner': 'O'},
   {'orth': 'block', 'tag': 'NN', 'ner': 'O'},
   {'orth': 'card', 'tag': 'NN', 'ner': 'O'},
   {'orth': 'no', 'tag': 'DT', 'ner': 'O'},
   {'orth': '4929921611032795', 'tag': 'CD', 'ner': 'O'}]}]

Dump training set to pickle and json respectively

In [11]:
import pickle
import json
with open("../../data/train.pickle", 'wb') as handle:
    pickle.dump(spacy_train,handle, protocol=pickle.HIGHEST_PROTOCOL)

with open("../../data/train.json","w") as f:
    json.dump(spacy_train_json,f)
       

Create JSON and pickle files for test dataset

In [12]:
test_samples = read_synth_dataset(data_path.format("test",DATA_DATE))
print("Read {} samples".format(len(test_samples)))

Read 59 samples


In [13]:
spacy_test = InputSample.create_spacy_dataset(test_samples)
spacy_test_json = InputSample.create_spacy_json(test_samples)
print(spacy_test[14])

59it [00:00, 11821.71it/s]


('Szymon Walczak listed his top 20 songs for Entertainment Weekly and had the balls to list this song at #15. (What did he put at #1 you ask? Answer:"Tube Snake Boogie" by Fernanda Ricci â€“ go figure)', {'entities': [(0, 14, 'PERSON'), (170, 184, 'PERSON')]})


Dump test set to pickle and json respectively

In [14]:
import pickle
with open("../../data/test.pickle", 'wb') as handle:
    pickle.dump(spacy_test,handle, protocol=pickle.HIGHEST_PROTOCOL)
    
with open("../../data/test.json","w") as f:
    json.dump(spacy_test_json,f)
       
