This notebook is a part of the diploma thesis "PII detection in unstructured texts". It is intended to be run from development environment Google Colab.

It uses 3 main frameworks/tools:
- spaCy (https://spacy.io/)
- Presidio (https://microsoft.github.io/presidio/)
- Streamlit (https://streamlit.io/)

In order to successfully run all cells, it is required to connect a personal Google disk with sufficient amount of empty space to save trained NER models weights.

Also, in order to achieve the best results, it is recommended to run a GPU session.

Complete source codes as well as the thesis' text itself can be found on https://github.com/ondrasekd/DP

## Dependencies
Install and import spacy dependencies for either CPU or GPU utilization

### CPU
CPU variant can be used if you plan only to utilize CPU power while training the NER models and run CPU model variants for inference

In [1]:
!pip install --upgrade spacy
import spacy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting spacy
  Downloading spacy-3.3.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.2 MB)
[K     |████████████████████████████████| 6.2 MB 4.1 MB/s 
[?25hCollecting typer<0.5.0,>=0.3.0
  Downloading typer-0.4.1-py3-none-any.whl (27 kB)
Collecting langcodes<4.0.0,>=3.2.0
  Downloading langcodes-3.3.0-py3-none-any.whl (181 kB)
[K     |████████████████████████████████| 181 kB 45.8 MB/s 
[?25hCollecting catalogue<2.1.0,>=2.0.6
  Downloading catalogue-2.0.7-py3-none-any.whl (17 kB)
Collecting pydantic!=1.8,!=1.8.1,<1.9.0,>=1.7.4
  Downloading pydantic-1.8.2-cp37-cp37m-manylinux2014_x86_64.whl (10.1 MB)
[K     |████████████████████████████████| 10.1 MB 39.2 MB/s 
Collecting srsly<3.0.0,>=2.4.3
  Downloading srsly-2.4.3-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (457 kB)
[K     |████████████████████████████████| 457 kB 21.3 MB/s 
[?25hCollecting spacy-lo

### GPU
GPU is required if you plan to train and run transformer based models

In [1]:
!pip install -U spacy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting spacy
  Downloading spacy-3.3.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.2 MB)
[K     |████████████████████████████████| 6.2 MB 15.3 MB/s 
[?25hCollecting srsly<3.0.0,>=2.4.3
  Downloading srsly-2.4.3-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (457 kB)
[K     |████████████████████████████████| 457 kB 70.3 MB/s 
Collecting thinc<8.1.0,>=8.0.14
  Downloading thinc-8.0.17-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (660 kB)
[K     |████████████████████████████████| 660 kB 71.8 MB/s 
[?25hCollecting pathy>=0.3.5
  Downloading pathy-0.6.1-py3-none-any.whl (42 kB)
[K     |████████████████████████████████| 42 kB 1.5 MB/s 
Collecting spacy-loggers<2.0.0,>=1.0.0
  Downloading spacy_loggers-1.0.2-py3-none-any.whl (7.2 kB)
Collecting catalogue<2.1.0,>=2.0.6
  Downloading catalogue-2.0.7-py3-none-any.whl (17 kB)
Collecting pydantic!=1.

In [2]:
!pip install torch==1.10.1+cu111 torchvision==0.11.2+cu111 torchaudio==0.10.1 -f https://download.pytorch.org/whl/torch_stable.html

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in links: https://download.pytorch.org/whl/torch_stable.html
Collecting torch==1.10.1+cu111
  Downloading https://download.pytorch.org/whl/cu111/torch-1.10.1%2Bcu111-cp37-cp37m-linux_x86_64.whl (2137.7 MB)
[K     |████████████▌                   | 834.1 MB 1.3 MB/s eta 0:16:29tcmalloc: large alloc 1147494400 bytes == 0x3971a000 @  0x7f2c32927615 0x592b76 0x4df71e 0x59afff 0x515655 0x549576 0x593fce 0x548ae9 0x51566f 0x549576 0x593fce 0x548ae9 0x5127f1 0x598e3b 0x511f68 0x598e3b 0x511f68 0x598e3b 0x511f68 0x4bc98a 0x532e76 0x594b72 0x515600 0x549576 0x593fce 0x548ae9 0x5127f1 0x549576 0x593fce 0x5118f8 0x593dd7
[K     |███████████████▉                | 1055.7 MB 1.3 MB/s eta 0:13:52tcmalloc: large alloc 1434370048 bytes == 0x7dd70000 @  0x7f2c32927615 0x592b76 0x4df71e 0x59afff 0x515655 0x549576 0x593fce 0x548ae9 0x51566f 0x549576 0x593fce 0x548ae9 0x5127f1 0x598e3b 0x511f68 0x59

In [3]:
!pip install -U spacy[cuda111,transformers]
!export CUDA_PATH=/usr/local/cuda-11.1
!export PATH=/usr/local/cuda-11.1/bin${PATH:+:${PATH}}
!export LD_LIBRARY_PATH=/usr/local/cuda-11.1/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
!pip install cupy-cuda111

import spacy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting spacy-transformers<1.2.0,>=1.1.2
  Downloading spacy_transformers-1.1.6-py2.py3-none-any.whl (51 kB)
[K     |████████████████████████████████| 51 kB 303 kB/s 
Collecting spacy-alignments<1.0.0,>=0.7.2
  Downloading spacy_alignments-0.8.5-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 27.1 MB/s 
[?25hCollecting transformers<4.20.0,>=3.4.0
  Downloading transformers-4.19.3-py3-none-any.whl (4.2 MB)
[K     |████████████████████████████████| 4.2 MB 63.0 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 66.5 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.7.0-py3-none-any.whl (86 kB)
[K     |███████████████████████████

## Dataset

In this part, the following datasets are loaded and processed:
- **CNEC_extended** dataset, which contains only named entities supertypes (available at https://github.com/strakova/ner_tsd2016/tree/master/data/CNEC_2.0_konkol)
- **CNEC 2.0** dataset, which is then processed and and transformed to a special version, which contains information about lemmas (and experimentally POS tags) (original dataset available at https://github.com/strakova/ner_tsd2016/tree/master/data/CNEC_2.0)


Recognized named entities can be found in the following schematics. Coarse-grained dataset (derived from CNEC_extended) recognizes only the supertypes.

https://ufal.mff.cuni.cz/~strakova/cnec2.0/ne-type-hierarchy.pdf

Mount google disk personal account to save/load transformed datasets and trained model weights

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Clone github repository, which contains needed datasets

In [None]:
%cd /content/

/content


In [None]:
!git clone https://github.com/strakova/ner_tsd2016.git

Cloning into 'ner_tsd2016'...
remote: Enumerating objects: 217, done.[K
remote: Total 217 (delta 0), reused 0 (delta 0), pack-reused 217[K
Receiving objects: 100% (217/217), 27.36 MiB | 16.31 MiB/s, done.
Resolving deltas: 100% (47/47), done.


Create directories structure in personal Google drive to save transformed datasets

In [None]:
!mkdir '/content/drive/MyDrive/PIIAnonymizer'
!mkdir '/content/drive/MyDrive/PIIAnonymizer/datasets'

mkdir: cannot create directory ‘/content/drive/MyDrive/PIIAnonymizer’: File exists


The following function serves is used to load saved spacy formatted datasets from disk and return the collection of Doc objects

In [5]:
def dataset_docs_from_path(path):
  from spacy.tokens import DocBin
  nlp = spacy.blank("cs")

  # Load a collection of training docs
  dataset_docbin = DocBin()
  dataset_docbin.from_disk(path)

  return list(dataset_docbin.get_docs(nlp.vocab))

#### Coarse-grained CNEC2.0 Extended
Transform CNEC2.0 Extended dataset into the spacy binary format
Transformed dataset doesn't contain any additional morphological features

In [None]:
!mkdir '/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0_extended'
!mkdir '/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0_extended/spacy'

mkdir: cannot create directory ‘/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0_extended’: File exists


In [None]:
# convert each train, dev and test dataset to a spacy binary format
# group 10 sentences into each spacy doc
!python -m spacy convert -n 10 -c conll '/content/ner_tsd2016/data/CNEC_2.0_konkol/train.conll' '/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0_extended/spacy'
!python -m spacy convert -n 10 -c conll '/content/ner_tsd2016/data/CNEC_2.0_konkol/dtest.conll' '/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0_extended/spacy'
!python -m spacy convert -n 10 -c conll '/content/ner_tsd2016/data/CNEC_2.0_konkol/etest.conll' '/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0_extended/spacy'

[38;5;4mℹ Grouping every 10 sentences into a document.[0m
[38;5;2m✔ Generated output file (715 documents):
/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0_extended/spacy/train.spacy[0m
[38;5;4mℹ Grouping every 10 sentences into a document.[0m
[38;5;2m✔ Generated output file (89 documents):
/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0_extended/spacy/dtest.spacy[0m
[38;5;4mℹ Grouping every 10 sentences into a document.[0m
[38;5;2m✔ Generated output file (89 documents):
/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0_extended/spacy/etest.spacy[0m


Show examples from dataset

In [None]:
cnec_extended_docs = dataset_docs_from_path('/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0_extended/spacy/train.spacy')

print(cnec_extended_docs[0])

for ent in cnec_extended_docs[0].ents:
  print (ent.text, ent.label_)

Jste světa znalý muž a víte stejně dobře jako já , že souvislost mezi současnými krutostmi v Jihovýchodní Asii a tou novou bankovní pobočkou hned vedle obchoďáku Zátoka je přímá a bezprostřední ; byl z toho už vzteklý jak uvázaný pes , protože zájemci o hodiny mu úplně narušili jeho denní režim a on si nemohl po obědě ani zdřímnout . I s Dubenkou , na kterou U tygra teď myslím . . . Hodil si kulovnici přes rameno a vydal se s význačným loveckým hostem do stráně , do kopce na krytou kazatelnu , aby s ním alespoň nemusel moknout . Já je normálně nosím tak " - a ukázal hřbetem dlaně na krajinu břišní . Když zakrátko Magora zavřeli , šli Němec a Jirousová za Václavem a rozhodli se , že případ budou publikovat . Když vyhraju , což je tutovka , dostanu od každého z vás litr slivovice já . Venku se již žádně zešeřilo a tma začínala houstnout . A jak TO vysvětlíte ? Zpívali jí Krásnou Meredith , ovšem ; 
Asii G
Zátoka I
Dubenkou P
U tygra I
Magora P
Němec P
Jirousová P
Václavem P
Krásnou Mered

### Fine-grained CNEC2.0
Transfer fine-grained CNEC2.0 into the spacy binary format and process dataset variations with or without additional morphological features like lemmas and POS tags

First, CNEC2.0 must be converted from its own proprietary dataformat to some standard dataformat like conll or conllu.

This operation is done by using a treex2conll2003 script available from Strakova's repo, which converts the Treex format to extended non-standard conll format

In [None]:
# please note this script will throw some exceptions, as the morphodita tool is not installed (nor needed)
%cd '/content/ner_tsd2016/utils/'
!./make_data.sh cnec2.0

In [None]:
!mkdir '/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0'
!mkdir '/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/no_morph'
!mkdir '/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/no_morph/spacy'

Then, fine-grained CNEC2.0 without additional morphological features is converted to a binary spacy format

In [None]:
!python -m spacy convert -n 10 -c conll '/content/ner_tsd2016/data_tagged/CNEC_2.0/train.conll' '/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/no_morph/spacy'
!python -m spacy convert -n 10 -c conll '/content/ner_tsd2016/data_tagged/CNEC_2.0/dtest.conll' '/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/no_morph/spacy'
!python -m spacy convert -n 10 -c conll '/content/ner_tsd2016/data_tagged/CNEC_2.0/etest.conll' '/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/no_morph/spacy'

[38;5;4mℹ Grouping every 10 sentences into a document.[0m
[38;5;2m✔ Generated output file (720 documents):
/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/no_morph/spacy/train.spacy[0m
[38;5;4mℹ Grouping every 10 sentences into a document.[0m
[38;5;2m✔ Generated output file (90 documents):
/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/no_morph/spacy/dtest.spacy[0m
[38;5;4mℹ Grouping every 10 sentences into a document.[0m
[38;5;2m✔ Generated output file (90 documents):
/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/no_morph/spacy/etest.spacy[0m


Show examples from dataset

In [None]:
cnec2_docs = dataset_docs_from_path('/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/no_morph/spacy/train.spacy')

print(cnec2_docs[0])

for ent in cnec2_docs[0].ents:
  print (ent.text, ent.label_)

Jste světa znalý muž a víte stejně dobře jako já , že souvislost mezi současnými krutostmi v Jihovýchodní Asii a tou novou bankovní pobočkou hned vedle obchoďáku Zátoka je přímá a bezprostřední ; byl z toho už vzteklý jak uvázaný pes , protože zájemci o hodiny mu úplně narušili jeho denní režim a on si nemohl po obědě ani zdřímnout . I s Dubenkou , na kterou U tygra teď myslím . . . Hodil si kulovnici přes rameno a vydal se s význačným loveckým hostem do stráně , do kopce na krytou kazatelnu , aby s ním alespoň nemusel moknout . Já je normálně nosím tak " - a ukázal hřbetem dlaně na krajinu břišní . Když zakrátko Magora zavřeli , šli Němec a Jirousová za Václavem a rozhodli se , že případ budou publikovat . Když vyhraju , což je tutovka , dostanu od každého z vás litr slivovice já . Venku se již žádně zešeřilo a tma začínala houstnout . A jak TO vysvětlíte ? Zpívali jí Krásnou Meredith , ovšem ; 
Asii gt
Zátoka if
Dubenkou p_
U tygra if
Magora p_
Němec ps
Jirousová ps
Václavem pf
Krásn

The second transformed fine-grained dataset contains additional information about lemmas.

It can be created by a little hack, which is to convert CNEC2.0 conll dataset to a older spacy's proprietary format, which uses the JSON annotations.

In [None]:
!mkdir '/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas'
!mkdir '/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas/json'
!mkdir '/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas/spacy'

In [None]:
!python -m spacy convert -n 10 -t json '/content/ner_tsd2016/data_tagged/CNEC_2.0/train.conll' '/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas/json'
!python -m spacy convert -n 10 -t json '/content/ner_tsd2016/data_tagged/CNEC_2.0/etest.conll' '/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas/json'
!python -m spacy convert -n 10 -t json '/content/ner_tsd2016/data_tagged/CNEC_2.0/dtest.conll' '/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas/json'

[38;5;4mℹ Grouping every 10 sentences into a document.[0m
[38;5;2m✔ Generated output file (1 documents):
/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas/json/train.json[0m
[38;5;4mℹ Grouping every 10 sentences into a document.[0m
[38;5;2m✔ Generated output file (1 documents):
/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas/json/etest.json[0m
[38;5;4mℹ Grouping every 10 sentences into a document.[0m
[38;5;2m✔ Generated output file (1 documents):
/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas/json/dtest.json[0m


There exists an error in spacy convert command, which allows to convert additional morphology info (lemmas) from non-standard conll formated dataset.

There, however, exists an annotation mismatch and "lemma" is annotated as "tag". Also, CNEC2.0's lemmas annotations are morphologicaly extended (details in https://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/m-layer/html/ch02s01.html).
SpaCy doesn't support extended lemmas annotations, so lemmas must be converted to a standart non-extended format.

In [None]:
def transform_lemmas(input_filename, output_filename):
  with open(input_filename) as f:
    lines = [line.rstrip() for line in f]

  file_transformed = open(output_filename, 'w')

  for line in lines:
    line = line.replace("\"tag\":", "\"lemma\":")
    if "\"lemma\":" in line:
      line_split = line.split("_")
      line = line_split[0]
      if "-" in line:
       line_split = line.split("-")
       line = line_split[0]
      if not line.endswith(","):
       line = line + "\","
    file_transformed.write(line + "\n")

  file_transformed.close()

In [None]:
transform_lemmas('/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas/json/train.json', '/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas/json/train_transformed.json')
transform_lemmas('/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas/json/etest.json', '/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas/json/etest_transformed.json')
transform_lemmas('/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas/json/dtest.json', '/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas/json/dtest_transformed.json')

Then, since spacy allows to transform v2's version datasets to a v3 binary format, lemmatized dataset is converted back to a spacy binary format, only now it contains the morphological info about lemmas

In [None]:
!python -m spacy convert -n 10 '/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas/json/train_transformed.json' '/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas/spacy'
!python -m spacy convert -n 10 '/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas/json/etest_transformed.json' '/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas/spacy'
!python -m spacy convert -n 10 '/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas/json/dtest_transformed.json' '/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas/spacy'

[38;5;2m✔ Generated output file (720 documents):
/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas/spacy/train_transformed.spacy[0m
[38;5;2m✔ Generated output file (90 documents):
/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas/spacy/etest_transformed.spacy[0m
[38;5;2m✔ Generated output file (90 documents):
/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas/spacy/dtest_transformed.spacy[0m


Show examples from dataset

In [None]:
cnec2_lemmas_docs = dataset_docs_from_path('/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas/spacy/train_transformed.spacy')

print(cnec2_lemmas_docs[0])

for ent in cnec2_lemmas_docs[0].ents:
  print (ent.text, ent.label_, ent.lemma_)

Jste světa znalý muž a víte stejně dobře jako já , že souvislost mezi současnými krutostmi v Jihovýchodní Asii a tou novou bankovní pobočkou hned vedle obchoďáku Zátoka je přímá a bezprostřední ; byl z toho už vzteklý jak uvázaný pes , protože zájemci o hodiny mu úplně narušili jeho denní režim a on si nemohl po obědě ani zdřímnout . I s Dubenkou , na kterou U tygra teď myslím . . . Hodil si kulovnici přes rameno a vydal se s význačným loveckým hostem do stráně , do kopce na krytou kazatelnu , aby s ním alespoň nemusel moknout . Já je normálně nosím tak " - a ukázal hřbetem dlaně na krajinu břišní . Když zakrátko Magora zavřeli , šli Němec a Jirousová za Václavem a rozhodli se , že případ budou publikovat . Když vyhraju , což je tutovka , dostanu od každého z vás litr slivovice já . Venku se již žádně zešeřilo a tma začínala houstnout . A jak TO vysvětlíte ? Zpívali jí Krásnou Meredith , ovšem ; 
Asii gt Asie
Zátoka if zátoka
Dubenkou p_ Dubenka
U tygra if u tygr
Magora p_ magor
Němec 

The next step is to add morphological info "POS tag".
In order to do so, CNEC2.0 dataset in non-standard conll format must be converted to a more detailed format conllu.

Since POS tags are again (surprise surprise) annotated in a non-standard way, POS annotations must be transformed to a standard version.

Also, there are some important differences between conll and conllu formats, which must also be addressed before an attempt to convert dataset to spacy format from conllu.

Function conllplus_to_conllu is designed to resolve the issues above.

In [None]:
def conllplus_to_conllu(input_filename_conll, output_filename_conllu):
  with open(input_filename_conll) as f:
    lines = f.readlines()

  file_transformed = open(output_filename_conllu, 'w')

  doc_id_counter = 0
  docs_count = 1

  POS_mapping_single_char = {
      "A":"ADJ",
      "C":"NUM",
      "D":"ADV",
      "I":"INTJ",
      "N":"NOUN",
      "P":"PRON",
      "V":"VERB",
      "R":"ADP",
      "T":"PART",
      "X":"X",
      "Z":"PUNCT"
    }
  POS_mapping_double_char = {
      "J,":"SCONJ",
      "J^":"CCONJ"
  }

  for line in lines:
    if line in ['\n', '\r\n']:
      doc_id_counter = 0
      docs_count += 1
      file_transformed.write(line)
      continue
    word, lemma, upos, xpos, misc = line.split(" ")

    if "_" in lemma:
      lemma = lemma.split("_")[0]
    if "-" in lemma:
      lemma = lemma.split("-")[0]

    if POS_mapping_single_char.get(upos[0]) is not None:
      upos = POS_mapping_single_char.get(upos[0])
    elif POS_mapping_double_char.get(upos[0:1]) is not None:
      upos = POS_mapping_single_char.get(upos[0:1])
    else:
      upos = "X"

    doc_id_counter += 1
    line = str(doc_id_counter) + "\t" + word + "\t" + lemma + "\t" + upos + "\t" + "_" + "\t" + "_" + "\t"+ "_" + "\t"+ "_" + "\t"+ "_" + "\t" + misc
    
    file_transformed.write(line)

  print(str(docs_count) + "documents was transformed to conllu format")
  file_transformed.close()

In [None]:
!mkdir '/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas_POS'
!mkdir '/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas_POS/conllu'
!mkdir '/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas_POS/spacy'

mkdir: cannot create directory ‘/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas_POS’: File exists
mkdir: cannot create directory ‘/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas_POS/spacy’: File exists


In [None]:
conllplus_to_conllu('/content/ner_tsd2016/data_tagged/CNEC_2.0/train.conll', '/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas_POS/conllu/train.conllu')
conllplus_to_conllu('/content/ner_tsd2016/data_tagged/CNEC_2.0/etest.conll', '/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas_POS/conllu/etest.conllu')
conllplus_to_conllu('/content/ner_tsd2016/data_tagged/CNEC_2.0/dtest.conll', '/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas_POS/conllu/dtest.conllu')

7194documents was transformed to conllu format
900documents was transformed to conllu format
901documents was transformed to conllu format


Conllu dataset can then be converted to spacy format

In [None]:
!python -m spacy convert -n 10 '/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas_POS/conllu/train.conllu' '/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas_POS/spacy'
!python -m spacy convert -n 10 '/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas_POS/conllu/etest.conllu' '/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas_POS/spacy'
!python -m spacy convert -n 10 '/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas_POS/conllu/dtest.conllu' '/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas_POS/spacy'

[38;5;4mℹ Grouping every 10 sentences into a document.[0m
[38;5;2m✔ Generated output file (720 documents):
/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas_POS/spacy/train.spacy[0m
[38;5;4mℹ Grouping every 10 sentences into a document.[0m
[38;5;2m✔ Generated output file (90 documents):
/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas_POS/spacy/etest.spacy[0m
[38;5;4mℹ Grouping every 10 sentences into a document.[0m
[38;5;2m✔ Generated output file (90 documents):
/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas_POS/spacy/dtest.spacy[0m


Show examples from dataset

In [None]:
cnec2_lemmas_pos_docs = dataset_docs_from_path('/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas_POS/spacy/train.spacy')

print(cnec2_lemmas_pos_docs[0])

for ent in cnec2_lemmas_pos_docs[0].ents:
  print (ent.text, ent.label_, ent.lemma_, ent.pos)

Jste světa znalý muž a víte stejně dobře jako já , že souvislost mezi současnými krutostmi v Jihovýchodní Asii a tou novou bankovní pobočkou hned vedle obchoďáku Zátoka je přímá a bezprostřední ; byl z toho už vzteklý jak uvázaný pes , protože zájemci o hodiny mu úplně narušili jeho denní režim a on si nemohl po obědě ani zdřímnout . I s Dubenkou , na kterou U tygra teď myslím . . . Hodil si kulovnici přes rameno a vydal se s význačným loveckým hostem do stráně , do kopce na krytou kazatelnu , aby s ním alespoň nemusel moknout . Já je normálně nosím tak " - a ukázal hřbetem dlaně na krajinu břišní . Když zakrátko Magora zavřeli , šli Němec a Jirousová za Václavem a rozhodli se , že případ budou publikovat . Když vyhraju , což je tutovka , dostanu od každého z vás litr slivovice já . Venku se již žádně zešeřilo a tma začínala houstnout . A jak TO vysvětlíte ? Zpívali jí Krásnou Meredith , ovšem ; 


Unfortunately, there is an error in spacy convert tool, which prevents to convert named entities annotations correctly from conllu format.

Specifically, there is a problem with predefined regex MISC_NER_PATTERN, which is designed incorrectly and doesn't allow to process named entities annotations.

You can see it for yourself at
https://github.com/explosion/spaCy/blob/master/spacy/training/converters/conllu_to_docs.py


### Data exploration
In this part, data from transformed dataset are explored and cleaned

The following function prints all named antity categories available in the provided dataset and counts total NEs count in the same category and also the NE's occurence in a context of the whole dataset 

In [None]:
def print_NE_occurences_docs(dataset_docs):
  from spacy.tokens import DocBin
  NEs = {}

  predefinedNEs = ["ah", "at", "az", "gc", "gh", "gl", "gq", "gr",
                   "gs", "gt", "gu", "g_", "ia", "ic", "if", "io",
                   "i_", "me", "mi", "mn", "ms", "na", "nb", "nc",
                   "ni", "no", "ns", "n_", "oa", "oe", "om", "op",
                   "or", "o_", "pc", "pd", "pf", "pm", "pp", "ps",
                   "p_", "td", "tf", "th", "tm", "ty"]

  for doc in dataset_docs:
    for ent in doc.ents:
      if NEs.get(ent.label_) is not None:
        NEs[ent.label_] += 1
      else:
        NEs[ent.label_] = 1

  total_NEs_count = sum(NEs.values()) 
  print("total detected named entity categories: " + str(len(NEs)))
  print("total named entites count: " + str(total_NEs_count))

  for predefinedNE in predefinedNEs:
    if (predefinedNE not in NEs.keys()):
      print("dataset is missing entity " + predefinedNE)

  NEs_sorted = dict(sorted(NEs.items(), key=lambda item: item[1]))

  print("NE\tcount\tpercent")
  for NE, NE_count in NEs_sorted.items():
    NE_percent = (NE_count/total_NEs_count) * 100
    NE_percent_short = "{:.2f}".format(NE_percent)
    print(str(NE) + "\t" + str(NE_count) + "\t" + str(NE_percent_short))

In [None]:
def print_NE_occurrences_path(dataset_path):
  dataset_docs = dataset_docs_from_path(dataset_path)
  print_NE_occurences_docs(dataset_docs)

In [None]:
docs_merged = dataset_docs_from_path('/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas/spacy/train_transformed.spacy') + dataset_docs_from_path('/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas/spacy/dtest_transformed.spacy') + dataset_docs_from_path('/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas/spacy/etest_transformed.spacy')

In [None]:
print_NE_occurences_docs(docs_merged)

total detected named entity categories: 46
total named entites count: 26555
NE	count	percent
mi	22	0.08
tf	24	0.09
i_	40	0.15
gh	54	0.20
ms	73	0.27
pm	79	0.30
gt	81	0.31
pd	98	0.37
g_	102	0.38
pp	103	0.39
me	104	0.39
gl	106	0.40
na	106	0.40
az	121	0.46
ns	137	0.52
or	141	0.53
gq	153	0.58
o_	165	0.62
ni	175	0.66
ah	184	0.69
gr	204	0.77
om	209	0.79
at	229	0.86
ia	238	0.90
mn	240	0.90
pc	251	0.95
nb	273	1.03
gs	301	1.13
no	316	1.19
p_	317	1.19
op	404	1.52
oe	475	1.79
td	515	1.94
tm	577	2.17
n_	596	2.24
io	776	2.92
if	846	3.19
gc	1083	4.08
ty	1295	4.88
th	1436	5.41
ic	1474	5.55
oa	1776	6.69
gu	1878	7.07
nc	2021	7.61
pf	2901	10.92
ps	3856	14.52


#### Data cleaning

Dataset NEs' occurrences were analyzed (further info in the main diploma thesis' text).

Some NEs will be removed from the dataset by function remove_redundant_NEs

In [None]:
def remove_redundant_NEs(docs):
  NEs_to_remove = {
      "pp",
      "mi"
  }

  for doc in docs:
    doc_ents = [ent for ent in doc.ents if ent.label_ not in NEs_to_remove]
    doc.set_ents(doc_ents)

In [None]:
remove_redundant_NEs(docs_merged)

In [None]:
print_NE_occurences_docs(docs_merged)

total detected named entity categories: 44
total named entites count: 26430
dataset is missing entity mi
dataset is missing entity pp
NE	count	percent
tf	24	0.09
i_	40	0.15
gh	54	0.20
ms	73	0.28
pm	79	0.30
gt	81	0.31
pd	98	0.37
g_	102	0.39
me	104	0.39
gl	106	0.40
na	106	0.40
az	121	0.46
ns	137	0.52
or	141	0.53
gq	153	0.58
o_	165	0.62
ni	175	0.66
ah	184	0.70
gr	204	0.77
om	209	0.79
at	229	0.87
ia	238	0.90
mn	240	0.91
pc	251	0.95
nb	273	1.03
gs	301	1.14
no	316	1.20
p_	317	1.20
op	404	1.53
oe	475	1.80
td	515	1.95
tm	577	2.18
n_	596	2.26
io	776	2.94
if	846	3.20
gc	1083	4.10
ty	1295	4.90
th	1436	5.43
ic	1474	5.58
oa	1776	6.72
gu	1878	7.11
nc	2021	7.65
pf	2901	10.98
ps	3856	14.59


The same thing is then done for each dataset (train, dev, test)

In [None]:
train_transformed = dataset_docs_from_path('/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas/spacy/train_transformed.spacy')
dev_transformed = dataset_docs_from_path('/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas/spacy/dtest_transformed.spacy')
test_transformed = dataset_docs_from_path('/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas/spacy/etest_transformed.spacy')

In [None]:
remove_redundant_NEs(train_transformed)
remove_redundant_NEs(dev_transformed)
remove_redundant_NEs(test_transformed)

The following function is used to serialize spacy's Doc objects and transform it to a DocBin.

DocBin can then be saved to a disk.

In [None]:
def serialize_docs(docs):
  from spacy.tokens import DocBin
  doc_bin = DocBin()
  for doc in docs:
    doc_bin.add(doc)

  return doc_bin

In [None]:
train_transformed_doc_bin = serialize_docs(train_transformed)
train_transformed_doc_bin.to_disk('/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas/spacy/train_transformed_clean.spacy')

dev_transformed_doc_bin = serialize_docs(dev_transformed)
dev_transformed_doc_bin.to_disk('/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas/spacy/dev_transformed_clean.spacy')

test_transformed_doc_bin = serialize_docs(test_transformed)
test_transformed_doc_bin.to_disk('/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas/spacy/test_transformed_clean.spacy')

# NER model training
This part will be dedicated to training a NER model with use of a spaCy framework.

First, less computational power demanding CPU models shall be trained on different versions of created datasets. Models shall than be evaluated (F1 score was selected as the main evaluation criterion).


SpaCy uses (in version v3) as a means to specify model architecture config.cfg files. These files were prepared separatelly of this notebook and can be obtained in a project github repository.

In [None]:
!mkdir '/content/drive/MyDrive/PIIAnonymizer/models'

In [12]:
%cd '/content'
!git clone 'https://github.com/ondrasekd/DP.git'

/content
Cloning into 'DP'...
remote: Enumerating objects: 265, done.[K
remote: Counting objects: 100% (154/154), done.[K
remote: Compressing objects: 100% (104/104), done.[K
remote: Total 265 (delta 83), reused 120 (delta 49), pack-reused 111[K
Receiving objects: 100% (265/265), 46.50 MiB | 11.40 MiB/s, done.
Resolving deltas: 100% (129/129), done.


Function display_model_inference_example serves to show a minimal example of classified NEs detected by a provided model

In [5]:
def display_model_inference_example(path_to_spacy_model):
  from spacy import displacy
  nlp = spacy.load(path_to_spacy_model)
  doc = nlp("Pan Karel se nechal zaměstnat v Google, protože Microsoft se mu nezdál. Povídal, že v Čechách se tohle nenosí.")
  displacy.render(doc,jupyter=True, style = "ent")

This function sets the temporary global train variables to clean up the code a little

In [6]:
def set_temp_train_variables(config_path, model_dir_path, test_dataset_path):
  global cfg
  cfg = config_path

  global model_dir
  model_dir = model_dir_path

  global model_eval_path 
  model_eval_path = model_dir_path + "/model-best_eval/"

  global model_eval_path_json
  model_eval_path_json = model_eval_path + "eval.json"

  global model_best_path
  model_best_path = model_dir_path + "/model-best"

  global test_dataset
  test_dataset = test_dataset_path

### CPU models

**Coarse-grained CPU model**

In [None]:
set_temp_train_variables('/content/DP/src/spacy_config_files/CPU_coarse.cfg',
                         '/content/drive/MyDrive/PIIAnonymizer/models/CPU_coarse',
                         '/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0_extended/spacy/etest.spacy')

# create model directory
!mkdir $model_dir

# train model
!python -m spacy train --verbose $cfg --output $model_dir

# evaluate on test dataset
!mkdir $model_eval_path
!python -m spacy evaluate $model_best_path $test_dataset -o $model_eval_path_json -dp $model_eval_path

mkdir: cannot create directory ‘/content/drive/MyDrive/PIIAnonymizer/models/CPU_coarse’: File exists
[38;5;4mℹ Saving to output directory:
/content/drive/MyDrive/PIIAnonymizer/models/CPU_coarse[0m
[38;5;4mℹ Using CPU[0m
[1m
[2022-06-04 16:16:10,025] [INFO] Set up nlp object from config
[2022-06-04 16:16:10,033] [DEBUG] Loading corpus from path: /content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0_extended/spacy/dtest.spacy
[2022-06-04 16:16:10,034] [DEBUG] Loading corpus from path: /content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0_extended/spacy/train.spacy
[2022-06-04 16:16:10,034] [INFO] Pipeline: ['tok2vec', 'ner']
[2022-06-04 16:16:10,039] [INFO] Created vocabulary
[2022-06-04 16:16:10,040] [INFO] Finished initializing nlp object

Load the table in your config with:

[initialize.lookups]
@misc = "spacy.LookupsDataLoader.v1"
lang = ${nlp.lang}
tables = ["lexeme_norm"]

[2022-06-04 16:16:14,314] [INFO] Initialized pipeline components: ['tok2vec', 'ner']
[38;5;2m✔ Initializ

**Fine-grained CPU model without lemmas**

In [None]:
set_temp_train_variables('/content/DP/src/spacy_config_files/CPU_fine_nomorph.cfg',
                         '/content/drive/MyDrive/PIIAnonymizer/models/CPU_fine_nomorph',
                         '/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/no_morph/spacy/etest.spacy')

# create model directory
!mkdir $model_dir

# train model
!python -m spacy train --verbose $cfg --output $model_dir

# evaluate on test dataset
!mkdir $model_eval_path
!python -m spacy evaluate $model_best_path $test_dataset -o $model_eval_path_json -dp $model_eval_path

[38;5;4mℹ Saving to output directory:
/content/drive/MyDrive/PIIAnonymizer/models/CPU_fine_nomorph[0m
[38;5;4mℹ Using CPU[0m
[1m
[2022-06-04 16:24:24,970] [INFO] Set up nlp object from config
[2022-06-04 16:24:24,979] [DEBUG] Loading corpus from path: /content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/no_morph/spacy/dtest.spacy
[2022-06-04 16:24:24,980] [DEBUG] Loading corpus from path: /content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/no_morph/spacy/train.spacy
[2022-06-04 16:24:24,980] [INFO] Pipeline: ['tok2vec', 'ner']
[2022-06-04 16:24:24,984] [INFO] Created vocabulary
[2022-06-04 16:24:24,985] [INFO] Finished initializing nlp object

Load the table in your config with:

[initialize.lookups]
@misc = "spacy.LookupsDataLoader.v1"
lang = ${nlp.lang}
tables = ["lexeme_norm"]

[2022-06-04 16:24:29,646] [INFO] Initialized pipeline components: ['tok2vec', 'ner']
[38;5;2m✔ Initialized pipeline[0m
[1m
[2022-06-04 16:24:29,656] [DEBUG] Loading corpus from path: /content/drive

**Fine-grained CPU model with lemmas**

In [None]:
set_temp_train_variables('/content/DP/src/spacy_config_files/CPU_fine_lemmas.cfg',
                         '/content/drive/MyDrive/PIIAnonymizer/models/CPU_fine_lemmas',
                         '/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas/spacy/etest_transformed_clean.spacy')

# in this case, since model uses lemmas, spacy-lookups-data should be installed
!pip install spacy-lookups-data

# create model directory
!mkdir $model_dir

# train model
!python -m spacy train --verbose $cfg --output $model_dir

# evaluate on test dataset
!mkdir $model_eval_path
!python -m spacy evaluate $model_best_path $test_dataset -o $model_eval_path_json -dp $model_eval_path

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting spacy-lookups-data
  Downloading spacy_lookups_data-1.0.3-py2.py3-none-any.whl (98.5 MB)
[K     |████████████████████████████████| 98.5 MB 116 kB/s 
Installing collected packages: spacy-lookups-data
Successfully installed spacy-lookups-data-1.0.3
[38;5;4mℹ Saving to output directory:
/content/drive/MyDrive/PIIAnonymizer/models/CPU_fine_lemmas[0m
[38;5;4mℹ Using CPU[0m
[1m
[2022-06-04 16:38:52,836] [INFO] Set up nlp object from config
[2022-06-04 16:38:52,845] [DEBUG] Loading corpus from path: /content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas/spacy/dtest_transformed.spacy
[2022-06-04 16:38:52,846] [DEBUG] Loading corpus from path: /content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas/spacy/train_transformed.spacy
[2022-06-04 16:38:52,846] [INFO] Pipeline: ['tok2vec', 'trainable_lemmatizer', 'ner']
[2022-06-04 16:38:52,852] [DEBUG] Loading lookups from spa

### GPU models
NOTE: to train the following models, GPU session must be enabled (go to Runtime-> Change runtime type-> choose GPU)

Since the developed anonymization tool needs to use a fine-grained dataset, transformed and cleaned dataset CNEC2.0 shall be used in every GPU training.

**GPU bert-base-multilingual-uncased**

In [None]:
set_temp_train_variables('/content/DP/src/spacy_config_files/GPU_bert_uncased.cfg',
                         '/content/drive/MyDrive/PIIAnonymizer/models/GPU_bert_uncased',
                         '/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas/spacy/test_transformed_clean.spacy')

# in this case, since model uses lemmas, spacy-lookups-data should be installed
!pip install spacy-lookups-data

# create model directory
!mkdir $model_dir

# train model
!python -m spacy train -g 0 --verbose $cfg --output $model_dir

# evaluate on test dataset
!mkdir $model_eval_path
!python -m spacy evaluate $model_best_path $test_dataset -o $model_eval_path_json -dp $model_eval_path

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
mkdir: cannot create directory ‘/content/drive/MyDrive/PIIAnonymizer/models/GPU_bert_uncased’: File exists
[38;5;4mℹ Saving to output directory:
/content/drive/MyDrive/PIIAnonymizer/models/GPU_bert_uncased[0m
[38;5;4mℹ Using GPU: 0[0m
[1m
[2022-06-04 22:43:42,947] [INFO] Set up nlp object from config
[2022-06-04 22:43:42,956] [DEBUG] Loading corpus from path: /content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas/spacy/dev_transformed_clean.spacy
[2022-06-04 22:43:42,958] [DEBUG] Loading corpus from path: /content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas/spacy/train_transformed_clean.spacy
[2022-06-04 22:43:42,958] [INFO] Pipeline: ['transformer', 'trainable_lemmatizer', 'ner']
[2022-06-04 22:43:42,963] [DEBUG] Loading lookups from spacy-lookups-data: ['lexeme_norm', 'lemma_lookup']
[2022-06-04 22:43:43,179] [INFO] Added vocab lookups: lexeme_norm, lemma_lookup
[2022

In [8]:
set_temp_train_variables('/content/DP/src/spacy_config_files/GPU_bert_uncased.cfg',
                         '/content/drive/MyDrive/PIIAnonymizer/models/GPU_bert_uncased',
                         '/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas/spacy/test_transformed_clean.spacy')

# evaluate on test dataset
!mkdir $model_eval_path
!python -m spacy evaluate -g 0 $model_best_path $test_dataset -o $model_eval_path_json -dp $model_eval_path

mkdir: cannot create directory ‘/content/drive/MyDrive/PIIAnonymizer/models/GPU_bert_uncased/model-best_eval/’: File exists
[38;5;4mℹ Using GPU: 0[0m
[1m

TOK     100.00
LEMMA   92.39 
NER P   79.08 
NER R   79.53 
NER F   79.30 
SPEED   3688  

[1m

          P        R        F
p_    26.32    27.78    27.03
n_    37.35    41.89    39.49
pf    90.56    94.17    92.33
pm    63.64    87.50    73.68
ps    86.73    89.93    88.30
gc    81.98    77.78    79.82
oa    77.23    72.38    74.73
if    53.25    61.19    56.94
gu    75.30    77.64    76.45
ty    97.08    98.52    97.79
mn    43.48    37.04    40.00
oe    79.55    72.92    76.09
or    50.00    50.00    50.00
ic    73.39    56.88    64.08
io    63.33    58.46    60.80
op    71.43    64.10    67.57
g_    66.67    33.33    44.44
th    96.72    95.68    96.20
no    70.97    66.67    68.75
ia    38.46    32.26    35.09
nc    75.94    94.84    84.34
gr    52.63    47.62    50.00
pc    77.78    80.77    79.25
gs    68.75    68.75    6

**GPU bert-base-multilingual-cased**

In [None]:
set_temp_train_variables('/content/DP/src/spacy_config_files/GPU_bert_cased.cfg',
                         '/content/drive/MyDrive/PIIAnonymizer/models/GPU_bert_cased',
                         '/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas/spacy/test_transformed_clean.spacy')

# in this case, since model uses lemmas, spacy-lookups-data should be installed
!pip install spacy-lookups-data

# create model directory
!mkdir $model_dir

# train model
!python -m spacy train -g 0 --verbose $cfg --output $model_dir

# evaluate on test dataset
!mkdir $model_eval_path
!python -m spacy evaluate $model_best_path $test_dataset -o $model_eval_path_json -dp $model_eval_path

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting spacy-lookups-data
  Downloading spacy_lookups_data-1.0.3-py2.py3-none-any.whl (98.5 MB)
[K     |████████████████████████████████| 98.5 MB 1.1 MB/s 
Installing collected packages: spacy-lookups-data
Successfully installed spacy-lookups-data-1.0.3
[38;5;4mℹ Saving to output directory:
/content/drive/MyDrive/PIIAnonymizer/models/GPU_bert_cased[0m
[38;5;4mℹ Using GPU: 0[0m
[1m
[2022-06-05 11:18:01,175] [INFO] Set up nlp object from config
[2022-06-05 11:18:01,185] [DEBUG] Loading corpus from path: /content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas/spacy/dev_transformed_clean.spacy
[2022-06-05 11:18:01,186] [DEBUG] Loading corpus from path: /content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas/spacy/train_transformed_clean.spacy
[2022-06-05 11:18:01,186] [INFO] Pipeline: ['transformer', 'trainable_lemmatizer', 'ner']
[2022-06-05 11:18:01,191] [DEBUG] Loading 

In [9]:
set_temp_train_variables('/content/DP/src/spacy_config_files/GPU_bert_cased.cfg',
                         '/content/drive/MyDrive/PIIAnonymizer/models/GPU_bert_cased',
                         '/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas/spacy/test_transformed_clean.spacy')

# evaluate on test dataset
!mkdir $model_eval_path
!python -m spacy evaluate -g 0 $model_best_path $test_dataset -o $model_eval_path_json -dp $model_eval_path

[38;5;4mℹ Using GPU: 0[0m
[1m

TOK     100.00
LEMMA   94.43 
NER P   78.76 
NER R   82.89 
NER F   80.77 
SPEED   6543  

[1m

          P        R        F
oa    80.17    81.17    80.67
ps    86.84    92.38    89.52
n_    33.94    50.00    40.44
pf    89.74    93.87    91.75
pm    66.67    75.00    70.59
gc    87.04    80.34    83.56
ic    73.19    63.12    67.79
if    55.13    64.18    59.31
gu    69.42    88.82    77.93
ty    95.04    99.26    97.10
mn    40.54    55.56    46.88
p_    29.03    25.00    26.87
io    75.38    75.38    75.38
op    66.67    66.67    66.67
g_    66.67    33.33    44.44
th    95.19    96.22    95.70
no    75.00    72.73    73.85
oe    80.00    66.67    72.73
nc    77.82    90.61    83.73
gq    61.54    44.44    51.61
gr    58.33    66.67    62.22
ia    66.67    32.26    43.48
pc    76.67    88.46    82.14
gs    81.25    81.25    81.25
ah    90.00    90.00    90.00
az    88.89    80.00    84.21
at    88.46    92.00    90.20
pd    90.00   100.00    94.74

**GPU small-e-czech**

In [None]:
set_temp_train_variables('/content/DP/src/spacy_config_files/GPU_small_e_czech.cfg',
                         '/content/drive/MyDrive/PIIAnonymizer/models/GPU_small_e_czech',
                         '/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas/spacy/test_transformed_clean.spacy')

# in this case, since model uses lemmas, spacy-lookups-data should be installed
!pip install spacy-lookups-data

# create model directory
!mkdir $model_dir

# train model
!python -m spacy train -g 0 --verbose $cfg --output $model_dir

# evaluate on test dataset
!mkdir $model_eval_path
!python -m spacy evaluate $model_best_path $test_dataset -o $model_eval_path_json -dp $model_eval_path

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
mkdir: cannot create directory ‘/content/drive/MyDrive/PIIAnonymizer/models/GPU_small_e_czech’: File exists
[38;5;4mℹ Saving to output directory:
/content/drive/MyDrive/PIIAnonymizer/models/GPU_small_e_czech[0m
[38;5;4mℹ Using GPU: 0[0m
[1m
[2022-06-05 08:42:15,380] [INFO] Set up nlp object from config
[2022-06-05 08:42:15,390] [DEBUG] Loading corpus from path: /content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas/spacy/dev_transformed_clean.spacy
[2022-06-05 08:42:15,391] [DEBUG] Loading corpus from path: /content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas/spacy/train_transformed_clean.spacy
[2022-06-05 08:42:15,391] [INFO] Pipeline: ['transformer', 'trainable_lemmatizer', 'ner']
[2022-06-05 08:42:15,396] [DEBUG] Loading lookups from spacy-lookups-data: ['lexeme_norm', 'lemma_lookup']
[2022-06-05 08:42:15,594] [INFO] Added vocab lookups: lexeme_norm, lemma_lookup
[20

In [10]:
set_temp_train_variables('/content/DP/src/spacy_config_files/GPU_small_e_czech.cfg',
                         '/content/drive/MyDrive/PIIAnonymizer/models/GPU_small_e_czech',
                         '/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas/spacy/test_transformed_clean.spacy')

# evaluate on test dataset
!mkdir $model_eval_path
!python -m spacy evaluate -g 0 $model_best_path $test_dataset -o $model_eval_path_json -dp $model_eval_path

mkdir: cannot create directory ‘/content/drive/MyDrive/PIIAnonymizer/models/GPU_small_e_czech/model-best_eval/’: File exists
[38;5;4mℹ Using GPU: 0[0m
[1m

TOK     100.00
LEMMA   86.82 
NER P   66.39 
NER R   66.28 
NER F   66.34 
SPEED   18101 

[1m

         P       R       F
ps   78.45   82.31   80.34
ty   81.76   96.30   88.44
gu   59.32   65.22   62.13
pf   81.92   86.20   84.01
gr   63.64   33.33   43.75
ic   37.20   38.12   37.65
gc   69.05   74.36   71.60
oa   42.86   37.66   40.09
p_   11.11    8.33    9.52
mn   33.33   18.52   23.81
pm    0.00    0.00    0.00
op   38.00   48.72   42.70
o_   25.00    5.26    8.70
if   25.32   29.85   27.40
gq   44.44   22.22   29.63
io   37.18   44.62   40.56
g_    0.00    0.00    0.00
th   91.35   91.35   91.35
nc   72.97   88.73   80.08
oe   65.00   81.25   72.22
ia    0.00    0.00    0.00
pc   86.36   73.08   79.17
gs   52.63   31.25   39.22
ah   68.18   75.00   71.43
az   81.82   90.00   85.71
at   74.07   80.00   76.92
n_   21.21   18

**GPU RobeCzech**

In [None]:
set_temp_train_variables('/content/DP/src/spacy_config_files/GPU_robeczech.cfg',
                         '/content/drive/MyDrive/PIIAnonymizer/models/GPU_robeczech',
                         '/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas/spacy/test_transformed_clean.spacy')

# in this case, since model uses lemmas, spacy-lookups-data should be installed
!pip install spacy-lookups-data

# create model directory
!mkdir $model_dir

# train model
!python -m spacy train -g 0 --verbose $cfg --output $model_dir

# evaluate on test dataset
!mkdir $model_eval_path
!python -m spacy evaluate $model_best_path $test_dataset -o $model_eval_path_json -dp $model_eval_path

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting spacy-lookups-data
  Downloading spacy_lookups_data-1.0.3-py2.py3-none-any.whl (98.5 MB)
[K     |████████████████████████████████| 98.5 MB 101 kB/s 
Installing collected packages: spacy-lookups-data
Successfully installed spacy-lookups-data-1.0.3
[38;5;4mℹ Saving to output directory:
/content/drive/MyDrive/PIIAnonymizer/models/GPU_robeczech[0m
[38;5;4mℹ Using GPU: 0[0m
[1m
[2022-06-07 11:45:51,869] [INFO] Set up nlp object from config
[2022-06-07 11:45:51,878] [DEBUG] Loading corpus from path: /content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas/spacy/dev_transformed_clean.spacy
[2022-06-07 11:45:51,879] [DEBUG] Loading corpus from path: /content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas/spacy/train_transformed_clean.spacy
[2022-06-07 11:45:51,879] [INFO] Pipeline: ['transformer', 'trainable_lemmatizer', 'ner']
[2022-06-07 11:45:51,884] [DEBUG] Loading l

In [None]:
set_temp_train_variables('/content/DP/src/spacy_config_files/GPU_robeczech.cfg',
                         '/content/drive/MyDrive/PIIAnonymizer/models/GPU_robeczech',
                         '/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas/spacy/test_transformed_clean.spacy')

# evaluate on test dataset
!mkdir $model_eval_path
!python -m spacy evaluate -g 0 $model_best_path $test_dataset -o $model_eval_path_json -dp $model_eval_path

**GPU Czert B-based cased**

In [None]:
set_temp_train_variables('/content/DP/src/spacy_config_files/GPU_czert_b_based.cfg',
                         '/content/drive/MyDrive/PIIAnonymizer/models/GPU_czert_b_based',
                         '/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas/spacy/test_transformed_clean.spacy')

# in this case, since model uses lemmas, spacy-lookups-data should be installed
!pip install spacy-lookups-data

# create model directory
!mkdir $model_dir

# train model
!python -m spacy train -g 0 --verbose $cfg --output $model_dir

# evaluate on test dataset
!mkdir $model_eval_path
!python -m spacy evaluate $model_best_path $test_dataset -o $model_eval_path_json -dp $model_eval_path

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
[38;5;4mℹ Saving to output directory:
/content/drive/MyDrive/PIIAnonymizer/models/GPU_czert_b_based[0m
[38;5;4mℹ Using GPU: 0[0m
[1m
[2022-06-07 12:15:16,754] [INFO] Set up nlp object from config
[2022-06-07 12:15:16,763] [DEBUG] Loading corpus from path: /content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas/spacy/dev_transformed_clean.spacy
[2022-06-07 12:15:16,764] [DEBUG] Loading corpus from path: /content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas/spacy/train_transformed_clean.spacy
[2022-06-07 12:15:16,764] [INFO] Pipeline: ['transformer', 'trainable_lemmatizer', 'ner']
[2022-06-07 12:15:16,769] [DEBUG] Loading lookups from spacy-lookups-data: ['lexeme_norm', 'lemma_lookup']
[2022-06-07 12:15:16,976] [INFO] Added vocab lookups: lexeme_norm, lemma_lookup
[2022-06-07 12:15:16,976] [INFO] Created vocabulary
[2022-06-07 12:15:16,977] [INFO] Finished initializing nlp 

In [12]:
set_temp_train_variables('/content/DP/src/spacy_config_files/GPU_czert_b_based.cfg',
                         '/content/drive/MyDrive/PIIAnonymizer/models/GPU_czert_b_based',
                         '/content/drive/MyDrive/PIIAnonymizer/datasets/CNEC2.0/lemmas/spacy/test_transformed_clean.spacy')

# evaluate on test dataset
!mkdir $model_eval_path
!python -m spacy evaluate -g 0 $model_best_path $test_dataset -o $model_eval_path_json -dp $model_eval_path

[38;5;4mℹ Using GPU: 0[0m
[1m

TOK     100.00
LEMMA   93.96 
NER P   77.47 
NER R   81.12 
NER F   79.25 
SPEED   8429  

[1m

          P        R        F
gs    58.62    53.12    55.74
ps    84.84    92.14    88.34
n_    48.28    56.76    52.17
pf    90.18    92.94    91.54
pm    83.33    62.50    71.43
gc    86.11    79.49    82.67
oa    68.53    66.53    67.52
g_    50.00    33.33    40.00
mn    60.00    44.44    51.06
ty    93.66    98.52    96.03
gu    71.43    83.85    77.14
p_    19.05    22.22    20.51
if    59.15    62.69    60.87
op    51.72    76.92    61.86
oe    80.00    91.67    85.44
or    30.00    37.50    33.33
io    67.14    72.31    69.63
ic    70.63    63.12    66.67
th    97.81    96.76    97.28
no    70.59    72.73    71.64
nc    77.78    92.02    84.30
gr    60.87    66.67    63.64
ia    42.86    29.03    34.62
pc    85.19    88.46    86.79
ah    72.73    80.00    76.19
az    70.00    70.00    70.00
at    82.14    92.00    86.79
gl    33.33    40.00    36.36

# Presidio evaluation

In this part, PIIAnonymizer tool is created, using Presidio SDK, and then evaluated on a custom evaluation dataset.

Please note, that custom recognizers' implementation as well as presidio setup is a part of the Streamlit GUI application, used in the last section. This code was copied to this notebook only to evaluate PIIAnonymizer tool on the evaluation dataset.

Install required presidio modules

In [6]:
!pip install presidio_analyzer
!pip install presidio_anonymizer

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting presidio_analyzer
  Downloading presidio_analyzer-2.2.28-py3-none-any.whl (63 kB)
[K     |████████████████████████████████| 63 kB 2.0 MB/s 
[?25hCollecting tldextract
  Downloading tldextract-3.3.0-py3-none-any.whl (93 kB)
[K     |████████████████████████████████| 93 kB 2.4 MB/s 
Collecting phonenumbers>=8.12
  Downloading phonenumbers-8.12.49-py2.py3-none-any.whl (2.6 MB)
[K     |████████████████████████████████| 2.6 MB 61.7 MB/s 
Collecting requests-file>=1.4
  Downloading requests_file-1.5.1-py2.py3-none-any.whl (3.7 kB)
Installing collected packages: requests-file, tldextract, phonenumbers, presidio-analyzer
Successfully installed phonenumbers-8.12.49 presidio-analyzer-2.2.28 requests-file-1.5.1 tldextract-3.3.0
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting presidio_anonymizer
  Downloading presidio_anonymi

**CPU version**

Create custom analyzer engine for a CPU session

In [9]:
import json
from json import JSONEncoder
import pandas as pd
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
import spacy


nlp = spacy.load('/content/drive/MyDrive/PIIAnonymizer/models/CPU_fine_nomorph/model-best')

import logging
from typing import Optional, List, Tuple, Set

from presidio_analyzer import (
    RecognizerResult,
    LocalRecognizer,
    AnalysisExplanation,
)

logger = logging.getLogger("presidio-analyzer")

# this custom spacy recognizer is based on https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/predefined_recognizers/spacy_recognizer.py
class SpacyRecognizerCustom(LocalRecognizer):
    """
    Recognize PII entities using a spaCy NLP model.
    This recognizer extract entities from NlpArtifacts and align their types with Presidio.
    :param supported_language: Language this recognizer supports
    :param supported_entities: The entities this recognizer can detect
    :param ner_strength: Default confidence for NER prediction
    :param check_label_groups: Tuple containing Presidio entity names
    and spaCy entity names, for verifying that the right entity
    is translated into a Presidio entity.
    """

    ENTITIES = [
        "PERSON",
        "EMAIL_ADDRESS",
        "LOGIN_NICK",
        "INSTITUTION",
        "PHONE_NUM",
        "MEDIA_NAME",
        "NUMBER_EXPR",
        "LOCATION",
        "PRODUCT",
        "DATE_TIME",
        "OTHER"
    ]

    DEFAULT_EXPLANATION = "Identified as {} by Spacy's Named Entity Recognition"

    CHECK_LABEL_GROUPS = [
        ({"PERSON"}, {"pd", "pf", "pm", "ps"}),
        ({"EMAIL_ADDRESS"}, {"me"}),
        ({"LOGIN_NICK"}, {"p_"}),
        ({"iNSTITUTION"}, {"ia", "ic", "if", "io", "i_"}),
        ({"PHONE_NUM"}, {"at"}),
        ({"MEDIA_NAME"}, {"mn", "ms"}),
        ({"NUMBER_EXPR"}, {"nb", "nc", "ni", "no", "ns", "n_"}),
        ({"LOCATION"}, {"ah", "az", "gc", "gh", "gl", "gq", "gr", "gs", "gt", "gu", "g_"}),
        ({"PRODUCT"}, {"op"}),
        ({"DATE_TIME"}, {"td", "tf", "th", "tm", "ty"}),
        ({"OTHER"}, {"oa", "or", "o_", "pc"})
    ]

    def __init__(
        self,
        supported_language: str = "cs",
        supported_entities: Optional[List[str]] = None,
        ner_strength: float = 0.82,
        check_label_groups: Optional[Tuple[Set, Set]] = None,
        context: Optional[List[str]] = None,
    ):
        self.ner_strength = ner_strength
        self.check_label_groups = (
            check_label_groups if check_label_groups else self.CHECK_LABEL_GROUPS
        )
        supported_entities = supported_entities if supported_entities else self.ENTITIES
        super().__init__(
            supported_entities=supported_entities,
            supported_language=supported_language,
            context=context,
        )

    def load(self) -> None:
        pass

    def build_spacy_explanation(
        self, original_score: float, explanation: str
    ) -> AnalysisExplanation:
        explanation = AnalysisExplanation(
            recognizer=self.__class__.__name__,
            original_score=original_score,
            textual_explanation=explanation,
        )
        return explanation

    def analyze(self, text, entities, nlp_artifacts=None):
        results = []
        if not nlp_artifacts:
            logger.warning("Nlp artifacts not provided...")
            return results

        ner_entities = nlp_artifacts.entities

        for entity in entities:
            if entity not in self.supported_entities:
                continue
            for ent in ner_entities:
                if not self.__check_label(entity, ent.label_, self.check_label_groups):
                    continue
                textual_explanation = self.DEFAULT_EXPLANATION.format(ent.label_)
                explanation = self.build_spacy_explanation(
                    self.ner_strength, textual_explanation
                )
                spacy_result = RecognizerResult(
                    entity_type=entity,
                    start=ent.start_char,
                    end=ent.end_char,
                    score=self.ner_strength,
                    analysis_explanation=explanation,
                    recognition_metadata={
                        RecognizerResult.RECOGNIZER_NAME_KEY: self.name
                    },
                )
                results.append(spacy_result)

        return results

    @staticmethod
    def __check_label(
        entity: str, label: str, check_label_groups: Tuple[Set, Set]
    ) -> bool:
        return any(
            [entity in egrp and label in lgrp for egrp, lgrp in check_label_groups]
        )

spacy_recognizer_custom = SpacyRecognizerCustom()

from collections import defaultdict
from typing import List, Optional

from presidio_analyzer import Pattern, PatternRecognizer


class CSRCRecognizer(PatternRecognizer):
    """Recognize CS "rodne cislo" using regex.
    :param patterns: List of patterns to be used by this recognizer
    :param context: List of context words to increase confidence in detection
    :param supported_language: Language this recognizer supports
    :param supported_entity: The entity this recognizer can detect
    """

    PATTERNS = [
        Pattern("rodne cislo (high)", r"\d{2}(0[1-9]|1[0-2]|5[1-9]|6[0-2])(0[1-9]|1[0-9]|2[0-9]|3[0-1])\/?\d{3,4}", 0.5)
    ]

    CONTEXT = [
        "rc",
        "rodne",
        "pojistence"
        "cislo"
    ]

    def __init__(
        self,
        patterns: Optional[List[Pattern]] = None,
        context: Optional[List[str]] = None,
        supported_language: str = "cs",
        supported_entity: str = "CS_RC",
    ):
        patterns = patterns if patterns else self.PATTERNS
        context = context if context else self.CONTEXT
        super().__init__(
            supported_entity=supported_entity,
            patterns=patterns,
            context=context,
            supported_language=supported_language,
        )

rc_recognizer = CSRCRecognizer()

from presidio_analyzer.predefined_recognizers import CreditCardRecognizer, CryptoRecognizer, EmailRecognizer, IbanRecognizer, IpRecognizer, PhoneRecognizer, UrlRecognizer

credit_card_recognizer = CreditCardRecognizer(supported_language="cs", context=["kreditni", "debetni", "karta", "visa", "mastercard", "maestro", "platba"])
crypto_recognizer = CryptoRecognizer(supported_language="cs", context=["wallet", "btc", "bitcoin", "ethereum", "eth", "crypto", "kryptomena"])
email_recognizer = EmailRecognizer(supported_language="cs", context=["email", "mail", "e-mail"])
iban_recognizer = IbanRecognizer(supported_language="cs", context=["iban", "banka", "swift", "zahranicni", "transakce", "platba"])
ip_recognizer = IpRecognizer(supported_language="cs")
url_recognizer = UrlRecognizer(supported_language="cs", supported_entity="DOMAIN")

from presidio_analyzer import AnalyzerEngine, RecognizerRegistry
from presidio_analyzer.nlp_engine import NlpEngineProvider


# Create configuration containing engine name and models
configuration = {
    "nlp_engine_name": "spacy",
    "models": [{"lang_code": "cs", "model_name": "cs_CPU_fine_nomorph"}],
}

# Create new recognizer registry and add the custom recognizer
recognizer_registry = RecognizerRegistry()

# append custom recognizers
recognizer_registry.add_recognizer(spacy_recognizer_custom)
recognizer_registry.add_recognizer(rc_recognizer)

# append predefined universal presidio recognizers
recognizer_registry.add_recognizer(credit_card_recognizer)
recognizer_registry.add_recognizer(crypto_recognizer)
recognizer_registry.add_recognizer(email_recognizer)
recognizer_registry.add_recognizer(iban_recognizer)
recognizer_registry.add_recognizer(ip_recognizer)

# Create NLP engine based on configuration
provider = NlpEngineProvider(nlp_configuration=configuration)
nlp_engine_custom = provider.create_engine()

# Pass the created NLP engine and supported_languages to the AnalyzerEngine
analyzer = AnalyzerEngine(
    nlp_engine=nlp_engine_custom, 
    supported_languages=["cs", "en"],
    registry=recognizer_registry
)

**GPU version**

Create custom analyzer engine for a GPU session

In [11]:
import json
from json import JSONEncoder
import pandas as pd
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
import spacy

nlp = spacy.load('/content/drive/MyDrive/PIIAnonymizer/models/GPU_bert_cased/model-best')

import logging
from typing import Optional, List, Tuple, Set

from presidio_analyzer import (
    RecognizerResult,
    LocalRecognizer,
    AnalysisExplanation,
)

logger = logging.getLogger("presidio-analyzer")

# this custom spacy recognizer is based on https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/predefined_recognizers/spacy_recognizer.py
class SpacyRecognizerCustom(LocalRecognizer):
    """
    Recognize PII entities using a spaCy NLP model.
    Since the spaCy pipeline is ran by the AnalyzerEngine,
    this recognizer only extracts the entities from the NlpArtifacts
    and replaces their types to align with Presidio's.
    :param supported_language: Language this recognizer supports
    :param supported_entities: The entities this recognizer can detect
    :param ner_strength: Default confidence for NER prediction
    :param check_label_groups: Tuple containing Presidio entity names
    and spaCy entity names, for verifying that the right entity
    is translated into a Presidio entity.
    """

    ENTITIES = [
        "PERSON",
        "EMAIL_ADDRESS",
        "LOGIN_NICK",
        "INSTITUTION",
        "PHONE_NUM",
        "MEDIA_NAME",
        "NUMBER_EXPR",
        "LOCATION",
        "PRODUCT",
        "DATE_TIME",
        "OTHER"
    ]

    DEFAULT_EXPLANATION = "Identified as {} by Spacy's Named Entity Recognition"

    CHECK_LABEL_GROUPS = [
        ({"PERSON"}, {"pd", "pf", "pm", "ps"}),
        ({"EMAIL_ADDRESS"}, {"me"}),
        ({"LOGIN_NICK"}, {"p_"}),
        ({"iNSTITUTION"}, {"ia", "ic", "if", "io", "i_"}),
        ({"PHONE_NUM"}, {"at"}),
        ({"MEDIA_NAME"}, {"mn", "ms"}),
        ({"NUMBER_EXPR"}, {"nb", "nc", "ni", "no", "ns", "n_"}),
        ({"LOCATION"}, {"ah", "az", "gc", "gh", "gl", "gq", "gr", "gs", "gt", "gu", "g_"}),
        ({"PRODUCT"}, {"op"}),
        ({"DATE_TIME"}, {"td", "tf", "th", "tm", "ty"}),
        ({"OTHER"}, {"oa", "or", "o_", "pc"})
    ]

    def __init__(
        self,
        supported_language: str = "cs",
        supported_entities: Optional[List[str]] = None,
        ner_strength: float = 0.82,
        check_label_groups: Optional[Tuple[Set, Set]] = None,
        context: Optional[List[str]] = None,
    ):
        self.ner_strength = ner_strength
        self.check_label_groups = (
            check_label_groups if check_label_groups else self.CHECK_LABEL_GROUPS
        )
        supported_entities = supported_entities if supported_entities else self.ENTITIES
        super().__init__(
            supported_entities=supported_entities,
            supported_language=supported_language,
            context=context,
        )

    def load(self) -> None:
        pass

    def build_spacy_explanation(
        self, original_score: float, explanation: str
    ) -> AnalysisExplanation:
        """
        Create explanation for why this result was detected.
        :param original_score: Score given by this recognizer
        :param explanation: Explanation string
        :return:
        """
        explanation = AnalysisExplanation(
            recognizer=self.__class__.__name__,
            original_score=original_score,
            textual_explanation=explanation,
        )
        return explanation

    def analyze(self, text, entities, nlp_artifacts=None):
        results = []
        if not nlp_artifacts:
            return results

        ner_entities = nlp_artifacts.entities

        for entity in entities:
            if entity not in self.supported_entities:
                continue
            for ent in ner_entities:
                if not self.__check_label(entity, ent.label_, self.check_label_groups):
                    continue
                textual_explanation = self.DEFAULT_EXPLANATION.format(ent.label_)
                explanation = self.build_spacy_explanation(
                    self.ner_strength, textual_explanation
                )
                spacy_result = RecognizerResult(
                    entity_type=entity,
                    start=ent.start_char,
                    end=ent.end_char,
                    score=self.ner_strength,
                    analysis_explanation=explanation,
                    recognition_metadata={
                        RecognizerResult.RECOGNIZER_NAME_KEY: self.name
                    },
                )
                results.append(spacy_result)

        return results

    @staticmethod
    def __check_label(
        entity: str, label: str, check_label_groups: Tuple[Set, Set]
    ) -> bool:
        return any(
            [entity in egrp and label in lgrp for egrp, lgrp in check_label_groups]
        )

# Create custom recognizer based on NER model NEs
spacy_recognizer_custom = SpacyRecognizerCustom()

from collections import defaultdict
from typing import List, Optional

from presidio_analyzer import Pattern, PatternRecognizer


class CSRCRecognizer(PatternRecognizer):
    """Recognize CS "rodne cislo" using regex.
    :param patterns: List of patterns to be used by this recognizer
    :param context: List of context words to increase confidence in detection
    :param supported_language: Language this recognizer supports
    :param supported_entity: The entity this recognizer can detect
    """

    PATTERNS = [
        Pattern("rodne cislo (high)", r"\d{2}(0[1-9]|1[0-2]|5[1-9]|6[0-2])(0[1-9]|1[0-9]|2[0-9]|3[0-1])\/?\d{3,4}", 0.5)
    ]

    CONTEXT = [
        "rc",
        "rodne",
        "pojistence"
        "cislo"
    ]

    def __init__(
        self,
        patterns: Optional[List[Pattern]] = None,
        context: Optional[List[str]] = None,
        supported_language: str = "cs",
        supported_entity: str = "CS_RC",
    ):
        patterns = patterns if patterns else self.PATTERNS
        context = context if context else self.CONTEXT
        super().__init__(
            supported_entity=supported_entity,
            patterns=patterns,
            context=context,
            supported_language=supported_language,
        )

rc_recognizer = CSRCRecognizer()

from presidio_analyzer.predefined_recognizers import CreditCardRecognizer, CryptoRecognizer, EmailRecognizer, IbanRecognizer, IpRecognizer, PhoneRecognizer, UrlRecognizer

credit_card_recognizer = CreditCardRecognizer(supported_language="cs", context=["kreditni", "debetni", "karta", "visa", "mastercard", "maestro", "platba"])
crypto_recognizer = CryptoRecognizer(supported_language="cs", context=["wallet", "btc", "bitcoin", "ethereum", "eth", "crypto", "kryptomena"])
email_recognizer = EmailRecognizer(supported_language="cs", context=["email", "mail", "e-mail"])
iban_recognizer = IbanRecognizer(supported_language="cs", context=["iban", "banka", "swift", "zahranicni", "transakce", "platba"])
ip_recognizer = IpRecognizer(supported_language="cs")
url_recognizer = UrlRecognizer(supported_language="cs", supported_entity="DOMAIN")

from presidio_analyzer import AnalyzerEngine, RecognizerRegistry
from presidio_analyzer.nlp_engine import NlpEngineProvider


# Create configuration containing engine name and models
configuration = {
    "nlp_engine_name": "spacy",
    "models": [{"lang_code": "cs", "model_name": "cs_GPU_bert_cased"}],
}

# Create new recognizer registry and add the custom recognizer
recognizer_registry = RecognizerRegistry()

# append custom recognizers
recognizer_registry.add_recognizer(spacy_recognizer_custom)
recognizer_registry.add_recognizer(rc_recognizer)

# append predefined universal presidio recognizers
recognizer_registry.add_recognizer(credit_card_recognizer)
recognizer_registry.add_recognizer(crypto_recognizer)
recognizer_registry.add_recognizer(email_recognizer)
recognizer_registry.add_recognizer(iban_recognizer)
recognizer_registry.add_recognizer(ip_recognizer)

# Create NLP engine based on configuration
provider = NlpEngineProvider(nlp_configuration=configuration)
nlp_engine_custom = provider.create_engine()

# Pass the created NLP engine and supported_languages to the AnalyzerEngine
analyzer = AnalyzerEngine(
    nlp_engine=nlp_engine_custom, 
    supported_languages=["cs"],
    registry=recognizer_registry
)

### PIIAnonymizer evaluation
Developed tool is evaluated through the "presidio-research" repo, which can be used to generate mock PII datasets or to evaluate custom recognizers

In [12]:
%cd /content
!git clone https://github.com/microsoft/presidio-research.git
%cd /content/presidio-research/

/content
Cloning into 'presidio-research'...
remote: Enumerating objects: 1022, done.[K
remote: Counting objects: 100% (159/159), done.[K
remote: Compressing objects: 100% (56/56), done.[K
remote: Total 1022 (delta 114), reused 109 (delta 103), pack-reused 863[K
Receiving objects: 100% (1022/1022), 2.01 MiB | 16.99 MiB/s, done.
Resolving deltas: 100% (653/653), done.
/content/presidio-research


Install presidio-research package + dependencies

In [13]:
!pip install -r requirements.txt
!python setup.py install

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en_core_web_sm
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0.tar.gz (13.9 MB)
[K     |████████████████████████████████| 13.9 MB 5.2 MB/s 
Collecting haikunator>=2.1.0
  Downloading haikunator-2.1.0-py2.py3-none-any.whl (4.6 kB)
Collecting schwifty
  Downloading schwifty-2022.6.0-py3-none-any.whl (204 kB)
[K     |████████████████████████████████| 204 kB 5.1 MB/s 
[?25hCollecting faker>=9.6.0
  Downloading Faker-13.13.0-py3-none-any.whl (1.6 MB)
[K     |████████████████████████████████| 1.6 MB 55.5 MB/s 
Collecting pytest>=6.2.3
  Downloading pytest-7.1.2-py3-none-any.whl (297 kB)
[K     |████████████████████████████████| 297 kB 66.0 MB/s 
Collecting requests>=2.25.1
  Downloading requests-2.28.0-py3-none-any.whl (62 kB)
[K     |████████████████████████████████| 62 kB 1.7 MB/s 
[?25hCollecting xmltodict>

running install
running bdist_egg
running egg_info
creating presidio_evaluator.egg-info
writing presidio_evaluator.egg-info/PKG-INFO
writing dependency_links to presidio_evaluator.egg-info/dependency_links.txt
writing requirements to presidio_evaluator.egg-info/requires.txt
writing top-level names to presidio_evaluator.egg-info/top_level.txt
writing manifest file 'presidio_evaluator.egg-info/SOURCES.txt'
adding license file 'LICENSE'
adding license file 'NOTICE'
writing manifest file 'presidio_evaluator.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
creating build
creating build/lib
creating build/lib/presidio_evaluator
copying presidio_evaluator/validation.py -> build/lib/presidio_evaluator
copying presidio_evaluator/data_objects.py -> build/lib/presidio_evaluator
copying presidio_evaluator/__init__.py -> build/lib/presidio_evaluator
copying presidio_evaluator/span_to_tag.py -> build/lib/presidio_evaluator
creating bu

Import required modules

In [14]:
from copy import deepcopy
from pprint import pprint
import pandas as pd

from presidio_evaluator import InputSample
from presidio_evaluator.evaluation import Evaluator, ModelError
from presidio_evaluator.models import PresidioAnalyzerWrapper

stanza and spacy_stanza are not installed
Flair is not installed by default
Flair is not installed


Evaluation dataset **contract_eval** contains a real world example of a contract, which needs to be anonymized before uploading to "Veřejný registr smluv".

It was annotated by the author, using the Label Studio annotation tool. This tool has an option to export annotated dataset in a CONLL2003ish format, which can be then after applying a simple fix converted to a spaCy V3 binary format.

In [13]:
!mkdir '/content/drive/MyDrive/PIIAnonymizer/datasets/eval'
!python -m spacy convert -n 10 '/content/DP/eval/contract_eval.conll' '/content/drive/MyDrive/PIIAnonymizer/datasets/eval'

mkdir: cannot create directory ‘/content/drive/MyDrive/PIIAnonymizer/datasets/eval’: File exists
[38;5;3m⚠ Document delimiters found, automatic document segmentation with `-n`
disabled.[0m
[38;5;2m✔ Generated output file (1 documents):
/content/drive/MyDrive/PIIAnonymizer/datasets/eval/contract_eval.spacy[0m


Load contract_eval dataset as a list of spaCy Docs

In [15]:
docbin = dataset_docs_from_path('/content/drive/MyDrive/PIIAnonymizer/datasets/eval/contract_eval.spacy')

And convert these docs to a presidio-research evaluation data container (list of InputSamples)

In [16]:
input_samples = []
for doc in docbin:
  input_samples.append(InputSample.from_spacy_doc(doc=doc))

Then, since PresidioAnalyzerWrapper, available at https://github.com/microsoft/presidio-research/blob/master/presidio_evaluator/models/presidio_analyzer_wrapper.py contains a bug, which prevents from loading custom presidio recognizers in other language than "en", I copied the PresidioAnalyzerWrapper implementation and fix the bug myself.

In [17]:
#NOTE: this code is available at https://github.com/microsoft/presidio-research/blob/master/presidio_evaluator/models/presidio_analyzer_wrapper.py

from typing import List, Optional, Dict

from presidio_analyzer import AnalyzerEngine

from presidio_evaluator import InputSample, span_to_tag
from presidio_evaluator.models import BaseModel


class PresidioAnalyzerWrapper(BaseModel):
    def __init__(
        self,
        analyzer_engine: Optional[AnalyzerEngine] = None,
        entities_to_keep: List[str] = None,
        verbose: bool = False,
        labeling_scheme: str = "BIO",
        score_threshold: float = 0.4,
        language: str = "en",
        entity_mapping: Optional[Dict[str, str]] = None,
    ):
        """
        Evaluation wrapper for the Presidio Analyzer
        :param analyzer_engine: object of type AnalyzerEngine (from presidio-analyzer)
        """
        super().__init__(
            entities_to_keep=entities_to_keep,
            verbose=verbose,
            labeling_scheme=labeling_scheme,
            entity_mapping=entity_mapping,
        )
        self.score_threshold = score_threshold
        self.language = language

        if not analyzer_engine:
            analyzer_engine = AnalyzerEngine()
            self._update_recognizers_based_on_entities_to_keep(analyzer_engine)
        self.analyzer_engine = analyzer_engine

    def predict(self, sample: InputSample) -> List[str]:

        results = self.analyzer_engine.analyze(
            text=sample.full_text,
            entities=self.entities,
            language="cs",
            score_threshold=self.score_threshold,
        )
        starts = []
        ends = []
        scores = []
        tags = []
        #
        for res in results:
            starts.append(res.start)
            ends.append(res.end)
            tags.append(res.entity_type)
            scores.append(res.score)

        response_tags = span_to_tag(
            scheme="IO",
            text=sample.full_text,
            starts=starts,
            ends=ends,
            tokens=sample.tokens,
            scores=scores,
            tags=tags,
        )
        return response_tags


    def _update_recognizers_based_on_entities_to_keep(
        self, analyzer_engine: AnalyzerEngine
    ):
        """Check if there are any entities not supported by this presidio instance.
        Add ORGANIZATION as it is removed by default
        """
        supported_entities = analyzer_engine.get_supported_entities(
            language=self.language
        )
        print("Entities supported by this Presidio Analyzer instance:")
        print(", ".join(supported_entities))

        if not self.entities:
            self.entities = supported_entities

        for entity in self.entities:
            if entity not in supported_entities:
                print(
                    f"Entity {entity} is not supported by this instance of Presidio Analyzer Engine"
                )

        if "ORGANIZATION" in self.entities and "ORGANIZATION" not in supported_entities:
            recognizers = analyzer_engine.get_recognizers()
            spacy_recognizer = [
                rec
                for rec in recognizers
                if rec.name == "SpacyRecognizer" or rec.name == "StanzaRecognizer"
            ]
            if len(spacy_recognizer):
                spacy_recognizer = spacy_recognizer[0]
                spacy_recognizer.supported_entities.append("ORGANIZATION")
                self.entities.append("ORGANIZATION")
                print("Added ORGANIZATION as a supported entity from spaCy/Stanza")

The whole PIIAnonymizer tool is then evaluated on a contract_eval dataset.

presidio-research package uses the PresidioAnalyzerWrapper to load the analyze engine and Evaluator object to run inference on a eval dataset and to calculate a final metrics - Presision, Recall and F1 score.

In [18]:
presidio_entities_map = {
        "PERSON": "PERSON",
        "LOCATION": "LOCATION",
        "EMAIL_ADDRESS": "EMAIL_ADDRESS",
        "CREDIT_CARD": "CREDIT_CARD",
        "PHONE_NUM": "PHONE_NUM",
        "DATE_TIME": "DATE_TIME",
        "DOMAIN": "DOMAIN",
        "IBAN_CODE": "IBAN_CODE",
        "IP_ADDRESS": "IP_ADDRESS",
        "INSTITUTION": "INSTITUTION",
        "LOGIN_NICK": "LOGIN_NICK",
        "MEDIA_NAME": "MEDIA_NAME",
        "NUMBER_EXPR": "NUMBER_EXPR",
        "OTHER": "OTHER",
        "CS_RC": "CS_RC",
        "CRYPTO": "CRYPTO"
    }

model = PresidioAnalyzerWrapper(analyzer_engine=analyzer, language=["cs"])

evaluator = Evaluator(model=model)
dataset = Evaluator.align_entity_types(
    deepcopy(input_samples), entities_mapping=presidio_entities_map
)

evaluation_results = evaluator.evaluate_all(dataset)
results = evaluator.calculate_score(evaluation_results)

entities, confmatrix = results.to_confusion_matrix()

print(results)

Evaluating <class '__main__.PresidioAnalyzerWrapper'>: 100%|██████████| 1/1 [00:32<00:00, 32.07s/it]

              Entity           Precision              Recall   Number of samples
         INSTITUTION                nan%               0.00%                  22
            LOCATION              94.44%              87.18%                  39
              DOMAIN                nan%               0.00%                   1
           IBAN_CODE             100.00%             100.00%                   1
           DATE_TIME              46.67%              77.78%                   9
           PHONE_NUM             100.00%             100.00%                   5
       EMAIL_ADDRESS             100.00%             100.00%                   2
              PERSON              75.00%             100.00%                   9
               OTHER              46.43%              54.17%                  24
         NUMBER_EXPR              92.31%              89.36%                  94
                 PII              85.64%              81.07%                 206
PII F measure: 81.67%





# Streamlit application

For the implementation of the PII anonymizer tool itself was choosed a SDK Presidio together with a Streamlit framework.

Presidio offers the tools to implement application for PII anonymization. Streamlit frameworks allows to create simple python-based graphic web applications.

PII anonymizer source codes can be found here: https://github.com/ondrasekd/DP/tree/master/src/streamlit_app

This notebook is meant as a way to simply run this application within the Google Colab environment.

First, install Streamlit in this virtual machine session.

In [10]:
# user streamlit version 1.7.0 because of the installed google colab python version
!pip install streamlit==1.7.0 pandas presidio-analyzer presidio-anonymizer

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting streamlit==1.7.0
  Downloading streamlit-1.7.0-py2.py3-none-any.whl (9.9 MB)
[K     |████████████████████████████████| 9.9 MB 14.7 MB/s 
Collecting validators
  Downloading validators-0.20.0.tar.gz (30 kB)
Collecting pydeck>=0.1.dev5
  Downloading pydeck-0.7.1-py2.py3-none-any.whl (4.3 MB)
[K     |████████████████████████████████| 4.3 MB 42.0 MB/s 
Collecting gitpython!=3.1.19
  Downloading GitPython-3.1.27-py3-none-any.whl (181 kB)
[K     |████████████████████████████████| 181 kB 73.9 MB/s 
Collecting watchdog
  Downloading watchdog-2.1.9-py3-none-manylinux2014_x86_64.whl (78 kB)
[K     |████████████████████████████████| 78 kB 8.1 MB/s 
[?25hCollecting blinker
  Downloading blinker-1.4.tar.gz (111 kB)
[K     |████████████████████████████████| 111 kB 73.8 MB/s 
Collecting base58
  Downloading base58-2.1.1-py3-none-any.whl (5.6 kB)
Collecting pympler>=0.9
  Downloading Pym

You can run the PII Anonymizer tool either for CPU session of Google Colab or GPU session. GPU application gives better results.

in order to use spacy models in a Streamlit based app, python package must be generated for each trained pipeline and then installed.

Then, to expose the web application running on a Google Colab's virtual machine local host, it is run through a localtunnel tool. Running web application than can be accessed on a provided URL.

**CPU version**

In [7]:
nlp = spacy.load('/content/drive/MyDrive/PIIAnonymizer/models/CPU_fine_nomorph/model-best')
!python -m spacy package /content/drive/MyDrive/PIIAnonymizer/models/CPU_fine_nomorph/model-best /content/drive/MyDrive/PIIAnonymizer/models/CPU_fine_nomorph --name CPU_fine_nomorph

[38;5;4mℹ Building package artifacts: sdist[0m
[38;5;2m✔ Loaded meta.json from file[0m
/content/drive/MyDrive/PIIAnonymizer/models/CPU_fine_nomorph/model-best/meta.json

[38;5;1m✘ Package directory already exists[0m
Please delete the directory and try again, or use the `--force` flag to
overwrite existing directories.



In [8]:
!pip install /content/drive/MyDrive/PIIAnonymizer/models/CPU_fine_nomorph/cs_CPU_fine_nomorph-0.0.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Processing ./drive/MyDrive/PIIAnonymizer/models/CPU_fine_nomorph/cs_CPU_fine_nomorph-0.0.0
[33m  DEPRECATION: A future pip version will change local packages to be built in-place without first copying to a temporary directory. We recommend you use --use-feature=in-tree-build to test your packages with this new behavior before it becomes the default.
   pip 21.3 will remove support for this functionality. You can find discussion regarding this at https://github.com/pypa/pip/issues/7555.[0m
Building wheels for collected packages: cs-CPU-fine-nomorph
  Building wheel for cs-CPU-fine-nomorph (setup.py) ... [?25l[?25hdone
  Created wheel for cs-CPU-fine-nomorph: filename=cs_CPU_fine_nomorph-0.0.0-py3-none-any.whl size=6615614 sha256=09a6208474d5356d89e475bb7760215234bf221f0ed3ad3b74676f0b350dc35d
  Stored in directory: /root/.cache/pip/wheels/9c/0b/90/789fe086c9ee3fbac87b5a3d4c83fbfbac3796

In [39]:
!streamlit run /content/DP/src/streamlit_app/presidio_streamlit_CPU_best.py & npx localtunnel --port 8501

2022-06-10 16:49:23.080 INFO    numexpr.utils: NumExpr defaulting to 2 threads.
[K[?25hnpx: installed 22 in 2.944s
[0m
[34m[1m  You can now view your Streamlit app in your browser.[0m
[0m
[34m  Network URL: [0m[1mhttp://172.28.0.2:8501[0m
[34m  External URL: [0m[1mhttp://104.199.214.178:8501[0m
[0m
your url is: https://thin-candles-stay-104-199-214-178.loca.lt
2022-06-10 16:49:41.598 Loaded recognizer: SpacyRecognizerCustom
2022-06-10 16:49:41.598 Loaded recognizer: CSRCRecognizer
2022-06-10 16:49:41.599 Loaded recognizer: CreditCardRecognizer
2022-06-10 16:49:41.599 Loaded recognizer: CryptoRecognizer
2022-06-10 16:49:41.599 Loaded recognizer: EmailRecognizer
2022-06-10 16:49:41.599 Loaded recognizer: IbanRecognizer
2022-06-10 16:49:41.599 Loaded recognizer: IpRecognizer
2022-06-10 16:49:41.827 Created NLP engine: spacy. Loaded models: ['cs']
2022-06-10 16:49:41.827 Fetching all recognizers for language cs
2022-06-10 16:49:41.827 Fetching all recognizers for language c

**GPU version**

In [7]:
nlp = spacy.load('/content/drive/MyDrive/PIIAnonymizer/models/GPU_bert_cased/model-best')
!python -m spacy package /content/drive/MyDrive/PIIAnonymizer/models/GPU_bert_cased/model-best /content/drive/MyDrive/PIIAnonymizer/models/GPU_bert_cased --name GPU_bert_cased

[38;5;4mℹ Building package artifacts: sdist[0m
[38;5;2m✔ Including 1 package requirement(s) from meta and config[0m
spacy-transformers>=1.1.6,<1.2.0
[38;5;2m✔ Loaded meta.json from file[0m
/content/drive/MyDrive/PIIAnonymizer/models/GPU_bert_cased/model-best/meta.json

[38;5;1m✘ Package directory already exists[0m
Please delete the directory and try again, or use the `--force` flag to
overwrite existing directories.



In [8]:
!pip install /content/drive/MyDrive/PIIAnonymizer/models/GPU_bert_cased/cs_GPU_bert_cased-0.0.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Processing ./drive/MyDrive/PIIAnonymizer/models/GPU_bert_cased/cs_GPU_bert_cased-0.0.0
[33m  DEPRECATION: A future pip version will change local packages to be built in-place without first copying to a temporary directory. We recommend you use --use-feature=in-tree-build to test your packages with this new behavior before it becomes the default.
   pip 21.3 will remove support for this functionality. You can find discussion regarding this at https://github.com/pypa/pip/issues/7555.[0m
Building wheels for collected packages: cs-GPU-bert-cased
  Building wheel for cs-GPU-bert-cased (setup.py) ... [?25l[?25hdone
  Created wheel for cs-GPU-bert-cased: filename=cs_GPU_bert_cased-0.0.0-py3-none-any.whl size=666693901 sha256=77faef2d6fe4ac6f3d3cece26d041c43b697249150540348760a267f9dfb99f1
  Stored in directory: /root/.cache/pip/wheels/3d/a0/60/92480eda7ca2fb534bb29f15b7a930e49d1d201183709e27

In [13]:
!streamlit run /content/DP/src/streamlit_app/presidio_streamlit_GPU_best.py & npx localtunnel --port 8501

2022-06-10 17:32:15.855 INFO    numexpr.utils: NumExpr defaulting to 2 threads.
[K[?25hnpx: installed 22 in 2.077s
[0m
[34m[1m  You can now view your Streamlit app in your browser.[0m
[0m
[34m  Network URL: [0m[1mhttp://172.28.0.2:8501[0m
[34m  External URL: [0m[1mhttp://34.91.51.235:8501[0m
[0m
your url is: https://tangy-radios-stay-34-91-51-235.loca.lt
2022-06-10 17:32:42.413 Loaded recognizer: SpacyRecognizerCustom
2022-06-10 17:32:42.413 Loaded recognizer: CSRCRecognizer
2022-06-10 17:32:42.413 Loaded recognizer: CreditCardRecognizer
2022-06-10 17:32:42.413 Loaded recognizer: CryptoRecognizer
2022-06-10 17:32:42.413 Loaded recognizer: EmailRecognizer
2022-06-10 17:32:42.413 Loaded recognizer: IbanRecognizer
2022-06-10 17:32:42.413 Loaded recognizer: IpRecognizer
2022-06-10 17:32:46.976 Created NLP engine: spacy. Loaded models: ['cs']
2022-06-10 17:32:46.976 Fetching all recognizers for language cs
2022-06-10 17:32:46.976 Fetching all recognizers for language cs
[34