<a href="https://colab.research.google.com/github/ontocord/muliwai/blob/main/PIISA_PII_Workspace.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#License

Copyright 2022 Authors of this Notebook

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

# What is this colab?

This colab is a workspace for Privacy Preserving research. We have code here that detects the following kinds of PII for all languages in BigScience.It also anonymizes the names of non-public figures using fake names, and anonymizes high-risk "character" based spans with labels (e.g., \<ID>, \<PHONE>, etc.)

Languages assumed are ["ar", "as", "bn", "ca", "en", "es", "eu", "fr", "gu", "hi", "id", "ig", "mr", "ny", "pa", "pt", "sn", "st", "sw", "ur", "vi", "xh", "yo", "zh", "zu"]

And we hope this code is generally useful to the open source community. Feel free to copy this colab and experiment! 

## Highest Risk
### Simple spans of characters:
*   **IDs [general]:** This is anything that is a sequence of 6 or more digits, as is common in identifiers for people internationally (national IDs, tax IDs, passport numbers, etc.), credit card numbers, IBAN codes, etc.
*   **Keys [general]:** This is anything that is a sequence of digits and letters in the same string.  Common for API keys, etc.
*   **Email address**, **User name**: Strings using @
*   **IP address**: Digits with periods in them
*   **Phone number**
*   **License plate**

### More complex spans:
* **Full Names**: Requires additional NER transformer models and spacy (Requires GPU)
* **Address**: (WIP)


## Lower Risk (we're keeping)
*   **URL**
*   **Time**: dateparser dependency
*   **Date**: dateparser dependency
*   **Age**

In [None]:
high_risk_stuff = {'ID', 'KEY', 'EMAIL', 'USER', 'IP_ADDRESS', 'PHONE', 'LICENSE_PLATE', 'PERSON'}

# Mount drive - link shared data into MyDrive

First bring up this link: https://drive.google.com/file/d/1_tmHHme7ZhDFb2YLtWOWz5DZdfP-7QSC/view

And at the top left hand corner, create a shortcut to this file in your Google drive. It will then appear in /content/drive/MyDrive/labelled_data.csv after you mount your drive.

Similarly link to this file:  https://drive.google.com/file/d/1dDZZkfcynynmDyyvKLjDB_C3iDNWRncQ/view

And the lang_subset.jsonl file will appear in your MyDrive.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Muliwai base

In [None]:
#@title Install dependencies
%%capture
!pip install https://github.com/kpu/kenlm/archive/master.zip
!pip install spacy==3.1.0 dateparser python-stdnum protobuf cdifflib transformers datasets langid faker sentencepiece fsspec tqdm sentence-transformers nltk
!python -m nltk.downloader punkt wordnet
import spacy
try:
  spacy.load('en_core_web_sm')
except:
  !python -m spacy download en_core_web_sm
  !python -m spacy download fr_core_news_sm
  !python -m spacy download ca_core_news_sm
  !python -m spacy download pt_core_news_sm
  !python -m spacy download zh_core_web_sm


In [None]:
#@title Overwrite current Muliwai Dir
%%capture
%cd /content/
!rm -rf /content/muliwai
!git clone https://github.com/piisa/muliwai

In [None]:
#@title Load Libpostal (Optional); this will take a long time.
%%capture
use_libpostal = True
if use_libpostal: #only do this if you want to use libpostal
    !sudo apt-get install curl autoconf automake libtool pkg-config
    !git clone https://github.com/openvenues/libpostal
    %cd libpostal
    !make distclean
    !./bootstrap.sh
    !./configure --datadir=/content/libpostal_data
    !make -j4
    !sudo make install
    !pip install postal
    !cp /usr/local/lib/libpostal.so /usr/lib/libpostal.so.1

# PERSON + ID anonymizer

In [None]:
%cd /content/


/content


In [None]:

langs = {
        "af": "Afrikaans",
        "als": "Tosk Albanian",
        "am": "Amharic",
        "an": "Aragonese",
        "ar": "Arabic",
        "arz": "Egyptian Arabic",
        "ast": "Asturian",
        "as": "Assamese",
        "av": "Avaric",
        "azb": "South Azerbaijani",
        "az": "Azerbaijani",
        "bar": "Bavarian",
        "ba": "Bashkir",
        "bcl": "Central Bikol",
        "be": "Belarusian",
        "bg": "Bulgarian",
        "bh": "Bihari",
        "bn": "Bengali",
        "bo": "Tibetan",
        "bpy": "Bishnupriya",
        "br": "Breton",
        "bs": "Bosnian",
        "bxr": "Russia Buriat",
        "ca": "Catalan",
        "cbk": "Chavacano",
        "ceb": "Cebuano",
        "ce": "Chechen",
        "ckb": "Central Kurdish",
        "cs": "Czech",
        "cv": "Chuvash",
        "cy": "Welsh",
        "da": "Danish",
        "de": "German",
        "diq": "Dimli",
        "dsb": "Lower Sorbian",
        "dv": "Dhivehi",
        "el": "Modern Greek",
        "eml": "Emilian-Romagnol",
        "en": "English",
        "eo": "Esperanto",
        "es": "Spanish",
        "et": "Estonian",
        "eu": "Basque",
        "fa": "Persian",
        "fi": "Finnish",
        "frr": "Northern Frisian",
        "fr": "French",
        "fy": "Western Frisian",
        "ga": "Irish",
        "gd": "Scottish Gaelic",
        "gl": "Galician",
        "gn": "Guarani",
        "gom": "Goan Konkani",
        "gu": "Gujarati",
        "he": "Hebrew",
        "hi": "Hindi",
        "hr": "Croatian",
        "hsb": "Upper Sorbian",
        "ht": "Haitian",
        "hu": "Hungarian",
        "hy": "Armenian",
        "ia": "Interlingua",
        "id": "Indonesian",
        "ie": "Interlingue",
        "ilo": "Iloko",
        "io": "Ido",
        "is": "Icelandic",
        "it": "Italian",
        "ja": "Japanese",
        "jbo": "Lojban",
        "jv": "Javanese",
        "ka": "Georgian",
        "kk": "Kazakh",
        "km": "Central Khmer",
        "kn": "Kannada",
        "ko": "Korean",
        "krc": "Karachay-Balkar",
        "ku": "Kurdish",
        "kv": "Komi",
        "kw": "Cornish",
        "ky": "Kirghiz",
        "la": "Latin",
        "lb": "Luxembourgish",
        "lez": "Lezghian",
        "li": "Limburgan",
        "lmo": "Lombard",
        "lo": "Lao",
        "lrc": "Northern Luri",
        "lt": "Lithuanian",
        "lv": "Latvian",
        "mai": "Maithili",
        "mg": "Malagasy",
        "mhr": "Eastern Mari",
        "min": "Minangkabau",
        "mk": "Macedonian",
        "ml": "Malayalam",
        "mn": "Mongolian",
        "mrj": "Western Mari",
        "mr": "Marathi",
        "ms": "Malay",
        "mt": "Maltese",
        "mwl": "Mirandese",
        "my": "Burmese",
        "myv": "Erzya",
        "mzn": "Mazanderani",
        "nah": "Nahuatl", # languages
        "nap": "Neapolitan",
        "nds": "Low German",
        "ne": "Nepali",
        "new": "Newari",
        "nl": "Dutch",
        "nn": "Norwegian Nynorsk",
        "no": "Norwegian",
        "oc": "Occitan",
        "or": "Oriya",
        "os": "Ossetian",
        "pam": "Pampanga",
        "pa": "Panjabi",
        "pl": "Polish",
        "pms": "Piemontese",
        "pnb": "Western Panjabi",
        "ps": "Pushto",
        "pt": "Portuguese",
        "qu": "Quechua",
        "rm": "Romansh",
        "ro": "Romanian",
        "ru": "Russian",
        "sah": "Yakut",
        "sa": "Sanskrit",
        "scn": "Sicilian",
        "sd": "Sindhi",
        "sh": "Serbo-Croatian",
        "si": "Sinhala",
        "sk": "Slovak",
        "sl": "Slovenian",
        "so": "Somali",
        "sq": "Albanian",
        "sr": "Serbian",
        "su": "Sundanese",
        "sv": "Swedish",
        "sw": "Swahili",
        "ta": "Tamil",
        "te": "Telugu",
        "tg": "Tajik",
        "th": "Thai",
        "tk": "Turkmen",
        "tl": "Tagalog",
        "tr": "Turkish",
        "tt": "Tatar",
        "tyv": "Tuvinian",
        "ug": "Uighur",
        "uk": "Ukrainian",
        "ur": "Urdu",
        "uz": "Uzbek",
        "vec": "Venetian",
        "vi": "Vietnamese",
        "vo": "Volapük",
        "war": "Waray",
        "wa": "Walloon",
        "wuu": "Wu Chinese",
        "xal": "Kalmyk",
        "xmf": "Mingrelian",
        "yi": "Yiddish",
        "yo": "Yoruba",
        "yue": "Yue Chinese",
        "zh": "Chinese",
    }
  

len(langs)

166

The below function detects people names and high-risk ID type information. It anonymizes the data and returns a 'text' field, and an 'ner' field, as well as the original text in the 'orig_text' field and 'orig_ner' field.

In [None]:
#@title Define apply_anonymization
%cd /content/

from muliwai.regex_manager import detect_ner_with_regex_and_context
from muliwai.pii_regexes_rulebase import regex_rulebase
from muliwai.ner_manager import detect_ner_with_hf_model
from muliwai.faker_manager import augment_anonymize
def apply_anonymization(
    sentence: str,
    lang_id: str,
    context_window: int = 20,
    anonymize_condition=None,
    tag_type={'IP_ADDRESS', 'KEY', 'ID', 'PHONE', 'USER', 'EMAIL', 'LICENSE_PLATE', 'PERSON'} ,
    device: str = "cpu",
) -> str:
    """
    Params:
    ==================
    sentence: str, the sentence to be anonymized
    lang_id: str, the language id of the sentence
    context_window: int, the context window size
    anonymize_condition: function, the anonymization condition
    tag_type: iterable, the tag types of the anonymization. By default: {'IP_ADDRESS', 'KEY', 'ID', 'PHONE', 'USER', 'EMAIL', 'LICENSE_PLATE', 'PERSON'} 
    device: cpu or cuda:{device_id}

    """
    if tag_type == None:
        tag_type = regex_rulebase.keys()
    lang_id = lang_id.split("_")[0]
    ner_ids = detect_ner_with_regex_and_context(
        sentence=sentence,
        src_lang=lang_id,
        context_window=context_window,
        tag_type=tag_type,
    )
    ner_persons = detect_ner_with_hf_model(
        sentence=sentence,
        src_lang=lang_id,
        device=device,
    )
    ner = list(set(ner_ids + ner_persons))
    ner.sort(key=lambda a: a[1])
    if anonymize_condition:
        new_sentence, new_ner, _ = augment_anonymize(sentence, lang_id, ner, )
        doc = {'text': new_sentence, 'ner': new_ner, 'orig_text': sentence, 'orig_ner': ner}
    else:
        new_sentence = sentence
        doc = {'text': new_sentence, 'ner': ner}
    return new_sentence, doc


/content


In [None]:
s = '现场考察并听取了相关介绍后，55651-3314市政协委员、首都体育学院校长钟秉枢从专业角度，点出了国内冰雪运动在人才方面的短板。2018-11-13，是毛主席题写报头的《湖南日报》创刊67周年报庆日，同时也是脱胎于毛主席所题名《新湖南报》的移动新闻客户端“新湖南”上线一周年纪念日。作为中国共产党、中国人民解放军和中华人民共和国的主要缔造者和领导人，毛泽东同时还是一名杰出的书法家。他博采众长，独创出淋漓奔放、纵横驰骋、笔墨潇洒的“毛体”。1979年12月，随着湖南省第五届人民代表大会第二次会议胜利召开，省人民代表大会告别了没有常设机构的历史，由此开启了我省人大建设的崭新篇章，成为民主政治建设进程中的里程碑。在历史前进的逻辑中前进，在时代发展的潮流中发展。黄兴(2018-11-13-2018-11-13)'

apply_anonymization(s, 'zh', anonymize_condition=True)

('现场考察并听取了相关介绍后， <ID> 市政协委员、首都体育学院校长钟秉枢从专业角度，点出了国内冰雪运动在人才方面的短板。 <ID> ，是毛主席题写报头的《湖南日报》创刊67周年报庆日，同时也是脱胎于毛主席所题名《新湖南报》的移动新闻客户端“新湖南”上线一周年纪念日。作为中国共产党、中国人民解放军和中华人民共和国的主要缔造者和领导人，毛泽东同时还是一名杰出的书法家。他博采众长，独创出淋漓奔放、纵横驰骋、笔墨潇洒的“毛体”。1979年12月，随着湖南省第五届人民代表大会第二次会议胜利召开，省人民代表大会告别了没有常设机构的历史，由此开启了我省人大建设的崭新篇章，成为民主政治建设进程中的里程碑。在历史前进的逻辑中前进，在时代发展的潮流中发展。黄兴( <ID> 1-13)',
 {'ner': [(' <ID> ', 14, 20, 'ID'),
   (' <ID> ', 61, 67, 'ID'),
   (' <ID> ', 326, 332, 'ID')],
  'orig_ner': [('55651-3314', 14, 24, 'ID'),
   ('钟秉枢', 38, 41, 'PERSON'),
   ('2018-11-13', 65, 75, 'ID'),
   ('毛', 78, 79, 'PERSON'),
   ('毛', 109, 110, 'PERSON'),
   ('毛泽东', 178, 181, 'PERSON'),
   ('2018-11-13-2018-1', 334, 351, 'ID'),
   ('2018-11-13', 345, 355, 'ID')],
  'orig_text': '现场考察并听取了相关介绍后，55651-3314市政协委员、首都体育学院校长钟秉枢从专业角度，点出了国内冰雪运动在人才方面的短板。2018-11-13，是毛主席题写报头的《湖南日报》创刊67周年报庆日，同时也是脱胎于毛主席所题名《新湖南报》的移动新闻客户端“新湖南”上线一周年纪念日。作为中国共产党、中国人民解放军和中华人民共和国的主要缔造者和领导人，毛泽东同时还是一名杰出的书法家。他博采众长，独创出淋漓奔放、纵横驰骋、笔墨潇洒的“毛体”。1979年12月，随着湖南省第五届人民代表大会第二次会议胜利召开，省人民代表大会告别了没有常设机

In [None]:
s = '3455 3471 1. Dao Van Dung, Nguyen Thi Nga (2010). Nonlinear stability analysis of imperfect functionally graded plates, with the Poisson’s ratio also varying through the thickness, subjected to mechanical and thermal loads. Proceedings of the tenth National Conference on Deformable Solid Mechanics, Thai Nguyen, pp. 130 – 141. 2.	Dao Van Dung, Nguyen Thi Nga. On the nonlinear post-buckling behavior of imperfect functionally graded cylindrical panels taking into account the thickness dependent Poisson’s ratio. Tuyển tập Báo cáo hội nghị Cơ học toàn quốc lần thứ IX, 8-9/12/2012. 3.	Dao Huy Bich, Dao Van Dung, Nguyen Thi Nga. Nonlinear buckling and post-buckling of imperfect eccentrically stiffened functionally graded plates based on the first order shear deformation plate theory. Hội nghị Khoa học toàn quốc Cơ học Vật rắn biến dạng lần thứ XI, Thành phố Hồ Chí Minh, 7-9/11/2013, pp. 111-121. 4.	Dao Van Dung, Nguyen Thi Nga. Nonlinear buckling and post-buckling of eccentrically stiffened functionally graded cylindrical shells surrounded by an elastic medium based on the first order shear deformation theory. Vietnam Journal of Mechanics, VAST, Vol. 35, No. 4 (2013), pp. 285-298. 5.	Dao Van Dung, Le Kha Hoa, Nguyen Thi Nga, Le Thi Ngoc Anh. Instability of eccentrically stiffened functionally graded truncated conical shells under mechanical loads. Composite Structures 106 (2013), pp. 104–113. 6.	Dao Van Dung, Le Kha Hoa, Nguyen Thi Nga. On the stability of functionally graded truncated conical shells reinforced by functionally graded stiffeners and surrounded by an elastic medium. Composite Structures, Volume 108, February 2014, pp. 77–90. 8.	Dao Van Dung, Nguyen Thi Nga. Nonlinear analysis of stability for imperfect eccentrically stiffened FGM plates under mechanical and thermal loads based on FSDT. Part 2: Numerical results and discussions. Vietnam Journal of Mechanics, VAST, Vol. 37, No. 4 (2015), pp. 251– 262, DOI:10.15625/0866-7136/37/4/5885. 9. Nguyen Thi Nga, Dao Van Dung (2015), “On the stability of FGM cylindrical shell reinforced by FGM stiffeners and filled by an elastic medium based on FSDT in thermal environment”, Hội nghị Khoa học toàn quốc Cơ học Vật rắn biến dạng lần thứ XII, Đại học Duy Tân, TP Đà Nẵng, 7/8/2015, pp. 1000-1007. 10.	Dao Van Dung, Nguyen Thi Nga (2016) Buckling and postbuckling nonlinear analysis of imperfect FGM plates reinforced by FGM stiffeners with temperature-dependent properties based on TSDT. Acta Mechanica. Vol. 227(8), pp. 2377-2401, DOI 10.1007/s00707-016-1637-y. 11.	D.V. Dung, L.K. Hoa, B.T. Thuyet, N.T. Nga (2016). Buckling analysis of functionally graded material (FGM) sandwich truncated conical shells reinforced by FGM stiffeners filled inside by elastic foundations. Applied Mathematics and Mechanics (English Edition), Vol. 37(7), pp. 879-902, DOI: 10.1007/s10483-016-2097-9. 13.	Dao Van Dung, Nguyen Thi Nga, Le Kha Hoa (2017). Nonlinear stability of functionally graded material (FGM) sandwich cylindrical shells reinforced by FGM stiffeners in thermal environment. Applied Mathematics and Mechanics (English Edition), Volume 38, Issue 5, pp 647–670. 14. Dao Van Dung, Nguyen Thi Nga, Pham Minh Vuong (2017). Nonlinear stability analysis of stiffened functionally graded material sandwich cylindrical shells with general Sigmoid law and power law in thermal environment using third-order shear deformation theory. Journal of Sandwich Structures and Materials. DOI: 10.1177/1099636217704863. First Published April 18, 2017.'

apply_anonymization(s, 'en', anonymize_condition=True)

('<PHONE> . Nathan Black , Sheila Bailey  Greenwood (2010). Nonlinear stability analysis of imperfect functionally graded plates, with the Poisson’s ratio also varying through the thickness, subjected to mechanical and thermal loads. Proceedings of the tenth National Conference on Deformable Solid Mechanics, Thai Nguyen, pp. 130 – 141. 2.\t Nathan Black , Sheila Bailey  Greenwood . On the nonlinear post-buckling behavior of imperfect functionally graded cylindrical panels taking into account the thickness dependent Poisson’s ratio. Tuyển tập Báo cáo hội nghị Cơ học toàn quốc lần thứ IX, <PHONE> .\tDao Huy Bich, Nathan Black , Sheila Bailey  Greenwood . Nonlinear buckling and post-buckling of imperfect eccentrically stiffened functionally graded plates based on the first order shear deformation plate theory. Hội nghị Khoa học toàn quốc Cơ học Vật rắn biến dạng lần thứ XI, Thành phố Hồ Chí Minh, 7-9/11/2013, pp. 111-121. 4.\t Nathan Black , Sheila Bailey  Greenwood . Nonlinear buckling a

In [None]:
s = """Physical therapy involving newborns and young infants is a specialized area of practice reserved for therapists who have advanced training and the competence to help newborns, young infants and their families meet their goals. Beginning at birth, infants apply a significant amount of effort to actively participate in and shape their world. Infants make their intentions and requests for support known through their behaviors during social and physical therapy encounters. The therapeutic encounter viewed from the infant’s perspective has received limited attention in the physical therapy literature. The purpose of this article is to discuss concepts related to phenomenology and synactive theory that are relevant to physical therapy with newborns and young infants during the first few months of life after birth. Blanchard, Y., & Øberg G. K. (2015). Physical therapy with newborns: Applying concepts of phenomenology and synactive theory to guide interventions. Physiotherapy Theory and Practice, 31(6), 377-381. doi: 10.3109/09593985.2015.1010243."""

apply_anonymization(s, 'en', anonymize_condition=True)


('Physical therapy involving newborns and young infants is a specialized area of practice reserved for therapists who have advanced training and the competence to help newborns, young infants and their families meet their goals. Beginning at birth, infants apply a significant amount of effort to actively participate in and shape their world. Infants make their intentions and requests for support known through their behaviors during social and physical therapy encounters. The therapeutic encounter viewed from the infant’s perspective has received limited attention in the physical therapy literature. The purpose of this article is to discuss concepts related to phenomenology and synactive theory that are relevant to physical therapy with newborns and young infants during the first few months of life after birth. Brandon , Y., & Shelia Washington K. (2015). Physical therapy with newborns: Applying concepts of phenomenology and synactive theory to guide interventions. Physiotherapy Theory 

# Regexes

In [None]:
#@title Get the regexes

from muliwai.pii_regexes import detect_ner_with_regex_and_context
from muliwai.pii_regexes_rulebase import regex_rulebase
import regex, re, copy
import pandas


print("Here are the regexes we are using:")
regex_rulebase


Here are the regexes we are using:


{'ADDRESS_EXP': {'ar': [(re.compile(r'(?:[-\/ ]*\d){5,8}[^\d]{5,20}|[^\d]{5,20}(?:[-\/ ]*\d){5,8}|(?:[-\/ ]*\d){1,3}[^@#$%]{5,20}(?:[-\/ ]*\d){5,8}',
    re.IGNORECASE|re.UNICODE),
    ['نهج',
     'شارع',
     'طريق',
     'جادة',
     'حارة',
     'ساحة',
     'ميدان',
     'الطريق',
     'السيار',
     'الشارع',
     'الطريق الدائري'],
    None)],
  'ca': [(re.compile(r'(?:[-\/ ]*\d){5,8}[^\d]{5,20}|[^\d]{5,20}(?:[-\/ ]*\d){5,8}|(?:[-\/ ]*\d){1,3}[^@#$%]{5,20}(?:[-\/ ]*\d){5,8}',
    re.IGNORECASE|re.UNICODE),
    ['avd',
     'cra',
     'cró',
     'cro',
     'eix',
     'pça',
     'pca',
     'plç',
     'plc',
     'rda',
     'trv',
     'via',
     'auto',
     'avda',
     'camí',
     'cami',
     'carr',
     'carr',
     'ctra',
     'cint',
     'diag',
     'drec',
     'entr',
     'pdís',
     'pdis',
     'ptge',
     'ptal',
     'rbla',
     'rtda',
     'sort',
     'trav',
     'trav',
     'autop',
     'autov',
     'avgda',
     'carra',
     'carro',
     'p

In [None]:
#@title Some Old Regexes for Comparison
import re, regex
privacy_group_regex_rulebase_old = {

    "AGE": {
      #TODO - finish out "years old" in other languages.
      "en": [
          (
              re.compile(
                  r"\S+ years old|\S+\-years\-old|\S+ year old|\S+\-year\-old", re.IGNORECASE
              ),
              None, None
          )
      ],
       "zh": [(regex.compile(r"([一二三四五六七八九十百\d]{1,3}歲|[一二三四五六七八九十百\d]{1,3}岁)"), None, None)],
    },
    "DATE": {
        #TODO - separate all the languages out. Do pt, fr, es
        "id": [(re.compile('\d{4}|[0-3]?\d[-\./][0-3]?\d[-\./]\d{2,4}'), None, [('lahir', 'AGE'),])], 
        "default": [(re.compile('\d{4}|[0-3]?\d[-\./][0-3]?\d[-\./]\d{2,4}'), None, [('born', 'AGE'), ("ni a bi lori",'AGE'), ("wazalwa ngo",'AGE'), ("akazvarwa",'AGE'), ("o hlahile ka",'AGE'), ("anabadwa pa",'AGE'), ("wazalwa ngo",'AGE'), ("alizaliwa tarehe",'AGE'), ("amụrụ",'AGE'), ("ولد",'AGE'), ("生於",'AGE'), ("sinh ra",'AGE'), ("का जन्म ए",'AGE'), ("پیدا ہوا",'AGE'), ('lahir', 'AGE'),  ('জন্ম', 'AGE')])],
    },
    #https://github.com/madisonmay/CommonRegex/blob/master/commonregex.py. Low to no PII 
    "TIME": {
      "default": [(re.compile('\d{1,2}:\d{2} ?(?:[ap]\.?m\.?)?|\d[ap]\.?m\.?', re.IGNORECASE), None, None),],
    },
    #if we want to match embeded PII within URLs
    "URL": {
      "default": [(re.compile('https?:\/\/[^\s\"\']{8,50}|www[^\s\"\']{8,50}', re.IGNORECASE), None, None)],
      "zh": [(regex.compile('(https?:\/\/.\P{Han}{1,}|www\.\P{Han}{1,50})', re.IGNORECASE), None, None)],
    },
    #experimental address stuff
    #from https://github.com/openvenues/libpostal and https://github.com/joke2k/faker/tree/master/faker/providers/address 
    #from https://github.com/Aggregate-Intellect/bigscience_aisc_pii_detection/blob/main/language/zh/rules.py which is under Apache 2
    #from https://github.com/madisonmay/CommonRegex/blob/master/commonregex.py
    "ADDRESS_EXP": {
        "ar": [(re.compile(r'(?:[-\/ ]*\d){5,8}[^\d]{5,20}|[^\d]{5,20}(?:[-\/ ]*\d){5,8}|(?:[-\/ ]*\d){1,3}[^@#$%]{5,20}(?:[-\/ ]*\d){5,8}', re.IGNORECASE|re.UNICODE), ['نهج', 'شارع', 'طريق', 'جادة', 'حارة', 'ساحة', 'ميدان', 'الطريق', 'السيار', 'الشارع', 'الطريق الدائري'], None)],
        "ca": [(re.compile(r'(?:[-\/ ]*\d){5,8}[^\d]{5,20}|[^\d]{5,20}(?:[-\/ ]*\d){5,8}|(?:[-\/ ]*\d){1,3}[^@#$%]{5,20}(?:[-\/ ]*\d){5,8}', re.IGNORECASE|re.UNICODE), ['avd', 'cra', 'cró', 'cro', 'eix', 'pça', 'pca', 'plç', 'plc', 'rda', 'trv', 'via', 'auto', 'avda', 'camí', 'cami', 'carr', 'carr', 'ctra', 'cint', 'diag', 'drec', 'entr', 'pdís', 'pdis', 'ptge', 'ptal', 'rbla', 'rtda', 'sort', 'trav', 'trav', 'autop', 'autov', 'avgda', 'carra', 'carro', 'plaça', 'placa', 'ronda', 'trval', 'carrer', 'portal', 'rambla', 'trvsal', 'autovia', 'carrera', 'carreró', 'carrero', 'cinturó', 'cinturo', 'drecera', 'entrada', 'passeig', 'rotonda', 'sortida', 'avinguda', 'diagonal', 'gran vía', 'gran via', 'passadís', 'passadis', 'passatge', 'autopista', 'carretera', 'travessia', 'travessera', 'transversal', 'eix diagonal'], None)],
        "en": [(re.compile(r"P\.? ?O\.? Box \d+", re.IGNORECASE), None, None),
            (re.compile(r'(?:[-\/ ]*\d){5,8}[^\d]{5,20}|[^\d]{5,20}(?:[-\/ ]*\d){5,8}|(?:[-\/ ]*\d){1,3}[^@#$%]{5,20}(?:[-\/ ]*\d){5,8}', re.IGNORECASE|re.UNICODE), \
           ['st', 'sq', 'acc', 'aly', 'anx', 'app', 'arc', 'art', 'ave', 'avn', 'avs', 'aut', 'bnk', 'bsn', 'bay', 'byu', 'bch', 'blt', 'bnd', 'blk', 'blf', 'bwk', 'bde', 'blv', 'bvd', 'bot', 'btm', 'bdy', 'brk', 'bri', 'brg', 'bwy', 'brk', 'brw', 'bte', 'bps', 'byp', 'cpe', 'cyn', 'cvn', 'ctr', 'cen', 'cir', 'clt', 'cct', 'crc', 'clm', 'clf', 'cls', 'clr', 'cmn', 'con', 'cnc', 'cxn', 'cps', 'cnr', 'crn', 'cor', 'cso', 'c.h', 'c.r', 'c.r', 'crt', 'cts', 'cyd', 'cov', 'crk', 'crs', 'cst', 'crf', 'cft', 'csg', 'crd', 'xrd', 'xwy', 'xwy', 'cul', 'cds', 'cve', 'crv', 'ctg', 'dle', 'dip', 'div', 'dns', 'drv', 'dve', 'dwy', 'edg', 'elb', 'elm', 'end', 'ent', 'esp', 'est', 'exp', 'ext', 'fls', 'frm', 'fry', 'fld', 'fds', 'fit', 'flt', 'frd', 'fwy', 'gap', 'gdn', 'grd', 'gte', 'gwy', 'gld', 'gln', 'gbd', 'gra', 'grn', 'grv', 'gro', 'gly', 'hbr', 'hvn', 'hds', 'hth', 'hts', 'hrd', 'hwy', 'hls', 'hub', 'imp', 'isl', 'iss', 'ids', 'jct', 'jnc', 'jtn', 'key', 'kys', 'knl', 'lgn', 'ldg', 'lgt', 'lnk', 'ltl', 'lit', 'lkt', 'lps', 'lot', 'mnr', 'mdw', 'mdr', 'mew', 'mws', 'mls', 'mwy', 'nvs', 'nbr', 'num', 'pde', 'prd', 'prk', 'pky', 'pkw', 'pwy', 'prt', 'pth', 'pke', 'pne', 'pns', 'pla', 'plc', 'pln', 'pls', 'plt', 'plz', 'pkt', 'pnt', 'pte', 'prt', 'pvt', 'prm', 'pur', 'quy', 'qys', 'rmp', 'ran', 'rpd', 'rng', 'rch', 'res', 'rst', 'rtt', 'rtn', 'rdg', 'row', 'roa', 'rds', 'rdw', 'rdy', 'rks', 'rty', 'rnd', 'rte', 'row', 'run', 'r.r', 'swy', 'shl', 'shr', 'sdg', 'slp', 'snd', 'spc', 'spg', 'spr', 'sqr', 'sqs', 's.h', 's.r', 'srd', 's.r', 'srt', 'str', 'sts', 'smt', 'tce', 'ter', 'twy', 'top', 'tor', 't.h', 't.r', 'trd', 't.r', 'trt', 'twr', 'trc', 'trk', 'trl', 'trs', 'tri', 'tun', 'trn', 'tpk', 'ups', 'uns', 'vly', 'vws', 'vla', 'vst', 'vis', 'vue', 'wlk', 'wky', 'wys', 'wls', 'whf', 'wyn', 'yrd', 'abby', 'accs', 'acrs', 'ally', 'alee', 'alwy', 'ambl', 'ancg', 'apts', 'apch', 'appr', 'artl', 'arty', 'aven', 'avnu', 'aves', 'avns', 'back', 'bank', 'basn', 'belt', 'bend', 'blck', 'bluf', 'blfs', 'bwlk', 'blvd', 'boul', 'bttm', 'btms', 'bowl', 'brce', 'brch', 'brae', 'bdge', 'brdg', 'bdwy', 'bway', 'brks', 'brow', 'burg', 'brgs', 'burw', 'btte', 'bypa', 'byps', 'bywy', 'camp', 'cape', 'cnyn', 'cvan', 'cswy', 'caus', 'cway', 'cetr', 'cntr', 'ctrs', 'cnwy', 'chas', 'cirs', 'crct', 'circ', 'cirt', 'crcs', 'clfs', 'clse', 'clde', 'cmmn', 'comm', 'cmns', 'cncd', 'conc', 'cntn', 'conr', 'cntr', 'cnrs', 'crns', 'cors', 'cseo', 'c.h.', 'c.hw', 'c hw', 'c.hi', 'c hi', 'c.r.', 'co.r', 'co r', 'c.rd', 'c rd', 'c.r.', 'co.r', 'co r', 'c.rt', 'c rt', 'crse', 'crts', 'ctyd', 'cove', 'cres', 'crst', 'crss', 'crsg', 'xing', 'x-rd', 'x rd', 'xrds', 'cowy', 'crwy', 'xway', 'x-wy', 'cuwy', 'crwy', 'csac', 'crve', 'curv', 'cttg', 'cutt', 'dale', 'dell', 'dene', 'devn', 'dstr', 'dwns', 'drwy', 'dvwy', 'dway', 'drov', 'esmt', 'edge', 'entr', 'espl', 'ests', 'exit', 'expy', 'exwy', 'extn', 'exts', 'fawy', 'fall', 'fare', 'farm', 'frms', 'fern', 'flds', 'flne', 'ftrk', 'fitr', 'flat', 'flts', 'folw', 'ftwy', 'ford', 'fshr', 'form', 'fmtn', 'frwy', 'fway', 'frnt', 'frtg', 'grdn', 'gdns', 'grds', 'gate', 'gtes', 'gway', 'gtwy', 'glde', 'glen', 'grbd', 'gdbd', 'g bd', 'gren', 'grwy', 'grnd', 'grve', 'glch', 'hngr', 'hrbr', 'hbrs', 'havn', 'head', 'heth', 'hgts', 'hlds', 'hird', 'hgwy', 'hway', 'hwye', 'hywy', 'hill', 'hils', 'hllw', 'holw', 'inlt', 'intg', 'intn', 'isld', 'isle', 'jnct', 'jctn', 'jcts', 'keys', 'knob', 'knol', 'knls', 'ladr', 'lagn', 'land', 'lndg', 'lane', 'lnwy', 'lees', 'lmts', 'line', 'link', 'lttl', 'litl', 'loaf', 'loop', 'lynn', 'mall', 'maze', 'mdws', 'mead', 'mead', 'mndr', 'mews', 'mile', 'mill', 'moor', 'mway', 'mtwy', 'nook', 'nmbr', 'oaks', 'otlt', 'otlk', 'oval', 'ovrb', 'ovlk', 'opas', 'padk', 'plms', 'prde', 'pard', 'park', 'pkld', 'pkwy', 'prkw', 'part', 'pass', 'psge', 'pass', 'pasg', 'path', 'phwy', 'pway', 'ptwy', 'psla', 'piaz', 'pzza', 'pier', 'pike', 'pine', 'pnes', 'pond', 'plac', 'plns', 'plat', 'plza', 'pokt', 'pckt', 'pnte', 'port', 'prts', 'prrs', 'prom', 'quad', 'qdgl', 'qdrt', 'quay', 'quys', 'radl', 'rmbl', 'ramp', 'rnch', 'rpds', 'rnge', 'rang', 'reef', 'resv', 'rsrv', 'rest', 'ride', 'rdge', 'rdgs', 'rgwy', 'rowy', 'rofw', 'ring', 'rise', 'rvwy', 'rvra', 'road', 'raod', 'rdsd', 'rdwy', 'rnde', 'rsbl', 'svrd', 'svwy', 'shls', 'shor', 'shrs', 'shun', 'shnt', 'sdng', 'skwy', 'slip', 'slpe', 'sprn', 'spgs', 'spns', 'spur', 'strs', 'stwy', 's.h.', 'st.h', 'st h', 's.hw', 's hw', 'shwy', 's.hi', 's hi', 's.r.', 's.rd', 's rd', 'strd', 's.r.', 's.rt', 's rt', 'srte', 'strt', 'stps', 'stra', 'strd', 'stra', 'stre', 'strt', 'strp', 'sbwy', 'sumt', 'tarn', 'terr', 'tsse', 'thor', 'thfr', 'thwy', 'thro', 'trwy', 'thwy', 'tlwy', 't.h.', 't.hw', 't hw', 't.hi', 't hi', 't.r.', 't rd', 't.rd', 'twpr', 't.r.', 't rt', 't.rt', 'twpr', 'twrs', 'trce', 'trak', 'trfy', 'trlr', 'tram', 'tmwy', 'tkwy', 'tunl', 'turn', 'tpke', 'upas', 'vale', 'vlly', 'vlys', 'viad', 'vdct', 'view', 'vlge', 'vlas', 'vsta', 'wade', 'walk', 'wkwy', 'wtrs', 'ways', 'whrf', 'wynd', 'yard', 'abbey', 'acres', 'alley', 'allwy', 'amble', 'annex', 'avenu', 'avnue', 'avens', 'avnus', 'basin', 'bayou', 'bayoo', 'beach', 'baech', 'beech', 'block', 'bluff', 'blvde', 'blvrd', 'boulv', 'bottm', 'bttms', 'brace', 'brnch', 'break', 'brook', 'burgs', 'butte', 'byway', 'c van', 'csway', 'chase', 'circt', 'claim', 'cliff', 'close', 'clstr', 'clnde', 'cmmns', 'comms', 'cncrd', 'concs', 'cnctr', 'copse', 'corso', 'co.hw', 'co hw', 'c.hwy', 'c hwy', 'co.hi', 'co hi', 'co.rd', 'co rd', 'cty.r', 'cty r', 'co.rt', 'co rt', 'cty.r', 'cty r', 'c.rte', 'c rte', 'court', 'creek', 'crest', 'crief', 'croft', 'cross', 'x ing', 'x-ing', 'xroad', 'x-way', 'x way', 'cusac', 'curve', 'downs', 'drive', 'drvwy', 'drove', 'elbow', 'expwy', 'exten', 'falls', 'farms', 'ferry', 'field', 'fline', 'flats', 'front', 'grdns', 'gates', 'gtway', 'glade', 'grdbd', 'gr bd', 'gd bd', 'g bde', 'g bvd', 'g bld', 'green', 'grnds', 'grove', 'gulch', 'gully', 'haven', 'heads', 'heath', 'hghts', 'hgths', 'hglds', 'hi.rd', 'hi rd', 'hills', 'inlet', 'islds', 'junct', 'knoll', 'lagon', 'light', 'littl', 'loops', 'lynne', 'manor', 'mills', 'mount', 'palms', 'pklds', 'pkway', 'prkwy', 'pkwys', 'pthwy', 'ptway', 'pines', 'place', 'plain', 'plaza', 'point', 'piont', 'ports', 'quays', 'ranae', 'ranch', 'rapid', 'range', 'reach', 'resrv', 'rserv', 'rsrve', 'ridge', 'rdgwy', 'roads', 'raods', 'rocks', 'ronde', 'round', 'route', 'sv rd', 'svcwy', 'shoal', 'shore', 'shors', 'shunt', 'slope', 'sound', 'space', 'sprng', 'strwy', 'st.wy', 'st wy', 'st.hw', 'st hw', 's.hwy', 's hwy', 's.hwy', 's hwy', 'st.hi', 'st hi', 'st.rd', 'st rd', 's.rte', 's rte', 'strte', 'st.rt', 'st rt', 'steps', 'strnd', 'strds', 'strav', 'stree', 'strip', 'thick', 'twp.h', 'twp h', 't.hwy', 't hwy', 'twp.r', 'twp r', 'tp rd', 't.rte', 't rte', 'twp.r', 'twp r', 'tower', 'tline', 'trace', 'track', 'trail', 'trees', 'upass', 'union', 'vllys', 'views', 'villa', 'vista', 'wlkwy', 'wells', 'wharf', 'access', 'arcade', 'artery', 'avenue', 'avenus', 'avnues', 'bluffs', 'bottom', 'bottms', 'branch', 'bridge', 'brdway', 'brooks', 'burrow', 'bypass', 'canyon', 'center', 'centre', 'circle', 'circel', 'circus', 'cliffs', 'common', 'concse', 'corner', 'corseo', 'cty.hw', 'cty hw', 'c.hgwy', 'c hgwy', 'c.hway', 'c hway', 'co.hwy', 'co hwy', 'cty.hi', 'cty hi', 'cty.rd', 'cty rd', 'cty.rt', 'cty rt', 'co.rte', 'co rte', 'courts', 'x-road', 'x road', 'divide', 'divers', 'estate', 'expway', 'fields', 'follow', 'garden', 'grd bd', 'g blvd', 'gr bde', 'gd bde', 'g boul', 'gr bvd', 'gd bvd', 'gr bld', 'gd bld', 'grange', 'ground', 'hanger', 'harbor', 'hghlds', 'hollow', 'intchg', 'island', 'knolls', 'ladder', 'lagoon', 'landng', 'limits', 'meadow', 'neaves', 'number', 'outlet', 'parade', 'parkwy', 'prkway', 'pthway', 'piazza', 'plains', 'prarie', 'pocket', 'pointe', 'priors', 'radial', 'ramble', 'rapids', 'rge rd', 'rserve', 'return', 'ridges', 'r of w', 'rotary', 'svc rd', 'shoals', 'shores', 'siding', 'skyway', 'spring', 'sprngs', 'square', 'stairs', 's.hgwy', 's hgwy', 's.hway', 's hway', 'st.hwy', 'st hwy', 's.road', 's road', 'st.rte', 'st rte', 'strand', 'strnds', 'street', 'subdiv', 'subway', 'summit', 'terace', 'terrac', 'tshp.h', 'tshp h', 'twp.hw', 'twp hw', 't.hgwy', 't hgwy', 't.hway', 't hway', 'twp.hi', 'twp hi', 'twp.rd', 'twp rd', 'tshp.r', 'tshp r', 'twp.rt', 'twp rt', 'tshp.r', 'tshp r', 'towers', 'trktrl', 'tunnel', 'trnabt', 'unions', 'us hwy', 'us rte', 'valley', 'vennel', 'viadct', 'villas', 'waters', 'allyway', 'avenues', 'bottoms', 'caravan', 'causewy', 'centers', 'circles', 'circlet', 'circuit', 'cluster', 'commons', 'concord', 'corners', 'co.hgwy', 'co hgwy', 'co.hway', 'co hway', 'cty.hwy', 'cty hwy', 'cty.rte', 'cty rte', 'crecent', 'cutting', 'estates', 'fairway', 'footway', 'freeway', 'gardens', 'gateway', 'gr blvd', 'gd blvd', 'grd bde', 'g blvrd', 'gr boul', 'gd boul', 'grd bvd', 'grd bld', 'grounds', 'harbour', 'harbors', 'heights', 'hieghts', 'highway', 'impasse', 'intsctn', 'islands', 'landing', 'laneway', 'lookout', 'meadows', 'meander', 'outlook', 'paddock', 'parkway', 'passage', 'pathway', 'plateau', 'prairie', 'private', 'pursuit', 'reserve', 'retreat', 'riviera', 'roadway', 'springs', 'squares', 'str.way', 'str way', 'st.hgwy', 'st hgwy', 'st.hway', 'st hway', 'st.road', 'st road', 'staterd', 's.route', 's route', 'statert', 'strands', 'streets', 'terrace', 'thicket', 'thruway', 'tollway', 'tshp.hw', 'tshp hw', 'twp.hwy', 'twp hwy', 'tshp.hi', 'tshp hi', 'tshp.rd', 'tshp rd', 'twp.rte', 'twp rte', 'tshp.rt', 'tshp rt', 'trailer', 'tramway', 'u s hwy', 'u s rte', 'valleys', 'viaduct', 'village', 'walkway', 'alleyway', 'approach', 'arterial', 'boundary', 'broadway', 'causeway', 'cty.hgwy', 'cty hgwy', 'cty.hway', 'cty hway', 'crescent', 'crossing', 'crossway', 'culdesac', 'driveway', 'easement', 'entrance', 'fireline', 'foot way', 'frontage', 'grd blvd', 'gr blvrd', 'gd blvrd', 'grd boul', 'greenway', 'highroad', 'junction', 'look out', 'motorway', 'overlook', 'overpass', 'parkland', 'parkways', 'quadrant', 'range rd', 'ridgeway', 'riverway', 'roadside', 'rosebowl', 'svc road', 'stairway', 'state rd', 'st.route', 'st route', 'state rt', 'terrasse', 'twp.hgwy', 'twp hgwy', 'twp.hway', 'twp hway', 'tshp.hwy', 'tshp hwy', 'tshp.rte', 'tshp rte', 'townline', 'triangle', 'trunkway', 'turnpike', 'us route', 'anchorage', 'autoroute', 'boardwalk', 'boulevard', 'boulavard', 'centreway', 'colonnade', 'concourse', 'connector', 'courtyard', 'crossroad', 'cruiseway', 'deviation', 'diversion', 'esplanade', 'extension', 'fire line', 'firetrack', 'firetrail', 'foreshore', 'formation', 'grd blvrd', 'highlands', 'high road', 'junctions', 'over look', 'over pass', 'parklands', 'park land', 'peninsula', 'promenade', 'ridge way', 'road side', 'rose bowl', 'stair way', 's.highway', 's highway', 'stateroad', 'state rte', 'stravenue', 'tshp.hgwy', 'tshp hgwy', 'tshp.hway', 'tshp hway', 'turnabout', 'underpass', 'apartments', 'board walk', 'boulevarde', 'concession', 'connection', 'cross road', 'crossroads', 'cul de sac', 'cul-de-sac', 'expressway', 'extensions', 'fire track', 'fire trail', 'fore shore', 'interstate', 'overbridge', 'park lands', 'quadrangle', 'range road', 'rightofway', 'roundabout', 'service rd', 'serviceway', 'st.highway', 'st highway', 'state road', 'stateroute', 'throughway', 'trafficway', 'under pass', 'us highway', 'appartments', 'county road', 'cross roads', 'distributor', 'interchange', 'inter state', 'over bridge', 'rural route', 'state route', 'subdivision', 'throughfare', 'thoroughway', 'township rd', 'truck trail', 'county route', 'inter change', 'intersection', 'right of way', 'service road', 'statehighway', 'thoroughfare', 'inter section', 'state highway', 'thorough fare', 'township road', 'county highway', 'township route', 'grand boulevard', 'township highway', 'county touring route'], None)],
        "es": [(re.compile(r'(?:[-\/ ]*\d){5,8}[^\d]{5,20}|[^\d]{5,20}(?:[-\/ ]*\d){5,8}|(?:[-\/ ]*\d){1,3}[^@#$%]{5,20}(?:[-\/ ]*\d){5,8}', re.IGNORECASE|re.UNICODE), ['aut', 'avd', 'cll', 'bda', 'bvd', 'blv',  'cno', 'cmo', 'c.h', 'c.n', 'c.v', 'cmt', 'cra', 'kra', 'cda', 'cer', 'cto', 'crv', 'ext', 'gta', 'hda', 'pqe', 'pzo', 'psj', 'pso', 'pas', 'p.o', 'p.º', 'p.°', 'pza', 'p.r', 'p.i', 'ret', 'rin', 'rda', 'rta', 'ver', 'via', 'vst', 'alam', 'auto', 'avda', 'av /', 'blvd', 'blev', 'call', 'cjón', 'cjon', 'cjla', 'calz', 'cmno', 'c.h.', 'c.n.', 'c.v.', 'cant', 'carr', 'ctra', 'cint', 'cirv', 'diag', 'gale', 'pant', 'pque', 'parq', 'ptda', 'pseo', 'paso', 'peat', 'plza', 'p.za', 'pzta', 'plta', 'pbdo', 'p.r.', 'p.i.', 'prol', 'pbla', 'pblo', 'pnte', 'rbla', 'rpla', 'rcon', 'rncn', 'rcda', 'rtda', 'ruta', 'sect', 'send', 'tras', 'trva', 'vcto', 'vsta', 'vist', 'acces', 'alque', 'andad', 'angta', 'apdro', 'autop', 'autov', 'av cl', 'av cr', 'bjada', 'banda', 'branc', 'bqllo', 'barda', 'brzal', 'bulev', 'calle', 'cllja', 'cllon', 'cllón', 'cllzo', 'czada', 'campg', 'cantr', 'carra', 'ctrin', 'crtil', 'crril', 'ccvcn', 'crrdo', 'cstan', 'custa', 'disem', 'eslda', 'estda', 'expla', 'extrm', 'ldera', 'llnra', 'malec', 'mrdor', 'meull', 'praje', 'parti', 'psaje', 'paseo', 'psmar', 'psllo', 'perif', 'plaza', 'plzta', 'plzla', 'pgres', 'pgind', 'rampa', 'rncon', 'rcnda', 'ronda', 'sedra', 'sedro', 'sbida', 'trans', 'trval', 'vreda', 'vista', 'acceso', 'av cra', 'bajada', 'brazal', 'callej', 'c priv', 'camino', 'cantón', 'canton', 'carril', 'cuesta', 'ladera', 'lderas', 'muelle', 'paraje', 'parque', 'pasaje', 'ps mar', 'pg res', 'pg ind', 'puebla', 'pueblo', 'puente', 'rambla', 'rampla', 'rincón', 'rincon', 'sector', 'subida', 'trvsal', 'trvsía', 'trvsia', 'vereda', 'alameda', 'andador', 'angosta', 'autovía', 'autovia', 'avenida', 'avda cr', 'bulevar', 'calleja', 'cl priv', 'callizo', 'calzada', 'camping', 'cantera', 'carrera', 'cerrada', 'costera', 'espalda', 'estrada', 'galería', 'galeria', 'laderas', 'llanura', 'malecón', 'malecon', 'mirador', 'pantano', 'partida', 'pasillo', 'poblado', 'pol.res', 'pol res', 'pol.ind', 'pol ind', 'retorno', 'rotonda', 'sendera', 'sendero', 'trasera', 'alquería', 'alqueria', 'apeadero', 'av calle', 'avda cra', 'barranco', 'barriada', 'callejón', 'callejon', 'cll priv', 'c / priv', 'caminito', 'carretil', 'cinturón', 'cinturon', 'circular', 'circuito', 'corredor', 'diagonal', 'glorieta', 'gran vía', 'gran via', 'hacienda', 'pasadizo', 'peatonal', 'plazuela', 'tránsito', 'transito', 'travesía', 'travesia', 'viaducto', 'autopista', 'boulevard', 'carretera', 'explanada', 'extensión', 'extension', 'plazoleta', 'políg res', 'polig res', 'políg ind', 'polig ind', 'rinconada', 'calle priv', 'callejuela', 'carreterín', 'carreterin', 'costanilla', 'diseminado', 'extramuros', 'particular', 'periferico', 'circunvalar', 'transversal', 'barranquillo', 'camino hondo', 'camino nuevo', 'camino viejo', 'prolongación', 'prolongacion', 'avenida calle', 'calle privada', 'circunvalación', 'circunvalacion', 'paseo maritimo', 'avenida carrera', 'polígono industrial', 'poligono industrial', 'polígono residencial', 'poligono residencial'], None)],
        "eu": [(re.compile(r'(?:[-\/ ]*\d){5,8}[^\d]{5,20}|[^\d]{5,20}(?:[-\/ ]*\d){5,8}|(?:[-\/ ]*\d){1,3}[^@#$%]{5,20}(?:[-\/ ]*\d){5,8}', re.IGNORECASE|re.UNICODE), ['err', 'pas', 'zum', 'bide', 'kale', 'zehb', 'bidea', 'etorb', 'kalea', 'plaza', 'bidexka', 'karrika', 'zumardia', 'autobidea', 'autopista', 'errepidea', 'etorbidea', 'hiribidea', 'ibilbidea', 'enparantza', 'korridorea', 'pasealekua', 'zeharbidea'], None)],
        "fr": [(re.compile(r'(?:[-\/ ]*\d){5,8}[^\d]{5,20}|[^\d]{5,20}(?:[-\/ ]*\d){5,8}|(?:[-\/ ]*\d){1,3}[^@#$%]{5,20}(?:[-\/ ]*\d){5,8}', re.IGNORECASE|re.UNICODE), ['all', 'ach', 'art', 'arc', 'aut', 'ave', 'avn', 'avs', 'bre', 'bch', 'ber', 'bde', 'blv', 'bvd', 'bld', 'but', 'cau', 'car', 'cav', 'chl', 'chs', 'che', 'chv', 'cht', 'col', 'ctr', 'cor', 'crs', 'deg', 'dsg', 'dig', 'éch', 'ecl', 'écl', 'env', 'enc', 'esp', 'fos', 'fos', 'gal', 'gbd', 'gch', 'gdr', 'grs', 'gri', 'ham', 'hch', 'imp', 'jte', 'lve', 'mte', 'mét', 'met', 'nte', 'prc', 'prv', 'pas', 'psg', 'p.n', 'ple', 'pat', 'pta', 'pae', 'pim', 'prt', 'ptr', 'pln', 'plt', 'pte', 'prq', 'rac', 'rpe', 'rmp', 'rem', 'rpt', 'rtd', 'rte', 'rts', 'r.n', 'rue', 'rle', 'res', 'sen', 'trn', 'tpl', 'trt', 'tra', 'val', 'val', 'ven', 'via', 'vte', 'vrt', 'vch', 'voi', 'alls', 'a ch', 'arts', 'anse', 'aven', 'avnu', 'aves', 'avns', 'bres', 'b ch', 'bers', 'bois', 'bcle', 'blvd', 'boul', 'côte', 'cote', 'cale', 'camp', 'cgne', 'carf', 'care', 'carr', 'chee', 'chss', 'chev', 'ch v', 'chem', 'ches', 'chsv', 'cloi', 'clos', 'cors', 'cour', 'degs', 'dsgs', 'digs', 'ecls', 'écls', 'espa', 'esps', 'foss', 'foyr', 'gals', 'garn', 'grbd', 'gdbd', 'g bd', 'grch', 'gdch', 'g ch', 'gden', 'g en', 'grdr', 'gr r', 'gd r', 'g rs', 'grim', 'h ch', 'hchs', 'imps', 'jtes', 'leve', 'mtes', 'parc', 'prcs', 'p.n.', 'pass', 'ples', 'peri', 'péri', 'p ch', 'pt a', 'p im', 'p rt', 'pt r', 'ptas', 'plci', 'plag', 'plan', 'plat', 'pltx', 'pnte', 'pont', 'porq', 'pour', 'prql', 'prom', 'peri', 'quai', 'racc', 'raid', 'rmpe', 'rang', 'remp', 'rocd', 'rnde', 'rdpt', 'roqt', 'rtnd', 'rtde', 'rtes', 'r.n.', 'rles', 'rues', 'ress', 'sens', 'sent', 'terr', 'tsse', 't pl', 'trpl', 'trts', 'trvs', 'vens', 'v rt', 'vrte', 'vche', 'v ch', 'voie', 'vois', 'allée', 'allee', 'avenu', 'avnue', 'avens', 'avnus', 'berge', 'blvde', 'blvrd', 'boulv', 'butte', 'carru', 'cares', 'carré', 'carre', 'cavée', 'cavee', 'cercl', 'champ', 'chees', 'che v', 'chs v', 'cours', 'degré', 'degre', 'digue', 'fosse', 'foyer', 'grdbd', 'gr bd', 'gd bd', 'g bde', 'g bvd', 'g bld', 'grdch', 'gr ch', 'gd ch', 'gd en', 'gr en', 'gdens', 'g ens', 'grd r', 'g rue', 'gr rs', 'gd rs', 'gdsen', 'h chs', 'hschs', 'jetée', 'jetee', 'levée', 'levee', 'métro', 'metro', 'parcs', 'passe', 'patio', 'pt ch', 'pt ae', 'p ave', 'p avn', 'pt im', 'pt rt', 'p rte', 'p rue', 'pt as', 'place', 'plage', 'plags', 'platx', 'ponts', 'portq', 'porqs', 'rampe', 'ronde', 'rd pt', 'rtnde', 'route', 'sente', 'sents', 'tsses', 'tr pl', 'terte', 'trvrs', 'valée', 'v rte', 'v che', 'voies', 'allées', 'allees', 'arcade', 'avenue', 'avenus', 'avnues', 'berges', 'boucle', 'côteau', 'coteau', 'carref', 'cercle', 'chalet', 'chemin', 'degrés', 'degres', 'digues', 'écluse', 'ecluse', 'enclos', 'espace', 'fosses', 'grd bd', 'g blvd', 'gr bde', 'gd bde', 'g boul', 'gr bvd', 'gd bvd', 'gr bld', 'gd bld', 'grd ch', 'gd ens', 'gr ens', 'gr rue', 'gd rue', 'grd rs', 'g rues', 'gs ens', 'gdsens', 'grille', 'hameau', 'hs chs', 'jetées', 'jetees', 'montée', 'montee', 'parvis', 'passes', 'pt ave', 'pt avn', 'pt rte', 'pt rue', 'placis', 'plages', 'plaine', 'pointe', 'portqs', 'rocade', 'roquet', 'routes', 'ruelle', 'sentes', 'square', 'tertes', 'vallon', 'vallee', 'vroute', 'avenues', 'carreau', 'chemins', 'château', 'chateau', 'cloître', 'cloitre', 'contour', 'écluses', 'ecluses', 'enclave', 'galerie', 'garenne', 'gr blvd', 'gd blvd', 'grd bde', 'g blvrd', 'gr boul', 'gd boul', 'grd bvd', 'grd bld', 'grd rue', 'gr rues', 'gd rues', 'gds ens', 'grs ens', 'impasse', 'montées', 'montees', 'passage', 'p allee', 'p route', 'plateau', 'rempart', 'rotonde', 'ruelles', 'sentier', 'terrain', 'venelle', 'v route', 'barriêre', 'barriere', 'campagne', 'carrière', 'carriere', 'chaussée', 'chaussee', 'corniche', 'descente', 'galeries', 'grd blvd', 'gr blvrd', 'gd blvrd', 'grd boul', 'g chemin', 'grandrue', 'grd rues', 'grds ens', 'impasses', 'p chemin', 'pt allee', 'pt route', 'p allees', 'plateaux', 'portique', 'pourtour', 'sentiers', 'terrasse', 'traverse', 'venelles', 'v chemin', 'autoroute', 'barriêres', 'barrieres', 'boulevard', 'boulavard', 'carrefour', 'carrières', 'carrieres', 'chaussées', 'chaussees', 'corniches', 'descentes', 'échangeur', 'esplanade', 'grd blvrd', 'gr chemin', 'gd chemin', 'grand rue', "grand'rue", 'grimpette', 'pt chemin', 'p impasse', 'pt allees', 'portiques', 'presquîle', 'presquile', 'promenade', 'raccourci', 'raidillon', 'rondpoint', 'residence', 'terrasses', 'bas chemin', 'boulevarde', 'esplanades', 'grd chemin', 'grand rues', 'passerelle', 'pt impasse', 'petite rue', 'presqu’île', "presqu'ile", 'rond point', 'residences', 'cheminement', 'haut chemin', 'passerelles', 'terre plein', 'grand chemin', 'périphérique', 'peripherique', 'petit chemin', 'petite allée', 'petite allee', 'petite route', 'peripherique', 'vieux chemin', 'ancien chemin', 'hauts chemins', 'petite avenue', 'petite allées', 'vieille route', 'ancienne route', 'chemin vicinal', 'grand ensemble', 'nouvelle route', 'petite impasse', 'petites allees', 'grand boulevard', 'route nationale', 'anciennes routes', 'chemins vicinaux', 'grands ensembles', 'passage à niveau', 'passage a niveau'], None)],
        "hi": [(re.compile(r'(?:[-\/ ]*\d){5,8}[^\d]{5,20}|[^\d]{5,20}(?:[-\/ ]*\d){5,8}|(?:[-\/ ]*\d){1,3}[^@#$%]{5,20}(?:[-\/ ]*\d){5,8}', re.IGNORECASE|re.UNICODE), ['मार्ग', 'बाजार', 'नगर', 'बाजार', 'सड़क', 'सड़क', 'राजमार्ग', 'marg', 'bazar', 'nagar', 'bazaar'], None)],
        "id": [(re.compile(r'(?:[-\/ ]*\d){5,8}[^\d]{5,20}|[^\d]{5,20}(?:[-\/ ]*\d){5,8}|(?:[-\/ ]*\d){1,3}[^@#$%]{5,20}(?:[-\/ ]*\d){5,8}', re.IGNORECASE|re.UNICODE), ['jln', 'jl.', 'gang', 'jalan', 'jalur', 'lrong', 'lorong', 'alunalun', 'jembatan', 'alun-alun', 'alun alun', 'terowongan'], None)],
        "pt": [(re.compile(r'(?:[-\/ ]*\d){5,8}[^\d]{5,20}|[^\d]{5,20}(?:[-\/ ]*\d){5,8}|(?:[-\/ ]*\d){1,3}[^@#$%]{5,20}(?:[-\/ ]*\d){5,8}', re.IGNORECASE|re.UNICODE), ['ava', 'ave', 'brº', 'br°', 'bro', 'bco', 'cam', 'dto', 'esq', 'lug', 'pda', 'pto', 'pça', 'pca', 'pct', 'qta', 'ret', 'rod', 'rot', 'rua', 'trv', 'urb', 'via', 'beco', 'ccnh', 'c.m.', 'estr', 'frei', 'part', 'pnto', 'pctª', 'pcta', 'proj', 'rvia', 'ruas', 'trav', 'vdto', 'vila', 'zona', 'c. m.', 'camno', 'largo', 'lugar', 'ponto', 'praça', 'praca', 'rampa', 'rdvia', 'rpart', 'ruela', 'sitio', 'tunel', 'viela', 'volta', 'acesso', 'bairro', 'estr m', 'estr n', 'estr r', 'frente', 'parada', 'prolng', 'quinta', 'r part', 'transv', 'alameda', 'avenida', 'av marg', 'calçada', 'calcada', 'caminho', 'direito', 'estrada', 'praceta', 'retorno', 'rodovia', 'rotunda', 'viaduto', 'autoestr', 'ave marg', 'ava marg', 'azinhaga', 'esquerdo', 'rodoanel', 'travessa', 'auto estr', 'estr marg', 'calçadinha', 'particular', 'projectada', 'autoestrada', 'caclcadinha', 'transversal', 'urbanizacao', 'auto estrada', 'prolongamento', 'rua particular', 'avenida marginal', 'câmara municipal', 'camara municipal', 'astrada marginal', 'estrada nacional', 'estrada regional', 'estrada municipal', 'itinerário principal', 'itinerario principal', 'itinerário complementar', 'itinerario complementar'], None)],
        "ur": [(re.compile(r'(?:[-\/ ]*\d){5,8}[^\d]{5,20}|[^\d]{5,20}(?:[-\/ ]*\d){5,8}|(?:[-\/ ]*\d){1,3}[^@#$%]{5,20}(?:[-\/ ]*\d){5,8}', re.IGNORECASE|re.UNICODE), ['گلی', 'روڈ', 'لین', 'سڑک', 'شاہراہ', 'ہائی وے', 'ایکسپریس وے'], None)],
        "vi": [(re.compile(r'(?:[-\/ ]*\d){5,8}[^\d]{5,20}|[^\d]{5,20}(?:[-\/ ]*\d){5,8}|(?:[-\/ ]*\d){1,3}[^@#$%]{5,20}(?:[-\/ ]*\d){5,8}', re.IGNORECASE|re.UNICODE), ['ngõ', 'đường', 'duong', 'đại lộ', 'dai lo', 'quốc lộ', 'quoc lo', 'tỉnh lộ', 'tinh lo', 'đường hẻm', 'duong hem', 'đường nhỏ', 'duong nho', 'đường phố', 'duong pho', 'công trường', 'cong truong', 'quảng trường', 'quang truong'], None)],
        "id": [(re.compile(r'(?:[-\/ ]*\d){5,8}[^\d]{5,20}|[^\d]{5,20}(?:[-\/ ]*\d){5,8}|(?:[-\/ ]*\d){1,3}[^@#$%]{5,20}(?:[-\/ ]*\d){5,8}', re.IGNORECASE|re.UNICODE), ["jalan", "gang", "jl.", "gg."], None)],
        "yo": [(re.compile(r'(?:[-\/ ]*\d){5,8}[^\d]{5,20}|[^\d]{5,20}(?:[-\/ ]*\d){5,8}|(?:[-\/ ]*\d){1,3}[^@#$%]{5,20}(?:[-\/ ]*\d){5,8}', re.IGNORECASE|re.UNICODE), ['opopona', 'ọna', ], None)],
        "sw": [(re.compile(r'(?:[-\/ ]*\d){5,8}[^\d]{5,20}|[^\d]{5,20}(?:[-\/ ]*\d){5,8}|(?:[-\/ ]*\d){1,3}[^@#$%]{5,20}(?:[-\/ ]*\d){5,8}', re.IGNORECASE|re.UNICODE), ['barabara kuu', 'barabara', 'avenue', 'njia',], None)],
        "xh": [(re.compile(r'(?:[-\/ ]*\d){5,8}[^\d]{5,20}|[^\d]{5,20}(?:[-\/ ]*\d){5,8}|(?:[-\/ ]*\d){1,3}[^@#$%]{5,20}(?:[-\/ ]*\d){5,8}', re.IGNORECASE|re.UNICODE), ['indlela', 'uhola wendlela', 'isitalato', ], None)],
        "zu": [(re.compile(r'(?:[-\/ ]*\d){5,8}[^\d]{5,20}|[^\d]{5,20}(?:[-\/ ]*\d){5,8}|(?:[-\/ ]*\d){1,3}[^@#$%]{5,20}(?:[-\/ ]*\d){5,8}', re.IGNORECASE|re.UNICODE), ['umgwaqo', 'umgwaqo omkhulu',], None)],
         #TODO - finish out the rest of the African languages and indic languages
         "zh": [(regex.Regex('((\\p{Han}{1,3}(自治区|省))?\\p{Han}{1,4}((?<!集)市|县|州)\\p{Han}{1,10}[路|街|道|巷](\\d{1,3}[弄|街|巷])?\\d{1,4}号)'), None, None),
                (regex.Regex('(?<zipcode>(^\\d{5}|^\\d{3})?)(?<city>\\D+[縣市])(?<district>\\D+?(市區|鎮區|鎮市|[鄉鎮市區]))(?<others>.+)', flags=regex.V0), None, None),
            (re.compile(r'\b[^\d~!@#$%^&="\':;<>?\/]{5,40}\s\d{1,8}(?:[-\s]\d{0,5})|\d/{1,3}\?\d{0,4}[^d]{5,20}\d{5,8}(?:[-\s]\d{0,5})', re.IGNORECASE|re.UNICODE), ['路','街道', '高速公路','大道', '道','街','段','弄', '大街', '巷', '村道','縣道','县道','省道','鄉道','乡道','大院','國道','国道','胡同'], None)
            ],
    },
    "PHONE": {
      "zh" : [(regex.compile(r"\d{4}-\d{8}"), None, None),
              
              #from https://github.com/Aggregate-Intellect/bigscience_aisc_pii_detection/blob/main/language/zh/rules.py which is under Apache 2
              (regex.compile('(0?\d{2,4}-[1-9]\d{6,7})|({\+86|086}-| ?1[3-9]\d{9} , ([\+0]?86)?[\-\s]?1[3-9]\d{9})'), None, None),
        ],
      # we can probably remove one of the below
      "default": [
              # https://github.com/madisonmay/CommonRegex/blob/master/commonregex.py phone with exts
              (
                  re.compile('((?:(?:\+?1\s*(?:[.-]\s*)?)?(?:\(\s*(?:[2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9])\s*\)|(?:[2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9]))\s*(?:[.-]\s*)?)?(?:[2-9]1[02-9]|[2-9][02-9]1|[2-9][02-9]{2})\s*(?:[.-]\s*)?(?:[0-9]{4})(?:\s*(?:#|x\.?|ext\.?|extension)\s*(?:\d+)?))', re.IGNORECASE),
                  None, None
              ),
              # common regex phone
              (
                  re.compile('((?:(?<![\d-])(?:\+?\d{1,3}[-.\s*]?)?(?:\(?\d{3}\)?[-.\s*]?)?\d{3}[-.\s*]?\d{4}(?![\d-]))|(?:(?<![\d-])(?:(?:\(\+?\d{2}\))|(?:\+?\d{2}))\s*\d{2}\s*\d{3}\s*\d{4}(?![\d-])))'),
                  None, None
              ), 
              ( re.compile('[\+\d]?(\d{2,3}[-\.\s]??\d{2,3}[-\.\s]??\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]??\d{4}|\d{3}[-\.\s]??\d{4})'), None, None)     
      ]      
    },
    "IP_ADDRESS": {
        "default": [(re.compile('(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)', re.IGNORECASE), None, None),]
              
        },
    "USER": {
      "default": [
              #generic user id
              (re.compile(r"\s@[a-z][0-9a-z]{4-8}", re.IGNORECASE), None, None),
              #email
              (re.compile("(\w+[a-z0-9!#$%&'*+\/=?^_`{|.}~-]*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?)", re.IGNORECASE), None, None),
      ]    
    },
    #need a global license plate regex
    "LICENSE_PLATE": {
      "en": [
              #en license plate
              (regex.compile('[A-Z]{3}-\d{4}|[A-Z]{1,3}-[A-Z]{1,2}-\d{1,4}'), None, None)
      ],
      "zh": [ #from https://github.com/Aggregate-Intellect/bigscience_aisc_pii_detection/blob/main/language/zh/rules.py which is under Apache 2
              #LICENSE_PLATE
              (regex.compile('(\b[A-Z]{3}-\d{4}\b)'), None, None),
              (regex.compile('^(?:[京津沪渝冀豫云辽黑湘皖鲁新苏浙赣鄂桂甘晋蒙陕吉闽贵粤青藏川宁琼使领 A-Z]{1}[A-HJ-NP-Z]{1}(?:(?:[0-9]{5}[DF])|(?:[DF](?:[A-HJ-NP-Z0-9])[0-9]{4})))|(?:[京津沪渝冀豫云辽黑湘皖鲁新苏浙赣鄂桂甘晋蒙陕吉闽贵粤青藏川宁琼使领 A-Z]{1}[A-Z]{1}[A-HJ-NP-Z0-9]{4}[A-HJ-NP-Z0-9 挂学警港澳]{1})$'), None, None),
      ],
    },
    "ID": {
      "zh": [ #from https://github.com/Aggregate-Intellect/bigscience_aisc_pii_detection/blob/main/language/zh/rules.py which is under Apache 2
              #since we can't capture some of the zh rules under the general rules
              (regex.compile('(?:[16][1-5]|2[1-3]|3[1-7]|4[1-6]|5[0-4])\d{4}(?:19|20)\d{2}(?:(?:0[469]|11)(?:0[1-9]|[12][0-9]|30)|(?:0[13578]|1[02])(?:0[1-9]|[12][0-9]|3[01])|02(?:0[1-9]|[12][0-9]))\d{3}[\dXx]'), None, None),
              (regex.compile('(^[EeKkGgDdSsPpHh]\d{8}$)|(^(([Ee][a-fA-F])|([DdSsPp][Ee])|([Kk][Jj])|([Mm][Aa])|(1[45]))\d{7}$)'), None, None),
          ],
      "default": [
              #credit card from common regex
              (re.compile('((?:(?:\\d{4}[- ]?){3}\\d{4}|\\d{15,16}))(?![\\d])'), None, None),
              #icd code - see https://stackoverflow.com/questions/5590862/icd9-regex-pattern
              (re.compile('[A-TV-Z][0-9][A-Z0-9](\.[A-Z0-9]{1,4})'), None, None),
              # generic id with dashes - this sometimes catches a - or a / at the beginning of a number which might not be what we want.
              # note, we are not catching a "." which could be inside a ID b/c this could be very close to math numbers and money. TBD
              (re.compile('[A-Z#]{0,3}(?:[-\/ ]*\d){6,13}'), None, ('pp', 'pp.', )), #adding cap chars at end, see Tw # ? ...[A-Z]{0,2}
              # IBAN
              (re.compile('[A-Z]{2}\d+\d+[A-Z]{0,4}(?:[- ]*\d){10,32}[A-Z]{0,3}'), None, None),
      ],
    },
 }

In [None]:
# Additional year patterns
year_patterns = [
  (re.compile(r"\b[1-2][0-9]{3}-[1-2][0-9]{3}\b"), None, None), # yyyy-yyyy
  (re.compile(r"\b[1-2][0-9]{3}-[0-3][0-9]-[0-3][0-9]\b"), None, None),  # yyyy-mm-dd or yyyy-dd-mm
  (re.compile(r"\b[0-3][0-9]-[0-3][0-9]-[1-2][0-9]{3}\b"), None, None), # mm-dd-yyyy or dd-mm-yyyy
  (re.compile(r"\b[0-3][0-9]-[1-2][0-9]{3}\b"), None, None),  # mm-yyyy
  (re.compile(r"\b[1-2][0-9]{3}-[0-3][0-9]\b"), None, None),  # yyyy-mm
]

privacy_group_regexes =  copy.deepcopy(regex_rulebase)

privacy_group_regexes2 =  copy.deepcopy(regex_rulebase)
privacy_group_regexes2['USER']['en'] = [( re.compile(r'\b[^@]*@?[A-Za-z]{3,10}\b'), ['second ago', 'seconds ago', 'minute ago',  'minutes ago', 'hour ago', 'hours ago', 'day ago', 'days ago', 'month ago', 'months ago', 'years ago', 'year ago'], None)]
privacy_group_regexes_with_year_pattern = copy.deepcopy(regex_rulebase)
privacy_group_regexes_with_year_pattern['DATE']['default'] = year_patterns

sasha_regexes = copy.deepcopy(regex_rulebase)
sasha_regexes['ID']['default'] = [( re.compile(r'\b[A-Za-z]*(?:[-]*\d){6,}\b'), None, None),]
sasha_regexes_with_year_pat = copy.deepcopy(regex_rulebase)
sasha_regexes_with_year_pat['ID']['default'] = [( re.compile(r'\b[A-Za-z]*(?:[-]*\d){6,}\b'), None, None),]
sasha_regexes_with_year_pat['DATE']['default'] = year_patterns

meg_regexes = copy.deepcopy(regex_rulebase)
meg_regexes['ID']['default'] = [( re.compile(r'\b[A-Za-z]*(?:[-\.]*\d){6,}\b'), None, None),]
meg_regexes_with_year_pat = copy.deepcopy(regex_rulebase)
meg_regexes_with_year_pat['ID']['default'] = [( re.compile(r'\b[A-Za-z]*(?:[-\.]*\d){6,}\b'), None, None),]
meg_regexes_with_year_pat['DATE']['default'] = year_patterns


#meg's alternate patterns.
# Patterns for high-risk character strings
id_pattern = r'(?:^|[\b\s@?,!;:\'\")(.\p{Han}])([A-Za-z]*(?:[\p{Pd}]*\p{Nd}){6,})(?:$|[\b\s@?,!;:\'\")(.\p{Han}])'
# https://regex101.com/r/JQkmh8/2
# key_pattern = r'(?:^|[\b\s@?,!;:\'\")(.\p{Han}])((?:(?:[A-Za-z]+[\p{Nd}\p{Pd}\/\+\=:]+|[\p{Nd}\p{Pd}\/\+\=:]+[A-Za-z]+)){4,}|(?:(?:\p{Nd}{3,}|[A-Z]+\p{Nd}+[A-Z]*|\p{Nd}+[A-Z]+\p{Nd}*)[\s\p{Pd}]?){4,})(?:$|[\b\s\p{Han}@?,!;:\'\"])'
# https://regex101.com/r/JQkmh8/5
key_pattern = r'(?:^|[\b\s@?,!:;\'\")(.\p{Han}])((?:(?:[A-Za-z]+[\p{Nd}\p{Pd}\/\+\=:_]+|[\p{Nd}\p{Pd}\/\+\=:]+[A-Za-z]+)){4,}|(?:(?:\p{Nd}{3,}|[A-Z]+\p{Nd}+[A-Z]*|\p{Nd}+[A-Z]+\p{Nd}*)[ \p{Pd}]?){3,})(?:$|[\b\s\p{Han}@?,!;:\'\")(.])'
# TODO: Should we put the start/end of string character classes?
ipv4_pattern = r'(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)(?:\.(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)){3}'
ipv6_pattern = r'(?:[0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|(?:[0-9a-fA-F]{1,4}:){1,7}:|(?:[0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|(?:[0-9a-fA-F]{1,4}:){1,5}(?::[0-9a-fA-F]{1,4}){1,2}|(?:[0-9a-fA-F]{1,4}:){1,4}(?::[0-9a-fA-F]{1,4}){1,3}|(?:[0-9a-fA-F]{1,4}:){1,3}(?::[0-9a-fA-F]{1,4}){1,4}|(?:[0-9a-fA-F]{1,4}:){1,2}(?::[0-9a-fA-F]{1,4}){1,5}|[0-9a-fA-F]{1,4}:(?:(?::[0-9a-fA-F]{1,4}){1,6})|:(?:(?::[0-9a-fA-F]{1,4}){1,7}|:)|fe80:(?::[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(?:ffff(?::0{1,4}){0,1}:){0,1}(?:(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9])|(?:[0-9a-fA-F]{1,4}:){1,4}:(?:(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9])'
ip_pattern = r"(?:^|[\b\s@?,!;:\'\")(.\p{Han}])(" + r"|".join([ipv4_pattern, ipv6_pattern]) + ")(?:$|[\s@,?!;:\'\"(.\p{Han}])"
# https://regex101.com/r/OZdSUu/5
email_pattern = r'(?:^|[\s\b\'\"@,?!;:)(.\p{Han}])([^\s@,?!;:)(]+@[^,\s!?;,]+[^\s\b\'\"@,?!;:)(.])(?:$|[\s\b@,?!;:)(.\p{Han}])'
# https://regex101.com/r/mOqi1s/3
user_pattern = r'(?:^|[\s@,?!;:\'\")(\p{Han}])(@[^\s@,?!;:\'\")(]{3,})'
# Examples from https://regexpattern.com/phone-number/
# https://regex101.com/r/lZZ0XP/4
# Also matches MLS numbers
phone_pattern = r'(?:^|[\s\'\"(\p{Han}])((?:\+\p{Nd}+[ \/.\p{Pd}]*)?(?:(?:\(\+?\p{Nd}+\))?(?:[ \/.\p{Pd}]*\p{Nd})){7,}(?:[\t\f #]*\p{Nd}+)?)(?:$|[\s@,?!;:\'\"(.\p{Han}])'

#create your own regexes to test


# MTS Testing Code

In [None]:
#@title Experimental Address Tests
%cd /content/muliwai
labelled_lines = pandas.read_csv("/content/drive/MyDrive/labelled_data.csv")
regex_catches = []
regex_catches_only_pii = []
gold_catches = []
gold_catches_only_pii = []

for index, row in labelled_lines.iterrows():
    text = row['text']
    if text and type(text) is str: 
     results=detect_ner_with_regex_and_context(text, 'en', tag_type={'AGE', 'ADDRESS_EXP'})
     if results: 
        print (row['recognized_text'], row['pii_type'], row['start_position'],row['end_position'], row['text'].replace("\n", " "))
        print ('**', results )

/content/muliwai
863-638-1188 PHONE_NUMBER 1058 1070 ​Today was our first meeting in the new year 2017. The new board for 2017 was introduced and inducted. The turnout of fabulous women at the Club Room at the Chain O Lakes Complex was good. Cam's Catering served a lunch to include pork tenderloin, steamed vegetable, mashed potatoes with cheese, garden salad, and carrot cake for desert. It was delightful to have so many choices! We had a special local non-profit organization Adrianne Hans from Hunger Education and Resource Training (HEART) was the speaker for our January meeting and we all learned so much from Adrienne regarding the college training courses offered at HEART. HEART offers two accredited training classes per year in 15 week courses intervals and one 3-week training class. Collage credit is achieved for those who graduate the program. The courses are designed to teach the students survival when conducting missionary work in third world countries. If anyone is interested i

In [None]:
#@title Run regex on Presidio labeled data to examine high risk stuff
%cd /content/muliwai
labelled_lines = pandas.read_csv("/content/drive/MyDrive/labelled_data.csv")
regex_catches = []
regex_catches_only_pii = []
gold_catches = []
gold_catches_only_pii = []

for index, row in labelled_lines.iterrows():
  try:
    text = row['text']
    print (row['recognized_text'], row['pii_type'], row['start_position'],row['end_position'], row['text'].replace("\n", " "))
    
  except:
    text = None
    pass
  if text: print ('**', detect_ner_with_regex_and_context(text, 'en', tag_type=high_risk_stuff))

/content/muliwai
0827650483 PHONE_NUMBER 831 841 - GPS navigation with Free IGO maps and GPS Antenna. - Optional external microphone Jack on the back. Sorry we don’t have Auris units yet. Not review but contact on websote does.not.work. Do you have screen for this model? 1999-2007. Not 2007-2009 model with factory screen. The modelreferred to above is a Prado 2007. Good to see you have the model for the 2007 Prado. Just want to be sure – is it for the Jan 2007 Prado or the Oct 2007 Prado? The facelift in 2007 came out with a small screen. The Jan 2007 had no screen. I want the Jan 2007. Do you have this? I need to know if I live in Vanderbijlpark can you help me, I drive 2010 Toyota Fortuner. Hi Faan, We courier via Aramex express service for customers in Potch. You can proceed to checkout to order the unit. Alternatively you can call or whatsapp our sales manager on 0827650483.
** [('1999-2007', 223, 232, 'DATE'), ('2007-2009', 238, 247, 'PHONE'), ('0827650483', 831, 841, 'PHONE')]
(3

In [None]:
#@title Precision, recall and testing code
import ast
import re
#import dateparser
import time
import random
import pandas
from sklearn.metrics import f1_score, confusion_matrix
#from sklearn.metrics import precision_score as precision
#from sklearn.metrics import recall_score as recall
import statistics
# For timing
from tqdm import tqdm


def precision(true, pred):
  correct = 0
  incorrect = 0
  for x in pred:
    if x in true:
      correct += 1
    else:
      incorrect += 1
  if correct + incorrect == 0:
    if pred == [] and true == []:
      return 1
    else:
      return 0
  return correct / float(correct + incorrect)


def recall(true, pred):
  return precision(pred, true)

def run_regex_set(labelled_lines, regex_set, src_lang='en', tag_type={'IP_ADDRESS', 'ID', 'PHONE', 'USER'}):
  regex_catches = []
  regex_catches_only_pii = []
  gold_catches = []
  gold_catches_only_pii = []
  regex_catches_context = []
  regex_catches_only_pii_context = []
  gold_catches_context = []
  gold_catches_only_pii_context = []
  for index, row in tqdm(labelled_lines.iterrows()):
      text = row["text"]
      context = row["context"]
      label = row["PII or not?"]
      ground_truth = row["Ground Truth"]
      # The whole paragraph 
      if isinstance(text, str):  # For some reason, a few of the text entries are null
        found_text_entities = list(set(a[0] for a in detect_ner_with_regex_and_context(text, src_lang, tag_type=tag_type, all_regex=regex_set)))     
        if label == 1:
          gold_catches.append([ground_truth])
          gold_catches_only_pii.append([ground_truth])
          regex_catches.append(found_text_entities)
          regex_catches_only_pii.append(found_text_entities)
        else:
          gold_catches.append([])
          regex_catches.append(found_text_entities)
      # A snippet just around the PII
      if isinstance(context, str):  # For some reason, a few of the text entries are null
        found_context_entities = list(set(a[0] for a in detect_ner_with_regex_and_context(context, src_lang, tag_type=tag_type, all_regex=regex_set)))
        if label == 1:
          gold_catches_context.append([ground_truth])
          gold_catches_only_pii_context.append([ground_truth])
          regex_catches_context.append(found_context_entities)
          regex_catches_only_pii_context.append(found_context_entities)
        else:
          gold_catches_context.append([])
          regex_catches_context.append(found_context_entities)
  return regex_catches, regex_catches_only_pii, gold_catches, gold_catches_only_pii, regex_catches_context, regex_catches_only_pii_context, gold_catches_context, gold_catches_only_pii_context


def run_it(regex_ruleset,  src_lang='en',  data_file=None, tag_type={'IP_ADDRESS', 'ID', 'PHONE', 'USER', 'URL', 'TIME'}, print_results=True):
  labelled_lines = pandas.read_csv(data_file)
  fp_dict = {}
  stats = {}
  for regex_name, regex_set in regex_ruleset:
    print("\n# " + regex_name + " regexes")
    raw_results = run_regex_set(labelled_lines, regex_set, src_lang, tag_type)

    regex_catches, regex_catches_only_pii, gold_catches, gold_catches_only_pii, regex_catches_context, regex_catches_only_pii_context, gold_catches_context, gold_catches_only_pii_context = raw_results

    regex_example_flagged = [0 if regex_catches[i] == [] else 1 for i in range(len(regex_catches))]

    gold_example_flagged = [0 if gold_catches[i] == [] else 1 for i in range(len(gold_catches))]

    regex_example_flagged_context = [0 if regex_catches_context[i] == [] else 1 for i in range(len(regex_catches_context))]

    gold_example_flagged_context = [0 if gold_catches_context[i] == [] else 1 for i in range(len(gold_catches_context))]

    exact_accuracy = statistics.mean([gold_catches[i] == regex_catches[i] for i in range(len(gold_catches))])

    exact_accuracy_only_pii = statistics.mean([gold_catches_only_pii[i] == regex_catches_only_pii[i] for i in range(len(gold_catches_only_pii))])

    subset_accuracy = statistics.mean([all(x in regex_catches[i] for x in gold_catches[i]) for i in range(len(gold_catches))])

    subset_accuracy_only_pii = statistics.mean([all(x in regex_catches_only_pii[i] for x in gold_catches_only_pii[i]) for i in range(len(gold_catches_only_pii))])

    average_precision = statistics.mean([precision(gold_catches[i], regex_catches[i]) for i in range(len(gold_catches))])

    average_precision_only_pii = statistics.mean([precision(gold_catches_only_pii[i], regex_catches_only_pii[i]) for i in range(len(gold_catches_only_pii))])

    average_recall = statistics.mean([recall(gold_catches[i], regex_catches[i]) for i in range(len(gold_catches))])

    average_recall_only_pii = statistics.mean([recall(gold_catches_only_pii[i], regex_catches_only_pii[i]) for i in range(len(gold_catches_only_pii))])

    exact_accuracy_only_pii = statistics.mean([gold_catches_only_pii[i] == regex_catches_only_pii[i] for i in range(len(gold_catches_only_pii))])
    fp_dict[regex_name] = (gold_catches, regex_catches)
    if print_results:
      print_it(exact_accuracy_only_pii, subset_accuracy_only_pii, average_precision_only_pii, average_recall_only_pii, exact_accuracy, subset_accuracy, average_precision, \
            average_recall, gold_example_flagged, regex_example_flagged, gold_example_flagged_context, regex_example_flagged_context)
    stats[regex_name]= (exact_accuracy_only_pii, subset_accuracy_only_pii, average_precision_only_pii, average_recall_only_pii, exact_accuracy, subset_accuracy, average_precision, \
            average_recall, gold_example_flagged, regex_example_flagged, gold_example_flagged_context, regex_example_flagged_context)
  return {'stats':stats, 'fp_dict': fp_dict}


def print_it(exact_accuracy_only_pii, subset_accuracy_only_pii, average_precision_only_pii, average_recall_only_pii, exact_accuracy, subset_accuracy, average_precision, \
            average_recall, gold_example_flagged, regex_example_flagged, gold_example_flagged_context, regex_example_flagged_context):
    print()
    print("## Results:")
    print("Exact match percent between regex match and gold PII:\t\t%.4f" % exact_accuracy_only_pii)
    print()
    print("Percent of the time where gold PII is in the regex matches:\t%.4f" % subset_accuracy_only_pii)
    print()
    print("Average precision between gold PII and regex matches:\t\t%.4f" % average_precision_only_pii)
    print()
    print("Average recall between gold PII and regex matches:\t\t%.4f" % average_recall_only_pii)
    print()
    print()
    print("## Results including cases where there is no PII and the gold label is empty:")
    print("Exact match percent between regex match and gold PII:\t\t%.4f" % exact_accuracy)
    print()
    print("Percent of the time where gold PII is in the regex matches:\t%.4f" % subset_accuracy)
    print()
    print("Average precision between gold PII and regex matches:\t\t%.4f" % average_precision)
    print()
    print("Average recall between gold PII and regex matches:\t\t%.4f" % average_recall)
    print()
    print()
    print("F1 score for regex match over the full text vs gold match over the full text: %.4f" % f1_score(gold_example_flagged, regex_example_flagged))
    print()
    print("F1 score for regex match over the context vs gold match over the context: %.4f" % f1_score(gold_example_flagged_context, regex_example_flagged_context))
    print()
    print()
    print("## Confusion matrix for regex match over the context vs gold match over the context [[true negative, false positive], [false negative, true positive]]:")
    print(confusion_matrix(gold_example_flagged_context, regex_example_flagged_context))
    print()
  

In [None]:
#@title Print false positives (optional)
def print_fp(gold_catches, regex_catches):
  false_positives = [list(filter(lambda x: x not in gold_catches[i], regex_catches[i])) for i in range(len(gold_catches))]
  print("False positives:")
  length = 0
  for false_positive_list in false_positives:
    if len(false_positive_list) > 0:
      print()
    for false_positive in false_positive_list:
      length += 1
      print(false_positive)
  print("Number of False Positives:", length)
#print_fp(gold_catches, regex_catches)

In [None]:
results = run_it(regex_ruleset=[('privacy_group', privacy_group_regexes), \
                      ('privacy_group_regex_rulebase_old', privacy_group_regex_rulebase_old), \
                      ('sasha', sasha_regexes), \
                      ('sasha_regexes_with_year_pat', sasha_regexes_with_year_pat), \
                      ('meg', meg_regexes), ('meg_regexes_with_year_pat', meg_regexes_with_year_pat)], \
       data_file= "/content/drive/MyDrive/labelled_data.csv",
       src_lang='en', tag_type={'IP_ADDRESS', 'KEY', 'ID', 'PHONE', 'USER', 'EMAIL', 'LICENSE_PLATE', 'AGE'})
#adding TIME will slow this and lower accuracy??
#adding LICENSE_PLATE seems to lower f1 slightly but doesn't change speed
#using precedence based overlap management also makes things slower


# privacy_group regexes


350it [00:09, 38.39it/s]



## Results:
Exact match percent between regex match and gold PII:		0.3772

Percent of the time where gold PII is in the regex matches:	0.7456

Average precision between gold PII and regex matches:		0.5189

Average recall between gold PII and regex matches:		0.7456


## Results including cases where there is no PII and the gold label is empty:
Exact match percent between regex match and gold PII:		0.3207

Percent of the time where gold PII is in the regex matches:	0.8309

Average precision between gold PII and regex matches:		0.4149

Average recall between gold PII and regex matches:		0.5656


F1 score for regex match over the full text vs gold match over the full text: 0.8119

F1 score for regex match over the context vs gold match over the context: 0.8106


## Confusion matrix for regex match over the context vs gold match over the context [[true negative, false positive], [false negative, true positive]]:
[[ 36  85]
 [ 15 214]]


# privacy_group_regex_rulebase_old regexes


350it [00:07, 46.37it/s]



## Results:
Exact match percent between regex match and gold PII:		0.3553

Percent of the time where gold PII is in the regex matches:	0.7018

Average precision between gold PII and regex matches:		0.4934

Average recall between gold PII and regex matches:		0.7018


## Results including cases where there is no PII and the gold label is empty:
Exact match percent between regex match and gold PII:		0.3120

Percent of the time where gold PII is in the regex matches:	0.8017

Average precision between gold PII and regex matches:		0.4038

Average recall between gold PII and regex matches:		0.5423


F1 score for regex match over the full text vs gold match over the full text: 0.7992

F1 score for regex match over the context vs gold match over the context: 0.8134


## Confusion matrix for regex match over the context vs gold match over the context [[true negative, false positive], [false negative, true positive]]:
[[ 48  73]
 [ 22 207]]


# sasha regexes


350it [00:06, 51.53it/s]



## Results:
Exact match percent between regex match and gold PII:		0.3991

Percent of the time where gold PII is in the regex matches:	0.7851

Average precision between gold PII and regex matches:		0.5544

Average recall between gold PII and regex matches:		0.7851


## Results including cases where there is no PII and the gold label is empty:
Exact match percent between regex match and gold PII:		0.3411

Percent of the time where gold PII is in the regex matches:	0.8571

Average precision between gold PII and regex matches:		0.4443

Average recall between gold PII and regex matches:		0.5977


F1 score for regex match over the full text vs gold match over the full text: 0.8083

F1 score for regex match over the context vs gold match over the context: 0.8092


## Confusion matrix for regex match over the context vs gold match over the context [[true negative, false positive], [false negative, true positive]]:
[[ 41  80]
 [ 19 210]]


# sasha_regexes_with_year_pat regexes


350it [00:03, 89.31it/s] 



## Results:
Exact match percent between regex match and gold PII:		0.3991

Percent of the time where gold PII is in the regex matches:	0.7851

Average precision between gold PII and regex matches:		0.5546

Average recall between gold PII and regex matches:		0.7851


## Results including cases where there is no PII and the gold label is empty:
Exact match percent between regex match and gold PII:		0.3411

Percent of the time where gold PII is in the regex matches:	0.8571

Average precision between gold PII and regex matches:		0.4445

Average recall between gold PII and regex matches:		0.5977


F1 score for regex match over the full text vs gold match over the full text: 0.8083

F1 score for regex match over the context vs gold match over the context: 0.8092


## Confusion matrix for regex match over the context vs gold match over the context [[true negative, false positive], [false negative, true positive]]:
[[ 41  80]
 [ 19 210]]


# meg regexes


350it [00:06, 50.91it/s]



## Results:
Exact match percent between regex match and gold PII:		0.3991

Percent of the time where gold PII is in the regex matches:	0.7982

Average precision between gold PII and regex matches:		0.5568

Average recall between gold PII and regex matches:		0.7982


## Results including cases where there is no PII and the gold label is empty:
Exact match percent between regex match and gold PII:		0.3411

Percent of the time where gold PII is in the regex matches:	0.8659

Average precision between gold PII and regex matches:		0.4459

Average recall between gold PII and regex matches:		0.6064


F1 score for regex match over the full text vs gold match over the full text: 0.8083

F1 score for regex match over the context vs gold match over the context: 0.8077


## Confusion matrix for regex match over the context vs gold match over the context [[true negative, false positive], [false negative, true positive]]:
[[ 40  81]
 [ 19 210]]


# meg_regexes_with_year_pat regexes


350it [00:04, 84.69it/s] 


## Results:
Exact match percent between regex match and gold PII:		0.3991

Percent of the time where gold PII is in the regex matches:	0.7982

Average precision between gold PII and regex matches:		0.5570

Average recall between gold PII and regex matches:		0.7982


## Results including cases where there is no PII and the gold label is empty:
Exact match percent between regex match and gold PII:		0.3411

Percent of the time where gold PII is in the regex matches:	0.8659

Average precision between gold PII and regex matches:		0.4460

Average recall between gold PII and regex matches:		0.6064


F1 score for regex match over the full text vs gold match over the full text: 0.8083

F1 score for regex match over the context vs gold match over the context: 0.8077


## Confusion matrix for regex match over the context vs gold match over the context [[true negative, false positive], [false negative, true positive]]:
[[ 40  81]
 [ 19 210]]






In [None]:
from muliwai.pii_regexes import detect_ner_with_regex_and_context
from tqdm import tqdm
import ast
import time
import pandas
text = ""
lang = ""

def run_it_unlabeled(regex_ruleset,  src_lang=None,  data_file=None, tag_type={'IP_ADDRESS', 'ID', 'PHONE', 'USER', 'URL', 'LICENSE_PLATE'}):
  global text, lang
  all_dfs = {}
  lines = [ast.literal_eval(l) for l in open(data_file, "r").readlines()]
  if src_lang is not None:
    lines = [l for l in lines if l['lang']==src_lang]
  print ('finished loading data... now detecting')
  for regex_name, regex_set in regex_ruleset:
    print("\n# " + regex_name + " regexes")
    start = time.time()
    idx = 0
    lang_2_matches = {}
    for dat in tqdm(lines):
        text = dat["text"]
        lang = dat['lang']
        regex_matches = lang_2_matches[lang] = lang_2_matches.get(lang, {"index": [], "text": [], "matches": []})
        text= text.encode().decode()
        matches = detect_ner_with_regex_and_context(text, lang, all_regex=regex_set, tag_type=tag_type)
        if len(matches) > 0:
          regex_matches["index"].append(idx)
          regex_matches["text"].append(text)
          len_text = len(text)
          matches2 = []
          for a in matches:
            if a[0] not in text: continue
            i = text.index(a[0])
            matches2.append((a[0], a[3], text[max(0,i-20):i]+"["+a[0]+"]"+text[i+len(a[0]):min(len_text, i+len(a[0])+20)]))
          regex_matches["matches"].append(matches2)
        idx += 1
    end = time.time()
    print()
    print(end-start)
    for lang, regex_matches in lang_2_matches.items():
      df = pandas.DataFrame(data=regex_matches)
      df.to_csv(f"/content/{regex_name}_{lang}_subset_regex_matches.csv")
      all_dfs[f'{regex_name}_{lang}'] = df
  return all_dfs


In [None]:

all_dfs = run_it_unlabeled(regex_ruleset=[('privacy_group', privacy_group_regexes)], \
       data_file= "/content/drive/MyDrive/lang_subset.jsonl",
       tag_type={'ADDRESS_EXP'}) # we could ignore DATEs as we won't be anonymizing (unless it's an AGE - TBD)

finished loading data... now detecting

# privacy_group regexes


100%|██████████| 10000/10000 [01:53<00:00, 88.25it/s]



113.32173013687134


In [None]:
# Our priority PII types are:
# IDs [general]: This is anything that is a sequence of 6 or more digits, as is common in identifiers for people internationally (national IDs, tax IDs, passport numbers, etc.), credit card numbers, IBAN codes, etc.
# Keys [general]: This is anything that is a sequence of digits and letters in the same string. Common for API keys, etc.
# Email address, User name: Strings using @
# IP address: Digits with periods in them
# Phone number
# License plate
all_dfs = run_it_unlabeled(regex_ruleset=[('huu', huu_regexes), \
                      ('sasha', sasha_regexes), \
                      ('meg', meg_regexes)], \
       data_file= "/content/drive/MyDrive/lang_subset.jsonl",
       tag_type={'IP_ADDRESS', 'ID', 'PHONE', 'USER', 'URL', 'DATE', 'AGE'}) # we could ignore DATEs as we won't be anonymizing (unless it's an AGE - TBD)

# Explore

In [None]:
#@title Eyeball results
#@markdown Select which language to view results from.
select_regex_set = "privacy_group" #@param ["privacy_group", "huu", "ian", "sasha", "meg"]
select_langa = "en" #@param ["ar", "as", "bn", "ca", "en", "es", "eu", "fr", "gu", "hi", "id", "ig", "mr", "ny", "pa", "pt", "sn", "st", "sw", "ur", "vi", "xh", "yo", "zh", "zu"] {allow-input: true}
langa_regex_key = select_regex_set + "_" + select_langa
select_pii_type = "" #@param ["", 'IP_ADDRESS', 'ID', 'PHONE', 'USER', 'URL', 'DATE', 'AGE'] {allow-input: true}


If someone who knows padas can filter records by the tag type the results below will be better..

In [None]:
#@title Explore (Select Scrollbar. Use Left Right Arrows on Keyboard To Scroll)
from ipywidgets import widgets
from IPython.display import display
import textwrap

max_val = len(all_dfs[langa_regex_key]) - 1
i = widgets.IntSlider(value=0, max=max_val, description='Index')
#how do we filter all_dfs by type??
#print ([a for a in all_dfs[langa_regex_key]['matches'][0] if a[1] == 'AGE'])
def match_f(i):
    print()
    print("Matches:")
    pii_str = ""
    for dat in [a for a in all_dfs[langa_regex_key]['matches'][i] if not select_pii_type or a[1] ==select_pii_type]:
      print (dat)
      #print([a for a in  if a[1] == 'AGE'])
def f(i):
    print('Index: {}'.format(i))
    print("** Text:")
    print(textwrap.fill(all_dfs[langa_regex_key]['text'][i]))
    i = widgets.IntSlider(value=0, max=max_val, description='Index')
out = widgets.interactive_output(f, {'i': i})
matches = widgets.interactive_output(match_f, {'i': i})
widgets.HBox([widgets.VBox([i, matches], layout=widgets.Layout(width='640px')), widgets.VBox([out], layout=widgets.Layout(height='250px'))])

TraitError: ignored

In [None]:
all_dfs

{'privacy_group_en': Empty DataFrame
 Columns: [index, text, matches]
 Index: []}

# NER At Scale Testing

In [None]:
!python muliwai/process.py -h

__Python VERSION: 3.7.12 (default, Jan 15 2022, 18:48:18) 
[GCC 7.5.0]
__pyTorch VERSION: 1.10.0+cu111
__CUDA VERSION
__CUDNN VERSION: 8005
__Number CUDA Devices: 1
__Devices
index, name, driver_version, memory.total [MiB], memory.used [MiB], memory.free [MiB]
0, Tesla P100-PCIE-16GB, 460.32.03, 16280 MiB, 2 MiB, 16278 MiB
Active CUDA Device: GPU 0
Available devices  1
Current cuda device  0
usage: process.py [-h] [-src_lang SRC_LANG] [-target_lang TARGET_LANG]
                  [-augment_lang AUGMENT_LANG] [-cutoff CUTOFF]
                  [-batch_size BATCH_SIZE] [-hfdataset HFDATASET]
                  [-infile INFILE] [-shard_range SHARD_RANGE]
                  [-max_docs MAX_DOCS] [-outfile OUTFILE]
                  [-num_workers NUM_WORKERS] [-do_spacy_only DO_SPACY_ONLY]
                  [-do_hf_ner_only DO_HF_NER_ONLY]
                  [-do_dictionary_only DO_ONTOLOGY_ONLY]
                  [-do_regex_only DO_REGEX_ONLY]
                  [-do_qg_rel_only DO_QG_REL_ONLY] 

In [None]:
%cd /content
import os
all_data = {}
cutoff = 30
from muliwai.text_augment import TextAugment
#gu,as,pa
for lang in ["en"]: #  "zh,en,ny", "yo,mr,sn,st", "xh,zu,ar,bn", "ca,es,eu", "fr,hi,id,ig", "pt,sw,ur,vi",  ]:
  if True: # os.path.exists(f"/content/muliwai/turkunlp_data/{lang}_data.jsonl.gz"):
    print (f"### testing {lang}")
    #-num_workers 2  
    !python /content/muliwai/process.py -src_lang $lang -do_trans 0 -cutoff $cutoff -do_hf_ner 1 -do_spacy 1 -do_regex 1 -do_kenlm 1 -do_anonymization 1
    for l in lang.split(","):
      all_data[l] = TextAugment.deserialize_ner_items(infile=f"{l}_out.jsonl")
      print (f'example record for {l}: ', all_data[l][0][f'{l}_ner'])
      print (f'text for {l}: ', all_data[l][0][f'{l}_text'])
  else:
    print (f"no data for {lang}. skipping")

!mkdir -p /content/drive/Shareddrives/BigScience/test_outputs_cpu
!cp *_out.jsonl /content/drive/Shareddrives/BigScience/test_outputs_cpu

#TODO - check to do kenlm tests for multi-word names at the cutoff, but lower the cutff for single word names. 

In [None]:
!python /content/muliwai/process.py -hfdataset cc100,sw -src_lang sw -do_trans 0 -cutoff 30  -do_hf_ner 1 -do_spacy 1 -do_regex 1 -do_kenlm 1 -do_dictionary 1 
   

Traceback (most recent call last):
  File "/content/muliwai/process.py", line 66, in <module>
    from text_augment import *
  File "/content/muliwai/text_augment.py", line 66, in <module>
    from ner_manager import *
  File "/content/muliwai/ner_manager.py", line 344
    ners = [tuple(list(a) +  [max(Counter(b))]) for a, b in doc[ner_key].items()]
       ^
IndentationError: expected an indented block


In [None]:
!python /content/muliwai/process.py -shard_range 5/1000 -hfdataset TurkuNLP/register_oscar,sw -src_lang sw -do_trans 0  -do_hf_ner 1 -do_spacy 1 -do_regex 1 -do_kenlm 1 -do_anonymization 1  -do_dictionary 1 
   

2022-03-09 02:57:42,928 : MainProcess : INFO : ('__Python VERSION:', '3.7.12 (default, Jan 15 2022, 18:48:18) \n[GCC 7.5.0]')
2022-03-09 02:57:42,928 : MainProcess : INFO : ('__pyTorch VERSION:', '1.10.0+cu111')
2022-03-09 02:57:42,928 : MainProcess : INFO : __CUDA VERSION
2022-03-09 02:57:42,928 : MainProcess : INFO : ('__CUDNN VERSION:', 8005)
2022-03-09 02:57:42,928 : MainProcess : INFO : ('__Number CUDA Devices:', 0)
2022-03-09 02:57:42,929 : MainProcess : INFO : __Devices
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

2022-03-09 02:57:42,967 : MainProcess : INFO : ('running on ', 'cpu')
2022-03-09 02:57:43,543 : MainProcess : INFO : Neuralcoref not loaded. Using normal spacy
  0% 0/1 [00:00<?, ?it/s]


100% 1/1 [00:00<00:00, 425.60it/s]

1it [00:20, 20.63s/it]
100% 1/1 [00:20<00:00, 20.63s/it]


In [None]:
from muliwai.text_augment import TextAugment
TextAugment.deserialize_ner_items(infile=f"sw_out.jsonl")

2022-03-09 03:06:36,441 : MainProcess : INFO : sw_out.jsonl


[{'chunks': [{'id': '0',
    'sw_offset': 0,
    'sw_text': 'Zarif: Iran inajua mpango wa Saudia wa kufanya mauaji ya kigaidi dhidi ya maafisa wa ngazi za juu wa Iran'}],
  'domain': '',
  'id': '0',
  'labels': ['NA'],
  'lang': 'sw',
  'sw_ner': [],
  'sw_signal_ner': [],
  'sw_text': 'Zarif: Iran inajua mpango wa Saudia wa kufanya mauaji ya kigaidi dhidi ya maafisa wa ngazi za juu wa Iran',
  'text': 'Zarif: Iran inajua mpango wa Saudia wa kufanya mauaji ya kigaidi dhidi ya maafisa wa ngazi za juu wa Iran'},
 {'chunks': [{'id': '1',
    'sw_offset': 0,
    'sw_text': 'Miripuko hiyo inakuja mwanzoni mwa Wiki Takatifu kuelekea Pasaka na ikiwa ni wiki chache tu kabla ya Papa Francis kuanza ziara yake katika nchi hiyo yenye idadi kubwa kabisa ya watu katika ulimwengu wa nchi za Kiarabu.'}],
  'domain': '',
  'id': '1',
  'labels': ['NA'],
  'lang': 'sw',
  'sw_ner': [],
  'sw_signal_ner': [],
  'sw_text': 'Miripuko hiyo inakuja mwanzoni mwa Wiki Takatifu kuelekea Pasaka na ikiwa ni wiki

# KenLM Models To Check Public Figures

In [None]:
from muliwai.text_augment import TextAugment
TextAugment.public_figure_kenlm_cutoff_map

ModuleNotFoundError: ignored

In [None]:
ar_model = TextAugment.load_kenlm_model("vi")

Downloading:   0%|          | 0.00/1.41G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/907k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/705k [00:00<?, ?B/s]

In [None]:
TextAugment.public_figure_kenlm_cutoff_map['vi']

{'cutoff': 500, 'pattern': '{} sinh ra'}

In [None]:
TextAugment.public_figure_kenlm_cutoff_map['vi'] = {'pattern': "{} fue", "cutoff": 500}

In [None]:
AdolfoSuarez = TextAugment.public_figure_kenlm_cutoff_map['vi']['pattern'].format("Nguyễn Chiến")
ar_model.get_perplexity(AdolfoSuarez)

1576.5

In [None]:
AdolfoSuarez

'Nguyễn Chiến fue'

In [None]:
Tariq = TextAugment.public_figure_kenlm_cutoff_map['ar']['pattern'].format("طارق بن زياد")
ar_model.get_perplexity(Tariq)

85.8

In [None]:
Nasser = TextAugment.public_figure_kenlm_cutoff_map['es']['pattern'].format("Gamal Abdel Nasser")
ar_model.get_perplexity(Nasser)

88.7

In [None]:
from faker import Faker
from faker.providers import person, company, geo, address, ssn, internet
faker_ar = Faker("ar_SA")
faker_ar.add_provider(person)


AttributeError: ignored

In [None]:
TextAugment.public_figure_kenlm_cutoff_map['vi'] = [{'cutoff': 500, 'pattern': "{} sinh ra"}, {'pattern': "{} sáng lập", "cutoff": 800}]

In [None]:
TextAugment.public_figure_kenlm_cutoff_map['vi']

[{'cutoff': 500, 'pattern': '{} sinh ra'},
 {'cutoff': 800, 'pattern': '{} sáng lập'}]

In [None]:
def check_fakename(kenlm_model, patterns, fake_name):
    for pattern in patterns:
        test_name = pattern['pattern'].format(fake_name)
        if kenlm_model.get_perplexity(test_name) <= pattern['cutoff']:
            return True
    return False

print(check_fakename(kenlm_model, TextAugment.public_figure_kenlm_cutoff_map['vi'], "Đỗ Lương Tuệ Minh"))

False


In [None]:
from muliwai.fake_names import *
from muliwai.text_augment import TextAugment
import random 

kenlm_model = TextAugment.load_kenlm_model("vi")
fake_dict = {
    "vietnamese_surnames" : vietnamese_surnames,
    "vietnamese_firstnames_male": vietnamese_firstnames_male,
    "vietnamese_first_middlenames_male": vietnamese_first_middlenames_male,
    "vietnamese_second_middlenames_male": vietnamese_second_middlenames_male,
    "vietnamese_firstnames_female": vietnamese_firstnames_female,
    "vietnamese_first_middlenames_female": vietnamese_first_middlenames_female,
    "vietnamese_second_middlenames_female": vietnamese_second_middlenames_female,
}
def check_fake_name(kenlm_model, patterns, fake_name):
    for pattern in patterns:
        test_name = pattern['pattern'].format(fake_name)
        if kenlm_model.get_perplexity(test_name) < pattern['cutoff']:
            return True
    return False

def create_vietnamese_fake_name(kenlm_model, fake_dict):

    surname = None
    first_middlename = None
    sencond_middlename = None
    firstname = None

    surname = random.choice(fake_dict.get("vietnamese_surnames"))
    if random.randint(0,1) !=0: # choice male if 1
        firstname = random.choice(fake_dict.get("vietnamese_firstnames_male"))
        while firstname == surname:
            firstname = random.choice(fake_dict.get("vietnamese_firstnames_male"))
        first_middlename = random.choice(fake_dict.get("vietnamese_first_middlenames_male"))
        while first_middlename == surname or first_middlename == firstname:
            first_middlename = random.choice(fake_dict.get("vietnamese_first_middlenames_male"))
        if random.random() > 0.72724: # probility of length of Vietnamese male > 3
            sencond_middlename = random.choice(fake_dict.get("vietnamese_second_middlenames_male"))
            while sencond_middlename == surname or sencond_middlename == firstname or sencond_middlename == first_middlename:
                sencond_middlename = random.choice(fake_dict.get("vietnamese_second_middlenames_male"))
    else:
        firstname = random.choice(fake_dict.get("vietnamese_firstnames_female"))
        while firstname == surname:
            firstname = random.choice(fake_dict.get("vietnamese_firstnames_female"))
        first_middlename = random.choice(fake_dict.get("vietnamese_first_middlenames_female"))
        while first_middlename == surname or first_middlename == firstname:
            first_middlename = random.choice(fake_dict.get("vietnamese_first_middlenames_female"))
        if random.random() > 0.34698: # probility of length of Vietnamese female > 3
            sencond_middlename = random.choice(fake_dict.get("vietnamese_second_middlenames_female"))
            while sencond_middlename == surname or sencond_middlename == firstname or sencond_middlename == first_middlename:
                sencond_middlename = random.choice(fake_dict.get("vietnamese_second_middlenames_female"))
    # concatenate fake name
    if sencond_middlename is None:
        fake_name = f"{surname} {first_middlename} {firstname}"
    else:
        fake_name = f"{surname} {first_middlename} {sencond_middlename} {firstname}"

    return fake_name

def create_vietnamese_fake_name(kenlm_model, patterns, fake_dict, retry=1000):
    for _ in range(retry):
        fake_name = create_vietnamese_fake_name(kenlm_model, fake_dict)
        if check_fake_name(kenlm_model, patterns, fake_name):
            print(fake_name, kenlm_model.get_perplexity(fake_name))
            return fake_name
    return ""


famous_name = "Nguyễn Xuân Phúc"

print ("#real famous name:") 
print(TextAugment.public_figure_kenlm_cutoff_map['vi'])
test_name = TextAugment.public_figure_kenlm_cutoff_map['vi']['pattern'].format(famous_name)
print(famous_name, kenlm_model.get_perplexity(test_name) , "is public figure" if kenlm_model.get_perplexity(test_name) <= TextAugment.public_figure_kenlm_cutoff_map['vi']['cutoff'] else "not public figure")

print(TextAugment.public_figure_kenlm_cutoff_map['vi_2'])
test_name = TextAugment.public_figure_kenlm_cutoff_map['vi_2']['pattern'].format(famous_name)
print(famous_name,kenlm_model.get_perplexity(test_name),  "is public figure" if kenlm_model.get_perplexity(test_name) <= TextAugment.public_figure_kenlm_cutoff_map['vi_2']['cutoff'] else "not public figure")



print ("#fake name")
for _ in range(10):
  
  fake_name = create_vietnamese_fake_name(fake_dict)
  #print(TextAugment.public_figure_kenlm_cutoff_map['vi'])
  test_name = TextAugment.public_figure_kenlm_cutoff_map['vi']['pattern'].format(fake_name)
  print(fake_name, kenlm_model.get_perplexity(test_name) , "is public figure" if kenlm_model.get_perplexity(test_name) <= TextAugment.public_figure_kenlm_cutoff_map['vi']['cutoff'] else "not public figure")


  #print(TextAugment.public_figure_kenlm_cutoff_map['vi_2'])
  test_name = TextAugment.public_figure_kenlm_cutoff_map['vi_2']['pattern'].format(fake_name)
  print(fake_name, kenlm_model.get_perplexity(test_name),  "is public figure" if kenlm_model.get_perplexity(test_name) <= TextAugment.public_figure_kenlm_cutoff_map['vi_2']['cutoff'] else "not public figure")

2022-02-24 23:56:23,730 : MainProcess : INFO : Getting model from https://s3.amazonaws.com/models.huggingface.co/neuralcoref/neuralcoref.tar.gz or cache
2022-02-24 23:56:23,898 : MainProcess : INFO : https://s3.amazonaws.com/models.huggingface.co/neuralcoref/neuralcoref.tar.gz not found in cache, downloading to /tmp/tmpsl24oseq
100%|██████████| 40155833/40155833 [00:02<00:00, 16230047.01B/s]
2022-02-24 23:56:26,495 : MainProcess : INFO : copying /tmp/tmpsl24oseq to cache at /root/.neuralcoref_cache/f46bc05a4bfba2ae0d11ffd41c4777683fa78ed357dc04a23c67137abf675e14.7d6f9a6fecf5cf09e74b65f85c7d6896b21decadb2554d486474f63b95ec4633
2022-02-24 23:56:26,600 : MainProcess : INFO : creating metadata file for /root/.neuralcoref_cache/f46bc05a4bfba2ae0d11ffd41c4777683fa78ed357dc04a23c67137abf675e14.7d6f9a6fecf5cf09e74b65f85c7d6896b21decadb2554d486474f63b95ec4633
2022-02-24 23:56:26,602 : MainProcess : INFO : removing temp file /tmp/tmpsl24oseq
2022-02-24 23:56:26,613 : MainProcess : INFO : extract

Downloading:   0%|          | 0.00/1.41G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/907k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/705k [00:00<?, ?B/s]

#real famous name:
{'pattern': '{} sinh ra', 'cutoff': 800}
Nguyễn Xuân Phúc 315.7 is public figure
{'pattern': '{} sáng lập', 'cutoff': 800}
Nguyễn Xuân Phúc 254.8 is public figure
#fake name
Lê Tường Vy Linh 3268.8 not public figure
Lê Tường Vy Linh 2549.4 not public figure
Nguyễn Khánh Thiên Hân 2225.2 not public figure
Nguyễn Khánh Thiên Hân 1852.2 not public figure
Hoàng Bùi Tú 1706.5 not public figure
Hoàng Bùi Tú 1017.4 not public figure
Ngô Dương Tấn Triết 3050.8 not public figure
Ngô Dương Tấn Triết 2539.3 not public figure
Ngô Chấn Bá Mạnh 3765.1 not public figure
Ngô Chấn Bá Mạnh 3701.8 not public figure
Hoàng Thiên Vỹ 2996.1 not public figure
Hoàng Thiên Vỹ 2418.6 not public figure
Dương Nam Danh 2729.9 not public figure
Dương Nam Danh 2203.7 not public figure
Vũ Thúy Trinh 2838.7 not public figure
Vũ Thúy Trinh 2291.6 not public figure
Đặng Phan Linh 4323.6 not public figure
Đặng Phan Linh 3235.2 not public figure
Đỗ Đặng Vân Diễm 5704.9 not public figure
Đỗ Đặng Vân Diễm 

In [None]:
from muliwai.fake_names import *
from muliwai.text_augment import TextAugment
import random 

def check_fake_name(models, patterns, fake_name, verbose):
    for model in models:
      for pattern in patterns:
          test_name = pattern['pattern'].format(fake_name)
          if model.get_perplexity(test_name) < pattern['cutoff']:
              if verbose:
                  print(fake_name, model.get_perplexity(test_name))
              return True
    return False

def generate_vietnamese_fake_name(fake_dict):

    surname = None
    first_middlename = None
    sencond_middlename = None
    firstname = None

    surname = random.choice(fake_dict.get("vietnamese_surnames"))
    if random.randint(0,1) !=0: # choice male if 1
        firstname = random.choice(fake_dict.get("vietnamese_firstnames_male"))
        while firstname == surname:
            firstname = random.choice(fake_dict.get("vietnamese_firstnames_male"))
        first_middlename = random.choice(fake_dict.get("vietnamese_first_middlenames_male"))
        while first_middlename == surname or first_middlename == firstname:
            first_middlename = random.choice(fake_dict.get("vietnamese_first_middlenames_male"))
        if random.random() > 0.72724: # probility of length of Vietnamese male > 3
            sencond_middlename = random.choice(fake_dict.get("vietnamese_second_middlenames_male"))
            while sencond_middlename == surname or sencond_middlename == firstname or sencond_middlename == first_middlename:
                sencond_middlename = random.choice(fake_dict.get("vietnamese_second_middlenames_male"))
    else:
        firstname = random.choice(fake_dict.get("vietnamese_firstnames_female"))
        while firstname == surname:
            firstname = random.choice(fake_dict.get("vietnamese_firstnames_female"))
        first_middlename = random.choice(fake_dict.get("vietnamese_first_middlenames_female"))
        while first_middlename == surname or first_middlename == firstname:
            first_middlename = random.choice(fake_dict.get("vietnamese_first_middlenames_female"))
        if random.random() > 0.34698: # probility of length of Vietnamese female > 3
            sencond_middlename = random.choice(fake_dict.get("vietnamese_second_middlenames_female"))
            while sencond_middlename == surname or sencond_middlename == firstname or sencond_middlename == first_middlename:
                sencond_middlename = random.choice(fake_dict.get("vietnamese_second_middlenames_female"))
    # concatenate fake name
    if sencond_middlename is None:
        fake_name = f"{surname} {first_middlename} {firstname}"
    else:
        fake_name = f"{surname} {first_middlename} {sencond_middlename} {firstname}"

    return fake_name

def create_vietnamese_fake_name(model, patterns, fake_info, retry=1000, verbose=False):
    for _ in range(retry):
        fake_name = generate_vietnamese_fake_name(fake_info)
        if check_fake_name(model, patterns, fake_name, verbose):
            return fake_name
    return ""

kenlm_models = TextAugment.load_kenlm_model("vi")
fake_dict = {
    "vietnamese_surnames" : vietnamese_surnames,
    "vietnamese_firstnames_male": vietnamese_firstnames_male,
    "vietnamese_first_middlenames_male": vietnamese_first_middlenames_male,
    "vietnamese_second_middlenames_male": vietnamese_second_middlenames_male,
    "vietnamese_firstnames_female": vietnamese_firstnames_female,
    "vietnamese_first_middlenames_female": vietnamese_first_middlenames_female,
    "vietnamese_second_middlenames_female": vietnamese_second_middlenames_female,
}

TextAugment.public_figure_kenlm_cutoff_map['vi'] = [{'cutoff': 500, 'pattern': '{} sinh ra'},
                                                    {'cutoff': 800, 'pattern': '{} sáng lập'}]
target_patterns = TextAugment.public_figure_kenlm_cutoff_map['vi']
fake_name = create_vietnamese_fake_name(kenlm_models, target_patterns, fake_dict, verbose=True)

Phạm Hồng Ân 479.1


In [None]:
TextAugment.load_kenlm_model("bn")

Downloading:   0%|          | 0.00/612M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.37M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.16M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/16.8G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.37M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.17M [00:00<?, ?B/s]

[<edugp_kenlm_model.KenlmModel at 0x7f56cbd96ad0>,
 <edugp_kenlm_model.KenlmModel at 0x7f585804e7d0>]

In [None]:
TextAugment.public_figure_kenlm_cutoff_map.get("vi", [{'cutoff': 500, 'pattern': "{} (born"}])

[{'cutoff': 500, 'pattern': '{} sinh ra'},
 {'cutoff': 800, 'pattern': '{} sáng lập'}]

In [None]:
kenlm_models = TextAugment.load_kenlm_model("vi")

In [None]:
from muliwai.fake_names import *
from muliwai.text_augment import TextAugment
import random
import time

class FakeNameGenerator:
  def __init__(
      self,
      lang: str = "vi",
      trials: int = 1000
  ):
      self.lang = lang
      self.trials = trials
      self.kenlm_models = TextAugment.load_kenlm_model(lang)
      self.patterns = TextAugment.public_figure_kenlm_cutoff_map.get(lang, [{'cutoff': 500, 'pattern': "{} (born"}])
      if self.lang == "vi":
          self.surname_list = [vietnamese_surnames]
          self.first_name_list = [vietnamese_firstnames_male, vietnamese_firstnames_female]
          self.middle_name_list = [[vietnamese_first_middlenames_male, vietnamese_second_middlenames_male], [vietnamese_first_middlenames_female,vietnamese_second_middlenames_female]]
      elif self.lang == "bn":
          self.surname_list = [bengali_surnames]
          self.first_name_list = [bengali_firstnames_male, bengali_firstnames_female]
          self.middle_name_list = []
      elif self.lang == "ur":
          self.surname_list = [urdu_surnames]
          self.first_name_list = [urdu_firstnames]
          self.middle_name_list = []

  def generate(self):
      """ generate fake name """
      surname = None
      middlename = None
      firstname = None

      surname = random.choice(self.surname_list[0])
      # check if target language has different name for gender 
      if len(self.first_name_list) > 1: 
          gender = random.randint(0,1) # 0: male, 1: female
          firstname = random.choice(self.first_name_list[gender])
          while firstname == surname:
              firstname = random.choice(self.first_name_list[gender])
      else:
          firstname = random.choice(self.first_name_list)
          while firstname == surname:
              firstname = random.choice(self.first_name_list)
      
      # check target language has middle name 
      if len(self.middle_name_list) > 0: 
          if random.random() > 0.5: # problitiity for fake name including middle name
              middlename = random.choice(self.middle_name_list[gender][0])
      # concatenate fake name
      if middlename is not None:
          if self.lang == "vi":
              fake_name = f"{surname} {middlename} {firstname}"
          else:
              fake_name = f"{firstname} {middlename} {surname}"
      else:
          if self.lang == "vi":
              fake_name = f"{surname} {firstname}"
          else:
              fake_name = f"{firstname} {surname}"
      return fake_name

  def check_fakename(self, fake_name, verbose=False):
      """ check fake name close to real name"""
      for model in self.kenlm_models:
          for pattern in self.patterns:
              test_name = pattern['pattern'].format(fake_name)
              if model.get_perplexity(test_name) < pattern['cutoff']:
                  if verbose:
                      print(fake_name, model.get_perplexity(test_name))
                  return True
      return False

  def create_fakename(self, verbose=False):
      success = False
      for _ in range(self.trials):
          fake_name = self.generate()
          if self.check_fakename(fake_name, verbose):
              success = True
              return fake_name
      if not success:
          print('Could not find any fake name. Try reducing perplexity_cutoff')

if __name__ == "__main__":
    generator = FakeNameGenerator(lang="vi")
    start_time=time.time()
    for i in range(100):
        fake_name = generator.create_fakename(verbose=True)
    print(f"Running time {time.time() - start_time}")

Ngô Thắm 721.8
Hồ Tiến 765.6
Nguyễn Phát 707.4
Huỳnh Quốc Tân 662.0
Võ Huyền 772.7
Ngô Văn Trường 486.1
Lý Tú 446.8
Trần Quỳnh Anh 522.4
Nguyễn Dương 698.7
Phạm Hồng Tài 441.1
Vũ Trung 650.4
Đỗ Thanh 635.6
Nguyễn Khang 738.8
Vũ Phạm Phương 721.9
Lý Nguyễn Việt 753.4
Nguyễn Hiền 727.1
Phan Đức 496.9
Lê Phương 737.3
Dương Ngọc Tân 473.6
Bùi Hằng 660.1
Huỳnh Anh 562.2
Bùi Nhàn 777.2
Ngô Khánh 640.0
Nguyễn Hiền 727.1
Bùi Đăng Tuấn 582.6
Nguyễn Duy Hoàng 306.3
Trần Bá Hưng 707.1
Dương Phụng 767.5
Bùi Quang Nghĩa 410.4
Hoàng Trọng 341.7
Hồ Xuân 608.9
Hoàng Ánh 732.6
Huỳnh Tường 715.1
Bùi Tú 728.8
Bùi Tường 701.4
Lê Vĩnh An 579.4
Phạm Đỗ Nguyên 767.8
Hồ Quốc 640.0
Lê Lý Ánh 732.0
Lý Hùng Nhựt 680.8
Huỳnh Công 634.9
Trần Minh 319.4
Huỳnh Tường Khanh 617.7
Phạm Nhàn 754.5
Bùi Thiện 615.6
Lý Đăng 755.3
Dương Anh 608.3
Bùi Đình Thành 746.3
Phan Mỹ 568.3
Hoàng Thiên An 662.9
Phạm Cao Quý 633.4
Lý Phương 732.5
Đặng Huyền 535.7
Đỗ Kỳ 776.1
Dương Phụng 767.5
Vũ Minh Thùy 787.6
Phạm Đỗ Thanh 717.5
Võ 

In [None]:
def demo():
  for i in range(100):
      if i >80:
        print(i)
        return i

In [None]:
demo()

81


81

In [None]:
from muliwai.faker_extensions import FakeNameGenerator
import time
generator = FakeNameGenerator(lang="ur")
start_time=time.time()
for i in range(100):
    generator.create_fakename(verbose=True)
print(f"Running time {time.time() - start_time}")

Running time 43.1169638633728


In [None]:
import time
time.time()

1645843312.0517728

In [None]:
 for _ in range(retry):
      name = f"{random.choice(self.first_name_list)} {random.choice(self.middle_name_list) if self.middle_name_list else ''} {random.choice(self.surname_list)}"
      perplexity = kenlm_model.get_perplexity(pattern.format(name))
      if perplexity > perplexity_cutoff:
        return name
    raise RuntimeError('Could not find any fake name. Try reducing perplexity_cutoff')

  def check_fake_name(self, patterns, fake_name, verbose):
    for model in self.kenlm_models:
      for pattern in patterns:
          test_name = pattern['pattern'].format(fake_name)
          if model.get_perplexity(test_name) < pattern['cutoff']:
              if verbose:
                  print(fake_name, model.get_perplexity(test_name))
              return True
    return False

In [None]:
scores=[]
for i in range(10):
  fake = TextAugment.public_figure_kenlm_cutoff_map['ar']['pattern'].format(faker_ar.name())
  score =  ar_model.get_perplexity(fake)
  print (fake,score)
  scores.append(score)
print ('mean of random name', sum(scores)/len(scores))

ولد ممدوح الشايع من 2916.4
ولد الأستاذ عرفه العليان من 2103.4
ولد السيد حامد المهنا من 1665.2
ولد ثقيف آل عايض من 340.9
ولد الأستاذة جمان المهنا من 1269.2
ولد ضياء آل رفيع من 11613.8
ولد السيد عتريس آل سعود من 1124.5
ولد داهي مهنا من 5616.0
ولد أناهيد آل الشيخ من 1545.4
ولد فيصل العليان من 3577.8
3177.26


# Mine Faker for Regex Patterns

In [None]:
%cd /content/muliwai/
import itertools
from faker_manager import faker_list
import faker
import faker.providers.address
address2person = {}
def to_regex(s_arr, lang):
  if type(s_arr) is str: return s_arr.replace("{{first_name}}", "\w{3-12}").replace("{{last_name}}", "\w{3-12}").replace("#", "\d").replace("%", "\d").replace("@", "\d?").replace("!", "\d?")
  if type(s_arr) is list and len(s_arr) == 1:
    return s_arr[0].replace("{{first_name}}", "\w{3-12}").replace("{{last_name}}", "\w{3-12}").replace("#", "\d").replace("%", "\d").replace("@", "\d?").replace("!", "\d?")
  return "("+"|".join(s.replace("{{first_name}}", "\w{3-12}").replace("{{last_name}}", "\w{3-12}").replace("#", "\d").replace("%", "\d").replace("@", "\d?").replace("!", "\d?")  for s in s_arr)+")"

for lang in faker_list:
      lang2 = lang.split("_")[0]
      aHash = address2person.get(lang2, {})
      found = False
      if not hasattr(faker.providers.person, lang):
          try:
            exec(f"import faker.providers.address.{lang}")
            found = True

          except:
            pass
            
      if found:
        data = {}
        provider = getattr(faker.providers.address,  lang)
        if hasattr(provider.Provider, 'address_formats'):
          print (lang)
          for i, address in enumerate(provider.Provider.address_formats):
            data[f'address{i}'] = address
        if hasattr(provider.Provider, 'street_address_formats'):
          data['street_address'] =  to_regex(provider.Provider.street_address_formats, lang)
        if hasattr(provider.Provider, 'street_name_formats'):
          if '{{street_name}}' not in provider.Provider.street_name_formats:
            data['street_name'] =  to_regex(provider.Provider.street_name_formats, lang)
        if hasattr(provider.Provider, 'street_names'):
          data['street_names'] =  to_regex(provider.Provider.street_names, lang)
        if hasattr(provider.Provider, 'street_suffixes'):
          data['street_suffix'] =   to_regex(provider.Provider.street_suffixes, lang)
        if hasattr(provider.Provider, 'street_titles'):
          data['street_title'] =   to_regex(provider.Provider.street_titles, lang)
        if hasattr(provider.Provider, 'street_suffixes_short'):
          data['street_suffix_short'] =  to_regex(provider.Provider.street_suffixes_short, lang)
        if hasattr(provider.Provider, 'street_suffixes_long'):
          data['street_suffixes_long'] =  to_regex(provider.Provider.street_suffixes_long, lang)
        if hasattr(provider.Provider, 'street_prefixes_short'):
          data['street_prefix_short'] =   to_regex(provider.Provider.street_prefixes_short, lang)
        if hasattr(provider.Provider, 'street_prefixes'):
          data['street_prefix'] =  to_regex(provider.Provider.street_prefixes, lang)
        if hasattr(provider.Provider, 'street_prefixes_long'):
          data['street_prefix_long'] = to_regex(provider.Provider.street_prefixes_long, lang)
        if hasattr(provider.Provider, 'postcode_formats'):
          data['postcode'] =  to_regex(provider.Provider.postcode_formats, lang)
        if hasattr(provider.Provider, 'building_number_formats'):
          data['building_number'] = to_regex(provider.Provider.building_number_formats, lang)
        if hasattr(provider.Provider, 'cities'):
          data['city'] = to_regex(provider.Provider.cities, lang)
        if hasattr(provider.Provider, 'states'):
          data['state'] = "|".join(list(itertools.chain(*[[to_regex(s2, lang) for s2 in s] for s in provider.Provider.states]))) if \
            type(provider.Provider.states[0]) is tuple else to_regex(provider.Provider.states, lang)
        if hasattr(provider.Provider, 'secondary_address_formats'):
          data['secondary_address'] = to_regex(provider.Provider.secondary_address_formats, lang)
        if hasattr(provider.Provider, 'street_suffixes'):
          data['street_name_suffix'] = to_regex(provider.Provider.street_suffixes, lang)
        for key, val in list(data.items()):
          for key2, val2 in list(data.items()):
            if "{{"+key2+"}}" in val:
              val = val.replace("{{"+key2+"}}", val2)
          data[key] = val
        for key, val in list(data.items()):
          for key2, val2 in list(data.items()):
            if "{{"+key2+"}}" in val:
              val = val.replace("{{"+key2+"}}", val2)
          data[key] = val
        for key, val in list(data.items()):
          for key2, val2 in list(data.items()):
            if "{{"+key2+"}}" in val:
              val = val.replace("{{"+key2+"}}", val2)
          data[key] = val
        
        print (data)

/content/muliwai
cs_CZ
{'address0': '({{street_name}} (\\d|\\d\\d|\\d\\d\\d))\n(1\\d\\d \\d\\d|2\\d\\d \\d\\d|3\\d\\d \\d\\d|4\\d\\d \\d\\d|5\\d\\d \\d\\d|6\\d\\d \\d\\d|7\\d\\d \\d\\d) (Abertamy|Adamov|Andělská Hora|Bakov nad Jizerou|Bavorov|Bechyně|Benešov nad Ploučnicí|Benátky nad Jizerou|Bezdružice|Bečov nad Teplou|Blatná|Blovice|Blšany|Bochov|Bohušovice nad Ohří|Bojkovice|Bor|Borohrádek|Borovany|Boží Dar|Brandýs nad Orlicí|Brno|Broumov|Brtnice|Brumov-Bylnice|Brušperk|Budišov nad Budišovkou|Budyně nad Ohří|Bučovice|Buštěhrad|Bystré|Bystřice|Bystřice nad Pernštejnem|Bystřice pod Hostýnem|Bzenec|Bílovec|Bělá nad Radbuzou|Bělá pod Bezdězem|Březnice|Březová|Březová nad Svitavou|Břidličná|Chabařovice|Chlumec|Chlumec nad Cidlinou|Choceň|Chomutov|Chotěboř|Chrast|Chrastava|Chropyně|Chvaletice|Chyše|Chýnov|Chřibská|Cvikov|Dačice|Dašice|Desná|Deštná|Dobrovice|Dobruška|Dobřany|Dobřichovice|Dobříš|Doksy|Dolní Benešov|Dolní Bousov|Dolní Kounice|Dolní Poustevna|Dubá|Dubí|Dubňany|Duchcov|Děčín|Fr

# OLD Stuff

Define the pattern that we want to test

In [None]:

id_patterns = [
  re.compile(r'\b[A-Za-z]*(?:[-]*\d){6,}\b'), # Sasha's general purpose ID regex
  #re.compile('((?:(?:\\d{4}[- ]?){3}\\d{4}|\\d{15,16}))(?![\\d])'),  # Huu's credit card from common regex
  #re.compile('[A-TV-Z][0-9][A-Z0-9](\.[A-Z0-9]{1,4})'),  # Huu's icd code - see https://stackoverflow.com/questions/5590862/icd9-regex-pattern
  #re.compile('[A-Z]{0,3}(?:[- ]*\d){7,13}')  # Huu's generic id with dashes
]
year_patterns = [
  re.compile(r"\b[1-2][0-9]{3}-[1-2][0-9]{3}\b"),  # yyyy-yyyy
  re.compile(r"\b[1-2][0-9]{3}-[0-3][0-9]-[0-3][0-9]\b"),  # yyyy-mm-dd or yyyy-dd-mm
  re.compile(r"\b[0-3][0-9]-[0-3][0-9]-[1-2][0-9]{3}\b"),  # mm-dd-yyyy or dd-mm-yyyy
  re.compile(r"\b[0-3][0-9]-[1-2][0-9]{3}\b"),  # mm-yyyy
  re.compile(r"\b[1-2][0-9]{3}-[0-3][0-9]\b")  # yyyy-mm
]

# Other ID patterns that don't seem to work quite as well
# pattern = re.compile(r'\b\S*(?:\d[\ \-\.\:]?\d){3,}\d?\S*\b')
# pattern = re.compile(r'[A-Za-z_]*(?:\d[\ \-\.\:]?\d[\ \-\.\:]?){3,}\d?[A-Za-z_]*')
# pattern = re.compile(r'\b[A-Za-z0-9]*(?:\d[\ \-\.\:]?\d){3,}\d?[A-Za-z0-9]*\b')
# pattern = re.compile(r'.')  # sanity check lol
# pattern = re.compile(r'\w*(?:\d[\ \-\.\:]?\d){3,}\d?\w*')
# pattern = re.compile(r'\b[A-Za-z_]*(?:\d[\ \-\.\:]?\d[\ \-\.\:]?){3,}\d?[A-Za-z_]\b')

def findall_without_year(text):
  filtered_matches = []
  id_regex_matches = []
  for regex in id_patterns:
    id_regex_matches += re.findall(regex, text)
  id_regex_matches = list(set(id_regex_matches))
  for match in id_regex_matches:
    include = True
    for year_pattern in year_patterns:
      if len(re.findall(year_pattern, text)) > 0:
        include = False
    if include:
      filtered_matches.append(match)
  return filtered_matches


Run the pattern on sasha's CSV to get some metrics. Note that the csv comes from "real world" data, but it contains only english text. The label balance and language is not representative of the data in the real BigScience training set. The code could be more readable here, I know :P

Print false positives from our regex on sasha's csv. Many of the false positives are currently: phone numbers, dates, urls, language where spaces are omitted, file names, and codes that are not pii. Math equations will also be affected for sure, although these do not seem to be present in the small CSV that Sasha annotated.

In [None]:
false_positives = [list(filter(lambda x: x not in gold_catches[i], regex_catches[i])) for i in range(len(gold_catches))]
print("False positives:")
length = 0
for false_positive_list in false_positives:
  if len(false_positive_list) > 0:
    print()
  for false_positive in false_positive_list:
    length += 1
    print(false_positive)
print("Number of False Positives:", length)

NameError: ignored

Run our regex on a subset of oscar and mc4, and print what it captures. Also, print the runtime at the end. It seems to capture all chinese with numbers in it, measurements (e.g., 200-400mm), large amounts of money (e.g. $4,000,000), and small sums of money (e.g., fractions of cryptocurrencies).