<h1> Token Masking and Data Cleaning Functions </h1>

The following content contains a function which:

1. Masks certain named entities across the corpus
2. Reduces strange corpus-specific encoding characters and idiosyncracies.


<h2> Token Masking </h2>

In the data, LOC and MISC tokens tend to be the most imbalanced. I create a general function to mask NE tokens with their CONLL03 tag, then iterate over the dataset, making the chosen replacements.

In [1]:
import os
import re
import json
import pandas as pd
import stanza
import spacy 
import importlib
import pickle

from transformers import BertTokenizer

In [2]:
# Data 

project_dir = "/Users/paulp/Library/CloudStorage/OneDrive-UniversityofEasternFinland/UEF/Thesis"
data_dir = os.path.join(project_dir,"Data")
model_dir = os.path.join(project_dir, "Models")

os.chdir(data_dir)

old_dataset = pd.read_csv('compiled_data_set.csv', index_col = 0)

# special tokens
with open('spec_tokens_ne.txt', 'rb') as file:
    spec_tokens = pickle.load(file)
    
# tokenizer and NER parser
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-cased', 
                                          additional_special_tokens = spec_tokens,
                                         unk_token = '[UNK]')

#processors = {'tokenize':'spacy','ner':'conll03'}
#stanza_ner = stanza.Pipeline('en', processors=processors, package='ewt')



Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.0.json:   0%|   …

2022-10-23 11:55:00 INFO: Loading these models for language: en (English):
| Processor | Package |
-----------------------
| tokenize  | spacy   |
| mwt       | ewt     |
| pos       | ewt     |
| lemma     | ewt     |
| depparse  | ewt     |
| ner       | conll03 |

2022-10-23 11:55:00 INFO: Use device: cpu
2022-10-23 11:55:00 INFO: Loading: tokenize
2022-10-23 11:55:00 INFO: Loading: mwt
2022-10-23 11:55:00 INFO: Loading: pos
2022-10-23 11:55:00 INFO: Loading: lemma
2022-10-23 11:55:00 INFO: Loading: depparse
2022-10-23 11:55:00 INFO: Loading: ner
2022-10-23 11:55:00 INFO: Done loading processors!


In [278]:
from spacy.lang.char_classes import ALPHA, ALPHA_LOWER, ALPHA_UPPER
from spacy.lang.char_classes import CONCAT_QUOTES, LIST_ELLIPSES, LIST_ICONS
from spacy.util import compile_infix_regex

nlp_spacy = spacy.load('en_core_web_sm')
doc = nlp("mother-in-law")
print([t.text for t in doc]) # ['mother', '-', 'in', '-', 'law']

# Modify tokenizer infix patterns
infixes = (
    LIST_ELLIPSES
    + LIST_ICONS
    + [
        r"(?<=[0-9])[+\-\*^](?=[0-9-])",
        r"(?<=[{al}{q}])\.(?=[{au}{q}])".format(
            al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
        ),
        r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
        #  Commented out regex that splits on hyphens between letters:
        # r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
        r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),
    ]
)

infix_re = compile_infix_regex(infixes)
nlp_spacy.tokenizer.infix_finditer = infix_re.finditer


['mother-in-law']


In [332]:
nlp_stanza = stanza.Pipeline(lang='en', processors={'tokenize':'spacy','ner':'conll03'}, tokenize_pretokenized=True)

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.0.json:   0%|   …

2022-10-23 18:22:57 INFO: Loading these models for language: en (English):
| Processor    | Package  |
---------------------------
| tokenize     | spacy    |
| pos          | combined |
| lemma        | combined |
| depparse     | combined |
| sentiment    | sstplus  |
| constituency | wsj      |
| ner          | conll03  |

2022-10-23 18:22:57 INFO: Use device: cpu
2022-10-23 18:22:57 INFO: Loading: tokenize
2022-10-23 18:22:57 INFO: Loading: pos
2022-10-23 18:22:57 INFO: Loading: lemma
2022-10-23 18:22:57 INFO: Loading: depparse
2022-10-23 18:22:57 INFO: Loading: sentiment
2022-10-23 18:22:57 INFO: Loading: constituency
2022-10-23 18:22:58 INFO: Loading: ner
2022-10-23 18:22:58 INFO: Done loading processors!


In [427]:
def ne_replace(text, filter_out = ['MISC', 'LOC']):
    
    doc = nlp_spacy(text)
    d = [[c.text for c in b.__iter__()] for b in doc.sents]
    
    p = nlp_stanza.process(d)
    p = p.to_dict()
    
    new_tokens = []
    for sent in p:
        for tok in sent:

            if tok['ner'] == 'O' or tok['ner'][2:] not in filter_out:
                new_tok = tok['text']
                new_tokens.append(new_tok)
            elif tok['ner'][0] in ['S', 'E']:
                new_tok = '<' + tok['ner'][2:] + '>'
                new_tokens.append(new_tok)
            elif tok['ner'][0] in ['B', 'I']:
                pass
    
    # a more professional and faster way to do this would be to put replacements in a json
    # and compile and do the replacements in one function
    t = bert_tokenizer.convert_tokens_to_string(new_tokens)
    t = re.sub('< ([\*R\?]) >', '<\g<1>>', t)    
    t = re.sub(' ([\.\?!,:;])', '\g<1>', t)
    t = re.sub('(\() ', '\g<1>', t)
    t = re.sub(' (\))', '\g<1>', t)
    t = re.sub(" & quot;", ' "', t)
    t = re.sub("&quot;([\., \?!])", '"\g<1>', t)
    t = re.sub('\xa0', '', t)
    t = re.sub(' quot;', ' "', t)
    t = re.sub('quot; ', '" ', t)
    t = re.sub('quot;[\.,!\?]', '" ', t)
    t = re.sub(" 'm ", "'m ", t)
    t = re.sub(" n't ", "n't ", t)
    t = re.sub(" 've ", "'ve ", t)
    t = re.sub(" 's ", "'s ", t)
    t = re.sub(" 'll ", "'ll ", t)
    t = re.sub(" 'd ", "'d ", t)
    t = re.sub(" {2,}", " ", t)
    t = re.sub(" s' ", "s' ", t)
    t = re.sub(" n't([ \.,:;])", "n't\g<1>", t)
    t = re.sub("%%", "", t)
    t = re.sub('\n{1,}\t{1,}', '\n\n', t)
    t = re.sub('\n{1,}', '\n\n', t)
    t = re.sub('([\.!\?])([A-Za-z])', '\g<1> \g<2>', t)
    t = re.sub('\x81@', ' ', t)
    t = re.sub(' amp; ', ' and ', t)
    t = re.sub('[\.,:]([A-Za-z])', '\. \g<1>', t)
    t = re.sub('([\.!\?]){4,}', '\g<1>', t)
    t = re.sub("''", "'", t)
    t = t.strip()
    
    return t


In [443]:
# this takes forever.
masked_dataset = old_dataset
masked_dataset['Text'] = old_dataset['Text'].apply(lambda x : ne_replace(x))

In [444]:
masked_dataset.to_csv('masked_data_set.csv')

<h1> References </h1>

Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton and Christopher D. Manning. 2020. Stanza: A Python Natural Language Processing Toolkit for Many Human Languages. In Association for Computational Linguistics (ACL) System Demonstrations. 2020. [pdf]

spaCy: Industrial-strength Natural Language Processing in Python. (2020). https://spacy.io/

