### BM-25 based redaction

We think that the model we have now is good, but it's failing on cases where the hardest words are almost redacted in order of how difficult they are rated from BM25. We want to develop a function that looks like `redact(text: str, p: float)` where `p%` of words are redacted, **in order of importance as measured by BM-25**. This notebook is where I'll figure out how to do this!

In [1]:
from rank_bm25 import BM25Okapi

In [40]:
from typing import List

import datasets
import os
import re

from nltk.corpus import stopwords

num_cpus = len(os.sched_getaffinity(0))
eng_stopwords = stopwords.words('english')

words_from_text_re = re.compile(r'\b\w+\b')
def words_from_text(s: str) -> List[str]:
    assert isinstance(s, str)
    return words_from_text_re.findall(s)

def get_words_from_doc(s: str) -> List[str]:
    words = words_from_text(s)
    return [w for w in words]

split = 'train[:100%]'
prof_data = datasets.load_dataset('wiki_bio', split=split, version='1.2.0')

def make_table_str(ex):
    ex['table_str'] = (
        ' '.join(ex['input_text']['table']['column_header'] + ex['input_text']['table']['content'])
    )
    return ex

prof_data = prof_data.map(make_table_str, num_proc=num_cpus)
profile_corpus = prof_data['table_str']
document_corpus = prof_data['target_text']

print("tokenizing corpi")
tokenized_document_corpus = [
    get_words_from_doc(doc) for doc in document_corpus
]
tokenized_profile_corpus = [
    get_words_from_doc(prof) for prof in profile_corpus
]

print("creating search index")

bm25 = BM25Okapi(tokenized_profile_corpus)

Using custom data configuration default
Reusing dataset wiki_bio (/home/jxm3/.cache/huggingface/datasets/wiki_bio/default/1.2.0/c05ce066e9026831cd7535968a311fc80f074b58868cfdffccbc811dff2ab6da)


 

Loading cached processed dataset at /home/jxm3/.cache/huggingface/datasets/wiki_bio/default/1.2.0/c05ce066e9026831cd7535968a311fc80f074b58868cfdffccbc811dff2ab6da/cache-67766c1c59349e68.arrow


 

Loading cached processed dataset at /home/jxm3/.cache/huggingface/datasets/wiki_bio/default/1.2.0/c05ce066e9026831cd7535968a311fc80f074b58868cfdffccbc811dff2ab6da/cache-2fbc743e56c5d996.arrow


 

Loading cached processed dataset at /home/jxm3/.cache/huggingface/datasets/wiki_bio/default/1.2.0/c05ce066e9026831cd7535968a311fc80f074b58868cfdffccbc811dff2ab6da/cache-f6309c3ee659d23f.arrow


 

Loading cached processed dataset at /home/jxm3/.cache/huggingface/datasets/wiki_bio/default/1.2.0/c05ce066e9026831cd7535968a311fc80f074b58868cfdffccbc811dff2ab6da/cache-aabbb8a7f29ded89.arrow


tokenizing corpi
creating search index


In [41]:
class JointBM250kapi(BM25Okapi):
    """A BM250kapi that takes extra documents to calculate idf but only returns scores within initial set of documents.
    
    This allows us to search only among profiles but use both profiles and documents to calculate inverse document frequency
    of terms. That's especially useful since stopwords mostly just appear in documents (and in a small set of profiles with
    captions) but they don't provide much utility to the search.
    """
    def __init__(self, corpus, extra_corpus):
        super().__init__(corpus + extra_corpus)
        self.doc_freqs = self.doc_freqs[:len(corpus)] # truncate extra docs
        self.doc_len = self.doc_len[:len(corpus)]
        # avgdl = num_doc / self.corpus_size
        self.avgdl = self.avgdl * (len(corpus) / (len(corpus) + len(extra_corpus)))
        

In [42]:
bm25 = JointBM250kapi(tokenized_profile_corpus, tokenized_document_corpus)

In [46]:
sample_doc = document_corpus[0]

In [47]:
sample_doc_words = list(set(words_from_text(sample_doc)))
sample_doc_words.sort(key=lambda w: (-bm25.idf.get(w, 0.0)))

for w in sample_doc_words:
    print(w, bm25.idf.get(w, 0.0))

schrodt 12.71573945527333
turboprop 12.71573945527333
flugzeugbau 12.464424168855352
gliders 10.456931240505229
aerobatics 10.267168698094153
aerobatic 10.195707159616845
transports 10.086900008928046
transitioning 9.801784852258027
constructions 9.741612898059184
dominate 9.219135777877232
revolutionized 9.19349077260302
pitts 8.934673127950944
ea 8.883861686314507
unlimited 8.61409891691667
powered 8.413387881015542
designing 7.5358360436621865
klaus 7.507269878817949
flew 7.2518075231360095
extra 7.087800467895267
manufacturer 7.050428327648014
mechanical 6.897196114293471
230 6.6800318925475395
perform 6.551602532194361
aircraft 6.499917681695289
designed 6.30869524878043
construction 6.257956944899406
flight 6.132538415394541
built 6.045147681862671
competitions 5.977600428731317
competing 5.958057390044562
pilot 5.796276403101713
firm 5.7788142006853
trained 5.746664725764301
scene 5.719714456182768
flying 5.567749015118819
designer 5.183122448101406
walter 5.092067109733902
spec

In [48]:
def fixed_redact_str(text: str, words_to_mask: List[str], mask_token: str = '<mask>') -> str:
    for w in words_to_mask:
        text = re.sub(
            (r'\b{}\b').format(re.escape(w)),
            mask_token, text, count=0
        )
    return text

def redact(document: str, p: float):
    words = list(set(words_from_text(sample_doc)))
    words.sort(key=lambda w: (-bm25.idf.get(w, 0.0)))
    n = round(len(sample_doc_words) * p)
    return fixed_redact_str(text=document, words_to_mask=words[:n])


for a in [0.2, 0.4, 0.6, 0.8, 1.0]:
    print(redact(sample_doc, a))
    print('\n')

walter extra is a german award-winning <mask> pilot , chief aircraft designer and founder of extra <mask> -lrb- extra aircraft construction -rrb- , a manufacturer of <mask> aircraft .
extra was trained as a mechanical engineer .
he began his flight training in <mask> , <mask> to <mask> aircraft to perform <mask> .
he built and flew a <mask> special aircraft and later built his own extra <mask>-230 .
extra began designing aircraft after competing in the 1982 world <mask> championships .
his aircraft <mask> <mask> the <mask> flying scene and still <mask> world competitions .
the german pilot klaus <mask> won his world championship title flying an aircraft made by the extra firm .
walter extra has designed a series of performance aircraft which include <mask> <mask> aircraft and <mask> <mask> .



walter <mask> is a german award-winning <mask> <mask> , chief <mask> designer and founder of <mask> <mask> -lrb- <mask> <mask> <mask> -rrb- , a <mask> of <mask> <mask> .
<mask> was trained as a 

In [49]:
import pickle

pickle.dump(bm25.idf, '../train_100_idf.p')

{'name': 0.08277135347674047,
 'nationality': 2.2472576970816913,
 'birth_date': 0.2622453114979173,
 'article_title': 0.0,
 'occupation': 1.8904228425976566,
 'walter': 5.092067109733902,
 'extra': 7.087800467895267,
 'german': 3.7877294201681124,
 '1954': 4.04331500452207,
 'aircraft': 6.499917681695289,
 'designer': 5.183122448101406,
 'and': 0.3572338068408065,
 'manufacturer': 7.050428327648014,
 'fullname': 2.2936732678337375,
 'youthclubs': 3.1525499185055352,
 'caps': 2.519556979876123,
 'position': 1.7636365934214382,
 'pcupdate': 3.4414318355134004,
 'years': 1.9012089406907435,
 'clubs': 2.373887676779816,
 'birth_place': 0.43665083849204755,
 'goals': 2.4232169253647484,
 'youthyears': 3.5420064259347583,
 'height': 2.3078774532947293,
 'aaron': 6.125812360747489,
 'hohlbein': 11.953595970673732,
 'wisconsin': 5.238597988064953,
 'badgers': 8.572418495877166,
 '12': 2.5905557882136456,
 '43': 5.173731098439578,
 '10': 2.437820364630962,
 '14': 2.680741721779249,
 'defender'