### BM-25 based redaction

We think that the model we have now is good, but it's failing on cases where the hardest words are almost redacted in order of how difficult they are rated from BM25. We want to develop a function that looks like `redact(text: str, p: float)` where `p%` of words are redacted, **in order of importance as measured by BM-25**. This notebook is where I'll figure out how to do this!

In [1]:
from rank_bm25 import BM25Okapi

In [2]:
from typing import List

import datasets
import os
import re

from nltk.corpus import stopwords

num_cpus = len(os.sched_getaffinity(0))
eng_stopwords = stopwords.words('english')

words_from_text_re = re.compile(r'\b\w+\b')
def words_from_text(s: str) -> List[str]:
    assert isinstance(s, str)
    return words_from_text_re.findall(s)

def get_words_from_doc(s: str) -> List[str]:
    words = words_from_text(s)
    return [w for w in words]

split = 'val[:100%]'
prof_data = datasets.load_dataset('wiki_bio', split=split, version='1.2.0')

def make_table_str(ex):
    ex['table_str'] = (
        ' '.join(ex['input_text']['table']['column_header'] + ex['input_text']['table']['content'])
    )
    return ex

prof_data = prof_data.map(make_table_str, num_proc=num_cpus)
profile_corpus = prof_data['table_str']
document_corpus = prof_data['target_text']

print("tokenizing corpi")
tokenized_document_corpus = [
    get_words_from_doc(doc) for doc in document_corpus
]
tokenized_profile_corpus = [
    get_words_from_doc(prof) for prof in profile_corpus
]

print("creating search index")

bm25 = BM25Okapi(tokenized_profile_corpus)

Using custom data configuration default
Reusing dataset wiki_bio (/home/jxm3/.cache/huggingface/datasets/wiki_bio/default/1.2.0/c05ce066e9026831cd7535968a311fc80f074b58868cfdffccbc811dff2ab6da)


        

#1:   0%|          | 0/18208 [00:00<?, ?ex/s]

#0:   0%|          | 0/18208 [00:00<?, ?ex/s]

#2:   0%|          | 0/18208 [00:00<?, ?ex/s]

#3:   0%|          | 0/18207 [00:00<?, ?ex/s]

tokenizing corpi
creating search index


In [3]:
class JointBM250kapi(BM25Okapi):
    """A BM250kapi that takes extra documents to calculate idf but only returns scores within initial set of documents.
    
    This allows us to search only among profiles but use both profiles and documents to calculate inverse document frequency
    of terms. That's especially useful since stopwords mostly just appear in documents (and in a small set of profiles with
    captions) but they don't provide much utility to the search.
    """
    def __init__(self, corpus, extra_corpus):
        super().__init__(corpus + extra_corpus)
        self.doc_freqs = self.doc_freqs[:len(corpus)] # truncate extra docs
        self.doc_len = self.doc_len[:len(corpus)]
        # avgdl = num_doc / self.corpus_size
        self.avgdl = self.avgdl * (len(corpus) / (len(corpus) + len(extra_corpus)))
        

In [4]:
bm25 = JointBM250kapi(tokenized_profile_corpus, tokenized_document_corpus)

In [5]:
sample_doc = document_corpus[0]

In [6]:
sample_doc_words = list(set(words_from_text(sample_doc)))
sample_doc_words.sort(key=lambda w: (-bm25.idf.get(w, 0.0)))

for w in sample_doc_words:
    print(w, bm25.idf.get(w, 0.0))

geniza 11.483575607564859
tulun 11.483575607564859
khail 10.972743118543239
907 9.537603669512086
880 9.148104573879777
882 8.82085046647102
forcing 8.438847191667527
coptic 8.071026273872993
sell 7.785992219890188
attached 7.753455292062404
properties 7.551234949833269
patriarch 6.964482690863175
ibn 6.942878504039736
pay 6.874054188367566
cairo 6.731681598654302
alexandria 6.62244420915696
ahmad 6.388155762286122
believed 6.3719007184695124
building 6.001865760200759
site 5.999081098466496
pope 5.749224267940596
egypt 5.747057552834824
contributions 5.674019271720173
heavy 5.665999665862906
forced 5.634545049229249
jewish 5.517714854325458
community 5.055610386969554
iii 4.893632987278909
local 4.871558293919304
see 4.8209773165335115
become 4.766211611337895
mark 4.538556593330421
some 4.446395002447301
governor 4.191793011750258
church 4.147825009136752
michael 3.8709312306720367
have 3.7855983297728404
him 3.644602442536783
this 3.6254145725852
st 3.2776025615567406
later 3.257408

In [7]:
def fixed_redact_str(text: str, words_to_mask: List[str], mask_token: str = '<mask>') -> str:
    for w in words_to_mask:
        text = re.sub(
            (r'\b{}\b').format(re.escape(w)),
            mask_token, text, count=0
        )
    return text

def redact(document: str, p: float):
    words = list(set(words_from_text(sample_doc)))
    words.sort(key=lambda w: (-bm25.idf.get(w, 0.0)))
    n = round(len(sample_doc_words) * p)
    return fixed_redact_str(text=document, words_to_mask=words[:n])


for a in [0.2, 0.4, 0.6, 0.8, 1.0]:
    print(redact(sample_doc, a))
    print('\n')

pope michael iii of alexandria -lrb- also known as <mask> iii -rrb- was the <mask> pope of alexandria and patriarch of the see of st. mark -lrb- <mask> -- <mask> -rrb- .
in <mask> , the governor of egypt , ahmad ibn <mask> , forced <mask> to pay heavy contributions , <mask> him to <mask> a church and some <mask> <mask> to the local jewish community .
this building was at one time believed to have later become the site of the cairo <mask> .



<mask> michael iii of <mask> -lrb- also known as <mask> iii -rrb- was the <mask> <mask> of <mask> and <mask> of the see of st. mark -lrb- <mask> -- <mask> -rrb- .
in <mask> , the governor of <mask> , <mask> <mask> <mask> , forced <mask> to <mask> heavy contributions , <mask> him to <mask> a church and some <mask> <mask> to the local jewish community .
this <mask> was at one time <mask> to have later become the <mask> of the <mask> <mask> .



<mask> michael <mask> of <mask> -lrb- also known as <mask> <mask> -rrb- was the <mask> <mask> of <mask> an

In [9]:
import pickle

pickle.dump(bm25.idf, open('../val_100_idf.p', 'wb'))