### BM-25 based redaction

We think that the model we have now is good, but it's failing on cases where the hardest words are almost redacted in order of how difficult they are rated from BM25. We want to develop a function that looks like `redact(text: str, p: float)` where `p%` of words are redacted, **in order of importance as measured by BM-25**. This notebook is where I'll figure out how to do this!

In [1]:
from rank_bm25 import BM25Okapi

In [2]:
from typing import List

import datasets
import os
import re

from nltk.corpus import stopwords

num_cpus = len(os.sched_getaffinity(0))
eng_stopwords = stopwords.words('english')

words_from_text_re = re.compile(r'\b\w+\b')
def words_from_text(s: str) -> List[str]:
    assert isinstance(s, str)
    return words_from_text_re.findall(s)

def get_words_from_doc(s: str) -> List[str]:
    words = words_from_text(s)
    return [w for w in words]

split = 'test[:100%]'
prof_data = datasets.load_dataset('wiki_bio', split=split, version='1.2.0')

def make_table_str(ex):
    ex['table_str'] = (
        ' '.join(ex['input_text']['table']['column_header'] + ex['input_text']['table']['content'])
    )
    return ex

prof_data = prof_data.map(make_table_str, num_proc=num_cpus)
profile_corpus = prof_data['table_str']
document_corpus = prof_data['target_text']

print("tokenizing corpi")
tokenized_document_corpus = [
    get_words_from_doc(doc) for doc in document_corpus
]
tokenized_profile_corpus = [
    get_words_from_doc(prof) for prof in profile_corpus
]

print("creating search index")

bm25 = BM25Okapi(tokenized_profile_corpus)

Using custom data configuration default
Reusing dataset wiki_bio (/home/jxm3/.cache/huggingface/datasets/wiki_bio/default/1.2.0/c05ce066e9026831cd7535968a311fc80f074b58868cfdffccbc811dff2ab6da)


               

#2:   0%|          | 0/9104 [00:00<?, ?ex/s]

 

#0:   0%|          | 0/9104 [00:00<?, ?ex/s]

#5:   0%|          | 0/9104 [00:00<?, ?ex/s]

#7:   0%|          | 0/9103 [00:00<?, ?ex/s]

#1:   0%|          | 0/9104 [00:00<?, ?ex/s]

#6:   0%|          | 0/9104 [00:00<?, ?ex/s]

#3:   0%|          | 0/9104 [00:00<?, ?ex/s]

#4:   0%|          | 0/9104 [00:00<?, ?ex/s]

tokenizing corpi
creating search index


In [3]:
class JointBM250kapi(BM25Okapi):
    """A BM250kapi that takes extra documents to calculate idf but only returns scores within initial set of documents.
    
    This allows us to search only among profiles but use both profiles and documents to calculate inverse document frequency
    of terms. That's especially useful since stopwords mostly just appear in documents (and in a small set of profiles with
    captions) but they don't provide much utility to the search.
    """
    def __init__(self, corpus, extra_corpus):
        super().__init__(corpus + extra_corpus)
        self.doc_freqs = self.doc_freqs[:len(corpus)] # truncate extra docs
        self.doc_len = self.doc_len[:len(corpus)]
        # avgdl = num_doc / self.corpus_size
        self.avgdl = self.avgdl * (len(corpus) / (len(corpus) + len(extra_corpus)))
        

In [4]:
bm25 = JointBM250kapi(tokenized_profile_corpus, tokenized_document_corpus)

In [5]:
sample_doc = document_corpus[0]

In [6]:
sample_doc_words = list(set(words_from_text(sample_doc)))
sample_doc_words.sort(key=lambda w: (-bm25.idf.get(w, 0.0)))

for w in sample_doc_words:
    print(w, bm25.idf.get(w, 0.0))

shenoff 11.483575607564859
randle 8.86848538161152
phase 7.927877357931569
tenth 6.643360193899222
secondary 6.638090149727103
leonard 6.035897854707147
senators 5.982533239847946
pick 5.662013688054766
overall 4.400667304180287
round 4.202138668427599
1949 4.017412641221092
baseball 3.9223006894414008
draft 3.892086532265137
washington 3.875748716013044
1970 3.6766192332095002
major 3.509989392146517
lrb 2.687126848657929
rrb 2.687126848657929
12 2.606354664662753
player 2.5143087395018817
former 2.4727241673859144
league 2.43310568243545
first 2.2890035548959276
february 2.183931261454566
june 2.080851818385721
he 0.9555266540517113
was 0.9023216788513828
born 0.7554652625737273
is 0.5936523046588995
a 0.23573687768198504
in 0.2263029407585435
of 0.17974547640818095
the 0.020815920860638215


In [7]:
def fixed_redact_str(text: str, words_to_mask: List[str], mask_token: str = '<mask>') -> str:
    for w in words_to_mask:
        text = re.sub(
            (r'\b{}\b').format(re.escape(w)),
            mask_token, text, count=0
        )
    return text

def redact(document: str, p: float):
    words = list(set(words_from_text(sample_doc)))
    words.sort(key=lambda w: (-bm25.idf.get(w, 0.0)))
    n = round(len(sample_doc_words) * p)
    return fixed_redact_str(text=document, words_to_mask=words[:n])


for a in [0.2, 0.4, 0.6, 0.8, 1.0]:
    print(redact(sample_doc, a))
    print('\n')

<mask> <mask> <mask> -lrb- born february 12 , 1949 -rrb- is a former major league baseball player .
he was the first-round pick of the washington <mask> in the <mask> <mask> of the june 1970 major league baseball draft , <mask> overall .



<mask> <mask> <mask> -lrb- born february 12 , <mask> -rrb- is a former major league <mask> player .
he was the first-<mask> <mask> of the washington <mask> in the <mask> <mask> of the june 1970 major league <mask> <mask> , <mask> <mask> .



<mask> <mask> <mask> -<mask>- born february <mask> , <mask> -<mask>- is a former <mask> league <mask> <mask> .
he was the first-<mask> <mask> of the <mask> <mask> in the <mask> <mask> of the june <mask> <mask> league <mask> <mask> , <mask> <mask> .



<mask> <mask> <mask> -<mask>- born <mask> <mask> , <mask> -<mask>- is a <mask> <mask> <mask> <mask> <mask> .
<mask> was the <mask>-<mask> <mask> of the <mask> <mask> in the <mask> <mask> of the <mask> <mask> <mask> <mask> <mask> <mask> , <mask> <mask> .



<mask> <

In [8]:
import pickle

pickle.dump(bm25.idf, open('../test_100_idf.p', 'wb'))