### BM-25 based redaction

We think that the model we have now is good, but it's failing on cases where the hardest words are almost redacted in order of how difficult they are rated from BM25. We want to develop a function that looks like `redact(text: str, p: float)` where `p%` of words are redacted, **in order of importance as measured by BM-25**. This notebook is where I'll figure out how to do this!

In [1]:
from rank_bm25 import BM25Okapi

In [3]:
from typing import List

import datasets
import os
import re

from nltk.corpus import stopwords

num_cpus = len(os.sched_getaffinity(0))
eng_stopwords = stopwords.words('english')

words_from_text_re = re.compile(r'\b\w+\b')
def words_from_text(s: str) -> List[str]:
    assert isinstance(s, str)
    return words_from_text_re.findall(s)

def get_words_from_doc(s: str) -> List[str]:
    words = words_from_text(s)
    return [w for w in words]

profile_corpus = []
document_corpus = []
for split in ['test[:100%]', 'train[:100%]', 'val[:100%]']:
    prof_data = datasets.load_dataset('wiki_bio', split=split, version='1.2.0')

    def make_table_str(ex):
        ex['table_str'] = (
            ' '.join(ex['input_text']['table']['column_header'] + ex['input_text']['table']['content'])
        )
        return ex

    prof_data = prof_data.map(make_table_str, num_proc=num_cpus)
    profile_corpus.extend(prof_data['table_str'])
    document_corpus.extend(prof_data['target_text'])

    print("tokenizing corpi")
    tokenized_document_corpus = [
        get_words_from_doc(doc) for doc in document_corpus
    ]
    tokenized_profile_corpus = [
        get_words_from_doc(prof) for prof in profile_corpus
    ]

    print("creating search index")

bm25 = BM25Okapi(tokenized_profile_corpus)

Using custom data configuration default
Reusing dataset wiki_bio (/home/jxm3/.cache/huggingface/datasets/wiki_bio/default/1.2.0/c05ce066e9026831cd7535968a311fc80f074b58868cfdffccbc811dff2ab6da)


 

Loading cached processed dataset at /home/jxm3/.cache/huggingface/datasets/wiki_bio/default/1.2.0/c05ce066e9026831cd7535968a311fc80f074b58868cfdffccbc811dff2ab6da/cache-e3602349ec2514fc.arrow


 

Loading cached processed dataset at /home/jxm3/.cache/huggingface/datasets/wiki_bio/default/1.2.0/c05ce066e9026831cd7535968a311fc80f074b58868cfdffccbc811dff2ab6da/cache-75b832f7a3bbc565.arrow


 

Loading cached processed dataset at /home/jxm3/.cache/huggingface/datasets/wiki_bio/default/1.2.0/c05ce066e9026831cd7535968a311fc80f074b58868cfdffccbc811dff2ab6da/cache-e0950270b9f9d578.arrow


 

Loading cached processed dataset at /home/jxm3/.cache/huggingface/datasets/wiki_bio/default/1.2.0/c05ce066e9026831cd7535968a311fc80f074b58868cfdffccbc811dff2ab6da/cache-b5d2c33ba9fa845b.arrow


 

Loading cached processed dataset at /home/jxm3/.cache/huggingface/datasets/wiki_bio/default/1.2.0/c05ce066e9026831cd7535968a311fc80f074b58868cfdffccbc811dff2ab6da/cache-ac82985692114ad5.arrow


 

Loading cached processed dataset at /home/jxm3/.cache/huggingface/datasets/wiki_bio/default/1.2.0/c05ce066e9026831cd7535968a311fc80f074b58868cfdffccbc811dff2ab6da/cache-089147a0f5df4208.arrow


 

Loading cached processed dataset at /home/jxm3/.cache/huggingface/datasets/wiki_bio/default/1.2.0/c05ce066e9026831cd7535968a311fc80f074b58868cfdffccbc811dff2ab6da/cache-41f63d684696d460.arrow


 

Loading cached processed dataset at /home/jxm3/.cache/huggingface/datasets/wiki_bio/default/1.2.0/c05ce066e9026831cd7535968a311fc80f074b58868cfdffccbc811dff2ab6da/cache-5cb952b6ab7c9e52.arrow


tokenizing corpi
creating search index


Using custom data configuration default
Reusing dataset wiki_bio (/home/jxm3/.cache/huggingface/datasets/wiki_bio/default/1.2.0/c05ce066e9026831cd7535968a311fc80f074b58868cfdffccbc811dff2ab6da)


                

#3:   0%|          | 0/72832 [00:00<?, ?ex/s]

#0:   0%|          | 0/72833 [00:00<?, ?ex/s]

#1:   0%|          | 0/72833 [00:00<?, ?ex/s]

#6:   0%|          | 0/72832 [00:00<?, ?ex/s]

#2:   0%|          | 0/72833 [00:00<?, ?ex/s]

#4:   0%|          | 0/72832 [00:00<?, ?ex/s]

#5:   0%|          | 0/72832 [00:00<?, ?ex/s]

#7:   0%|          | 0/72832 [00:00<?, ?ex/s]

tokenizing corpi
creating search index


Using custom data configuration default
Reusing dataset wiki_bio (/home/jxm3/.cache/huggingface/datasets/wiki_bio/default/1.2.0/c05ce066e9026831cd7535968a311fc80f074b58868cfdffccbc811dff2ab6da)


 

Loading cached processed dataset at /home/jxm3/.cache/huggingface/datasets/wiki_bio/default/1.2.0/c05ce066e9026831cd7535968a311fc80f074b58868cfdffccbc811dff2ab6da/cache-3c1a083eaafb2613.arrow


 

Loading cached processed dataset at /home/jxm3/.cache/huggingface/datasets/wiki_bio/default/1.2.0/c05ce066e9026831cd7535968a311fc80f074b58868cfdffccbc811dff2ab6da/cache-10591a407843bfc7.arrow


 

Loading cached processed dataset at /home/jxm3/.cache/huggingface/datasets/wiki_bio/default/1.2.0/c05ce066e9026831cd7535968a311fc80f074b58868cfdffccbc811dff2ab6da/cache-f19feca32d67cc2c.arrow


 

Loading cached processed dataset at /home/jxm3/.cache/huggingface/datasets/wiki_bio/default/1.2.0/c05ce066e9026831cd7535968a311fc80f074b58868cfdffccbc811dff2ab6da/cache-f833e1ed70828dfe.arrow


 

Loading cached processed dataset at /home/jxm3/.cache/huggingface/datasets/wiki_bio/default/1.2.0/c05ce066e9026831cd7535968a311fc80f074b58868cfdffccbc811dff2ab6da/cache-f676ddffb469016b.arrow


 

Loading cached processed dataset at /home/jxm3/.cache/huggingface/datasets/wiki_bio/default/1.2.0/c05ce066e9026831cd7535968a311fc80f074b58868cfdffccbc811dff2ab6da/cache-944bd579cd5506d4.arrow


 

Loading cached processed dataset at /home/jxm3/.cache/huggingface/datasets/wiki_bio/default/1.2.0/c05ce066e9026831cd7535968a311fc80f074b58868cfdffccbc811dff2ab6da/cache-34155b01c146257f.arrow


 

Loading cached processed dataset at /home/jxm3/.cache/huggingface/datasets/wiki_bio/default/1.2.0/c05ce066e9026831cd7535968a311fc80f074b58868cfdffccbc811dff2ab6da/cache-9f7ba753fdbc342d.arrow


tokenizing corpi
creating search index


In [4]:
class JointBM250kapi(BM25Okapi):
    """A BM250kapi that takes extra documents to calculate idf but only returns scores within initial set of documents.
    
    This allows us to search only among profiles but use both profiles and documents to calculate inverse document frequency
    of terms. That's especially useful since stopwords mostly just appear in documents (and in a small set of profiles with
    captions) but they don't provide much utility to the search.
    """
    def __init__(self, corpus, extra_corpus):
        super().__init__(corpus + extra_corpus)
        self.doc_freqs = self.doc_freqs[:len(corpus)] # truncate extra docs
        self.doc_len = self.doc_len[:len(corpus)]
        # avgdl = num_doc / self.corpus_size
        self.avgdl = self.avgdl * (len(corpus) / (len(corpus) + len(extra_corpus)))
        

In [5]:
bm25 = JointBM250kapi(tokenized_profile_corpus, tokenized_document_corpus)

In [6]:
bm25.corpus_size # sum of test + val dataset sizes

1456642

In [7]:
sample_doc = document_corpus[0]

In [8]:
sample_doc_words = list(set(words_from_text(sample_doc)))
sample_doc_words.sort(key=lambda w: (-bm25.idf.get(w, 0.0)))

for w in sample_doc_words:
    print(w, bm25.idf.get(w, 0.0))

shenoff 13.78617889325838
randle 9.611725716162116
phase 7.778587566794514
tenth 6.687385310867826
secondary 6.560984605725483
senators 6.057335346802146
leonard 6.022866943367083
pick 5.661127969026836
overall 4.391287681455781
round 4.175715349234315
1949 4.049596401480462
baseball 3.948515277335053
draft 3.891657790050534
washington 3.86770107843569
1970 3.686959126867915
major 3.5200195552502986
lrb 3.2614305151562943
rrb 3.2614305151562943
12 2.5931347491133714
player 2.5031259505633887
former 2.4725950970549793
league 2.438119267765881
first 2.290916672769617
february 2.190346388838188
june 2.080023619381432
he 0.9513560430399224
was 0.8967680388057797
born 0.7574916085316836
is 0.5922542167573592
a 0.23333775024460834
in 0.2231668755220113
of 0.1789217148342992
the 0.02632505999650725


In [9]:
def fixed_redact_str(text: str, words_to_mask: List[str], mask_token: str = '<mask>') -> str:
    for w in words_to_mask:
        text = re.sub(
            (r'\b{}\b').format(re.escape(w)),
            mask_token, text, count=0
        )
    return text

def redact(document: str, p: float):
    words = list(set(words_from_text(sample_doc)))
    words.sort(key=lambda w: (-bm25.idf.get(w, 0.0)))
    n = round(len(sample_doc_words) * p)
    return fixed_redact_str(text=document, words_to_mask=words[:n])


for a in [0.2, 0.4, 0.6, 0.8, 1.0]:
    print(redact(sample_doc, a))
    print('\n')

<mask> <mask> <mask> -lrb- born february 12 , 1949 -rrb- is a former major league baseball player .
he was the first-round pick of the washington <mask> in the <mask> <mask> of the june 1970 major league baseball draft , <mask> overall .



<mask> <mask> <mask> -lrb- born february 12 , <mask> -rrb- is a former major league <mask> player .
he was the first-<mask> <mask> of the washington <mask> in the <mask> <mask> of the june 1970 major league <mask> <mask> , <mask> <mask> .



<mask> <mask> <mask> -<mask>- born february <mask> , <mask> -<mask>- is a former <mask> league <mask> <mask> .
he was the first-<mask> <mask> of the <mask> <mask> in the <mask> <mask> of the june <mask> <mask> league <mask> <mask> , <mask> <mask> .



<mask> <mask> <mask> -<mask>- born <mask> <mask> , <mask> -<mask>- is a <mask> <mask> <mask> <mask> <mask> .
<mask> was the <mask>-<mask> <mask> of the <mask> <mask> in the <mask> <mask> of the <mask> <mask> <mask> <mask> <mask> <mask> , <mask> <mask> .



<mask> <

In [10]:
import pickle

pickle.dump(bm25.idf, open('../test_val_train_100_idf.p', 'wb'))

In [11]:
date_idf = bm25.idf.copy()
# Date-modifications for idf.

for year in range(1000, 2022):
    if str(year) in date_idf:
        date_idf[str(year)] = max(
            date_idf[str(year)], 10
        )

for day in range(1, 31+1):
    assert str(day) in date_idf
    date_idf[str(day)] = 10
            

for month in ['january', 'february', 'march', 'april', 'may', 'june', 'july', 'august', 'september', 'october', 'november', 'december']:
    assert month in date_idf
    date_idf[str(month)] = 10

In [12]:
pickle.dump(date_idf, open('../test_val_train_100_idf_dates.p', 'wb'))