### Collecting intermediate results (p2)

After collecting results in the previous notebook I realized a grave error. All documents with words longer than 128 are not truncated so obviously BM-25 can figure out who lots of people are by looking at words past 128 tokens.

In [1]:
import sys
sys.path.append('/home/jxm3/research/deidentification/unsupervised-deidentification')

from dataloader import WikipediaDataModule

import os

num_cpus = os.cpu_count()
dm = WikipediaDataModule(
    document_model_name_or_path="roberta-base",
    profile_model_name_or_path="google/tapas-base",
    max_seq_length=128,
    dataset_name='wiki_bio',
    dataset_train_split='train[:1]', # not used
    dataset_val_split='val[:1000]',
    dataset_version='1.2.0',
    word_dropout_ratio=0.0,
    word_dropout_perc=0.0,
    num_workers=1,
    train_batch_size=64,
    eval_batch_size=64
)
dm.setup("fit")

Initializing WikipediaDataModule with num_workers = 1 and mask token `<mask>`
loading wiki_bio[1.2.0] split train[:1]


Using custom data configuration default
Reusing dataset wiki_bio (/home/jxm3/.cache/huggingface/datasets/wiki_bio/default/1.2.0/c05ce066e9026831cd7535968a311fc80f074b58868cfdffccbc811dff2ab6da)


loading wiki_bio[1.2.0] split val[:1000]


Using custom data configuration default
Reusing dataset wiki_bio (/home/jxm3/.cache/huggingface/datasets/wiki_bio/default/1.2.0/c05ce066e9026831cd7535968a311fc80f074b58868cfdffccbc811dff2ab6da)
Loading cached processed dataset at /home/jxm3/.cache/huggingface/datasets/wiki_bio/default/1.2.0/c05ce066e9026831cd7535968a311fc80f074b58868cfdffccbc811dff2ab6da/cache-b98e3ae8bfedd5c6.arrow
Loading cached processed dataset at /home/jxm3/.cache/huggingface/datasets/wiki_bio/default/1.2.0/c05ce066e9026831cd7535968a311fc80f074b58868cfdffccbc811dff2ab6da/cache-03afde4145425441.arrow
Loading cached processed dataset at /home/jxm3/.cache/huggingface/datasets/wiki_bio/default/1.2.0/c05ce066e9026831cd7535968a311fc80f074b58868cfdffccbc811dff2ab6da/cache-9e1e8ccd5263c786.arrow
Loading cached processed dataset at /home/jxm3/.cache/huggingface/datasets/wiki_bio/default/1.2.0/c05ce066e9026831cd7535968a311fc80f074b58868cfdffccbc811dff2ab6da/cache-35dcb5acc2cbcaaa.arrow


In [2]:
dm.val_dataset[0]

{'input_text': {'table': {'column_header': ['successor',
    'name',
    'residence',
    'ended',
    'feast_day',
    'title',
    'enthroned',
    'predecessor',
    'death_date',
    'buried',
    'birth_place',
    'nationality',
    'religion',
    'article_title',
    'type'],
   'row_number': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
   'content': ['gabriel i',
    'michael iii of alexandria',
    "saint mark 's church",
    '16 march 907',
    '16 -rrb- march -lrb- 20 baramhat in the coptic calendar',
    '56th of st. mark pope of alexandria & patriarch of the see',
    '25 april 880',
    'shenouda i',
    '16 march 907',
    'monastery of saint macarius the great',
    'egypt',
    'egyptian',
    'coptic orthodox christian',
    'pope michael iii of alexandria\n',
    'pope']},
  'context': 'pope michael iii of alexandria\n'},
 'target_text': 'pope michael iii of alexandria -lrb- also known as khail iii -rrb- was the coptic pope of alexandria and patriarch of the see o

In [3]:
import pandas as pd

### Load pre-generated redacted data from various models.

Models are explained in `../model_cfg.py`.

In [4]:
import glob
import re


adv_df = None
for model_name in ['model_4', 'model_5', 'model_6']:
    csv_filenames = glob.glob(f'../adv_csvs/{model_name}/*0.csv')
    print(model_name, csv_filenames)
    for filename in csv_filenames:
        df = pd.read_csv(filename)
        k = re.search(r'_(\d+)_\d+.csv', filename).group(1)
        df['model_name'] = model_name + '__k' + str(k)
        df['i'] = df.index
        mini_df = df[['perturbed_text', 'model_name', 'i']]
        if adv_df is None:
            adv_df = mini_df
        else:
            adv_df = pd.concat((adv_df, mini_df), axis=0)

model_4 ['../adv_csvs/model_4/results_1_100.csv', '../adv_csvs/model_4/results_10_100.csv', '../adv_csvs/model_4/results_1000_100.csv']
model_5 ['../adv_csvs/model_5/results_1_100.csv', '../adv_csvs/model_5/results_10_100.csv', '../adv_csvs/model_5/results_1000_100.csv']
model_6 ['../adv_csvs/model_6/results_1_100.csv', '../adv_csvs/model_6/results_10_100.csv', '../adv_csvs/model_6/results_1000_100.csv']


In [5]:
adv_df.head()

Unnamed: 0,perturbed_text,model_name,i
0,pope <mask> <mask> <mask> alexandria (also kno...,model_4__k1,0
1,<mask> <mask> is a male former <mask> <mask> p...,model_4__k1,1
2,<mask> <mask> (born 30 <mask> <mask>) is a tur...,model_4__k1,2
3,"<mask> <mask> , (born march 14 , <mask>) is a ...",model_4__k1,3
4,<mask> <mask>. <mask> is a former democratic m...,model_4__k1,4


### Get baseline redacted data

Redacted via NER and Lexical redaction.

In [6]:
mini_val_dataset = dm.val_dataset[:100]
ner_df = pd.DataFrame(
    columns=['perturbed_text'],
    data=mini_val_dataset['document_redact_ner']
)
ner_df['model_name'] = 'named_entity'
ner_df['i'] = ner_df.index
       
lex_df = pd.DataFrame(
    columns=['perturbed_text'],
    data=mini_val_dataset['document_redact_lexical']
)
lex_df['model_name'] = 'lexical'
lex_df['i'] = lex_df.index

baseline_df = pd.concat((lex_df, ner_df), axis=0)

In [7]:
baseline_df.head()

Unnamed: 0,perturbed_text,model_name,i
0,<mask> <mask> <mask> of <mask> ( also known as...,lexical,0
1,<mask> <mask> is a male former table tennis pl...,lexical,1
2,<mask> <mask> ( born <mask><mask> <mask> <mask...,lexical,2
3,<mask> <mask> <mask> ( born <mask> <mask>4 <ma...,lexical,3
4,<mask> <mask> <mask> is a former <mask> member...,lexical,4


In [8]:
full_df = pd.concat((adv_df, baseline_df), axis=0)
full_df['model_name'].value_counts()

model_4__k1       100
model_4__k10      100
model_4__k1000    100
model_5__k1       100
model_5__k10      100
model_5__k1000    100
model_6__k1       100
model_6__k10      100
model_6__k1000    100
lexical           100
named_entity      100
Name: model_name, dtype: int64

In [9]:
full_df['i'].value_counts()

0     11
63    11
73    11
72    11
71    11
      ..
30    11
29    11
28    11
27    11
99    11
Name: i, Length: 100, dtype: int64

In [10]:
full_df['perturbed_text'] = full_df['perturbed_text'].apply(lambda s: s.replace('<SPLIT>', '\n'))

### Truncating

Hugely important step that was missing in the prior analysis!

In [11]:
import transformers
tokenizer = transformers.AutoTokenizer.from_pretrained('roberta-base')

In [12]:
def truncate_text(text: str, max_length=128) -> str:
    input_ids = tokenizer(text, truncation=True, max_length=128)['input_ids']
    reconstructed_text = (
        tokenizer
            .decode(input_ids)
            .replace('<mask>', ' <mask> ')
            .replace('  <mask>', ' <mask>')
            .replace('<mask>  ', '<mask> ')
            .replace('<s>', '')
            .replace('</s>', '')
            .strip()
    )
    return reconstructed_text

# truncate_text(sample_long_text)
sample_long_text = full_df['perturbed_text'].iloc[14]
print(sample_long_text)
print()
print(truncate_text(sample_long_text))

<mask> <mask> (born 4 <mask> <mask>) is a danish professional football midfielder , who currently plays for danish 1st division side <mask> boldklub .
<mask> began playing football in kolt-hasselager if , where he was picked for agf , where he got his other footballing education .
he was part of the year ' 88 , who won in the junior league , like michael lumb , frederik krabbe , michael vester , niels kristensen , morten beck andersen and anders syberg , who all had the onset of agf 's 1 .
hold .
in the autumn of 2009 he was loaned to næstved bk , and just before winter transfer window end he switched permanently to the club .
in 2010 he changed to fc fredericia , where he played until 2012 when he got vendsyssel ff as a new club in january 2015 he was given at his own request that he want to terminated his contract with the vendsyssel ff .
on 6 february 2015 he signed a two-year contract with lyngby boldklub


<mask> <mask> (born 4 <mask> <mask> ) is a danish professional football mid

In [13]:
sample_short_text = full_df['perturbed_text'].iloc[5]
print(sample_short_text)
print()
print(truncate_text(sample_short_text))

<mask> <mask> (born may 8 , <mask>) is an american stage , film and television actress .
she is perhaps best known for portraying the female changeling on '' '' .


<mask> <mask> (born may 8, <mask> ) is an american stage, film and television actress.
she is perhaps best known for portraying the female changeling on '' ''.


In [14]:
full_df['perturbed_text_truncated'] = full_df['perturbed_text'].apply(truncate_text)

### Measuring utility

Unit 1: number of redacted words.

In [15]:
def count_masks(s):
    return s.count('<mask>')

In [16]:
count_masks(full_df.iloc[0]['perturbed_text_truncated'])

9

In [17]:
full_df.groupby('model_name').apply(lambda s: count_masks('\n'.join(s['perturbed_text_truncated']))) / 100.0

model_name
lexical           19.41
model_4__k1        9.45
model_4__k10      14.59
model_4__k1000    34.91
model_5__k1        8.60
model_5__k10      22.45
model_5__k1000    53.04
model_6__k1        3.54
model_6__k10       6.58
model_6__k1000    19.25
named_entity      13.94
dtype: float64

Unit 2: compressed size.

In [18]:
import zlib

def count_compressed_bytes(s: str) -> int:
    return len(zlib.compress(s.encode()))

teststr = """Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus
pretium justo eget elit eleifend, et dignissim quam eleifend. Nam vehicula nisl
posuere velit volutpat, vitae scelerisque nisl imperdiet. Phasellus dignissim,
dolor amet."""

count_compressed_bytes(teststr)

157

In [19]:
original_text_truncated = [truncate_text(d) for d in mini_val_dataset['document']]

In [20]:
original_total_bytes = count_compressed_bytes('\n'.join(original_text_truncated))

1 - full_df.groupby('model_name').apply(lambda s: count_compressed_bytes('\n'.join(s['perturbed_text_truncated']))) / original_total_bytes

model_name
lexical           0.150189
model_4__k1       0.097302
model_4__k10      0.153961
model_4__k1000    0.381735
model_5__k1       0.092106
model_5__k10      0.218379
model_5__k1000    0.646665
model_6__k1       0.040074
model_6__k10      0.070183
model_6__k1000    0.195886
named_entity      0.125845
dtype: float64

### Reidentification rate (privacy metric)

In [21]:
from typing import List

from nltk.corpus import stopwords
from rank_bm25 import BM25Okapi

eng_stopwords = stopwords.words('english')
from tqdm.auto import tqdm
tqdm.pandas()


def get_words_from_doc(s: List[str]) -> List[str]:
    words = s.split()
    return [w for w in words if not w in eng_stopwords]

In [22]:
import datasets

split = 'val[:20%]'
prof_data = datasets.load_dataset('wiki_bio', split=split, version='1.2.0')

def make_table_str(ex):
    ex['table_str'] = (
        ' '.join(ex['input_text']['table']['column_header'] + ex['input_text']['table']['content'])
    )
    return ex

prof_data = prof_data.map(make_table_str)
profile_corpus = prof_data['table_str']

Using custom data configuration default
Reusing dataset wiki_bio (/home/jxm3/.cache/huggingface/datasets/wiki_bio/default/1.2.0/c05ce066e9026831cd7535968a311fc80f074b58868cfdffccbc811dff2ab6da)
Loading cached processed dataset at /home/jxm3/.cache/huggingface/datasets/wiki_bio/default/1.2.0/c05ce066e9026831cd7535968a311fc80f074b58868cfdffccbc811dff2ab6da/cache-ba6837fc22371371.arrow


In [23]:
tokenized_profile_corpus = [
    get_words_from_doc(prof) for prof in profile_corpus
]

In [24]:
bm25 = BM25Okapi(tokenized_profile_corpus)

In [25]:
def get_top_k(ex):
    query = ex["perturbed_text_truncated"].split()
    top_k = bm25.get_scores(query).argsort()[::-1]
    # breakpoint()
    ex["correct_idx"] = top_k.tolist().index(ex["i"])
    ex["is_correct"] = 1 if top_k[0] == ex["i"] else 0
    return ex
    
num_proc = min(8, len(os.sched_getaffinity(0)))
full_df = full_df.progress_apply(get_top_k, axis=1)
print(full_df["is_correct"].mean())

  0%|          | 0/1100 [00:00<?, ?it/s]

0.3


In [26]:
full_df.groupby('model_name').mean()['is_correct'] * 100

model_name
lexical            0.0
model_4__k1       33.0
model_4__k10      24.0
model_4__k1000     8.0
model_5__k1       37.0
model_5__k10      20.0
model_5__k1000     2.0
model_6__k1       65.0
model_6__k10      51.0
model_6__k1000    24.0
named_entity      66.0
Name: is_correct, dtype: float64

In [27]:
full_df[full_df['model_name'] == 'model_4__k1'].head(n=10)

Unnamed: 0,perturbed_text,model_name,i,perturbed_text_truncated,correct_idx,is_correct
0,pope <mask> <mask> <mask> alexandria (also kno...,model_4__k1,0,pope <mask> <mask> <mask> alexandria (also kno...,0,1
1,<mask> <mask> is a male former <mask> <mask> p...,model_4__k1,1,<mask> <mask> is a male former <mask> <mask> p...,5620,0
2,<mask> <mask> (born 30 <mask> <mask>) is a tur...,model_4__k1,2,<mask> <mask> (born 30 <mask> <mask> ) is a tu...,1,0
3,"<mask> <mask> , (born march 14 , <mask>) is a ...",model_4__k1,3,"<mask> <mask> , (born march 14, <mask> ) is a ...",1,0
4,<mask> <mask>. <mask> is a former democratic m...,model_4__k1,4,<mask> <mask> . <mask> is a former democratic ...,43,0
5,"<mask> <mask> (born may 8 , <mask>) is an amer...",model_4__k1,5,"<mask> <mask> (born may 8, <mask> ) is an amer...",2461,0
6,"<mask> demonte <mask> (born <mask> 5 , <mask>)...",model_4__k1,6,"<mask> demonte <mask> (born <mask> 5, <mask> )...",17,0
7,<mask> <mask> (born <mask> neil morrison on 22...,model_4__k1,7,<mask> <mask> (born <mask> neil morrison on 22...,0,1
8,"<mask> <mask> (born <mask> 20 , <mask> in empo...",model_4__k1,8,"<mask> <mask> (born <mask> 20, <mask> in empor...",0,1
9,blessed <mask> <mask> <mask> t.o.s.d. -<mask>)...,model_4__k1,9,blessed <mask> <mask> <mask> t.o.s.d. - <mask>...,0,1


In [28]:
sample_pt = full_df[full_df['model_name'] == 'model_4__k1']['perturbed_text_truncated'][0]
sample_pt

'pope <mask> <mask> <mask> alexandria (also known as <mask> <mask> ) was the coptic pope of alexandria and <mask> of the see of st. mark ( <mask> -- <mask> ).\nin <mask> , the governor of egypt, ahmad ibn tulun, forced khail to pay heavy contributions, forcing him to sell a church and some attached properties to the local jewish community.\nthis building was at one time believed to have later become the site of the cairo geniza.'

In [29]:
full_df['num_words'] = full_df['perturbed_text_truncated'].apply(lambda s: len(s.split()))

In [30]:
sample_pt.split()

['pope',
 '<mask>',
 '<mask>',
 '<mask>',
 'alexandria',
 '(also',
 'known',
 'as',
 '<mask>',
 '<mask>',
 ')',
 'was',
 'the',
 'coptic',
 'pope',
 'of',
 'alexandria',
 'and',
 '<mask>',
 'of',
 'the',
 'see',
 'of',
 'st.',
 'mark',
 '(',
 '<mask>',
 '--',
 '<mask>',
 ').',
 'in',
 '<mask>',
 ',',
 'the',
 'governor',
 'of',
 'egypt,',
 'ahmad',
 'ibn',
 'tulun,',
 'forced',
 'khail',
 'to',
 'pay',
 'heavy',
 'contributions,',
 'forcing',
 'him',
 'to',
 'sell',
 'a',
 'church',
 'and',
 'some',
 'attached',
 'properties',
 'to',
 'the',
 'local',
 'jewish',
 'community.',
 'this',
 'building',
 'was',
 'at',
 'one',
 'time',
 'believed',
 'to',
 'have',
 'later',
 'become',
 'the',
 'site',
 'of',
 'the',
 'cairo',
 'geniza.']

In [31]:
bm25.get_scores(
    ['pope',
     # 'alexandria',
     # 'coptic',
     # 'pope',
     # 'alexandria',
     'st.',
     'mark',
     # 'governor',
     'egypt,',
     # 'forced',
     # 'khail',
     # 'jewish',
    ]
).argsort()[::-1][:5]

array([    0, 12300,  4159,  9464, 13849])

In [45]:
full_df['model_name'].unique()

array(['model_4__k1', 'model_4__k10', 'model_4__k1000', 'model_5__k1',
       'model_5__k10', 'model_5__k1000', 'model_6__k1', 'model_6__k10',
       'model_6__k1000', 'lexical', 'named_entity'], dtype=object)

In [52]:
for i, el in enumerate(full_df[ full_df['model_name'] == 'model_5__k10' ]['perturbed_text'].tolist()):
    print(i, el, '\n\n')

0 <mask> <mask> <mask> <mask> alexandria (also known as khail <mask>) was <mask> <mask> <mask> of <mask> <mask> patriarch of <mask> see <mask> <mask>. mark (880 -- <mask>) .
in 882 , the governor of egypt , ahmad ibn tulun , forced khail to pay heavy contributions , forcing him to sell a church and some attached properties to the local jewish community .
this building was at one time believed to have later become the site of the cairo geniza .
 


1 <mask> <mask> is a male former <mask> tennis player from <mask> .
 


2 <mask> <mask> (born 30 november <mask>) is <mask> <mask> professional footballer .
he currently plays as a <mask> for <mask> <mask> .
 


3 <mask> <mask> , (born <mask> 14 , <mask>) is a professional <mask> player who represents <mask> .
she <mask> a career-high world ranking of world no. 101 in july <mask> .
 


4 <mask> <mask>. <mask> is a <mask> <mask> member of the <mask> house of representatives .
<mask> was born in butler to michael and angela <mask> <mask> .
 




In [69]:
[d for d in mini_val_dataset['target_text'] if ('february' in d and 'ambassador' in d)]

['vernon a. walters -lrb- january 3 , 1917 -- february 10 , 2002 -rrb- was a united states army officer and a diplomat .\nmost notably , he served from 1972 to 1976 as deputy director of central intelligence , from 1985 to 1989 as the united states ambassador to the united nations and from 1989 to 1991 as ambassador to the federal republic of germany during the decisive phase of german reunification .\nwalters rose to the rank of lieutenant general in the u.s. army and is a member of the military intelligence hall of fame .\n']

In [82]:
[d for d in mini_val_dataset['target_text'] if ('prudente' in d)]

['thaila ayala sales -lrb- born april 14 , 1986 in presidente prudente -rrb- is a brazilian actress and model .\n']

In [83]:
mini_val_dataset['target_text'].index('thaila ayala sales -lrb- born april 14 , 1986 in presidente prudente -rrb- is a brazilian actress and model .\n')

10

In [92]:
for _, r in full_df[full_df.index == 10][['perturbed_text', 'model_name']].iterrows():
    print(r['model_name'])
    print(r['perturbed_text'], '\n')

model_4__k1
<mask> <mask> <mask> (born <mask> 14 , <mask> in presidente prudente) is <mask> <mask> <mask> and <mask> .
 

model_4__k10
<mask> <mask> <mask> (born <mask> 14 , <mask> in presidente prudente) is <mask> <mask> <mask> and <mask> .
 

model_4__k1000
<mask> <mask> <mask> (born <mask> 14 , <mask> <mask> <mask> <mask>) is <mask> <mask> <mask> <mask> <mask> .
 

model_5__k1
<mask> <mask> <mask> (born april 14 , <mask> in presidente prudente) is a <mask> actress and <mask> .
 

model_5__k10
<mask> <mask> <mask> (born april 14 , <mask> in presidente prudente) is a <mask> <mask> <mask> <mask> .
 

model_5__k1000
<mask> <mask> <mask> (<mask> <mask> <mask> , <mask> <mask> <mask> <mask>) <mask> <mask> <mask> <mask> <mask> <mask> .
 

model_6__k1
<mask> <mask> sales (born april 14 , 1986 in presidente prudente) is a brazilian <mask> and model .
 

model_6__k10
<mask> <mask> sales (born april 14 , 1986 in presidente prudente) is a brazilian <mask> and model .
 

model_6__k1000
<mask> <ma

In [94]:
mini_val_dataset['target_text'][82]

'thiago dos santos costa -lrb- born february 28 , 1983 -rrb- is a brazilian footballer who plays for são luiz .\n'

In [93]:
for _, r in full_df[full_df.index == 82][['perturbed_text', 'model_name']].iterrows():
    print(r['model_name'])
    print(r['perturbed_text'], '\n')

model_4__k1
<mask> dos <mask> <mask> (born <mask> 28 , <mask>) is a brazilian footballer who plays for <mask> <mask> .
 

model_4__k10
<mask> <mask> <mask> <mask> (born <mask> 28 , <mask>) is a <mask> footballer who plays for <mask> <mask> .
 

model_4__k1000
<mask> <mask> <mask> <mask> (born <mask> 28 , <mask>) is a <mask> <mask> <mask> <mask> <mask> <mask> <mask> .
 

model_5__k1
thiago dos <mask> <mask> (born february 28 , <mask>) is a brazilian footballer who plays for <mask> <mask> .
 

model_5__k10
<mask> <mask> <mask> <mask> (born february <mask> , <mask>) is a <mask> footballer who plays for <mask> <mask> .
 

model_5__k1000
<mask> <mask> <mask> <mask> (<mask> <mask> <mask> , <mask>) <mask> <mask> <mask> <mask> <mask> <mask> <mask> <mask> <mask> .
 

model_6__k1
<mask> dos santos <mask> (born february 28 , <mask>) is a brazilian footballer who plays for são luiz .
 

model_6__k10
<mask> dos <mask> <mask> (born february 28 , <mask>) is a brazilian footballer who plays for são lu