### Testing masking span sampler

How do temperature values affect the output of our masking span sampler, with IDF-weighting enabled?

In [1]:
import sys
sys.path.append('/home/jxm3/research/deidentification/unsupervised-deidentification')

In [2]:
from datamodule import WikipediaDataModule
import os

num_cpus = len(os.sched_getaffinity(0))

dm = WikipediaDataModule(
    document_model_name_or_path="roberta-base",
    profile_model_name_or_path="google/tapas-base",
    max_seq_length=128,
    dataset_name='wiki_bio',
    dataset_train_split='train[:1024]', # not used in this notebook
    dataset_val_split='val[:20%]',
    dataset_version='1.2.0',
    word_dropout_ratio=0.0,
    word_dropout_perc=0.0,
    num_workers=1,
    train_batch_size=64,
    eval_batch_size=64
)
dm.setup("fit")

Initializing WikipediaDataModule with num_workers = 1 and mask token `<mask>`
loading wiki_bio[1.2.0] split train[:1024]


Using custom data configuration default
Reusing dataset wiki_bio (/home/jxm3/.cache/huggingface/datasets/wiki_bio/default/1.2.0/c05ce066e9026831cd7535968a311fc80f074b58868cfdffccbc811dff2ab6da)


loading wiki_bio[1.2.0] split val[:20%]


Using custom data configuration default
Reusing dataset wiki_bio (/home/jxm3/.cache/huggingface/datasets/wiki_bio/default/1.2.0/c05ce066e9026831cd7535968a311fc80f074b58868cfdffccbc811dff2ab6da)
Loading cached processed dataset at /home/jxm3/.cache/huggingface/datasets/wiki_bio/default/1.2.0/c05ce066e9026831cd7535968a311fc80f074b58868cfdffccbc811dff2ab6da/cache-793b771e10f80bbe.arrow
Loading cached processed dataset at /home/jxm3/.cache/huggingface/datasets/wiki_bio/default/1.2.0/c05ce066e9026831cd7535968a311fc80f074b58868cfdffccbc811dff2ab6da/cache-7d07543b6205ca87.arrow
Loading cached processed dataset at /home/jxm3/.cache/huggingface/datasets/wiki_bio/default/1.2.0/c05ce066e9026831cd7535968a311fc80f074b58868cfdffccbc811dff2ab6da/cache-912d45fbf560a15e.arrow
Loading cached processed dataset at /home/jxm3/.cache/huggingface/datasets/wiki_bio/default/1.2.0/c05ce066e9026831cd7535968a311fc80f074b58868cfdffccbc811dff2ab6da/cache-4731c171b2d92df3.arrow
Loading cached processed dataset at /h

In [3]:
from masking_tokenizing_dataset import MaskingTokenizingDataset

val_masking_dataset = MaskingTokenizingDataset(
    dm.val_dataset,
    document_tokenizer=dm.document_tokenizer,
    profile_tokenizer=dm.profile_tokenizer,
    max_seq_length=dm.max_seq_length,
    word_dropout_ratio=1.0,
    word_dropout_perc=-1.0,
    profile_row_dropout_perc=0.0,
    sample_spans=False,
    adversarial_masking=False,
    idf_masking=True,
    num_nearest_neighbors=0,
    document_types=["document"],
    is_train_dataset=True
)

In [4]:
test_doc = dm.val_dataset[10]["document"]
test_doc

'thaila ayala sales ( born april 14 , 1986 in presidente prudente ) is a brazilian actress and model .\n'

In [6]:
%load_ext autoreload
%autoreload 2

In [54]:
mss = val_masking_dataset.masking_span_sampler

for temp in [0.25, 0.5, 1.0, 10.0]:
    mss.idf_temp = temp
    print(temp)
    for i in range(10):
        print('\t', mss.random_redact_str(text=test_doc).strip())
    print('\n'*2)

0.25
	 thaila <mask> <mask> ( <mask> april 14 , 1986 in <mask> <mask> ) <mask> a brazilian actress and model .
	 <mask> <mask> sales ( <mask> <mask> 14 , 1986 <mask> presidente <mask> ) is a brazilian actress and model .
	 <mask> <mask> sales ( <mask> april 14 , <mask> <mask> presidente <mask> ) is a brazilian actress and model .
	 <mask> <mask> sales ( born april 14 , 1986 in <mask> <mask> ) is a <mask> actress and <mask> .
	 <mask> ayala sales ( born april 14 , 1986 in <mask> <mask> ) is a <mask> <mask> and <mask> .
	 <mask> <mask> <mask> ( born april 14 , 1986 <mask> <mask> <mask> ) is a brazilian actress and model .
	 <mask> <mask> <mask> ( born april <mask> , 1986 in presidente <mask> ) is a brazilian actress <mask> model .
	 thaila <mask> sales ( born april <mask> , <mask> in <mask> <mask> ) is a <mask> actress and model .
	 thaila <mask> sales ( born april <mask> , <mask> <mask> <mask> prudente ) is <mask> brazilian actress and model .
	 <mask> <mask> <mask> ( <mask> april 14 , 

In [None]:
dm.word_dropout_ratio