Skip to content

jxmorris12/unsupervised-text-deidentification

Repository files navigation

Unsupervised Text Deidentification

Official code for 2022 paper, "Unsupervised Text Deidentification". Our method, NN DeID can anonymize text without any prior notion of what constitutes personal information (i.e. without guidelines or labeled data).

This repository all the code for training reidentification models and deidentifying data. The main tools & frameworks used are PyTorch, PyTorch Lightning (for model-training), and TextAttack (for deidentification via greedy inference). The main dataset used is the wikibio dataset, loaded through HuggingFace datasets. The models that comprise the biencoder are TAPAS and RoBERTa (or PMLM).

folder structure

Model-training: main.py is the root file for model-training. All experiments can be launched by invoking this file with certain arguments, i.e. by running python main.py --args.

Deidentification: All the text deidentification happens through scripts/deidentify.py. To launch all the deidentification experiments at once, run python scripts/launch_experiments.py. Various pieces of the deidentification logic, are situated in deidentification/*.py.

models

Modeling code is in models/*.py. Contains code for training models using contrastive loss (biencoder), contrastive loss for a cross-encoder, and "coordinate ascent" via freezing embeddings for a bienocder.

Here's a sample command for training the models:

python main.py --epochs 300 --batch_size 64 --max_seq_length 128 --word_dropout_ratio 0.8 --word_dropout_perc -1.0 --document_model_name roberta --profile_model_name tapas --dataset_name "wiki_bio" --dataset_train_split="train[:100%]" --learning_rate 1e-4 --num_validations_per_epoch 1 --loss coordinate_ascent --e 3072 --label_smoothing 0.01

dataloading

The dataloading code is all routed through datamodule.py, which loads a dataset of documents & profiles and preprocesses it, including by a number of baseline deidentification techniques which are used for computing model performance on deidentified data throughout training.

Other notable files include masking_tokenizing_dataset.py, which implements a custom PyTorch dataset that tokenizes data and optionally randomly masks words, and masking_span_sampler.py, which contains options for randomly sampling sentences from text (not used in the paper) and randomly masking text using a number of different strategies.

scripts

There are a bunch of useful scripts in scripts/, including code for generating nearest-neighbors, uploading models to the HuggingFace hub, doing probing on embeddings to measure retainment of profile information such as birth dates and months, and deidentifying data using TextAttack.

models

The models used for reidentification and identification are available through HuggingFace. All models were trained with this codebase and the hyperparameters specified in the paper.

  • jxm/wikibio_roberta_tapas_vanilla: vanilla biencoder, trained with RoBERTa document encoder and TAPAS profile encoder on documents without any masking

  • jxm/wikibio_roberta_tapas: vanilla biencoder, trained with RoBERTa document encoder and TAPAS profile encoder on documents with uniform random masking

  • jxm/wikibio_roberta_tapas_idf: vanilla biencoder, trained with RoBERTa document encoder and TAPAS profile encoder on documents with IDF-weighted random masking

  • jxm/wikibio_pmlm_tapas: vanilla biencoder, trained with PMLM document encoder and TAPAS profile encoder on documents with uniform random masking

  • jxm/wikibio_pmlm_tapas_idf: vanilla biencoder, trained with PMLM document encoder and TAPAS profile encoder on documents with IDF-weighted random masking

  • jxm/wikibio_roberta_roberta: vanilla biencoder, trained with RoBERTa document encoder and RoBERTa profile encoder on documents with uniform random masking

  • jxm/wikibio_roberta_roberta_idf: vanilla biencoder, trained with RoBERTa document encoder and RoBERTa profile encoder on documents with IDF-weighted random masking

example model-running command

Here's an example of how you'd load a pre-trained model, in addition to the WikiBio dataloader. First you have to clone the repository to download the model:

!git clone https://huggingface.co/jxm/wikibio_roberta_roberta

Then run the following code.

from datamodule import WikipediaDataModule
from model import CoordinateAscentModel


num_cpus = len(os.sched_getaffinity(0))

checkpoint_path = "/content/unsupervised-deid/wikibio_roberta_roberta/model.ckpt"
model = CoordinateAscentModel.load_from_checkpoint(checkpoint_path)
dm = WikipediaDataModule(
    document_model_name_or_path=model.document_model_name_or_path,
    profile_model_name_or_path=model.profile_model_name_or_path,
    dataset_name='wiki_bio',
    dataset_train_split='train[:10%]',
    dataset_val_split='val[:20%]',
    dataset_version='1.2.0',
    num_workers=1,
    train_batch_size=64,
    eval_batch_size=64,
    max_seq_length=128,
    sample_spans=False,
)

example model-training command

This is an example of how to train the RoBERTa-TAPAS example from above (uniform random masking). All of the other models can be trained with a very similar command.

python main.py --epochs 60 --batch_size 128 --max_seq_length 128 --word_dropout_ratio 1.0 --word_dropout_perc -1.0 --document_model_name roberta --profile_model_name tapas --dataset_train_split="train[:100%]" --learning_rate 1e-4 --num_validations_per_epoch 1 --loss coordinate_ascent --e 3072 --label_smoothing 0.01

analysis example

Troubleshooting

OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory.

Solution : Install "en_core_web_sm" using the following command python -m spacy download en_core_web_sm

[Similar command might work for other models as well]

ERROR Error while calling W&B API: project not found (<Response [404]>)

Solution : Make sure you have a project on wandb that you want to use for this experiment. Add that project name to the command line arguments using the following flag: --wandb_project_name <project_name>

Also, make sure you provide an eentity which can access that project. You can do this by using the following flag: --wandb_entity <entity>

Also, make sure the wandb API_key provided corresponds to the user of which the project is used.

Debugging issue : pdb debugger too slow

Solution : Please type this command in the terminal before rerunning the program - 'export CUDA_VISIBLE_DEVICES=0'

Citation

If this package is useful for you, please cite the following!

@misc{https://doi.org/10.48550/arxiv.2210.11528,
  doi = {10.48550/ARXIV.2210.11528},
  url = {https://arxiv.org/abs/2210.11528},
  author = {Morris, John X. and Chiu, Justin T. and Zabih, Ramin and Rush, Alexander M.},
  keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {Unsupervised Text Deidentification},
  publisher = {arXiv},
  year = {2022}, 
  copyright = {arXiv.org perpetual, non-exclusive license}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages