JAMDEC: Authorship Obfuscation Method

This repository contains the code and the scripts to reproduce the experiments from the paper JAMDEC: Unsupervised Authorship Obfuscation using Constrained Decoding over Small Language Models.

JAMDEC, is a lightweight, user-controlled, unsupervised inference-time algorithm for authorship obfuscation that can be applied to any arbitrary text. It uses a three-stage framework which includes keyword extraction, over-generation,and filtering. In the keyword extraction stage, we identify the essential keywords required to preserve the content and sentiment of the original text. These keywords, along with a left context, are then utilized in the over-generation stage, where a unique combination of Diverse and Constrained Beam Search is employed to generate a diverse array of possible generations which are constrained to have similar content as the original text. Finally, we incorporate a user-specific filtering stage to grant users the ability to control the types of generations used in the final obfuscated output.

In this paper we demonstrate the effectiveness of JAMDEC on two obfuscation datasets comprised of academic articles (AMT) and diary-style writings (BLOG). When applied in combination with a GPT2-XL model, JAMDEC outperforms state-of-the-art authorship obfuscation methods, while remaining competitive with more computationally expensive GPT3 175B.

In this repo, we provide code which implements JAMDEC on a GPT2-XL model for these two experiments.

Dependencies

The code is written in Python and the dependencies are:

Python >= 3.10.10
PyTorch >= 2.0.1
Huggingface Transformers >= 4.33.3

Conda Environment: We recommend using a conda environment for Python 3.7. To setup the environment, run

conda env create --file environment.yml
# activate the environment
conda activate jambalaya_decoding

Install Dependencies via Pip: To install dependencies, run

pip install -r requirement.txt

Datasets

We use two different datasets, the Extended-Brennan-Greenstadt (Brennan et al., 2012) (amt) and the Blog Authorship corpus (Schler et al., 2006) (blog), using a range of different authors (3 - 10). All raw datasets can be found under the datasets/ folder. We process the data by splitting up the text and creating left context using the file process_raw_data.py. This is incorporated into the scripts for each experiment.

Experimental Pipeline

Experimental code for both amt and the blog datasets can be found in the scripts folder labeled as amt_exp.sh and blog_exp.sh. Each experiment consists of the following four steps:

Pre-process the Raw Data: process_raw_data.py (run time: less than 5 minute)
Keyword Extraction: main_keyword_extraction.py (run time: less than 1 hour)
Over-Generation: main_generation.py (run time: varies widely based on size of dataset/# of gpus available, can be up to 3 days)
Filtering: main_filter.py (run time: less than 1 hour)
Evaluation: main_evaluation.py (run time: less than 5 minute)

Citation

If you find this repository useful, or you use it in your research, please cite:

@misc{fisher2024jamdec,
      title={JAMDEC: Unsupervised Authorship Obfuscation using Constrained Decoding over Small Language Models}, 
      author={Jillian Fisher and Ximing Lu and Jaehun Jung and Liwei Jiang and Zaid Harchaoui and Yejin Choi},
      year={2024},
      eprint={2402.08761},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.vscode		.vscode
classifier_models		classifier_models
dataset		dataset
keyword_extraction_methods		keyword_extraction_methods
neurologic_super_fast		neurologic_super_fast
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
authorship_obfuscation_visuals.jpeg		authorship_obfuscation_visuals.jpeg
environment.yml		environment.yml
main_evaluation.py		main_evaluation.py
main_filter.py		main_filter.py
main_generation.py		main_generation.py
main_keyword_extraction.py		main_keyword_extraction.py
process_raw_data.py		process_raw_data.py
requirements.		requirements.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

JAMDEC: Authorship Obfuscation Method

Dependencies

Datasets

Experimental Pipeline

Citation

Acknowledgements

About

Releases

Packages

Contributors 2

Languages

jfisher52/JAMDecoding

Folders and files

Latest commit

History

Repository files navigation

JAMDEC: Authorship Obfuscation Method

Dependencies

Datasets

Experimental Pipeline

Citation

Acknowledgements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages