This is the data and code for paper Robust Textual Embedding against Word-level Adversarial Attacks (UAI 2022).
There are three datasets used in our experiments. Download and uncompress them to the directory ./data/imdb
, ./data/yelp/
and ./data/yahoo/
, respectively.
Download and put glove.840B.300d.txt
, counter-fitted-vectors.txt
, pytorch_model.bin
, and bert_config.json
to the directory ./data/
.
- python==3.7.11
- pytorch==1.7.1
- tensorflow-gpu==1.15.0
- tqdm==4.42
- scikit-learn==0.23
- numpy==1.21
- keras==2.2.5
- nltk==3.4.5
-
Generating the dictionary, embedding matrix and distance matrix:
python build_embs.py --task_name imdb --data_dir ./data/
Depending on the dataset you want to use, the
--task_name
field can beimdb
,yelp
, oryahoo
.You could use our pregenerated data by downloading aux_files and place
aux_files
into the dictionary./data/
. -
Training the model with standard trainig:
python cnn_classifier.py --data_dir ./data/ --task_name imdb --model_type CNNModel --output_dir model/cnn-imdb-nt --do_train --do_eval --max_seq_length 512 --num_train_epochs 2
Depending on the model you want to use, the
--model_type
field can beCNNModel
orBiLSTMModel
. The--max_seq_length
is512
forimdb
and256
foryelp
andyahoo
. -
Attacking the model of stardard training using the attack GA:
python cnn_attack.py --data_dir ./data/ --task_name imdb --model_type CNNModel --attack ga --output_dir model/cnn-imdb-nt --save_to_file model/cnn-imdb-nt/attack-ga-2.txt --max_seq_length 512 --num_train_epochs 2
Depending on the attack method you want to use, the
--attack
field can bepwws
,ga
orpso
. -
Training the model with our proposed FTML:
python cnn_classifier_ftml.py --data_dir ./data/ --task_name imdb --model_type CNNModel --output_dir model/cnn-imdb-ftml --do_train --do_eval --max_seq_length 512 --num_train_epochs 20 --beta 1.0 --alpha 6.0
The
--num_train_epochs
is20
forimdb
and5
foryelp
andyahoo
.You could also use our trained model by downloading models.
-
Attacking the model of FTML using the attack GA:
python cnn_attack.py --data_dir ./data/ --task_name imdb --model_type CNNModel --attack ga --output_dir model/cnn-imdb-ftml --save_to_file model/cnn-imdb-ftml/attack-ga-20.txt --max_seq_length 512 --num_train_epochs 20
-
Training the model with standard trainig:
python bert_classifier.py --data_dir ./data/ --task_name imdb --output_dir model/bert-imdb-nt --max_seq_length 256 --do_train --do_eval --do_lower_case --num_train_epochs 3
-
Attacking the model of stardard training using the attack GA:
python bert_attack.py --data_dir ./data/ --task_name imdb --attack ga --output_dir model/bert-imdb-nt --save_to_file model/bert-imdb-nt/attack-ga-3.txt --do_lower_case --num_train_epochs 3
-
Training the model with our proposed FTML:
python bert_classifier_ftml.py --data_dir ./data/ --task_name imdb --output_dir model/bert-imdb-ftml --do_train --do_eval --do_lower_case --num_train_epochs 20
You could also use our trained model by downloading models.
-
Attacking the model of FTML using the attack GA:
python bert_attack.py --data_dir ./data/ --task_name imdb --attack ga --output_dir model/bert-imdb-ftml --save_to_file model/bert-imdb-ftml/attack-ga-20.txt --do_lower_case --num_train_epochs 20