DataLens

This is the official code base for our ACM CCS 2021 paper:

"DataLens: Scalable Privacy Preserving Training via Gradient Compression and Aggregation".

Boxin Wang*, Fan Wu*, Yunhui Long*, Luka Rimanic, Ce Zhang, Bo Li

Citation

@article{wang2021datalens,
  title={DataLens: Scalable Privacy Preserving Training via Gradient Compression and Aggregation},
  author={Wang, Boxin and Wu, Fan and Long, Yunhui and Rimanic, Luka and Zhang, Ce and Li, Bo},
  journal={ACM Conference on Computer and Communications Security (CCS)},
  year={2021}
}

Usage

Prepare your environment

The project is tested on Python 3.6, but a higher version of Python should also work. Download required packages

pip install -r requirements.txt

Prepare your data

Please store the training data in $data_dir. By default, $data_dir is set to ../../data.

We provide a script to download the MNIST and Fashion Mnist datasets.

python download.py [dataset_name]

For MNIST, you can run

python download.py mnist

For Fashion-MNIST, you can run

python download.py fashion_mnist

For CelebA and Places365 datasets, please refer to their official websites for downloading.

Training

python main.py --checkpoint_dir [checkpoint_dir] --dataset [dataset_name] --train --stochastic --signsgd --topk [topk]

For example, to train the Datalens on Fashion-MNIST given eps=1 and delta=1e-5

python main.py --checkpoint_dir fmnist_z_dim_50_topk_200_teacher_4000_sigma_5000_thresh_0.7_pt_30_d_step_2_stochastic_1e-5/ \
--topk 200 --signsgd --norandom_proj --shuffle  --teachers_batch 80 --batch_teachers 50 \
--dataset fashion_mnist --train --max_eps 1 --train --thresh 0.7 --sigma 5000 --nopretrain \
--z_dim 50 --nosave_epoch --epoch 300 --save_vote --d_step 2 --pretrain_teacher 10 --stochastic --max_grad 1e-5

By default, after it reaches the max epsilon, it will generate 10 batches of 10,000 DP samples as eps-1.00.data-{i}.pkl (i=0,...9) in checkpoint_dir.

More example commands (eps=1):

MNIST

python main.py --checkpoint_dir [checkpoint-dir] \
--topk 200 --signsgd --norandom_proj --shuffle  --teachers_batch 80 --batch_teachers 50 \
--dataset mnist --train --max_eps 1 --train --thresh 0.7 --sigma 5000 --nopretrain \
--z_dim 50 --save_epoch --epoch 300 --save_vote --d_step 2 --pretrain_teacher 10 --stochastic --max_grad 1e-5

Fashion-MNIST

python main.py --checkpoint_dir [checkpoint-dir] \
--topk 200 --signsgd --norandom_proj --shuffle  --teachers_batch 80 --batch_teachers 50 \
--dataset fashion_mnist --train --max_eps 1 --train --thresh 0.9 --sigma 5000 --nopretrain \
--z_dim 50 --nosave_epoch --epoch 300 --save_vote --d_step 2 --pretrain_teacher 10 --stochastic --max_grad 1e-5

CelebA

python main.py --checkpoint_dir [checkpoint-dir] \
--topk 700 --signsgd --norandom_proj --shuffle  --teachers_batch 100 --batch_teachers 60 \
--dataset celebA-gender-train --train --max_eps 1 --train --thresh 0.85 --sigma 9000 --nopretrain \
--z_dim 100 --nosave_epoch --epoch 300 --save_vote --d_step 2 --pretrain_teacher 30 --stochastic --max_grad 1e-5

Hair

python main.py --checkpoint_dir [checkpoint-dir]\
--topk 700 --signsgd --norandom_proj --shuffle  --teachers_batch 100 --batch_teachers 80 \
--dataset celebA-hair-trn --train --max_eps 1 --train --thresh 0.9 --sigma 9000 --nopretrain \
--z_dim 100 --save_epoch --epoch 300 --save_vote --d_step 2 --pretrain_teacher 30 --stochastic --max_grad 1e-5

More example commands (eps=10):

MNIST:

python main.py --checkpoint_dir [checkpoint-dir]/ \
--topk 300 --signsgd --norandom_proj --shuffle  --teachers_batch 80 --batch_teachers 50 \
--dataset mnist --train --max_eps 10 --train --thresh 0.2 --sigma 800 --nopretrain \
 --z_dim 50 --nosave_epoch --epoch 300 --save_vote --pretrain_teacher 10 --d_step 2 --stochastic --max_grad 1e-5

Fashion-MNIST

python main.py --checkpoint_dir [checkpoint-dir] / \
--topk 350 --signsgd --norandom_proj --shuffle  --teachers_batch 80 --batch_teachers 50 \
--dataset fashion_mnist --train --max_eps 10 --train --thresh 0.27 --sigma 1000 --nopretrain \
 --z_dim 64 --nosave_epoch --epoch 300 --save_vote --pretrain_teacher 10 --d_step 2 --stochastic --max_grad 1e-5

python main.py --checkpoint_dir [checkpoint-dir] / \
--topk 350 --signsgd --norandom_proj --shuffle  --teachers_batch 80 --batch_teachers 50 \
--dataset fashion_mnist --train --max_eps 10 --train --thresh 0.27 --sigma 1000 --nopretrain \
 --z_dim 64 --nosave_epoch --epoch 300 --save_vote --pretrain_teacher 10 --d_step 2

CelebA-Gender

python main.py --checkpoint_dir [checkpoint-dir] / \
--topk 500 --signsgd --norandom_proj --shuffle  --teachers_batch 100 --batch_teachers 60 \
--dataset celebA-gender-train --train --max_eps 10 --train --thresh 0.12 --sigma 700 --nopretrain \
 --z_dim 100 --nosave_epoch --epoch 300 --save_vote --pretrain_teacher 30 --d_step 2 --stochastic

CelebA-Hair

python main.py --checkpoint_dir [checkpoint-dir] / \
--topk 500 --signsgd --norandom_proj --shuffle  --teachers_batch 80 --batch_teachers 50 \
--dataset celebA-hair-trn --train --max_eps 10 --train --thresh 0.25 --sigma 700 --nopretrain \
 --z_dim 100 --nosave_epoch --epoch 300 --save_vote --pretrain_teacher 30 --d_step 2 --stochastic

Training Args

main.py:
  --ae: AE model name
    (default: '')
  --batch_size: The size of batch images [64]
    (default: '30')
    (an integer)
  --batch_teachers: Number of teacher models in one batch
    (default: '1')
    (an integer)
  --beta1: Momentum term of adam [0.5]
    (default: '0.5')
    (a number)
  --checkpoint_dir: Directory name to save the checkpoints [checkpoint]
    (default: 'checkpoint')
  --checkpoint_name: checkpoint model name [checkpoint]
    (default: 'checkpoint')
     --[no]crop: True for cropping
    (default: 'false')
  --d_step: steps of the discriminator
    (default: '1')
    (an integer)
  --data_dir: Root directory of dataset [data]
    (default: '../../data')
  --dataset: The name of dataset [cinic, celebA, mnist, lsun, fire-small]
    (default: 'slt')
  --delta: delta for differential privacy
    (default: '1e-05')
    (a number)
  --epoch: Epoch for training teacher models
    (default: '1000')
    (an integer)
  --[no]finetune_ae: Finetune ae
    (default: 'false')
  --g_epoch: Epoch for training the student models
    (default: '500')
    (an integer)
  --g_step: steps of the generator
    (default: '1')
    (an integer)
  --generator_dir: Directory name to save the generator
    (default: 'generator')
  --hid_dim: Dimmension of hidden dim
    (default: '512')
    (an integer)
  --[no]increasing_dim: Increase the projection dimension for each epoch
    (default: 'false')
  --input_height: The size of image to use (will be center cropped). 
    (default: '32')
    (an integer)
  --input_width: The size of image to use (will be center cropped). If None, same value as input_height [None]
    (default: '32')
    (an integer)
  --klevel: Levels of gradient quantization
    (default: '4')
    (an integer)
  --[no]klevelsgd: Apply klevel sgd for gradient agggregation
    (default: 'false')
  --learning_rate: Learning rate of for adam
    (default: '0.001')
    (a number)
  --[no]load_d: True for loading the pretrained models w/ discriminator, False for not load [True]
    (default: 'true')
  --loss: AE reconstruction loss
    (default: 'l1')
  --max_eps: maximum epsilon
    (default: '1.0')
    (a number)
  --max_grad: maximum gradient for signsgd aggregation
    (default: '0.0')
    (a number)
  --[no]mean_kernel: Apply Mean Kernel for gradient agggregation
    (default: 'false')
  --[no]non_private: Do not apply differential privacy
    (default: 'false')
  --orders: rdp orders
    (default: '200')
    (an integer)
  --output_height: The size of the output images to produce [64]
    (default: '32')
    (an integer)
  --output_width: The size of the output images to produce. If None, same value as output_height [None]
    (default: '32')
    (an integer)
  --[no]pca: Apply pca for gradient aggregation
    (default: 'false')
  --pca_dim: principal dimensions for pca
    (default:'10')                                                           
    (a number)
  --[no]pretrain: True for loading the pretrained models, False for not load [True]
    (default: 'true')
  --pretrain_teacher: Pretrain teacher for epochs
    (default: '0')
    (an integer)
  --proj_mat: #/ projection mat
    (default: '1')
    (an integer)
  --[no]random_label: random labels for training data, only used when pretraining some models
    (default: 'false')
  --[no]random_proj: Apply pca for gradient aggregation
    (default: 'true')
  --sample_dir: Directory name to save the image samples [samples]
    (default: 'samples')
  --sample_step: Number of teacher models in one batch
    (default: '10')
    (an integer)
  --[no]save_epoch: Save each epoch per 0.1 eps
    (default: 'false')
  --[no]save_vote: Save voting results
    (default: 'false')
  --[no]shuffle: Evenly distribute dataset
    (default: 'true')
  --sigma: Scale of gaussian noise for gradient aggregation
    (default: '2000.0')
    (a number)
  --sigma_thresh: Scale of gaussian noise for thresh gnmax
    (default: '4500.0')
    (a number)
  --[no]signsgd: Apply sign sgd for gradient agggregation
    (default: 'false')
  --[no]signsgd_dept: Apply sign sgd for gradient agggregation with data dependent bound
    (default: 'false')
  --[no]signsgd_nothresh: Apply sign sgd for gradient agggregation
    (default: 'false')
  --[no]simple_gan: Use fc to build GAN
    (default: 'false')
  --[no]sketchsgd: Apply sketch sgd for gradient agggregation
  (default: 'false')
  --[no]small: Use a smaller discriminator
    (default: 'false')
  --step_size: Step size for gradient aggregation
    (default: '0.0001')
    (a number)
  --[no]stochastic: Apply stochastic sign sgd for gradient agggregation
    (default: 'false')
  --[no]tanh: Use tanh as activation func
    (default: 'false')
  --teacher_dir: Directory name to save the teacher [teacher]
    (default: 'teacher')
  --teachers_batch: Number of batch
    (default: '1')
    (an integer)
  --thresh: threshhold for threshgmax
    (default: '0.5')
    (a number)
  --topk: Number of top k gradients
    (default: '50')
    (an integer)
  --[no]train: True for training, False for testing [False]
    (default: 'false')
  --[no]train_ae: Train ae
    (default: 'false')
  --train_size: The size of train images [np.inf]
    (default: 'inf')
    (a number)
  --[no]wgan: Train wgan
    (default: 'false')
  --y_dim: #/ y dim
    (default: '10')
    (an integer)
  --z_dim: #/ z dim
    (default: '100')
    (an integer)

Generating synthetic samples

python main.py --checkpoint_dir [checkpoint_dir] --dataset [dataset_name]

Evaluate the synthetic records

We train a classifier on synthetic samples and test it on real samples. We put the evaluation script under the evaluation folder.

For MNIST,

python evaluation/train-classifier-mnist.py --data [DP_data_dir]

For Fashion-MNIST,

python evaluation/train-classifier-fmnist.py --data [DP_data_dir]

For CelebA-Gender,

python evaluation/train-classifier-celebA.py --data [DP_data_dir]

For CelebA-Hair,

python evaluation/train-classifier-hair.py --data [DP_data_dir]

The [DP_data_dir] is where your generated DP samples are located. In the Fashion-MNIST example above, we have generated 10 bathces of DP samples in $checkpoint_dir/eps-1.00.data-{i}.pkl (i=0,...,9). During evaluation, you should run with the prefix of the data_dir, where the program will concatenate all of the generated DP samples and use it as the training data.

python evaluation/train-classifier-fmnist.py --data $checkpoint_dir/eps-1.00.data

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
evaluation		evaluation
.gitignore		.gitignore
README.md		README.md
config.py		config.py
download.py		download.py
dp_pca.py		dp_pca.py
dp_utils.py		dp_utils.py
environments.txt		environments.txt
fid.py		fid.py
gen_data.py		gen_data.py
inception_score.py		inception_score.py
input.py		input.py
main.py		main.py
mnist_cnn_icp_eval.py		mnist_cnn_icp_eval.py
model.py		model.py
ops.py		ops.py
pate_core.py		pate_core.py
rdp_utils.py		rdp_utils.py
requirements.txt		requirements.txt
temp.py		temp.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DataLens

Citation

Usage

Prepare your environment

Prepare your data

Training

More example commands (eps=1):

More example commands (eps=10):

Training Args

Generating synthetic samples

Evaluate the synthetic records

About

Releases

Packages

Languages

llbbcc/DataLens

Folders and files

Latest commit

History

Repository files navigation

DataLens

Citation

Usage

Prepare your environment

Prepare your data

Training

More example commands (eps=1):

More example commands (eps=10):

Training Args

Generating synthetic samples

Evaluate the synthetic records

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages