Skip to content

Commit

Permalink
Add differential privacy distinguisher (#4)
Browse files Browse the repository at this point in the history
  • Loading branch information
wulu473 committed Mar 4, 2024
1 parent a986ff5 commit e7212c2
Show file tree
Hide file tree
Showing 49 changed files with 2,802 additions and 4 deletions.
2 changes: 1 addition & 1 deletion VERSION
Original file line number Diff line number Diff line change
@@ -1 +1 @@
0.2.0
0.2.0
2 changes: 2 additions & 0 deletions experiments/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
/dev_data
outputs/
55 changes: 55 additions & 0 deletions experiments/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# Experiments

This folder contains code to run experiments end-to-end in Azure Machine Learning.

We differentiate three types of threat models:
1. Black-box membership inference (coming soon)
2. White-box membership inference (coming soon)
3. Differential privacy distinguisher

## Differential privacy distinguisher

We follow the attack by Nasr et al. (2023) to match the differential privacy threat model.
Currently this does not take privacy amplification via subsampling into account and only audits a single step of DP-SGD.

### Collecting membership inference data

The threat model of Differentially Private Stochastic Gradient Descent (DP-SGD) assumes that the adversary has access to the training data and each gradient in the training process.
In order to instantiate a matching adversary, we need to collect membership information during the training process.
We provide a wrapper (`privacy_estimates.experiments.attack.dpd.CanaryTrackingOptimizer`) for a PyTorch optimizer that can be used with Opacus.

`privacy_estimates.experiments.games.DifferentialPrivacyGameBase` contains code to run the differentially private distinguisher game.``

## Installation and setup

### Setup the local environment

The local environment is packaged within the `privacy-estimates` package.

```bash
pip install privacy-estimates[pipelines]
```

### Setup Azure ML

Setup an Azure ML workspace and download the `config.json` file ([details](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-configure-environment#workspace)) and add it to `configs/workspace`.

Add `gpu_compute` and `cpu_compute` to the `config.json` indicating where to run the experiments.
The values should match with compute clusters in your workspace.


## Example 2: Differential Privacy Distinguisher for image classification

We can increase the threat model to match the theoretical bound by using a different attack.
We follow the gradient canary attack by Nasr et al. (2023) to match the differential privacy threat model.

```bash
python estimate_differential_privacy_image_classifier.py --config-name dpd_image_classifier +submit=True
```

## References

Nasr, M., Hayes, J., Steinke, T., Balle, B., Tramèr, F., Jagielski, M., Carlini, N. and Terzis, A., 2023. Tight Auditing of Differentially Private Machine Learning. arXiv preprint arXiv:2302.07956.

Zanella-Béguelin, S., Wutschitz, L., Tople, S., Salem, A., Rühle, V., Paverd, A., Naseri, M., Köpf, B. and Jones, D., 2023, July. Bayesian estimation of differential privacy. In International Conference on Machine Learning (pp. 40624-40636). PMLR.

Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
$schema: https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json
name: privacy_estimates_experiments_predict_with_cnn_classifier
display_name: Compute predictions using CNN classifier
version: local1
type: command
description: Compute predictions using transformer classifier
inputs:
model:
type: uri_folder
description: The checkpoint directory
optional: false
dataset:
type: uri_folder
description: Challenge points data in Parquet format that will be divided to the models
optional: false
model_rel_path:
type: string
description: Relative path to experiment directory to extract model checkpoints
default: "./"
optional: false
batch_size:
type: integer
description: Batch size for the model
default: 1024
optional: true
outputs:
predictions:
type: uri_folder
description: Predictions
code: .
additional_includes:
- "../../models"
command: >-
python predict_with_cnn_classifier.py \
--dataset ${{inputs.dataset}} \
--experiment_dir ${{inputs.model}} \
--model_rel_path ${{inputs.model_rel_path}} \
--output ${{outputs.predictions}} \
$[[ --batch_size ${{inputs.batch_size}} ]]
environment:
conda_file: environment.yaml
image: mcr.microsoft.com/azureml/openmpi4.1.0-cuda11.1-cudnn8-ubuntu20.04
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
name: PrivacyGames
channels:
- pytorch
- conda-forge
- defaults
dependencies:
- cudatoolkit=11.1.1
- pip=21.2.4
- python=3.8.13
- pytorch=1.9.0
- pip:
- azureml-core==1.39.0.post1
- datasets==1.18.4
- numpy==1.22.3
- pandas==1.4.1
- Pillow==9.2.0
- pyarrow==7.0.0
- pydantic==1.9.0
- pydantic-cli==4.3.0
- scipy==1.8.0
- tqdm==4.63.1
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
import sys
import torch
import numpy as np

from datasets import load_from_disk, Dataset, Features, Sequence, Value
from pathlib import Path
from pydantic_cli import run_and_exit
from pydantic import BaseModel, Field
from torch.utils.data import DataLoader, TensorDataset


sys.path.append(str(Path(__file__).parent.parent.parent))
from models.cnn import CNN, compute_prediction_metrics


class Arguments(BaseModel):
dataset: Path = Field(
description="Path to the dataset for computing predictions on."
)
experiment_dir: Path = Field(
description="Path to the experiment directory."
)
model_rel_path: str = Field(
default="./", description="Glob pattern for the model to use."
)
output: Path = Field(
description="Path to the output file."
)
batch_size: int = Field(
description="Batch size."
)
use_cpu: int = Field(
default=0, description="Whether to use the CPU instead of the GPU."
)


def main(args: Arguments):
data = load_from_disk(args.dataset)

print(f"Loaded dataset: {data.features}")

model = CNN.load(args.experiment_dir / args.model_rel_path / "model.pt")

data_loader = DataLoader(
data.with_format("torch"),
batch_size=args.batch_size,
shuffle=False
)

device = "cpu" if args.use_cpu else "cuda"

metrics = compute_prediction_metrics(model=model, device=device, data_loader=data_loader)
print(metrics)

predictions = Dataset.from_dict(
mapping={
"logits": [logits for logits in metrics.logits] if len(metrics.losses) > 0 else [],
"label": metrics.labels,
"loss": metrics.losses,
},
features=Features({
"logits": Sequence(feature=Value(dtype="float64"), length=-1),
"label": data.features["label"],
"loss": Value(dtype="float64")
})
)

print(f"Writing {len(predictions)} predictions to file")
assert len(predictions) == len(data)
predictions.save_to_disk(args.output)

return 0


def exception_handler(ex):
raise RuntimeError("Command ran with an error.") from ex


if __name__ == "__main__":
run_and_exit(Arguments, main, exception_handler=exception_handler)
94 changes: 94 additions & 0 deletions experiments/components/train-cnn-classifier/component_spec.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
$schema: https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json
name: privacy_estimates_experiments_train_cnn_classifier
display_name: Train CNN classifier
version: local1
type: command
description: Train CNN classifier
inputs:
train_data:
type: uri_folder
description: Training data in Huggingface dataset format
validation_data:
type: uri_folder
description: Validation data in Huggingface dataset format
seed:
type: integer
description: Random seed
num_train_epochs:
type: number
description: Number of training epochs.
target_epsilon:
type: number
description: Target epsilon at the end of training.
delta:
type: number
description: Target delta at the end of training.
learning_rate:
type: number
description: Learning rate.
max_physical_batch_size:
type: integer
description: Largest batch size per device.
canary_gradient:
type: string
description: Canary gradient.
default: dirac
total_train_batch_size:
type: integer
description: Number of samples between gradient updates
per_sample_max_grad_norm:
type: number
description: Per sample max grad norm for DP training.
lr_scheduler_gamma:
type: number
description: Learning rate scheduler gamma.
default: 1.0
logging_steps:
type: integer
description: Number of steps before logging
default: 100
disable_ml_flow:
type: integer
description: Disable ML flow logging to AML
default: 0
outputs:
model:
type: uri_folder
description: Trained model
dpd_data:
type: uri_file
description: JSON encoded membership inference scores for a differential privacy distinguisher
dp_parameters:
type: uri_file
description: JSON encoded differential privacy parameters
metrics:
type: uri_file
description: JSON encoded metrics
code: .
additional_includes:
- "../../models"
- "../../../privacy_estimates/experiments/utils.py"
- "../../../privacy_estimates/experiments/attacks/dpd"
command: >-
python train_cnn.py \
--train_data_path ${{inputs.train_data}} \
--test_data_path ${{inputs.validation_data}} \
--target_epsilon ${{inputs.target_epsilon}} \
--delta ${{inputs.delta}} \
--learning_rate ${{inputs.learning_rate}} \
--num_train_epochs ${{inputs.num_train_epochs}} \
--max_physical_batch_size ${{inputs.max_physical_batch_size}} \
--per_sample_max_grad_norm ${{inputs.per_sample_max_grad_norm}} \
--total_train_batch_size ${{inputs.total_train_batch_size}} \
--output_dir ${{outputs.model}} \
--seed ${{inputs.seed}} \
--canary_gradient ${{inputs.canary_gradient}} \
--lr_scheduler_gamma ${{inputs.lr_scheduler_gamma}} \
--disable_ml_flow ${{inputs.disable_ml_flow}} \
--metrics ${{outputs.metrics}} \
--dpd_data ${{outputs.dpd_data}} \
--dp_parameters ${{outputs.dp_parameters}} \
--logging_steps ${{inputs.logging_steps}}
environment:
conda_file: environment.yaml
image: mcr.microsoft.com/azureml/openmpi4.1.0-cuda11.1-cudnn8-ubuntu20.04
Loading

0 comments on commit e7212c2

Please sign in to comment.