-
Notifications
You must be signed in to change notification settings - Fork 6
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add differential privacy distinguisher (#4)
- Loading branch information
Showing
49 changed files
with
2,802 additions
and
4 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1 @@ | ||
0.2.0 | ||
0.2.0 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
/dev_data | ||
outputs/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,55 @@ | ||
# Experiments | ||
|
||
This folder contains code to run experiments end-to-end in Azure Machine Learning. | ||
|
||
We differentiate three types of threat models: | ||
1. Black-box membership inference (coming soon) | ||
2. White-box membership inference (coming soon) | ||
3. Differential privacy distinguisher | ||
|
||
## Differential privacy distinguisher | ||
|
||
We follow the attack by Nasr et al. (2023) to match the differential privacy threat model. | ||
Currently this does not take privacy amplification via subsampling into account and only audits a single step of DP-SGD. | ||
|
||
### Collecting membership inference data | ||
|
||
The threat model of Differentially Private Stochastic Gradient Descent (DP-SGD) assumes that the adversary has access to the training data and each gradient in the training process. | ||
In order to instantiate a matching adversary, we need to collect membership information during the training process. | ||
We provide a wrapper (`privacy_estimates.experiments.attack.dpd.CanaryTrackingOptimizer`) for a PyTorch optimizer that can be used with Opacus. | ||
|
||
`privacy_estimates.experiments.games.DifferentialPrivacyGameBase` contains code to run the differentially private distinguisher game.`` | ||
|
||
## Installation and setup | ||
|
||
### Setup the local environment | ||
|
||
The local environment is packaged within the `privacy-estimates` package. | ||
|
||
```bash | ||
pip install privacy-estimates[pipelines] | ||
``` | ||
|
||
### Setup Azure ML | ||
|
||
Setup an Azure ML workspace and download the `config.json` file ([details](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-configure-environment#workspace)) and add it to `configs/workspace`. | ||
|
||
Add `gpu_compute` and `cpu_compute` to the `config.json` indicating where to run the experiments. | ||
The values should match with compute clusters in your workspace. | ||
|
||
|
||
## Example 2: Differential Privacy Distinguisher for image classification | ||
|
||
We can increase the threat model to match the theoretical bound by using a different attack. | ||
We follow the gradient canary attack by Nasr et al. (2023) to match the differential privacy threat model. | ||
|
||
```bash | ||
python estimate_differential_privacy_image_classifier.py --config-name dpd_image_classifier +submit=True | ||
``` | ||
|
||
## References | ||
|
||
Nasr, M., Hayes, J., Steinke, T., Balle, B., Tramèr, F., Jagielski, M., Carlini, N. and Terzis, A., 2023. Tight Auditing of Differentially Private Machine Learning. arXiv preprint arXiv:2302.07956. | ||
|
||
Zanella-Béguelin, S., Wutschitz, L., Tople, S., Salem, A., Rühle, V., Paverd, A., Naseri, M., Köpf, B. and Jones, D., 2023, July. Bayesian estimation of differential privacy. In International Conference on Machine Learning (pp. 40624-40636). PMLR. | ||
|
42 changes: 42 additions & 0 deletions
42
experiments/components/predict-with-cnn-classifier/component_spec.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,42 @@ | ||
$schema: https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json | ||
name: privacy_estimates_experiments_predict_with_cnn_classifier | ||
display_name: Compute predictions using CNN classifier | ||
version: local1 | ||
type: command | ||
description: Compute predictions using transformer classifier | ||
inputs: | ||
model: | ||
type: uri_folder | ||
description: The checkpoint directory | ||
optional: false | ||
dataset: | ||
type: uri_folder | ||
description: Challenge points data in Parquet format that will be divided to the models | ||
optional: false | ||
model_rel_path: | ||
type: string | ||
description: Relative path to experiment directory to extract model checkpoints | ||
default: "./" | ||
optional: false | ||
batch_size: | ||
type: integer | ||
description: Batch size for the model | ||
default: 1024 | ||
optional: true | ||
outputs: | ||
predictions: | ||
type: uri_folder | ||
description: Predictions | ||
code: . | ||
additional_includes: | ||
- "../../models" | ||
command: >- | ||
python predict_with_cnn_classifier.py \ | ||
--dataset ${{inputs.dataset}} \ | ||
--experiment_dir ${{inputs.model}} \ | ||
--model_rel_path ${{inputs.model_rel_path}} \ | ||
--output ${{outputs.predictions}} \ | ||
$[[ --batch_size ${{inputs.batch_size}} ]] | ||
environment: | ||
conda_file: environment.yaml | ||
image: mcr.microsoft.com/azureml/openmpi4.1.0-cuda11.1-cudnn8-ubuntu20.04 |
21 changes: 21 additions & 0 deletions
21
experiments/components/predict-with-cnn-classifier/environment.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
name: PrivacyGames | ||
channels: | ||
- pytorch | ||
- conda-forge | ||
- defaults | ||
dependencies: | ||
- cudatoolkit=11.1.1 | ||
- pip=21.2.4 | ||
- python=3.8.13 | ||
- pytorch=1.9.0 | ||
- pip: | ||
- azureml-core==1.39.0.post1 | ||
- datasets==1.18.4 | ||
- numpy==1.22.3 | ||
- pandas==1.4.1 | ||
- Pillow==9.2.0 | ||
- pyarrow==7.0.0 | ||
- pydantic==1.9.0 | ||
- pydantic-cli==4.3.0 | ||
- scipy==1.8.0 | ||
- tqdm==4.63.1 |
80 changes: 80 additions & 0 deletions
80
experiments/components/predict-with-cnn-classifier/predict_with_cnn_classifier.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,80 @@ | ||
import sys | ||
import torch | ||
import numpy as np | ||
|
||
from datasets import load_from_disk, Dataset, Features, Sequence, Value | ||
from pathlib import Path | ||
from pydantic_cli import run_and_exit | ||
from pydantic import BaseModel, Field | ||
from torch.utils.data import DataLoader, TensorDataset | ||
|
||
|
||
sys.path.append(str(Path(__file__).parent.parent.parent)) | ||
from models.cnn import CNN, compute_prediction_metrics | ||
|
||
|
||
class Arguments(BaseModel): | ||
dataset: Path = Field( | ||
description="Path to the dataset for computing predictions on." | ||
) | ||
experiment_dir: Path = Field( | ||
description="Path to the experiment directory." | ||
) | ||
model_rel_path: str = Field( | ||
default="./", description="Glob pattern for the model to use." | ||
) | ||
output: Path = Field( | ||
description="Path to the output file." | ||
) | ||
batch_size: int = Field( | ||
description="Batch size." | ||
) | ||
use_cpu: int = Field( | ||
default=0, description="Whether to use the CPU instead of the GPU." | ||
) | ||
|
||
|
||
def main(args: Arguments): | ||
data = load_from_disk(args.dataset) | ||
|
||
print(f"Loaded dataset: {data.features}") | ||
|
||
model = CNN.load(args.experiment_dir / args.model_rel_path / "model.pt") | ||
|
||
data_loader = DataLoader( | ||
data.with_format("torch"), | ||
batch_size=args.batch_size, | ||
shuffle=False | ||
) | ||
|
||
device = "cpu" if args.use_cpu else "cuda" | ||
|
||
metrics = compute_prediction_metrics(model=model, device=device, data_loader=data_loader) | ||
print(metrics) | ||
|
||
predictions = Dataset.from_dict( | ||
mapping={ | ||
"logits": [logits for logits in metrics.logits] if len(metrics.losses) > 0 else [], | ||
"label": metrics.labels, | ||
"loss": metrics.losses, | ||
}, | ||
features=Features({ | ||
"logits": Sequence(feature=Value(dtype="float64"), length=-1), | ||
"label": data.features["label"], | ||
"loss": Value(dtype="float64") | ||
}) | ||
) | ||
|
||
print(f"Writing {len(predictions)} predictions to file") | ||
assert len(predictions) == len(data) | ||
predictions.save_to_disk(args.output) | ||
|
||
return 0 | ||
|
||
|
||
def exception_handler(ex): | ||
raise RuntimeError("Command ran with an error.") from ex | ||
|
||
|
||
if __name__ == "__main__": | ||
run_and_exit(Arguments, main, exception_handler=exception_handler) |
94 changes: 94 additions & 0 deletions
94
experiments/components/train-cnn-classifier/component_spec.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,94 @@ | ||
$schema: https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json | ||
name: privacy_estimates_experiments_train_cnn_classifier | ||
display_name: Train CNN classifier | ||
version: local1 | ||
type: command | ||
description: Train CNN classifier | ||
inputs: | ||
train_data: | ||
type: uri_folder | ||
description: Training data in Huggingface dataset format | ||
validation_data: | ||
type: uri_folder | ||
description: Validation data in Huggingface dataset format | ||
seed: | ||
type: integer | ||
description: Random seed | ||
num_train_epochs: | ||
type: number | ||
description: Number of training epochs. | ||
target_epsilon: | ||
type: number | ||
description: Target epsilon at the end of training. | ||
delta: | ||
type: number | ||
description: Target delta at the end of training. | ||
learning_rate: | ||
type: number | ||
description: Learning rate. | ||
max_physical_batch_size: | ||
type: integer | ||
description: Largest batch size per device. | ||
canary_gradient: | ||
type: string | ||
description: Canary gradient. | ||
default: dirac | ||
total_train_batch_size: | ||
type: integer | ||
description: Number of samples between gradient updates | ||
per_sample_max_grad_norm: | ||
type: number | ||
description: Per sample max grad norm for DP training. | ||
lr_scheduler_gamma: | ||
type: number | ||
description: Learning rate scheduler gamma. | ||
default: 1.0 | ||
logging_steps: | ||
type: integer | ||
description: Number of steps before logging | ||
default: 100 | ||
disable_ml_flow: | ||
type: integer | ||
description: Disable ML flow logging to AML | ||
default: 0 | ||
outputs: | ||
model: | ||
type: uri_folder | ||
description: Trained model | ||
dpd_data: | ||
type: uri_file | ||
description: JSON encoded membership inference scores for a differential privacy distinguisher | ||
dp_parameters: | ||
type: uri_file | ||
description: JSON encoded differential privacy parameters | ||
metrics: | ||
type: uri_file | ||
description: JSON encoded metrics | ||
code: . | ||
additional_includes: | ||
- "../../models" | ||
- "../../../privacy_estimates/experiments/utils.py" | ||
- "../../../privacy_estimates/experiments/attacks/dpd" | ||
command: >- | ||
python train_cnn.py \ | ||
--train_data_path ${{inputs.train_data}} \ | ||
--test_data_path ${{inputs.validation_data}} \ | ||
--target_epsilon ${{inputs.target_epsilon}} \ | ||
--delta ${{inputs.delta}} \ | ||
--learning_rate ${{inputs.learning_rate}} \ | ||
--num_train_epochs ${{inputs.num_train_epochs}} \ | ||
--max_physical_batch_size ${{inputs.max_physical_batch_size}} \ | ||
--per_sample_max_grad_norm ${{inputs.per_sample_max_grad_norm}} \ | ||
--total_train_batch_size ${{inputs.total_train_batch_size}} \ | ||
--output_dir ${{outputs.model}} \ | ||
--seed ${{inputs.seed}} \ | ||
--canary_gradient ${{inputs.canary_gradient}} \ | ||
--lr_scheduler_gamma ${{inputs.lr_scheduler_gamma}} \ | ||
--disable_ml_flow ${{inputs.disable_ml_flow}} \ | ||
--metrics ${{outputs.metrics}} \ | ||
--dpd_data ${{outputs.dpd_data}} \ | ||
--dp_parameters ${{outputs.dp_parameters}} \ | ||
--logging_steps ${{inputs.logging_steps}} | ||
environment: | ||
conda_file: environment.yaml | ||
image: mcr.microsoft.com/azureml/openmpi4.1.0-cuda11.1-cudnn8-ubuntu20.04 |
Oops, something went wrong.