Add differential privacy distinguisher (#4)

microsoft · Mar 4, 2024 · e7212c2 · e7212c2
1 parent a986ff5
commit e7212c2
Show file tree

Hide file tree

Showing 49 changed files with 2,802 additions and 4 deletions.
diff --git a/VERSION b/VERSION
@@ -1 +1 @@
-0.2.0
+0.2.0
diff --git a/experiments/.gitignore b/experiments/.gitignore
@@ -0,0 +1,2 @@
+/dev_data
+outputs/
diff --git a/experiments/README.md b/experiments/README.md
@@ -0,0 +1,55 @@
+# Experiments
+
+This folder contains code to run experiments end-to-end in Azure Machine Learning.
+
+We differentiate three types of threat models:
+1. Black-box membership inference (coming soon)
+2. White-box membership inference (coming soon)
+3. Differential privacy distinguisher
+
+## Differential privacy distinguisher
+
+We follow the attack by Nasr et al. (2023) to match the differential privacy threat model.
+Currently this does not take privacy amplification via subsampling into account and only audits a single step of DP-SGD.
+
+### Collecting membership inference data
+
+The threat model of Differentially Private Stochastic Gradient Descent (DP-SGD) assumes that the adversary has access to the training data and each gradient in the training process.
+In order to instantiate a matching adversary, we need to collect membership information during the training process.
+We provide a wrapper (`privacy_estimates.experiments.attack.dpd.CanaryTrackingOptimizer`) for a PyTorch optimizer that can be used with Opacus.
+
+`privacy_estimates.experiments.games.DifferentialPrivacyGameBase` contains code to run the differentially private distinguisher game.``
+
+## Installation and setup
+
+### Setup the local environment
+
+The local environment is packaged within the `privacy-estimates` package.
+
+```bash
+pip install privacy-estimates[pipelines]
+```
+
+### Setup Azure ML
+
+Setup an Azure ML workspace and download the `config.json` file ([details](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-configure-environment#workspace)) and add it to `configs/workspace`.
+
+Add `gpu_compute` and `cpu_compute` to the `config.json` indicating where to run the experiments.
+The values should match with compute clusters in your workspace.
+
+
+## Example 2: Differential Privacy Distinguisher for image classification
+
+We can increase the threat model to match the theoretical bound by using a different attack.
+We follow the gradient canary attack by Nasr et al. (2023) to match the differential privacy threat model.
+
+```bash
+python estimate_differential_privacy_image_classifier.py --config-name dpd_image_classifier +submit=True
+```
+
+## References
+
+Nasr, M., Hayes, J., Steinke, T., Balle, B., Tramèr, F., Jagielski, M., Carlini, N. and Terzis, A., 2023. Tight Auditing of Differentially Private Machine Learning. arXiv preprint arXiv:2302.07956.
+
+Zanella-Béguelin, S., Wutschitz, L., Tople, S., Salem, A., Rühle, V., Paverd, A., Naseri, M., Köpf, B. and Jones, D., 2023, July. Bayesian estimation of differential privacy. In International Conference on Machine Learning (pp. 40624-40636). PMLR.
+
diff --git a/experiments/components/predict-with-cnn-classifier/component_spec.yaml b/experiments/components/predict-with-cnn-classifier/component_spec.yaml
@@ -0,0 +1,42 @@
+$schema: https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json
+name: privacy_estimates_experiments_predict_with_cnn_classifier
+display_name: Compute predictions using CNN classifier
+version: local1
+type: command
+description: Compute predictions using transformer classifier
+inputs:
+  model:
+    type: uri_folder
+    description: The checkpoint directory
+    optional: false
+  dataset:
+    type: uri_folder
+    description: Challenge points data in Parquet format that will be divided to the models
+    optional: false
+  model_rel_path:
+    type: string
+    description: Relative path to experiment directory to extract model checkpoints
+    default: "./"
+    optional: false
+  batch_size:
+    type: integer
+    description: Batch size for the model
+    default: 1024
+    optional: true
+outputs:
+  predictions:
+    type: uri_folder
+    description: Predictions
+code: .
+additional_includes:
+  - "../../models"
+command: >-
+  python predict_with_cnn_classifier.py \
+    --dataset ${{inputs.dataset}} \
+    --experiment_dir ${{inputs.model}} \
+    --model_rel_path ${{inputs.model_rel_path}} \
+    --output ${{outputs.predictions}} \
+    $[[ --batch_size ${{inputs.batch_size}} ]]
+environment:
+  conda_file: environment.yaml
+  image: mcr.microsoft.com/azureml/openmpi4.1.0-cuda11.1-cudnn8-ubuntu20.04
diff --git a/experiments/components/predict-with-cnn-classifier/environment.yaml b/experiments/components/predict-with-cnn-classifier/environment.yaml
@@ -0,0 +1,21 @@
+name: PrivacyGames
+channels:
+  - pytorch
+  - conda-forge
+  - defaults
+dependencies:
+  - cudatoolkit=11.1.1
+  - pip=21.2.4
+  - python=3.8.13
+  - pytorch=1.9.0
+  - pip:
+    - azureml-core==1.39.0.post1
+    - datasets==1.18.4
+    - numpy==1.22.3
+    - pandas==1.4.1
+    - Pillow==9.2.0
+    - pyarrow==7.0.0
+    - pydantic==1.9.0
+    - pydantic-cli==4.3.0
+    - scipy==1.8.0
+    - tqdm==4.63.1
diff --git a/experiments/components/predict-with-cnn-classifier/predict_with_cnn_classifier.py b/experiments/components/predict-with-cnn-classifier/predict_with_cnn_classifier.py
@@ -0,0 +1,80 @@
+import sys
+import torch
+import numpy as np
+
+from datasets import load_from_disk, Dataset, Features, Sequence, Value
+from pathlib import Path
+from pydantic_cli import run_and_exit
+from pydantic import BaseModel, Field
+from torch.utils.data import DataLoader, TensorDataset
+
+
+sys.path.append(str(Path(__file__).parent.parent.parent))
+from models.cnn import CNN, compute_prediction_metrics
+
+
+class Arguments(BaseModel):
+    dataset: Path = Field(
+        description="Path to the dataset for computing predictions on."
+    )
+    experiment_dir: Path = Field(
+       description="Path to the experiment directory."
+    )
+    model_rel_path: str = Field(
+        default="./", description="Glob pattern for the model to use."
+    )
+    output: Path = Field(
+        description="Path to the output file."
+    )
+    batch_size: int = Field(
+        description="Batch size."
+    )
+    use_cpu: int = Field(
+        default=0, description="Whether to use the CPU instead of the GPU."
+    )
+
+
+def main(args: Arguments):
+    data = load_from_disk(args.dataset)
+
+    print(f"Loaded dataset: {data.features}")
+
+    model = CNN.load(args.experiment_dir / args.model_rel_path / "model.pt")
+
+    data_loader = DataLoader(
+        data.with_format("torch"),
+        batch_size=args.batch_size,
+        shuffle=False
+    )
+
+    device = "cpu" if args.use_cpu else "cuda"
+
+    metrics = compute_prediction_metrics(model=model, device=device, data_loader=data_loader)
+    print(metrics)
+
+    predictions = Dataset.from_dict(
+        mapping={
+            "logits": [logits for logits in metrics.logits] if len(metrics.losses) > 0 else [],
+            "label": metrics.labels,
+            "loss": metrics.losses,
+        },
+        features=Features({
+            "logits": Sequence(feature=Value(dtype="float64"), length=-1),
+            "label": data.features["label"],
+            "loss": Value(dtype="float64")
+        })
+    )
+
+    print(f"Writing {len(predictions)} predictions to file")
+    assert len(predictions) == len(data)
+    predictions.save_to_disk(args.output)
+
+    return 0
+
+
+def exception_handler(ex):
+    raise RuntimeError("Command ran with an error.") from ex
+
+
+if __name__ == "__main__":
+    run_and_exit(Arguments, main, exception_handler=exception_handler)
diff --git a/experiments/components/train-cnn-classifier/component_spec.yaml b/experiments/components/train-cnn-classifier/component_spec.yaml
@@ -0,0 +1,94 @@
+$schema: https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json
+name: privacy_estimates_experiments_train_cnn_classifier
+display_name: Train CNN classifier
+version: local1
+type: command
+description: Train CNN classifier
+inputs:
+  train_data:
+    type: uri_folder
+    description: Training data in Huggingface dataset format
+  validation_data:
+    type: uri_folder
+    description: Validation data in Huggingface dataset format
+  seed:
+    type: integer
+    description: Random seed
+  num_train_epochs:
+    type: number
+    description: Number of training epochs.
+  target_epsilon:
+    type: number
+    description: Target epsilon at the end of training.
+  delta:
+    type: number
+    description: Target delta at the end of training.
+  learning_rate:
+    type: number
+    description: Learning rate.
+  max_physical_batch_size:
+    type: integer
+    description: Largest batch size per device.
+  canary_gradient:
+    type: string
+    description: Canary gradient.
+    default: dirac
+  total_train_batch_size:
+    type: integer
+    description: Number of samples between gradient updates
+  per_sample_max_grad_norm:
+    type: number
+    description: Per sample max grad norm for DP training.
+  lr_scheduler_gamma:
+    type: number
+    description: Learning rate scheduler gamma.
+    default: 1.0
+  logging_steps:
+    type: integer
+    description: Number of steps before logging
+    default: 100
+  disable_ml_flow:
+    type: integer
+    description: Disable ML flow logging to AML
+    default: 0
+outputs:
+  model:
+    type: uri_folder
+    description: Trained model
+  dpd_data:
+    type: uri_file
+    description: JSON encoded membership inference scores for a differential privacy distinguisher
+  dp_parameters:
+    type: uri_file
+    description: JSON encoded differential privacy parameters
+  metrics:
+    type: uri_file
+    description: JSON encoded metrics
+code: .
+additional_includes:
+  - "../../models"
+  - "../../../privacy_estimates/experiments/utils.py"
+  - "../../../privacy_estimates/experiments/attacks/dpd"
+command: >-
+  python train_cnn.py \
+    --train_data_path ${{inputs.train_data}} \
+    --test_data_path ${{inputs.validation_data}} \
+    --target_epsilon ${{inputs.target_epsilon}} \
+    --delta ${{inputs.delta}} \
+    --learning_rate ${{inputs.learning_rate}} \
+    --num_train_epochs ${{inputs.num_train_epochs}} \
+    --max_physical_batch_size ${{inputs.max_physical_batch_size}} \
+    --per_sample_max_grad_norm ${{inputs.per_sample_max_grad_norm}} \
+    --total_train_batch_size ${{inputs.total_train_batch_size}} \
+    --output_dir ${{outputs.model}} \
+    --seed ${{inputs.seed}} \
+    --canary_gradient ${{inputs.canary_gradient}} \
+    --lr_scheduler_gamma ${{inputs.lr_scheduler_gamma}} \
+    --disable_ml_flow ${{inputs.disable_ml_flow}} \
+    --metrics ${{outputs.metrics}} \
+    --dpd_data ${{outputs.dpd_data}} \
+    --dp_parameters ${{outputs.dp_parameters}} \
+    --logging_steps ${{inputs.logging_steps}}
+environment:
+  conda_file: environment.yaml
+  image: mcr.microsoft.com/azureml/openmpi4.1.0-cuda11.1-cudnn8-ubuntu20.04