# Dataset Usage Cardinality Inference Demo
Did your model train on my data? More importantly, does its use qualify as fair use, or does it infringe on my copyright? As AI continues to advance, this question becomes increasingly critical. According to Section 107 of the U.S. Copyright Act, determining whether a use constitutes fair use or copyright infringement requires evaluating the "_amount and substantiality of the portion used in relation to the copyrighted work_" under the "_nature of the copyrighted work_." This raises a key question: **how much of a given dataset was used to train a machine learning model?**

Dataset Usage Cardinality Inference (DUCI) provides an answer. It enables data owners to assess the risk of unauthorized usage and protect their rights by estimating the exact proportion of data used. DUCI achieves this through a debiasing process that aggregates individual Membership Inference Attack (MIA) guesses to deliver accurate results.

## Problem Overview
The Dataset Usage Cardinality Inference (DUCI) algorithm---acting as an agent for the dataset owner with full access to a target dataset---aims to estimate the proportion of the target dataset used in training a victim model, given black-box access to the model and knowledge of the training algorithm (e.g., the population data and model archtecture).

<img src="documentation/images/duci_problem.png" alt="Problem Illustration" title="Simple DUCI Pipeline" width="600">

## Set up the Colab environment

In [1]:
# !pip install numpy torch  # Install NumPy and PyTorch if not already installed

In [2]:
from torch.utils.data import Subset
import logging
import numpy as np
import random
import time


# Set up the logger
logging.basicConfig(level=logging.DEBUG, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger()

In [3]:
from dataset import get_dataset
from models.utils import train_models, load_models
from get_signals import get_model_signals
from audit import sample_auditing_dataset
from modules.duci import DUCI

  from .autonotebook import tqdm as notebook_tqdm
2025-02-06 10:48:40,503 - INFO - PyTorch version 2.4.0 available.
2025-02-06 10:48:41,803 - DEBUG - matplotlib data path: /home/yao/.conda/envs/pytorch/lib/python3.8/site-packages/matplotlib/mpl-data
2025-02-06 10:48:41,808 - DEBUG - CONFIGDIR=/home/yao/.config/matplotlib
2025-02-06 10:48:41,809 - DEBUG - interactive is False
2025-02-06 10:48:41,810 - DEBUG - platform is linux
2025-02-06 10:48:42,000 - DEBUG - CACHEDIR=/home/yao/.cache/matplotlib
2025-02-06 10:48:42,003 - DEBUG - Using fontManager instance from /home/yao/.cache/matplotlib/fontlist-v330.json


## Prepare dataset
As the dataset owner, we have a target dataset $X$ and access to a population pool. For simplicity, assume the population pool is the CIFAR-10 dataset, and we sample a subset $X$ of size 500 from this pool.

In [4]:
# Set Configs
_dataset = 'cifar10' # cifar10 as the population pool
dataset_dir = 'data'
log_dir = 'demo_duci'
configs = {
    'run': {
        'random_seed': 12345,
        'log_dir': 'demo_duci',
        'time_log': True,
        'num_experiments': 1
    },
    'audit': {
        'privacy_game': 'privacy_loss_model',
        'algorithm': 'RMIA',
        'num_ref_models': 1,
        'device': 'cuda:0',
        'report_log': 'report_rmia',
        'batch_size': 5000
    },
    'train': {
        'model_name': 'wrn28-2',
        'device': 'cuda:0',
        'batch_size': 256,
        'optimizer': 'SGD',
        'learning_rate': 0.1,
        'weight_decay': 0,
        'epochs': 100
    },
    'data': {
        'dataset': 'cifar10',
        'data_dir': 'data'
    }
}



In [5]:
dataset, population = get_dataset(_dataset, dataset_dir, logger)

# Select 500 points from the dataset as the target dataset
all_indices = list(range(len(dataset)))
random.shuffle(all_indices)
target_indices = all_indices[:500]
remaining_indices = all_indices[500:]
TRAIN_SIZE = len(dataset) // 2

2025-02-06 10:48:42,243 - INFO - Data loaded from data/cifar10.pkl
2025-02-06 10:48:42,257 - INFO - Population data loaded from data/cifar10_population.pkl
2025-02-06 10:48:42,258 - INFO - The whole dataset size: 50000


## Set up the victim model
Suppose a victim model is trained on a randomly selected $p$ proportion of our dataset $X$. Our goal is to infer the value of $p$.

In [6]:
proportions = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
p = random.choice(proportions)

In [7]:
# Randomly selection 0.3 proportion of the target dataset
selected_indices = random.sample(target_indices, int(p * len(target_indices)))
remaining_size = TRAIN_SIZE - len(selected_indices)
selected_remaining_indices = random.sample(remaining_indices, remaining_size)
selected_victim_indices = selected_indices + selected_remaining_indices

# select all unselected indices in all_indices as the test indices
test_indices = list(set(all_indices) - set(selected_indices) - set(selected_remaining_indices))
target_data_split = {
    'train': selected_victim_indices,
    'test': test_indices
}
target_membership = np.zeros(len(dataset))
target_membership[selected_victim_indices] = 1

## Train reference models

In the **Privacy Meter** library, $2N$ reference models are trained by default, ensuring that each data point is included in one model's training set and excluded from another. We first explore dataset usage inference using two reference models before moving to the special case of single-reference models (by adapting the MIA implementations in the library).

In [8]:
# Randomly selection half of the target dataset
ref_selected_indices = random.sample(target_indices, int(0.5 * len(target_indices)))
ref_remaining_size = TRAIN_SIZE - len(ref_selected_indices)
ref_selected_remaining_indices = random.sample(remaining_indices, ref_remaining_size)
ref_selected_victim_indices = ref_selected_indices + ref_selected_remaining_indices

# select all unselected indices in all_indices as the test indices
ref_test_indices = list(set(all_indices) - set(ref_selected_indices) - set(ref_selected_remaining_indices))
ref_data_split = {
    'train': ref_selected_victim_indices,
    'test': ref_test_indices
}
ref_membership = np.zeros(len(dataset))
ref_membership[ref_selected_victim_indices] = 1

# Get the pair reference model
ref_paired_selected_indices = list(set(target_indices)-set(ref_selected_indices))
ref_paired_remaining_size = TRAIN_SIZE - len(ref_paired_selected_indices)
ref_paired_selected_remaining_indices = random.sample(remaining_indices, ref_paired_remaining_size)
ref_paired_selected_victim_indices = ref_paired_selected_indices + ref_paired_selected_remaining_indices

# select all unselected indices in all_indices as the test indices
ref_paired_test_indices = list(set(all_indices) - set(ref_paired_selected_indices) - set(ref_paired_selected_remaining_indices))
ref_paired_data_split = {
    'train': ref_paired_selected_victim_indices,
    'test': ref_paired_test_indices
}
ref_paired_membership = np.zeros(len(dataset))
ref_paired_membership[ref_paired_selected_victim_indices] = 1


In [9]:
data_splits = [target_data_split, ref_data_split, ref_paired_data_split]
memberships = np.array([target_membership, ref_membership, ref_paired_membership]) # size: 2N+1 * len(dataset)
models_list = train_models(
    log_dir, dataset, data_splits, memberships, configs, logger
)

2025-02-06 10:48:42,377 - INFO - Training 3 models
2025-02-06 10:48:42,378 - INFO - --------------------------------------------------
2025-02-06 10:48:42,379 - INFO - Training model 0: Train size 25000, Test size 25000


Using optimizer: SGD | Learning Rate: 0.1 | Weight Decay: 0
Epoch [1/100] | Train Loss: 2.2799 | Train Acc: 0.1453
Test Loss: 2.1548 | Test Acc: 0.2206
Epoch 1 took 5.23 seconds
Epoch [2/100] | Train Loss: 1.9888 | Train Acc: 0.2802
Test Loss: 1.8909 | Test Acc: 0.3122
Epoch 2 took 4.48 seconds
Epoch [3/100] | Train Loss: 1.8115 | Train Acc: 0.3394
Test Loss: 1.7629 | Test Acc: 0.3470
Epoch 3 took 4.48 seconds
Epoch [4/100] | Train Loss: 1.7004 | Train Acc: 0.3787
Test Loss: 1.6773 | Test Acc: 0.3786
Epoch 4 took 4.47 seconds
Epoch [5/100] | Train Loss: 1.5988 | Train Acc: 0.4187
Test Loss: 1.6252 | Test Acc: 0.4007
Epoch 5 took 4.51 seconds
Epoch [6/100] | Train Loss: 1.4959 | Train Acc: 0.4536
Test Loss: 1.6081 | Test Acc: 0.4052
Epoch 6 took 4.56 seconds
Epoch [7/100] | Train Loss: 1.4038 | Train Acc: 0.4946
Test Loss: 1.6123 | Test Acc: 0.4241
Epoch 7 took 4.49 seconds
Epoch [8/100] | Train Loss: 1.3240 | Train Acc: 0.5260
Test Loss: 1.3616 | Test Acc: 0.5117
Epoch 8 took 4.50 seco

2025-02-06 10:56:19,160 - INFO - Train accuracy 1.0, Train Loss 0.0030490692529105104
2025-02-06 10:56:19,161 - INFO - Test accuracy 0.62864, Test Loss 1.4608709082311513
2025-02-06 10:56:19,176 - INFO - Training model 0 took 456.7981164455414 seconds
2025-02-06 10:56:19,194 - INFO - --------------------------------------------------
2025-02-06 10:56:19,195 - INFO - Training model 1: Train size 25000, Test size 25000


Using optimizer: SGD | Learning Rate: 0.1 | Weight Decay: 0
Epoch [1/100] | Train Loss: 2.2567 | Train Acc: 0.1633
Test Loss: 2.1042 | Test Acc: 0.2538
Epoch 1 took 4.77 seconds
Epoch [2/100] | Train Loss: 1.9471 | Train Acc: 0.2964
Test Loss: 1.8366 | Test Acc: 0.3292
Epoch 2 took 4.53 seconds
Epoch [3/100] | Train Loss: 1.7635 | Train Acc: 0.3561
Test Loss: 1.7545 | Test Acc: 0.3572
Epoch 3 took 4.51 seconds
Epoch [4/100] | Train Loss: 1.6477 | Train Acc: 0.3928
Test Loss: 1.6086 | Test Acc: 0.4052
Epoch 4 took 4.53 seconds
Epoch [5/100] | Train Loss: 1.5520 | Train Acc: 0.4299
Test Loss: 1.6251 | Test Acc: 0.3972
Epoch 5 took 4.54 seconds
Epoch [6/100] | Train Loss: 1.4620 | Train Acc: 0.4654
Test Loss: 1.4798 | Test Acc: 0.4474
Epoch 6 took 4.52 seconds
Epoch [7/100] | Train Loss: 1.3772 | Train Acc: 0.4981
Test Loss: 1.4301 | Test Acc: 0.4773
Epoch 7 took 4.56 seconds
Epoch [8/100] | Train Loss: 1.3029 | Train Acc: 0.5299
Test Loss: 1.3670 | Test Acc: 0.5102
Epoch 8 took 4.55 seco

2025-02-06 11:03:55,267 - INFO - Train accuracy 1.0, Train Loss 0.0037437779781389602
2025-02-06 11:03:55,268 - INFO - Test accuracy 0.63904, Test Loss 1.419813417658514
2025-02-06 11:03:55,278 - INFO - Training model 1 took 456.08333253860474 seconds
2025-02-06 11:03:55,295 - INFO - --------------------------------------------------
2025-02-06 11:03:55,295 - INFO - Training model 2: Train size 25000, Test size 25000


Using optimizer: SGD | Learning Rate: 0.1 | Weight Decay: 0
Epoch [1/100] | Train Loss: 2.2485 | Train Acc: 0.1440
Test Loss: 2.0968 | Test Acc: 0.2458
Epoch 1 took 4.78 seconds
Epoch [2/100] | Train Loss: 1.9579 | Train Acc: 0.2927
Test Loss: 1.8407 | Test Acc: 0.3178
Epoch 2 took 4.55 seconds
Epoch [3/100] | Train Loss: 1.7865 | Train Acc: 0.3446
Test Loss: 1.7790 | Test Acc: 0.3464
Epoch 3 took 4.57 seconds
Epoch [4/100] | Train Loss: 1.6700 | Train Acc: 0.3875
Test Loss: 1.6581 | Test Acc: 0.3898
Epoch 4 took 4.52 seconds
Epoch [5/100] | Train Loss: 1.5611 | Train Acc: 0.4328
Test Loss: 1.6060 | Test Acc: 0.4046
Epoch 5 took 4.52 seconds
Epoch [6/100] | Train Loss: 1.4670 | Train Acc: 0.4691
Test Loss: 1.4764 | Test Acc: 0.4604
Epoch 6 took 4.54 seconds
Epoch [7/100] | Train Loss: 1.3832 | Train Acc: 0.5008
Test Loss: 1.8152 | Test Acc: 0.3603
Epoch 7 took 4.53 seconds
Epoch [8/100] | Train Loss: 1.3155 | Train Acc: 0.5279
Test Loss: 1.4083 | Test Acc: 0.4900
Epoch 8 took 4.56 seco

2025-02-06 11:11:30,969 - INFO - Train accuracy 1.0, Train Loss 0.0027838589371733218
2025-02-06 11:11:30,970 - INFO - Test accuracy 0.62692, Test Loss 1.5007508044340172
2025-02-06 11:11:30,980 - INFO - Training model 2 took 455.6854817867279 seconds


In [None]:
# models_list, memberships = load_models(log_dir, dataset, 3, configs, logger)
# target_membership, ref_membership, ref_paired_membership = memberships

## Inferring $p$
We have imported the DUCI module using: `from modules.duci import DUCI`. In DUCI, the standard MIA is executed first, so we need to set up the MIA before proceeding.

In [11]:
# Sample the population dataset used in MIA
population = Subset(
    population,
    np.random.choice(
        len(population),
        configs["audit"].get("population_size", len(population)),
        replace=False,
    ),
)

### Query the victim model and the reference model to generate signals (softmax outputs)

In [12]:
baseline_time = time.time()
auditing_dataset = Subset(dataset, target_indices)
auditing_membership = np.array([target_membership[target_indices], ref_membership[target_indices], ref_paired_membership[target_indices]]).astype(bool) # size: 2 * len(auditing_dataset)   
signals = get_model_signals(models_list, auditing_dataset, configs, logger) # num_samples * num_models
auditing_membership = auditing_membership.T
assert signals.shape == auditing_membership.shape, f"signals or auditing_membership has incorrect shape (num_samples * num_models): {signals.shape} vs {auditing_membership.shape}"
population_signals = get_model_signals(
    models_list, population, configs, logger, is_population=True
)
logger.info("Preparing signals took %0.5f seconds", time.time() - baseline_time)

2025-02-06 11:11:31,382 - INFO - Computing signals for all models.
Computing softmax: 100%|██████████| 1/1 [00:00<00:00, 15.19it/s]
Computing softmax: 100%|██████████| 1/1 [00:00<00:00, 50.74it/s]
Computing softmax: 100%|██████████| 1/1 [00:00<00:00, 50.73it/s]
2025-02-06 11:11:31,536 - INFO - Signals saved to disk.
2025-02-06 11:11:32,651 - INFO - Computing signals for all models.
Computing softmax: 100%|██████████| 2/2 [00:00<00:00,  5.43it/s]
Computing softmax: 100%|██████████| 2/2 [00:00<00:00,  5.58it/s]
Computing softmax: 100%|██████████| 2/2 [00:00<00:00,  5.56it/s]
2025-02-06 11:11:33,777 - INFO - Signals saved to disk.
2025-02-06 11:11:33,785 - INFO - Preparing signals took 2.68006 seconds


In [13]:
auditing_membership[:, 0].mean(), auditing_membership[:, 1].mean(), auditing_membership[:, 2].mean()

(0.4, 0.5, 0.5)

### Perform DUCI

In [14]:
baseline_time = time.time()
target_model_idx = 0
ref_model_indices = [1, 2]

logger.info(f"Initiate DUCI for target models: {target_model_idx}")
# args = {
#     "attack": "RMIA",
#     "dataset": configs["data"]["dataset"], # TODO: have DUCI config
#     "model": configs["train"]["model_name"],
#     "offline_a": None
# }
args = {
    "attack": "RMIA",
    "dataset": configs["data"]["dataset"], # TODO: have DUCI config
    "model": configs["train"]["model_name"],
    "offline_a": 0.3
}
DUCI_instance = DUCI(logger, args)

logger.info("Collecting membership prediction for each sample in the target dataset on target models and reference models.")
logger.info("Predicting the proportion of dataset usage on target models.")

duci_preds, true_proportions, errors = DUCI_instance.pred_proportions(
    [target_model_idx], 
    [ref_model_indices], 
    signals,
    population_signals,
    auditing_membership,
)

logger.info(
    "DUCI %0.1f seconds", time.time() - baseline_time
)
logger.info(f"Average prediction errors: {np.mean(errors)}")
logger.info(f"All prediction errors: {errors}")
logger.info(f"Prediction details: DUCI predictions: {duci_preds}, True proportions: {true_proportions}")



2025-02-06 11:11:33,809 - INFO - Initiate DUCI for target models: 0
2025-02-06 11:11:33,812 - INFO - Collecting membership prediction for each sample in the target dataset on target models and reference models.
2025-02-06 11:11:33,813 - INFO - Predicting the proportion of dataset usage on target models.
2025-02-06 11:11:33,814 - INFO - Args for MIA attack: {'attack': 'RMIA', 'dataset': 'cifar10', 'model': 'wrn28-2', 'offline_a': 0.3}
2025-02-06 11:11:33,814 - INFO - Running RMIA attack on target model 0 with offline_a=0.3
2025-02-06 11:11:33,829 - INFO - Collect membership prediction for target dataset on target model 0 costs 0.0 seconds
2025-02-06 11:11:33,830 - INFO - Args for MIA attack: {'attack': 'RMIA', 'dataset': 'cifar10', 'model': 'wrn28-2', 'offline_a': 0.3}
2025-02-06 11:11:33,831 - INFO - Running RMIA attack on target model 1 with offline_a=0.3
2025-02-06 11:11:33,843 - INFO - Args for MIA attack: {'attack': 'RMIA', 'dataset': 'cifar10', 'model': 'wrn28-2', 'offline_a': 0.3

### Check the prediction and $p$

In [15]:
p

0.4

In [16]:
duci_preds[0]

0.37694704049844235