# Label-only Transfer Attack

## Intuition
This attack is the so-called "Transfer attack" from Li et al. (https://arxiv.org/abs/2007.15528). This is an attack which first builds an attack based on the output labels of one model, then transfers this attack onto a  model which outputs confidence scores.

### Stages of building the attack

1. **Shadow dataset relabeling**:
The attacker does not have access to the correct label of the data, therefore he uses the target model to create labels for his dataset.

2. **Shadow model architecture selection**:
The attacker can have white-box knowledge about the attacked model's architecture, but does not have to. He can use a different architecture, without much loss in attack performance.

3. **Shadow model training**:
The attacker trains the shadow model with its relabeled dataset.

4. **Membership inferece**:
He trains a membership inference model based on the shadow model's outputs.

5. **Transfer**:
Transfer attack on to model of choice.

![title](img/transfer_attack.png)

In [1]:
import torch
from torch import nn
from models.mnist import Net

## Load targeted model and data

In [2]:
import numpy as np
import os
import sys
sys.path.insert(0, os.path.abspath('..'))

from art.utils import load_mnist

# data
(x_train, y_train), (x_test, y_test), _min, _max = load_mnist(raw=True)

x_train = np.expand_dims(x_train, axis=1).astype(np.float32)[:1000]
y_train = y_train[:1000]
x_test = np.expand_dims(x_test, axis=1).astype(np.float32)

In [3]:
import torch.optim as optim
from art.estimators.classification.pytorch import PyTorchClassifier

model = Net()

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters())

art_model = PyTorchClassifier(model=model, loss=criterion, optimizer=optimizer, channels_first=True, input_shape=(1,28,28,), nb_classes=10, clip_values=(_min,_max))

### Fit model

In [4]:
art_model.fit(x_train, y_train, nb_epochs=20)

#### Train accuracy

In [5]:
pred = np.array([np.argmax(arr) for arr in art_model.predict(x_test)])

print('Base model accuracy: ', np.sum(pred == y_test) / len(y_test))

Base model accuracy:  0.8681


#### Test accuracy

In [6]:
pred = np.array([np.argmax(arr) for arr in art_model.predict(x_train)])

print('Base model accuracy: ', np.sum(pred == y_train) / len(y_train))

Base model accuracy:  0.977


## Create shadow model


### Attacker knows base architecture (white-box)

In [7]:
shadow_model_wb = Net()

optimizer = optim.Adam(shadow_model_wb.parameters())

art_shadow_model_wb = PyTorchClassifier(shadow_model_wb, loss=criterion, optimizer=optimizer, channels_first=True, input_shape=(1,28,28,), nb_classes=10, clip_values=(_min,_max))

### Attacker does not know base architecture (black-box)

In [8]:
import torch.nn.functional as F
class Linear_Net(nn.Module):
    def __init__(self):
        super(Linear_Net, self).__init__()
        self.fc1 = nn.Linear(784, 128)
        self.dropout = nn.Dropout(0.5)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = torch.flatten(x, 1)
        x = self.fc1(x)
        x = F.relu(x)
        x = self.dropout(x)
        x = self.fc2(x)
        output = F.log_softmax(x, dim=1)
        return output
    
shadow_model_bb = Linear_Net()

optimizer = optim.Adam(shadow_model_bb.parameters())

art_shadow_model_bb = PyTorchClassifier(shadow_model_bb, loss=criterion, optimizer=optimizer, channels_first=True, input_shape=(1,28,28,), nb_classes=10, clip_values=(_min,_max))

## Attack

In [9]:
attacker_data_size = 250
print("Attacker knows %d train samples and %d test samples" % (attacker_data_size, attacker_data_size))

Attacker knows 250 train samples and 250 test samples


### Fit white-box attack

In [None]:
from sklearn.ensemble import RandomForestClassifier
from art.estimators.classification.scikitlearn import ScikitlearnRandomForestClassifier
from art.attacks.inference.membership_inference import LabelOnlyTransferAttack, MembershipInferenceBlackBox

membership_inference_wb = MembershipInferenceBlackBox(
    art_shadow_model_wb,
    input_type="loss",
    attack_model=ScikitlearnRandomForestClassifier(RandomForestClassifier())
)

attack_wb = LabelOnlyTransferAttack(
    classifier=art_model,
    membership_inference=membership_inference_wb,
)

# dataset relabeling, shadow model training, membership inference training
attack_wb.fit(
    x=x_train[:attacker_data_size],
    test_x=x_test[:attacker_data_size],
    nb_epochs=50,
)

### Fit black-box attack

In [None]:
membership_inference_bb = MembershipInferenceBlackBox(
    art_shadow_model_bb,
    input_type="loss",
    attack_model=ScikitlearnRandomForestClassifier(RandomForestClassifier())
)

attack_bb = LabelOnlyTransferAttack(
    classifier=art_model,
    membership_inference=membership_inference_bb,
)

# dataset relabeling, shadow model training, membership inference training
attack_bb.fit(
    x=x_train[:attacker_data_size],
    test_x=x_test[:attacker_data_size],
    nb_epochs=50,
)

### Transfer attack on target model

In [12]:
from sklearn.metrics import accuracy_score
import numpy as np
num_samples = len(x_train)
membership = [1] * num_samples + [0] * num_samples

attack_wb.transfer(art_model)
attack_bb.transfer(art_model)

inferred_membership_wb = attack_wb.infer(np.concatenate([x_train, x_test[:num_samples]]))
inferred_membership_bb = attack_bb.infer(np.concatenate([x_train, x_test[:num_samples]]))

acc_wb = accuracy_score(membership, inferred_membership_wb)
acc_bb = accuracy_score(membership, inferred_membership_bb)

print("Accuracy when attacker knows target architecture: %f" % acc_wb)
print("Accuracy when attacker does NOT know target architecture: %f" % acc_bb)

Accuracy when attacker knows target architecture: 0.577000
Accuracy when attacker does NOT know target architecture: 0.537500


**White-box attacks can yield slightly better results than black-box attacks.**

### Transfer black-box attack on a different model

Now that we have trained our shadow-model based attack on the predicted labels of the target model, let's see if it can predict membership on different models as well!

#### Load other model (MLP) and fit

In [13]:
from models.mnist import MLP

mlp_mnist = MLP(input_dims=784, n_hiddens=[256, 256], n_class=10)

optimizer = optim.Adam(mlp_mnist.parameters())

art_mlp_model = PyTorchClassifier(model=mlp_mnist, optimizer=optimizer, loss=criterion, channels_first=True, input_shape=(1,28,28,), nb_classes=10, clip_values=(_min,_max))

art_mlp_model.fit(x_train, y_train, nb_epochs=20)

pred = np.array([np.argmax(arr) for arr in art_mlp_model.predict(x_test)])

print('Base model accuracy: ', np.sum(pred == y_test) / len(y_test))

Sequential(
  (fc1): Linear(in_features=784, out_features=256, bias=True)
  (relu1): ReLU()
  (drop1): Dropout(p=0.2, inplace=False)
  (fc2): Linear(in_features=256, out_features=256, bias=True)
  (relu2): ReLU()
  (drop2): Dropout(p=0.2, inplace=False)
  (out): Linear(in_features=256, out_features=10, bias=True)
)
Base model accuracy:  0.8873


In [14]:
attack_bb.transfer(art_mlp_model)

inferred_membership = attack_bb.infer(np.concatenate([x_train, x_test[:num_samples]]))

transfer_acc = accuracy_score(membership, inferred_membership)

print("Transfer membership inference accuracy: %f" % transfer_acc)

Transfer membership inference accuracy: 0.540500


**The transferred membership inference maintains a similar accuracy as before.**