# A Simple Adversarial Attack Against Machine Learning Using Feature Perturbation

In this lab, we will demonstrate how we can algorithmically alter features of an input to attack a pre-trained machine learning model. This lab assumes three inputs:

+ A pre-trained classification model, namely ```model```
+ An input, namely ```original_input```, whose label is ```original_label``` by the model (i.e., ```original_label = model(original_input)```)
+ An target label, namely ```target_label```, which is different from ```original_label```.    

The objective of this attack is to automatically generate a new input namely ```adversarial_input``` by minimally purturbing features of ```original_input``` so that ```target_label = model(adversarial_input)```. 

It is worth noting that this method algorithmically generates ```adversarial_input``` from  ```original_input``` without rigorously proving or verifying the purturbation is *minimized*. Nevertheless, this method can be easily extended to assure the distance between the ```adversarial_input``` and the ```origina_input``` is smaller than a pre-defined value.  



## Step 1. Build A Classification Model

### Exploring The Data

We use the iris data to build a classification model. The iris data has 4 features and 3 classes. Therefore, the model needs to perform multi-class classification. 



In [None]:
import torch
import torch.nn as nn
import numpy as np
from sklearn import datasets

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score

iris = datasets.load_iris()

X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

X_train = torch.from_numpy(X_train.astype(np.float32))
y_train = torch.from_numpy(y_train.astype(np.compat.long))
X_test = torch.from_numpy(X_test.astype(np.float32))
y_test = torch.from_numpy(y_test.astype(np.compat.long))

_, n_input_features = X_train.shape

n_output_features = torch.unique(y_train).shape[0]
print("The number of features: ", n_output_features)
print("The number of classes: ", n_output_features)





The number of features:  3
The number of classes:  3


### Define A Model

Since the iris data are relatively simple, we use a linear model with one layer. We use Pytorch to implement this model. 

In [None]:
# A model implemented using PyTorch
class IRISModel(nn.Module):
  def __init__(self, n_input_features, n_output_features):
    super(IRISModel, self).__init__()
    self.linear = nn.Linear(n_input_features, n_output_features, bias = False)

  def forward(self, x):
    y = self.linear(x)
    return y

### Train The Model

Since this is a multi-class classification problem, we have used ```CrossEntropyLoss()```. After 5000 rounds of training, the model has accomplished a high detection accuracy of 98%. 

In [None]:

model = IRISModel(n_input_features, n_output_features)
lr = 0.01
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr = lr)
torch.manual_seed(247)

n_iter = 5000

for i in range(n_iter):

  y_predict = model(X_train)
  loss = criterion(y_predict, y_train)
  loss.backward()
  optimizer.step()
  optimizer.zero_grad()

  if i % 500 == 0:
    with torch.no_grad():
      y_predict = model(X_train)
      loss = criterion(y_predict, y_train)
      _, y_predict = torch.max(y_predict, dim = 1)
      cnt_matched = (y_predict == y_train).sum().item()
      cnt_all = y_predict.shape[0]
      accuracy_train = cnt_matched / cnt_all
      print("[%d]: loss: %f, accuracy:%f" % (i, loss, accuracy_train))


[0]: loss: 1.430276, accuracy:0.303571
[500]: loss: 0.423566, accuracy:0.875000
[1000]: loss: 0.341122, accuracy:0.946429
[1500]: loss: 0.292505, accuracy:0.982143
[2000]: loss: 0.258522, accuracy:0.982143
[2500]: loss: 0.233166, accuracy:0.982143
[3000]: loss: 0.213481, accuracy:0.982143
[3500]: loss: 0.197744, accuracy:0.982143
[4000]: loss: 0.184870, accuracy:0.982143
[4500]: loss: 0.174138, accuracy:0.982143


### Exploring The Trained Model

We print the weights of trained model. Since the model has only one layer, these weights manifest the feature importance. Specifically, each row, which corresponds to each target class, shows how each feature contributes to the final score of this class. For example, the last row shows that feature 3 (with weight of 2.6633) contributes most to the 3rd class. 

In [None]:
print(model.linear.weight)

Parameter containing:
tensor([[ 0.9818,  1.2395, -2.2441, -0.6774],
        [ 0.7079, -0.4689,  0.2842, -1.0578],
        [-1.3512, -1.4727,  2.6633,  1.6054]], requires_grad=True)


## Step 2. Identify An Original Input

You can create an arbitrary input as the original input. Here we use [6.0, 2.2, 4, 1]. 

We then apply the trained model to classify this input and we get the label of 1, which corresponds to the class of versicolor. 

In [None]:
original_input = np.array([6.0, 2.2, 4, 1])
original_input_tensor = torch.from_numpy(original_input.astype(np.float32))

with torch.no_grad():
  #prob_predict, label_predict = torch.max(model(original_input), dim=1)
  r = model(original_input_tensor)
  m = nn.Softmax(dim = 1)
  r = m(r.reshape(-1, 3))
  prob_predict, label_predict = torch.max(r, dim=1)

original_label = label_predict
print(original_label)
print(iris.target_names[original_label])


tensor([1])
versicolor


## Step 3. Generating An Adversarial Example

## Identify All Candidate Guidence Instances and Sort Them

In order to generate an adversarial example, we first define a label that is different from the label of the original input.  

As you can find in the previous section, the label of the original input is ```versiclor``` (i.e., class 1). Therefore, here we assign ```target_label``` as 2. 

We will next identify all inputs that are classified as class 2 and store them into ```all_guidence_candidates```. We next evaluate the distance between the ```original_input``` and all inputs in ```all_guidence_candidates```. We sort indices of  ```all_guidence_candidates``` according to distance values and then store them into ```ordered_neighbors_indices```. Inputs in ```all_guidence_candidates``` with smaller distances are preferred since they imply smaller perturbation when we alter the ```original_input```. 

In [None]:
from sklearn.metrics.pairwise import manhattan_distances

target_label = 2

with torch.no_grad():
  predict = model(X_train)
  _, labels_predict = torch.max(predict, dim=1)
  #print(labels_predict)
  all_guidence_candidates = X_train[labels_predict == target_label]

distances = manhattan_distances(original_input.reshape(-1, 4), all_guidence_candidates)

ordered_neighbors_indices = np.argsort(distances[0])

#ordered_all_guidence_candidates = all_guidence_candidates[ordered_neighbors_indices]


## Identify Top-K Stealthy Candidate guidence Instances

For each input in ```all_guidence_candidates```, we identify those that have a higher likelihood to be classified as class 1. Note, all inputs in ```all_guidence_candidates``` have been classified as class 2 by our model. Nevertheless, the higher likelihood an input from ```all_guidence_candidates``` has, the closer it is to the decision boundary. Therefore, it is more likely to be a *stealthy* guidence input. 


We therefore apply our trained model and sort indices of ```all_guidence_candidates``` based on the probability of being classified as ```original_input```. We only preserve the top K (i.e., 10) inputs that are closest to the decision boundary and store them in ```top_k_candidates_using_prob```. 



In [None]:
from sklearn.metrics.pairwise import paired_euclidean_distances

k = 10

with torch.no_grad():
  predict = model(all_guidence_candidates)
  m = nn.Softmax(dim=1)
  prob = m(predict)[:, original_label]
  prob = prob.squeeze().numpy()


ordered_index_using_prob = np.argsort(prob)
ordered_index_using_prob = ordered_index_using_prob[::-1]
top_k_candidates_using_prob = ordered_index_using_prob[:k]
print(top_k_candidates_using_prob)

[37 26  8 30  7 31 40 21  1 13]


### Identify The Best Guidence Instances

We next enumerate each index in ```ordered_neighbors_indices``` and evaluate whehter it is in the ```top_k_candidates_using_prob```. If so, this input will be identified as the ```best_target_instance_index```. 

It is worth noting that it is possible that no index in ```ordered_neighbors_indices``` belongs to ```top_k_candidates_using_prob```. In this case, the ```best_target_instance_index``` is the first input in ```ordered_neighbors_indices```, i.e., the input in ```ordered_neighbors_indices``` that has the smallest distance with the original_input (see the initialization of ```best_target_instance_index```). 

In [None]:
best_target_instance_index = ordered_neighbors_indices[0]

for i in ordered_neighbors_indices:
  if i in top_k_candidates_using_prob:
    best_target_instance_index = i
    break

best_target = all_guidence_candidates[best_target_instance_index]
print(original_input)
print(best_target)                                            

[6.  2.2 4.  1. ]
tensor([6.3000, 2.5000, 4.9000, 1.5000])


### Adjust The Original Input Oriented by The Best Guidence Instance

We now have the best guidence input. We now adjust ```original_input``` so that it moves towards ```best_target```. 

Our adjustment leverages the feature importance. Specifically, it starts with the feature that contributes most to the target class. We continuously evaluate whether the adjusted ```original_input``` is classified as the target_label. If so, we generate the ```adversarial_example```. 

For our example, our algoirthm only changes the third feature from 4 to 4.9 and the trained model classifies the ```adversarial_example``` as 2. 


In [None]:
print(model.linear.weight[target_label])
ordered_feature_idx_using_importance = np.argsort(model.linear.weight[2].detach().numpy())[::-1]
print("Feature Indices Ordered by Importance: ", ordered_feature_idx_using_importance)
print("Features Ordered by Importance: ", [iris.feature_names[i] for i in ordered_feature_idx_using_importance])

adversarial_example = torch.from_numpy(original_input.copy().astype(np.float32))


for feature in ordered_feature_idx_using_importance:
  adversarial_example[feature] = best_target[feature]
  with torch.no_grad():
    predict = model(adversarial_example)
    label_predict = torch.argmax(predict)
    if(label_predict == target_label):
      break

#with torch.no_grad():
#  predict = model(torch.from_numpy(original_input.astype(np.float32)))
#  label_predict = torch.argmax(predict)
#  print(label_predict)
#  predict = model(adversarial_example)
#  label_predict = torch.argmax(predict)
#  print(label_predict)

print("original_input:", original_input)
print("best target input:", best_target.numpy())
print("The adversarial example:", adversarial_example.numpy())





tensor([-1.3512, -1.4727,  2.6633,  1.6054], grad_fn=<SelectBackward0>)
Feature Indices Ordered by Importance:  [2 3 0 1]
Features Ordered by Importance:  ['petal length (cm)', 'petal width (cm)', 'sepal length (cm)', 'sepal width (cm)']
original_input: [6.  2.2 4.  1. ]
best target input: [6.3 2.5 4.9 1.5]
The adversarial example: [6.  2.2 4.9 1.5]
