# Not What It Seems: <br/>Imperceptible Attacks on NLP Models

See related paper for details.

## Setup
Install dependencies and load pre-treained models.

In [None]:
# Install dependencies
!pip install fairseq
!pip install sacremoses
!pip install fastBPE
!pip install subword_nmt
!pip install textdistance[extras]
!pip install scipy
!pip install requests

In [None]:
# Load pre-trained translation model
import torch
en2fr = torch.hub.load('pytorch/fairseq',
                       'transformer.wmt14.en-fr',
                       tokenizer='moses',
                       bpe='subword_nmt').cuda()

## Unknown Characters

Unusual characters, such as zero-width spaces and control sequences, are simply encoded as the `<unk>` character by the FairSeq implementation. This likely generalizes to many other NLP models.

In [None]:
# Define function for decoding from the source dictionary
def src_decode(sentence):
  res = []
  for idx in sentence:
    res.append(en2fr.src_dict.symbols[idx])
  return ' '.join(res)

In [None]:
# Define some test encodings
sentence = "This is a test."
psentence = f"This {chr(0x2063)}is a test."
inputv = en2fr.encode(sentence)
outputv = en2fr.generate(inputv)[0]['tokens']
pinputv = en2fr.encode(psentence)
poutputv = en2fr.generate(pinputv)[0]['tokens']

In [56]:
# Proof that it the encoder just treats invisible chars as an unknown token.
src_decode(pinputv)

'This <unk> is a test . </s>'

## Invisible Characters

Certian Unicode chacacters are almost never visually rendered by design. Conveniently, they can be embedded within strings and copied + pasted on most systems. Most NLP models are not trained against these characters, making them  not present in source language dictionaries. Thus, they typically result in an `<unk>` embedded vector.

In [None]:
# Zero width space
ZWSP = chr(0x200B)
# Zero width joiner
ZWJ = chr(0x200D)

print(f"{ZWSP}{ZWJ}")

​‍


In [None]:
print(f"This string contains {ZWSP*500}1000{ZWJ*500} invisible characters.")

This string contains ​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​1000‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

## Homoglyphs

The Unicode specification defines several homoglyph documents. The following section retrieves these documents and creates mapping between homoglyph characters.

In [None]:
import requests

confusables = dict()
intentionals = dict()

# Retrieve Unicode Confusable homoglyph characters
conf_resp = requests.get("https://www.unicode.org/Public/security/latest/confusables.txt", stream=True)
for line in conf_resp.iter_lines():
  if len(line):
    line = line.decode('utf-8-sig')
    if line[0] != '#':
      line = line.replace("#*", "#")
      _, line = line.split("#", maxsplit=1)
      if line[3] not in confusables:
        confusables[line[3]] = []
      confusables[line[3]].append(line[7])

# Retrieve Unicode Intentional homoglyph characters
int_resp = requests.get("https://www.unicode.org/Public/security/latest/intentional.txt", stream=True)
for line in int_resp.iter_lines():
  if len(line):
    line = line.decode('utf-8-sig')
    if line[0] != '#':
      line = line.replace("#*", "#")
      _, line = line.split("#", maxsplit=1)
      if line[3] not in intentionals:
        intentionals[line[3]] = []
      intentionals[line[3]].append(line[7])

## Reorderings

Unicode Bidirectional (Bidi) Algorithm override characters can be used to render encodeded characters in any order. The following section defines a function which can generate 2^|n| reordered encodings of a given string of length n which are all rendered the same way in any system built on Google's Chromium.

This [site](https://www.soscisurvey.de/tools/view-chars.php) can be used to visualize the underlying encoding of the text.

In [None]:
# Unicode Bidi override characters
PDF = chr(0x202C)
LRE = chr(0x202A)
RLE = chr(0x202B)
LRO = chr(0x202D)
RLO = chr(0x202E)

PDI = chr(0x2069)
LRI = chr(0x2066)
RLI = chr(0x2067)

class Swap():
    """Represents swapped elements in a string of text."""
    def __init__(self, one, two):
        self.one = one
        self.two = two
    
    def __repr__(self):
        return f"Swap({self.one}, {self.two})"

    def __eq__(self, other):
        return self.one == other.one and self.two == other.two

    def __hash__(self):
        return hash((self.one, self.two))

def some(*els):
    """Returns the arguments as a tuple with Nones removed."""
    return tuple(filter(None, tuple(els)))

def swaps(chars: str) -> set:
    """Generates all possible swaps for a string."""
    def pairs(chars, pre=(), suf=()):
        orders = set()
        for i in range(len(chars)-1):
            prefix = pre + tuple(chars[:i])
            suffix = suf + tuple(chars[i+2:])
            swap = Swap(chars[i+1], chars[i])
            pair = some(prefix, swap, suffix)
            orders.add(pair)
            orders.update(pairs(suffix, pre=some(prefix, swap)))
            orders.update(pairs(some(prefix, swap), suf=suffix))
        return orders
    return pairs(chars) | {tuple(chars)}

def unswap(el: tuple) -> str:
    """Reverts a tuple of swaps to the original string."""
    if isinstance(el, str):
        return el
    elif isinstance(el, Swap):
        return unswap((el.two, el.one))
    else:
        res = ""
        for e in el:
            res += unswap(e)
        return res

def uniswap(els):
    res = ""
    for el in els:
        if isinstance(el, Swap):
            res += uniswap([LRO, LRI, RLO, LRI, el.one, PDI, LRI, el.two, PDI, PDF, PDI, PDF])
        elif isinstance(el, str):
            res += el
        else:
            for subel in el:
                res += uniswap([subel])
    return res

def strings_to_file(file, string):
  with open(file, 'w') as f:
      for swap in swaps(string):
          uni = uniswap(swap)
          print(uni, file=f)

def print_strings(string):
  for swap in swaps(string):
    uni = uniswap(swap)
    print(uni)

In [None]:
# Prove number of permutations = 2^|n|-1
string = "A test string."
assert(len(swaps(string))) == 2**(len(string)-1)

In [None]:
# Print strings
print_strings("Test")

Te‭⁦‮⁦t⁩⁦s⁩‬⁩‬
‭⁦‮⁦e⁩⁦T⁩‬⁩‬‭⁦‮⁦t⁩⁦s⁩‬⁩‬
‭⁦‮⁦e⁩⁦T⁩‬⁩‬st
T‭⁦‮⁦s⁩⁦e⁩‬⁩‬t
‭⁦‮⁦‭⁦‮⁦s⁩⁦e⁩‬⁩‬⁩⁦T⁩‬⁩‬t
‭⁦‮⁦‭⁦‮⁦t⁩⁦s⁩‬⁩‬⁩⁦Te⁩‬⁩‬
Test
‭⁦‮⁦‭⁦‮⁦t⁩⁦s⁩‬⁩‬⁩⁦‭⁦‮⁦e⁩⁦T⁩‬⁩‬⁩‬⁩‬


In [None]:
# Save strings to file
strings_to_file('reordering_example.txt', "This is a test.")

## Deletions

Unicode control characters used for deleting text can be encoded into strings. Upon rendering, these control characters are actioned and the appropriate surrounding text is not rendered. Yet, NLP models generally still "see" the surrounding text.

In [None]:
# Backspace character
BKSP = chr(0x8)
# Delete character
DEL = chr(0x7F)
# Carriage return character
CR = chr(0xD)

print(f"{CR}{BKSP}{DEL}")




In [None]:
# Deletion charaters are interesting, as surrounding text isn't visually
# rendered, but is "seen" by the model
inp = f"father{BKSP*6}"
res = en2fr.encode(inp)
print(f"Input Rendering: '{inp}'")
print(f"Input Encoding: '{src_decode(res)}'")
print(f"Output: '{en2fr.translate(inp)}'")

Input Rendering: 'father'
Input Encoding: 'father </s>'
Output: 'père'


In [None]:
# The carriage return character returns the beginning of the line and
# ovewrites it
inp = f"grand{CR}father"
res = en2fr.encode(inp)
print(f"Input Rendering:\n{inp}")
print(f"Input Encoding: '{src_decode(res)}'")
print(f"Output: '{en2fr.translate(inp)}'")

Input Rendering:
grandfather
Input Encoding: 'grand father </s>'
Output: 'grand-père'


## Untargeted Integrity Attacks

The performance of various NLP models can be degraded through the use of invisible character, homoglyph, reordering, and deletion attacks. The most effective attacks can be found, independent of the underlying model, using a genetic algorithm.

### Attack Setup

Each attack will be defined as an object and set of contstraints over which a genetic algorithm (differential evolution) will optimize. For these attacks, the visual representation of the input is fixed and the aim of the attack is to determine the imperceptible perturbation for which the supplied model's output will be maximally distant from the output of the unperturbed input.

Each attack will be derived from the following Objective abstract class.

In [None]:
from abc import ABC
from typing import List, Tuple, Callable, Dict
from fairseq.hub_utils import GeneratorHubInterface
from scipy.optimize import NonlinearConstraint, differential_evolution
from textdistance import levenshtein
import numpy as np


class Objective(ABC):
  """ Abstract class representing objectives for scipy's genetic algorithms."""

  def __init__(self, model: GeneratorHubInterface, input: str, max_perturbs: int, distance: Callable[[str,str],int]):
    if not model:
      raise ValueError("Must supply model.")
    if not input:
      raise ValueError("Must supply input.")

    self.model: GeneratorHubInterface = model
    self.input: str = input
    self.max_perturbs: int = max_perturbs
    self.distance: Callable[[str,str],int] = distance
    self.output = self.model.translate(self.input)

  def objective(self) -> Callable[[List[float]], float]:
    def _objective(perturbations: List[float]) -> float:
      candidate: str = self.candidate(perturbations)
      translation: str = self.model.translate(candidate)
      return -self.distance(self.output, translation)
    return _objective

  def differential_evolution(self, print_result=True, verbose=True, maxiter=60, popsize=32, polish=False) -> str:
    result = differential_evolution(self.objective(), self.bounds(),
                                    disp=verbose, maxiter=maxiter,
                                    popsize=popsize, polish=polish)
    candidate = self.candidate(result.x)
    if (print_result):
      print(f"Result: {candidate}")
      print(f"Result Distance: {result.fun}")
      print(f"Perturbation Encoding: {result.x}")
      print(f"Input Translation: {self.output}")
      print(f"Result Translation: {self.model.translate(candidate)}")
    return candidate

  def bounds(self) -> List[Tuple[float, float]]:
    raise NotImplementedError()

  def candidate(self, perturbations: List[float]) -> str:
    raise NotImplementedError()


def natural(x: float) -> int:
    """Rounds float to the nearest natural number (positive int)"""
    return max(0, round(float(x)))

### Machine Translation Attacks

These attacks are against the fariseq EN-FR translation model.

#### Invisible Character Attack

The following attack injects the invisible character supplied to create the maximally effective imperceptible perturbation against the supplied visual input and distance metric.

In [None]:
class InvisibleCharacterObjective(Objective):
  """Class representing an Objective which injects invisible characters."""

  def __init__(self, model: GeneratorHubInterface, input: str, max_perturbs: int = 25, invisible_chrs: List[str] = [ZWJ,ZWSP], distance: Callable[[str,str],int] = levenshtein.distance, **kwargs):
    super().__init__(model, input, max_perturbs, distance)
    self.invisible_chrs: List[str] = invisible_chrs

  def bounds(self) -> List[Tuple[float, float]]:
    return [(0,len(self.invisible_chrs)-1), (-1, len(self.input)-1)] * self.max_perturbs

  def candidate(self, perturbations: List[float]) -> str:
    candidate = [char for char in self.input]
    for i in range(0, len(perturbations), 2):
      inp_index = natural(perturbations[i+1])
      if inp_index >= 0:
        inv_char = self.invisible_chrs[natural(perturbations[i])]
        candidate = candidate[:inp_index] + [inv_char] + candidate[inp_index:]
    return ''.join(candidate)

In [None]:
# Execute Example Invisible Characters Attack
inv_obj = InvisibleCharacterObjective(en2fr, 'Spectacular Wingsuit Jump Over Bogota', max_perturbs=2)
inv_result = inv_obj.differential_evolution(maxiter=40, popsize=32)

#### Homoglyph Attack

This attack replaces characters with homoglyphs to create the maximally effective imperceptible perturbation against the supplied visual input and distance metric.

In [None]:
class HomoglyphObjective(Objective):

  def __init__(self, model: GeneratorHubInterface, input: str, max_perturbs=None, distance: Callable[[str,str],int] = levenshtein.distance, homoglyphs: Dict[str,List[str]] = intentionals, **kwargs):
    super().__init__(model, input, max_perturbs, distance)
    if not self.max_perturbs:
      self.max_perturbs = len(self.input)
    self.homoglyphs = homoglyphs
    self.glyph_map = []
    for i, char in enumerate(self.input):
      if char in self.homoglyphs:
        charmap = self.homoglyphs[char]
        charmap = list(zip([i] * len(charmap), charmap))
        self.glyph_map.extend(charmap)

  def bounds(self) -> List[Tuple[float, float]]:
    return [(-1, len(self.glyph_map)-1)] * self.max_perturbs

  def candidate(self, perturbations: List[float]) -> str:
    candidate = [char for char in self.input]  
    for perturb in map(natural, perturbations):
      if perturb >= 0:
        i, char = self.glyph_map[perturb]
        candidate[i] = char
    return ''.join(candidate)

In [None]:
# Execute Example Homoglyph Attack
homo_obj = HomoglyphObjective(en2fr, 'Spectacular Wingsuit Jump Over Bogota', max_perturbs=3)
homo_result = homo_obj.differential_evolution(maxiter=5)

#### Reordering Attack

This attack reorders characters using Unicode Bidi overrides to create the maximally effective imperceptible perturbation against the supplied visual input and distance metric.

The reordering patterns used in this attack were designed to be effective against the Bidi implementation used in the Chromium text rendering engine.

In [None]:
class ReorderObjective(Objective):

  def __init__(self, model: GeneratorHubInterface, input: str, max_perturbs: int = 50, distance: Callable[[str,str],int] = levenshtein.distance, **kwargs):
    super().__init__(model, input, max_perturbs, distance)

  def bounds(self) -> List[Tuple[float, float]]:
    return [(-1,len(self.input)-1)] * self.max_perturbs

  def candidate(self, perturbations: List[float]) -> str:
    def swaps(els) -> str:
      res = ""
      for el in els:
          if isinstance(el, Swap):
              res += swaps([LRO, LRI, RLO, LRI, el.one, PDI, LRI, el.two, PDI, PDF, PDI, PDF])
          elif isinstance(el, str):
              res += el
          else:
              for subel in el:
                  res += swaps([subel])
      return res

    _candidate = [char for char in self.input]
    for perturb in map(natural, perturbations):
      if perturb >= 0 and len(_candidate) >= 2:
        perturb = min(perturb, len(_candidate) - 2)
        _candidate = _candidate[:perturb] + [Swap(_candidate[perturb+1], _candidate[perturb])] + _candidate[perturb+2:]

    return swaps(_candidate)

In [None]:
# Execute Example Reordering Attack
reorder_obj = ReorderObjective(en2fr, 'Spectacular Wingsuit Jump Over Bogota', max_perturbs=2)
reorder_result = reorder_obj.differential_evolution(maxiter=5, popsize=5)

differential_evolution step 1: f(x)= -78
differential_evolution step 2: f(x)= -80
differential_evolution step 3: f(x)= -86
differential_evolution step 4: f(x)= -96
differential_evolution step 5: f(x)= -140
Result: Spectacular Wingsuit Jump Over ‭⁦‮⁦o⁩⁦B⁩‬⁩‬go‭⁦‮⁦a⁩⁦t⁩‬⁩‬
Result Distance: -140
Perturbation Encoding: [34.6174655  31.19274238]
Input Translation: Spectaculaire combinaison pour les ailes sauter au-dessus de Bogota
Result Translation: Saut en combinaison à ailes spectaculaire Saut au-dessus d'une combinaison à voilure spectaculaire Saut au-dessus d'une combinaison à voilure spectaculaire Saut au-dessus d'une combinaison à voilure


#### Deletion Attack

This attack inserts Unicode deletion control characters followed by a supplied character to be deleted to create the maximally effective imperceptible perturbation against the supplied visual input and distance metric.

In [None]:
class DeletionObjective(Objective):
  """Class representing an Objective which injects deletion control characters."""

  def __init__(self, model: GeneratorHubInterface, input: str, max_perturbs: int = 100, distance: Callable[[str,str],int] = levenshtein.distance, del_chr: str = BKSP, ins_chr: str = 'a', **kwargs):
    super().__init__(model, input, max_perturbs, distance)
    self.del_chr: str = del_chr
    self.ins_chr: str = ins_chr

  def bounds(self) -> List[Tuple[float, float]]:
    return [(-1,len(self.input)-1)] * self.max_perturbs

  def candidate(self, perturbations: List[float]) -> str:
    candidate = [char for char in self.input]
    for i in range(len(perturbations)):
      perturb = natural(perturbations[i])
      candidate = candidate[:perturb] + [self.ins_chr, self.del_chr] + candidate[perturb:]
      for j in range(i,len(perturbations)):
        perturbations[j] += 2
    return ''.join(candidate)

In [None]:
# Execute Example Reordering Attack
deletion_obj = DeletionObjective(en2fr, "This is a test", max_perturbs=2)
deletion_result = deletion_obj.differential_evolution(maxiter=10, popsize=10)

differential_evolution step 1: f(x)= -27
differential_evolution step 2: f(x)= -28
differential_evolution step 3: f(x)= -28
differential_evolution step 4: f(x)= -28
differential_evolution step 5: f(x)= -28
differential_evolution step 6: f(x)= -28
differential_evolution step 7: f(x)= -28
differential_evolution step 8: f(x)= -28
differential_evolution step 9: f(x)= -28
Result: Thais is a taest
Result Distance: -28.0
Perturbation Encoding: [12.75208361  3.78647273]
Input Translation: Il s'agit d'un test
Result Translation: Les Thaïlandais sont les plus avides


#### Attack Performance

In this section we plot the results of each of the attacks over pertrubations of fixed sizes.

##### Experiment Setup

In [None]:
# Download EN-FR Test Data
!wget -c http://statmt.org/wmt14/test-full.tgz -O - | tar -xz
!mv test-full/newstest2014-fren-src.en.sgm .
!mv test-full/newstest2014-fren-ref.fr.sgm .
!rm -rf test-full/
!pip install beautifulsoup4

--2021-01-19 12:09:45--  http://statmt.org/wmt14/test-full.tgz
Resolving statmt.org (statmt.org)... 129.215.197.184
Connecting to statmt.org (statmt.org)|129.215.197.184|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3255445 (3.1M) [application/x-gzip]
Saving to: ‘STDOUT’


2021-01-19 12:09:47 (1.68 MB/s) - written to stdout [3255445/3255445]



In [None]:
# Build source and target mappings for BLEU scoring
from bs4 import BeautifulSoup

with open('newstest2014-fren-src.en.sgm', 'r') as f:
  source_doc = BeautifulSoup(f, 'html.parser')

with open('newstest2014-fren-ref.fr.sgm', 'r') as f:
  target_doc = BeautifulSoup(f, 'html.parser')

source = dict()
target = dict()

for doc in source_doc.find_all('doc'):
  source[str(doc['docid'])] = dict()
  for seg in doc.find_all('seg'):
    if len(str(seg.string)) < 50:
      source[str(doc['docid'])][str(seg['id'])] = str(seg.string)

for doc in target_doc.find_all('doc'):
  target[str(doc['docid'])] = dict()
  for seg in doc.find_all('seg'):
    target[str(doc['docid'])][str(seg['id'])] = str(seg.string)

In [None]:
# Experiemnt for testing objectives within sliding perturbation bounds
from tqdm.auto import tqdm, trange
import pickle

def experiment(model, objective, source, perturbs, min_perturb, max_perturb, file, maxiter, popsize):
  for i in trange(min_perturb, max_perturb, desc="Perturbations"):
    perturbs[str(i)] = dict()
    for docid, doc in tqdm(source.items(), leave=False, desc="Document"):
      perturbs[str(i)][docid] = dict()
      for segid, seg in tqdm(doc.items(), leave=False, desc="Sequence"):
        perturbs[str(i)][docid][segid] = objective(en2fr, seg, max_perturbs=i).differential_evolution(print_result=False, verbose=False, maxiter=maxiter, popsize=popsize)
        with open(file, 'wb') as f:
          pickle.dump(perturbs, f)

##### Invisible Character Experiment

In [None]:
# Invisible Character Experiment

# Only consider a subset (65 examples) of the source
source_small = { '1016-latimes': source['1016-latimes'], '2327-dailymail.co.uk': source['2327-dailymail.co.uk'] }
perturbs = { '0': source_small }

experiment(en2fr, InvisibleCharacterObjective, source_small, perturbs, 1, 6, '/content/drive/MyDrive/invisible_chars.pkl', 5, 16)

##### Homoglyph Experiment

In [None]:
# Homoglyph Experiment

experiment(en2fr, HomoglyphObjective, source_small, perturbs, 1, 6, '/content/drive/MyDrive/homoglyphs.pkl', 3, 16)

##### Reordering Experiment

In [None]:
# Reordering Experiment

experiment(en2fr, ReorderObjective, source_small, perturbs, 1, 6, '/content/drive/MyDrive/reorder.pkl', 3, 16)

##### Deletion Experiment

In [None]:
# Deletion Experiment

experiment(en2fr, DeletionObjective, source_small, perturbs, 1, 6, '/content/drive/MyDrive/deletion.pkl', 3, 16)

### MNLI Attacks

#### Attack Setup

In [None]:
!wget https://cims.nyu.edu/~sbowman/multinli/multinli_1.0.zip
!unzip multinli_1.0.zip
!rm -rf __MACOSX/

In [None]:
import json
with open('multinli_1.0/multinli_1.0_dev_matched.jsonl', 'r') as f:
  mnli_test = [json.loads(jline) for jline in f.readlines()]

In [None]:
# Load pre-trained translation model
import torch
mnli = torch.hub.load('pytorch/fairseq',
                       'roberta.large.mnli').eval().cuda()
label_map = {'contradiction': 0, 'neutral': 1, 'entailment': 2}

In [60]:
class MnliObjective():

  def __init__(self, model: GeneratorHubInterface, input: str, hypothesis: str, label:int, max_perturbs: int):
    if not model:
      raise ValueError("Must supply model.")
    if not input:
      raise ValueError("Must supply input.")
    if not hypothesis:
      raise ValueError("Must supply hypothesis.")
    if label == None:
      raise ValueError("Must supply label.")
    self.model: GeneratorHubInterface = model
    self.input: str = input
    self.hypothesis: str = hypothesis
    self.label: int = label
    self.max_perturbs: int = max_perturbs

  def objective(self) -> Callable[[List[float]], float]:
    def _objective(perturbations: List[float]) -> float:
      candidate: str = self.candidate(perturbations)
      tokens = self.model.encode(candidate, self.hypothesis)
      predict = self.model.predict('mnli', tokens)
      if predict.argmax() != self.label:
        return -np.inf
      else:
        return predict.cpu().detach().numpy()[0][self.label]
    return _objective

  def differential_evolution(self, print_result=True, verbose=True, maxiter=3, popsize=32, polish=False) -> str:
    result = differential_evolution(self.objective(), self.bounds(),
                                    disp=verbose, maxiter=maxiter,
                                    popsize=popsize, polish=polish)
    candidate = self.candidate(result.x)
    if (print_result):
      print(f"Result: {candidate}")
      print(f"Correct Label Prediction: {result.fun}")
      print(f"Perturbation Encoding: {result.x}")
    return candidate

#### Invisible Character Attack

In [61]:
class InvisibleCharacterMnliObjective(MnliObjective, InvisibleCharacterObjective):
  
  def __init__(self, model: GeneratorHubInterface, input: str, hypothesis: str, label:int, max_perturbs: int = 10, invisible_chrs: List[str] = [ZWJ,ZWSP], **kwargs):
    super().__init__(model, input, hypothesis, label, max_perturbs)
    self.invisible_chrs = invisible_chrs

#### Homoglyph Attack

In [62]:
class HomoglyphMnliObjective(MnliObjective, HomoglyphObjective):
  
  def __init__(self, model: GeneratorHubInterface, input: str, hypothesis: str, label:int, max_perturbs: int = 10, homoglyphs: Dict[str,List[str]] = intentionals, **kwargs):
    super().__init__(model, input, hypothesis, label, max_perturbs)
    self.homoglyphs = homoglyphs
    self.glyph_map = []
    for i, char in enumerate(self.input):
      if char in self.homoglyphs:
        charmap = self.homoglyphs[char]
        charmap = list(zip([i] * len(charmap), charmap))
        self.glyph_map.extend(charmap)

#### Reordering Attack

In [63]:
class ReorderMnliObjective(MnliObjective, ReorderObjective):
  
  def __init__(self, model: GeneratorHubInterface, input: str, hypothesis: str, label:int, max_perturbs: int = 10, **kwargs):
    super().__init__(model, input, hypothesis, label, max_perturbs)

#### Deletion Attack

In [64]:
class DeletionMnliObjective(MnliObjective, DeletionObjective):
  
  def __init__(self, model: GeneratorHubInterface, input: str, hypothesis: str, label:int, max_perturbs: int = 10, del_chr: str = BKSP, ins_chr: str = 'a', **kwargs):
    super().__init__(model, input, hypothesis, label, max_perturbs)
    self.del_chr: str = del_chr
    self.ins_chr: str = ins_chr

#### Attack Performance

##### Experiment Setup

In [65]:
def mnli_experiment(model, objective, data, file, min_budget, max_budget, maxiter, popsize):
  perturbs = { '0': data }
  for budget in trange(min_budget, max_budget):
    perturbs[str(budget)] = dict()
    for test in tqdm(data, leave=False):
      obj = objective(mnli, test['sentence1'], test['sentence2'], label_map[test['gold_label']])
      example = obj.differential_evolution(print_result=False, verbose=False, maxiter=maxiter, popsize=popsize)
      perturbs[str(budget)][test['pairID']] = example
      with open(file, 'wb') as f:
          pickle.dump(perturbs, f)

In [66]:
mnli_test_small = mnli_test[:20]

##### Invisible Character Experiment

In [67]:
mnli_experiment(mnli, InvisibleCharacterMnliObjective, mnli_test_small, "/content/drive/MyDrive/mnli_invisible_chars.pkl", 1, 6, 3, 16)

HBox(children=(FloatProgress(value=0.0, max=5.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, max=20.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, max=20.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, max=20.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, max=20.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, max=20.0), HTML(value='')))




##### Homoglyph Experiment

In [68]:
mnli_experiment(mnli, HomoglyphMnliObjective, mnli_test_small, "/content/drive/MyDrive/mnli_homoglyphs.pkl", 1, 6, 3, 16)

HBox(children=(FloatProgress(value=0.0, max=5.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, max=20.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, max=20.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, max=20.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, max=20.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, max=20.0), HTML(value='')))




##### Reordering Experiment

In [None]:
mnli_experiment(mnli, ReorderMnliObjective, mnli_test_small, "/content/drive/MyDrive/mnli_reorder.pkl", 1, 6, 3, 16)

##### Deletion Experiment

In [None]:
mnli_experiment(mnli, DeletionMnliObjective, mnli_test_small, "/content/drive/MyDrive/mnli_deletion.pkl", 1, 6, 3, 16)

## Availability Attacks

Sponge examples can be crafted from imperceptible perturbations. These examples are optimized to maximize inference compute time. When sent in large batches to ML systems, these examples cna be used to mount an ML denial of service (DoS) availability attack.

In general, we find that homoglyphs are the best suited transformation to maximize runtime. Future work can search for combinations of transformations that outperform homoglyphs alone in the availability setting.

### Machine Translation Attacks

#### Homoglyph Attack

In [None]:
from timeit import timeit

class HomoglyphSpongeObjective(HomoglyphObjective):

  def objective(self) -> Callable[[List[float]], float]:
    def _objective(perturbations: List[float]) -> float:
      candidate: str = self.candidate(perturbations)
      return -1 * timeit(lambda: self.model.translate(candidate), number=1)
    return _objective

#### Attack Performance

##### Homoglyph Experiment

In [None]:
# Sponge Experiment

experiment(en2fr, HomoglyphSpongeObjective, source_small, perturbs, 1, 6, '/content/drive/MyDrive/sponge.pkl', 3, 16)

## Targeted Integrity Attacks

It may be possible to mount targeted attacks which use invisible perturbations to craft an example with a fixed visual input and fixed output over a given model.

This section contains initial examples, but represents an opportunity for future work.

### Attack via `<unk>`

The `<unk>` charater can easily be repeated any number of times without visual indication for a given input. This attack attempts, with poor results, to use differential evolution to search for targetted adversarial examples of a fixed visual input and output using only repeated `<unk>` characters. These `<unk>`s can trivially be mapped back to zero width spaces or any other character unknown to the model as an adversarial input string.

In [None]:
def make_objective(input: str, target: str, model: GeneratorHubInterface, max_insertions: int = 1024):

  input_enc = model.encode(input)
  input_len = len(input_enc) - 1 # Remove EOS token
  bounds = [(0, max_insertions // input_len)] * input_len

  def objective(perturbations: List[float]) -> float:
    candidate = []
    for idx, perturb_count in enumerate(perturbations):
      candidate += [model.src_dict.unk_index] * max(0,round(float((perturb_count)))) + [int(input_enc[idx])]
    candidate += [int(input_enc[-1])] # Replace EOS token
    translation = model.decode(model.generate(torch.LongTensor(candidate))[0]['tokens'])
    return levenshtein.normalized_distance(target, translation)

  return objective, bounds

In [None]:
input = "Don't Panic."
target = en2fr.translate("Panic.")

objective, bounds = make_objective(input, target, en2fr, max_insertions=256)
result = differential_evolution(objective, bounds, disp=True, maxiter=100, popsize=32)

input_enc = en2fr.encode(input)
candidate = []
for idx, perturb_count in enumerate(result.x):
  candidate += [en2fr.src_dict.unk_index] * round(float((perturb_count))) + [int(input_enc[idx])]
candidate += [int(input_enc[-1])]

best_candidate = en2fr.decode(en2fr.generate(torch.LongTensor(candidate))[0]['tokens'])
print(f"Original Input: '{input}'")
print(f"Original Output: '{target}'")
print(f"Perturbed Output: '{best_candidate}'")