## Introduction 
Here we are trying to adjust parameters of a paraphrase model to generate adversarial examples. 

### Policy gradients 
The key parameter update equation is $\theta_{t+1} = \theta_t + \alpha \nabla_\theta R(\theta)$, where $\alpha$ is a step size parameter, the parameter vector $\theta$ is for a model (here a paraphrase model), and $J$ is a loss function. The time step $t$ depends on the problem specification and we will get to it later. 

Now in my review I have defined the loss function $J(\theta) = E_\pi[r(\tau)]$. Here: 
* $\pi$ is the policy, a probability distribution for the next action in a given state; essentially $p(a_t|s_t)$
* $\tau$ is a trajectory, a specific sequence $s_0, a_0, r_1, s_1, a_1, \ldots$ of the agent in the game. This starts at time $t=0$ and finishes at time $t=T$. 
* $r(\tau)$ is the sum of rewards for a trajectory $\tau$, or in other words, the total reward for the trajectory. 

This is a strange loss function because higher values are better. We might have to invert it at some point. 

To update parameters we must find the gradient $\nabla_\theta J(\theta)$, which measures how $J(\theta)$ changes when we adjust the parameters of the paraphrase model. The gradient is simplified through some maths to get the policy gradient theorem $$ \nabla_\theta J(\theta) =  \nabla_\theta E_\pi [r(\tau)]  = E_\pi \left[r(\tau) \sum_{t=1}^T \nabla_\theta \log \pi (a_t|s_t)  \right] $$ 

To calculate this you need to calculate the expectation term, which in turn means evaluating every possible trajectory $\tau$ and its expected return. Generally this is not possible and instead we turn to estimators.  

One of these is REINFORCE. It gives us  $$ \nabla_\theta J(\theta) \approx \sum_{s=1}^S \sum_{t=1}^T G_t \nabla \log \pi(a_t|s_t)$$ where 
* $G_t$ is the discounted return and is given by $G_t = r_t + \beta r_{t-1} + \beta^2 r_{t-2} + \dots$. It's a rough estimate of $r(\tau)$. Rewards obtained later in the episode are weighted much higher than rewards obtained earlier. I guess it assumes that the parameters update every timestep. 
* $S$ is some number of samples.

The implementation of REINFORCE and similar estimators depends on how we formulate the problem. Below we present some possible formulations

### Interpretation One: Document-level  
This is the first implementation we will try. 

Here we generate a list of paraphrases at each time point. The idea is that there is one paraphrase amongst them that is a good adversarial example. We try to tune the model to produce the best one. 

This interpretation sees forming the complete paraphrase as one time step. So it isn't token-level but document-level. 

* Starting state: $s0 = x$, the original example  
* Actions: each action is "choosing" a paraphrase (or of choosing $n$ paraphrases. The set of all possible paraphrases and their probabilities is the policy. So $\pi(a|s) = p(x'| x;\theta)$ where $x'$ is the paraphrase. 
    * for a given paraphrase we can get the probabilities of generating each token in turn, then multiply them together to get some kind of "probability" of the paraphrase. 
    * for a list of paraphrases, we can multiply together the "probability" of each one to get the probability of obtaining the list (multiplying also by nCr (for r paraphrases out of n total) to account for the lack of order in the list)
* Reward: The paraphrase moves through the reward function $R(x, x')$) to get the reward $r$. 
* Time steps: We only have one time step in the game ($T=1$ and $G_t=r$)  


### Interpretation 2: Token-level
This interpretation is at token-level; it sees choosing the next word as the next time step. 

* Starting state: $s0 = x$, the initial state. But you also have a "blank slate" for the paraphrase. So maybe it's a tuple (x, pp) where pp is a paraphrase with no words. Here x is used as the reference for the paraphrase generator.  
* Actions: Choose the next word of p. I guess this starts with the \<START\> token (or something similar). Then you have the policy $\pi(a|s)$ which is the same as $p(w_{next}|pp, x; \theta)$ where $\theta$ is the paraphrase model parameters, $pp$ is the so-far constructed sentence, and $w_{next}$ is the next token (I say token because I don't know if this model is on the subword or word basis). 
* Time steps: every token is generated one-by-one and each of these is allocated a time step. This means probably that you also update the parameters after each token generated too. 
* Reward. The reward is allocated every token. There are many reward functions (see papers on token-level loss functions). Some also incorporate document-level rewards too. 
* Next state. $s_1$ is again the tuple $(x, pp)$ but now $pp$ has the first word in it. 

On *teacher forcing*. This is when you have a ground-truth paraphrase and you can use it when generating tokens. This is useful because if the model makes a mistake it doesn't continue down that track but is adjusted back. This stops big divergences (but also might limit the diversity of generated paraphrases). This is used when training a paraphrase model. You have a set of reference paraphrases that are human provided. Here though we only have the original sentence and no references. We could generate adversarial examples and use that to do teacher forcing. Generating them using textattack recipes might work. This is only really used on the token-level rewards. 

In [1]:
%load_ext autoreload
%autoreload 2

## Setup, load models + datasets 

In [2]:
import os, gc
import torch 
from torch.utils.data import DataLoader
from datasets import load_dataset, load_metric
import datasets, transformers
from transformers import pipeline, AutoModelForSeq2SeqLM, AutoModelForSequenceClassification, AutoTokenizer
from pprint import pprint
import numpy as np, pandas as pd
import scipy
from utils import *   # local script 
import pyarrow
from sentence_transformers import SentenceTransformer, util
from IPython.core.debugger import set_trace
from GPUtil import showUtilization
import seaborn as sns
from itertools import repeat
from collections import defaultdict
from IPython.display import Markdown

path_cache = './cache/'
path_results = "./results/"

seed = 420
torch.manual_seed(seed)
np.random.seed(seed)
torch.cuda.manual_seed(seed)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') 
devicenum = torch.cuda.current_device() if device.type == 'cuda' else -1
n_wkrs = 4 * torch.cuda.device_count()
batch_size = 64
pd.set_option("display.max_colwidth", 400)

In [3]:
# Paraphrase model (para)
para_name = "tuner007/pegasus_paraphrase"
para_tokenizer = AutoTokenizer.from_pretrained(para_name)
para_model = AutoModelForSeq2SeqLM.from_pretrained(para_name).to(device)

In [4]:
## Victim Model (VM)
vm_name = "textattack/distilbert-base-uncased-rotten-tomatoes"
vm_tokenizer = AutoTokenizer.from_pretrained(vm_name)
vm_model = AutoModelForSequenceClassification.from_pretrained(vm_name).to(device)
vm_idx2lbl = vm_model.config.id2label
vm_lbl2idx = vm_model.config.label2id
vm_num_labels = vm_model.num_labels

In [5]:
dataset = load_dataset("rotten_tomatoes")
train,valid,test = dataset['train'],dataset['validation'],dataset['test']

label_cname = 'label'
## For snli
# remove_minus1_labels = lambda x: x[label_cname] != -1
# train = train.filter(remove_minus1_labels)
# valid = valid.filter(remove_minus1_labels)
# test = test.filter(remove_minus1_labels)

# make sure that all datasets have the same number of labels as what the victim model predicts
assert train.features[label_cname].num_classes == vm_num_labels
assert valid.features[label_cname].num_classes == vm_num_labels
assert test.features[ label_cname].num_classes == vm_num_labels

train_dl = DataLoader(train, batch_size=batch_size, shuffle=True, num_workers=n_wkrs)
valid_dl = DataLoader(valid, batch_size=batch_size, shuffle=True, num_workers=n_wkrs)
test_dl = DataLoader( test,  batch_size=batch_size, shuffle=True, num_workers=n_wkrs)


Using custom data configuration default
Reusing dataset rotten_tomatoes_movie_review (/data/tproth/.cache/huggingface/datasets/rotten_tomatoes_movie_review/default/1.0.0/9c411f7ecd9f3045389de0d9ce984061a1056507703d2e3183b1ac1a90816e4d)


## Training

**Setup**
* use training dataset
* sentiment analysis

**Training loop** 
* get batch (e.g. 16 examples) which we call batch_orig
* get paraphrases for each example in batch to make new batch (batch_pp)
    * more efficient if we have bigger batches
    * start with k=2 paraphrases for now (so the code can handle the multi-paraphrase case) but we will try with 1 and with a few as well. 
    * will have to later play around with diversity parameters (or maybe they can also be learned with rl too)
* get reward 
    * the *reward function* $R$ takes in k rows of batch_pp, which corresponds to one example of batch_orig
    * here are the formulas. $x'$ means paraphrase, $f(x)_y$ means the model confidence of x for the class of the true label $y$, $SS(a,b)$ is the result of a semantic similarity model run over $a$ and $b$, and $\lambda$ is a hyperparameter that we'll probably have to tune in reward-shaping style.  
    * for k=1: $R = f(x)_y - f(x')_y + \lambda SS(x, x')$
    * for k>1 (e.g. 3): some thought needed. ideas: 
        * only look at best performing paraphrase $x'_m$. find it by $x'_m = \max_i [f(x)_y - f(x'_i)_y]$, then return $R [f(x)_y - f(x'_m)_y] + \lambda SS(x,x')$ 
        * take average of each: $\frac{1}{k} \sum_{i=1}^k \left[ f(x)_y - f(x'_i)_y + \lambda SS(x, x'_i) \right]$
* update parameters of $f_{x'}$
   


In [6]:
#First let's map our adversarial problem to the language of reinforcement learning. We start off with an example $x$, and we then generate a paraphrase $x'$ using a paraphrase model $f_p$ with parameters $\theta$. It seems safe to say that the starting state $s_0$ is represented by $x$, so that $s_0 = x$. The action is then to generate paraphrases. Under this view the policy $\pi(a|s)$ really only has one option - the "generate paraphrase" option - and so this probability distribution is really just 1 on "generate paraphrase" and 0 everywhere else. We then obtain a list of paraphrases $l_{x'}$ , which seems to be ordered from "best to worst" according to the paraphrase model. The chance of getting a particular paraphrase $p(x'|x; \theta)$ is dependent on $\theta$. Under the standard RL model this probability is given by the "environment" and is usually hard to model. It is similar to $p(r_{t+1}|s_t,a_t)$ which is the "environment" factor in my literature review. 
#We now tick over to $t=1$ and put $l_{x'}$ into the reward function and obtain a reward $r$. (We tackle reward function design elsewhere). The next state is $s_1= l_{x'}$. It would seem natural to continue the game and generate paraphrases of paraphrases and so on, but we stop here. The trajectory is then $\tau = x, \text{gen_x'}, r, l_{x'}, \dots$.  

### Tasks 
* ~~clean up text above~~
    * ~~Second edit that makes clear the scenario that is being implemented.~~
* clean up a bunch of commented code 
* write out some scenarios 
    * best-paraphrase reward from set 
    * avg-paraphrase reward from set 
    * one paraphrase reward 
* write reward function
* checkpoint: reward for one set of paraphrases 
* add to git
* implement reinforce 
* ~~make a wiki entry for "teacher forcing"~~
* merge `get_paraphrases` and `get_paraphrases_and_probs`
* run through Trainer tutorial at: https://colab.research.google.com/github/huggingface/notebooks/blob/master/transformers_doc/pytorch/training.ipynb

In [7]:
def get_paraphrases_and_probs(input_text,num_return_sequences,num_beams, num_beam_groups=1,diversity_penalty=0):
    batch = para_tokenizer(input_text,truncation=True,padding='longest', return_tensors="pt").to(device)
    translated = para_model.generate(**batch,num_beams=num_beams, num_return_sequences=num_return_sequences, 
                               temperature=1.5, 
                                 num_beam_groups=num_beam_groups, 
                                diversity_penalty=diversity_penalty,
                                 return_dict_in_generate=True, output_scores=True)
    # Sequence scores won't add to 1 across the generated paraphrases, so here we normalise them. 
    # We also need to take exp for them to work. 
    seq_probs = torch.exp(translated.sequences_scores) / sum(torch.exp(translated.sequences_scores))
    tgt_text = para_tokenizer.batch_decode(translated.sequences, skip_special_tokens=True)
    return tgt_text, seq_probs

In [8]:
input_text = "Note that some asteroids (the ones behind the asteroids marked 1, 5, and 7) won't have a chance to be vaporized until the next full rotation."
num_beam_groups = 10
num_beams=20
num_return_sequences = 20
diversity_penalty=1.
para, seq_probs = get_paraphrases_and_probs(input_text,num_return_sequences,num_beams,
                                            num_beam_groups, diversity_penalty)

# batch = para_tokenizer(input_text,truncation=True,padding='longest', return_tensors="pt").to(device)
# translated = para_model.generate(**batch,num_beams=num_beams, num_return_sequences=num_return_sequences, 
#                                temperature=1.5, 
#                                  num_beam_groups=num_beam_groups, 
#                                  do_sample=True,
#                                 diversity_penalty=diversity_penalty,
#                                  return_dict_in_generate=True, output_scores=True)
# seq_probs = torch.exp(translated.sequences_scores)
# tgt_text = para_tokenizer.batch_decode(translated.sequences, skip_special_tokens=True)

In [9]:
# Precompute paraphrases for the training set and store them
def get_paraphrases(input_text,num_return_sequences,num_beams, num_beam_groups=1,diversity_penalty=0):
    batch = para_tokenizer(input_text,truncation=True,padding='longest', return_tensors="pt").to(device)
    translated = para_model.generate(**batch,num_beams=num_beams, num_return_sequences=num_return_sequences, 
                                   temperature=1.5, num_beam_groups=num_beam_groups, diversity_penalty=diversity_penalty)
    tgt_text = para_tokenizer.batch_decode(translated, skip_special_tokens=True)
    return tgt_text

# def gen_dataset_paraphrases(x, cname_input, cname_output, n_seed_seqs=32): 
#     """ x: one row of a dataset. 
#     cname_input: column to generate paraphrases for 
#     cname_output: column name to give output of paraphrases 
#     n_seed_seqs: rough indicator of how many paraphrases to return. 
#             For now, keep at 4,8,16,32,64 etc"""
#     # TODO: figure out how to batch this. 
#     if n_seed_seqs % 4 != 0: raise ValueError("keep n_seed_seqs divisible by 4 for now")
#     n = n_seed_seqs/2
#     #low diversity (ld) paraphrases 
#     ld_l = get_paraphrases(x[cname_input],num_return_sequences=int(n),
#                             num_beams=int(n))
#     #high diversity (hd) paraphrases. We can use num_beam_groups and diversity_penalty as hyperparameters. 
#     hd_l =  get_paraphrases(x[cname_input],num_return_sequences=int(n),
#                             num_beams=int(n), num_beam_groups=int(n),diversity_penalty=50002.5)
#     l = ld_l + hd_l 
#     x[cname_output] = l #TODO: change to list(set(l))             
#     return x 


def create_paraphrase_dataset(batch, cname_input, cname_output, n_seed_seqs=32): 
    """Create `n_seed_seq` paraphrases for each example in the batch. Then repeat the other fields 
        so that the resulting datase has the same length as the number of paraphrases. Key assumption is 
        that the same number of paraphrases is created for each example.
    batch: a dict of examples used by the `map` function from the dataset
    cname_input: What column to create paraphrases of 
    cname_output: What to call the column of paraphrases
    n_seed_seqs: Number of paraphrases to generate. """
    
    # Generate paraphrases. 
    # This can be later extended to add diversity or so on. 
    para_l = get_paraphrases(batch[cname_input], n_seed_seqs, n_seed_seqs)
    
    # To return paraphrases as a list of lists for batch input (not done here but might need later)
    #     split_into_sublists = lambda l,n: [l[i:i + n] for i in range(0, len(l), n)]
    #     para_l = split_into_sublists(para_l, n_seed_seqs)
    batch[cname_output] = para_l 
    
    # Repeat each entry in all other columns `n_seed_seq` times so they are the same length
    # as the paraphrase column
    # Only works if the same number of paraphrases is generated for each phrase. 
    # Else try something like 
        # for o in zip(*batch.values()):
        #     d = dict(zip(batch.keys(), o))
        #     get_paraphrases(batch[cname_input],num_return_sequences=n_seed_seqs,num_beams=n_seed_seqs)
        #     for k,v in d.items(): 
        #       return_d[k] += v if k == 'text' else [v for o in range(n_paraphrases)]
        # return return_d
    return_d = defaultdict(list) 
    repeat_each_item_n_times = lambda l,n: [o for o in l for i in range(n)]
    for k in batch.keys(): 
        if   k == cname_output: return_d[k] = batch[cname_output]
        else:                   return_d[k] = repeat_each_item_n_times(batch[k], n_seed_seqs)
    return return_d 

In [10]:
# Generate paraphrase dataset
n_seed_seqs = 3
cname_input = 'text' # which text column to paraphrase
cname_output= cname_input + '_pphrases'
date = '20210802'
fname = path_cache + '_rt_train'+ date + '_' + str(n_seed_seqs)
if os.path.exists(fname):  
    ds_pphrases = datasets.load_from_disk(fname)
else:
    ds_pphrases = train.shard(1, 0, contiguous=True)
    # Have to call with batched=True
    # Need to set a batch size otherwise will run out of memory on the GPU card. 
    # 64 seems to work well 
    ds_pphrases = ds_pphrases.map(
        lambda x: create_paraphrase_dataset(x, n_seed_seqs=n_seed_seqs,
            cname_input=cname_input, cname_output=cname_output),
        batched=True, batch_size=64) 
    ds_pphrases.save_to_disk(fname)
    gc.collect(); torch.cuda.empty_cache() # free up most of the GPU memory

In [11]:
# Get model results for the pphrases
def get_vm_scores(ds_pphrases, cname_orig, cname_pphrase, cname_label='label'): 
    """Get victim model preds/probs for  """
    # Get preds and accuracy on the pphrase dataset
    print("Getting victim model scores.")
    dl = DataLoader(ds_pphrases, batch_size=batch_size, shuffle=False, 
                    num_workers=n_wkrs, pin_memory=True)
  #  metric = load_metric('accuracy')
    pphrase_probs_l,orig_probs_l = [],[]
    assert vm_model.training == False  # checks that model is in eval mode 
    #monitor = Monitor(2)  # track GPU usage and memory
    
    def get_vm_preds(x): 
        """Get predictions for a vector x (here a vector of documents/text). 
        Works for a sentiment-analysis dataset (needs to be adjusted for NLI tasks)"""
        inputs = vm_tokenizer(x, padding=True,truncation=True, return_tensors="pt")
        inputs.to(device)
        outputs = vm_model(**inputs, labels=labels)
        probs = outputs.logits.softmax(1).cpu()
        preds = probs.argmax(1)
        return probs, preds
    
    with torch.no_grad():
        for i, data in enumerate(dl): 
            if i % 50 == 0 : print(i, "out of", len(dl))
            labels,orig,pphrases = data['label'].to(device),data[cname_orig],data[cname_pphrase]
            
            # predictions for original
            orig_probs, orig_preds = get_vm_preds(orig)            
            orig_probs_l.append(orig_probs)
            
            # predictions for pphrase
            pphrase_probs, pphrase_preds = get_vm_preds(pphrases)            
            pphrase_probs_l.append(pphrase_probs)
          #  metric.add_batch(predictions=pphrase_preds, references=labels)

    # convert lists to tensor
    orig_probs_t, pphrase_probs_t = torch.cat(orig_probs_l),torch.cat(pphrase_probs_l)
    #monitor.stop()
    return orig_probs_t, pphrase_probs_t

cname_orig = cname_input
cname_pphrase = cname_output
cname_label = 'label'
fname = path_cache + 'results_df_'+ date + "_" + str(n_seed_seqs) + ".csv"
if os.path.exists(fname):    results_df = pd.read_csv(fname)
else: 
    #sim_score_t = generate_sim_scores()
    orig_probs_t,pphrase_probs_t = get_vm_scores(ds_pphrases, cname_orig, cname_pphrase, cname_label)
    vm_orig_scores    = torch.tensor([r[idx] for idx,r in zip(ds_pphrases[cname_label], orig_probs_t   )])
    vm_pphrase_scores = torch.tensor([r[idx] for idx,r in zip(ds_pphrases[cname_label], pphrase_probs_t)])
    results_df = pd.DataFrame({
                  cname_orig: ds_pphrases[cname_orig],
                  cname_pphrase: ds_pphrases[cname_pphrase],
   #               'sim_score': sim_score_t,
                  'label_true': ds_pphrases[cname_label], 
                  'label_vm_orig': orig_probs_t.argmax(1),
                  'label_vm_pphrase': pphrase_probs_t.argmax(1),
                  'vm_orig_truelabel': vm_orig_scores,             
                  'vm_pphrase_truelabel': vm_pphrase_scores,
                  'vm_truelabel_change': vm_orig_scores - vm_pphrase_scores,
                  'vm_orig_class0': orig_probs_t[:,0], 
                  'vm_orig_class1': orig_probs_t[:,1], 
   #               'vm_orig_class2': orig_probs_t[:,2],  
                  'vm_pphrase_class0': pphrase_probs_t[:,0], 
                  'vm_pphrase_class1': pphrase_probs_t[:,1], 
#                  'vm_pphrase_class2': pphrase_probs_t[:,2]     
                  })
#    results_df['vm_truelabel_change_X_sim_score'] = results_df['vm_truelabel_change'] * results_df['sim_score']
    results_df.to_csv(fname, index_label = 'idx')

In [12]:
def reward_fn_onerow(x): 
    """x is one row of a pandas df"""
    text,pp,lbl_change = x['text'],x['text_pphrases'],x['vm_truelabel_change']
    return lbl_change

def reward_fn_batch(): 
    pass 

def loss_fn(): 
    pass 

* make a function that calculates loss 
    * you will have to "shape" the reward from this function, which means you will have to try out a number of different things 
    * for now, try $ |f(p1) - y|^2 + |f(p2) - y|^2 + |f(p3) - y|^2 $ where $f(p1)$ is model confidence for class y. 
        * this is tricky because the reward isn't calculated over one datapoint any longer but rather a few of them. in addition you have two datasets: the original and the paraphrase. 
        * maybe the best is to compute the paraphrases inside the reward function. or compute them before but just store them, and then reference it in the reward function. 
    * later you can add various terms, e.g. BERTScore, or a term for fluency, or the semantic similarity component.