## Introduction 

Here we are trying to adjust parameters of a paraphrase model to generate adversarial examples. 
### Policy gradients 
The key parameter update equation is $\theta_{t+1} = \theta_t + \alpha \nabla_\theta J(\theta)$, where $\alpha$ is a step size parameter, the parameter vector $\theta$ is for a model (here a paraphrase model), and $J$ is a loss function. The time step $t$ depends on the problem specification and we will get to it later. 

Now in my review I have defined the loss function $J(\theta) = E_\pi[r(\tau)]$. Here: 
* $\pi$ is the policy, a probability distribution for the next action in a given state; essentially $p(a_t|s_t)$
* $\tau$ is a trajectory, a specific sequence $s_0, a_0, r_1, s_1, a_1, \ldots$ of the agent in the game. This starts at time $t=0$ and finishes at time $t=T$. 
* $r(\tau)$ is the sum of rewards for a trajectory $\tau$, or in other words, the total reward for the trajectory. 

For this loss function higher values are better (which might make it a reward function) and so we might have to invert it at some point. 

To update parameters we must find the gradient $\nabla_\theta J(\theta)$, which measures how $J(\theta)$ changes when we adjust the parameters of the paraphrase model. The gradient is simplified through some maths to get the policy gradient theorem $$ \nabla_\theta J(\theta) =  \nabla_\theta E_\pi [r(\tau)]  = E_\pi \left[r(\tau) \sum_{t=1}^T \nabla_\theta \log \pi (a_t|s_t)  \right] $$ 

To calculate this you need to calculate the expectation term, which in turn means evaluating every possible trajectory $\tau$ and its expected return. Generally this is not possible and instead we turn to estimators.  

One of these is REINFORCE. It gives us  $$ \nabla_\theta J(\theta) \approx \sum_{s=1}^S \sum_{t=1}^T G_t \nabla \log \pi(a_t|s_t)$$ where 
* $G_t$ is the discounted return and is given by $G_t = r_t + \beta r_{t-1} + \beta^2 r_{t-2} + \dots$. It's a rough estimate of $r(\tau)$. Rewards obtained later in the episode are weighted much higher than rewards obtained earlier. I guess it assumes that the parameters update every timestep. 
* $S$ is some number of samples.

The implementation of REINFORCE and similar estimators depends on how we formulate the problem. Below we present some possible formulations

### Interpretation One: Document-level  
This is the first implementation we will try. 

Here we generate a list of paraphrases at each time point. The idea is that there is one paraphrase amongst them that is a good adversarial example. We try to tune the model to produce the best one. 

This interpretation sees forming the complete paraphrase as one time step. So it isn't token-level but document-level. 

* Starting state: $s0 = x$, the original example  
* Actions: each action is "choosing" a paraphrase (or of choosing $n$ paraphrases). The set of all possible paraphrases and their probabilities is the policy. So $\pi(a|s) = p(x'| x;\theta)$ where $x'$ is the paraphrase (or list of paraphrases). 
    * To approximate this probability, what we can do is generate a large list of paraphrases, and for each, the probabilities of generating each token in turn for that paraphrase. This gives a rough "probability" of how likely that sequence was. This number is kind of like a weight for how good that paraphrase is, according to the model.  We can then turn the weights into probabilities to get a "probability" of the paraphrase. This is dependent on the number of paraphrases generated, so generating a large list is likely to be better for this task. 
* Reward: The paraphrase moves through the reward function $R(x, x')$) to get the reward $r$. 
* Time steps: We only have one time step in the game ($T=1$ and $G_t=r$)  


There are a few variations to this scenario that we can do. For each of these we will formulate the policy and the reward function $R$. Below, $x'$ means paraphrase, $f(x)_y$ means the model confidence of x for the class of the true label $y$, $SS(a,b)$ is the result of a semantic similarity model run over $a$ and $b$, and $\lambda$ is a hyperparameter.  


#### One-paraphrase 
Here we only generate one paraphrase. This scenario also has a few options. First we generate a list of paraphrases with the probabilities of selecting one. Then we either sample probabilistically from the list or pick the most probable option. 

In this case the policy $p(x'|x,\theta)$ is the chance of obtaining a specific paraphrase. For the sampling option this is equal to its sample probability. For the top option this is just the probability of selecting that option. 

The reward function might look like $R(x,x') = f(x)_y - f(x')_y + \lambda SS(x, x')$. We could also make the $SS$ factor a step-function above some threshold. 

The REINFORCE equation $$ \nabla_\theta J(\theta) \approx \sum_{s=1}^S \sum_{t=1}^T G_t \nabla \log \pi(a_t|s_t)$$ becomes $$ \nabla_\theta J(\theta) \approx \sum_{s=1}^S  R(x,x'_s) \nabla \log p(x'_s|x,\theta)$$ We repeat the process $S$ times where $S$ is ideally as large as possible. We can start with something simple (e.g. $S=10$ or $S=100$) and go from there.  

The gradient term $\nabla \log p(x'_s|x,\theta)$ can hopefully be found with autodiff. 

#### Set of paraphrases
In this scenario the paraphrase model is evaluated on performance over a set of paraphrases, which we call $X'$ here. The policy becomes $p(X'|x, \theta)$, the probability of obtaining that list. We can get this probability by multipling together the "probability" of each individual paraphrase, multiplying also by nCr (for r paraphrases out of n total) to account for the lack of order in the list. 

We can make a number of sub-scenarios here. 

For the **top-paraphrase in set** condition the paraphrase generator is only measured on the best reward for a paraphrase in its set. The idea is the generator will learn to produce a diverse set of examples, any of which could plausibly be a good adversarial example. Here we only look at best performing paraphrase $x'_m$, which we can find by $x'_m = \max_i [f(x)_y - f(x'_i)_y]$, then return $R(x,x'_m) = [f(x)_y - f(x'_m)_y] + \lambda SS(x,x'_m)$ 

For the **average-paraphrase in set** condition the paraphrase generator is measured on the average reward of the paraphrases in its set. This encourages the generator to consider performance of all examples more-or-less equally. The reward function could be something like $\frac{1}{k} \sum_{i=1}^k \left[ f(x)_y - f(x'_i)_y + \lambda SS(x, x'_i) \right]$ 

A combination of these scenarios is the **top-k/top-p\% paraphrases in set**. Here we only use the top-$k$ paraphrases, or more generally, the top $p$ percentage of paraphrases. 


### Interpretation 2: Token-level
This interpretation is at token-level; it sees choosing the next word as the next time step. 

* Starting state: $s0 = x$, the initial state. But you also have a "blank slate" for the paraphrase. So maybe it's a tuple (x, pp) where pp is a paraphrase with no words. Here x is used as the reference for the paraphrase generator.  
* Actions: Choose the next word of p. I guess this starts with the \<START\> token (or something similar). Then you have the policy $\pi(a|s)$ which is the same as $p(w_{next}|pp, x; \theta)$ where $\theta$ is the paraphrase model parameters, $pp$ is the so-far constructed sentence, and $w_{next}$ is the next token (I say token because I don't know if this model is on the subword or word basis). 
* Time steps: every token is generated one-by-one and each of these is allocated a time step. This means probably that you also update the parameters after each token generated too. 
* Reward. The reward is allocated every token. There are many reward functions (see papers on token-level loss functions). Some also incorporate document-level rewards too. 
* Next state. $s_1$ is again the tuple $(x, pp)$ but now $pp$ has the first word in it. 

On *teacher forcing*. This is when you have a ground-truth paraphrase and you can use it when generating tokens. This is useful because if the model makes a mistake it doesn't continue down that track but is adjusted back. This stops big divergences (but also might limit the diversity of generated paraphrases). This is used when training a paraphrase model. You have a set of reference paraphrases that are human provided. Here though we only have the original sentence and no references. We could generate adversarial examples and use that to do teacher forcing. Generating them using textattack recipes might work. This is only really used on the token-level rewards. 

### Updating the paraphrase model parameters. 

There is a choice here. We can either directly update the parameters of the paraphrase model. Or we can fix the parameters and add a new dense layer to the end of the model. We could then train this dense layer to convert paraphrases to adversarial paraphrases. 

Before trying this out, I am worried that we will destroy the capabilities of the paraphrase generator a bit. We might get semantically invalid or ungrammatical or gibberish text. If so we could try and mitigate it a bit by shaping our reward function to maintain grammatical components. 

### Experiment order

Plan is to try the following order: 

1. One-paraphrase (most probable option). I'll start with this one because it is probably the most simple case. Within this category: 
    1a. tune existing parameters only (see if the text is recognisable) 
    1b. add dense layer onto end and try again 
2. One-paraphrase (sampled). This seems like a logical extension on the first one. 
3. Paraphrase-set options. (Decide after finishing 1, 2) 
4. Token-level tuning. (Decide after 1,2,3)


### Layer Freezing

I am uncertain on if to do this or not. 

* This [paper](https://arxiv.org/abs/1911.03090) indicates that you can get pretty good results by freezing all layers except the last few 
* Conversely I saw in the transformers documentation that transformers train better if you don't do layer freezing 


## Setup, load models + datasets 

In [1]:
%load_ext line_profiler

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
# Core imports 
import torch, numpy as np, pandas as pd, os, gc, logging
from torch.utils.data import DataLoader
from datasets import load_dataset, load_metric, load_from_disk
from transformers import (AutoModelForSeq2SeqLM, AutoModelForSequenceClassification, 
                          AutoTokenizer, AdamW, SchedulerType, get_scheduler)
from collections import defaultdict
from types import MethodType
import utils; from utils import *   # local script 
from tqdm.auto import tqdm

# Dev imports (not needed for final script)
import seaborn as sns
from IPython.display import Markdown
from pprint import pprint
from IPython.core.debugger import set_trace
from GPUtil import showUtilization
import torchsnooper

# Paths
path_cache = './cache/'
path_results = "./results/"

# Seeds
seed = 420
torch.manual_seed(seed)
np.random.seed(seed)
torch.cuda.manual_seed(seed)

# Devices and GPU settings
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') 
devicenum = torch.cuda.current_device() if device.type == 'cuda' else -1
n_wkrs = 4 * torch.cuda.device_count()
batch_size_dl = 64

# Config
pd.set_option("display.max_colwidth", 400)

# Logging 
logging.basicConfig(format='%(message)s')
logger = logging.getLogger("main_logger")
logger.setLevel(logging.INFO)


### Parameters and training settings
# Paraphrase parameters  
num_beams = 1
num_return_sequences=1
num_beam_groups = 1
diversity_penalty = 0.  # must be a float
temperature = 1.5
length_penalty = 1
min_length = 5

# REINFORCE parameters 
S = 1

# Model training parameters
batch_size = 2
lr = 1e-5 # Initial learning rate (after the potential warmup period) to use
weight_decay = 0
n_train_epochs = 20
#lr_scheduler_type = 'none'
n_warmup_steps = 30 
plot_grads = False

### Load models

In [4]:
## Paraphrase (pp) model 
pp_name = "tuner007/pegasus_paraphrase"
pp_tokenizer = AutoTokenizer.from_pretrained(pp_name)
# takes about 3GB memory space up on the GPU
pp_model = AutoModelForSeq2SeqLM.from_pretrained(pp_name).to(device)
# If need a no_grad version of generate:
pp_model.generate_with_grad = MethodType(utils.generate_with_grad, pp_model)

## Victim Model (VM)
vm_name = "textattack/distilbert-base-uncased-rotten-tomatoes"
vm_tokenizer = AutoTokenizer.from_pretrained(vm_name)
vm_model = AutoModelForSequenceClassification.from_pretrained(vm_name).to(device)
vm_idx2lbl = vm_model.config.id2label
vm_lbl2idx = vm_model.config.label2id
vm_num_labels = vm_model.num_labels

### Load raw datasets and create dataloaders

In [5]:
dataset = load_dataset("rotten_tomatoes")
train,valid,test = dataset['train'],dataset['validation'],dataset['test']
label_cname = 'label'
## For snli
# remove_minus1_labels = lambda x: x[label_cname] != -1
# train = train.filter(remove_minus1_labels)
# valid = valid.filter(remove_minus1_labels)
# test = test.filter(remove_minus1_labels)

# make sure that all datasets have the same number of labels as what the victim model predicts
assert train.features[label_cname].num_classes == vm_num_labels
assert valid.features[label_cname].num_classes == vm_num_labels
assert test.features[ label_cname].num_classes == vm_num_labels

train_dl = DataLoader(train, batch_size=batch_size_dl, shuffle=True, num_workers=n_wkrs)
valid_dl = DataLoader(valid, batch_size=batch_size_dl, shuffle=True, num_workers=n_wkrs)
test_dl = DataLoader( test,  batch_size=batch_size_dl, shuffle=True, num_workers=n_wkrs)

Using custom data configuration default
Reusing dataset rotten_tomatoes_movie_review (/data/tproth/.cache/huggingface/datasets/rotten_tomatoes_movie_review/default/1.0.0/9c411f7ecd9f3045389de0d9ce984061a1056507703d2e3183b1ac1a90816e4d)


In [6]:
# For testing, we'll just use a simple dataset
simple_dataset = load_dataset('csv',data_files="simple_dataset.csv")['train']
simple_dataset_test = load_dataset('csv',data_files="simple_dataset_test.csv")['train']
simple_dl = DataLoader(simple_dataset, batch_size=batch_size, 
                            shuffle=False, num_workers=n_wkrs)
simple_dl_test = DataLoader(simple_dataset_test, batch_size=batch_size, 
                            shuffle=False, num_workers=n_wkrs)
dl = simple_dl

Using custom data configuration default-a2b91d51da8a7742
Reusing dataset csv (/data/tproth/.cache/huggingface/datasets/csv/default-a2b91d51da8a7742/0.0.0/2dc6629a9ff6b5697d82c25b73731dd440507a69cbce8b425db50b751e8fcfd0)
Using custom data configuration default-73f997ff87f1fac5
Reusing dataset csv (/data/tproth/.cache/huggingface/datasets/csv/default-73f997ff87f1fac5/0.0.0/2dc6629a9ff6b5697d82c25b73731dd440507a69cbce8b425db50b751e8fcfd0)


## Training

### Description 

Training loop pseudocode

The REINFORCE estimator is $$ \nabla_\theta J(\theta) \approx \sum_{s=1}^S  R(x,x'_s) \nabla \log p(x'_s|x,\theta)$$

**Non-batched version (one example), stochastic gradient descent**  
Inputs: train, n_pp=1, vm, ppm, $\alpha = 5e^{-5}$ (saw this rate for $\alpha$ somewhere  
Set eval_mode=true for vm, eval_mode = false for ppm  
Freeze all layers of ppm except last 6  
Shuffle traning dataset  

Loop: take one row $x$ from train
* tokenize
* do greedy search to get paraphrase pp
* get reward using `reward_fn(x, pp)`. $r=R(x,x'_s) = f(x)_y - f(x'_s)_y + \lambda SS(x, x'_s)$ 
* update model parameters 


* generate large UNIVERSE list of paraphrases `pp_l` (e.g. 128) from 'text' column using ppm
* extract sequence scores from this list to get a vector of probabilities `pp_probs`
* take `log` of `pp_probs` and store in `pp_logprobs`
* pick S paraphrases from `pp_l` to get `pp_s`. 
* Take the corresponding entries from `pp_logprobs`. Get gradient of each entry by looking at .grad attribute. Sum them up and store in a variable `gradsum` 
* for each `pp` (i.e. $x'_s$) in `pp_s`:
    * 
* Sum up these rewards to get `rewardsum` and add to `gradsum` to get `nablaJ`
* Update parameters of paraphrase model with $\theta_{t+1} = \theta_t + \alpha \nabla_\theta J(\theta)$

$$ J(\theta) \approx \sum_{s=1}^S  R(x,x'_s) \log p(x'_s|x,\theta)$$

### Preprocessing and setup 

#### Define functions 

In [7]:
def get_paraphrases(text, num_return_sequences, num_beams, 
                     num_beam_groups=1, diversity_penalty=0, 
                    temperature=1.5, min_length=5, length_penalty=1):
    """Wrapper for generating paraphrases (pp's). Most keywords are passed on to pp_model.generate function, 
    so see docs for that function. """
    batch = pp_tokenizer(text, truncation=True, padding='longest', return_tensors="pt").to(device)
    set
    # Only greedy search supported at the moment
    generated = pp_model.generate_with_grad(**batch, 
                                         num_beams=num_beams,
                                         num_return_sequences=num_return_sequences, 
                                         do_sample=False, 
                                         temperature=temperature, 
                                         num_beam_groups=num_beam_groups,
                                         diversity_penalty=diversity_penalty,
                                         length_penalty=length_penalty,
                                         min_length=min_length,
                                         return_dict_in_generate=True,
                                         output_scores=True,
                                         pad_token_id = pp_tokenizer.pad_token_id,
                                         eos_token_id = pp_tokenizer.eos_token_id)
    tgt_text = pp_tokenizer.batch_decode(generated.sequences, skip_special_tokens=True)
    return generated, tgt_text

In [8]:
def print_info_on_generated_text():
    """
        Prints a bunch of statistics around the generated text. Useful for debugging purposes.
        So far only works for greedy search.
    """
    logger.info("\n######################################################################\n")
    logger.info(f"Original text: {text}")
    tgt_text = pp_tokenizer.batch_decode(translated.sequences, skip_special_tokens=True)
    tgt_text_with_tokens = pp_tokenizer.batch_decode(translated.sequences, skip_special_tokens=False)
    logger.info(f"Generated text: {tgt_text}")
    logger.info(f"Generated text with special tokens: {tgt_text_with_tokens}")
    logger.info(f"Shape of translated.sequences:{translated.sequences.shape}")
    logger.info(f"translated.sequences:{translated.sequences}")
    logger.info(f"Scores is a tuple of length {len(translated.scores)} \
    and each score is a tensor of shape {translated.scores[0].shape}")
    scores_stacked = torch.stack(translated.scores, 1)
    logger.info(f"Stacking the scores into a tensor of shape {scores_stacked.shape}")
    scores_softmax = torch.softmax(scores_stacked, 2)
    logger.info(f"Now taking softmax. This shouldn't change the shape, but just to check,\
    its shape is {scores_softmax.shape}")
    probsums = scores_softmax.sum(axis=2)
    logger.info(f"These are probabilities now and so they should all sum to 1 (or close to it) in the axis \
    corresponding to each time step. We can check the sums here: {probsums}, but it's a long tensor \
    of shape {probsums.shape} and hard to see, so summing over all these values and removing 1 \
    from each gives {torch.sum(probsums - 1)} \
    which should be close to 0.")
    seq_without_first_tkn = translated.sequences[:, 1:]
    logger.info("Now calculating sequence probabilities")
    seq_token_probs = torch.gather(scores_softmax,2,seq_without_first_tkn[:,:,None]).squeeze(-1)
    seq_prob = seq_token_probs.prod(-1).item()
    logger.info(f"Sequence probability: {seq_prob}")

    # Get the 2nd and 3rd most likely tokens at each st
    topk_ids = torch.topk(scores_softmax,3,dim=2).indices[:,:,1:]
    topk_tokens_probs = torch.gather(scores_softmax,2,topk_ids).squeeze(-1)
    toks2 = pp_tokenizer.convert_ids_to_tokens(topk_ids[:,:,0].squeeze())
    toks3 = pp_tokenizer.convert_ids_to_tokens(topk_ids[:,:,1].squeeze())
    tok_probs2 = topk_tokens_probs[:,:,0].squeeze()
    tok_probs3 = topk_tokens_probs[:,:,1].squeeze()

    logger.info(f"Probabilities of getting the top 3 tokens at each step:")
    tokens = pp_tokenizer.convert_ids_to_tokens(seq_without_first_tkn.squeeze())
    for (p, t, p2,t2,p3,t3)  in zip(seq_token_probs.squeeze(), tokens, tok_probs2, toks2, tok_probs3, toks3): 
        logger.info(f"{t}: {round(p.item(),3)}  {t2}: {round(p2.item(),3)}  {t3}: {round(p3.item(),3)}") 

In [9]:
def get_pp_logp(translated): 
    """log(p(pp|orig)) basically.
    works for greedy search, will need tweaking for other types probably"""
    scores_stacked = torch.stack(translated.scores, 1)
    scores_log_softmax = torch.log_softmax(scores_stacked, 2)
    seq_without_first_tkn = translated.sequences[:, 1:]
    attention_mask = pp_model._prepare_attention_mask_for_generation(
        seq_without_first_tkn, pp_tokenizer.pad_token_id, pp_tokenizer.eos_token_id
    )
    seq_token_log_probs = torch.gather(scores_log_softmax,2,seq_without_first_tkn[:,:,None]).squeeze(-1)
    # account for the padding tokens at the end 
    seq_token_log_probs = seq_token_log_probs * attention_mask
    seq_log_prob = seq_token_log_probs.sum(-1)
    return seq_log_prob

In [10]:
def get_vm_probs(text): 
    if vm_model.training: vm_model.eval()
    tkns = vm_tokenizer(text, truncation=True, padding='longest', return_tensors="pt").to(device)
    logits = vm_model(**tkns).logits
    probs = torch.softmax(logits,1)
    return probs

In [11]:
def reward_fn(orig_l, pp_l, truelabel, return_probs=False): 
    """orig_l, pp_l are lists of original and paraphrase respectively"""
    # Victim model probability differences between orig and pp
    orig_probs,pp_probs = get_vm_probs(orig_l),get_vm_probs(pp_l)
    orig_truelabel_probs = torch.gather(orig_probs,1,truelabel[:,None]).squeeze()
    pp_truelabel_probs   = torch.gather(pp_probs,1,truelabel[:,None]).squeeze()
    vm_scores = (orig_truelabel_probs - pp_truelabel_probs).detach().cpu().tolist()
    
    # ROUGE scores
    def get_rouge_score(ref, pred):
        return rouge_metric.compute(rouge_types=["rougeL"],
            predictions=[pred], references=[ref])['rougeL'].mid.fmeasure 
    rouge_scores = [get_rouge_score(ref=orig,pred=pp) for orig,pp in zip(orig_l, pp_l)]

    # Reward calculation 
    rewards = torch.tensor([-9999 if r < 0.15 else v*r for v,r in zip(vm_scores, rouge_scores)],device=device)
    
    print("orig_l", orig_l)
    print("pp_l", pp_l)
    print("VM score: ", vm_scores)
    print("ROUGE score:", rouge_scores)
    print("Reward:", rewards)
    
    if return_probs: return orig_probs,pp_probs,rewards
    else:            return rewards

In [12]:
def training_step(data): 
    optimizer.zero_grad()
    label,text = data['label'].to(device),data["text"]
    generated, pp_text = get_paraphrases(text,
            num_return_sequences=num_return_sequences, num_beams=num_beams, 
            num_beam_groups=num_beam_groups, diversity_penalty=diversity_penalty,
            temperature=temperature, length_penalty=length_penalty, min_length=min_length)
    pp_logp = get_pp_logp(generated)
    reward = reward_fn(orig_l=text, pp_l=pp_text, truelabel=label)
    loss = torch.sum(-reward * pp_logp)  # don't know if summing is the right approach here
    loss.backward()
    optimizer.step()
    #  lr_scheduler.step()
    return loss, reward, pp_logp

In [13]:
def get_vm_preds_for_dl(dl): 
    l = list()
    if pp_model.training: pp_model.eval()
    if vm_model.training: vm_model.eval()
    for i, data in enumerate(dl):
        label,text = data['label'].to(device),data["text"]
        generated, pp_text = get_paraphrases(text,
                num_return_sequences=num_return_sequences, num_beams=num_beams, 
                num_beam_groups=num_beam_groups, diversity_penalty=diversity_penalty,
                temperature=temperature, length_penalty=length_penalty, min_length=min_length)
        pp_logp = get_pp_logp(generated).item()

        orig_probs,pp_probs,truelabel_prob_diff = reward_fn(orig=text, 
            pp=pp_text, truelabel=label, return_probs = True)
        orig_probs_truelabel = orig_probs[0].detach().cpu().numpy()[label]
        pp_probs_truelabel   = pp_probs[0].detach().cpu().numpy()[label]
        orig_preds,pp_preds = orig_probs.argmax(1).item(),pp_probs.argmax(1).item()
    
        d = {
            "orig": text[0],
            "pp": pp_text[0],   
            "pp_logp": pp_logp, 
            "pp_p": np.exp(pp_logp),
            "truelabel": label.item(),
            "orig_pred": orig_preds, 
            "pp_pred": pp_preds,
            "orig_probs_truelabel":orig_probs_truelabel,
            "pp_probs_truelabel": pp_probs_truelabel,
            "truelabel_prob_diff": truelabel_prob_diff
        }
        l.append(d)
        
        # writer.add_text()
        # writer.add_text()

        # writer.add_scalars("test_predictions",d)
    return l

In [14]:
def get_avg_prob_diff(preds):
    prob_diffs = [o['truelabel_prob_diff'] for o in preds]
    return np.mean(prob_diffs)

In [15]:
def plot_grad_flow(named_parameters):
    '''Plots the gradients flowing through different layers in the net during training.
    Can be used for checking for possible gradient vanishing / exploding problems.
    
    Usage: Plug this function in Trainer class after loss.backwards() as 
    "plot_grad_flow(self.model.named_parameters())" to visualize the gradient flow'''
    from matplotlib.lines import Line2D
    import matplotlib.pyplot as plt 
    ave_grads = []
    max_grads= []
    layers = []
    for n, p in named_parameters:
        if(p.requires_grad) and ("bias" not in n):
            layers.append(n)
            ave_grads.append(p.grad.abs().mean())
            max_grads.append(p.grad.abs().max())
    plt.bar(np.arange(len(max_grads)), max_grads, alpha=0.1, lw=1, color="c")
    plt.bar(np.arange(len(max_grads)), ave_grads, alpha=0.1, lw=1, color="b")
    plt.hlines(0, 0, len(ave_grads)+1, lw=2, color="k" )
    plt.xticks(range(0,len(ave_grads), 1), layers, rotation="vertical")
    plt.xlim(left=0, right=len(ave_grads))
    plt.ylim(bottom = -0.001, top=0.02) # zoom in on the lower gradient regions
    plt.xlabel("Layers")
    plt.ylabel("Average Gradient")
    plt.title("Gradient Flow")
    plt.grid(True)
    plt.legend([Line2D([0], [0], color="c", lw=4),
                Line2D([0], [0], color="b", lw=4),
                Line2D([0], [0], color="k", lw=4)], ['max-gradient', 'mean-gradient', 'zero-gradient'])

#### Set up models and do layer freezing

In [16]:
### Setup
vm_model.eval()
pp_model.train()

## Layer freezing 
# Unfreeze last 2 layers of the base model decoder
# Not sure if decoder layer norm should be unfrozen or not, but it appears after the
#   other parameters in the module ordering, so let's include it for now
# Also unfreeze the linear head.  This isn't stored in the base model but rather tacked on top
#   and will be fine-tuned for summarisation. 
layer_list = ['decoder.layers.14', 'decoder.layers.15', 'decoder.layer_norm'] 
for i, (name,param) in enumerate(pp_model.base_model.named_parameters()): 
    if np.any([o in name for o in layer_list]):   param.requires_grad = True
    else:                                         param.requires_grad = False
for param in pp_model.lm_head.parameters():       param.requires_grad = True
# For some reason this seems to be excluded
for param in pp_model.base_model.shared.parameters(): param.requires_grad = False 
### For checking the grad status of the layers
# for i, (name, param) in enumerate(pp_model.base_model.named_parameters()): print(i, name, param.requires_grad)
# for i, (name, param) in enumerate(pp_model.lm_head.named_parameters()):    print(i, name, param.requires_grad)

#### Create small dataset (dev step for quicker development, delete later)

In [17]:
# train_small = train.shard(10000, 4, contiguous=False)  # small training set for testing purposes
# train_small_dl = DataLoader(train_small, batch_size=batch_size, 
#                             shuffle=True, num_workers=n_wkrs)
# dl = train_small_dl

#### Set up optimiser and learning rate scheduler

In [18]:
# Code below taken from https://github.com/huggingface/transformers/blob/master/examples/pytorch/text-classification/run_glue_no_trainer.py#L363
# Split weights in two groups, one with weight decay and the other not.
# no_decay = ["bias", "LayerNorm.weight"]
# optimizer_grouped_parameters = [
#     {
#         "params": [p for n, p in pp_model.named_parameters() if not any(nd in n for nd in no_decay)],
#         "weight_decay": weight_decay,
#     },
#     {
#         "params": [p for n, p in pp_model.named_parameters() if any(nd in n for nd in no_decay)],
#         "weight_decay": 0.0,
#     },
# ]
# optimizer = AdamW(optimizer_grouped_parameters, lr=lr)

# For now we just keep this simple
optimizer = AdamW(pp_model.parameters(), lr=lr)
# lr_scheduler = get_scheduler(
#     name=lr_scheduler_type,
#     optimizer=optimizer,
#     num_warmup_steps=n_warmup_steps,
#     num_training_steps=n_train_steps,
# )

#### Set up other miscellaneous things

In [19]:
rouge_metric = load_metric("rouge")

### Training loop 

In [20]:
n_train_epochs = 3
n_train_steps = n_train_epochs * len(dl)
plot_grads = False

In [21]:
progress_bar = tqdm(range(n_train_steps))
for epoch in range(n_train_epochs): 
    logger.info(f"Now on epoch {epoch} of {n_train_epochs}")
    if not pp_model.training: pp_model.train()
    for i, data in enumerate(dl): 
        if i % 10 == 0 :   logging.info(f"Now processing batch {i} out of {len(dl)}")
        loss, reward, pp_logp = training_step(data) 
        if plot_grads: plot_grad_flow(pp_model.named_parameters())
    
#       if i == 0: 
#             print(label)
#             print(text)
#             print(pp_text)
#             print("reward: ", reward)
#             print("logp: ", pp_logp)
#             print("p", p)
#             print("loss: ", loss)

        # For debugging
        # print_info_on_generated_text()
        
        # Useful link: 
        # https://discuss.huggingface.co/t/generation-probabilities-how-to-compute-probabilities-of-output-scores-for-gpt2/3175
        # might be helpful?
        # https://discuss.huggingface.co/t/showing-individual-token-and-corresponding-score-during-beam-search/3735/5 
        
        progress_bar.update(1)            
    
    
    # Evaluation loop every 5 epochs
    if epoch % 5 == 0: 
        train_set_preds = get_vm_preds_for_dl(dl = simple_dl)
        test_set_preds  = get_vm_preds_for_dl(dl = simple_dl_test)
        avg_prob_diff_train = get_avg_prob_diff(train_set_preds)
        avg_prob_diff_test  = get_avg_prob_diff(test_set_preds)
        print("Train paraphrases:", [o['pp'] for o in train_set_preds])
        print("Train avg prob diff:", avg_prob_diff_train)
        print("Test paraphrases:",  [o['pp'] for o in test_set_preds])
        print("Test avg prob diff:",  avg_prob_diff_test)

HBox(children=(FloatProgress(value=0.0, max=6.0), HTML(value='')))

Now on epoch 0 of 3


orig_l ['I like this movie', 'I do not like this movie']
pp_l ['This movie is good.', "I don't like this movie."]
VM score:  [-0.2225930094718933, -0.03158789873123169]
ROUGE score: [0.5, 0.6666666666666666]
Reward: tensor([-0.1113, -0.0211], device='cuda:0', dtype=torch.float64)
orig_l ['I love this apple', 'I hate this apple']
pp_l ['I love apples.', 'This apple is not something I like.']
VM score:  [-0.024676978588104248, -0.07387912273406982]
ROUGE score: [0.5714285714285715, 0.36363636363636365]
Reward: tensor([-0.0141, -0.0269], device='cuda:0', dtype=torch.float64)


ValueError: only one element tensors can be converted to Python scalars

In [None]:
%debug

## Testing and debugging 

### Verifying that the weights update each training step 

In [None]:
def check_parameters_update(dl): 
    """
    This checks which parameters are being updated. 
    We run one forward pass+backward pass (updating the parameters once) 
    and look at which ones change. 
    """
    # Check which parameters should be updated
    params_with_grad = [o for o in pp_model.named_parameters() if o[1].requires_grad]
    print("---- Parameters with 'requires_grad' and their sizes ------")
    for (name, p) in params_with_grad:  print(name, p.size())
        
    ## Take a step and see which weights update
    params_all = [o for o in pp_model.named_parameters()]  # this is updated by a training step    
    params_all_initial = [(name, p.clone()) for (name, p) in params_all]  # Initial values
        
    # take a step    
    loss, reward, pp_logp = training_step(data)
    
    print("\n---- Matrix norm of parameter update for one step ------\n")
    for (_,old_p), (name, new_p) in zip(params_all_initial, params_all): 
        print (name, torch.norm(new_p - old_p).item()) 
check_parameters_update(dl)

## Code scraps 

### Experiments around plotting average parameter updates 

In [None]:
def get_parameter_group_dict(): 
    """Function to create "groups" of parameters. This is useful to check how much a group of 
    parameters updates at an epoch. 
    Parameter groups are hardcoded into this code for now. 
    """
    # Identify which parameters should be grouped together
    isolates = ['model.shared.weight',"model.encoder.embed_positions.weight", "model.encoder.layer_norm",
                "model.decoder.embed_positions.weight", "model.decoder.layer_norm"]
    layers_base = ["model.encoder.layers", "model.decoder.layers"]
    def flatten_list(l): return list(np.concatenate(l).flat)
    layers = flatten_list([[lyr + "." + str(o) +"." for o in list(range(16))] for lyr in layers_base])
    parameter_groups = layers + isolates
    # Sort the parameter groups by the order they appear in the model 
    all_params = [name for name,_ in pp_model.named_parameters()]
    ordering = [np.min(np.where([pg in o for o in all_params])) for pg in parameter_groups]
    parameter_groups = [o for _,o in sorted(zip(ordering, parameter_groups))]
    # Assign each model parameter a parameter group 
    group_d = dict()
    for pg in parameter_groups: 
        name = pg[:-1] if pg in layers else pg  # remove the "." from the end of the name for the numeric layers
        group_d[name] = [o for o in all_params if pg in o]
    return group_d

In [None]:
def get_parameter_update_amount(): 
    group_d = get_parameter_group_dict()
    params_all_initial_d = dict(params_all_initial)
    params_all_d = dict(params_all)
    group_d = get_parameter_group_dict()
    df_d = dict()
    for k,param_l in group_d.items(): 
        l = list()
        for p in param_l: 
            l.append((params_all_initial_d[p] - params_all_d[p]).abs().flatten())
        l = torch.cat(l).cpu().detach().numpy()  # list of 1-d tensors to tensor and then to numpy
        df_d[k] = pd.DataFrame(l).describe().values.flatten()
    df = pd.DataFrame(df_d)
    df.index = pd.DataFrame([1,2,3]).describe().index
    return df 

In [None]:
## Random code snippets

# initial_params = [(name, p.detach().clone()) for (name, p) in pp_model.named_parameters()]
# loss, reward, pp_logp = training_step(data) 
# update_d =  dict()
# for (_,old_p), (name, new_p) in zip(initial_params, pp_model.named_parameters()): 
#     update_d[name] = torch.abs(old_p - new_p).detach().flatten()     
    
#             update_d =  dict()
#             for (_,old_p), (name, new_p) in zip(initial_params, pp_model.named_parameters()): 
#                 update_d[name] = torch.abs(old_p - new_p).flatten() 
#                 print (name, torch.norm(new_p - old_p).item())  
            
#             group_d = get_parameter_group_dict()
#             initial_params_d,current_params_d = dict(initial_params),dict()
#             params_all_d = dict(params_all)
#             group_d = get_parameter_group_dict()
#             df_d = dict()
#             for k,param_l in group_d.items(): 
#                 l = list()
#                 for p in param_l: 
#                     l.append((params_all_initial_d[p] - params_all_d[p]).abs().flatten())
#                 l = torch.cat(l).cpu().detach().numpy()  # list of 1-d tensors to tensor and then to numpy
#                 df_d[k] = pd.DataFrame(l).describe().values.flatten()
#             df = pd.DataFrame(df_d)
#             df.index = pd.DataFrame([1,2,3]).describe().index

### Generating a paraphrase dataset and getting VM predictions for it

In [None]:
def create_paraphrase_dataset(batch, cname_input, cname_output, num_beams=32,
                              num_return_sequences=32): 
    """Create paraphrases for each example in the batch. Then repeat the other fields 
        so that the resulting datase has the same length as the number of paraphrases. 
        Key assumption is 
        that the same number of paraphrases is created for each example.
        batch: a dict of examples used by the `map` function from the dataset
        cname_input: What column to create paraphrases of 
        cname_output: What to call the column of paraphrases
        other parameters - passed to get_paraphrases. """
    
    # Generate paraphrases. 
    # This can be later extended to add diversity or so on. 
    #set_trace()
    pp_l,probs = get_paraphrases(batch[cname_input], num_beams=num_beams,
        num_return_sequences=num_return_sequences)
    
    # To return paraphrases as a list of lists for batch input (not done here but might need later)
    #     split_into_sublists = lambda l,n: [l[i:i + n] for i in range(0, len(l), n)]
    #     pp_l = split_into_sublists(pp_l, n_seed_seqs)
    batch[cname_output] = pp_l 
    batch["probs"] = probs.to('cpu').numpy()
    
    # Repeat each entry in all other columns `num_return_sequences` times so they are the same length
    # as the paraphrase column
    # Only works if the same number of paraphrases is generated for each phrase. 
    # Else try something like 
        # for o in zip(*batch.values()):
        #     d = dict(zip(batch.keys(), o))
        #     get_paraphrases(batch[cname_input],num_return_sequences=n_seed_seqs,num_beams=n_seed_seqs)
        #     for k,v in d.items(): 
        #       return_d[k] += v if k == 'text' else [v for o in range(n_paraphrases)]
        # return return_d
    return_d = defaultdict(list) 
    repeat_each_item_n_times = lambda l,n: [o for o in l for i in range(n)]
    for k in batch.keys(): 
        if   k == cname_output: return_d[k] = batch[cname_output]
        elif k == "probs"     : return_d[k] = batch["probs"]
        else:                   return_d[k] = repeat_each_item_n_times(batch[k], num_return_sequences)
    return return_d 

In [None]:
def get_vm_scores(ds_pp, cname_orig, cname_pp, cname_label='label', 
                  use_metric=False, monitor=False): 
    """Get victim model preds+probs for the paraphrase dataset.
    """
    assert vm_model.training == False  # checks that model is in eval mode 
    if use_metric: 
        metric_d = {}
        metric_d['orig'],metric_d['pp'] = load_metric('accuracy'),load_metric('accuracy')
    orig_probs_l,pp_probs_l = [],[]
    if monitor: monitor = Monitor(2)  # track GPU usage and memory
    
    def get_vm_preds(x): 
        """Get predictions for a vector x (here a vector of documents/text). 
        Works for a sentiment-analysis dataset (needs to be adjusted for NLI tasks)"""
        inputs = vm_tokenizer(x, padding=True, truncation=True, return_tensors="pt")
        inputs.to(device)
        outputs = vm_model(**inputs, labels=labels)
        probs = outputs.logits.softmax(1).cpu()
        preds = probs.argmax(1)
        return probs, preds
       
    print("Getting victim model predictions for both original and paraphrased text.")
    dl = DataLoader(ds_pp, batch_size=batch_size, shuffle=False, 
                    num_workers=n_wkrs, pin_memory=True)
    with torch.no_grad():
        for i, data in enumerate(dl): 
            if i % 50 == 0 : print("Now processing batch", i, "out of", len(dl))
            labels,orig,pp = data['label'].to(device),data[cname_orig],data[cname_pp]
            orig_probs, orig_preds = get_vm_preds(orig)            
            pp_probs,   pp_preds   = get_vm_preds(pp)    
            orig_probs_l.append(orig_probs); pp_probs_l.append(pp_probs)
            if use_metric: 
                metric_d['orig'].add_batch(predictions=orig_preds, references=labels)
                metric_d['pp'].add_batch(  predictions=pp_preds,   references=labels)
    if monitor: monitor.stop()
    def list2tensor(l): return torch.cat(l)
    orig_probs_t,pp_probs_t = list2tensor(orig_probs_l),list2tensor(pp_probs_l)
    if use_metric: return orig_probs_t, pp_probs_t, metric_d
    else:          return orig_probs_t, pp_probs_t, None

In [None]:
### Generate paraphrase dataset
num_beams = 10
num_return_sequences = 3
cname_input = 'text' # which text column to paraphrase
cname_output= cname_input + '_pp'
date = '20210825'
fname = path_cache + '_rt_train'+ date + '_' + str(num_return_sequences)
if os.path.exists(fname):  
    ds_pp = datasets.load_from_disk(fname)
else:
    ds_pp = train.shard(200, 0, contiguous=True)
    # Have to call with batched=True
    # Need to set a batch size otherwise will run out of memory on the GPU card. 
    # 64 seems to work well 
    ds_pp = ds_pp.map(
        lambda x: create_paraphrase_dataset(x, 
            num_beams=num_beams, num_return_sequences=num_return_sequences,
            cname_input=cname_input, cname_output=cname_output),
        batched=True, batch_size=4) 
    ds_pp.save_to_disk(fname)
    gc.collect(); torch.cuda.empty_cache() # free up most of the GPU memory

In [None]:
### Get predictions
cname_orig = cname_input
cname_pp = cname_output
cname_label = 'label'
print_metric = True
fname = path_cache + 'results_df_'+ date + "_" + str(num_return_sequences) + ".csv"
if os.path.exists(fname):    results_df = pd.read_csv(fname)
else: 
    #sim_score_t = generate_sim_scores()
    orig_probs_t,pp_probs_t,metric_d = get_vm_scores(ds_pp, cname_orig, 
                                                     cname_pp, cname_label,
                                                     monitor=True, use_metric=print_metric)
    if print_metric: 
        print("orig vm accuracy:",       metric_d['orig'].compute())
        print("paraphrase vm accuracy:", metric_d['pp'].compute())
    vm_orig_scores  = torch.tensor([r[idx] for idx,r in zip(ds_pp[cname_label], orig_probs_t)])
    vm_pp_scores    = torch.tensor([r[idx] for idx,r in zip(ds_pp[cname_label], pp_probs_t)])
    results_df = pd.DataFrame({
                  cname_orig: ds_pp[cname_orig],
                  cname_pp: ds_pp[cname_pp],
   #               'sim_score': sim_score_t,
                  'label_true': ds_pp[cname_label], 
                  'label_vm_orig': orig_probs_t.argmax(1),
                  'label_vm_pp': pp_probs_t.argmax(1),
                  'vm_orig_truelabel': vm_orig_scores,             
                  'vm_pp_truelabel': vm_pp_scores,
                  'vm_truelabel_change': vm_orig_scores - vm_pp_scores,
                  'vm_orig_class0': orig_probs_t[:,0], 
                  'vm_orig_class1': orig_probs_t[:,1], 
                  'vm_pp_class0': pp_probs_t[:,0], 
                  'vm_pp_class1': pp_probs_t[:,1], 
                  })
#    results_df['vm_truelabel_change_X_sim_score'] = results_df['vm_truelabel_change'] * results_df['sim_score']
    results_df.to_csv(fname, index_label = 'idx')

### Testing how to keep gradients with `generate` functions

In [None]:
### Testing the `generate_with_grad` function

input_text="hello my name is Tom"
num_return_sequences=1
num_beams=2
return_probs=True
batch = pp_tokenizer(input_text, truncation=True, padding='longest', return_tensors="pt").to(device)
generated = pp_model.generate_with_grad(**batch, return_dict_in_generate=True, output_scores=True,
                              num_return_sequences=num_return_sequences,
                                num_beams=num_beams,
                                num_beam_groups=1,
                                diversity_penalty=0,
                                temperature=1.5, 
                              length_penalty=1)
print(generated)

tgt_text = pp_tokenizer.batch_decode(generated.sequences, skip_special_tokens=True)
print(pp_tokenizer.tokenize(tgt_text[0]))
print(pp_tokenizer.encode(tgt_text[0]))

# Score: score = sum_logprobs / (hyp.shape[-1] ** self.length_penalty)
# gradient gets removed (i think) by the line 
# beam_hyp.add(
#   input_ids[batch_beam_idx].clone(),
#   next_score.item())


x=generated['scores'][5]
print(x.max(1))
x.max(1).values / (len(generated['scores']) ** 0.8)

In [None]:
## An example of how to use greedy_search

# from transformers import (
# AutoTokenizer,
# AutoModelForCausalLM,
# LogitsProcessorList,
# MinLengthLogitsProcessor,
# )

# tokenizer = AutoTokenizer.from_pretrained("gpt2")
# model = AutoModelForCausalLM.from_pretrained("gpt2")

# # set pad_token_id to eos_token_id because GPT2 does not have a EOS token
# model.config.pad_token_id = model.config.eos_token_id

# input_prompt = "Today is a beautiful day, and"
# input_ids = tokenizer(input_prompt, return_tensors="pt").input_ids

# # instantiate logits processors
# logits_processor = LogitsProcessorList([
#     MinLengthLogitsProcessor(15, eos_token_id=model.config.eos_token_id),
# ])

# outputs = model.greedy_search(input_ids, logits_processor=logits_processor)

# print("Generated:", tokenizer.batch_decode(outputs, skip_special_tokens=True))

### Tensorboard setup 

In [None]:

# from torch.utils.tensorboard import SummaryWriter
# import datetime 
# # Create writer and track to run directory 
# path_runs = './runs/'
# log_dir = path_runs + datetime.datetime.now().strftime("%Y%m%d-%H%M%S") + "/"
# writer = SummaryWriter(log_dir = log_dir)
# # stuff here logging to tensorboard
# #writer.close() # important otherwise Tensorboard eventually shuts down
