## Introduction 
Here we are trying to adjust parameters of a paraphrase model to generate adversarial examples. 

### Policy gradients 
The key parameter update equation is $\theta_{t+1} = \theta_t + \alpha \nabla_\theta J(\theta)$, where $\alpha$ is a step size parameter, the parameter vector $\theta$ is for a model (here a paraphrase model), and $J$ is a loss function. The time step $t$ depends on the problem specification and we will get to it later. 

Now in my review I have defined the loss function $J(\theta) = E_\pi[r(\tau)]$. Here: 
* $\pi$ is the policy, a probability distribution for the next action in a given state; essentially $p(a_t|s_t)$
* $\tau$ is a trajectory, a specific sequence $s_0, a_0, r_1, s_1, a_1, \ldots$ of the agent in the game. This starts at time $t=0$ and finishes at time $t=T$. 
* $r(\tau)$ is the sum of rewards for a trajectory $\tau$, or in other words, the total reward for the trajectory. 

For this loss function higher values are better (which might make it a reward function) and so we might have to invert it at some point. 

To update parameters we must find the gradient $\nabla_\theta J(\theta)$, which measures how $J(\theta)$ changes when we adjust the parameters of the paraphrase model. The gradient is simplified through some maths to get the policy gradient theorem $$ \nabla_\theta J(\theta) =  \nabla_\theta E_\pi [r(\tau)]  = E_\pi \left[r(\tau) \sum_{t=1}^T \nabla_\theta \log \pi (a_t|s_t)  \right] $$ 

To calculate this you need to calculate the expectation term, which in turn means evaluating every possible trajectory $\tau$ and its expected return. Generally this is not possible and instead we turn to estimators.  

One of these is REINFORCE. It gives us  $$ \nabla_\theta J(\theta) \approx \sum_{s=1}^S \sum_{t=1}^T G_t \nabla \log \pi(a_t|s_t)$$ where 
* $G_t$ is the discounted return and is given by $G_t = r_t + \beta r_{t-1} + \beta^2 r_{t-2} + \dots$. It's a rough estimate of $r(\tau)$. Rewards obtained later in the episode are weighted much higher than rewards obtained earlier. I guess it assumes that the parameters update every timestep. 
* $S$ is some number of samples.

The implementation of REINFORCE and similar estimators depends on how we formulate the problem. Below we present some possible formulations

### Interpretation One: Document-level  
This is the first implementation we will try. 

Here we generate a list of paraphrases at each time point. The idea is that there is one paraphrase amongst them that is a good adversarial example. We try to tune the model to produce the best one. 

This interpretation sees forming the complete paraphrase as one time step. So it isn't token-level but document-level. 

* Starting state: $s0 = x$, the original example  
* Actions: each action is "choosing" a paraphrase (or of choosing $n$ paraphrases). The set of all possible paraphrases and their probabilities is the policy. So $\pi(a|s) = p(x'| x;\theta)$ where $x'$ is the paraphrase (or list of paraphrases). 
    * To approximate this probability, what we can do is generate a large list of paraphrases, and for each, the probabilities of generating each token in turn for that paraphrase. This gives a rough "probability" of how likely that sequence was. This number is kind of like a weight for how good that paraphrase is, according to the model.  We can then turn the weights into probabilities to get a "probability" of the paraphrase. This is dependent on the number of paraphrases generated, so generating a large list is likely to be better for this task. 
* Reward: The paraphrase moves through the reward function $R(x, x')$) to get the reward $r$. 
* Time steps: We only have one time step in the game ($T=1$ and $G_t=r$)  


There are a few variations to this scenario that we can do. For each of these we will formulate the policy and the reward function $R$. Below, $x'$ means paraphrase, $f(x)_y$ means the model confidence of x for the class of the true label $y$, $SS(a,b)$ is the result of a semantic similarity model run over $a$ and $b$, and $\lambda$ is a hyperparameter.  


#### One-paraphrase 
Here we only generate one paraphrase. This scenario also has a few options. First we generate a list of paraphrases with the probabilities of selecting one. Then we either sample probabilistically from the list or pick the most probable option. 

In this case the policy $p(x'|x,\theta)$ is the chance of obtaining a specific paraphrase. For the sampling option this is equal to its sample probability. For the top option this is just the probability of selecting that option. 

The reward function might look like $R(x,x') = f(x)_y - f(x')_y + \lambda SS(x, x')$. We could also make the $SS$ factor a step-function above some threshold. 

The REINFORCE equation $$ \nabla_\theta J(\theta) \approx \sum_{s=1}^S \sum_{t=1}^T G_t \nabla \log \pi(a_t|s_t)$$ becomes $$ \nabla_\theta J(\theta) \approx \sum_{s=1}^S  R(x,x'_s) \nabla \log p(x'_s|x,\theta)$$ We repeat the process $S$ times where $S$ is ideally as large as possible. We can start with something simple (e.g. $S=10$ or $S=100$) and go from there.  

The gradient term $\nabla \log p(x'_s|x,\theta)$ can hopefully be found with autodiff. 

#### Set of paraphrases
In this scenario the paraphrase model is evaluated on performance over a set of paraphrases, which we call $X'$ here. The policy becomes $p(X'|x, \theta)$, the probability of obtaining that list. We can get this probability by multipling together the "probability" of each individual paraphrase, multiplying also by nCr (for r paraphrases out of n total) to account for the lack of order in the list. 

We can make a number of sub-scenarios here. 

For the **top-paraphrase in set** condition the paraphrase generator is only measured on the best reward for a paraphrase in its set. The idea is the generator will learn to produce a diverse set of examples, any of which could plausibly be a good adversarial example. Here we only look at best performing paraphrase $x'_m$, which we can find by $x'_m = \max_i [f(x)_y - f(x'_i)_y]$, then return $R(x,x'_m) = [f(x)_y - f(x'_m)_y] + \lambda SS(x,x'_m)$ 

For the **average-paraphrase in set** condition the paraphrase generator is measured on the average reward of the paraphrases in its set. This encourages the generator to consider performance of all examples more-or-less equally. The reward function could be something like $\frac{1}{k} \sum_{i=1}^k \left[ f(x)_y - f(x'_i)_y + \lambda SS(x, x'_i) \right]$ 

A combination of these scenarios is the **top-k/top-p\% paraphrases in set**. Here we only use the top-$k$ paraphrases, or more generally, the top $p$ percentage of paraphrases. 


### Interpretation 2: Token-level
This interpretation is at token-level; it sees choosing the next word as the next time step. 

* Starting state: $s0 = x$, the initial state. But you also have a "blank slate" for the paraphrase. So maybe it's a tuple (x, pp) where pp is a paraphrase with no words. Here x is used as the reference for the paraphrase generator.  
* Actions: Choose the next word of p. I guess this starts with the \<START\> token (or something similar). Then you have the policy $\pi(a|s)$ which is the same as $p(w_{next}|pp, x; \theta)$ where $\theta$ is the paraphrase model parameters, $pp$ is the so-far constructed sentence, and $w_{next}$ is the next token (I say token because I don't know if this model is on the subword or word basis). 
* Time steps: every token is generated one-by-one and each of these is allocated a time step. This means probably that you also update the parameters after each token generated too. 
* Reward. The reward is allocated every token. There are many reward functions (see papers on token-level loss functions). Some also incorporate document-level rewards too. 
* Next state. $s_1$ is again the tuple $(x, pp)$ but now $pp$ has the first word in it. 

On *teacher forcing*. This is when you have a ground-truth paraphrase and you can use it when generating tokens. This is useful because if the model makes a mistake it doesn't continue down that track but is adjusted back. This stops big divergences (but also might limit the diversity of generated paraphrases). This is used when training a paraphrase model. You have a set of reference paraphrases that are human provided. Here though we only have the original sentence and no references. We could generate adversarial examples and use that to do teacher forcing. Generating them using textattack recipes might work. This is only really used on the token-level rewards. 

### Updating the paraphrase model parameters. 

There is a choice here. We can either directly update the parameters of the paraphrase model. Or we can fix the parameters and add a new dense layer to the end of the model. We could then train this dense layer to convert paraphrases to adversarial paraphrases. 

Before trying this out, I am worried that we will destroy the capabilities of the paraphrase generator a bit. We might get semantically invalid or ungrammatical or gibberish text. If so we could try and mitigate it a bit by shaping our reward function to maintain grammatical components. 

### Experiment order

Plan is to try the following order: 

1. One-paraphrase (most probable option). I'll start with this one because it is probably the most simple case. Within this category: 
    1a. tune existing parameters only (see if the text is recognisable) 
    1b. add dense layer onto end and try again 
2. One-paraphrase (sampled). This seems like a logical extension on the first one. 
3. Paraphrase-set options. (Decide after finishing 1, 2) 
4. Token-level tuning. (Decide after 1,2,3)


### Layer Freezing

I am uncertain on if to do this or not. 

* This [paper](https://arxiv.org/abs/1911.03090) indicates that you can get pretty good results by freezing all layers except the last few 
* Conversely I saw in the transformers documentation that transformers train better if you don't do layer freezing 


## Setup, load models + datasets 

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
import os, gc
import torch 
from torch.utils.data import DataLoader
from datasets import load_dataset, load_metric
import datasets, transformers
from transformers import pipeline, AutoModelForSeq2SeqLM, AutoModelForSequenceClassification, AutoTokenizer
from pprint import pprint
import numpy as np, pandas as pd
import scipy
from utils import *   # local script 
import pyarrow
from sentence_transformers import SentenceTransformer, util
from IPython.core.debugger import set_trace
from GPUtil import showUtilization
import seaborn as sns
from itertools import repeat
from collections import defaultdict
from IPython.display import Markdown

path_cache = './cache/'
path_results = "./results/"

seed = 420
torch.manual_seed(seed)
np.random.seed(seed)
torch.cuda.manual_seed(seed)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') 
devicenum = torch.cuda.current_device() if device.type == 'cuda' else -1
n_wkrs = 4 * torch.cuda.device_count()
batch_size = 64
pd.set_option("display.max_colwidth", 400)

In [4]:
# Paraphrase (pp) model 
pp_name = "tuner007/pegasus_paraphrase"
pp_tokenizer = AutoTokenizer.from_pretrained(pp_name)
# takes about 3GB memory space up on the GPU
pp_model = AutoModelForSeq2SeqLM.from_pretrained(pp_name).to(device)


In [5]:
from types import MethodType
from undecorated import undecorated
generate1 = undecorated(pp_model.generate)
pp_model.generate1 = MethodType(generate1, pp_model)
pp_model.generate3 = MethodType(pp_model.generate.__closure__[0].cell_contents,pp_model)

In [80]:
# from transformers.models.pegasus import PegasusForConditionalGeneration
from typing import Any, Callable, Dict, Iterable, List, Mapping, Optional, Union
from transformers.generation_beam_search import BeamScorer, BeamSearchScorer
import torchsnooper

def generate2(
    self,
    input_ids: Optional[torch.LongTensor] = None,
    max_length: Optional[int] = None,
    min_length: Optional[int] = None,
    do_sample: Optional[bool] = None,
    early_stopping: Optional[bool] = None,
    num_beams: Optional[int] = None,
    temperature: Optional[float] = None,
    top_k: Optional[int] = None,
    top_p: Optional[float] = None,
    repetition_penalty: Optional[float] = None,
    bad_words_ids: Optional[Iterable[int]] = None,
    bos_token_id: Optional[int] = None,
    pad_token_id: Optional[int] = None,
    eos_token_id: Optional[int] = None,
    length_penalty: Optional[float] = None,
    no_repeat_ngram_size: Optional[int] = None,
    encoder_no_repeat_ngram_size: Optional[int] = None,
    num_return_sequences: Optional[int] = None,
    max_time: Optional[float] = None,
    decoder_start_token_id: Optional[int] = None,
    use_cache: Optional[bool] = None,
    num_beam_groups: Optional[int] = None,
    diversity_penalty: Optional[float] = None,
    prefix_allowed_tokens_fn: Optional[Callable[[int, torch.Tensor], List[int]]] = None,
    output_attentions: Optional[bool] = None,
    output_hidden_states: Optional[bool] = None,
    output_scores: Optional[bool] = None,
    return_dict_in_generate: Optional[bool] = None,
    forced_bos_token_id: Optional[int] = None,
    forced_eos_token_id: Optional[int] = None,
    remove_invalid_values: Optional[bool] = None,
    **model_kwargs):
    # set init values
    num_beams = num_beams if num_beams is not None else self.config.num_beams
    num_beam_groups = num_beam_groups if num_beam_groups is not None else self.config.num_beam_groups
    max_length = max_length if max_length is not None else self.config.max_length
    do_sample = do_sample if do_sample is not None else self.config.do_sample
    num_return_sequences = (
        num_return_sequences if num_return_sequences is not None else self.config.num_return_sequences
    )

    pad_token_id = pad_token_id if pad_token_id is not None else self.config.pad_token_id
    bos_token_id = bos_token_id if bos_token_id is not None else self.config.bos_token_id
    eos_token_id = eos_token_id if eos_token_id is not None else self.config.eos_token_id

    output_scores = output_scores if output_scores is not None else self.config.output_scores
    output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
    output_hidden_states = (
        output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
    )
    return_dict_in_generate = (
        return_dict_in_generate if return_dict_in_generate is not None else self.config.return_dict_in_generate
    )

    model_kwargs["output_attentions"] = output_attentions
    model_kwargs["output_hidden_states"] = output_hidden_states

    if input_ids is None:
        # init `input_ids` with bos_token_id
        input_ids = self._prepare_input_ids_for_generation(bos_token_id, model_kwargs.get("encoder_outputs"))

    if model_kwargs.get("attention_mask", None) is None:
        # init `attention_mask` depending on `pad_token_id`
        model_kwargs["attention_mask"] = self._prepare_attention_mask_for_generation(
            input_ids, pad_token_id, eos_token_id
        )

    # special case if pad_token_id is not defined
    if pad_token_id is None and eos_token_id is not None:
        logger.warning(f"Setting `pad_token_id` to `eos_token_id`:{eos_token_id} for open-end generation.")
        pad_token_id = eos_token_id

    # Storing encoder_input_ids for logits_processor that could use them
    encoder_input_ids = input_ids if self.config.is_encoder_decoder else None

    if self.config.is_encoder_decoder:
        # add encoder_outputs to model_kwargs
        model_kwargs = self._prepare_encoder_decoder_kwargs_for_generation(input_ids, model_kwargs)

        # set input_ids as decoder_input_ids
        if "decoder_input_ids" in model_kwargs:
            input_ids = model_kwargs.pop("decoder_input_ids")
        else:
            input_ids = self._prepare_decoder_input_ids_for_generation(
                input_ids, decoder_start_token_id=decoder_start_token_id, bos_token_id=bos_token_id
            )

#         if "encoder_outputs" not in model_kwargs or not isinstance(model_kwargs["encoder_outputs"], ModelOutput):
#             raise ValueError("Make sure that `model_kwargs` include `encoder_outputs` of type `ModelOutput`.")
    if input_ids.shape[-1] >= max_length:
        input_ids_string = "decoder_input_ids" if self.config.is_encoder_decoder else "input_ids"
        logger.warning(
            f"Input length of {input_ids_string} is {input_ids.shape[-1]}, but ``max_length`` is set to {max_length}."
            "This can lead to unexpected behavior. You should consider increasing ``config.max_length`` or ``max_length``."
        )

    # determine generation mode
    is_greedy_gen_mode = (num_beams == 1) and (num_beam_groups == 1) and do_sample is False
    is_sample_gen_mode = (num_beams == 1) and (num_beam_groups == 1) and do_sample is True
    is_beam_gen_mode = (num_beams > 1) and (num_beam_groups == 1) and do_sample is False
    is_beam_sample_gen_mode = (num_beams > 1) and (num_beam_groups == 1) and do_sample is True
    is_group_beam_gen_mode = (num_beams > 1) and (num_beam_groups > 1)
    if num_beam_groups > num_beams:
        raise ValueError("`num_beam_groups` has to be smaller or equal to `num_beams`")
    if is_group_beam_gen_mode and do_sample is True:
        raise ValueError(
            "Diverse beam search cannot be used in sampling mode. Make sure that `do_sample` is set to `False`."
        )

    # set model_kwargs
    model_kwargs["use_cache"] = use_cache

    # get distribution pre_processing samplers
    logits_processor = self._get_logits_processor(
        repetition_penalty=repetition_penalty,
        no_repeat_ngram_size=no_repeat_ngram_size,
        encoder_no_repeat_ngram_size=encoder_no_repeat_ngram_size,
        encoder_input_ids=encoder_input_ids,
        bad_words_ids=bad_words_ids,
        min_length=min_length,
        max_length=max_length,
        eos_token_id=eos_token_id,
        forced_bos_token_id=forced_bos_token_id,
        forced_eos_token_id=forced_eos_token_id,
        prefix_allowed_tokens_fn=prefix_allowed_tokens_fn,
        num_beams=num_beams,
        num_beam_groups=num_beam_groups,
        diversity_penalty=diversity_penalty,
        remove_invalid_values=remove_invalid_values,
    )

    stopping_criteria = self._get_stopping_criteria(
        max_length=max_length,
        max_time=max_time,
    )

    if is_greedy_gen_mode:
        if num_return_sequences > 1:
            raise ValueError(
                f"num_return_sequences has to be 1, but is {num_return_sequences} when doing greedy search."
            )

        # greedy search
        return self.greedy_search(
            input_ids,
            logits_processor=logits_processor,
            stopping_criteria=stopping_criteria,
            max_length=max_length,
            pad_token_id=pad_token_id,
            eos_token_id=eos_token_id,
            output_scores=output_scores,
            return_dict_in_generate=return_dict_in_generate,
            **model_kwargs,
        )

    elif is_sample_gen_mode:
        # get probability distribution warper
        logits_warper = self._get_logits_warper(
            top_k=top_k, top_p=top_p, temperature=temperature, num_beams=num_beams
        )

        # expand input_ids with `num_return_sequences` additional sequences per batch
        input_ids, model_kwargs = self._expand_inputs_for_generation(
            input_ids,
            expand_size=num_return_sequences,
            is_encoder_decoder=self.config.is_encoder_decoder,
            **model_kwargs,
        )

        # sample
        return self.sample(
            input_ids,
            logits_processor=logits_processor,
            logits_warper=logits_warper,
            stopping_criteria=stopping_criteria,
            max_length=max_length,
            pad_token_id=pad_token_id,
            eos_token_id=eos_token_id,
            output_scores=output_scores,
            return_dict_in_generate=return_dict_in_generate,
            **model_kwargs,
        )

    elif is_beam_gen_mode:
        batch_size = input_ids.shape[0]

        length_penalty = length_penalty if length_penalty is not None else self.config.length_penalty
        early_stopping = early_stopping if early_stopping is not None else self.config.early_stopping

        if num_return_sequences > num_beams:
            raise ValueError("`num_return_sequences` has to be smaller or equal to `num_beams`.")

        beam_scorer = BeamSearchScorer(
            batch_size=batch_size,
            max_length=max_length,
            num_beams=num_beams,
            device=self.device,
            length_penalty=length_penalty,
            do_early_stopping=early_stopping,
            num_beam_hyps_to_keep=num_return_sequences,
        )
        # interleave with `num_beams`
        input_ids, model_kwargs = self._expand_inputs_for_generation(
            input_ids, expand_size=num_beams, is_encoder_decoder=self.config.is_encoder_decoder, **model_kwargs
        )
        with torchsnooper.snoop(depth=4, max_variable_length=200, normalize=True):
            return self.beam_search(
                input_ids,
                beam_scorer,
                logits_processor=logits_processor,
                stopping_criteria=stopping_criteria,
                max_length=max_length,
                pad_token_id=pad_token_id,
                eos_token_id=eos_token_id,
                output_scores=output_scores,
                return_dict_in_generate=return_dict_in_generate,
                **model_kwargs,
            )

    elif is_beam_sample_gen_mode:
        logits_warper = self._get_logits_warper(
            top_k=top_k, top_p=top_p, temperature=temperature, num_beams=num_beams
        )

        batch_size = input_ids.shape[0] * num_return_sequences

        length_penalty = length_penalty if length_penalty is not None else self.config.length_penalty
        beam_scorer = BeamSearchScorer(
            batch_size=batch_size,
            max_length=max_length,
            num_beams=num_beams,
            device=self.device,
            length_penalty=length_penalty,
            do_early_stopping=early_stopping,
        )

        # interleave with `num_beams * num_return_sequences`
        input_ids, model_kwargs = self._expand_inputs_for_generation(
            input_ids,
            expand_size=num_beams * num_return_sequences,
            is_encoder_decoder=self.config.is_encoder_decoder,
            **model_kwargs,
        )

        return self.beam_sample(
            input_ids,
            beam_scorer,
            logits_processor=logits_processor,
            logits_warper=logits_warper,
            stopping_criteria=stopping_criteria,
            max_length=max_length,
            pad_token_id=pad_token_id,
            eos_token_id=eos_token_id,
            output_scores=output_scores,
            return_dict_in_generate=return_dict_in_generate,
            **model_kwargs,
        )

    elif is_group_beam_gen_mode:
        batch_size = input_ids.shape[0]

        length_penalty = length_penalty if length_penalty is not None else self.config.length_penalty
        early_stopping = early_stopping if early_stopping is not None else self.config.early_stopping

        if num_return_sequences > num_beams:
            raise ValueError("`num_return_sequences` has to be smaller or equal to `num_beams`.")

        if num_beams % num_beam_groups != 0:
            raise ValueError("`num_beams` should be divisible by `num_beam_groups` for group beam search.")

        diverse_beam_scorer = BeamSearchScorer(
            batch_size=batch_size,
            max_length=max_length,
            num_beams=num_beams,
            device=self.device,
            length_penalty=length_penalty,
            do_early_stopping=early_stopping,
            num_beam_hyps_to_keep=num_return_sequences,
            num_beam_groups=num_beam_groups,
        )
        # interleave with `num_beams`
        input_ids, model_kwargs = self._expand_inputs_for_generation(
            input_ids, expand_size=num_beams, is_encoder_decoder=self.config.is_encoder_decoder, **model_kwargs
        )
        return self.group_beam_search(
            input_ids,
            diverse_beam_scorer,
            logits_processor=logits_processor,
            stopping_criteria=stopping_criteria,
            max_length=max_length,
            pad_token_id=pad_token_id,
            eos_token_id=eos_token_id,
            output_scores=output_scores,
            return_dict_in_generate=return_dict_in_generate,
            **model_kwargs,
        )
pp_model.generate2 = MethodType(generate2, pp_model)

In [7]:
## Victim Model (VM)
vm_name = "textattack/distilbert-base-uncased-rotten-tomatoes"
vm_tokenizer = AutoTokenizer.from_pretrained(vm_name)
vm_model = AutoModelForSequenceClassification.from_pretrained(vm_name).to(device)
vm_idx2lbl = vm_model.config.id2label
vm_lbl2idx = vm_model.config.label2id
vm_num_labels = vm_model.num_labels

In [8]:
dataset = load_dataset("rotten_tomatoes")
train,valid,test = dataset['train'],dataset['validation'],dataset['test']
label_cname = 'label'
## For snli
# remove_minus1_labels = lambda x: x[label_cname] != -1
# train = train.filter(remove_minus1_labels)
# valid = valid.filter(remove_minus1_labels)
# test = test.filter(remove_minus1_labels)

# make sure that all datasets have the same number of labels as what the victim model predicts
assert train.features[label_cname].num_classes == vm_num_labels
assert valid.features[label_cname].num_classes == vm_num_labels
assert test.features[ label_cname].num_classes == vm_num_labels

train_dl = DataLoader(train, batch_size=batch_size, shuffle=True, num_workers=n_wkrs)
valid_dl = DataLoader(valid, batch_size=batch_size, shuffle=True, num_workers=n_wkrs)
test_dl = DataLoader( test,  batch_size=batch_size, shuffle=True, num_workers=n_wkrs)


Using custom data configuration default
Reusing dataset rotten_tomatoes_movie_review (/data/tproth/.cache/huggingface/datasets/rotten_tomatoes_movie_review/default/1.0.0/9c411f7ecd9f3045389de0d9ce984061a1056507703d2e3183b1ac1a90816e4d)


## Training

**Setup**
* use training dataset
* sentiment analysis


**Training loop** 
* get batch (e.g. 16 examples) which we call batch_orig
* get paraphrases for each example in batch to make new batch (batch_pp)
    * more efficient if we have bigger batches
    * start with k=2 paraphrases for now (so the code can handle the multi-paraphrase case) but we will try with 1 and with a few as well. 
    * will have to later play around with diversity parameters (or maybe they can also be learned with rl too)
* get reward 
    * the *reward function* $R$ takes in k rows of batch_pp, which corresponds to one example of batch_orig
    * here are the formulas. 
    * for k=1: $R = f(x)_y - f(x')_y + \lambda SS(x, x')$
    * for k>1 (e.g. 3): some thought needed. ideas: 
        * only look at best performing paraphrase $x'_m$. find it by $x'_m = \max_i [f(x)_y - f(x'_i)_y]$, then return $R [f(x)_y - f(x'_m)_y] + \lambda SS(x,x')$ 
        * take average of each: $\frac{1}{k} \sum_{i=1}^k \left[ f(x)_y - f(x'_i)_y + \lambda SS(x, x'_i) \right]$
* update parameters of $f_{x'}$


### Tasks 
* ~~clean up text above~~
    * ~~Second edit that makes clear the scenario that is being implemented.~~
* ~~clean up a bunch of commented code~~
* ~~write out some scenarios~~ 
    * ~~best-paraphrase reward from set~~
    * ~~avg-paraphrase reward from set~~
    * ~~one paraphrase reward~~ 
* ~~write out order of scenarios~~
* ~~add to git~~
* ~~merge `get_paraphrases` and `get_paraphrases_and_probs`~~
* ~~Adjust `create_paraphrase_dataset` to be able to deal with probabilities.~~
* ~~Make terminology `para` and `pp` consistent~~
* ~~Adjust `get_vm_scores` to deal with probabilities~~
* write out pseudocode for how the training loop works
    * ~~stochastic gradient descent version~~
    * batched version
    * version with Adam / other gradient descent technique

* work out the simplest way you can start
* ~~run through Trainer tutorial at: https://colab.research.google.com/github/huggingface/notebooks/blob/master/transformers_doc/pytorch/training.ipynb~~
* ~~look into accumulated gradients~~
* ~~look into mixed precision~~
* ~~look into callbacks~~
* ~~look into ZeRO~~
* ~~look into curriculum learning~~

* ~~look at the generate return function to check size (can you find token-level scores somewhere?)~~
* implement non-batched version
* write batched version pseudocode
* write version where it is extended with Adam 
* write `reward_fn_onepp`


First scenario: one paraphrase (top-one from probabilities) tuning parameters 
* implement reinforce 
* checkpoint: reward for one set of paraphrases 

* write out idea for tweaking hyperparameters of paraphrase model with this approach too

**Possible future tasks**
* try using pytorch AMP
* how do GPU's work
* gradient checkpointing 
* distributed training
* hooks
* pinned memory
* freezing layers

In [30]:
# def get_paraphrases(input_text, num_return_sequences, num_beams, return_probs=True, 
#                      num_beam_groups=1, diversity_penalty=0, temperature=1.5):
#     """Wrapper for generating paraphrases (pp's). Most keywords are passed on to pp_model.generate function, 
#     so see docs for that function. Set return_probs=True to return a tuple of (pp's, probs), 
#     else we return just a list of pp's. """
#     set_trace()
#     batch = pp_tokenizer(input_text, truncation=True, padding='longest', return_tensors="pt").to(device)
    
#     translated = pp_model.generate2(**batch, num_beams=num_beams, 
#         num_return_sequences=num_return_sequences, temperature=temperature, 
#         num_beam_groups=num_beam_groups, diversity_penalty=diversity_penalty,
#         return_dict_in_generate=True, output_scores=True
#     )
#     # Sequence scores won't add to 1 across the generated paraphrases, so here we normalise them. 
#     # We also need to take exp for them to work. 
#     seq_probs = torch.exp(translated.sequences_scores) / sum(torch.exp(translated.sequences_scores))
#     tgt_text = pp_tokenizer.batch_decode(translated.sequences, skip_special_tokens=True)
#     if return_probs:   return tgt_text, seq_probs
#     else:              return tgt_text

In [9]:
# from transformers import (
# AutoTokenizer,
# AutoModelForCausalLM,
# LogitsProcessorList,
# MinLengthLogitsProcessor,
# )

# tokenizer = AutoTokenizer.from_pretrained("gpt2")
# model = AutoModelForCausalLM.from_pretrained("gpt2")

# # set pad_token_id to eos_token_id because GPT2 does not have a EOS token
# model.config.pad_token_id = model.config.eos_token_id

# input_prompt = "Today is a beautiful day, and"
# input_ids = tokenizer(input_prompt, return_tensors="pt").input_ids

# # instantiate logits processors
# logits_processor = LogitsProcessorList([
#     MinLengthLogitsProcessor(15, eos_token_id=model.config.eos_token_id),
# ])

# outputs = model.greedy_search(input_ids, logits_processor=logits_processor)

# print("Generated:", tokenizer.batch_decode(outputs, skip_special_tokens=True))

Generated: ["Today is a beautiful day, and I'm so happy to be here. I'm so happy to"]


In [11]:
# translated = pp_model.generate2(**batch, num_beams=num_beams, 
#     num_return_sequences=num_return_sequences, temperature=temperature, 
#     num_beam_groups=num_beam_groups, diversity_penalty=diversity_penalty,
#     return_dict_in_generate=True, output_scores=True
# )

In [12]:
# translated

In [81]:
input_text="hello my name is Tom"
num_return_sequences=1
num_beams=2
return_probs=True
batch = pp_tokenizer(input_text, truncation=True, padding='longest', return_tensors="pt").to(device)
generated = pp_model.generate2(**batch, return_dict_in_generate=True, output_scores=True,
                              num_return_sequences=num_return_sequences,
                                num_beams=num_beams,
                                num_beam_groups=1,
                                diversity_penalty=0,
                                temperature=1.5, 
                              length_penalty=1)

# Score: score = sum_logprobs / (hyp.shape[-1] ** self.length_penalty)
# gradient gets removed (i think) by the line 
# beam_hyp.add(
#   input_ids[batch_beam_idx].clone(),
#   next_score.item())


Source path:... <ipython-input-80-178233e04878>
New var:....... self = PegasusForConditionalGeneration(  (model): PegasusModel(    (shared): Embedding(96103, 1024, paddi...mentwise_affine=True)    )  )  (lm_head): Linear(in_features=1024, out_features=96103, bias=False))
New var:....... input_ids = tensor<(2, 1), int64, cuda:0>
New var:....... max_length = 60
New var:....... min_length = None
New var:....... do_sample = False
New var:....... early_stopping = False
New var:....... num_beams = 2
New var:....... temperature = 1.5
New var:....... top_k = None
New var:....... top_p = None
New var:....... repetition_penalty = None
New var:....... bad_words_ids = None
New var:....... bos_token_id = 0
New var:....... pad_token_id = 0
New var:....... eos_token_id = 1
New var:....... length_penalty = 1
New var:....... no_repeat_ngram_size = None
New var:....... encoder_no_repeat_ngram_size = None
New var:....... num_return_sequences = 1
New var:....... max_time = None
New var:....... decoder_sta

        Starting var:.. self = PegasusForConditionalGeneration(  (model): PegasusModel(    (shared): Embedding(96103, 1024, paddi...mentwise_affine=True)    )  )  (lm_head): Linear(in_features=1024, out_features=96103, bias=False))
        Starting var:.. input_ids = None
        Starting var:.. attention_mask = tensor<(2, 6), int64, cuda:0>
        Starting var:.. decoder_input_ids = tensor<(2, 1), int64, cuda:0>
        Starting var:.. decoder_attention_mask = None
        Starting var:.. head_mask = None
        Starting var:.. decoder_head_mask = None
        Starting var:.. encoder_outputs = {'last_hidden_state': tensor<(2, 6, 1024), float32, cuda:0, grad>}
        Starting var:.. past_key_values = None
        Starting var:.. inputs_embeds = None
        Starting var:.. decoder_inputs_embeds = None
        Starting var:.. labels = None
        Starting var:.. use_cache = None
        Starting var:.. output_attentions = False
        Starting var:.. output_hidden_states = False
  

                line      1774                 next_token_scores, 2 * num_beams, dim=1, largest=True, sorted=True
                line      1773             next_token_scores, next_tokens = torch.topk(
Modified var:.. next_token_scores = tensor<(1, 4), float32, cuda:0, grad>
New var:....... next_tokens = tensor<(1, 4), int64, cuda:0>
                line      1777             next_indices = next_tokens // vocab_size
    Source path:... tensor.py
    Starting var:.. args = (tensor<(1, 4), int64, cuda:0>, 96103)
    Starting var:.. kwargs = {}
    Starting var:.. f = <function Tensor.__floordiv__>
    Starting var:.. wrapped = <function Tensor.__floordiv__>
                    call        22     def wrapped(*args, **kwargs):
                    line        23         from torch.overrides import has_torch_function, handle_torch_function
    New var:....... has_torch_function = <function has_torch_function>
    New var:....... handle_torch_function = <function handle_torch_function>
      

                    line       217         for batch_idx, beam_hyp in enumerate(self._beam_hyps):
                    line       268         return UserDict(
                    line       270                 "next_beam_scores": next_beam_scores.view(-1),
                    line       271                 "next_beam_tokens": next_beam_tokens.view(-1),
                    line       272                 "next_beam_indices": next_beam_indices.view(-1),
                    line       269             {
                    line       268         return UserDict(
        Source path:... __init__.py
        Starting var:.. args = REPR FAILED
        Starting var:.. kwargs = {}
                        call       981     def __init__(*args, **kwargs):
                        line       982         if not args:
                        line       985         self, *args = args
        Modified var:.. args = [{'next_beam_scores': tensor<(2,), float32, cuda:0, grad>, 'next_beam_tokens': tensor<(2,),

    Modified var:.. reordered_past = ((tensor<(2, 16, 1, 64), float32, cuda:0, grad>, tensor<(2, 16, 1, 64), float32, cuda:0, grad>, te...ad>, tensor<(2, 16, 6, 64), float32, cuda:0, grad>, tensor<(2, 16, 6, 64), float32, cuda:0, grad>))
                    line      1335         for layer_past in past:
                    line      1337             reordered_past += (
                    line      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        Starting var:.. .0 = <tuple_iterator object>
        Starting var:.. beam_idx = tensor<(2,), int64, cuda:0>
                        call      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
                        line      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        New var:....... past_state = tensor<(2, 16, 1, 64), float3

                    line      1337             reordered_past += (
                    line      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        Starting var:.. .0 = <tuple_iterator object>
        Starting var:.. beam_idx = tensor<(2,), int64, cuda:0>
                        call      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
                        line      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        New var:....... past_state = tensor<(2, 16, 1, 64), float32, cuda:0, grad>
                        return    1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        Return value:.. tensor<(2, 16, 1, 64), float32, cuda:0, grad>
        Starting var:.. .0 = <tuple_iterator object>
        Startin

                    line      1337             reordered_past += (
                    line      1335         for layer_past in past:
                    line      1337             reordered_past += (
                    line      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        Starting var:.. .0 = <tuple_iterator object>
        Starting var:.. beam_idx = tensor<(2,), int64, cuda:0>
                        call      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
                        line      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        New var:....... past_state = tensor<(2, 16, 1, 64), float32, cuda:0, grad>
                        return    1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
    

                    line      1337             reordered_past += (
                    line      1335         for layer_past in past:
                    line      1340         return reordered_past
                    return    1340         return reordered_past
    Return value:.. ((tensor<(2, 16, 1, 64), float32, cuda:0, grad>, tensor<(2, 16, 1, 64), float32, cuda:0, grad>, te...ad>, tensor<(2, 16, 6, 64), float32, cuda:0, grad>, tensor<(2, 16, 6, 64), float32, cuda:0, grad>))
Source path:... generation_utils.py
                line      1803             if beam_scorer.is_done:
    Source path:... generation_beam_search.py
    Starting var:.. self = <transformers.generation_beam_search.BeamSearchScorer object>
                    call       196     def is_done(self) -> bool:
                    line       197         return self._done.all()
                    return     197         return self._done.all()
    Return value:.. tensor<(), bool, cuda:0>
Source path:... generation_utils

                        line      1268             attention_mask=attention_mask,
                        line      1269             decoder_input_ids=decoder_input_ids,
                        line      1270             encoder_outputs=encoder_outputs,
                        line      1271             decoder_attention_mask=decoder_attention_mask,
                        line      1272             head_mask=head_mask,
                        line      1273             decoder_head_mask=decoder_head_mask,
                        line      1274             past_key_values=past_key_values,
                        line      1275             inputs_embeds=inputs_embeds,
                        line      1276             decoder_inputs_embeds=decoder_inputs_embeds,
                        line      1277             use_cache=use_cache,
                        line      1278             output_attentions=output_attentions,
                        line      1279             output_hidden_sta

Modified var:.. scores = (tensor<(2, 96103), float32, cuda:0, grad>, tensor<(2, 96103), float32, cuda:0, grad>)
                line      1755                 if output_attentions:
                line      1762                 if output_hidden_states:
                line      1770             vocab_size = next_token_scores.shape[-1]
                line      1771             next_token_scores = next_token_scores.view(batch_size, num_beams * vocab_size)
Modified var:.. next_token_scores = tensor<(1, 192206), float32, cuda:0, grad>
                line      1773             next_token_scores, next_tokens = torch.topk(
                line      1774                 next_token_scores, 2 * num_beams, dim=1, largest=True, sorted=True
                line      1773             next_token_scores, next_tokens = torch.topk(
Modified var:.. next_token_scores = tensor<(1, 4), float32, cuda:0, grad>
                line      1777             next_indices = next_tokens // vocab_size
    Source pat

                line      1791             beam_idx = beam_outputs["next_beam_indices"]
    Source path:... __init__.py
    Starting var:.. self = {'next_beam_scores': tensor([-1.4723, -2.7266], device='cuda:0', grad_fn=<ViewBackward>), 'next_beam_tokens': tensor([117, 131], device='cuda:0'), 'next_beam_indices': tensor([0, 1], device='cuda:0')}
    Starting var:.. key = 'next_beam_indices'
                    call      1005     def __getitem__(self, key):
                    line      1006         if key in self.data:
                    line      1007             return self.data[key]
                    return    1007             return self.data[key]
    Return value:.. tensor<(2,), int64, cuda:0>
Source path:... generation_utils.py
                line      1793             input_ids = torch.cat([input_ids[beam_idx, :], beam_next_tokens.unsqueeze(-1)], dim=-1)
Modified var:.. input_ids = tensor<(2, 3), int64, cuda:0>
                line      1795             cur_len = cur_len + 1

                    line      1335         for layer_past in past:
                    line      1337             reordered_past += (
                    line      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        Starting var:.. .0 = <tuple_iterator object>
        Starting var:.. beam_idx = tensor<(2,), int64, cuda:0>
                        call      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
                        line      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        New var:....... past_state = tensor<(2, 16, 2, 64), float32, cuda:0, grad>
                        return    1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        Return value:.. tensor<(2, 16, 2, 64), float32, cuda:0, grad>
 

                    line      1337             reordered_past += (
                    line      1335         for layer_past in past:
                    line      1337             reordered_past += (
                    line      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        Starting var:.. .0 = <tuple_iterator object>
        Starting var:.. beam_idx = tensor<(2,), int64, cuda:0>
                        call      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
                        line      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        New var:....... past_state = tensor<(2, 16, 2, 64), float32, cuda:0, grad>
                        return    1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
    

                    line      1335         for layer_past in past:
                    line      1337             reordered_past += (
                    line      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        Starting var:.. .0 = <tuple_iterator object>
        Starting var:.. beam_idx = tensor<(2,), int64, cuda:0>
                        call      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
                        line      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        New var:....... past_state = tensor<(2, 16, 2, 64), float32, cuda:0, grad>
                        return    1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        Return value:.. tensor<(2, 16, 2, 64), float32, cuda:0, grad>
 

                    line       718                 self._forward_pre_hooks.values()):
                    line       716         for hook in itertools.chain(
                    line       724         if torch._C._get_tracing_state():
                    line       727             result = self.forward(*input, **kwargs)
        Source path:... modeling_pegasus.py
        Starting var:.. self = PegasusForConditionalGeneration(  (model): PegasusModel(    (shared): Embedding(96103, 1024, paddi...mentwise_affine=True)    )  )  (lm_head): Linear(in_features=1024, out_features=96103, bias=False))
        Starting var:.. input_ids = None
        Starting var:.. attention_mask = tensor<(2, 6), int64, cuda:0>
        Starting var:.. decoder_input_ids = tensor<(2, 1), int64, cuda:0>
        Starting var:.. decoder_attention_mask = None
        Starting var:.. head_mask = None
        Starting var:.. decoder_head_mask = None
        Starting var:.. encoder_outputs = {'last_hidden_state': tensor<(

                line      1752             if return_dict_in_generate:
                line      1753                 if output_scores:
                line      1754                     scores += (next_token_scores,)
Modified var:.. scores = (tensor<(2, 96103), float32, cuda:0, grad>, tensor<(2, 96103), float32, cuda:0, grad>, tensor<(2, 96103), float32, cuda:0, grad>)
                line      1755                 if output_attentions:
                line      1762                 if output_hidden_states:
                line      1770             vocab_size = next_token_scores.shape[-1]
                line      1771             next_token_scores = next_token_scores.view(batch_size, num_beams * vocab_size)
Modified var:.. next_token_scores = tensor<(1, 192206), float32, cuda:0, grad>
                line      1773             next_token_scores, next_tokens = torch.topk(
                line      1774                 next_token_scores, 2 * num_beams, dim=1, largest=True, sorted=True

    Return value:.. {'next_beam_scores': tensor([-1.5838, -2.8241], device='cuda:0', grad_fn=<ViewBackward>), 'next_beam_tokens': tensor([161, 208], device='cuda:0'), 'next_beam_indices': tensor([0, 1], device='cuda:0')}
Source path:... generation_utils.py
Modified var:.. beam_outputs = {'next_beam_scores': tensor([-1.5838, -2.8241], device='cuda:0', grad_fn=<ViewBackward>), 'next_beam_tokens': tensor([161, 208], device='cuda:0'), 'next_beam_indices': tensor([0, 1], device='cuda:0')}
                line      1789             beam_scores = beam_outputs["next_beam_scores"]
    Source path:... __init__.py
    Starting var:.. self = {'next_beam_scores': tensor([-1.5838, -2.8241], device='cuda:0', grad_fn=<ViewBackward>), 'next_beam_tokens': tensor([161, 208], device='cuda:0'), 'next_beam_indices': tensor([0, 1], device='cuda:0')}
    Starting var:.. key = 'next_beam_scores'
                    call      1005     def __getitem__(self, key):
                    line      1006         if key

                    line      1337             reordered_past += (
                    line      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        Starting var:.. .0 = <tuple_iterator object>
        Starting var:.. beam_idx = tensor<(2,), int64, cuda:0>
                        call      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
                        line      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        New var:....... past_state = tensor<(2, 16, 3, 64), float32, cuda:0, grad>
                        return    1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        Return value:.. tensor<(2, 16, 3, 64), float32, cuda:0, grad>
        Starting var:.. .0 = <tuple_iterator object>
        Startin

                    line      1337             reordered_past += (
                    line      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        Starting var:.. .0 = <tuple_iterator object>
        Starting var:.. beam_idx = tensor<(2,), int64, cuda:0>
                        call      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
                        line      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        New var:....... past_state = tensor<(2, 16, 3, 64), float32, cuda:0, grad>
                        return    1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        Return value:.. tensor<(2, 16, 3, 64), float32, cuda:0, grad>
        Starting var:.. .0 = <tuple_iterator object>
        Startin

                    line      1337             reordered_past += (
                    line      1335         for layer_past in past:
                    line      1337             reordered_past += (
                    line      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        Starting var:.. .0 = <tuple_iterator object>
        Starting var:.. beam_idx = tensor<(2,), int64, cuda:0>
                        call      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
                        line      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        New var:....... past_state = tensor<(2, 16, 3, 64), float32, cuda:0, grad>
                        return    1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
    

                    return    1340         return reordered_past
    Return value:.. ((tensor<(2, 16, 3, 64), float32, cuda:0, grad>, tensor<(2, 16, 3, 64), float32, cuda:0, grad>, te...ad>, tensor<(2, 16, 6, 64), float32, cuda:0, grad>, tensor<(2, 16, 6, 64), float32, cuda:0, grad>))
Source path:... generation_utils.py
                line      1803             if beam_scorer.is_done:
    Source path:... generation_beam_search.py
    Starting var:.. self = <transformers.generation_beam_search.BeamSearchScorer object>
                    call       196     def is_done(self) -> bool:
                    line       197         return self._done.all()
                    return     197         return self._done.all()
    Return value:.. tensor<(), bool, cuda:0>
Source path:... generation_utils.py
                line      1806             if stopping_criteria(input_ids, scores):
    Source path:... generation_stopping_criteria.py
    Starting var:.. self = [<transformers.generation_stoppi

                        line      1268             attention_mask=attention_mask,
                        line      1269             decoder_input_ids=decoder_input_ids,
                        line      1270             encoder_outputs=encoder_outputs,
                        line      1271             decoder_attention_mask=decoder_attention_mask,
                        line      1272             head_mask=head_mask,
                        line      1273             decoder_head_mask=decoder_head_mask,
                        line      1274             past_key_values=past_key_values,
                        line      1275             inputs_embeds=inputs_embeds,
                        line      1276             decoder_inputs_embeds=decoder_inputs_embeds,
                        line      1277             use_cache=use_cache,
                        line      1278             output_attentions=output_attentions,
                        line      1279             output_hidden_sta

Modified var:.. scores = (tensor<(2, 96103), float32, cuda:0, grad>, tensor<(2, 96103), float32, cuda:0, grad>, tensor<(2, 96103), float32, cuda:0, grad>, tensor<(2, 96103), float32, cuda:0, grad>)
                line      1755                 if output_attentions:
                line      1762                 if output_hidden_states:
                line      1770             vocab_size = next_token_scores.shape[-1]
                line      1771             next_token_scores = next_token_scores.view(batch_size, num_beams * vocab_size)
Modified var:.. next_token_scores = tensor<(1, 192206), float32, cuda:0, grad>
                line      1773             next_token_scores, next_tokens = torch.topk(
                line      1774                 next_token_scores, 2 * num_beams, dim=1, largest=True, sorted=True
                line      1773             next_token_scores, next_tokens = torch.topk(
Modified var:.. next_token_scores = tensor<(1, 4), float32, cuda:0, grad>
            

                line      1791             beam_idx = beam_outputs["next_beam_indices"]
    Source path:... __init__.py
    Starting var:.. self = {'next_beam_scores': tensor([-1.7536, -3.4577], device='cuda:0', grad_fn=<ViewBackward>), 'next_be...kens': tensor([ 442, 3227], device='cuda:0'), 'next_beam_indices': tensor([0, 1], device='cuda:0')}
    Starting var:.. key = 'next_beam_indices'
                    call      1005     def __getitem__(self, key):
                    line      1006         if key in self.data:
                    line      1007             return self.data[key]
                    return    1007             return self.data[key]
    Return value:.. tensor<(2,), int64, cuda:0>
Source path:... generation_utils.py
                line      1793             input_ids = torch.cat([input_ids[beam_idx, :], beam_next_tokens.unsqueeze(-1)], dim=-1)
Modified var:.. input_ids = tensor<(2, 5), int64, cuda:0>
                line      1795             cur_len = cur_len + 1

                    line      1335         for layer_past in past:
                    line      1337             reordered_past += (
                    line      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        Starting var:.. .0 = <tuple_iterator object>
        Starting var:.. beam_idx = tensor<(2,), int64, cuda:0>
                        call      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
                        line      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        New var:....... past_state = tensor<(2, 16, 4, 64), float32, cuda:0, grad>
                        return    1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        Return value:.. tensor<(2, 16, 4, 64), float32, cuda:0, grad>
 

                    line      1337             reordered_past += (
                    line      1335         for layer_past in past:
                    line      1337             reordered_past += (
                    line      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        Starting var:.. .0 = <tuple_iterator object>
        Starting var:.. beam_idx = tensor<(2,), int64, cuda:0>
                        call      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
                        line      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        New var:....... past_state = tensor<(2, 16, 4, 64), float32, cuda:0, grad>
                        return    1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
    

                    line      1335         for layer_past in past:
                    line      1337             reordered_past += (
                    line      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        Starting var:.. .0 = <tuple_iterator object>
        Starting var:.. beam_idx = tensor<(2,), int64, cuda:0>
                        call      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
                        line      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        New var:....... past_state = tensor<(2, 16, 4, 64), float32, cuda:0, grad>
                        return    1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        Return value:.. tensor<(2, 16, 4, 64), float32, cuda:0, grad>
 

                line      1732             outputs = self(
                line      1733                 **model_inputs,
                line      1734                 return_dict=True,
                line      1735                 output_attentions=output_attentions,
                line      1736                 output_hidden_states=output_hidden_states,
                line      1732             outputs = self(
    Source path:... module.py
    Starting var:.. self = PegasusForConditionalGeneration(  (model): PegasusModel(    (shared): Embedding(96103, 1024, paddi...mentwise_affine=True)    )  )  (lm_head): Linear(in_features=1024, out_features=96103, bias=False))
    Starting var:.. input = ()
    Starting var:.. kwargs = {'input_ids': None, 'encoder_outputs': {'last_hidden_state': tensor<(2, 6, 1024), float32, cuda:0,... 'use_cache': None, 'return_dict': True, 'output_attentions': False, 'output_hidden_states': False}
                    call       715     def _call_impl(self, *

                line      1752             if return_dict_in_generate:
                line      1753                 if output_scores:
                line      1754                     scores += (next_token_scores,)
Modified var:.. scores = (tensor<(2, 96103), float32, cuda:0, grad>, tensor<(2, 96103), float32, cuda:0, grad>, tensor<(2, ...uda:0, grad>, tensor<(2, 96103), float32, cuda:0, grad>, tensor<(2, 96103), float32, cuda:0, grad>)
                line      1755                 if output_attentions:
                line      1762                 if output_hidden_states:
                line      1770             vocab_size = next_token_scores.shape[-1]
                line      1771             next_token_scores = next_token_scores.view(batch_size, num_beams * vocab_size)
Modified var:.. next_token_scores = tensor<(1, 192206), float32, cuda:0, grad>
                line      1773             next_token_scores, next_tokens = torch.topk(
                line      1774            

    Return value:.. {'next_beam_scores': tensor([-2.1390, -3.9068], device='cuda:0', grad_fn=<ViewBackward>), 'next_beam_tokens': tensor([107, 111], device='cuda:0'), 'next_beam_indices': tensor([0, 1], device='cuda:0')}
Source path:... generation_utils.py
Modified var:.. beam_outputs = {'next_beam_scores': tensor([-2.1390, -3.9068], device='cuda:0', grad_fn=<ViewBackward>), 'next_beam_tokens': tensor([107, 111], device='cuda:0'), 'next_beam_indices': tensor([0, 1], device='cuda:0')}
                line      1789             beam_scores = beam_outputs["next_beam_scores"]
    Source path:... __init__.py
    Starting var:.. self = {'next_beam_scores': tensor([-2.1390, -3.9068], device='cuda:0', grad_fn=<ViewBackward>), 'next_beam_tokens': tensor([107, 111], device='cuda:0'), 'next_beam_indices': tensor([0, 1], device='cuda:0')}
    Starting var:.. key = 'next_beam_scores'
                    call      1005     def __getitem__(self, key):
                    line      1006         if key

                        line      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
                        return    1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        Return value:.. None
                    line      1337             reordered_past += (
                    line      1335         for layer_past in past:
                    line      1337             reordered_past += (
                    line      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        Starting var:.. .0 = <tuple_iterator object>
        Starting var:.. beam_idx = tensor<(2,), int64, cuda:0>
                        call      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
                        line      1338                 tup

                    line      1335         for layer_past in past:
                    line      1337             reordered_past += (
                    line      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        Starting var:.. .0 = <tuple_iterator object>
        Starting var:.. beam_idx = tensor<(2,), int64, cuda:0>
                        call      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
                        line      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        New var:....... past_state = tensor<(2, 16, 5, 64), float32, cuda:0, grad>
                        return    1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        Return value:.. tensor<(2, 16, 5, 64), float32, cuda:0, grad>
 

                    line      1337             reordered_past += (
                    line      1335         for layer_past in past:
                    line      1340         return reordered_past
                    return    1340         return reordered_past
    Return value:.. ((tensor<(2, 16, 5, 64), float32, cuda:0, grad>, tensor<(2, 16, 5, 64), float32, cuda:0, grad>, te...ad>, tensor<(2, 16, 6, 64), float32, cuda:0, grad>, tensor<(2, 16, 6, 64), float32, cuda:0, grad>))
Source path:... generation_utils.py
                line      1803             if beam_scorer.is_done:
    Source path:... generation_beam_search.py
    Starting var:.. self = <transformers.generation_beam_search.BeamSearchScorer object>
                    call       196     def is_done(self) -> bool:
                    line       197         return self._done.all()
                    return     197         return self._done.all()
    Return value:.. tensor<(), bool, cuda:0>
Source path:... generation_utils

                        line      1280             return_dict=return_dict,
                        line      1266         outputs = self.model(
        New var:....... outputs = {'last_hidden_state': tensor<(2, 1, 1024), float32, cuda:0, grad>, 'past_key_values': ((tensor<(2,...float32, cuda:0, grad>)), 'encoder_last_hidden_state': tensor<(2, 6, 1024), float32, cuda:0, grad>}
                        line      1282         lm_logits = self.lm_head(outputs[0]) + self.final_logits_bias
        New var:....... lm_logits = tensor<(2, 1, 96103), float32, cuda:0, grad>
                        line      1284         masked_lm_loss = None
        New var:....... masked_lm_loss = None
                        line      1285         if labels is not None:
                        line      1289         if not return_dict:
                        line      1293         return Seq2SeqLMOutput(
                        line      1294             loss=masked_lm_loss,
                        line      1

                line      1753                 if output_scores:
                line      1754                     scores += (next_token_scores,)
                line      1755                 if output_attentions:
                line      1762                 if output_hidden_states:
                line      1770             vocab_size = next_token_scores.shape[-1]
                line      1771             next_token_scores = next_token_scores.view(batch_size, num_beams * vocab_size)
Modified var:.. next_token_scores = tensor<(1, 192206), float32, cuda:0, grad>
                line      1773             next_token_scores, next_tokens = torch.topk(
                line      1774                 next_token_scores, 2 * num_beams, dim=1, largest=True, sorted=True
                line      1773             next_token_scores, next_tokens = torch.topk(
Modified var:.. next_token_scores = tensor<(1, 4), float32, cuda:0, grad>
                line      1777             next_indices = next_

Modified var:.. beam_outputs = {'next_beam_scores': tensor([-4.3012, -7.3791], device='cuda:0', grad_fn=<ViewBackward>), 'next_beam_tokens': tensor([125, 161], device='cuda:0'), 'next_beam_indices': tensor([1, 1], device='cuda:0')}
                line      1789             beam_scores = beam_outputs["next_beam_scores"]
    Source path:... __init__.py
    Starting var:.. self = {'next_beam_scores': tensor([-4.3012, -7.3791], device='cuda:0', grad_fn=<ViewBackward>), 'next_beam_tokens': tensor([125, 161], device='cuda:0'), 'next_beam_indices': tensor([1, 1], device='cuda:0')}
    Starting var:.. key = 'next_beam_scores'
                    call      1005     def __getitem__(self, key):
                    line      1006         if key in self.data:
                    line      1007             return self.data[key]
                    return    1007             return self.data[key]
    Return value:.. tensor<(2,), float32, cuda:0, grad>
Source path:... generation_utils.py
            

                    line      1337             reordered_past += (
                    line      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        Starting var:.. .0 = <tuple_iterator object>
        Starting var:.. beam_idx = tensor<(2,), int64, cuda:0>
                        call      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
                        line      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        New var:....... past_state = tensor<(2, 16, 6, 64), float32, cuda:0, grad>
                        return    1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        Return value:.. tensor<(2, 16, 6, 64), float32, cuda:0, grad>
        Starting var:.. .0 = <tuple_iterator object>
        Startin

                    line      1337             reordered_past += (
                    line      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        Starting var:.. .0 = <tuple_iterator object>
        Starting var:.. beam_idx = tensor<(2,), int64, cuda:0>
                        call      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
                        line      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        New var:....... past_state = tensor<(2, 16, 6, 64), float32, cuda:0, grad>
                        return    1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        Return value:.. tensor<(2, 16, 6, 64), float32, cuda:0, grad>
        Starting var:.. .0 = <tuple_iterator object>
        Startin

                    line      1337             reordered_past += (
                    line      1335         for layer_past in past:
                    line      1337             reordered_past += (
                    line      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        Starting var:.. .0 = <tuple_iterator object>
        Starting var:.. beam_idx = tensor<(2,), int64, cuda:0>
                        call      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
                        line      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        New var:....... past_state = tensor<(2, 16, 6, 64), float32, cuda:0, grad>
                        return    1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
    

                    return    1340         return reordered_past
    Return value:.. ((tensor<(2, 16, 6, 64), float32, cuda:0, grad>, tensor<(2, 16, 6, 64), float32, cuda:0, grad>, te...ad>, tensor<(2, 16, 6, 64), float32, cuda:0, grad>, tensor<(2, 16, 6, 64), float32, cuda:0, grad>))
Source path:... generation_utils.py
                line      1803             if beam_scorer.is_done:
    Source path:... generation_beam_search.py
    Starting var:.. self = <transformers.generation_beam_search.BeamSearchScorer object>
                    call       196     def is_done(self) -> bool:
                    line       197         return self._done.all()
                    return     197         return self._done.all()
    Return value:.. tensor<(), bool, cuda:0>
Source path:... generation_utils.py
                line      1806             if stopping_criteria(input_ids, scores):
    Source path:... generation_stopping_criteria.py
    Starting var:.. self = [<transformers.generation_stoppi

                        line      1267             input_ids,
                        line      1268             attention_mask=attention_mask,
                        line      1269             decoder_input_ids=decoder_input_ids,
                        line      1270             encoder_outputs=encoder_outputs,
                        line      1271             decoder_attention_mask=decoder_attention_mask,
                        line      1272             head_mask=head_mask,
                        line      1273             decoder_head_mask=decoder_head_mask,
                        line      1274             past_key_values=past_key_values,
                        line      1275             inputs_embeds=inputs_embeds,
                        line      1276             decoder_inputs_embeds=decoder_inputs_embeds,
                        line      1277             use_cache=use_cache,
                        line      1278             output_attentions=output_attentions,
      

                line      1755                 if output_attentions:
                line      1762                 if output_hidden_states:
                line      1770             vocab_size = next_token_scores.shape[-1]
                line      1771             next_token_scores = next_token_scores.view(batch_size, num_beams * vocab_size)
Modified var:.. next_token_scores = tensor<(1, 192206), float32, cuda:0, grad>
                line      1773             next_token_scores, next_tokens = torch.topk(
                line      1774                 next_token_scores, 2 * num_beams, dim=1, largest=True, sorted=True
                line      1773             next_token_scores, next_tokens = torch.topk(
Modified var:.. next_token_scores = tensor<(1, 4), float32, cuda:0, grad>
                line      1777             next_indices = next_tokens // vocab_size
    Source path:... tensor.py
    Starting var:.. args = (tensor<(1, 4), int64, cuda:0>, 96103)
    Starting var:.. kwargs = {

                line      1791             beam_idx = beam_outputs["next_beam_indices"]
    Source path:... __init__.py
    Starting var:.. self = {'next_beam_scores': tensor([-5.2613, -7.1020], device='cuda:0', grad_fn=<ViewBackward>), 'next_beam_tokens': tensor([131, 346], device='cuda:0'), 'next_beam_indices': tensor([0, 0], device='cuda:0')}
    Starting var:.. key = 'next_beam_indices'
                    call      1005     def __getitem__(self, key):
                    line      1006         if key in self.data:
                    line      1007             return self.data[key]
                    return    1007             return self.data[key]
    Return value:.. tensor<(2,), int64, cuda:0>
Source path:... generation_utils.py
                line      1793             input_ids = torch.cat([input_ids[beam_idx, :], beam_next_tokens.unsqueeze(-1)], dim=-1)
Modified var:.. input_ids = tensor<(2, 8), int64, cuda:0>
                line      1795             cur_len = cur_len + 1

        New var:....... past_state = tensor<(2, 16, 7, 64), float32, cuda:0, grad>
                        return    1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        Return value:.. tensor<(2, 16, 7, 64), float32, cuda:0, grad>
        Starting var:.. .0 = <tuple_iterator object>
        Starting var:.. past_state = tensor<(2, 16, 7, 64), float32, cuda:0, grad>
        Starting var:.. beam_idx = tensor<(2,), int64, cuda:0>
                        call      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
                        line      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
                        return    1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        Return value:.. tensor<(2, 16, 7, 64), f

                    line      1337             reordered_past += (
                    line      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        Starting var:.. .0 = <tuple_iterator object>
        Starting var:.. beam_idx = tensor<(2,), int64, cuda:0>
                        call      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
                        line      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        New var:....... past_state = tensor<(2, 16, 7, 64), float32, cuda:0, grad>
                        return    1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        Return value:.. tensor<(2, 16, 7, 64), float32, cuda:0, grad>
        Starting var:.. .0 = <tuple_iterator object>
        Startin

                    line      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        Starting var:.. .0 = <tuple_iterator object>
        Starting var:.. beam_idx = tensor<(2,), int64, cuda:0>
                        call      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
                        line      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        New var:....... past_state = tensor<(2, 16, 7, 64), float32, cuda:0, grad>
                        return    1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        Return value:.. tensor<(2, 16, 7, 64), float32, cuda:0, grad>
        Starting var:.. .0 = <tuple_iterator object>
        Starting var:.. past_state = tensor<(2, 16, 7, 64), float32, cuda:0, grad>

                    line      1319         return {
                    return    1319         return {
    Return value:.. {'input_ids': None, 'encoder_outputs': {'last_hidden_state': tensor<(2, 6, 1024), float32, cuda:0,...64, cuda:0>, 'attention_mask': tensor<(2, 6), int64, cuda:0>, 'head_mask': None, 'use_cache': None}
Source path:... generation_utils.py
                line      1732             outputs = self(
                line      1733                 **model_inputs,
                line      1734                 return_dict=True,
                line      1735                 output_attentions=output_attentions,
                line      1736                 output_hidden_states=output_hidden_states,
                line      1732             outputs = self(
    Source path:... module.py
    Starting var:.. self = PegasusForConditionalGeneration(  (model): PegasusModel(    (shared): Embedding(96103, 1024, paddi...mentwise_affine=True)    )  )  (lm_head): Linear(in_features=

Modified var:.. next_token_scores = tensor<(2, 96103), float32, cuda:0, grad>
                line      1748             next_token_scores = logits_processor(input_ids, next_token_scores)
    Source path:... generation_logits_process.py
    Starting var:.. self = [<transformers.generation_logits_process.MinLengthLogitsProcessor object>, <transformers.generation_logits_process.ForcedEOSTokenLogitsProcessor object>]
    Starting var:.. input_ids = tensor<(2, 8), int64, cuda:0>
    Starting var:.. scores = tensor<(2, 96103), float32, cuda:0, grad>
    Starting var:.. kwargs = {}
                    call        84     def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> torch.FloatTensor:
                    line        85         for processor in self:
    New var:....... processor = <transformers.generation_logits_process.MinLengthLogitsProcessor object>
                    line        86             function_args = inspect.signature(processor.__call__)

                line      1778             next_tokens = next_tokens % vocab_size
                line      1781             beam_outputs = beam_scorer.process(
                line      1782                 input_ids,
                line      1783                 next_token_scores,
                line      1784                 next_tokens,
                line      1785                 next_indices,
                line      1786                 pad_token_id=pad_token_id,
                line      1787                 eos_token_id=eos_token_id,
                line      1781             beam_outputs = beam_scorer.process(
    Source path:... generation_beam_search.py
    Starting var:.. self = <transformers.generation_beam_search.BeamSearchScorer object>
    Starting var:.. input_ids = tensor<(2, 8), int64, cuda:0>
    Starting var:.. next_scores = tensor<(1, 4), float32, cuda:0, grad>
    Starting var:.. next_tokens = tensor<(1, 4), int64, cuda:0>
    Starting var:.. next_indices =

                line      1793             input_ids = torch.cat([input_ids[beam_idx, :], beam_next_tokens.unsqueeze(-1)], dim=-1)
Modified var:.. input_ids = tensor<(2, 9), int64, cuda:0>
                line      1795             cur_len = cur_len + 1
Modified var:.. cur_len = 9
                line      1797             model_kwargs = self._update_model_kwargs_for_generation(
                line      1798                 outputs, model_kwargs, is_encoder_decoder=self.config.is_encoder_decoder
                line      1797             model_kwargs = self._update_model_kwargs_for_generation(
    Starting var:.. outputs = {'logits': tensor<(2, 1, 96103), float32, cuda:0, grad>, 'past_key_values': ((tensor<(2, 16, 8, 64...float32, cuda:0, grad>)), 'encoder_last_hidden_state': tensor<(2, 6, 1024), float32, cuda:0, grad>}
    Starting var:.. model_kwargs = {'use_cache': None, 'attention_mask': tensor<(2, 6), int64, cuda:0>, 'encoder_outputs': {'last_hid...d>, tensor<(2, 16, 6, 64), floa

                    line      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        Starting var:.. .0 = <tuple_iterator object>
        Starting var:.. beam_idx = tensor<(2,), int64, cuda:0>
                        call      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
                        line      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        New var:....... past_state = tensor<(2, 16, 8, 64), float32, cuda:0, grad>
                        return    1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        Return value:.. tensor<(2, 16, 8, 64), float32, cuda:0, grad>
        Starting var:.. .0 = <tuple_iterator object>
        Starting var:.. past_state = tensor<(2, 16, 8, 64), float32, cuda:0, grad>

                    line      1335         for layer_past in past:
                    line      1337             reordered_past += (
                    line      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        Starting var:.. .0 = <tuple_iterator object>
        Starting var:.. beam_idx = tensor<(2,), int64, cuda:0>
                        call      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
                        line      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        New var:....... past_state = tensor<(2, 16, 8, 64), float32, cuda:0, grad>
                        return    1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        Return value:.. tensor<(2, 16, 8, 64), float32, cuda:0, grad>
 

                    line      1337             reordered_past += (
                    line      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        Starting var:.. .0 = <tuple_iterator object>
        Starting var:.. beam_idx = tensor<(2,), int64, cuda:0>
                        call      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
                        line      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        New var:....... past_state = tensor<(2, 16, 8, 64), float32, cuda:0, grad>
                        return    1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        Return value:.. tensor<(2, 16, 8, 64), float32, cuda:0, grad>
        Starting var:.. .0 = <tuple_iterator object>
        Startin

                line      1732             outputs = self(
                line      1733                 **model_inputs,
                line      1734                 return_dict=True,
                line      1735                 output_attentions=output_attentions,
                line      1736                 output_hidden_states=output_hidden_states,
                line      1732             outputs = self(
    Source path:... module.py
    Starting var:.. self = PegasusForConditionalGeneration(  (model): PegasusModel(    (shared): Embedding(96103, 1024, paddi...mentwise_affine=True)    )  )  (lm_head): Linear(in_features=1024, out_features=96103, bias=False))
    Starting var:.. input = ()
    Starting var:.. kwargs = {'input_ids': None, 'encoder_outputs': {'last_hidden_state': tensor<(2, 6, 1024), float32, cuda:0,... 'use_cache': None, 'return_dict': True, 'output_attentions': False, 'output_hidden_states': False}
                    call       715     def _call_impl(self, *

                line      1752             if return_dict_in_generate:
                line      1753                 if output_scores:
                line      1754                     scores += (next_token_scores,)
                line      1755                 if output_attentions:
                line      1762                 if output_hidden_states:
                line      1770             vocab_size = next_token_scores.shape[-1]
                line      1771             next_token_scores = next_token_scores.view(batch_size, num_beams * vocab_size)
Modified var:.. next_token_scores = tensor<(1, 192206), float32, cuda:0, grad>
                line      1773             next_token_scores, next_tokens = torch.topk(
                line      1774                 next_token_scores, 2 * num_beams, dim=1, largest=True, sorted=True
                line      1773             next_token_scores, next_tokens = torch.topk(
Modified var:.. next_token_scores = tensor<(1, 4), float32, cuda:0

Modified var:.. input_ids = tensor<(2, 10), int64, cuda:0>
                line      1795             cur_len = cur_len + 1
Modified var:.. cur_len = 10
                line      1797             model_kwargs = self._update_model_kwargs_for_generation(
                line      1798                 outputs, model_kwargs, is_encoder_decoder=self.config.is_encoder_decoder
                line      1797             model_kwargs = self._update_model_kwargs_for_generation(
    Starting var:.. outputs = {'logits': tensor<(2, 1, 96103), float32, cuda:0, grad>, 'past_key_values': ((tensor<(2, 16, 9, 64...float32, cuda:0, grad>)), 'encoder_last_hidden_state': tensor<(2, 6, 1024), float32, cuda:0, grad>}
    Starting var:.. model_kwargs = {'use_cache': None, 'attention_mask': tensor<(2, 6), int64, cuda:0>, 'encoder_outputs': {'last_hid...d>, tensor<(2, 16, 6, 64), float32, cuda:0, grad>, tensor<(2, 16, 6, 64), float32, cuda:0, grad>))}
    Starting var:.. is_encoder_decoder = True
              

                    line      1337             reordered_past += (
                    line      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        Starting var:.. .0 = <tuple_iterator object>
        Starting var:.. beam_idx = tensor<(2,), int64, cuda:0>
                        call      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
                        line      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        New var:....... past_state = tensor<(2, 16, 9, 64), float32, cuda:0, grad>
                        return    1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        Return value:.. tensor<(2, 16, 9, 64), float32, cuda:0, grad>
        Starting var:.. .0 = <tuple_iterator object>
        Startin

                    line      1337             reordered_past += (
                    line      1335         for layer_past in past:
                    line      1337             reordered_past += (
                    line      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        Starting var:.. .0 = <tuple_iterator object>
        Starting var:.. beam_idx = tensor<(2,), int64, cuda:0>
                        call      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
                        line      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        New var:....... past_state = tensor<(2, 16, 9, 64), float32, cuda:0, grad>
                        return    1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
    

        New var:....... past_state = tensor<(2, 16, 9, 64), float32, cuda:0, grad>
                        return    1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        Return value:.. tensor<(2, 16, 9, 64), float32, cuda:0, grad>
        Starting var:.. .0 = <tuple_iterator object>
        Starting var:.. past_state = tensor<(2, 16, 9, 64), float32, cuda:0, grad>
        Starting var:.. beam_idx = tensor<(2,), int64, cuda:0>
                        call      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
                        line      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
                        return    1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        Return value:.. tensor<(2, 16, 9, 64), f

                    line       718                 self._forward_pre_hooks.values()):
                    line       716         for hook in itertools.chain(
                    line       724         if torch._C._get_tracing_state():
                    line       727             result = self.forward(*input, **kwargs)
        Source path:... modeling_pegasus.py
        Starting var:.. self = PegasusForConditionalGeneration(  (model): PegasusModel(    (shared): Embedding(96103, 1024, paddi...mentwise_affine=True)    )  )  (lm_head): Linear(in_features=1024, out_features=96103, bias=False))
        Starting var:.. input_ids = None
        Starting var:.. attention_mask = tensor<(2, 6), int64, cuda:0>
        Starting var:.. decoder_input_ids = tensor<(2, 1), int64, cuda:0>
        Starting var:.. decoder_attention_mask = None
        Starting var:.. head_mask = None
        Starting var:.. decoder_head_mask = None
        Starting var:.. encoder_outputs = {'last_hidden_state': tensor<(

                line      1749             next_token_scores = next_token_scores + beam_scores[:, None].expand_as(next_token_scores)
                line      1752             if return_dict_in_generate:
                line      1753                 if output_scores:
                line      1754                     scores += (next_token_scores,)
                line      1755                 if output_attentions:
                line      1762                 if output_hidden_states:
                line      1770             vocab_size = next_token_scores.shape[-1]
                line      1771             next_token_scores = next_token_scores.view(batch_size, num_beams * vocab_size)
Modified var:.. next_token_scores = tensor<(1, 192206), float32, cuda:0, grad>
                line      1773             next_token_scores, next_tokens = torch.topk(
                line      1774                 next_token_scores, 2 * num_beams, dim=1, largest=True, sorted=True
                line 

Modified var:.. input_ids = tensor<(2, 11), int64, cuda:0>
                line      1795             cur_len = cur_len + 1
Modified var:.. cur_len = 11
                line      1797             model_kwargs = self._update_model_kwargs_for_generation(
                line      1798                 outputs, model_kwargs, is_encoder_decoder=self.config.is_encoder_decoder
                line      1797             model_kwargs = self._update_model_kwargs_for_generation(
    Starting var:.. outputs = {'logits': tensor<(2, 1, 96103), float32, cuda:0, grad>, 'past_key_values': ((tensor<(2, 16, 10, 6...float32, cuda:0, grad>)), 'encoder_last_hidden_state': tensor<(2, 6, 1024), float32, cuda:0, grad>}
    Starting var:.. model_kwargs = {'use_cache': None, 'attention_mask': tensor<(2, 6), int64, cuda:0>, 'encoder_outputs': {'last_hid...d>, tensor<(2, 16, 6, 64), float32, cuda:0, grad>, tensor<(2, 16, 6, 64), float32, cuda:0, grad>))}
    Starting var:.. is_encoder_decoder = True
              

                    line      1335         for layer_past in past:
                    line      1337             reordered_past += (
                    line      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        Starting var:.. .0 = <tuple_iterator object>
        Starting var:.. beam_idx = tensor<(2,), int64, cuda:0>
                        call      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
                        line      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        New var:....... past_state = tensor<(2, 16, 10, 64), float32, cuda:0, grad>
                        return    1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        Return value:.. tensor<(2, 16, 10, 64), float32, cuda:0, grad>

        Return value:.. tensor<(2, 16, 10, 64), float32, cuda:0, grad>
        Starting var:.. .0 = <tuple_iterator object>
        Starting var:.. past_state = tensor<(2, 16, 10, 64), float32, cuda:0, grad>
        Starting var:.. beam_idx = tensor<(2,), int64, cuda:0>
                        call      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
                        line      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
                        return    1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        Return value:.. tensor<(2, 16, 10, 64), float32, cuda:0, grad>
        Starting var:.. .0 = <tuple_iterator object>
        Starting var:.. past_state = tensor<(2, 16, 10, 64), float32, cuda:0, grad>
        Starting var:.. beam_idx = tensor<(2,), int64, cuda:0>
        

                line      1729         while cur_len < max_length:
                line      1730             model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
    Source path:... modeling_pegasus.py
    Starting var:.. self = PegasusForConditionalGeneration(  (model): PegasusModel(    (shared): Embedding(96103, 1024, paddi...mentwise_affine=True)    )  )  (lm_head): Linear(in_features=1024, out_features=96103, bias=False))
    Starting var:.. decoder_input_ids = tensor<(2, 11), int64, cuda:0>
    Starting var:.. past = ((tensor<(2, 16, 10, 64), float32, cuda:0, grad>, tensor<(2, 16, 10, 64), float32, cuda:0, grad>, ...ad>, tensor<(2, 16, 6, 64), float32, cuda:0, grad>, tensor<(2, 16, 6, 64), float32, cuda:0, grad>))
    Starting var:.. attention_mask = tensor<(2, 6), int64, cuda:0>
    Starting var:.. head_mask = None
    Starting var:.. use_cache = None
    Starting var:.. encoder_outputs = {'last_hidden_state': tensor<(2, 6, 1024), float32, cuda:0, grad>}


                        return    1293         return Seq2SeqLMOutput(
        Return value:.. {'logits': tensor<(2, 1, 96103), float32, cuda:0, grad>, 'past_key_values': ((tensor<(2, 16, 11, 6...float32, cuda:0, grad>)), 'encoder_last_hidden_state': tensor<(2, 6, 1024), float32, cuda:0, grad>}
    Source path:... module.py
    New var:....... result = {'logits': tensor<(2, 1, 96103), float32, cuda:0, grad>, 'past_key_values': ((tensor<(2, 16, 11, 6...float32, cuda:0, grad>)), 'encoder_last_hidden_state': tensor<(2, 6, 1024), float32, cuda:0, grad>}
                    line       728         for hook in itertools.chain(
                    line       729                 _global_forward_hooks.values(),
                    line       730                 self._forward_hooks.values()):
                    line       728         for hook in itertools.chain(
                    line       734         if (len(self._backward_hooks) > 0) or (len(_global_backward_hooks) > 0):
                   

                line      1749             next_token_scores = next_token_scores + beam_scores[:, None].expand_as(next_token_scores)
                line      1752             if return_dict_in_generate:
                line      1753                 if output_scores:
                line      1754                     scores += (next_token_scores,)
                line      1755                 if output_attentions:
                line      1762                 if output_hidden_states:
                line      1770             vocab_size = next_token_scores.shape[-1]
                line      1771             next_token_scores = next_token_scores.view(batch_size, num_beams * vocab_size)
Modified var:.. next_token_scores = tensor<(1, 192206), float32, cuda:0, grad>
                line      1773             next_token_scores, next_tokens = torch.topk(
                line      1774                 next_token_scores, 2 * num_beams, dim=1, largest=True, sorted=True
                line 

                line      1793             input_ids = torch.cat([input_ids[beam_idx, :], beam_next_tokens.unsqueeze(-1)], dim=-1)
Modified var:.. input_ids = tensor<(2, 12), int64, cuda:0>
                line      1795             cur_len = cur_len + 1
Modified var:.. cur_len = 12
                line      1797             model_kwargs = self._update_model_kwargs_for_generation(
                line      1798                 outputs, model_kwargs, is_encoder_decoder=self.config.is_encoder_decoder
                line      1797             model_kwargs = self._update_model_kwargs_for_generation(
    Starting var:.. outputs = {'logits': tensor<(2, 1, 96103), float32, cuda:0, grad>, 'past_key_values': ((tensor<(2, 16, 11, 6...float32, cuda:0, grad>)), 'encoder_last_hidden_state': tensor<(2, 6, 1024), float32, cuda:0, grad>}
    Starting var:.. model_kwargs = {'use_cache': None, 'attention_mask': tensor<(2, 6), int64, cuda:0>, 'encoder_outputs': {'last_hid...d>, tensor<(2, 16, 6, 64), fl

                    line      1335         for layer_past in past:
                    line      1337             reordered_past += (
                    line      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        Starting var:.. .0 = <tuple_iterator object>
        Starting var:.. beam_idx = tensor<(2,), int64, cuda:0>
                        call      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
                        line      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        New var:....... past_state = tensor<(2, 16, 11, 64), float32, cuda:0, grad>
                        return    1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        Return value:.. tensor<(2, 16, 11, 64), float32, cuda:0, grad>

                        line      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
                        return    1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        Return value:.. None
                    line      1337             reordered_past += (
                    line      1335         for layer_past in past:
                    line      1337             reordered_past += (
                    line      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        Starting var:.. .0 = <tuple_iterator object>
        Starting var:.. beam_idx = tensor<(2,), int64, cuda:0>
                        call      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
                        line      1338                 tup

                    line      1337             reordered_past += (
                    line      1335         for layer_past in past:
                    line      1337             reordered_past += (
                    line      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        Starting var:.. .0 = <tuple_iterator object>
        Starting var:.. beam_idx = tensor<(2,), int64, cuda:0>
                        call      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
                        line      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        New var:....... past_state = tensor<(2, 16, 11, 64), float32, cuda:0, grad>
                        return    1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
   

                line      1732             outputs = self(
    Source path:... module.py
    Starting var:.. self = PegasusForConditionalGeneration(  (model): PegasusModel(    (shared): Embedding(96103, 1024, paddi...mentwise_affine=True)    )  )  (lm_head): Linear(in_features=1024, out_features=96103, bias=False))
    Starting var:.. input = ()
    Starting var:.. kwargs = {'input_ids': None, 'encoder_outputs': {'last_hidden_state': tensor<(2, 6, 1024), float32, cuda:0,... 'use_cache': None, 'return_dict': True, 'output_attentions': False, 'output_hidden_states': False}
                    call       715     def _call_impl(self, *input, **kwargs):
                    line       716         for hook in itertools.chain(
                    line       717                 _global_forward_pre_hooks.values(),
                    line       718                 self._forward_pre_hooks.values()):
                    line       716         for hook in itertools.chain(
                    line  

Modified var:.. next_token_scores = tensor<(2, 96103), float32, cuda:0, grad>
                line      1748             next_token_scores = logits_processor(input_ids, next_token_scores)
    Source path:... generation_logits_process.py
    Starting var:.. self = [<transformers.generation_logits_process.MinLengthLogitsProcessor object>, <transformers.generation_logits_process.ForcedEOSTokenLogitsProcessor object>]
    Starting var:.. input_ids = tensor<(2, 12), int64, cuda:0>
    Starting var:.. scores = tensor<(2, 96103), float32, cuda:0, grad>
    Starting var:.. kwargs = {}
                    call        84     def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> torch.FloatTensor:
                    line        85         for processor in self:
    New var:....... processor = <transformers.generation_logits_process.MinLengthLogitsProcessor object>
                    line        86             function_args = inspect.signature(processor.__call__

                        return     551         return torch.floor_divide(self, other)
        Return value:.. tensor<(1, 4), int64, cuda:0>
                    return      27             return f(*args, **kwargs)
    Return value:.. tensor<(1, 4), int64, cuda:0>
Source path:... generation_utils.py
                line      1778             next_tokens = next_tokens % vocab_size
                line      1781             beam_outputs = beam_scorer.process(
                line      1782                 input_ids,
                line      1783                 next_token_scores,
                line      1784                 next_tokens,
                line      1785                 next_indices,
                line      1786                 pad_token_id=pad_token_id,
                line      1787                 eos_token_id=eos_token_id,
                line      1781             beam_outputs = beam_scorer.process(
    Source path:... generation_beam_search.py
    Starting var:.. se

                    return    1007             return self.data[key]
    Return value:.. tensor<(2,), int64, cuda:0>
Source path:... generation_utils.py
                line      1791             beam_idx = beam_outputs["next_beam_indices"]
    Source path:... __init__.py
    Starting var:.. self = {'next_beam_scores': tensor([-11.9152, -13.0620], device='cuda:0', grad_fn=<ViewBackward>), 'next_...kens': tensor([3227,  107], device='cuda:0'), 'next_beam_indices': tensor([1, 1], device='cuda:0')}
    Starting var:.. key = 'next_beam_indices'
                    call      1005     def __getitem__(self, key):
                    line      1006         if key in self.data:
                    line      1007             return self.data[key]
                    return    1007             return self.data[key]
    Return value:.. tensor<(2,), int64, cuda:0>
Source path:... generation_utils.py
                line      1793             input_ids = torch.cat([input_ids[beam_idx, :], beam_next_

                    line      1337             reordered_past += (
                    line      1335         for layer_past in past:
                    line      1337             reordered_past += (
                    line      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        Starting var:.. .0 = <tuple_iterator object>
        Starting var:.. beam_idx = tensor<(2,), int64, cuda:0>
                        call      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
                        line      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        New var:....... past_state = tensor<(2, 16, 12, 64), float32, cuda:0, grad>
                        return    1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
   

                    line      1337             reordered_past += (
                    line      1335         for layer_past in past:
                    line      1337             reordered_past += (
                    line      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        Starting var:.. .0 = <tuple_iterator object>
        Starting var:.. beam_idx = tensor<(2,), int64, cuda:0>
                        call      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
                        line      1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        New var:....... past_state = tensor<(2, 16, 12, 64), float32, cuda:0, grad>
                        return    1338                 tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
   

                    return    1340         return reordered_past
    Return value:.. ((tensor<(2, 16, 12, 64), float32, cuda:0, grad>, tensor<(2, 16, 12, 64), float32, cuda:0, grad>, ...ad>, tensor<(2, 16, 6, 64), float32, cuda:0, grad>, tensor<(2, 16, 6, 64), float32, cuda:0, grad>))
Source path:... generation_utils.py
                line      1803             if beam_scorer.is_done:
    Source path:... generation_beam_search.py
    Starting var:.. self = <transformers.generation_beam_search.BeamSearchScorer object>
                    call       196     def is_done(self) -> bool:
                    line       197         return self._done.all()
                    return     197         return self._done.all()
    Return value:.. tensor<(), bool, cuda:0>
Source path:... generation_utils.py
                line      1804                 break
                line      1809         sequence_outputs = beam_scorer.finalize(
                line      1810             input_ids, beam_sco

                    line       334                 "sequence_scores": best_scores,
                    line       332             {
                    line       331         return UserDict(
        Source path:... __init__.py
        Starting var:.. args = REPR FAILED
        Starting var:.. kwargs = {}
                        call       981     def __init__(*args, **kwargs):
                        line       982         if not args:
                        line       985         self, *args = args
        Modified var:.. args = [{'sequences': tensor<(1, 7), int64, cuda:0>, 'sequence_scores': tensor<(1,), float32, cuda:0>}]
        New var:....... self = REPR FAILED
                        line       986         if len(args) > 1:
                        line       988         if args:
                        line       989             dict = args[0]
        New var:....... dict = {'sequences': tensor<(1, 7), int64, cuda:0>, 'sequence_scores': tensor<(1,), float32, cuda:0>}
         

In [79]:
x=generated['scores'][5]
print(x.max(1))
x.max(1).values / (len(generated['scores']) ** 0.8)

torch.return_types.max(
values=tensor([-2.2412, -4.3012], device='cuda:0', grad_fn=<MaxBackward0>),
indices=tensor([  1, 125], device='cuda:0'))


tensor([-0.3070, -0.5892], device='cuda:0', grad_fn=<DivBackward0>)

In [82]:
transformers.__version__

'4.5.0'

tensor([-10.1877, -11.9152], device='cuda:0', grad_fn=<MaxBackward0>)

In [63]:
tgt_text = pp_tokenizer.batch_decode(generated.sequences, skip_special_tokens=True)
print(pp_tokenizer.tokenize(tgt_text[0]))
print(pp_tokenizer.encode(tgt_text[0]))

['▁Tom', '▁is', '▁my', '▁name', '.']
[3227, 117, 161, 442, 107, 1]


In [42]:
generated

BeamSearchEncoderDecoderOutput(sequences=tensor([[   0, 3227,  117,  161,  442,  107,    1]], device='cuda:0'), sequences_scores=tensor([-0.5345], device='cuda:0'), scores=(tensor([[-1.3439e+01, -7.4424e+00, -1.3252e+01,  ..., -1.3211e+01,
         -1.4638e+01, -1.3537e+01],
        [-1.0000e+09, -1.0000e+09, -1.0000e+09,  ..., -1.0000e+09,
         -1.0000e+09, -1.0000e+09]], device='cuda:0', grad_fn=<AddBackward0>), tensor([[-15.5012,  -9.3269, -15.3885,  ..., -15.2806, -16.5083, -15.2270],
        [-15.4444, -12.1908, -14.3813,  ..., -15.8672, -15.1434, -17.1512]],
       device='cuda:0', grad_fn=<AddBackward0>), tensor([[-16.5940,  -9.6609, -16.3711,  ..., -16.8337, -17.5857, -18.3755],
        [-17.1933, -14.4977, -17.3700,  ..., -16.0694, -18.6977, -20.8976]],
       device='cuda:0', grad_fn=<AddBackward0>), tensor([[-16.4108, -10.8331, -16.1417,  ..., -17.0277, -19.3644, -19.6051],
        [-16.7516, -13.8706, -16.3067,  ..., -18.6184, -19.7848, -18.4986]],
       device='cuda:0

torch.return_types.max(
values=tensor([-10.1877, -11.9152], device='cuda:0', grad_fn=<MaxBackward0>),
indices=tensor([   1, 3227], device='cuda:0'))

In [55]:
np.exp(-7e-01)

0.4965853037914095

BeamSearchEncoderDecoderOutput(sequences=tensor([[  0, 600, 442, 117, 112, 208, 107,   1]], device='cuda:0'), sequences_scores=tensor([-0.7592], device='cuda:0'), scores=(tensor([[-1.3111e+01, -7.3406e+00, -1.2861e+01,  ..., -1.1740e+01,
         -1.4253e+01, -1.4704e+01],
        [-1.0000e+09, -1.0000e+09, -1.0000e+09,  ..., -1.0000e+09,
         -1.0000e+09, -1.0000e+09]], device='cuda:0', grad_fn=<AddBackward0>), tensor([[-16.4262, -13.5976, -16.2702,  ..., -16.8354, -18.7062, -18.8218],
        [-15.1518, -11.4697, -14.6758,  ..., -17.2132, -16.4186, -17.0657]],
       device='cuda:0', grad_fn=<AddBackward0>), tensor([[-14.4046,  -8.1447, -14.3312,  ..., -13.1408, -14.7156, -16.0423],
        [-16.8748, -10.4540, -16.4042,  ..., -18.1801, -20.0519, -18.0784]],
       device='cuda:0', grad_fn=<AddBackward0>), tensor([[-13.6286,  -7.3730, -13.5444,  ..., -17.7188, -14.8766, -14.5535],
        [-19.3907, -12.9705, -19.0755,  ..., -18.0790, -20.3148, -20.5254]],
       device='cuda:0',

In [None]:
from transformers import (
AutoTokenizer,
AutoModelForCausalLM,
LogitsProcessorList,
MinLengthLogitsProcessor,
)
logits_processor = LogitsProcessorList([
    MinLengthLogitsProcessor(15, eos_token_id=pp_model.config.eos_token_id),
])
decoder_start_token_id = pp_model.config.bos_token_id

In [None]:
def create_paraphrase_dataset(batch, cname_input, cname_output, num_beams=32,
                              num_return_sequences=32): 
    """Create paraphrases for each example in the batch. Then repeat the other fields 
        so that the resulting datase has the same length as the number of paraphrases. 
        Key assumption is 
        that the same number of paraphrases is created for each example.
        batch: a dict of examples used by the `map` function from the dataset
        cname_input: What column to create paraphrases of 
        cname_output: What to call the column of paraphrases
        other parameters - passed to get_paraphrases. """
    
    # Generate paraphrases. 
    # This can be later extended to add diversity or so on. 
    #set_trace()
    pp_l,probs = get_paraphrases(batch[cname_input], num_beams=num_beams,
        num_return_sequences=num_return_sequences)
    
    # To return paraphrases as a list of lists for batch input (not done here but might need later)
    #     split_into_sublists = lambda l,n: [l[i:i + n] for i in range(0, len(l), n)]
    #     pp_l = split_into_sublists(pp_l, n_seed_seqs)
    batch[cname_output] = pp_l 
    batch["probs"] = probs.to('cpu').numpy()
    
    # Repeat each entry in all other columns `num_return_sequences` times so they are the same length
    # as the paraphrase column
    # Only works if the same number of paraphrases is generated for each phrase. 
    # Else try something like 
        # for o in zip(*batch.values()):
        #     d = dict(zip(batch.keys(), o))
        #     get_paraphrases(batch[cname_input],num_return_sequences=n_seed_seqs,num_beams=n_seed_seqs)
        #     for k,v in d.items(): 
        #       return_d[k] += v if k == 'text' else [v for o in range(n_paraphrases)]
        # return return_d
    return_d = defaultdict(list) 
    repeat_each_item_n_times = lambda l,n: [o for o in l for i in range(n)]
    for k in batch.keys(): 
        if   k == cname_output: return_d[k] = batch[cname_output]
        elif k == "probs"     : return_d[k] = batch["probs"]
        else:                   return_d[k] = repeat_each_item_n_times(batch[k], num_return_sequences)
    return return_d 

In [None]:
# Generate paraphrase dataset
num_beams = 10
num_return_sequences = 3
cname_input = 'text' # which text column to paraphrase
cname_output= cname_input + '_pp'
date = '20210825'
fname = path_cache + '_rt_train'+ date + '_' + str(num_return_sequences)
if os.path.exists(fname):  
    ds_pp = datasets.load_from_disk(fname)
else:
    ds_pp = train.shard(200, 0, contiguous=True)
    # Have to call with batched=True
    # Need to set a batch size otherwise will run out of memory on the GPU card. 
    # 64 seems to work well 
    ds_pp = ds_pp.map(
        lambda x: create_paraphrase_dataset(x, 
            num_beams=num_beams, num_return_sequences=num_return_sequences,
            cname_input=cname_input, cname_output=cname_output),
        batched=True, batch_size=4) 
    ds_pp.save_to_disk(fname)
    gc.collect(); torch.cuda.empty_cache() # free up most of the GPU memory

In [None]:
def get_vm_scores(ds_pp, cname_orig, cname_pp, cname_label='label', 
                  use_metric=False, monitor=False): 
    """Get victim model preds+probs for the paraphrase dataset.
    """
    assert vm_model.training == False  # checks that model is in eval mode 
    if use_metric: 
        metric_d = {}
        metric_d['orig'],metric_d['pp'] = load_metric('accuracy'),load_metric('accuracy')
    orig_probs_l,pp_probs_l = [],[]
    if monitor: monitor = Monitor(2)  # track GPU usage and memory
    
    def get_vm_preds(x): 
        """Get predictions for a vector x (here a vector of documents/text). 
        Works for a sentiment-analysis dataset (needs to be adjusted for NLI tasks)"""
        inputs = vm_tokenizer(x, padding=True, truncation=True, return_tensors="pt")
        inputs.to(device)
        outputs = vm_model(**inputs, labels=labels)
        probs = outputs.logits.softmax(1).cpu()
        preds = probs.argmax(1)
        return probs, preds
       
    print("Getting victim model predictions for both original and paraphrased text.")
    dl = DataLoader(ds_pp, batch_size=batch_size, shuffle=False, 
                    num_workers=n_wkrs, pin_memory=True)
    with torch.no_grad():
        for i, data in enumerate(dl): 
            if i % 50 == 0 : print("Now processing batch", i, "out of", len(dl))
            labels,orig,pp = data['label'].to(device),data[cname_orig],data[cname_pp]
            orig_probs, orig_preds = get_vm_preds(orig)            
            pp_probs,   pp_preds   = get_vm_preds(pp)    
            orig_probs_l.append(orig_probs); pp_probs_l.append(pp_probs)
            if use_metric: 
                metric_d['orig'].add_batch(predictions=orig_preds, references=labels)
                metric_d['pp'].add_batch(  predictions=pp_preds,   references=labels)
    if monitor: monitor.stop()
    def list2tensor(l): return torch.cat(l)
    orig_probs_t,pp_probs_t = list2tensor(orig_probs_l),list2tensor(pp_probs_l)
    if use_metric: return orig_probs_t, pp_probs_t, metric_d
    else:          return orig_probs_t, pp_probs_t, None

In [None]:
cname_orig = cname_input
cname_pp = cname_output
cname_label = 'label'
print_metric = True
fname = path_cache + 'results_df_'+ date + "_" + str(num_return_sequences) + ".csv"
if os.path.exists(fname):    results_df = pd.read_csv(fname)
else: 
    #sim_score_t = generate_sim_scores()
    orig_probs_t,pp_probs_t,metric_d = get_vm_scores(ds_pp, cname_orig, 
                                                     cname_pp, cname_label,
                                                     monitor=True, use_metric=print_metric)
    if print_metric: 
        print("orig vm accuracy:",       metric_d['orig'].compute())
        print("paraphrase vm accuracy:", metric_d['pp'].compute())
    vm_orig_scores  = torch.tensor([r[idx] for idx,r in zip(ds_pp[cname_label], orig_probs_t)])
    vm_pp_scores    = torch.tensor([r[idx] for idx,r in zip(ds_pp[cname_label], pp_probs_t)])
    results_df = pd.DataFrame({
                  cname_orig: ds_pp[cname_orig],
                  cname_pp: ds_pp[cname_pp],
   #               'sim_score': sim_score_t,
                  'label_true': ds_pp[cname_label], 
                  'label_vm_orig': orig_probs_t.argmax(1),
                  'label_vm_pp': pp_probs_t.argmax(1),
                  'vm_orig_truelabel': vm_orig_scores,             
                  'vm_pp_truelabel': vm_pp_scores,
                  'vm_truelabel_change': vm_orig_scores - vm_pp_scores,
                  'vm_orig_class0': orig_probs_t[:,0], 
                  'vm_orig_class1': orig_probs_t[:,1], 
                  'vm_pp_class0': pp_probs_t[:,0], 
                  'vm_pp_class1': pp_probs_t[:,1], 
                  })
#    results_df['vm_truelabel_change_X_sim_score'] = results_df['vm_truelabel_change'] * results_df['sim_score']
    results_df.to_csv(fname, index_label = 'idx')

### Training loop 

Training loop pseudocode

The REINFORCE estimator is $$ \nabla_\theta J(\theta) \approx \sum_{s=1}^S  R(x,x'_s) \nabla \log p(x'_s|x,\theta)$$

**Non-batched version (one example), stochastic gradient descent**  
Inputs: train, n_pp=1, vm, ppm, $\alpha = 5e^{-5}$ (saw this rate for $\alpha$ somewhere  
Set eval_mode=true for vm, eval_mode = false for ppm  
Freeze all layers of ppm except last 6  
Shuffle traning dataset  

Loop: take one row from train
* generate large UNIVERSE list of paraphrases `pp_l` (e.g. 128) from 'text' column using ppm
* extract sequence scores from this list to get a vector of probabilities `pp_probs`
* take `log` of `pp_probs` and store in `pp_logprobs`
* pick S paraphrases from `pp_l` to get `pp_s`. 
* Take the corresponding entries from `pp_logprobs`. Get gradient of each entry by looking at .grad attribute. Sum them up and store in a variable `gradsum` 
* for each `pp` (i.e. $x'_s$) in `pp_s`:
    * get reward using `reward_fn_onepp(x, pp)`. $r=R(x,x'_s) = f(x)_y - f(x'_s)_y + \lambda SS(x, x'_s)$
* Sum up these rewards to get `rewardsum` and add to `gradsum` to get `nablaJ`
* Update parameters of paraphrase model with $\theta_{t+1} = \theta_t + \alpha \nabla_\theta J(\theta)$

In [29]:
# Parameters
batch_size = 1
alpha = 5e-5
n_pp = 1
pp_l_sz = 64
S = 20
# Paraphrase settings
num_beams = pp_l_sz
num_beam_groups = 1
diversity_penalty = 0.
temperature = 1.5
### Setup
vm_model.eval()
pp_model.train()
train_small = train.shard(100, 0, contiguous=True)  # small training set for testing purposes
train_small_dl = DataLoader(train_small, batch_size=batch_size, 
                            shuffle=True, num_workers=n_wkrs)
dl = train_small_dl
## Layer freezing 
# Unfreeze last 2 layers of the base model decoder
# Not sure if decoder layer norm should be unfrozen or not, but it appears after the
#   other parameters in the module ordering, so let's include it for now
# Also unfreeze the linear head.  This isn't stored in the base model but rather tacked on top
#   and will be fine-tuned for summarisation. 
layer_list = ['decoder.layers.14', 'decoder.layers.15', 'decoder.layer_norm'] 
for i, (name,param) in enumerate(pp_model.base_model.named_parameters()): 
    if np.any([o in name for o in layer_list]):   param.requires_grad = True
    else:                                         param.requires_grad = False
for param in pp_model.lm_head.parameters():       param.requires_grad = True
# For some reason this seems to be excluded
for param in pp_model.base_model.shared.parameters(): param.requires_grad=False 
### For checking the grad status of the layers
# for i, (name, param) in enumerate(pp_model.base_model.named_parameters()): print(i, name, param.requires_grad)
# for i, (name, param) in enumerate(pp_model.lm_head.named_parameters()):    print(i, name, param.requires_grad)

In [31]:
for i, data in enumerate(dl): 
    if i % 10 == 0 : print(i, "out of", len(dl))
    label,text = data['label'].to(device),data["text"]
    # generate paraphrases
    pp_l,pp_probs = get_paraphrases(text, num_return_sequences=pp_l_sz, num_beams=num_beams,
                num_beam_groups=num_beam_groups, diversity_penalty=diversity_penalty,
                temperature=temperature, return_probs=True)
    pp_logprobs = torch.log(pp_probs)
    if i ==0: break
    

0 out of 86
> [0;32m<ipython-input-30-b95c66eb06e5>[0m(7)[0;36mget_paraphrases[0;34m()[0m
[0;32m      5 [0;31m    else we return just a list of pp's. """
[0m[0;32m      6 [0;31m    [0mset_trace[0m[0;34m([0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m----> 7 [0;31m    [0mbatch[0m [0;34m=[0m [0mpp_tokenizer[0m[0;34m([0m[0minput_text[0m[0;34m,[0m [0mtruncation[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m [0mpadding[0m[0;34m=[0m[0;34m'longest'[0m[0;34m,[0m [0mreturn_tensors[0m[0;34m=[0m[0;34m"pt"[0m[0;34m)[0m[0;34m.[0m[0mto[0m[0;34m([0m[0mdevice[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m      8 [0;31m[0;34m[0m[0m
[0m[0;32m      9 [0;31m    translated = pp_model.generate2(**batch, num_beams=num_beams, 
[0m
ipdb> n
> [0;32m<ipython-input-30-b95c66eb06e5>[0m(9)[0;36mget_paraphrases[0;34m()[0m
[0;32m      7 [0;31m    [0mbatch[0m [0;34m=[0m [0mpp_tokenizer[0m[0;34m([0m[0minput_text[0m[0;34m,[0m [0mtrun

ipdb> n
> [0;32m<ipython-input-30-b95c66eb06e5>[0m(17)[0;36mget_paraphrases[0;34m()[0m
[0;32m     15 [0;31m    [0;31m# We also need to take exp for them to work.[0m[0;34m[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     16 [0;31m    [0mseq_probs[0m [0;34m=[0m [0mtorch[0m[0;34m.[0m[0mexp[0m[0;34m([0m[0mtranslated[0m[0;34m.[0m[0msequences_scores[0m[0;34m)[0m [0;34m/[0m [0msum[0m[0;34m([0m[0mtorch[0m[0;34m.[0m[0mexp[0m[0;34m([0m[0mtranslated[0m[0;34m.[0m[0msequences_scores[0m[0;34m)[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m---> 17 [0;31m    [0mtgt_text[0m [0;34m=[0m [0mpp_tokenizer[0m[0;34m.[0m[0mbatch_decode[0m[0;34m([0m[0mtranslated[0m[0;34m.[0m[0msequences[0m[0;34m,[0m [0mskip_special_tokens[0m[0;34m=[0m[0;32mTrue[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     18 [0;31m    [0;32mif[0m [0mreturn_probs[0m[0;34m:[0m   [0;32mreturn[0m [0mtgt_text[0m[0;34m,[0m [0mseq_probs[0m[0;

ipdb> q


BdbQuit: 

In [None]:
x = undecorated(undecorated(generate1))

In [None]:
{'input_text': ["this is the best american movie about troubled teens since 1998's whatever ."], 'num_return_sequences': 64, 'num_beams': 64, 'return_probs': True, 
 'num_beam_groups': 1, 'diversity_penalty': 0.0, 'temperature': 1.5}

In [None]:
param.shape

In [None]:
pp_probs.grad

In [None]:
??pp_model.base_model.__repr__

In [None]:
for i,mod in enumerate(pp_model.base_model.named_modules()): 
    print(i,mod)

In [None]:
x =mod[1]

In [None]:
list(x.parameters())

In [None]:
pp_logprobs.backward()

In [None]:
x = pp_model.generate.__closure__[0].cell_contents

In [None]:
x.cell_contents

In [None]:
??pp_model.greedy_search

In [None]:
x= pp_model.generate

In [None]:
??x.__self__

In [None]:
??undecorated

In [None]:
?train_small_dl

In [None]:
train_dl

In [None]:
def reward_fn_onepp(x): 
    pass
    
    
def reward_fn_onerow(x): 
    """x is one row of a pandas df"""
    text,pp,lbl_change = x['text'],x['text_pp'],x['vm_truelabel_change']
    return lbl_change

def reward_fn_batch(): 
    pass 

def loss_fn(): 
    pass 