In this notebook I explore some properties about the generation process for paraphrase generation. I also answer some questions I had. 
Note: some of the asserts aren't up to date

## Setup 

In [None]:
%load_ext autoreload
%autoreload 2
%load_ext line_profiler

In [None]:
## Imports and environment variables 
import os
os.environ["TOKENIZERS_PARALLELISM"]  = "true"  # set to false if not working

# Core imports 
import torch, numpy as np, pandas as pd, gc,sys, logging, warnings
from torch.utils.data import DataLoader, RandomSampler
from torch.distributions import Categorical
from datasets import load_dataset, load_metric, load_from_disk, DatasetDict
from transformers import (AutoModelForSeq2SeqLM, AutoModelForSequenceClassification, 
                          AutoTokenizer, AdamW, SchedulerType, get_scheduler)
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import pytorch_cos_sim
from collections import defaultdict
from accelerate import Accelerator, notebook_launcher
from cachetools import cached, LRUCache
from types import MethodType
from timeit import default_timer as timer
import utils; from utils import *   # local script 
from tqdm.auto import tqdm
import itertools
import copy 
import wandb
from pprint import pprint
from undecorated import undecorated


# Dev imports (not needed for final script)
import seaborn as sns
from IPython.display import Markdown
from pprint import pprint
from IPython.core.debugger import set_trace
from GPUtil import showUtilization
import torchsnooper

ModuleNotFoundError: No module named 'utils'

In [None]:
logging.basicConfig(format='%(message)s') 
logger = logging.getLogger("main_logger")
logger.setLevel(logging.INFO)

In [None]:
# options for the pp_model 
# 1. tuner007/pegasus_paraphrase
# 2. tdopierre/ProtAugment-ParaphraseGenerator
# 3. eugenesiow/bart-paraphrase

## PEGASUS model
pp_name = "tuner007/pegasus_paraphrase"
pp_tokenizer_pegasus = AutoTokenizer.from_pretrained(pp_name)
pp_model_pegasus = AutoModelForSeq2SeqLM.from_pretrained(pp_name, local_files_only=True)
generate_with_grad = undecorated(pp_model_pegasus.generate)
pp_model_pegasus.generate_with_grad = MethodType(generate_with_grad, pp_model_pegasus)

## BART model
pp_name = "eugenesiow/bart-paraphrase"
pp_tokenizer_bart = AutoTokenizer.from_pretrained(pp_name)
pp_model_bart = AutoModelForSeq2SeqLM.from_pretrained(pp_name, local_files_only=True)
generate_with_grad = undecorated(pp_model_bart.generate)
pp_model_bart.generate_with_grad = MethodType(generate_with_grad, pp_model_bart)


## Functions 

In [None]:
def get_pp_logp(translated): 
    """log(p(pp|orig)) basically.
    works for greedy search, will need tweaking for other types probably"""
    seq_without_first_tkn = translated.sequences[:, 1:]
    attention_mask = pp_model._prepare_attention_mask_for_generation(
        seq_without_first_tkn, pp_tokenizer.pad_token_id, pp_tokenizer.eos_token_id
    )
    scores_log_softmax = torch.stack(translated.scores, 1).log_softmax(2)
    seq_token_log_probs = torch.gather(scores_log_softmax,2,seq_without_first_tkn[:,:,None]).squeeze(-1)
    del scores_log_softmax
    # account for nan values by setting them to 0 (maybe a bit of a hack)
    # will also handle inf and -inf values too by default
    seq_token_log_probs = torch.nan_to_num(seq_token_log_probs)
    # account for the padding tokens at the end 
    seq_token_log_probs = seq_token_log_probs * attention_mask
    seq_log_prob = seq_token_log_probs.sum(-1)
#     if np.any(np.isnan(seq_log_prob.detach().cpu()).tolist()): 
#         warnings.warn(f"Warning: NAN's detected in pp_logp calclulations.\n seq_token_log_probs: {seq_token_log_probs}")
    return seq_log_prob

def get_tokens_from_token_ids_batch(tokenizer, ids_batch):
    l = []
    for i in range(ids_batch.shape[0]): 
        l.append(tokenizer.convert_ids_to_tokens(ids_batch[i,:]))
    return l

def get_start_end_special_token_ids(tokenizer): 
    """The token id's that input/output sequences should start and end with"""
    d = {}
    if pp_tokenizer.name_or_path in ['eugenesiow/bart-paraphrase', 'tdopierre/ProtAugment-ParaphraseGenerator']: 
        d["input_start_id"] =  tokenizer.bos_token_id
        d["input_end_id"] =  [tokenizer.pad_token_id, tokenizer.eos_token_id]
        d["output_start_id"] =  tokenizer.eos_token_id 
        d["output_end_id"] =  [tokenizer.pad_token_id, tokenizer.eos_token_id]
    elif pp_tokenizer.name_or_path == "tuner007/pegasus_paraphrase":
        d["input_start_id"] =  None
        d["input_end_id"] =  [tokenizer.pad_token_id, tokenizer.eos_token_id] 
        d["output_start_id"] =  tokenizer.pad_token_id
        d["output_end_id"] =  [tokenizer.pad_token_id, tokenizer.eos_token_id]
    else: 
        raise Exception("unrecognised tokenizer")
    return d

def check_no_nans_or_infs(x):
    assert torch.all(~torch.isnan(x))
    assert torch.all(~torch.isneginf(x))
    assert torch.all(~torch.isposinf(x))

def assert_start_and_end_tokens_are_correct(tokenizer, orig_token_ids, pp_token_ids):
    """Make sure input sequences (orig) and output sequences (pp) start and end with the 
    right special tokens (depends on tokenizer)"""
    start_end_token_d = get_start_end_special_token_ids(pp_tokenizer)
    
    # Input
    if start_end_token_d['input_start_id'] is not None: 
        assert torch.all(orig_token_ids[:,0] == start_end_token_d['input_start_id'])
    # can probs rewrite this to make it nicer but it's fine for now
    assert torch.all(torch.logical_or(orig_token_ids[:,-1] == start_end_token_d['input_end_id'][0], 
                                      orig_token_ids[:,-1] == start_end_token_d['input_end_id'][1]))
    
    # Output
    assert torch.all(pp_token_ids[:,0] == start_end_token_d['output_start_id'])
    assert torch.all(torch.logical_or(pp_token_ids[:,-1] == start_end_token_d['output_end_id'][0], 
                                      pp_token_ids[:,-1] == start_end_token_d['output_end_id'][1]))

def check_scores_for_posinf_nan_and_unexpected_neginf(scores_stacked): 
    """Check we don't have any postive inf or nan, and that all negative inf values are expected"""
    assert torch.all(~torch.isposinf(scores_stacked))
    assert torch.all(~torch.isnan(scores_stacked))

    # We expect to see negative inf for the eos_token when we have not reached min_length. 
    # But we shouldn't expect it for any other tokens
    idx_neginf = torch.nonzero(torch.isneginf(scores_stacked))
    assert torch.all(idx_neginf[:,2] == pp_tokenizer.eos_token_id)
    # Rough check that all idx before min_length are -inf for all elements in batch
    # We do min_length - 1 because sequences are allowed to have length min_length so that idx 
    # shouldn't be set to -inf
    # Not a 100% test but very likely to identify
    assert idx_neginf.shape[0] == (pp_model_params["min_length"] -1) * batch_size  
    # Check that no elements after min_length are -inf
    assert torch.all(idx_neginf[:,1] < (pp_model_params["min_length"] -1 ))

def check_scores_log_softmax_sums_and_shape(scores_log_softmax):
    sums = scores_log_softmax.exp().sum(2)
    # check that the axes is right
    # we want to sum over token probabilities at each generation step, so we 
    # should end up with a shape [batch_size, generated_length]
    assert sums.shape[0] == batch_size  
    assert sums.shape[1] == generated_length - 1
    # check that they sum to 1 along the generated_length axis
    assert torch.allclose(sums, torch.ones(sums.size()), atol = 1e-4)
    
def check_seq_token_log_prob_values_are_correct(): 
    """Just enumerates and checks values
    Quite slow for large batches so run as a test rather than an assert in every batch. 
    """
    l = []
    for i_ex in range(batch_size):
        for i_step in range(generated_length - 1):
            i_tkn = seq_without_first_tkn[i_ex][i_step].item()
            l.append(scores_log_softmax[i_ex,i_step, i_tkn] == seq_token_log_probs[i_ex,i_step])
    assert all(l)
    
def pretty_print_pp_batch_and_next_token_probabilities(): 
    """Goes through each paraphrase and shows at each timestep the next likely tokens. 
    Only will work for greedy search. 
    e.g. [
    "<pad> ['▁My, 0.289', '▁I, 0.261', '▁Hello, 0.07'] | Entropy: 4.23 ",
     "<pad> My ['▁name, 0.935', '▁Name, 0.005', 'name, 0.002'] | Entropy: 0.80 "
    ]
    """
    str_d = defaultdict(list)
    for i_tkn in range(0, generated_length-1): 
        ids = pp_output.sequences[:, :(i_tkn+1)]
        partial_pp = pp_tokenizer.batch_decode(ids)
        kth_ids,kth_probs = tkn_kmaxidx[:, i_tkn, :], tkn_kmaxprob[:, i_tkn, :]
        kth_tkns = get_tokens_from_token_ids_batch(pp_tokenizer, kth_ids)

        # enumerates examples in batch
        z = zip(partial_pp, kth_tkns, kth_probs, ent.detach())
        for i_ex, (ex_sen, ex_next_tkns, ex_next_probs, ex_e) in enumerate(z): 
            # Form nice formatted string mixing together tokens and probabilities
            tkn_tuples_l = [(tkn, round_t(prob,3)) for tkn, prob in zip(ex_next_tkns, ex_next_probs)]
            tkn_str = ['%s, %s' % t for t in tkn_tuples_l]
            # Add to dict of lists and add on entropy term. 
            str_d[i_ex].append(f"{ex_sen} {tkn_str} | Entropy: {ex_e[i_tkn]:.2f} ")

    for v in str_d.values():  pprint(v)

## Greedy search for paraphrase generation 

In [None]:
# INPUT AND PARAMETERS
orig_l = [
    "Look at the bird over there with the red and yellow stripes.", 
    "That girl has her hair dyed yellow - how interesting."
]
pp_model_params = {
    "num_beams": 1, 
    "num_return_sequences": 1, 
    "num_beam_groups": 1, 
    "diversity_penalty": 0.,   # must be a float
    "temperature": 1.5,
    "length_penalty" : 1,
    "min_length" : 5
}

## Select which model/tokenizer to use
pp_tokenizer = pp_tokenizer_pegasus
pp_model = pp_model_pegasus

In [None]:
#### TOKENIZER INFORMATION #####
logger.info("\n############################## TOKENIZER ########################################\n")
logger.info(f"We are using the {pp_tokenizer.name_or_path} tokenizer")
logger.info(f"Tokenizer has these special tokens:{pp_tokenizer.all_special_tokens}")
logger.info(f"The bos token is {pp_tokenizer.bos_token} and has id {pp_tokenizer.bos_token_id}")
logger.info(f"The eos token is {pp_tokenizer.eos_token} and has id {pp_tokenizer.eos_token_id}")
logger.info(f"The pad token is {pp_tokenizer.pad_token} and has id {pp_tokenizer.pad_token_id}")
logger.info(f"The unk token is {pp_tokenizer.unk_token} and has id {pp_tokenizer.unk_token_id}")


#### INPUT #####
batch_size = len(orig_l)
orig_tokens = pp_tokenizer(orig_l, return_tensors='pt', padding=True, pad_to_multiple_of=4)
input_length = orig_tokens['input_ids'].size()[1]
orig_l_tokens_list = get_tokens_from_token_ids_batch(pp_tokenizer, orig_tokens['input_ids'])

logger.info("\n############################### INPUT #######################################\n")
logger.info(f"Original text: {orig_l}")
logger.info(f"Batch size is: {batch_size}")
logger.info(f"This is tokenised to get a dict with keys {orig_tokens.keys()} which should be input_ids and attention_mask ")
logger.info(f"The input_ids look like this: {orig_tokens['input_ids']}")
logger.info(f"The tokens are: {orig_l_tokens_list}")
logger.info(f"This has shape {orig_tokens['input_ids'].shape} or [batch_size, input_length], which also\
 might be padded to hit a padding multiple (so input_length is not just the longest example length in the batch).")
logger.info(f"Input length is: {input_length}")
logger.info(f"The attention_mask looks like this: {orig_tokens['attention_mask']}")
logger.info(f"This has shape {orig_tokens['attention_mask'].shape} or [batch_size, input_length]")

##### PARAPHRASE #####
pp_output = pp_model.generate_with_grad(**orig_tokens, **pp_model_params, do_sample=False, 
                                      return_dict_in_generate=True,
                                      output_scores=True,
                                    remove_invalid_values=False)
generated_length = pp_output.sequences.shape[1]
pp_l             = pp_tokenizer.batch_decode(pp_output.sequences, skip_special_tokens=True)
pp_l_with_tokens = pp_tokenizer.batch_decode(pp_output.sequences, skip_special_tokens=False)
pp_l_tokens_list = get_tokens_from_token_ids_batch(pp_tokenizer, pp_output.sequences)

assert_start_and_end_tokens_are_correct(pp_tokenizer, orig_token_ids=orig_tokens['input_ids'],
                                        pp_token_ids=pp_output.sequences)

logger.info("\n#################################### PARAPHRASES ##################################\n")
logger.info(f"Paraphrases: {pp_l}")
logger.info(f"Output has keys {pp_output.keys()}")
logger.info(f"Paraphrases with special tokens: {pp_l_with_tokens}")
logger.info(f"List of pp tokens:{pp_l_tokens_list}")
logger.info(f"Paraphrase token sequences: {pp_output.sequences}")
logger.info(f"Shape of pp token sequences:{pp_output.sequences.shape} or [batch_size, generated_length]")
logger.info(f"Generated length: {generated_length}")

###### SCORES AND PROBABILITIES ########
scores_stacked = torch.stack(pp_output.scores, 1)
# The second argument to stack (i.e. dim) determines which axis the tensors are stacked along. 
# It determines the axis that becomes generated_length - 1
# dim=0 gives shape [generated_length-1, batch_size, vocab_size]
# dim=1 gives shape [batch_size, generated_length-1, vocab_size]
# dim=2 gives shape [batch_size, vocab_size, generated_length-1]
# Our scores_stacked is stacked on dim 1 so it should be second 
assert scores_stacked.shape == torch.Size([batch_size, (generated_length - 1), pp_tokenizer.vocab_size])
#check_scores_for_posinf_nan_and_unexpected_neginf(scores_stacked)


# These scores are logits 
# see some of the docs on this page https://huggingface.co/docs/transformers/v4.16.2/en/main_classes/output#transformers.modeling_outputs.Seq2SeqModelOutput
# so we got to take softmax over them 
# but if we take regular softmax then we run into numerical errors
# so instead we take log_softmax
scores_log_softmax = torch.log_softmax(scores_stacked, 2)
check_scores_log_softmax_sums_and_shape(scores_log_softmax)


####### SEQUENCE PROBABILITIES #######
# We select the token probability corresponding to each token 
# However because the scores represent transitions we need to remove the first token from each 
# sequence to match them up. 
seq_without_first_tkn = pp_output.sequences[:,1:]
assert seq_without_first_tkn.shape == torch.Size([batch_size, generated_length - 1])
# Now select prob corresponding to each token
seq_token_log_probs = torch.gather(scores_log_softmax,2,seq_without_first_tkn[:,:,None]).squeeze(-1)
assert seq_token_log_probs.shape == seq_without_first_tkn.shape  # probs should be 1-1 with the filtered tkns
#check_no_nans_or_infs(seq_token_log_probs)
# Check that the last token probability corresponds to a possible end token 
output_end_ids = get_start_end_special_token_ids(pp_tokenizer)['output_end_id']
assert all([o in scores_log_softmax[:, -1, output_end_ids] for o in seq_token_log_probs[:,-1]])
check_seq_token_log_prob_values_are_correct()

# The attention mask has 1 everywhere except for where padding tokens occur, where it has 0. 
# It is used to filter out padding tokens from the sequence probablity because then the sequence 
# probability will depend on how many padding tokens there are and the probability of generating them, 
# which (a) we don't want and (b) the probability isn't correct anyway 
attention_mask = pp_model._prepare_attention_mask_for_generation(
    seq_without_first_tkn, pp_tokenizer.pad_token_id, pp_tokenizer.eos_token_id
)
seq_token_log_probs = seq_token_log_probs * attention_mask
# check attention mask only has 0 for padding tokens and not eos tokens or anything else
assert all(seq_without_first_tkn[attention_mask == 0] == pp_tokenizer.pad_token_id)
assert seq_token_log_probs.shape == attention_mask.shape == seq_token_log_probs.shape
#assert torch.all(seq_token_log_probs  > -10)  # we shouldn't be picking extrememly rare tokens
seq_log_prob = seq_token_log_probs.sum(-1)
assert seq_log_prob.shape == torch.Size([batch_size])
#check_no_nans_or_infs(seq_log_prob)

logger.info("\n##########################  SCORES AND PROBABILITIES ####################################\n")
logger.info(f"Scores is a tuple of length {len(pp_output.scores)} which is one less than the generated_length, or \
the number of tokens in the pp token sequences (this has shape {pp_output.sequences.shape}")
logger.info(f"Each score is a tensor of shape {pp_output.scores[0].shape} or [batch_size, vocab_size]")
#logger.info(f"Full shape:{[o.shape for o in pp_output.scores]}")
logger.info(f"We stack them to get a tensor of shape {scores_stacked.shape} or [batch_size, generated_length - 1, vocab_size]")
logger.info(f"Scores are really logits so we have to take softmax to get probabilities. ")
logger.info("But if we take regular softmax then we run into numerical errors so we take log softmax")
logger.info("We then select the token probability corresponding to each token and sum them to get the log \
probability of the sequence.")

############# ENTROPY AND TOKEN PROBABILITIES ####################
ent = Categorical(logits = scores_stacked).entropy()
assert ent.shape == torch.Size([batch_size, generated_length - 1])
scores_softmax = scores_log_softmax.exp()
k=3
tkn_kmaxprob, tkn_kmaxidx = torch.topk(scores_softmax, k=k, dim=2)
tkn_kmaxprob = tkn_kmaxprob.detach()  # log these 
# The third dimension indexes top1, top2, top3 etc 
assert tkn_kmaxprob[:,:,0].shape == torch.Size([batch_size, generated_length - 1])
# I'd naively expect True everywhere for tkn_kmaxidx[:,:,0] == pp_output.sequences[:, 1:] but it turns 
# out this is not the case because padding tokens seem to have prob 0 and eos tokens are outputted 
# instead by the token generation process and then later replaced by pad

#Uncomment to show how paraphrase is formed. 


logger.info("\n########################## ENTROPY AND TOKEN PROBABILITIES ####################################\n")
logger.info(f"Originals: {orig_l}")
pretty_print_pp_batch_and_next_token_probabilities()


############################## TOKENIZER ########################################

We are using the tuner007/pegasus_paraphrase tokenizer
Tokenizer has these special tokens:['</s>', '<unk>', '<pad>', '<mask_2>', '<mask_1>', '<unk_2>', '<unk_3>', '<unk_4>', '<unk_5>', '<unk_6>', '<unk_7>', '<unk_8>', '<unk_9>', '<unk_10>', '<unk_11>', '<unk_12>', '<unk_13>', '<unk_14>', '<unk_15>', '<unk_16>', '<unk_17>', '<unk_18>', '<unk_19>', '<unk_20>', '<unk_21>', '<unk_22>', '<unk_23>', '<unk_24>', '<unk_25>', '<unk_26>', '<unk_27>', '<unk_28>', '<unk_29>', '<unk_30>', '<unk_31>', '<unk_32>', '<unk_33>', '<unk_34>', '<unk_35>', '<unk_36>', '<unk_37>', '<unk_38>', '<unk_39>', '<unk_40>', '<unk_41>', '<unk_42>', '<unk_43>', '<unk_44>', '<unk_45>', '<unk_46>', '<unk_47>', '<unk_48>', '<unk_49>', '<unk_50>', '<unk_51>', '<unk_52>', '<unk_53>', '<unk_54>', '<unk_55>', '<unk_56>', '<unk_57>', '<unk_58>', '<unk_59>', '<unk_60>', '<unk_61>', '<unk_62>', '<unk_63>', '<unk_64>', '<unk_65>', '<unk_66>', '<u

["<pad> ['▁The, 0.401', '▁There, 0.271', '▁Look, 0.05'] | Entropy: 3.27 ",
 "<pad> The ['▁bird, 0.868', '▁birds, 0.015', '▁red, 0.014'] | Entropy: 1.38 ",
 "<pad> The bird ['▁has, 0.596', '▁is, 0.16', '▁with, 0.086'] | Entropy: 2.28 ",
 "<pad> The bird has ['▁red, 0.479', '▁stripes, 0.084', '▁both, 0.065'] | "
 'Entropy: 3.13 ',
 "<pad> The bird has red ['▁and, 0.928', '▁stripes, 0.001', '▁AND, 0.001'] | "
 'Entropy: 1.02 ',
 "<pad> The bird has red and ['▁yellow, 0.887', 'yellow, 0.005', '▁Yellow, "
 "0.004'] | Entropy: 1.49 ",
 "<pad> The bird has red and yellow ['▁stripes, 0.88', '▁striped, 0.011', "
 "'▁striping, 0.005'] | Entropy: 1.26 ",
 "<pad> The bird has red and yellow stripes ['., 0.679', '▁on, 0.119', '</s>, "
 "0.058'] | Entropy: 2.13 ",
 "<pad> The bird has red and yellow stripes. ['</s>, 0.911', '., 0.0', '▁, "
 "0.0'] | Entropy: 1.32 "]
["<pad> ['▁That, 0.325', '▁The, 0.155', '▁How, 0.085'] | Entropy: 4.14 ",
 "<pad> That ['▁girl, 0.885', '▁person, 0.004', '▁is, 0.003']

## Beam search for paraphrase generation 

In [None]:
orig_l = [
    "Look! A small dog. Isn't it cute?", 
    "Far out, if I have to write another sentence...it'll be bad."
]
n_seq = 5
pp_model_params = {
    "num_beams": 5, 
    "num_return_sequences": n_seq, 
    "num_beam_groups": 1, 
    "diversity_penalty": 0.,  
    "temperature": 1.5,
    "length_penalty" : 0,
    "min_length" : 5
}
batch_size = len(orig_l)
logger.info(f"Input: {orig_l}")
orig_tokens = pp_tokenizer(orig_l, return_tensors='pt', padding=True, pad_to_multiple_of=4)
input_length = orig_tokens['input_ids'].shape[1]
pp_output = pp_model.generate_with_grad(**orig_tokens, **pp_model_params, do_sample=False, 
                                      return_dict_in_generate=True,
                                      output_scores=True,
                                    remove_invalid_values=False)
logger.info(f"Input: {orig_l}")


Input: ["Look! A small dog. Isn't it cute?", "Far out, if I have to write another sentence...it'll be bad."]
Input: ["Look! A small dog. Isn't it cute?", "Far out, if I have to write another sentence...it'll be bad."]


In [None]:
generated_length = pp_output.sequences.shape[1]
assert pp_output.sequences.shape == torch.Size([batch_size * n_seq, generated_length ])
pp_l_with_tokens = pp_tokenizer.batch_decode(pp_output.sequences, skip_special_tokens=False)
print("Paraphrases:")
pprint(pp_l_with_tokens)
logger.info(f"Output has keys {pp_output.keys()}")
assert pp_output.sequences_scores.shape == torch.Size([batch_size * n_seq])
assert len(pp_output.scores) == generated_length  # different to greedy search: not generated_length - 1 
assert pp_output.scores[0].shape == torch.Size([batch_size * n_seq, pp_tokenizer.vocab_size])

# scores_stacked = torch.stack(pp_output.scores, 1)
# assert scores_stacked.shape == torch.Size([batch_size * n_seq, generated_length, pp_tokenizer.vocab_size])
transition_logprobs = pp_model.compute_transition_beam_scores(
    pp_output.sequences, pp_output.scores, pp_output.beam_indices, eos_token_id = pp_tokenizer.eos_token_id)
pp_logprobs = transition_logprobs.sum(-1)
assert pp_logprobs.shape == torch.Size([batch_size * n_seq])
print("logprobs (has grad):", pp_logprobs)
baseline_seq_prob = np.log(1/pp_tokenizer.vocab_size)* generated_length
baseline_short_seq_prob = np.log(1/pp_tokenizer.vocab_size)* np.floor(generated_length /2 )
baseline_high_prob_seq =  np.log(1000/pp_tokenizer.vocab_size)* generated_length
baseline_high_prob_short_seq =  np.log(1000/pp_tokenizer.vocab_size)* np.floor(generated_length /2 ) 
print("sequences scores (no grad):", pp_output.sequences_scores)

print("baseline prob (selecting token with prob 1/vocab_size every input):",baseline_seq_prob )
print("baseline short seq prob:",baseline_short_seq_prob )
print("baseline high prob seq", baseline_high_prob_seq)
print("baseline high prob short seq", baseline_high_prob_short_seq)

Output has keys odict_keys(['sequences', 'sequences_scores', 'scores', 'beam_indices'])


Paraphrases:
['<pad> There is a small dog.</s><pad><pad><pad><pad><pad><pad><pad>',
 '<pad> A small dog.</s><pad><pad><pad><pad><pad><pad><pad><pad><pad>',
 "<pad> Isn't the dog cute?</s><pad><pad><pad><pad><pad><pad>",
 '<pad> A small dog is cute.</s><pad><pad><pad><pad><pad><pad><pad>',
 "<pad> Isn't it cute?</s><pad><pad><pad><pad><pad><pad><pad>",
 '<pad> It will be bad if I have to write another sentence.</s><pad>',
 '<pad> I will be bad if I have to write another sentence.</s><pad>',
 "<pad> It'll be bad if I have to write another sentence.</s>",
 '<pad> If I have to write another sentence, it will be bad.</s>',
 '<pad> It will be bad if I have to write a new sentence.</s>']
logprobs (has grad): tensor([ -73.2909,  -62.3275,  -88.2333,  -83.2530,  -79.5468, -134.9267,
        -124.6652, -144.2008, -140.2372, -131.8344], grad_fn=<SumBackward1>)
sequences scores (no grad): tensor([-3.4677, -3.5051, -3.6984, -4.4127, -4.5878, -5.1774, -5.8165, -6.0948,
        -6.4582, -6.7450])
bas

In [None]:
def compare_scores_and_transition_probs(): 
    # This code indicates that the transition probabilities are not the same as the scores. 
    # It seems they are the same sometimes but other times they are not. 
    for ex in range(batch_size * n_seq): 
        for step in range(generated_length):
            tkn_id = pp_output.sequences[ex][step].item()
            score = pp_output.scores[step][ex][tkn_id]
            prob = transition_probs[ex][step]
            print("example", ex, "step", step, "tkn_id", tkn_id, 
                  "score", round_t(score), "transition_logprob", round_t(prob))
#compare_scores_and_transition_probs()

## Tokenizer differences 

### Types 

Both the "eugenesiow/bart-paraphrase" model and the "tdopierre/ProtAugment-ParaphraseGenerator" are BART tokenizers and have type BartTokenizerFast. The implementation is identical to RobertaTokenizerFast according to the docs, which in turn was derived from GPT-2. They use byte-level Byte Pair Encoding.  

The "tuner007/pegasus_paraphrase" model is a Pegasus tokenizer has type PegasusTokenizerFast. This uses Unigram. 

### Tokenization differences

#### Spaces 

The BART tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a word will be encoded differently whether it is at the beginning of the sentence (without space) or not:

In [None]:
tokens = pp_tokenizer_bart(['hello there',' hello there'], return_tensors='pt')
get_tokens_from_token_ids_batch(pp_tokenizer_bart, tokens['input_ids'])

[['<s>', 'hello', 'Ġthere', '</s>'], ['<s>', 'Ġhello', 'Ġthere', '</s>']]

The Pegasus tokenizer doesn't do this

In [None]:
tokens = pp_tokenizer_pegasus(['hello there',' hello there'], return_tensors='pt')
get_tokens_from_token_ids_batch(pp_tokenizer_pegasus, tokens['input_ids'])

[['▁hello', '▁there', '</s>'], ['▁hello', '▁there', '</s>']]

### Representing tokens 

The tokenizers represent tokens differently.  
The BART models use Ġ to indicate start of a word for a token. Its generated tokens look like `['<s>', 'Hello', 'Ġmy', 'Ġname', 'Ġis', 'Ġz', 'f', 'ld', 'lf', 'o', 'q', 'd', '</s>', '<pad>', '<pad>', '<pad>']`  
The Pegasus model uses \_ to indicate start of a word for a token. Its generated tokens look like `['▁Hello', '▁my', '▁name', '▁is', '▁z', 'fl', 'dl', 'fo', 'q', 'd', '</s>', '<pad>']`

### Special tokens

#### BART

In [None]:
logger.info(f"Tokenizer has these special tokens:{pp_tokenizer_bart.all_special_tokens}")
logger.info(f"The bos token is {pp_tokenizer_bart.bos_token} and has id {pp_tokenizer_bart.bos_token_id}")
logger.info(f"The eos token is {pp_tokenizer_bart.eos_token} and has id {pp_tokenizer_bart.eos_token_id}")
logger.info(f"The pad token is {pp_tokenizer_bart.pad_token} and has id {pp_tokenizer_bart.pad_token_id}")
logger.info(f"The unk token is {pp_tokenizer_bart.unk_token} and has id {pp_tokenizer_bart.unk_token_id}")

Tokenizer has these special tokens:['<s>', '</s>', '<unk>', '<pad>', '<mask>']
The bos token is <s> and has id 0
The eos token is </s> and has id 2
The pad token is <pad> and has id 1
The unk token is <unk> and has id 3


#### PEGASUS

In [None]:
logger.info(f"Tokenizer has these special tokens:{pp_tokenizer_pegasus.all_special_tokens}")
logger.info(f"The bos token is {pp_tokenizer_pegasus.bos_token} and has id {pp_tokenizer_pegasus.bos_token_id}")
logger.info(f"The eos token is {pp_tokenizer_pegasus.eos_token} and has id {pp_tokenizer_pegasus.eos_token_id}")
logger.info(f"The pad token is {pp_tokenizer_pegasus.pad_token} and has id {pp_tokenizer_pegasus.pad_token_id}")
logger.info(f"The unk token is {pp_tokenizer_pegasus.unk_token} and has id {pp_tokenizer_pegasus.unk_token_id}")

Tokenizer has these special tokens:['</s>', '<unk>', '<pad>', '<mask_2>', '<mask_1>', '<unk_2>', '<unk_3>', '<unk_4>', '<unk_5>', '<unk_6>', '<unk_7>', '<unk_8>', '<unk_9>', '<unk_10>', '<unk_11>', '<unk_12>', '<unk_13>', '<unk_14>', '<unk_15>', '<unk_16>', '<unk_17>', '<unk_18>', '<unk_19>', '<unk_20>', '<unk_21>', '<unk_22>', '<unk_23>', '<unk_24>', '<unk_25>', '<unk_26>', '<unk_27>', '<unk_28>', '<unk_29>', '<unk_30>', '<unk_31>', '<unk_32>', '<unk_33>', '<unk_34>', '<unk_35>', '<unk_36>', '<unk_37>', '<unk_38>', '<unk_39>', '<unk_40>', '<unk_41>', '<unk_42>', '<unk_43>', '<unk_44>', '<unk_45>', '<unk_46>', '<unk_47>', '<unk_48>', '<unk_49>', '<unk_50>', '<unk_51>', '<unk_52>', '<unk_53>', '<unk_54>', '<unk_55>', '<unk_56>', '<unk_57>', '<unk_58>', '<unk_59>', '<unk_60>', '<unk_61>', '<unk_62>', '<unk_63>', '<unk_64>', '<unk_65>', '<unk_66>', '<unk_67>', '<unk_68>', '<unk_69>', '<unk_70>', '<unk_71>', '<unk_72>', '<unk_73>', '<unk_74>', '<unk_75>', '<unk_76>', '<unk_77>', '<unk_78>'

### Special token usage with input and output sequences 

#### BART 

They use this format 
```
single sequence: <s> X </s>
pair of sequences: <s> A </s></s> B </s>
```

#### PEGASUS

Format: 
```
- single sequence: ``X </s>``
- pair of sequences: ``A B </s>`` (not intended use)
```

BOS token is never used 

#### Differences 

 Tokenizers also use different tokens when representing input sequences and generating output sentences. Here is a quick summary:

In [None]:
tokens_d = {
    "bart": {
        "special_tokens": pp_tokenizer_bart.all_special_tokens, 
        "input_start": pp_tokenizer_bart.bos_token,
        "input_end": [pp_tokenizer_bart.pad_token, pp_tokenizer_bart.eos_token], 
        "output_start": pp_tokenizer_bart.eos_token, 
        "output_end": [pp_tokenizer_bart.pad_token, pp_tokenizer_bart.eos_token]
    }, 
    "pegasus": {
        "special_tokens": pp_tokenizer_pegasus.all_special_tokens,
        "input_start": None,
        "input_end": [pp_tokenizer_pegasus.pad_token, pp_tokenizer_pegasus.eos_token], 
        "output_start": pp_tokenizer_pegasus.pad_token,  
        "output_end": [pp_tokenizer_pegasus.pad_token, pp_tokenizer_pegasus.eos_token], 
    }
}
tokens_d

{'bart': {'special_tokens': ['<s>', '</s>', '<unk>', '<pad>', '<mask>'],
  'input_start': '<s>',
  'input_end': ['<pad>', '</s>'],
  'output_start': '</s>',
  'output_end': ['<pad>', '</s>']},
 'pegasus': {'special_tokens': ['</s>',
   '<unk>',
   '<pad>',
   '<mask_2>',
   '<mask_1>',
   '<unk_2>',
   '<unk_3>',
   '<unk_4>',
   '<unk_5>',
   '<unk_6>',
   '<unk_7>',
   '<unk_8>',
   '<unk_9>',
   '<unk_10>',
   '<unk_11>',
   '<unk_12>',
   '<unk_13>',
   '<unk_14>',
   '<unk_15>',
   '<unk_16>',
   '<unk_17>',
   '<unk_18>',
   '<unk_19>',
   '<unk_20>',
   '<unk_21>',
   '<unk_22>',
   '<unk_23>',
   '<unk_24>',
   '<unk_25>',
   '<unk_26>',
   '<unk_27>',
   '<unk_28>',
   '<unk_29>',
   '<unk_30>',
   '<unk_31>',
   '<unk_32>',
   '<unk_33>',
   '<unk_34>',
   '<unk_35>',
   '<unk_36>',
   '<unk_37>',
   '<unk_38>',
   '<unk_39>',
   '<unk_40>',
   '<unk_41>',
   '<unk_42>',
   '<unk_43>',
   '<unk_44>',
   '<unk_45>',
   '<unk_46>',
   '<unk_47>',
   '<unk_48>',
   '<unk_49>',
 

### Token indexing

In [None]:
def print_tokens_from_ids(tokenizer, start_id=100, end_id=200):
    ids = list(range(start_id,end_id))
    print(*list(zip(ids, tokenizer.convert_ids_to_tokens(ids))))

#### BART 

Having a look at generated tokens makes me suspect that they are indexed in whatever order they are encountered in the source text they are trained on. It seems like a rough frequency of english tokens but there are also tokens that are definitely out of order. 

The first few are reserved for special tokens, and the other low numbers (e.g. up to 100) are pretty common suffixes and words

In [None]:
print_tokens_from_ids(pp_tokenizer_bart, 0,50)

(0, '<s>') (1, '<pad>') (2, '</s>') (3, '<unk>') (4, '.') (5, 'Ġthe') (6, ',') (7, 'Ġto') (8, 'Ġand') (9, 'Ġof') (10, 'Ġa') (11, 'Ġin') (12, '-') (13, 'Ġfor') (14, 'Ġthat') (15, 'Ġon') (16, 'Ġis') (17, 'âĢ') (18, "'s") (19, 'Ġwith') (20, 'ĠThe') (21, 'Ġwas') (22, 'Ġ"') (23, 'Ġat') (24, 'Ġit') (25, 'Ġas') (26, 'Ġsaid') (27, 'Ļ') (28, 'Ġbe') (29, 's') (30, 'Ġby') (31, 'Ġfrom') (32, 'Ġare') (33, 'Ġhave') (34, 'Ġhas') (35, ':') (36, 'Ġ(') (37, 'Ġhe') (38, 'ĠI') (39, 'Ġhis') (40, 'Ġwill') (41, 'Ġan') (42, 'Ġthis') (43, ')') (44, 'ĠâĢ') (45, 'Ġnot') (46, 'Ŀ') (47, 'Ġyou') (48, 'ľ') (49, 'Ġtheir')


Looking at 100 to 200 you can see some words (e.g. Trump at 140, or 2017 at 193) that aren't common enough to be that high. This makes me suspect that words are in encounter order in the text. 

In [None]:
print_tokens_from_ids(pp_tokenizer_bart, 100,200)

(100, 'I') (101, 'Ġlike') (102, 'a') (103, 'Ġsome') (104, 'S') (105, 'Ã«') (106, 'Ġthem') (107, 'Ġyears') (108, "'") (109, 'Ġdo') (110, 'Ġyour') (111, 'Ġ-') (112, 'Ġ1') (113, '"') (114, 'Ġif') (115, 'Ġcould') (116, '?') (117, 'Ġno') (118, 'i') (119, 'm') (120, 'Ġget') (121, 'ĠU') (122, 'Ġnow') (123, 'Ġhim') (124, 'Ġback') (125, 'ĠBut') (126, 'ĠâĢĵ') (127, 'Ġmy') (128, "Ġ'") (129, 'Ġonly') (130, 'Ġthree') (131, ';') (132, 'Ġ2') (133, 'The') (134, '1') (135, 'Ġpercent') (136, 'Ġagainst') (137, 'Ġbefore') (138, 'Ġcompany') (139, 'o') (140, 'ĠTrump') (141, 'Ġhow') (142, 'Ġbecause') (143, 'Ġany') (144, 'Ġmost') (145, 'Ġbeing') (146, 'Ġmake') (147, 'Ġwhere') (148, 'Ġduring') (149, 'Ġthrough') (150, 'Ġwhile') (151, '000') (152, 'ĠThis') (153, 'Ġmillion') (154, 'ing') (155, 'Ġ3') (156, 'Ġmade') (157, 'Ġwell') (158, 'Ġ10') (159, 'Ġdown') (160, 'Ġoff') (161, 'Ġsays') (162, 'Ġme') (163, 'ĠB') (164, 'Ġgoing') (165, 'Ġteam') (166, 'ĠWe') (167, 'Ġthose') (168, 'Ġgovernment') (169, 'Ġway') (170, 'We'

Tokens towards the end are gibberish or mispellings encountered in the input. The fifth last token is something labelled <|endoftext|> and I don't know what that is. Then there is a bunch of tokens like "madeupword0001". The last token is the mask token and then token indicies after that return None. 

In [None]:
print_tokens_from_ids(pp_tokenizer_bart, pp_tokenizer_bart.vocab_size-20, pp_tokenizer_bart.vocab_size+10)

(50245, 'ĠSetTextColor') (50246, 'Ġfixme') (50247, 'ĠãĤµãĥ¼ãĥĨãĤ£') (50248, 'ĠãĤµãĥ¼ãĥĨãĤ£ãĥ¯ãĥ³') (50249, 'ĠÂłĠÂłĠÂłĠÂłĠÂłĠÂłĠÂłĠÂł') (50250, 'ĠAdinida') (50251, 'ItemTracker') (50252, 'ĠDevOnline') (50253, 'ĠÂłÂł') (50254, '<?') (50255, '*=-') (50256, 'ÃĽÃĽ') (50257, 'ĠEntityItem') (50258, 'EngineDebug') (50259, 'ĠstrutConnector') (50260, '<|endoftext|>') (50261, 'madeupword0000') (50262, 'madeupword0001') (50263, 'madeupword0002') (50264, '<mask>') (50265, None) (50266, None) (50267, None) (50268, None) (50269, None) (50270, None) (50271, None) (50272, None) (50273, None) (50274, None)


#### PEGASUS

Special tokens make up the first hundred or so. After that there's a token \<n> that seems like some new line thing. 

In [None]:
print_tokens_from_ids(pp_tokenizer_pegasus, 0,120)

(0, '<pad>') (1, '</s>') (2, '<mask_1>') (3, '<mask_2>') (4, '<unk_2>') (5, '<unk_3>') (6, '<unk_4>') (7, '<unk_5>') (8, '<unk_6>') (9, '<unk_7>') (10, '<unk_8>') (11, '<unk_9>') (12, '<unk_10>') (13, '<unk_11>') (14, '<unk_12>') (15, '<unk_13>') (16, '<unk_14>') (17, '<unk_15>') (18, '<unk_16>') (19, '<unk_17>') (20, '<unk_18>') (21, '<unk_19>') (22, '<unk_20>') (23, '<unk_21>') (24, '<unk_22>') (25, '<unk_23>') (26, '<unk_24>') (27, '<unk_25>') (28, '<unk_26>') (29, '<unk_27>') (30, '<unk_28>') (31, '<unk_29>') (32, '<unk_30>') (33, '<unk_31>') (34, '<unk_32>') (35, '<unk_33>') (36, '<unk_34>') (37, '<unk_35>') (38, '<unk_36>') (39, '<unk_37>') (40, '<unk_38>') (41, '<unk_39>') (42, '<unk_40>') (43, '<unk_41>') (44, '<unk_42>') (45, '<unk_43>') (46, '<unk_44>') (47, '<unk_45>') (48, '<unk_46>') (49, '<unk_47>') (50, '<unk_48>') (51, '<unk_49>') (52, '<unk_50>') (53, '<unk_51>') (54, '<unk_52>') (55, '<unk_53>') (56, '<unk_54>') (57, '<unk_55>') (58, '<unk_56>') (59, '<unk_57>') (60, 

Unlike the BART models I can believe that these tokens are in order of frequency. I can't see anything that is obviously out of place. 

In [None]:
print_tokens_from_ids(pp_tokenizer_pegasus, 120,250)

(120, '▁that') (121, '-') (122, '▁with') (123, '’') (124, '▁on') (125, '▁I') (126, '▁it') (127, '▁are') (128, '▁your') (129, '▁be') (130, '▁as') (131, "'") (132, '▁or') (133, '▁have') (134, '▁at') (135, '▁from') (136, '▁this') (137, '▁can') (138, '▁will') (139, '▁The') (140, '▁was') (141, '▁by') (142, '▁an') (143, '▁(') (144, 't') (145, '▁we') (146, '▁not') (147, '!') (148, '▁has') (149, '▁all') (150, '▁our') (151, ':') (152, '?') (153, '▁their') (154, '▁more') (155, '▁but') (156, '▁one') (157, '▁they') (158, ')') (159, 'The') (160, '▁about') (161, '▁my') (162, '▁which') (163, '▁also') (164, '▁up') (165, '▁out') (166, '▁time') (167, '▁so') (168, '▁It') (169, '▁his') (170, '▁who') (171, '▁do') (172, '▁like') (173, '▁when') (174, '▁been') (175, '▁if') (176, '▁other') (177, '▁new') (178, '▁he') (179, '▁get') (180, '▁what') (181, '▁some') (182, '▁This') (183, '▁them') (184, '▁We') (185, '▁“') (186, '▁there') (187, 'I') (188, '▁just') (189, '▁any') (190, '▁into') (191, '/') (192, '▁would') 

There's nothing special at the end, just looks like isolated tokens and None values after the tokens finish. It's also worth noting that the Pegasus model has ~96100 tokens which is way more than the ~50270 of the BART models (almost double). 

In [None]:
print_tokens_from_ids(pp_tokenizer_pegasus, pp_tokenizer_pegasus.vocab_size-20, pp_tokenizer_pegasus.vocab_size+10)

(96083, '0:09') (96084, '▁Pietra') (96085, 'webhost') (96086, '8:48') (96087, '▁psychoanalytic') (96088, '1335') (96089, '▁Happies') (96090, '▁Tamale') (96091, '▁Seidel') (96092, '▁Muppet') (96093, '▁Quota') (96094, '▁polyphenol') (96095, 'utyrate') (96096, 'saari') (96097, '▁WASTE') (96098, '▁$6,500') (96099, '.06%') (96100, 'constitutional') (96101, '▁$6.4') (96102, 'ospermum') (96103, None) (96104, None) (96105, None) (96106, None) (96107, None) (96108, None) (96109, None) (96110, None) (96111, None) (96112, None)


## Questions 

### When does the model generate padding tokens?

For both models padding tokens are generated after the EOS token. Additionally for Pegasus generated text starts with the padding token. 

### Why do generated paraphrases start with the EOS token? 

This is only the case for BART models. For pegasus models they use a padding token to start generated paraphrases. 

I don't know what the BOS token isn't used for these things. Pegasus has an open issue [here](https://github.com/huggingface/transformers/issues/12474). 

Whatever the reason you should just do the default because that is what the preprocessing does and you will get the best results that way. 

### Does p(PAD) =1 after an eos token?

For both BART and Pegasus models it appears that probability of outputting a pad token is actually zero at all timesteps. Instead the model outputs the eos token over and over, and there must be some post-processing that takes place that replaces eos token with padding token. 
For Pegasus it appears it is the same behaviour. 
Example code: 

In [None]:
print(round_t(scores_softmax[:,:,pp_tokenizer.eos_token_id]))
print(round_t(scores_softmax[:,:,pp_tokenizer.pad_token_id]))

[[0.   0.   0.   0.   0.   0.   0.   0.06 0.91]
 [0.   0.   0.   0.   0.   0.   0.   0.04 0.9 ]]
[[0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0.]]


What is interesting is that there is probability assigned to other tokens other than eos and pad after a eos token is outputted. Again there must be some kind of postprocessing that takes care of this situation because I haven't really seen it in the wild. 

Some models (e.g. GPT2) don't even have a PAD token. Instead they use the eos token on repeat. See this [issue](https://github.com/huggingface/transformers/issues/8452#issuecomment-739008168). What is confusing is seeing this behaviour with models that have a padding token. 

### Do rows always sum to 1 when looking at token generation scores? 


Yes, they should. I put in an assert to check this. 
If you have a nan or an inf then they won't sum to 1. To confirm this

In [None]:
print(torch.isnan(torch.sum(torch.tensor([1,2,3, torch.nan]))))
print(torch.isinf(torch.sum(torch.tensor([1,2,3, torch.inf]))))

tensor(True)
tensor(True)


### Does first row sum to 0? (the one corresponding to the startoff token)


So there is no token scores that correspond to the first token (usually a bos or pad token). The scores are a tuple of length (`generated_length - 1`). So there shouldn't be a "zero" row really. 
I remember seeing something like this at some point so I'll keep an eye out for it. 

### How are logits containing nan or inf transformed with softmax and log_softmax?

We can explore this through some code examples.

#### Vanilla case  
First we look at the case without any nan or inf. 

In [None]:
logits = torch.tensor([1.4, -1, 3, 2])
print(logits)
print(torch.softmax(logits,0))
print(torch.log_softmax(logits,0))

tensor([ 1.4000, -1.0000,  3.0000,  2.0000])
tensor([0.1271, 0.0115, 0.6297, 0.2316])
tensor([-2.0625, -4.4625, -0.4625, -1.4625])


The softmax values are interpreted as probabilities, and the log softmax is just the log of the probabilities, done for numerical stability. We can just take exponents to return to probabilities if needed. 

In [None]:
print(torch.log_softmax(logits,0).exp())

tensor([0.1271, 0.0115, 0.6297, 0.2316])


#### Positive inf

Now let's see what happens if we introduce a positive inf. 

In [None]:
logits = torch.tensor([1.4, -1, 3, 2, torch.inf])
print(logits)
print(torch.softmax(logits,0))
print(torch.log_softmax(logits,0))

tensor([ 1.4000, -1.0000,  3.0000,  2.0000,     inf])
tensor([nan, nan, nan, nan, nan])
tensor([nan, nan, nan, nan, nan])


We get nan values in the softmax and log_softmax.  So if you see nans in the softmax, remember that an inf in the scores is one reason why it may happen.  

This is interesting because if we just assume inf is a large positive number, we'd expect a softmax with basically a 1 and all zeros, and a log softmax of a 0 and a lot of negatives. We can try it here: 

In [None]:
logits = torch.tensor([1.4, -1, 3, 2, 10000000000])
print(logits)
print(torch.softmax(logits,0))
print(torch.log_softmax(logits,0))

tensor([ 1.4000e+00, -1.0000e+00,  3.0000e+00,  2.0000e+00,  1.0000e+10])
tensor([0., 0., 0., 0., 1.])
tensor([-1.0000e+10, -1.0000e+10, -1.0000e+10, -1.0000e+10,  0.0000e+00])


Basically what we get. So this indicates that if we get a positive inf we might be able to mitigate this problem by clipping it to some kind of maximum value. 

#### Negative inf 

In [None]:
logits = torch.tensor([1.4, -1, 3, 2, -torch.inf])
print(logits)
print(torch.softmax(logits,0))
print(torch.log_softmax(logits,0))

tensor([ 1.4000, -1.0000,  3.0000,  2.0000,    -inf])
tensor([0.1271, 0.0115, 0.6297, 0.2316, 0.0000])
tensor([-2.0625, -4.4625, -0.4625, -1.4625,    -inf])


Negative inf behaves a bit differently. The softmax is unaffected and basically just assigns a prob of 0 to the corresponding entry. The log softmax carries the `-inf` through. 

Again clipping the -inf to a large negative value can mitigate this problem somewhat: 

In [None]:
logits = torch.tensor([1.4, -1, 3, 2, -10000000])
print(logits)
print(torch.softmax(logits,0))
print(torch.log_softmax(logits,0))

tensor([ 1.4000e+00, -1.0000e+00,  3.0000e+00,  2.0000e+00, -1.0000e+07])
tensor([0.1271, 0.0115, 0.6297, 0.2316, 0.0000])
tensor([-2.0625e+00, -4.4625e+00, -4.6253e-01, -1.4625e+00, -1.0000e+07])


#### nan values 

In [None]:
logits = torch.tensor([1.4, -1, 3, 2, torch.nan])
print(logits)
print(torch.softmax(logits,0))
print(torch.log_softmax(logits,0))

tensor([ 1.4000, -1.0000,  3.0000,  2.0000,     nan])
tensor([nan, nan, nan, nan, nan])
tensor([nan, nan, nan, nan, nan])


A nan in the logits propagates and affects the entire softmax and log_softmax tensors. The network basically gives up and says "no idea how to deal with this. 

This seems to be the case with most torch functions; e.g.  

In [None]:
print(torch.sum(logits,0))
print(torch.divide(logits,0.2))
print(logits + logits)

tensor(nan)
tensor([ 7., -5., 15., 10., nan])
tensor([ 2.8000, -2.0000,  6.0000,  4.0000,     nan])


### How do you interpret token entropy?

The token scores (when stacked) are a tensor of dimensions (batch_size, generated_length - 1, vocab_size). We take softmax to get a tensor of probability distributions across all possible tokens. We can also calculate the entropy of each of these probability distributions. 

Entropy is a measure of how "peaky" or "flat" a probability distribution is. It is the expected value of the self-information of an event, which is basically a measure of how "surprised" you would be if that event occured. 
If we have a discrete random variable $X$ with probability distribution $P(x) $, entropy is given by $$H(X) = \mathbb{E}_{ X\sim P} [I(x)] = -\mathbb{E}_{X \sim P} [\log P(x)] $$ which is practically calculated by $$H(X) =  -\sum_{x=-\infty}^\infty p(x) \log(p(x))$$

The lowest value of entropy is 0, which is when you have $P(x)=1$ for some event. 

In [None]:
Categorical(probs = torch.tensor([1,0,0,0])).entropy()

tensor(1.1921e-07)

High values of entropy occur when the probability distribution is very flat. 

In [None]:
print(Categorical(probs = torch.tensor([0.20,0.50,0.10,0.20])).entropy())  # spikier, lower entropy
print(Categorical(probs = torch.tensor([0.25,0.25,0.25,0.25])).entropy())  # flatter, higher entropy

tensor(1.2206)
tensor(1.3863)


In terms of tokens, we can show two realistic distributions below. The first has two likely tokens, one somewhat likely, and the rest unlikely. The second has many more likely tokens. 

In [None]:
l1 = [0.001,0.001,0.001,0.001,0.5,0.001,0.4,0.001,0.001,0.001,0.001,0.001,0.09]
l2 = [0.1,0.1,0.1,0.1,0.025,0.1,0.1,0.1,0.1,0.1,0.025,0.025,0.025]
print(Categorical(probs = torch.tensor(l1)).entropy())  # spikier, lower entropy
print(Categorical(probs = torch.tensor(l2)).entropy())  # flatter, higher entropy

tensor(0.9989)
tensor(2.4412)


There isn't a theoretical maximum for entropy, but for tokens you'll be governed by vocab size. Here we show some practical maximums for some different vocab sizes

In [None]:
v_size = 1000
print(Categorical(probs = torch.tensor([1/v_size for i in range(v_size)])).entropy())  # v small vocab
v_size = 10000
print(Categorical(probs = torch.tensor([1/v_size for i in range(v_size)])).entropy())  # small vocab
v_size = 50000
print(Categorical(probs = torch.tensor([1/v_size for i in range(v_size)])).entropy())  # ~BART vocab
v_size = 100000
print(Categorical(probs = torch.tensor([1/v_size for i in range(v_size)])).entropy())  # ~PEGASUS vocab

tensor(6.9078)
tensor(9.2103)
tensor(10.8198)
tensor(11.5129)


These can be a bit hard to interpret so maybe you should also look at some other token-level stats, like max_prob, second_max_prob, third_max_prob, mean, variance or other things like that. 

Some other things to note. First you still get an entropy value if your probability dist sums up to more than 1, so make sure to check this before doing entropy. Secondly if you have nan or inf in the probability values then you will get an error. This is true if you use either `probs` or `logits` in the Categorical function. 

In [None]:
print(Categorical(probs = torch.tensor([0.5,0.25,0.25,0.25])).entropy())       # sums to more than 1, gives result
#print(Categorical(probs = torch.tensor([0.5,0.25,0.25,torch.nan])).entropy())  # throws error
#print(Categorical(probs = torch.tensor([0.5,0.25,0.25,torch.inf])).entropy())  # throws error
#print(Categorical(logits = torch.tensor([0.5,0.25,0.25,torch.nan])).entropy())  # throws error

tensor(1.3322)


### What does it mean to take average of token entropy? 

Average entropy over tokens will depend on how many padding tokens are added. This in turn depends on how many examples there are in the batch.   
The entropy of padding tokens appears to be quite high based on some quick experiments.  
It might or might not depend on sentence length.   
It could be useful as a broad measure to see if the model is getting more "peaky" in selecting tokens at each time step.   
It might be useful when tracking a paraphrase over time and seeing its variations?   

### What do you calculate KL divergence of? 

When your parameters to the policy network get updated, you calculate change in the output space and work out KL divergence of the prob dists. 

This might be hard for your case: 
* if you consider full sequences as the action space, these have such low probabilities that it's hard to see how they change really. you can't get a dist
* you can get a dist of individual tokens but then you have to have the same paraphrase for that to make any sense whatsoever. 
  * you could extract input_ids for a sequence and feed it into pp_model.generate_beam_transition_probabilities() to get the prob of that specific sentence. then you could maybe use this to keep token probs the same? 
  
  

### How do you get nan and inf introduced into token scores?


My understanding is that you get -inf for the first (min_length - 1) steps when you introduce a min_length parameter for the generated sequences for the eos_token_id slot. This is to stop the token from appearing and truncating the sequence. 

I also seem to get -inf for padding, eos tokens and sometimes just random other tokens as well. There doesn't seem to be much rhyme or reason to it. 

There is an option to generate to automatically handle these tokens. I haven't been using it though. 

You might also get -inf when setting other parameters to the `generate()` function (e.g. bad_words_ids or something similar).

Nan can come in if you multiple -inf by 0 - this is equal to nan under ieee 754 standard. 

### How big is the action space?

Each different paraphrase can be considered a different action. This makes the action space a discrete action space rather than continuous. An initial estimate of its size is on the order of `vocab_size ^ generated_length`, but the vast majority of these sequences aren't valid English sentences and have a very low probability of being obtained. In addition the actions available are heavily dependent on the state (i.e. the original paraphrase). It is still a very large space. 

### Can you log the top X most probable sentences and the probability of obtaining them? 


Currently you get your action using greedy search which picks the most likely token at each point. But this might not give the most likely sequence: you could have a high probability token in two timesteps that requires choosing the second-most-probable token to get there. So you can’t say that the greedy-search paraphrase is the most probable sequence. 

Beam search will always find outputs with greater than or equal probability to greedy search. However it is not guaranteed to find the most likely output. Yet maybe it can be an estimator of it. 


The most probable returned sentences will depend heavily on the hyperparameters given to the generate function, such as temperature, diversity_penalty, num_beam_groups, min_length, length_penalty and so on. You can only ever get the most likely tokens for a set of hyperparameters. 

So what you can do is generate a bunch of output with beam search and then calculate the sequence probabilities of it (or set length_penalty to 0 and use the sequence_scores). Then this is roughly what you want. But there is no guarantee that the greedy search output will be part of the generated sentences. And if it is then it probably won't be the most likely sentence output either. 

This might make more sense if we start using beam search instead of greedy search.

If you can log this you could make a plot tracking how many pp has probs over: 1e-5, 1e-4, 1e-3, 1e-2, 1e-1. would be a good plot. x axis epoch, then ether do (a) for individual examples, or (b) as averages across examples

### Why can't we use sampling? 

From what I've read the sampling operation is not differentiable. This means that autograd doesn't carry a gradient through the sampling operation.   
I've read that "RL gets around this" but don't really understand the details at this point. Maybe with a non-differentiable policy gradient method? 

Links: [explanations](https://www.google.com/search?q=sampling+non+differentiable&oq=sampling+non+differentiable&aqs=chrome..69i57j69i60.7731j0j4&sourceid=chrome&ie=UTF-8) [hugginface forum post](https://discuss.huggingface.co/t/finetuning-gpt2-with-user-defined-loss/163?page=3)  [alt method](https://leolaugier.wp.imt.fr/2019/09/09/workarounds-non-differentiability/)

### How are beam search scores and sequence_scores calculated? 

UPDATE: raised a github issue, let's see.   

There was a [PR](https://github.com/huggingface/transformers/pull/14654) that was merged in transformers v4.16.0 that seems to fix up the issue of the scores not being correct. Now there is a function  [`compute_transition_beam_scores`](https://huggingface.co/docs/transformers/v4.16.2/en/main_classes/model#transformers.generation_utils.GenerationMixin.compute_transition_beam_scores) that seems to give you what you want. 

**Previously:**

According to [this post](https://discuss.huggingface.co/t/generation-probabilities-how-to-compute-probabilities-of-output-scores-for-gpt2/3175/15?u=tomroth1001) they are calculated like this: 

**`sequence_scores`**: cumulative log probabilities of the `num_beams` most probable beams. It can be formulated as a recursive formula: sequence_scores[k]_i = sequence_score[k]_{i-1} + log_probs[i-1, :])_topk(2)[k] with sequence_score[k]_{i=start_token} = 0` (i being the time step, k being kth beam). 
 * Then you divide this by length_penalty i think (or 1 + length penalty?), so a lot of people seem to just set length_penalty to zero   
 
**`scores`**: this is where it becomes confusing. At the moment the scores[i, j] are defined as log_probs[i-1, j] + sequence_score[j % vocab_size]_{i-1} whereas j % vocab_size essentially defines the beam index.
  * don't really know how to interpret this overly much. 
  
  
NOTE: scores for greedy search are not logprobs but rather logits. 


### How does layer-norm affect the probs? 

Not sure. leaving this for now and lumping in with the dropout question. 

### Given the size of the action space is this a good candidate for differential entropy? 

I don't think so - looks like it's continuous spaces only based on a cursory look

You might need something more specialised - m[this paper](https://arxiv.org/pdf/1806.00589.pdf) talks about entropy for high dimensional spaces 

### When do you hit floating point threshold for token probabilities? When do nans and inf get introduced? 

You work with log-probabilities exactly so you don't hit this issue. 

### Does using fp32 affect token calculations? 


It shouldn't because you are working with log probabilities. 

### How does dropout affect generated probabilities? How does train/eval mode affect generated probs for a sentence? 

Based on observations when we put a network into eval mode there is less randomness in the generated probabilities. This makes sense because we have reduced randomness because we remove the stochastic behaviour of the dropout node. 

We can have a look at the differences for some examples. 

In [None]:
def get_token_probs_for_mode(orig_l, mode): 
    if mode == "train": pp_model.train()
    elif mode == "eval":  pp_model.eval()
    else:   raise Exception("shouldn't get here")
    orig_tokens = pp_tokenizer(orig_l, return_tensors='pt', padding=True, pad_to_multiple_of=4)
    pp_output = pp_model.generate_with_grad(**orig_tokens, **pp_model_params, do_sample=False, 
                                          return_dict_in_generate=True,
                                          output_scores=True,
                                        remove_invalid_values=False)
    pp_l_with_tokens = pp_tokenizer.batch_decode(pp_output.sequences, skip_special_tokens=False)

    seq_without_first_tkn = pp_output.sequences[:, 1:]
    attention_mask = pp_model._prepare_attention_mask_for_generation(
        seq_without_first_tkn, pp_tokenizer.pad_token_id, pp_tokenizer.eos_token_id
    )
    scores_log_softmax = torch.stack(pp_output.scores, 1).log_softmax(2)
    seq_token_log_probs = torch.gather(scores_log_softmax,2,seq_without_first_tkn[:,:,None]).squeeze(-1)
    del scores_log_softmax
    # account for nan values by setting them to 0 (maybe a bit of a hack)
    # will also handle inf and -inf values too by default
    seq_token_log_probs = torch.nan_to_num(seq_token_log_probs)
    # account for the padding tokens at the end 
    seq_token_log_probs = seq_token_log_probs * attention_mask
    seq_log_prob = seq_token_log_probs.sum(-1)

    return (pp_l_with_tokens, seq_token_log_probs, seq_log_prob)

orig_l = [
    "Hello I am tom", 
    "yes hello."
]
get_token_probs_for_mode(orig_l, mode ='train')

In [None]:
get_token_probs_for_mode(orig_l, mode ='eval')

(['<pad> I am tom.</s>', '<pad> Yes, hello.</s><pad>'],
 tensor([[-0.4700, -0.5260, -2.3377, -0.4375, -0.5740, -0.0981],
         [-1.4130, -0.2326, -0.4826, -0.2805, -0.0873, -0.0000]],
        grad_fn=<MulBackward0>),
 tensor([-4.4433, -2.4960], grad_fn=<SumBackward1>))

It's hard to make too many inferences from this really. 

## Glossary

### `attention_mask`


The `attention_mask` is used when you have a batch of texts that are different lengths. The shorter one might be padded up to some minimum length with 0’s. The `attention_mask` is a binary vector that tell the model where the padding is. In effect, they can tell the model what to ignore. The 0’s will be padding. 

It is used to filter out padding tokens from the sequence probablity because then the sequence 
probability will depend on how many padding tokens there are and the probability of generating them, 
which (a) we don't want and (b) the probability isn't correct anyway 

It is returned when you tokenise the input. 

I also create it at one point with the code for the paraphrase
```
attention_mask = pp_model._prepare_attention_mask_for_generation(
    seq_without_first_tkn, pp_tokenizer.pad_token_id, pp_tokenizer.eos_token_id
)
```
so that I can get the right probabilities for the sequence. 

### `token_type_ids`


**`token_type_ids`:** Some models/tasks use pairs of sentences concatenated together with a [SEP] token (or something) separating them. Tasks that might need this include textual entailment or duplicate detection. The `token_type_ids` is just a vector of 1’s and 0’s that indicate if a given word is part of the first sentence (a 0) or the second sentence (a 1). 

Looks something like this: 
```
>>> encoded_dict["token_type_ids"]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
```

So you won't need this for sentiment classification but you might for entailment. 

### `encoder_input_ids`

Will leave this until I see it 

### `decoder_input_ids`

**`decoder_input_ids`:** Only applies to encoder-decoder models. [One source](https://huggingface.co/transformers/glossary.html#decoder-input-ids) puts this as the input id’s that will be fed into the decoder. [Another source](https://huggingface.co/transformers/model_doc/t5.html#training) puts this as the target sequence (shifted by one place) when doing seq2seq training . Often when training you pass in the `labels` attribute and the model figures out what this should be. See [here](https://huggingface.co/transformers/model_doc/t5.html#training)

### input_ids

This is just the tokenized input sequence. 

### decoder_start_token_id

Will leave this until I see it.