# Approximating the surprisal of adverbs


### Definition
<strong>Surprisal</strong> (a.k.a. information content) is a measure of the amount of information gained when an event occurs which had some probability value associated with it. Mathematically, it can be represented as such: for some instance or outcome $ x_i $ of random variable $ X $, which can take on values $ x_1, x_2, ... $, and the probability of outcome $ x_{i} $, $ p(x_{i}) $, the surprisal of $ x_{i} $ is given by $$ h(x_{i}) = -\log_{2}{p(x_{i})} \text{ bits} $$
* $ p(x_{i}) = 1 \Rightarrow h(x_{i}) = 0 \text{ bits} $
* $ p(x_{i}) = 0 \Rightarrow h(x_{i}) = \infty \text{ bits} $

In our experiment, one of the values of interest is the surprisal of an adverb.

### Methods

To approximate token probabilities, we use <a href="https://huggingface.co/gpt2-large">GPT-2 large</a>, an English language model (LM). It is accessible via the Hugging Face framework.<br>

For tensor manipulation and operations, we use PyTorch.

### Calculation: Adverb surprisal

#### Import statements

In [1]:
from typing import Dict, List, Tuple, Union

import nltk
nltk.download('punkt')
import torch
from transformers import PreTrainedTokenizerFast, GPT2TokenizerFast, GPT2LMHeadModel

torch.manual_seed(42) # Set seed for reproducibility

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/alisonykim/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


<torch._C.Generator at 0x147f44d70>

In [2]:
lm = GPT2LMHeadModel.from_pretrained('gpt2-large')
tokenizer = GPT2TokenizerFast.from_pretrained('gpt2-large')

#### Define sentences

Parallel corpora (```common``` = sentences with commonly appearing adverbs, ```rare``` = sentences with rarely appearing adverbs)

In [3]:
common = [
	'They looked at each other happily.',
	'The experienced doctor performed the operation successfully.',
	'She did not move, continuing to stare at me passionately.',
	'The dogs barked unexpectedly.',
	'The man on the boat waved angrily.',
	'One important person has been continuously absent.',
	'By the swimming pool, the neighbour waited nervously.',
	'It is so easy to be occasionally charitable.',
	'She clings to her marriage desperately.',
	'In this country, racism is spreading constantly.',
	'The little girl screamed and stamped her foot emotionally.',
	'The guests avoided all political discussion carefully.',
	'The day started normally.',
	'She grabbed the microphone stepped onto the stage confidently.',
	'They have been misleading you most unacceptably.',
	'To meet the deadline, the team worked efficiently.',
	'The student studied for the exam carefully.'
]

In [4]:
rare = [
	'They looked at each other amiably.',
	'The experienced doctor performed the operation dexterously.',
	'She did not move, continuing to stare at me belligerently.',
	'The dogs barked ferociously.',
	'The man on the boat waved affably.',
	'One important person has been conspicuously absent.',
	'By the swimming pool, the neighbour waited languidly.',
	'It is so easy to be vicariously charitable.',
	'She clings to her marriage tenaciously.',
	'In this country, racism has been spreading insidiously.',
	'The little girl screamed and stamped her foot petulantly.',
	'The guests avoided all political discussion sedulously.',
	'The day started mundanely.',
	'She grabbed the microphone and stepped onto the stage audaciously.',
	'They have been misleading you most egregiously.',
	'To meet the deadline, the team worked assiduously.',
	'The student studied for the exam sedulously.'
]

### Steps
* <strong>Step 1:</strong> Tokenize sentences, i.e. map each token to an index in the LM's vocabulary.
	* <strong>Step 1.5:</strong> Extract the adverb from the sentence and tokenize it separately from the entire sentence. (Some words like "sedulously" are mapped to more than 1 item in GPT-2's vocabulary. This occurs due to GPT-2's tokenization process, which uses bytes as the base vocabulary. Therefore, we must tokenize the adverb separately from the entire sentence. See: <strong>Note</strong>.)
* <strong>Step 2:</strong> Add \<BOS\> and \<EOS\> tokens to the sequence so that the probabilities we get are properly conditioned on the preceding tokens.
* <strong>Step 3:</strong> Calculate tokens' conditional logits (pre-softmax probabilities) under the LM.
* <strong>Step 4:</strong> Shift the label (token IDs) and logits tensors.
	* <strong>Logic:</strong> Here, we are interested in the <em>conditional</em> probabilities of words. Therefore, the conditional probability of \<BOS\> given previous tokens does not make sense to consider (there are no tokens preceding \<BOS\> in the sentence). Similarly, the probabilities over time steps after \<BOS\> are not of interest. The labels and logits tensors are shifted accordingly.
* <strong>Step 5:</strong> Calculate the negative log likelihood of the vocabulary item(s) corresponding to the adverb.
* <strong>Step 6:</strong> Extract the logits corresponding to the token position(s) of the adverb.
* <strong>Step 7:</strong> Ensure that each adverb is mapped to a single surprisal value.

<strong>Note:</strong> Because some adverbs are mapped to more than 1 vocabulary item, the following helper functions are defined:
* ```identify_adverb_ids```: Identify the vocabulary ID(s) corresponding to an adverb
* ```consolidate_surprisal```: Map one surprisal value to each adverb

In [5]:
def identify_adverb_ids(
    sent: str,
    tokenizer: PreTrainedTokenizerFast
) -> Union[Tuple[str, List[int], List[int]], Tuple[str, None]]:
    """
    Identify the vocabulary ID(s) corresponding to the adverb.
    
    Returns:
        Tuple of adverb vocabulary ID(s) and adverb token position(s)
    """
    sent_tokenized = nltk.word_tokenize(sent, language='english')
    # Adverbs in this experiment end with 'ly'
    adverb = next((token for token in sent_tokenized if token.endswith('ly')))
    sent_ids = [tokenizer.bos_token_id] + tokenizer(sent)['input_ids'] + [tokenizer.eos_token_id]
    # Reverse-engineer the encoding (via, surprise^, decoding) to find corresponding ID(s)
    # ^No pun intended
    for i in range(len(sent_ids)-1):
        curr_id_decoded = tokenizer.decode(sent_ids[i]).strip()
        next_id_decoded = tokenizer.decode(sent_ids[i+1]).strip()
        if curr_id_decoded == adverb:
            return adverb, [sent_ids[i]], [i]
        elif curr_id_decoded + next_id_decoded == adverb:
            return adverb, sent_ids[i:i+2], [i, i+1]
        elif len(curr_id_decoded + next_id_decoded) < len(adverb) and curr_id_decoded + next_id_decoded in adverb: # Substring in ```adverb``` but not equal
            try:
                next_next_id_decoded = tokenizer.decode(sent_ids[i+2]).strip()
                if curr_id_decoded + next_id_decoded + next_next_id_decoded == adverb:
                    return adverb, sent_ids[i:i+3], [i, i+1, i+2]
            except IndexError:
                return adverb, sent_ids[i:i+2], [i, i+1]
    return adverb, None


def consolidate_surprisal(adv_surprisal_dict: Dict[str, List[float]]) -> Dict[str, float]:
	"""Ensure that each adverb has one surprisal value."""
	return {adverb: sum(surprisal) for adverb, surprisal in adv_surprisal_dict.items()}

##### Common adverb sentences

In [6]:
adverb_surprisal_common = dict()
for sent in common:
	# Step 1
	sent_tokenized = tokenizer(sent)['input_ids']
	# Step 1.5
	adverb, adverb_tokenized, adverb_time_step = identify_adverb_ids(sent, tokenizer)
	# Step 2
	input_ids = torch.tensor([tokenizer.bos_token_id] + sent_tokenized + [tokenizer.eos_token_id])
	# Step 3
	with torch.no_grad():
		outputs = lm(input_ids, labels=input_ids)
	# Step 4
	labels_shifted = input_ids[..., 1:].contiguous()
	adverb_time_step_shifted = [time_step + 1 for time_step in adverb_time_step] # Shift time steps
	logits = outputs['logits']
	logits_shifted = logits[..., :-1, :].contiguous()
	assert logits_shifted.size(0) == labels_shifted.size(0) # As many labels as logits
	# Step 5
	nlls = -1 * torch.log_softmax(logits, dim=-1) # Token surprisal
	# Step 6
	adverb_surprisal_common[adverb] = [
		nlls[time_step][adverb_id].item()
		for time_step, adverb_id in zip(adverb_time_step_shifted, adverb_tokenized)
	]

# Step 7
adverb_surprisal_common = consolidate_surprisal(adverb_surprisal_common)

In [7]:
adverb_surprisal_common

{'happily': 16.239721298217773,
 'successfully': 15.366941452026367,
 'passionately': 17.545166015625,
 'unexpectedly': 15.709169387817383,
 'angrily': 15.555511474609375,
 'continuously': 13.219365119934082,
 'nervously': 15.868431091308594,
 'occasionally': 11.954917907714844,
 'desperately': 13.908486366271973,
 'constantly': 15.275320053100586,
 'emotionally': 14.575103759765625,
 'carefully': 14.272598266601562,
 'normally': 15.229681015014648,
 'confidently': 16.01131820678711,
 'unacceptably': 56.898990631103516,
 'efficiently': 16.871641159057617}

##### Rare adverb sentences

In [8]:
adverb_surprisal_rare = dict()
for sent in rare:
	# Step 1
	sent_tokenized = tokenizer(sent)['input_ids']
	# Step 1.5
	adverb, adverb_tokenized, adverb_time_step = identify_adverb_ids(sent, tokenizer)
	# Step 2
	input_ids = torch.tensor([tokenizer.bos_token_id] + sent_tokenized + [tokenizer.eos_token_id])
	# Step 3
	with torch.no_grad():
		outputs = lm(input_ids, labels=input_ids)
	# Step 4
	labels_shifted = input_ids[..., 1:].contiguous()
	adverb_time_step_shifted = [time_step + 1 for time_step in adverb_time_step] # Shift time steps
	logits = outputs['logits']
	logits_shifted = logits[..., :-1, :].contiguous()
	assert logits_shifted.size(0) == labels_shifted.size(0) # As many labels as logits
	# Step 5
	nlls = -1 * torch.log_softmax(logits, dim=-1) # Token surprisal
	# Step 6
	adverb_surprisal_rare[adverb] = [
		nlls[time_step][adverb_id].item()
		for time_step, adverb_id in zip(adverb_time_step_shifted, adverb_tokenized)
	]

# Step 7
adverb_surprisal_rare = consolidate_surprisal(adverb_surprisal_rare)

In [9]:
adverb_surprisal_rare

{'amiably': 34.04397487640381,
 'dexterously': 33.66762161254883,
 'belligerently': 55.433589935302734,
 'ferociously': 59.79363441467285,
 'affably': 33.415852546691895,
 'conspicuously': 27.25703525543213,
 'languidly': 46.11696910858154,
 'vicariously': 46.579978942871094,
 'tenaciously': 31.364498138427734,
 'insidiously': 58.4285945892334,
 'petulantly': 61.445457458496094,
 'sedulously': 31.638813018798828,
 'mundanely': 31.853524208068848,
 'audaciously': 34.2632417678833,
 'egregiously': 34.74201774597168,
 'assiduously': 63.89957046508789}