## Eliciting scores from LMs

Language models are trained to predict token probabilities, given some input context. This allows us to explore a number of different scoring methods.

* token-scoring: the natural ability of LMs -- assigning probabilities to tokens given context
* word-scoring: going from tokens (which could be sub-words) to word scores
* sequence-scoring: going from tokens/words to full, multi-word sequences 
* conditional-scoring: computing conditional probabilities of sequences given some input

For all these methods, we will consider a range of different scores: probabilities, log-probabilities, surprisals. In the context of sequence probabilities, we will look at differences between summing log-probabilities (equivalent to multiplying probabilities) vs. looking at log-probability per token, to account for the effect of length.

In [1]:
import torch

from minicons import cwe, scorer
from nltk.tokenize import TweetTokenizer
from torch.utils.data import DataLoader

  from .autonotebook import tqdm as notebook_tqdm


## Contextualized word embeddings

In [8]:
def cosine(a: torch.Tensor, b: torch.Tensor, eps =1e-8) -> torch.Tensor:
    a_n, b_n = a.norm(dim=1)[:, None], b.norm(dim=1)[:, None]
    a_norm = a / torch.max(a_n, eps * torch.ones_like(a_n))
    b_norm = b / torch.max(b_n, eps * torch.ones_like(b_n))
    sims = torch.mm(a_norm, b_norm.transpose(0, 1))
    return sims

In [2]:
lm = cwe.CWE("bert-base-uncased")

In [20]:
queries = [
    ("I saw a bat flying around looking for food.", "bat"),
    ("I think bats are fierce when they are hungry.", "bats"),
    ("He swung his bat at me but I dodged it.", "bat")
]

layer_embs = lm.extract_representation(queries, layer='all')

In [25]:
cosine(layer_embs[11], layer_embs[11])

tensor([[1.0000, 0.7299, 0.6701],
        [0.7299, 1.0000, 0.5523],
        [0.6701, 0.5523, 1.0000]])

### Different types of LMs

Autoregressive LMs: `lm.IncrementalLMScorer`

Masked LMs: `lm.MaskedLMScorer`

In [2]:
model_name = "HuggingFaceTB/SmolLM2-135M"
# model_name = "gpt2"
# model_name = "facebook/opt-125m"

# many models do not automatically insert a beggining of 
# sentence tokens when tokenizing a sequence, even though
# they were trained to do so...

if "gpt2" in model_name or "pythia" in model_name or "SmolLM" in model_name:
    BOS = True
else:
    BOS = False

lm = scorer.IncrementalLMScorer(model_name)

### Token Scoring

**Input:** I know what the lion devoured at sunrise.

**Outputs:** 
* Probabilities: $p(w_i | w_1, w_2, \dots, w_{i-1})$
* log-probabilities: $\log p(w_i | w_1, w_2, \dots, w_{i-1})$
* Surprisals: $-\log p(w_i | w_1, w_2, \dots, w_{i-1})$


In [3]:
sequences = [
    "I know what the lion devoured at sunrise.", 
    "I know that the lion devoured at sunrise."
]

Probabilities:

In [5]:
lm.token_score(
    sequences, 
    bos_token=BOS,
    prob=True,
    bow_correction=True
)

[[('<|endoftext|>', 0.0),
  ('I', 0.002043517306447029),
  ('Ġknow', 0.017165811732411385),
  ('Ġwhat', 0.0534108504652977),
  ('Ġthe', 0.0762874037027359),
  ('Ġlion', 5.0759124860633165e-05),
  ('Ġdev', 9.478997526457533e-06),
  ('oured', 0.10184398293495178),
  ('Ġat', 0.021646004170179367),
  ('Ġsunrise', 0.0006119301542639732),
  ('.', 0.1724427342414856)],
 [('<|endoftext|>', 0.0),
  ('I', 0.002043517306447029),
  ('Ġknow', 0.017165811732411385),
  ('Ġthat', 0.2915889024734497),
  ('Ġthe', 0.13011622428894043),
  ('Ġlion', 8.081334817688912e-05),
  ('Ġdev', 7.503035885747522e-05),
  ('oured', 0.32274961471557617),
  ('Ġat', 0.0006744182901456952),
  ('Ġsunrise', 8.538211841369048e-05),
  ('.', 0.07048003375530243)]]

log probabilities/surprisals:

In [6]:
lm.token_score(
    sequences, 
    bos_token=BOS,
    surprisal=True,
    bow_correction=True
)

[[('<|endoftext|>', 0.0),
  ('I', 6.193082809448242),
  ('Ġknow', 4.064835548400879),
  ('Ġwhat', 2.929741382598877),
  ('Ġthe', 2.5732474327087402),
  ('Ġlion', 9.888419151306152),
  ('Ġdev', 11.566431999206543),
  ('oured', 2.284313201904297),
  ('Ġat', 3.8329343795776367),
  ('Ġsunrise', 7.398892402648926),
  ('.', 1.7576900720596313)],
 [('<|endoftext|>', 0.0),
  ('I', 6.193082809448242),
  ('Ġknow', 4.064835548400879),
  ('Ġthat', 1.2324103116989136),
  ('Ġthe', 2.0393271446228027),
  ('Ġlion', 9.423368453979492),
  ('Ġdev', 9.497617721557617),
  ('oured', 1.1308784484863281),
  ('Ġat', 7.301660060882568),
  ('Ġsunrise', 9.36837387084961),
  ('.', 2.652425765991211)]]

### Word scoring

Same metrics, but logprobs for words that are split into tokens are summed---e.g., `devoured` is split into `dev + oured`. However, here you have to provide the word tokenizer yourself. We will use `nltk`'s `TweetTokenizer()` as an example

In [7]:
word_tokenizer = TweetTokenizer().tokenize

In [10]:
lm.word_score_tokenized(
    sequences, 
    bos_token=BOS, 
    tokenize_function=word_tokenizer,
    surprisal=True,
    base_two=True,
    bow_correction=True
)

[[('I', 8.93472957611084),
  ('know', 5.864317893981934),
  ('what', 4.2267231941223145),
  ('the', 3.712411403656006),
  ('lion', 14.265973091125488),
  ('devoured', 19.98240089416504),
  ('at', 5.529755592346191),
  ('sunrise', 10.674345016479492),
  ('.', 2.535810708999634)],
 [('I', 8.93472957611084),
  ('know', 5.864317893981934),
  ('that', 1.7779922485351562),
  ('the', 2.942127227783203),
  ('lion', 13.595046997070312),
  ('devoured', 15.333678245544434),
  ('at', 10.534069061279297),
  ('sunrise', 13.515706062316895),
  ('.', 3.82664155960083)]]

### Sequence scoring

**Input:** batch of sentences

**Outputs:** scores indicating how likely each sequence is. There are multiple methods for doing this though:

* summed log-probs (equivalent to joint probability, computed using the product rule)
* log-prob per token

In [12]:
sequences = [
    "The keys to the cabinet are on the table.",
    "The keys to the cabinet is on the table."
]

log-prob per token (default behavior):

In [13]:
lm.sequence_score(sequences, bos_token=BOS, bow_correction=True)

[-3.7346298694610596, -4.106328010559082]

summed log-probs:

summing is done by using the `reduction` argument, which takes a function

In [14]:
lm.sequence_score(
    sequences, 
    bos_token=BOS, 
    bow_correction=True,
    reduction=lambda x: x.sum().item()
)

[-37.34629821777344, -41.06328201293945]

Here, the lambda function is a concise way of defining a function, here this is equivalent to taking the torch tensor consisting of the model elicited log-probabilities and reduces it row-wise by summing, and extracting the item (as opposed to keeping it as a `tensor`). For example:

In [49]:
x = torch.tensor([0.223234, 0.443257, 0.364343], dtype=torch.double)
sum_func = lambda x: x.sum().item()

sum_func(x)

1.030834

### log-prob of full sequence (summing) vs. log-prob per token (avg)

Usually, the two metrics show similar qualitative trends, especially for minimal pair comparisons. However there are certain cases where log-prob per token is a better metric. This is because the summed log prob metric for a sentence might be lower simply because it is longer (contain more tokens)--since it involves a more number of multiplications between word-probabilities, each of which is a number lower than 1.

The following pair illustrates this issue:

1. These casseroles disgust Mrs. O'leary
2. *These casseroles disgusts Kayla

In [19]:
sequences = [
    # longer but grammatical
    "These casseroles disgust Mrs. O'leary",
    # shorter but ungrammatical
    "These casseroles disgusts Kayla" 
]

In [20]:
# sum
lm.sequence_score(
    stimuli, 
    bos_token=BOS, 
    bow_correction=True,
    reduction=lambda x: x.sum().item()
)

[-62.72190856933594, -56.759830474853516]

In [21]:
lm.sequence_score(stimuli, bos_token=BOS, bow_correction=True)

[-6.272191047668457, -8.10854721069336]

### Conditional LM scoring

This follows the same principle as sequence scoring, but allows you to separate the prefix and the continuation. Like sequence scoring, this method also allows for different reduction methods

In [27]:
# Lake and Murphy (2023) / Murphy (1988)
# "are cooked in a pie" is more salient to sliced apples
# than to "apples" or "sliced things"

prefixes = ["Sliced apples", "Apples", "Sliced things"]
continuations = ["are cooked in a pie"] * 3

# log P(continuation | prefix)

In [28]:
lm.conditional_score(prefixes, continuations, bos_token=BOS, bow_correction=True)

[-3.195496082305908, -3.9255783557891846, -3.6544346809387207]