## Eliciting scores from LMs

Language models are trained to predict token probabilities, given some input context. This allows us to explore a number of different scoring methods.

* token-scoring: the natural ability of LMs -- assigning probabilities to tokens given context
* word-scoring: going from tokens (which could be sub-words) to word scores
* sequence-scoring: going from tokens/words to full, multi-word sequences 
* conditional-scoring: computing conditional probabilities of sequences given some input

For all these methods, we will consider a range of different scores: probabilities, log-probabilities, surprisals. In the context of sequence probabilities, we will look at differences between summing log-probabilities (equivalent to multiplying probabilities) vs. looking at log-probability per token, to account for the effect of length.

In [1]:
import torch

from minicons import scorer
from nltk.tokenize import TweetTokenizer
from torch.utils.data import DataLoader

### Different types of LMs

Autoregressive LMs: `lm.IncrementalLMScorer` - These predict the next token, given some left context, e.g., "I saw the man reading a __ --> "book", they can only look up the sequence words *before* the next one, i.e., they make predictions unidirectionally.

Masked LMs: `lm.MaskedLMScorer` - These predict words in context using bidirectional evidence, i.e., they can even see the "future" in order to form predictions about a given word. E.g., "I saw a __ on the mat" --> "cat", "dog", etc.

In [2]:
model_name = "HuggingFaceTB/SmolLM2-135M"
# model_name = "gpt2"
# model_name = "facebook/opt-125m"

# many models do not automatically insert a beggining of 
# sentence tokens when tokenizing a sequence, even though
# they were trained to do so...

if "gpt2" in model_name or "pythia" in model_name or "SmolLM" in model_name:
    BOS = True
else:
    BOS = False

lm = scorer.IncrementalLMScorer(model_name)

In [3]:
prefixes = [
    "He caught the pass and scored another touchdown. There was nothing he enjoyed more than a good game of",
    "The firefighters wanted to have a mascot to live with them at the firehouse. Naturally, they decided it would have to be a"
]

dist = lm.next_word_distribution(prefixes)

# batch-wise log probabilities over the next word
dist

tensor([[-12.3408, -26.3428, -26.3726,  ..., -23.4791, -17.3325, -22.6918],
        [-11.9606, -23.8860, -23.8423,  ..., -20.0405, -12.6999, -20.7964]])

In [4]:
# We can then use this to compute the entropy of the distributions! Larger Entropies suggest "more randomness"
# i.e., if many things can be expected as the "next word", then we should see higher entropies.

entropy = (-dist * dist.exp()).sum(1)
entropy

tensor([4.2262, 6.0277])

**Analysis:** We see that the second context has a much higher entropy than the first one. This is to be expected! The first context clearly allows one to expect only one next word -- i.e., football (in the US). In the second context, almost anything can be a mascot!

In [5]:
# Let's now query our distributions -- here we can specify a list of items we want to get the 
# probabilities of, for each input. The output will be a tuple of two elements: first, a list 
# of lists containing the probabilities for each query. Second, a similarly structured object
# but with ranks.

lm.query(
    dist, 
    [["football", "baseball", "monopoly"], 
     ["dog", "bear", "zebra", "cat"]]
)

([[0.3232646584510803, 0.021983524784445763, 5.0356284191366285e-05],
  [0.0610167421400547,
   0.04948180541396141,
   0.0008991205831989646,
   0.014311952516436577]],
 [[1, 6, 844], [1, 2, 160, 9]])

In [6]:
# But how to get the actual next word in each context? or a set of next words?
# to do this, we can simply use the <model>.topk() function, which returns the
# top "k" next words.

lm.topk(dist, k = 10)

([['football',
   'pass',
   'catch',
   'basketball',
   'soccer',
   'baseball',
   'the',
   'tag',
   'touch',
   ''],
  ['dog',
   'bear',
   'lion',
   'horse',
   'dragon',
   'giant',
   'big',
   'wolf',
   'cat',
   'tiger']],
 [[0.3232646584510803,
   0.07216373831033707,
   0.05828924849629402,
   0.04123228043317795,
   0.025963453575968742,
   0.021983524784445763,
   0.019504325464367867,
   0.01449351292103529,
   0.013784543611109257,
   0.010535284876823425],
  [0.0610167421400547,
   0.04948180541396141,
   0.023994380608201027,
   0.022978678345680237,
   0.017409881576895714,
   0.01674766279757023,
   0.01547678466886282,
   0.014478103257715702,
   0.014311952516436577,
   0.013663243502378464]])

### Token Scoring

Now moving on to more exciting functions, let's start with token scoring! Here we can get a token-by-token probability/log-probability, given a batch of sentences.

**Input:** I know what the lion devoured at sunrise.

**Outputs:** 
* Probabilities: $p(w_i | w_1, w_2, \dots, w_{i-1})$
* log-probabilities: $\log p(w_i | w_1, w_2, \dots, w_{i-1})$
* Surprisals: $-\log p(w_i | w_1, w_2, \dots, w_{i-1})$


In [7]:
sequences = [
    "I know what the lion devoured at sunrise.", 
    "I know that the lion devoured at sunrise."
]

Probabilities:

In [8]:
lm.token_score(
    sequences, 
    bos_token=BOS,
    prob=False,
    surprisal=True,
    bow_correction=True
)

[[('<|endoftext|>', 0.0),
  ('I', 6.193082809448242),
  ('Ġknow', 4.064835548400879),
  ('Ġwhat', 2.929741382598877),
  ('Ġthe', 2.5732474327087402),
  ('Ġlion', 9.888419151306152),
  ('Ġdev', 11.566431999206543),
  ('oured', 2.284313201904297),
  ('Ġat', 3.8329339027404785),
  ('Ġsunrise', 7.398892879486084),
  ('.', 1.7576900720596313)],
 [('<|endoftext|>', 0.0),
  ('I', 6.193082809448242),
  ('Ġknow', 4.064835548400879),
  ('Ġthat', 1.2324100732803345),
  ('Ġthe', 2.039327383041382),
  ('Ġlion', 9.423368453979492),
  ('Ġdev', 9.497617721557617),
  ('oured', 1.1308780908584595),
  ('Ġat', 7.301660537719727),
  ('Ġsunrise', 9.36837387084961),
  ('.', 2.652425765991211)]]

log probabilities/surprisals:

In [9]:
lm.token_score(
    sequences, 
    bos_token=BOS,
    surprisal=True
)

[[('<|endoftext|>', 0.0),
  ('I', 5.686653137207031),
  ('Ġknow', 4.515300750732422),
  ('Ġwhat', 2.9601783752441406),
  ('Ġthe', 2.591022491455078),
  ('Ġlion', 9.441679954528809),
  ('Ġdev', 12.020923614501953),
  ('oured', 1.3213729858398438),
  ('Ġat', 4.788769721984863),
  ('Ġsunrise', 7.405997276306152),
  ('.', 1.3616142272949219)],
 [('<|endoftext|>', 0.0),
  ('I', 5.686653137207031),
  ('Ġknow', 4.515300750732422),
  ('Ġthat', 1.2616539001464844),
  ('Ġthe', 2.0563812255859375),
  ('Ġlion', 9.17728042602539),
  ('Ġdev', 9.753372192382812),
  ('oured', 1.1199378967285156),
  ('Ġat', 7.276801109313965),
  ('Ġsunrise', 9.404172897338867),
  ('.', 2.371004104614258)]]

### Word scoring

Same metrics, but log probabilities for words that are split into tokens are summed---e.g., `devoured` is split into `dev + oured`. However, here you have to provide the word tokenizer yourself. We will use `nltk`'s `TweetTokenizer()` as an example

In [10]:
word_tokenizer = TweetTokenizer().tokenize

In [11]:
lm.word_score_tokenized(
    sequences, 
    bos_token=BOS, 
    tokenize_function=word_tokenizer,
    surprisal=True,
    base_two=True
)

[[('I', 8.204106330871582),
  ('know', 6.514202117919922),
  ('what', 4.270634651184082),
  ('the', 3.7380552291870117),
  ('lion', 13.621464729309082),
  ('devoured', 19.248865127563477),
  ('at', 6.908734321594238),
  ('sunrise', 10.684595108032227),
  ('.', 1.9643940925598145)],
 [('I', 8.204106330871582),
  ('know', 6.514202117919922),
  ('that', 1.8201818466186523),
  ('the', 2.966731071472168),
  ('lion', 13.24001693725586),
  ('devoured', 15.686870574951172),
  ('at', 10.498205184936523),
  ('sunrise', 13.567353248596191),
  ('.', 3.420635938644409)]]

### Sequence scoring

We can also compute the joint probability/log-probability of entire sequences, by simply summing the token probabilities. Although, the default behavior is to do so per token -- i.e., by taking the mean.

**Input:** batch of sentences

**Outputs:** scores indicating how likely each sequence is. There are multiple methods for doing this though:

* summed log-probs (equivalent to joint probability, computed using the product rule)
* log-prob per token

In [12]:
sequences = [
    "The keys to the cabinet are on the table.",
    "The keys to the cabinet is on the table."
]

log-prob per token (default behavior):

In [13]:
lm.sequence_score(sequences, bos_token=BOS)

[-3.674643039703369, -4.0424699783325195]

#### Summed log-probs:

summing is done by using the `reduction` argument, which takes a function as its value. It is called a "reduction" since it is reducing a list of values into a single value

In [14]:
lm.sequence_score(
    sequences, 
    bos_token=BOS, 
    reduction=lambda x: x.sum().item()
)

[-36.746429443359375, -40.42470169067383]

Here, the lambda function is a concise way of defining a function, here this is equivalent to taking the torch tensor consisting of the model elicited log-probabilities and reduces it row-wise by summing, and extracting the item (as opposed to keeping it as a `tensor`). For example:

In [15]:
x = torch.tensor([0.223234, 0.443257, 0.364343], dtype=torch.double)
sum_func = lambda x: x.sum().item()

sum_func(x)

1.030834

### log-prob of full sequence (summing) vs. log-prob per token (avg)

Usually, the two metrics show similar qualitative trends, especially for minimal pair comparisons. However there are certain cases where log-prob per token is a better metric. This is because the summed log prob metric for a sentence might be lower simply because it is longer (contain more tokens)--since it involves a more number of multiplications between word-probabilities, each of which is a number lower than 1.

The following pair illustrates this issue:

1. These casseroles disgust Mrs. O'leary
2. *These casseroles disgusts Kayla

In [16]:
stimuli = [
    "These casseroles disgust Mrs. O'leary", # longer but grammatical
    "These casseroles disgusts Kayla" # shorter but ungrammatical
]

In [17]:
# sum
lm.sequence_score(
    stimuli, 
    bos_token=BOS, 
    reduction=lambda x: x.sum().item()
)

[-61.54801559448242, -56.26276779174805]

In [18]:
lm.sequence_score(stimuli, bos_token=BOS)

[-6.154801368713379, -8.037538528442383]

### Conditional LM scoring

This follows the same principle as sequence scoring, but allows you to separate the prefix and the continuation. Like sequence scoring, this method also allows for different reduction methods.

For example, the code below computes:

1. log p(are cooked in a pie. | Sliced apples)
2. log p(are cooked in a pie. | Apples)
3. log p(are cooked in a pie. | Sliced things)

In [19]:
# Lake and Murphy (2023) / Murphy (1988)
# "are cooked in a pie" is an emergent property of sliced apples

prefix = ["Sliced apples", "Apples", "Sliced things"]
continuation = ["are cooked in a pie."] * 3

In [20]:
lm.conditional_score(prefix, continuation, bos_token=BOS)

[-3.2367591857910156, -3.9154112339019775, -3.4075069427490234]