<div style="text-align: right" align="right"><i>Peter Norvig<br> 2019, revised 2024<br>Based on <a href="https://nbviewer.org/gist/yoavg/d76121dfde2618422139">Yoav Goldberg's 2015 notebook</a></i></div> 

# The Effectiveness of Generative Language Models

This notebook is an expansion of [**Yoav Goldberg's 2015 notebook**](https://nbviewer.org/gist/yoavg/d76121dfde2618422139) on character-level *n*-gram language models, which in turn was a response to  [**Andrej Karpathy's 2015 blog post**](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) on recurrent neural network (RNN) language models. 

The term [**generative AI**](https://en.wikipedia.org/wiki/Generative_artificial_intelligencehttps://en.wikipedia.org/wiki/Generative_artificial_intelligence) is all the rage these days; it refers to computer programs that can *generate* something new (such as an image or a piece of text). 

In 2015 Karpathy's point was that recurrent neural networks were unreasonably effective at generating good text, even though they are at heart rather simple. Goldberg's point was that, yes, they are effective, but actually most of the magic is not in the RNNs, it is  in the training data itself, and an even simpler model (with no neural nets) does just as well at generating English text. Goldberg and Karpathy agree that the RNN captures some aspects of C++ code that the simpler model does not. My point is to update the decade-old Python code, and make a few enhancements.


## Definitions

- A **generative language model** is a model that, when given an initial text, can predict what tokens come next; it can generate a continuation of a partial text. (And when the initial text is empty, it can generate the whole text.) In terms of probabilities, the model represents *P*(*t* | *h*), the probability distribution that the next token will be *t*, given a history of previous tokens *h*. The probability distribution is estimated by looking at a training corpus of text.
- A **token** is a unit of text. In a character model, "walking" would be 7 tokens, one for each letter, while in a word model it would be one token, and in other models it might be two tokens ("walk", "ing").
- A generative model stands in contrast to a **discriminative model**, such as an email spam filter, which can discriminate between spam and non-spam, but can't be used to generate a new sample of spam.
- An **n-gram model** is a generative model that estimates the probability of *n*-token sequences. For example, a 5-gram character model would be able to say that given the previous 4 characters `'chai'`, the next character might be `'r'` or `'n'` (to form `'chair'` or `'chain'`). A 5-gram model is also called a [Markov model](https://en.wikipedia.org/wiki/Markov_model) of **order** 4, because it maps from the 4 previous tokens to the next token.
- A **recurrent neural network (RNN) model** is more powerful than an *n*-gram model, because it contains memory units that allow it to retain some information from more than *n* tokens in the past. See Karpathy for [details](http://karpathy.github.io/2015/05/21/rnn-effectiveness/).
- Current **large language models** such as ChatGPT, Claude, and Gemini use a more powerful model called a [transformer](https://en.wikipedia.org/wiki/Transformer_%28deep_learning_architecture%29).  Karpathy has [an introduction](https://www.youtube.com/watch?v=zjkBMFhNj_g&t=159s).

## Training Data

A language model learns probabilities by counting token subsequences in a corpus of text that we call the **training data**. 

Both Karpathy and Goldberg use the works of Shakespeare as their initial training data:

In [1]:
# Fetch the file if it does not already exist here
! [ -f shakespeare_input.txt ] || curl -O https://norvig.com/ngrams/shakespeare_input.txt 

In [2]:
shakespeare: str = open("shakespeare_input.txt").read()

print(f'{len(shakespeare):,d} characters and {len(shakespeare.split()):,d} words:\n\n{shakespeare[:200]}...')

4,573,338 characters and 832,301 words:

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you...


## Python Code for *n*-Gram Language Model

I'll start with some imports and simple definitions:

In [3]:
import random
from typing import *
from collections import defaultdict, Counter, deque

type Token = str # Datatype to represent a token (a character or word)

cat = ''.join    # Function to concatenate strings

Now I define the class `LanguageModel`:
- A `LanguageModel` is a subclass of `defaultdict` that maps a history of length *order* tokens to a `Counter` of next tokens.
  - The tokens in the history are concatenated together into one string to form the keys of the LanguageModel.
- The `__init__` method sets the order of the model and optionally accepts tokens of training data. 
- The  `train` method builds up the `{history: Counter(next_token)}` mapping from the training data.
- The  `generate` method random samples `length` tokens from the mapping. 
- The  `gen` method is a convenience function to call `generate` and print the results.

In [4]:
class LanguageModel(defaultdict): 
    """A mapping of {'history': Counter(next_token)}."""
    
    def __init__(self, order: int, tokens=()):
        """Set the order of the model, and optionally initialize it with some tokens."""
        self.order = order
        self.default_factory = Counter # Every history entry has a Counter of tokens
        self.train(tokens)

    def train(self, tokens):
        """Go through the tokens, building the {'history': Counter(next_tokens)} mapping."""
        history = deque(maxlen=self.order) # History keeps at most `order` tokens
        for token in tokens:
            self[cat(history)][token] += 1
            history.append(token)
        return self

    def generate(self, length=1000, start=()) -> List[Token]:
        """Generate a random text of `length` tokens, from a sequence of `start` tokens.
        At each step, consider the previous `self.order` tokens and randomly sample the next token."""
        tokens = list(start)
        while len(tokens) < length:
            history = cat(tokens[-self.order:])
            tokens.append(random_token(self[history]))
        return tokens

    def gen(self, length=1000, start=()) -> None:
        """Call generate and print the resulting tokens."""
        print(cat(self.generate(length, start)))

We'll need a function to randomly select a next token from one of the model's Counters:

In [5]:
def random_token(counter: Counter) -> Token:
    """Randomly sample a token from a Counter, with probability proportional to each token's count."""
    return random.choices(list(counter), weights=list(counter.values()), k=1)[0]

Let's train a character-level language model of order 4 on the Shakespeare data. We'll call the language model `LM`:

In [6]:
LM = LanguageModel(4, shakespeare)

Here are some examples of what's in the model:

In [7]:
LM["chai"]

Counter({'n': 78, 'r': 35})

So `"chai"` is followed by either `'n'` or `'r'`. In contrast, almost any letter  can follow `"the "`:

In [8]:
LM["the "]

Counter({'s': 2058,
         'w': 1759,
         'c': 1561,
         'm': 1392,
         'p': 1360,
         'f': 1258,
         'b': 1217,
         'd': 1170,
         't': 1109,
         'g': 1037,
         'h': 1029,
         'l': 1006,
         'r': 804,
         'k': 713,
         'e': 704,
         'n': 616,
         'a': 554,
         'o': 530,
         'v': 388,
         'i': 298,
         'q': 171,
         'D': 146,
         'y': 120,
         'u': 105,
         'L': 105,
         'F': 103,
         'T': 102,
         'j': 99,
         'C': 81,
         'E': 77,
         'G': 75,
         'M': 54,
         'P': 54,
         'R': 45,
         'S': 40,
         'B': 31,
         'J': 30,
         'A': 29,
         'K': 22,
         'H': 20,
         'V': 18,
         'N': 18,
         'I': 14,
         'W': 14,
         "'": 10,
         'Q': 7,
         'z': 6,
         'O': 3})

## Generating Shakespeare

We cann generate a random text from the order 4 model:

In [9]:
LM.gen()

First Fishes come, trouble annot have the here reign's every madness, it is repart on hath than of that were is time little be faint of that came of a monstands his on and the wonderstandiscoverfluous nest ask again! thou should writ than we, I'll his good Mercules; your
sonneur, my good is no make me, yet were here is very us;
And, nobler the more at me not his preport,
Such moved on:
But not by my duke
To business: pleasure no moral bed.
Harry noble an end
Do more that were I have do behind,
I go to judgment, and he as he,'that I have come.
Julius, and Penthorough at lame you helps as this for, 'tis not
right i' the earth Boling it fing, sir.

HAMLET:
All:
Their me,
And yet you speak; I strong;
But were subject his his pride up a throw
One way;
Still ye well me an enemy,
I will sick, cause,
And cut of they saithful necess, if God!
For than you how approves compound in Gloucester tribute
A grave?

BURGUNDY:
Thy good with lessenger with done is
own deep; ha!
Will meat great lady troubl

Order 4 captures the structure of plays, mentions some characters, and generates mostly English words. But the words don't always go together to form grammatical sentences, and there is certainly no coherence or plot. 

## Generating Order 7 Shakespeare

What if we increase the model to order 7? Or 10? The output gets a bit better, roughly as good as the RNN models that Karpathy shows, and all from a much simpler *n*-gram model.

In [10]:
%time LanguageModel(7, shakespeare).gen()

First Citizen:
Neighbours shall we in our enemies.

First Lord:
Behind this crown'd, Laertes,
Will your sake, speak of African;
Where is three.

BIRON:
Why, I prithee, take me mad:
Hark, how our company.

MISTRESS FORD:

All:
Our dukedoms.
The break with
you.

LEONTES:
Say you this true at first inducement.

IAGO:
There is set for't! And now my deeds be grieves her question.

LEONATO:
Brother with thy state, and here together with usurping her still
And looks, bid her contemn'd revolt: this night;
For, in his tears not found the players cannot, take this.

CADE:

DICK:
My heart that in this, till thou thus, sir. Fare you were so;
To disprove to hear from the man.

ARIEL:
I pray you, the gates;
And makes a sun and disjoin'd penitent head of thine;
With fear; my master, sir, no; the presupposed
Upon the fiend's wrong
And fertile land of the quern
And ruminate tender'd herself: he shall stay at home;
And chides wrong.
I will we bite our castle:
She died,
That were out by the painful, and 

## Generating Order 10 Shakespeare

In [11]:
LanguageModel(10, shakespeare).gen()

First Citizen:
Woe to the hearts of men
The thing I am forbid;
Or study where I had and have it; and much more ease; for so I have.

Second Lord:
He had no other death concludes but what thou wilt.
How now, Simple! where have you taste of thy abhorr'd ingredients of our loves again,
Alike betwitched by the Frenchman his companion of the gesture
One might interpreter, you might pardon him, sweet father, do you here? things that I bought mine own.

KING RICHARD II:
A lunatic lean-witted fools
The way twice o'er, I'll weep. O fool, I shall be publish'd, and
Her coronation-day,
When Bolingbroke ascends my throne of France upon his sudden seem,
I would be the first house, our story
What we have o'erheard
Your royal grace!

DUKE VINCENTIO:
Have after. To what end he gave me fresh garments must not then respective lenity,
And all-to topple: pure surprised:
Guard her till this osier cage of ours
Were nice and continents! what mutiny!
What raging of their emperor
And to conclude to hate me.

KI

## Probabilities and Smoothing

Sometimes we'd rather see probabilities, not raw counts. Given a language model `LM`, the probability *P*(*token* | *history*) could be computed as follows:

In [12]:
def P(token, history, LM: LanguageModel): 
    "The probability that token follows history."""
    return LM[history][token] / sum(LM[history].values())

What's the probaility that the letter "n" follows the four letters "chai"?

In [13]:
P('n', 'chai', LM)

0.6902654867256637

What about the letter "s"?

In [14]:
P('s', 'chai', LM)

0.0

Shakespeare never wrote about "chaise longues," or "chai tea" so the probability of an "s" or space following "chai" is zero, according to our language model. But do we really want to say it is absolutely impossible for the sequence of letters "chais" or "chai " to appear anywhere in a text, just because we didn't happen to see it in our training data? More sophisticated language models use [**smoothing**](https://en.wikipedia.org/wiki/Kneser%E2%80%93Ney_smoothing) to assign non-zero (but small) probabilities to previously-unseen sequences. A simple type of smoothing is  "add-one smoothing"; it assumes that if we have counted *N* tokens following a given history, then the probability of an unseen token is 1 / (*N* + 1), and the probabilities for the previously-seen tokens are reduced accordingly (dividing by *N* + 1 instead of *N*):

In [15]:
def P(t, h, LM: LanguageModel): 
    "The probability that token t follows history h, using add-one smoothing."""
    N = sum(LM[h].values())
    return max(1, LM[h][t]) / (N + 1)

That gives us:

In [16]:
P('s', 'chai', LM)

0.008771929824561403

In [17]:
P('n', 'chai', LM)

0.6842105263157895

## Starting Text

One thing you may have noticed: all the generated passages start the word "First". Why is that? Because the training data happens to start with the line "First Citizen:", and so when we call `generate_tokens`, we start with an empty history, and the only thing that follows the empty history in the training data is the letter "F", the only thing that follows "F" is "i", and so on, until we get to a point where there are multiple choices. We could get more variety in the start of the generated text by breaking the training text up into multiple sections, so that each section would contribute a different possible starting point. But that would require some knowledge of the structure of the training text; right now the only assumption is that it is a sequence of tokens/characters.

We can give a starting text to `generate_tokens` and it will continue from there. But since the models only look at a few characters of history (just 4 for `LM`), this won't make much difference. For example, the following won't make the model generate a story about Romeo:

In [18]:
LM.gen(start='ROMEO')

ROMEO:
Is the rash they news, and
the night too her never, tie here queen,
And, in lady store known
in practises;
Whose thy master thought;
Then for my rose, for nothese him that honour body and I now nights have to make with noted.

ANTIPHOLUS OF EPHEN SCROOP:
Gloucested, and dote: marry, were cours, I deputes our this, and hurt was. Yet, could probations, I hear me, Rosal to chard,
Which thy error offer'd you parce Will bears,
The nature and how prever my mont: after come to bear the that worthly to Cyprus.--Help, ho! how
did:
If guard this daughter,
Were up ruled for sister, be hour, than with mouth, my patient you my hence Helent denuncle, la, had for thank the farth
The gate:
My prither business. He behold, in would not ruins. What, assage of fles, that were this,
How it is not thence; for the matter. Thaisanio;
Why, where Christial of a bay,
But of steeds the not me save you owest be dog-day,
His slain If, a black, you die against the compey, of thine own the riots do lustry, may

# Linux Kernel C++ Code

Goldberg's point is that the simple character-level n-gram model performs about as well as the  more complex RNN model on Shakespearean text. 

But Karpathy also trained an RNN on 6 megabytes of Linux-kernel C++ code. Let's see what we can do with that training data.

In [19]:
! [ -f linux_input.txt ] || curl -O https://norvig.com/ngrams/linux_input.txt
! wc   linux_input.txt

  241465  759639 6206997 linux_input.txt


In [20]:
linux = open("linux_input.txt").read()

## Generating Order 10 C++

We'll start with an order-10 character model, and compare that to an order-20 model. We'll generate a longer text, because sometimes 1000 characters ends up being just one long comment.

In [21]:
LanguageModel(10, linux).gen(length=3000)

/*
 * linux/kernel.h>
#include <linux/syscalls.h>
#include <linux/slab.h>

struct snapshot_device_available space if
 * such as list traversal.
 */
struct sigpending *list, siginfo_t info;
	if (!ret)
		rb->aux_priv = event->pmu = pmu;

	return BUF_PAGE_SIZE(slots)			\
	(offsetof(struct printk_log) + msg->text_len;
	if (path)
		audit_log_n_untrustedstring(ab, "remove_rule");
			list_add(&nt->entry, list);
	cpu_buffer->irq_work.work, rb_wake_up_waiters - wake up the first child if it
		 * is tracked on a waitqueue_head_t *bit_waitqueue_head_init(&rt_rq->rt_nr_boosted = 0;
	ret = test_jprobe();
	if (err)
			goto process
 * @len: max length to calculate_period(event);
	print_ip_sym(s, *p, flags);
	void (*write_delay,
	.writeunlock	= torture_random(trsp) % (cxt.nrealwriters_stress = cxt.nrealwriters_stress >= 0)
			mark_reg_unknown_value(regs);

	current->lockdep_depth = curr->lockdep_depth; i++) {
		if (KDB_FLAG(CMD_INTERRUPT)) {
		/* We need to keep it from
		 * the current thread will pa

## Order 20 C++

In [22]:
LanguageModel(20, linux).gen(length=3000)

/*
 * linux/kernel/irq/autoprobe.c
 *
 * Copyright (C) 2009 Red Hat, Inc., Ingo Molnar <mingo@redhat.com>
 *
 * Contributors at various stages not listed above:
 *  Jason Wessel ( jason.wessel@windriver.com>
 *
 * Copyright (C) 1992, 1998-2006 Linus Torvalds, Ingo Molnar
 *
 * This file contains macros used solely by rtmutex.c. Debug version.
 */

extern void
rt_mutex_deadlock_account_lock(lock, proxy_owner);
	rt_mutex_set_owner(lock);
		mutex_acquire(&lock->dep_map, subclass, 0, nest_lock, ip);

	if (mutex_optimistic_spin(struct mutex *lock, struct ww_acquire_ctx *ctx)
{
	struct mutex_waiter *waiter);
extern void debug_rt_mutex_init_waiter(&waiter);
	RB_CLEAR_NODE(&rt_waiter.tree_entry);

	raw_spin_lock(&this_rq->lock);

	update_blocked_averages(cpu);

	rcu_read_lock();
	p = find_task_by_vpid(pid);
		if (p)
			err = posix_cpu_clock_get,
	.timer_create	= pc_timer_create,
	.timer_set	= pc_timer_settime,
	.timer_del	= pc_timer_delete,
	.timer_get	= pc_timer_gettime,
};
#ifndef _CONSOLE_C

## Analysis of Generated Linux Text

As Goldberg says, "Order 10 is pretty much junk." But order 20 is much better. Most of the comments have a start and an end; most of the open parentheses are balanced with a close parentheses; but the braces are not as well balanced. That shouldn't be surprising. If the span of an open/close parenthesis pair is less than 20 characters then it can be represented within the model, but if the span of an open/close brace is more than 20 characters, then it cannot be represented by the model. Goldberg notes that Karpathy's RRN seems to have learned to devote some of its long short-term memory (LSTM) to representing nesting level, as well as things like whether we are currently within a string or a comment. It is indeed impressive, as Karpathy says, that the model learned to do this on its own, without any input from the human engineer.

## Character Models versus Word and Token Models

Karpathy and Goldberg both used character models, because the exact formatting of characters (especially indentation and line breaks) is important in the format of plays and C++ programs. But if you are interested in generating paragraphs of text that don't have any specific format, it is  common to use a **word** model, which represents the probability of the next word given the previous words, or a **token** model in which tokens can be words, punctuation, or parts of words. For example, the text `"Spiderman!"` might be broken up into the three tokens `"Spider"`, `"man"`, and `"!"`. 

One simple way of tokenizing a text is to break it up into alternating strings of word and non-word characters; the function `tokenize` does that:

In [23]:
import re

tokenize = re.compile(r'\w+|\W+').findall # Find all alternating word- or non-word strings

In [24]:
assert tokenize('Soft! who comes here?') == [
    'Soft', '! ', 'who', ' ', 'comes', ' ', 'here', '?']

assert tokenize('wherefore art thou ') == [
    'wherefore', ' ', 'art', ' ', 'thou', ' ']

We can train a token language model on the Shakespeare data. A model of order 6 keeps a history of up to three word and three non-word tokens. 

In [25]:
TLM = LanguageModel(6, tokenize(shakespeare))

In [26]:
TLM['wherefore art thou ']

Counter({'Romeo': 1})

In [27]:
TLM['not in our ']

Counter({'stars': 1, 'Grecian': 1})

In [28]:
TLM['end of my ']

Counter({'life': 1, 'business': 1, 'dinner': 1, 'time': 1})

In [29]:
TLM[' end of my']

Counter({' ': 2})

We see below that the quality of the token models is similar to character models, and improves from 6 tokens to 8:

In [30]:
TLM.gen(400)

First Citizen:
Before we proceed any further, hear me speak.

TIMON:
Freely, good father.

Old Athenian:
Thou hast a sister by the mother's, from the top to toe?

MARCELLUS:
My lord, upon the platform where we watch'd.

HAMLET:
Did you not tell me, Griffith, as thou led'st me,
That the great body of our kingdom
How foul it is; what rank diseases grow
And with what zeal! for, now he has crack'd the league, and hath attach'd
Our merchants' goods at Bourdeaux.

ABERGAVENNY:
Is it therefore
The ambassador is silenced?

NORFOLK:
Marry, is't.

ABERGAVENNY:
A proper title of a peace; and purchased
At a superfluous rate!

BUCKINGHAM:
Why, all this business
Our reverend cardinal carried.

NORFOLK:
Like it your grace,
The Breton navy is dispersed by tempest:
Richmond, in Yorkshire, sent out a boat
Unto the shore, to ask those on the banks
If they were known, as the suspect is great,
Would make thee quickly hop without thy head.
Give me my horse, you
rogues; give me my gown; or else keep it in yo

In [31]:
LanguageModel(8, tokenize(shakespeare)).gen(400)

First Citizen:
Before we proceed any further, hear me speak.

All:
Peace, ho! Hear Antony. Most noble Antony!

ANTONY:
Why, friends, you go to do you know not what:
Wherein hath Caesar thus deserved your loves?
Alas, you know not: I must tell you that,
Before my daughter told me--what might you,
Or my dear majesty your queen here, think,
If I had play'd the desk or table-book,
Or given my heart a winking, mute and dumb,
Or look'd upon this love with idle sight;
What might you think? No, I went round to work,
And my young mistress thus I did bespeak:
'Lord Hamlet is a prince, out of thy star;
This must not be:' and then I precepts gave her,
That she should lock herself from his resort,
Admit no messengers, receive no tokens.
Which done, she took the fruits of my advice;
And he, repulsed--a short tale to make--
Fell into a sadness, then into a fast,
Thence to a watch, thence into a weakness,
Thence to a lightness, and, by this declension,
Into the madness wherein now he raves,
And all we

## C++ Token Model

Similar remarks hold for token models trained on C++ data:

In [32]:
LanguageModel(8, tokenize(linux)).gen(1000)

/*
 * linux/kernel/irq/autoprobe.c
 *
 * Copyright (C) 1992, 1998-2006 Linus Torvalds, Ingo Molnar
 *
 * This file contains the private data structure and API definitions.
 */

#ifndef __KERNEL_RTMUTEX_COMMON_H
#define __KERNEL_RTMUTEX_COMMON_H

#include <linux/rtmutex.h>

/*
 * The rtmutex in kernel tester is independent of rtmutex debugging. We
 * call schedule_rt_mutex_test() instead of schedule() for the tasks which
 * belong to the tester. That way we can delay the wakeup path of those
 * threads to provoke lock stealing and testing of  complex boosting scenarios.
 */
#ifdef CONFIG_RT_MUTEX_TESTER

extern void schedule_rt_mutex_test(struct rt_mutex *mutex)
{
	int tid, op, dat;
	struct test_thread_data *td;

	/* We have to lookup the task */
	for (tid = 0; tid < MAX_RT_TEST_THREADS; tid++) {
		if (threads[tid] == current)
			break;
	}

	BUG_ON(tid == MAX_RT_TEST_THREADS);

	td = &thread_data[tid];

	op = td->opcode;
	dat = td->opdata;

	switch (op) {
	case RTTEST_LOCK:
	case RTTEST