<div style="text-align: right" align="right"><i>Peter Norvig<br> 2019, revised 2024<br>Based on <a href="https://nbviewer.org/gist/yoavg/d76121dfde2618422139">Yoav Goldberg's 2015 notebook</a></i></div> 

# The Effectiveness of Generative Language Models

This notebook is an expansion of [**Yoav Goldberg's 2015 notebook**](https://nbviewer.org/gist/yoavg/d76121dfde2618422139) on character-level *n*-gram language models, which in turn was a response to  [**Andrej Karpathy's 2015 blog post**](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) on recurrent neural network (RNN) language models. 

The term [**generative AI**](https://en.wikipedia.org/wiki/Generative_artificial_intelligencehttps://en.wikipedia.org/wiki/Generative_artificial_intelligence) is all the rage these days; it refers to computer programs that can *generate* something new (such as an image or a piece of text). 

In 2015 Karpathy's point was that recurrent neural networks were unreasonably effective at generating good text, even though they are at heart rather simple. Goldberg's point was that, yes, they are effective, but actually most of the magic is not in the RNNs, it is  in the training data itself, and an even simpler model (with no neural nets) does just as well at generating English text. Goldberg and Karpathy agree that the RNN captures some aspects of C++ code that the simpler model does not. My point is to update the decade-old Python code, and make a few enhancements.


## Definitions

- A **generative language model** is a model that, when given an initial text, can predict what tokens come next; it can generate a continuation of a partial text. (And when the initial text is empty, it can generate the whole text.) In terms of probabilities, the model represents *P*(*t* | *h*), the probability distribution that the next token will be *t*, given a history of previous tokens *h*. The probability distribution is estimated by looking at a training corpus of text.

- A **token** is a unit of text. It can be a single character (as covered by Karpathy and Goldberg) or more generally it can be a word or a part of a word (as allowed in my implementation).

- A generative model stands in contrast to a **discriminative model**, such as an email spam filter, which can discriminate between spam and non-spam, but can't be used to generate a new sample of spam.


- An **n-gram model** is a generative model that estimates the probability of *n*-token sequences. For example, a 5-gram character model would be able to say that given the previous 4 characters `'chai'`, the next character might be `'r'` or `'n'` (to form `'chair'` or `'chain'`). A 5-gram model is also called a model of **order** 4, because it maps from the 4 previous tokens to the next token.

- A **recurrent neural network (RNN) model** is more powerful than an *n*-gram model, because it contains memory units that allow it to retain information from more than *n* tokens. See Karpathy for [details](http://karpathy.github.io/2015/05/21/rnn-effectiveness/).

- Current **large language models** such as ChatGPT, Claude, Llama, and Gemini use an even more powerful model called a [transformer](https://en.wikipedia.org/wiki/Transformer_%28deep_learning_architecture%29).  Karpathy has [an introduction](https://www.youtube.com/watch?v=zjkBMFhNj_g&t=159s).

## Training Data

A language model learns probabilities by observing a corpus of text that we call the **training data**. 

Both Karpathy and Goldberg use the works of Shakespeare (about 800,000 words) as their initial training data:

In [1]:
! [ -f shakespeare_input.txt ] || curl -O https://norvig.com/ngrams/shakespeare_input.txt
! wc   shakespeare_input.txt 
# Print the number of lines, words, and characters

  167204  832301 4573338 shakespeare_input.txt


In [2]:
! head -8 shakespeare_input.txt 
# First 8 lines

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?


## Python Code for n-Gram Model

I do some imports and then define two data types:
- A `Token` is an individual unit of text, a string of one or more characters.
- A `LanguageModel` is a subclass of `defaultdict` that maps a history of length *n* tokens to a `Counter` of the number of times each token appears immediately following the history in the training data.

In [3]:
import random
from typing import *
from collections import defaultdict, Counter

Token = str # Datatype to represent a token

class LanguageModel(defaultdict): 
    """A mapping of {'history': Counter(next_token)}."""
    def __init__(self, n: int):
        self.order = n
        super().__init__(Counter)

I define two main functions that do essentially all the work:

- `train_LM` takes a sequence of tokens (the training data) and an integer *n*, and builds a language model of order *n*, formed by counting the times each token *t* occurs and storing that under the entry for the history *h* of *n* tokens that precede *t*. 
- `generate_tokens` generates a random sequence of tokens, given an (optional) start sequence of tokens. At each step it looks at the history of previously generated tokens and chooses a new token at random from the language model's counter for that history.

In [4]:
def train_LM(tokens, order: int) -> LanguageModel:
    """Create and train a language model of given order on the given tokens."""
    LM = LanguageModel(order)
    history = []
    for token in tokens:
        LM[cat(history)][token] += 1
        history = (history + [token])[-order:] 
    return LM

def generate_tokens(LM: LanguageModel, length=1000, start=()) -> List[Token]:
    """Generate a random text of `length` tokens, with an optional start, from `LM`."""
    tokens = list(start)
    while len(tokens) < length:
        history = cat(tokens[-LM.order:])
        tokens.append(random_sample(LM[history]))
    return tokens

Here are three auxiliary functions:
- `gen` is a convenience function to call `generate_tokens`, concatenate the resulting tokens, and print them.
- `random_sample` randomly chooses a single token from a Counter, with probability in proportion to its count.
- `cat` is a utility function to concatenate strings (tokens) into one big string.

In [5]:
def gen(LM: LanguageModel, length=1000, start=()) -> None:
    """Call generate_tokens and print the resulting tokens."""
    print(cat(generate_tokens(LM, length, start)))
    
def random_sample(counter: Counter) -> Token:
    """Randomly sample a token from the counter, proportional to each token's count."""
    return random.choices(list(counter), weights=list(counter.values()), k=1)[0]

cat = ''.join # Function to join strings together

Let's train a character-level language model of order 4 on the Shakespeare data. We'll call the model `LM4`.  (Note that saying `tokens=data` means that the sequence of tokens is equal to the sequence of characters in `data`; in other words each character is a token.)

In [6]:
data = open("shakespeare_input.txt").read()

LM = train_LM(tokens=data, order=4)

Here are some examples of what's in the model, for various 4-character histories:

In [7]:
LM["chai"]

Counter({'n': 78, 'r': 35})

In [8]:
random_sample(LM["chai"])

'n'

In [9]:
LM["the "]

Counter({'p': 1360,
         's': 2058,
         'l': 1006,
         'o': 530,
         'g': 1037,
         'c': 1561,
         'a': 554,
         'C': 81,
         'r': 804,
         'h': 1029,
         'R': 45,
         'd': 1170,
         'w': 1759,
         'b': 1217,
         'm': 1392,
         'v': 388,
         't': 1109,
         'f': 1258,
         'i': 298,
         'n': 616,
         'V': 18,
         'e': 704,
         'u': 105,
         'L': 105,
         'y': 120,
         'A': 29,
         'H': 20,
         'k': 713,
         'M': 54,
         'T': 102,
         'j': 99,
         'q': 171,
         'K': 22,
         'D': 146,
         'P': 54,
         'S': 40,
         'G': 75,
         'I': 14,
         'B': 31,
         'W': 14,
         'E': 77,
         'F': 103,
         'O': 3,
         "'": 10,
         'z': 6,
         'J': 30,
         'N': 18,
         'Q': 7})

So `"chai"` is followed by either `'n'` or `'r'`, and almost any letter  can follow `"the "`.

## Generating Shakespeare

We cann generate a random text from the order 4 character model:

In [10]:
gen(LM)

First, hypocrity.

Messenge a bear she is, malice for the people's lion!

TALBOT:
Thou dwell and,
And liest felt malice, by Cleon, then, sir?

QUEEN:
Yes, we press might
It may purch:
A coward and Valenting kiss,
And perils house,
Till deeds as dost ther;
And I, to keep hour could nevership not yet, I this cause.

Solicy.

NYM:
To his not my lord
'her beggars never hear us
to do
Not sell with the now I furthen here that thou all.

KING RICHARD II:
Go, but boy lords this kins
Answer unkindness.'
Plant reason
Are mayst pleasure
What cousiness come inst the devils, and I say, his spurn by serves;
For and upon't her than lord.

QUEEN:
Then, nothing of he will leaving.

SIR TOBY BELCH:
Come of an even to portculling end in are a dog; and thither with grief, heart is much vicion! By their sick with
corrupt,
But my lords: but home
Are than and shot as the once inter way I
shall,
This fruit.

LONGAVILLES:
'Tis yet at hatility the eachelor our plead me, the cram with mattended grace and my yet 

Order 4 captures the structure of plays, mentions some characters, and generates mostly English words. But the words don't always go together to form grammatical sentences, and there is certainly no coherence or plot. 

## Generating Order 7 Shakespeare

What if we increase the model to order 7? Or 10? The output gets a bit better, roughly as good as the RNN models that Karpathy shows, and all from a much simpler *n*-gram model.

In [11]:
gen(train_LM(data, order=7))

First Clown:
What are so envenom'd: therefore, good Ursula
Walk in thy opinion and hit
The cheater: but lechery eats in my person's sacred king.
Ay me! sad hours must fight no more.

OLIVIA:
Are you think you; you well:
A gallant ship, sir; you have not denies
The discourse our supposed
He that makes King Edward from our counsel in my close they will this young daughter is my birth;
But no man have I here resignation of them: while apart.
Stand in this I challenge much as to mince nought;
Watch'd you,
Your displeasures, and all the heart bleed at Plashy too;
My operant power unto Octavius' tent
How the hand
Of him that I were you venture of muttons, be your leave,
For which the anvil of him; you shall acquitted
Without book are adventurously bound together.

FORD:
Marry, we have a parish-top. What is the matters. This curse thought of revenge,
For governed from the proclaim it civil dissemblies to trust not colour will.

TITANIA:
My good king, from the very know, sir king of her lips,


## Generating Order 10 Shakespeare

In [12]:
gen(train_LM(data, order=10))

First Citizen:
We are blest in peace and honour than
Your gates against an alien
That by direct or by collateral hand
They find us touch'd, or carved to thee.

CARDINAL WOLSEY:
Your grace hath blessed and engaged to many Greeks,
Even in these cases, where in gore he lay insteep'd,
And take it.

PROTEUS:
My gracious king:
And I do wish
That your pains the hire;
If you do wrong you? alas, our places.

SATURNINUS:
Why, worthy Margaret, that is the mad mothers than snow,
And all the posterns
Clear them out.

All:
A heavy reckoning to make
Mine eyes too, examined my parts with death, goodman Dull.

DULL:
Which is not confess she does: there's another place
And find me well deliver:
Mark Antony, she pursed up
his heart think it cites us, brother, I go; I'll win them, fear it not;
Let not thy nature; let not every man's Hero:

CLAUDIO:
I know him; 'tis a mere scutcheon: and so
ends my catechism.

EARL OF DOUGLAS:
'Faith, that hath gull'd thee there.
But, room, fairy! here comes your father's 

## Aside: Probabilities and Smoothing

Sometimes we'd rather see probabilities, not raw counts. Given a language model `LM`, the probability *P*(*t* | *h*) can be computed as follows:

In [13]:
def P(t, h, LM: LanguageModel): 
    "The probability that token t follows history h."""
    return LM[h][t] / sum(LM[h].values())

In [14]:
P('s', 'the ', LM)

0.09286165508528112

In [15]:
P('n', 'chai', LM)

0.6902654867256637

In [16]:
P('r', 'chai', LM)

0.30973451327433627

In [17]:
P('s', 'chai', LM)

0.0

In [18]:
P(' ', 'chai', LM)

0.0

Shakespeare never wrote about "chaise longues," or "chai tea" so the probability of an `'s'` or `' '` following `'chai'` is zero, according to our language model. But do we really want to say it is absolutely impossible for the sequence of letters `'chais'` or `'chai '` to appear in a generated text, just because we didn't happen to see it in our training data? More sophisticated language models use [**smoothing**](https://en.wikipedia.org/wiki/Kneser%E2%80%93Ney_smoothing) to assign non-zero (but small) probabilities to previously-unseen sequences. In this notebook we stick to the basic unsmoothed model.

## Aside: Starting Text

One thing you may have noticed: all the generated passages start the same. Why is that? Because the training data happens to start with the line "First Citizen:", and so when we call `generate_tokens`, we start with an empty history, and the only thing that follows the empty history in the training data is the letter "F", the only thing that follows "F" is "i", and so on, until we get to a point where there are multiple choices. We could get more variety in the start of the generated text by breaking the training text up into multiple sections, so that each section would contribute a different possible starting point. But that would require some knowledge of the structure of the training text; right now the only assumption is that it is a sequence of tokens/characters.

We can give a starting text to `generate_tokens` and it will continue from there. But since the models only look at a few characters of history (just 4 for `LM`), this won't make much difference. For example, the following won't make the model generate a story about Romeo:

In [19]:
gen(LM, start='ROMEO')

ROMEO:
What swifter of so much a doubled,
And I may, the dare, by the lives done own rich, that you naughter-out appearancess of your turn us,
Whose heave born blow be prology.
Keeper: proud.

BALTHAZAR:
Let so shall a picture we mercy,
And the king this bondage an e'er that I storator:

FERDINAL CAMPEIUS:
First Servant:
I this on your end
Louder with methink I do bid it: if heavens:
shall this vest; and run, my troops then, good and so forty wife.
Invited.

MACBETH:
Avaunt,
And Here is days.

ORLEANS:
And friend;
Dismissing, she detain the time thou scourselves itself:
but with the gent the half abused!

QUEEN:
Their daughteous most in my lost fight
The prey to say you give himselves Lord's the comes bold your look the lady, look you, Fabian:
A pass!

EMILIA:
Is heater is they my him; and chard for my lord, ripe to Pompey, I and let us have your woo?

Think upon: indeed? Do noison.
'Tis most in a suddenly out, sir; you are
A cist
For noble Antent our
daughter disclose him.

CELIA:
Dou

# Linux Kernel C++ Code

Goldberg's point is that the simple character-level n-gram model performs about as well as the  more complex RNN model on Shakespearean text. 

But Karpathy also trained an RNN on 6 megabytes of Linux-kernel C++ code. Let's see what we can do with that training data.

In [20]:
! [ -f linux_input.txt ] || curl -O https://norvig.com/ngrams/linux_input.txt
! wc   linux_input.txt

  241465  759639 6206997 linux_input.txt


In [21]:
linux = open("linux_input.txt").read()

## Generating Order 10 C++

We'll start with an order-10 character model, and compare that to an order-20 model. We'll generate a longer text, because sometimes 1000 characters ends up being just one long comment.

In [22]:
gen(train_LM(linux, order=10), length=3000)

/*
 * linux/kernel.h>
#include <linux/kgdb.h>
#include <linux/ftrace_event.h.
 *
 *       recursive.  An event enabled cpu */
		detach_timer(struct sched_domain_span(sd);
	struct ftrace_func(tk, ri, regs);

	if (!chip->irq_unmask	= noop,
	.irq_enable)
		clear_frozen = false;

	if (time_before(end_time, timer_t, timer_id, &flag);
	if (err)
		return;

	tracing_resize_ring_buffer_record_disabled)
		return;
	if (desc) {
		raw_spin_lock_irqsave(&tracepoint_str + (*pos - 1);
	i++;

find_first_elem;
	}

	/* sort them */
	array_desc
#define CAP_PI		(void *)rec->ip);
			/* Ftrace is shutting down system");
	shutdown_task != NULL)
		event->rcu_pending()) {
		err = -EINVAL;
}

/* Special cases that /proc allows
	 * for built-in exception is the concept of
a "sequence count */
		iter->trace is a kernel
 * and userspace address.
		 * Either we have now waited for the first call.
 * Return: %false if it was a timeout or signal will be freed in case of UMH_NO_WAIT)	/* task has the given prefix.
 *	0 

## Order 20 C++

In [23]:
gen(train_LM(linux, order=20), length=3000)

/*
 * linux/kernel/irq/proc.c
 *
 * Copyright (C) 2009 Jason Baron <jbaron@redhat.com>
 * Copyright (C) 2009 Jason Baron <jbaron@redhat.com>
 * Copyright (C) 2006 Rafael J. Wysocki <rjw@sisk.pl>
 *
 * This file is released under the GPLv2.
 */

#include <linux/init.h>
#include <linux/export.h>
#include <linux/ktime.h>

#include <asm/uaccess.h>
#include <linux/fs.h>

#include "trace_probe.h"

#define KPROBE_EVENT_SYSTEM);
	if (WARN_ON_ONCE(ret)) {
		pr_warn("error enabling all events\n");
		return;
	}

	cancel_delayed_work_sync(&req->work);

	trace_pm_qos_update_request(req->pm_qos_class,
					    new_value, timeout_us);
	if (new_value != req->node.prio)
		pm_qos_update_target(struct pm_qos_constraints network_tput_constraints,
	.name = "memory_bandwidth",
};


static struct pm_qos_object memory_bandwidth_pm_qos,
};

static ssize_t
tracing_write_stub(struct file *filp, const char __user *ubuf, size_t cnt,
		    loff_t *ppos)
{
	int ret = -ENODEV;

	mutex_lock(&trace_types_lock);

	tr->c

## Analysis of Generated Linux Text

As Goldberg says, "Order 10 is pretty much junk." But order 20 is much better. Most of the comments have a start and an end; most of the open parentheses are balanced with a close parentheses; but the braces are not as well balanced. That shouldn't be surprising. If the span of an open/close parenthesis pair is less than 20 characters then it can be represented within the model, but if the span of an open/close brace is more than 20 characters, then it cannot be represented by the model. Goldberg notes that Karpathy's RRN seems to have learned to devote some of its long short-term memory (LSTM) to representing nesting level, as well as things like whether we are currently within a string or a comment. It is indeed impressive, as Karpathy says, that the model learned to do this on its own, without any input from the human engineer.

## Token Models versus Character Models

Karpathy and Goldberg both used character models, because the exact formatting of characters (especially indentation and line breaks) is important in the format of plays and C++ programs. But if you are interested in generating paragraphs of text that don't have any specific format, it is  common to use a **word** model, which represents the probability of the next word given the previous words, or a **token** model in which tokens can be words, punctuation, or parts of words. For example, the text `"Spiderman!"` might be broken up into the three tokens `"Spider"`, `"man"`, and `"!"`. 

One simple way of tokenizing a text is to break it up into alternating strings of word and non-word characters; the function `tokenize` does that by default:

In [24]:
import re

word_or_nonword = r'\w+|\W+' # Regular expression to parse a string of either word or non-word characters.

def tokenize(text: str, regex=word_or_nonword) -> List[Token]: 
    """Break text up into tokens using regex."""
    return re.findall(regex, text)

In [25]:
assert tokenize('Soft! who comes here?') == [
    'Soft', '! ', 'who', ' ', 'comes', ' ', 'here', '?']

assert tokenize('wherefore art thou ') == [
    'wherefore', ' ', 'art', ' ', 'thou', ' ']

We can train a token model on the Shakespeare data. A model of order 6 keeps a history of up to three word and three non-word tokens. 

In [26]:
TLM = train_LM(tokenize(data), order=6)

In [27]:
TLM['wherefore art thou ']

Counter({'Romeo': 1})

In [28]:
TLM['not in our ']

Counter({'stars': 1, 'Grecian': 1})

In [29]:
TLM['end of my ']

Counter({'life': 1, 'business': 1, 'dinner': 1, 'time': 1})

In [30]:
TLM[' end of my']

Counter({' ': 2})

We see below that the quality of the token models is similar to character models, and improves from 6 tokens to 8:

In [31]:
gen(TLM)

First Citizen:
Before we proceed any further, hear me speak,
Before you answer Warwick. His demand
Springs not from Edward's well-meant honest love,
But from deceit bred by necessity;
For how can I grace my talk,
Wanting a hand to hold a sceptre up
And with the clamour of thy drum,
And even at hand a drum is ready braced
That shall reverberate all as loud as Mars. By Jupiter,
Were I the Moor, I would not be awaked.

LORENZO:
That is the very note of it: and it is known she is, these moral laws
Of nature and of nations, 'long
To him and his virtue;
By her election may be truly read
What kind of man,
So keen and greedy to confound a man:
He plies the duke at dinner: by two o'clock I'll get me such a question: stand again:
Think'st thou I am an old man's life is done:
Then, dear my liege, mine honour let me try;
In that I live and for that will I cause these of Cyprus to
mutiny; whose qualification shall come into no true
taste again but by the recorder.
Then he was urged to tell my tale 

In [32]:
gen(train_LM(tokenize(data), 8))

First Citizen:
Before we proceed any further, hear me speak.

All:
Peace, ho! Hear Antony. Most noble Antony!

ANTONY:
Why, friends, you go to do you know not what
you speak. But, if ever the duke return, as our
prayers are he may, let me desire you to make your
answer before him. If it be honest you have spoke,
you have courage to maintain it: I am bound to wonder, I am bound
To Persia, and want guilders for my voyage:
Therefore make present satisfaction,
Or I'll attach you by this officer.

ANTIPHOLUS OF EPHESUS:
I will debate this matter at more leisure
And teach your ears to list me with more heed.
To Adriana, villain, hie thee straight:
Give her this key, and tell her, in the desk
That's cover'd o'er with Turkish tapestry,
There is a purse of ducats; let her send it:
Tell her I am arrested in the street
And that shall bail me; hie thee, slave, be gone!
On, officer, to prison till it come.

DROMIO OF SYRACUSE:
To Adriana! that is where we dined,
Where Dowsabel did claim me for her 

## C++ Token Model

Similar remarks hold for token models trained on C++ data:

In [33]:
gen(train_LM(tokenize(linux), 8), length=3000)

/*
 * linux/kernel/irq/autoprobe.c
 *
 * Copyright (C) 1992, 1998-2006 Linus Torvalds, Ingo Molnar
 *
 * This file contains the IRQ-resend code
 *
 * If the interrupt is waiting to be processed, we try to re-run it.
 * We can't directly run it from here since the caller might be in an
 * interrupt-protected region. Not all irq controller chips can
 * retrigger interrupts at the hardware level, so in those cases
 * we allow the resending of IRQs via a tasklet.
 */

#include <linux/irq.h>
#include <linux/random.h>
#include <linux/ftrace.h>
#include <linux/smp.h>
#include <linux/interrupt.h>
#include <linux/tick.h>
#include <linux/seq_file.h>
#include <asm/uaccess.h>
#include <asm/bitsperlong.h>

#include "trace.h"

static DEFINE_PER_CPU(int, bpf_prog_active);

/**
 * trace_call_bpf - invoke BPF program
 * @prog: BPF program
 * @ctx: opaque context pointer
 *
 * kprobe handlers execute BPF programs via this helper.
 * Can be used from static tracepoints in the future.
 *
 * Return: BPF pr