# The Unreasonable Effectiveness of Character-level Language Models
# (and why RNNs are still cool)

## By [Yoav Goldberg](http://www.cs.biu.ac.il/~yogo) (2015)

#### (with minor changes by Peter Norvig (2022) for modern Python 3)

<hr>

RNNs, LSTMs and Deep Learning are all the rage, and a recent [blog post](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) by Andrej Karpathy is doing a great job explaining what these models are and how to train them.
It also provides some very impressive results of what they are capable of.  This is a great post, and if you are interested in natural language, machine learning or neural networks you should definitely read it. 

Go [**read it now**](http://karpathy.github.io/2015/05/21/rnn-effectiveness/), then come back here. 

You're back? good. Impressive stuff, huh? How could the network learn to imitate the input like that?
Indeed. I was quite impressed as well.

However, it feels to me that most readers of the post are impressed by the wrong reasons.
This is because they are not familiar with **unsmoothed maximum-liklihood character level language models** and their unreasonable effectiveness at generating rather convincing natural language outputs.

In what follows I will briefly describe these character-level maximum-likelihood langauge models, which are much less magical than RNNs and LSTMs, and show that they too can produce a rather convincing Shakespearean prose. I will also show about 30 lines of python code that take care of both training the model and generating the output. Compared to this baseline, the RNNs may seem somehwat less impressive. So why was I impressed? I will explain this too, below.

## Unsmoothed Maximum Likelihood Character Level Language Model 

The name is quite long, but the idea is very simple.  We want a model whose job is to guess the next character based on the previous *n* letters. For example, having seen `ello`, the next characer is likely to be either a commma or space (if we assume is is the end of the word "hello"), or the letter `w` if we believe we are in the middle of the word "mellow". Humans are quite good at this, but of course seeing a larger history makes things easier (if we were to see 5 letters instead of 4, the choice between space and `w` would have been much easier).

We will call *n*, the number of letters we need to guess based on, the _order_ of the language model.

RNNs and LSTMs can potentially learn infinite-order language model (they guess the next character based on a "state" which supposedly encode all the previous history). We here will restrict ourselves to a fixed-order language model.

So, we are seeing *n* letters, and need to guess the *n+1*th one. We are also given a large-ish amount of text (say, all of Shakespear works) that we can use. How would we go about solving this task?

Mathematiacally, we would like to learn a function *P(c* | *h)*. Here, *c* is a character, *h* is a *n*-letters history, and *P(c* | *h)* stands for how likely is it to see *c* after we've seen *h*.

Perhaps the simplest approach would be to just count and divide (a.k.a **maximum likelihood estimates**). We will count the number of times each letter *c* appeared after *h*, and divide by the total numbers of letters appearing after *h*. The **unsmoothed** part means that if we did not see a given letter following *h*, we will just give it a probability of zero.

And that's all there is to it.


## Training Code

Here is the code for training the model. `fname` is a file to read the characters from. `order` is the history size to consult. Note that we pad the data with `order` leading characters so that we also learn how to start.


In [1]:
import random
from collections import Counter, defaultdict
from typing import List, Tuple

class LanguageModel(defaultdict):
    """A mapping from `order` history characters to a list of ('c', probability) pairs,
    e.g., for order=4, {'spea': [('k', 0.99), ('r', 0.01)])}."""
    def __init__(self, order): self.order = order

def train_char_lm(fname, order=4) -> LanguageModel:
    """Train an `order`-gram character-level language model on all the text in `fname`."""
    lm = LanguageModel(order)
    data = (PAD * order) + open(fname).read()
    # First read data into Counters of characters; then normalize
    lm.default_factory = Counter 
    for i in range(order, len(data)):
        history, char = data[i - order:i], data[i]
        lm[history][char] += 1
    for history in lm:
        lm[history] = normalize(lm[history])
    return lm

def normalize(counter) -> List[Tuple[str, float]]:
    """Return (key, val) pairs, normalized so values sum to 1.0, largest first."""
    total = float(sum(counter.values()))
    return [(k, v / total) for k, v in counter.most_common()]

PAD = '`' # Character to pad the beginning of a text

Let's train it on Andrej's Shakespeare text:

In [2]:
! [ -f shakespeare_input.txt ] || curl -O https://norvig.com/ngrams/shakespeare_input.txt
! wc shakespeare_input.txt

  167204  832301 4573338 shakespeare_input.txt


In [3]:
lm = train_char_lm("shakespeare_input.txt", order=4)

Ok. Now let's do some queries:

In [4]:
lm['ello']

[('w', 0.817717206132879),
 ('r', 0.059625212947189095),
 ('u', 0.03747870528109029),
 (',', 0.027257240204429302),
 ("'", 0.017035775127768313),
 (' ', 0.013628620102214651),
 ('.', 0.0068143100511073255),
 ('?', 0.0068143100511073255),
 ('!', 0.0068143100511073255),
 (':', 0.005110732538330494),
 ('n', 0.0017035775127768314)]

In [5]:
lm['Firs']

[('t', 1.0)]

In [6]:
lm['rst ']

[('S', 0.16292134831460675),
 ('L', 0.10674157303370786),
 ('C', 0.09550561797752809),
 ('G', 0.0898876404494382),
 ('M', 0.0593900481540931),
 ('t', 0.05377207062600321),
 ('W', 0.033707865168539325),
 ('s', 0.03290529695024077),
 ('o', 0.030497592295345103),
 ('b', 0.024879614767255216),
 ('w', 0.024077046548956663),
 ('a', 0.02247191011235955),
 ('m', 0.02247191011235955),
 ('n', 0.020064205457463884),
 ('h', 0.019261637239165328),
 ('O', 0.018459069020866775),
 ('i', 0.016853932584269662),
 ('d', 0.015248796147672551),
 ('P', 0.014446227929373997),
 ('c', 0.012841091492776886),
 ('F', 0.012038523274478331),
 ('f', 0.011235955056179775),
 ('g', 0.011235955056179775),
 ('l', 0.01043338683788122),
 ('I', 0.009630818619582664),
 ('B', 0.009630818619582664),
 ('p', 0.00882825040128411),
 ('K', 0.008025682182985553),
 ('r', 0.0072231139646869984),
 ('A', 0.0056179775280898875),
 ('H', 0.0040128410914927765),
 ('k', 0.0040128410914927765),
 ('e', 0.0032102728731942215),
 ('T', 0.003210272

So `"ello"` is followed by either space, punctuation or `w` (or `r`, `u`, `n`), `"Firs"` is pretty much deterministic, and the word following `"rst "` can start with pretty much every letter.

## Generating from the model

Generating is also very simple. To generate a letter, we will take the history, look at the last *order* characters, and then sample a random letter based on the corresponding distribution:

In [7]:
def generate_character(lm, history) -> str:
    """Given a history of characters, sample a random next character from `lm`."""
    p = random.random()
    for c, v in lm[history]:
        if p <= v: 
            return c
        p -= v

To generate a passage of *k* characters, we just seed it with the initial history and run letter generation in a loop, updating the history at each turn.

In [8]:
def generate_text(lm, length=1000) -> str:
    """Sample a random `length`-long passage from `lm`."""
    history = PAD * lm.order
    out = []
    for i in range(length):
        c = generate_character(lm, history)
        history = history[1:] + c
        out.append(c)
    return ''.join(out)

## Generated Shakespeare from different order models

Let's try to generate text based on different language-model orders. Let's start with something silly:

## order 2:

In [9]:
lm = train_char_lm("shakespeare_input.txt", order=2)
print(generate_text(lm))

Fiestis the so, ing'd? was hathy now by ollood re hichaved.

MANCE:
So oul and carried bear wilese com now ifeck.

Vere ame
A Cad prover, toods
Whiseto wousave.

But to prath troureak do thque.

SUS:
Up to light sonste cat ing ing.

EXASTHASTARVENRAY:
In Clee rah, th sher gollooke herponefted? hin;
Yout the cat this nothaved; tord belf.
Gody hat threit es faw'd you wer:
If ailt how at be it:
A to miden lanign day himemalo; withat thy shat ans, th al at's th for tent! EDGAN:
As livill your frows;
Ay grour Romer, ant cou, dign,
Behould a fieved theek
I wood by ne untle, blove;
O' womit hat I lass an.

Tword nothee, plasou? To fer'd thisichou some me,
Mill rol.

Ha! I deast ind the many ouldou not: seep amend at's of to deat hent: whis dick.

KINERMISATEUS Let a reempat ung.
Mus he maltie. WAR Con ound our wilivem I;
As rom de she a weat pey prews say arn Cithe ming paseas duch and such chat your you go my Duke ens;
EDWARUTUS:
I lontassues is se day Ladand dre lin wass re becy, ton
Theas 

## order 4 

Order 2 was not so great... but what if we increase the order to 4?



In [10]:
lm = train_char_lm("shakespeare_input.txt", order=4)
print(generate_text(lm))

First Outland? Look upon it is his better offence, which I at out:
Sir John.

KING PHILIP:
Why yawn companiel?

CAMILLO:

COSTARD:
No, safe to my utter?

APEMANTUS:
Nor I worship, give thee: I'll draws eart sociable, noble greaten this lease are could hide use meet, it is trued Brutus me against would burn some to him the grace no more thy play.

VIOLA:
You affection,
Therein men
such on from that I my grind homel, call, were beseech you evenger.

PISTOL:
Let us should blow
Will be to remembrancis!

MACBETH:

MARGARET:
The to unworthwith treath,
Drop of tune's my bloody. What will me thy Fame is attaining her bed, and sir; he us all give some young Rome,
Hostess Shortly come any rascal, o'er them off this Englands this be an assius, aris. Pray you saw them she day opinion for at is my he dange the fight, sir, her, and so long bound thee.

DESDEMONA:
Can I have foot the violets! where's amore you these
him in hold fairs speak no pays and father?
'Tis burn after the air dought amissive,




## order 7

Order 4 is already quite reasonable, and reads like English. Just 4 letters history! What if we increase it to 7?

In [11]:
lm = train_char_lm("shakespeare_input.txt", order=7)
print(generate_text(lm))

First Citizen:
Ay, but I therefore let him from shore, to Mercury, set: then, some comfort.

COUNTESS:
Tell her, indeed, too, good show
Can heavens, how I came hither art the next.

BARDOLPH:
By these men.
But direction: and, for secrecy:
We shall I be,
That so with your uncle to couples with her beauty is bound as thought! an 'twere the occasion this to-morrow night;
Unless your mouths, if the realm.
I never enter'd Pucelle jointure.

SIMPCOX:
God keep from whom I, indeed too, 'mong other both at our fair pillow for a
man against the prince, there!

MARIA:
You shall deliver us from their own grace, pardon you: yet Count Comfect; an the year groan at it.

SIR ANDREW:
'Slight, than you can be!
Through thou scurvy railing, may surrender; so must be, love, kill myself, thou wert born i' the people,
You are
going to my father's show thyself?

MONTAGUE:
O, when nobles should you in.

TRINCULO:
Excellent this sleep,
And say 'God save you to-morrow morning, but what you fear?  myself in such 

## How about order 10?

In [12]:
lm = train_char_lm("shakespeare_input.txt", order=10)
print(generate_text(lm))

First Citizen:
O royal Caesar!

First Tribune:
We will make amends now: get you gone.

MARTIUS:
What else, fellow? Pray you, let me go.
But what, is he arrested? Tell me at whose burthen
The anger'd any heart alive
To hear meekly, sir, they shall know it, would much better used
On Navarre and his brethren come?

BUCKINGHAM:
Give me any gage of this seeming.

HORATIO:
Ay, my good lord: our time too brief: I will thither: gracious offers from the humble-bees,
And let another general, thou shouldst not bear my standard of the wheat must needs
Appear unkinglike.

CAIUS LUCIUS:
I have, my lord,
I should not; for he this
very day receive it friendly; but from this time.

VALENTINE:
How use doth breed a habit in a man!
This shadow
Doth limp behind that doth warrant.
Hark, how our steeds for present business, nor my power
To o'erthrown Antony,
And very weak and melancholy upon your stubborn ancient skill to fear and cold hand of death hath snatch'd that it us befitted
To bear themselves made, 

## This works pretty well

With an order of 4, we already get quite reasonable results. Increasing the order to 7 (about a word and a half of history) or 10 (about two short words of history) already gets us quite passable Shakepearan text. I'd say it is on par with the examples in Andrej's post. And how simple and un-mystical the model is!

## So why am I impressed with the RNNs after all?

Generating English a character at a time -- not so impressive in my view. The RNN needs to learn the previous *n* letters, for a rather small *n*, and that's it. 

However, Karpathy's C++ code generation example is very impressive. Why? because of the context awareness. Note that in all of Karpathy's posted examples, the code is well indented, the braces and brackets are correctly nested, and even the comments start and end correctly. This is not something that can be achieved by simply looking at the previous *n* characters. 

If Karpathy's examples are not cherry-picked, and the output is generally that nice, then the LSTM did learn something not trivial at all.

# Linux Kernel C++ Code

Just for the fun of it, let's see what our simple language model does with the Linux-kernel code:

In [13]:
! [ -f linux_input.txt ] || curl -O https://norvig.com/ngrams/linux_input.txt
! wc linux_input.txt

  241465  759639 6206997 linux_input.txt


In [14]:
lm = train_char_lm("linux_input.txt", order=10)
print(generate_text(lm))

/*
 * linux/kernel/printk.c
 *
 *  Copyright (C) 1992, 1998-2004 Linus Torvalds, Ingo Molnar <mingo@redhat.com>
 * Copyright (c) 2009 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc.
 *
 * This overall must be zero */
	txc->ppsfreq, &utp->freq) ||
			__get_user(handler, &act->sa_handler;
		next_event.tv64 != KTIME_MAX;
	next_event = parent_freezer(freezer))) {
		if (!access_ok(VERIFY_READ, u_event, sizeof(*src);

	/* Convert (if necessary to check that the target CPU.
 */
void gcov_info *info)
{
	return (copied == sizeof(debug_alloc_header *)
				(debug_alloc_header {
	char reserved fields\n");
		return;

	perf_output_begin(&handle, &snapshot_data *data)
{
	struct mcs_spinlock */
void gcov_info *get_accumulated_info(node, info);
	if (i > len)
		cnt = TRACE_SIGNAL_DELIVERED;
out:
	tracing_stop_tr(tr);

	__trace_function_single(int cpu);

extern void kdb_dumpregs(regs);
		dbg_activate_work(work);

	/*
	 * Can't set/change the"
					   "1", enable);
extern struct ftrace_graph_entry_leaf(struc

In [15]:
lm = train_char_lm("linux_input.txt", order=15)
print(generate_text(lm))

/*
 * linux/kernel/time/tick-broadcast-hrtimer.c
 * This file emulates a local clock event device cannot go away as
	 * long as we hold
	 * lock->wait_lock held.
 */
void __ptrace_unlink - unlink/remove profiling data set with an existing node. Needs to be called with lock->wait_lock);
	 *					acquire(lock);
	 * or:
	 *
	 * unlock(wait_lock);
	 *					acquire(lock);
	 */
	return rc;
}

static int get_clock_desc(id, &cd);
	if (err)
		return err;

	if (cd.clk->ops.clock_adjtime(clockid_t id, struct timex __user *) &txc);
	set_fs(oldfs);
	if (!err && compat_put_timespec(&out, rmtp))
		return -EFAULT;
	}
	force_successful_syscall_return();
	return compat_jiffies_to_clock_t);

u64 nsec_to_clock_t(tsk->delays->blkio_start = ktime_get();
	if (!ret) {
		printk(KERN_CONT ".. corrupted trace buffer .. ");
	return -1;
}

/* Will lock the rq it finds */
static struct cgroup_subsys *ss;
	char *tok;
	int ssid, ret;

	/* Do not accept '\n' to prevent making /proc/<pid>/cgroup.
 */
int zap_other_thread

In [16]:
lm = train_char_lm("linux_input.txt", order=20)
print(generate_text(lm))

/*
 * linux/kernel/irq/handle.c
 *
 * Copyright (C) 2003-2004 Amit S. Kale <amitkale@linsyssoft.com>
 * Copyright (C) 2008 Steven Rostedt <srostedt@redhat.com>
 *
 */
#include <linux/irq.h>
#include <linux/notifier.h>
#include <linux/init.h>
#include <linux/vmalloc.h>
#include <asm/sections.h>

#include <crypto/hash.h>
#include <keys/asymmetric-type.h>
#include <keys/system_keyring.h>
#include "module-internal.h"

struct key *system_trusted_keyring, 1),
					   "asymmetric",
					   NULL,
					   p,
					   plen,
					   ((KEY_POS_ALL & ~KEY_POS_SETATTR) |
			      KEY_USR_VIEW | KEY_USR_READ),
					   KEY_ALLOC_NOT_IN_QUOTA |
					   KEY_ALLOC_TRUSTED);
		if (IS_ERR(key)) {
		switch (PTR_ERR(key)) {
			/* Hide some search errors */
		case -EACCES:
		case -ENOTDIR:
		case -EAGAIN:
			return ERR_PTR(-EACCES);

		/*
		 * We could be clever and allow to attach a event to an
		 * offline CPU and activate it when the CPU comes up, but
		 * that's for later.
		 */
		if (!cpu_online(cpu))
		c

In [17]:
print(generate_text(lm))

/*
 * linux/kernel/itimer.c
 *
 * Copyright (C) 2010		SUSE Linux Products GmbH
 * Copyright (C) 2002 2003 by MontaVista Software.
 *
 * 2004-06-01  Fix CLOCK_REALTIME clock/timer TIMER_ABSTIME bug.
 *			     Copyright (C) 2004-2006 Ingo Molnar
 *  Copyright (C) 2004 Nadia Yvette Chambers, IBM
 * (C) 2004 Nadia Yvette Chambers
 */
#include <linux/cred.h>
#include <linux/module.h>
#include <linux/mutex.h>
#include <linux/mutex.h>
#include <linux/securebits.h>
#include <linux/clocksource.h>
#include <linux/timecounter.h>

void timecounter_init(struct timecounter *tc)
{
	cycle_t cycle_now, cycle_delta;

	sleeptime_injected = true;
	} else if (timespec64_compare(&ts_new, &timekeeping_suspend_time) > 0) {
		ts_delta = timespec64_sub(tk_xtime(tk), timekeeping_suspend_time, delta_delta);
		}
	}

	timekeeping_update(tk, TK_MIRROR | TK_CLOCK_WAS_SET);

	if (action & TK_MIRROR)
		memcpy(&shadow_timekeeper, &tk_core.timekeeper;
	struct clocksource *clock = tk->tkr_mono.clock;
	tk->tkr_mono.read = 

In [18]:
print(generate_text(lm, length=5000))

/*
 * linux/kernel/irq/handle.c
 *
 * Copyright (C) 2000-2001 VERITAS Software Corporation.
 * Copyright (C) 2011 Peter Zijlstra <pzijlstr@redhat.com>
 *  Copyright (C) 2004-2006 Tom Rini <trini@kernel.crashing.org>
 * Copyright (C) 2005-2006, Thomas Gleixner, Russell King
 *
 * This file contains the /proc/irq/ handling code.
 */

#include <linux/percpu.h>
#include <linux/cpuset.h>
#include <linux/uaccess.h>
#include <linux/slab.h>
#include <linux/file.h>
#include <linux/export.h>
#include <linux/signal.h>
#include <linux/debug_locks.h>
#include <linux/seq_file.h>
#include <linux/uaccess.h>

#include <trace/events/timer.h>

/*
 * Per cpu nohz control structure
 */
static DEFINE_PER_CPU(struct task_struct *p;
	int retval;

	rcu_read_lock();
	for_each_domain(cpu, sd)
		domain_num++;
	entry = table = sd_alloc_ctl_entry(domain_num + 1);
	if (table == NULL)
		return NULL;

	/*
	 * We repeat when a time extend is encountered or we hit
	 * the end of the page to save the
		 * missed events, 

## Analysis

Order 10 is pretty much junk. In order 15 things sort-of make sense, but we jump abruptly between the `[sic]`
and by order 20 we are doing quite nicely -- but are far from keeping good indentation and brackets. 

How could we? we do not have the memory, and these things are not modeled at all. While we could quite easily enrich our model to support also keeping track of brackets and indentation (by adding information such as "have I seen ( but not )" to the conditioning history), this requires extra work, non-trivial human reasoning, and will make the model significantly more complex. 

Karpathy's LSTM, on the other hand, seemed to have just learn it on its own. And that's impressive.

