# The Unreasonable Effectiveness of Character-level Language Models
## (and why RNNs are still cool)

## By [Yoav Goldberg](http://www.cs.biu.ac.il/~yogo) (2015)
#### with minor changes for Python 3 by Peter Norvig (2022)

<hr>

RNNs, LSTMs and Deep Learning are all the rage, and a recent [blog post](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) by Andrej Karpathy is doing a great job explaining what these models are and how to train them.
It also provides some very impressive results of what they are capable of.  This is a great post, and if you are interested in natural language, machine learning or neural networks you should definitely read it. 

Go read it now, then come back here. 

You're back? good. Impressive stuff, huh? How could the network learn to immitate the input like that?
Indeed. I was quite impressed as well.

However, it feels to me that most readers of the post are impressed by the wrong reasons.
This is because they are not familiar with **unsmoothed maximum-liklihood character level language models** and their unreasonable effectiveness at generating rather convincing natural language outputs.

In what follows I will briefly describe these character-level maximum-likelihood langauge models, which are much less magical than RNNs and LSTMs, and show that they too can produce a rather convincing Shakespearean prose. I will also show about 30 lines of python code that take care of both training the model and generating the output. Compared to this baseline, the RNNs may seem somehwat less impressive. So why was I impressed? I will explain this too, below.

## Unsmoothed Maximum Likelihood Character Level Language Model 

The name is quite long, but the idea is very simple.  We want a model whose job is to guess the next character based on the previous *n* letters. For example, having seen `ello`, the next characer is likely to be either a commma or space (if we assume is is the end of the word "hello"), or the letter `w` if we believe we are in the middle of the word "mellow". Humans are quite good at this, but of course seeing a larger history makes things easier (if we were to see 5 letters instead of 4, the choice between space and `w` would have been much easier).

We will call *n*, the number of letters we need to guess based on, the _order_ of the language model.

RNNs and LSTMs can potentially learn infinite-order language model (they guess the next character based on a "state" which supposedly encode all the previous history). We here will restrict ourselves to a fixed-order language model.

So, we are seeing *n* letters, and need to guess the *n+1*th one. We are also given a large-ish amount of text (say, all of Shakespear works) that we can use. How would we go about solving this task?

Mathematiacally, we would like to learn a function *P(c* | *h)*. Here, *c* is a character, *h* is a *n*-letters history, and *P(c* | *h)* stands for how likely is it to see *c* after we've seen *h*.

Perhaps the simplest approach would be to just count and divide (a.k.a **maximum likelihood estimates**). We will count the number of times each letter *c* appeared after *h*, and divide by the total numbers of letters appearing after *h*. The **unsmoothed** part means that if we did not see a given letter following *h*, we will just give it a probability of zero.

And that's all there is to it.


## Training Code

Here is the code for training the model. `fname` is a file to read the characters from. `order` is the history size to consult. Note that we pad the data with `order` leading characters so that we also learn how to start.


In [1]:
import random
from collections import Counter, defaultdict
from typing import Dict, List, Tuple

LM = Dict[str, Counter] # Language Model: lm['Firs'] = [('t', 1.0)]

def train_char_lm(fname, order=4) -> LM:
    """Train an `order`-gram character-level language model on all the text in `fname`."""
    lm = defaultdict(Counter)
    data = (PAD * order) + open(fname).read()
    for i in range(len(data) - order):
        history, char = data[i:i+order], data[i+order]
        lm[history][char]+=1
    return {history: normalize(counter) for history, counter in lm.items()}

def normalize(counter) -> List[Tuple[str, float]]:
    """Return (key, val) pairs, normalized so values sum to 1.0, largest first."""
    total = float(sum(counter.values()))
    return [(k, v / total) for k, v in counter.most_common()]

PAD = '`' # Character to pad the beginning of a text

Let's train it on Andrej's Shakespeare text:

In [2]:
! [ -f shakespeare_input.txt ] || curl -O https://norvig.com/ngrams/shakespeare_input.txt
! wc shakespeare_input.txt

  167204  832301 4573338 shakespeare_input.txt


In [3]:
lm = train_char_lm("shakespeare_input.txt", order=4)

Ok. Now let's do some queries:

In [4]:
lm['ello']

[('w', 0.817717206132879),
 ('r', 0.059625212947189095),
 ('u', 0.03747870528109029),
 (',', 0.027257240204429302),
 ("'", 0.017035775127768313),
 (' ', 0.013628620102214651),
 ('.', 0.0068143100511073255),
 ('?', 0.0068143100511073255),
 ('!', 0.0068143100511073255),
 (':', 0.005110732538330494),
 ('n', 0.0017035775127768314)]

In [5]:
lm['Firs']

[('t', 1.0)]

In [6]:
lm['rst ']

[('S', 0.16292134831460675),
 ('L', 0.10674157303370786),
 ('C', 0.09550561797752809),
 ('G', 0.0898876404494382),
 ('M', 0.0593900481540931),
 ('t', 0.05377207062600321),
 ('W', 0.033707865168539325),
 ('s', 0.03290529695024077),
 ('o', 0.030497592295345103),
 ('b', 0.024879614767255216),
 ('w', 0.024077046548956663),
 ('a', 0.02247191011235955),
 ('m', 0.02247191011235955),
 ('n', 0.020064205457463884),
 ('h', 0.019261637239165328),
 ('O', 0.018459069020866775),
 ('i', 0.016853932584269662),
 ('d', 0.015248796147672551),
 ('P', 0.014446227929373997),
 ('c', 0.012841091492776886),
 ('F', 0.012038523274478331),
 ('f', 0.011235955056179775),
 ('g', 0.011235955056179775),
 ('l', 0.01043338683788122),
 ('I', 0.009630818619582664),
 ('B', 0.009630818619582664),
 ('p', 0.00882825040128411),
 ('K', 0.008025682182985553),
 ('r', 0.0072231139646869984),
 ('A', 0.0056179775280898875),
 ('H', 0.0040128410914927765),
 ('k', 0.0040128410914927765),
 ('e', 0.0032102728731942215),
 ('T', 0.003210272

So `"ello"` is followed by either space, punctuation or `w` (or `r`, `u`, `n`), `"Firs"` is pretty much deterministic, and the word following `"rst "` can start with pretty much every letter.

## Generating from the model

Generating is also very simple. To generate a letter, we will take the history, look at the last *order* characters, and then sample a random letter based on the corresponding distribution:

In [7]:
def generate_letter(lm, history, order) -> str:
    """Given the history of characters, sample a random character from the `lm`."""
    history = history[-order:]
    dist = lm[history]
    p = random.random()
    for c,v in dist:
        p = p - v
        if p <= 0: return c

To generate a passage of *k* characters, we just seed it with the initial history and run letter generation in a loop, updating the history at each turn.

In [8]:
def generate_text(lm, order, nletters=1000) -> str:
    """Sample a random `nletters`-long passage from `lm`."""
    history = PAD * order
    out = []
    for i in range(nletters):
        c = generate_letter(lm, history, order)
        history = history[-order:] + c
        out.append(c)
    return ''.join(out)

## Generated Shakespeare from different order models

Let's try to generate text based on different language-model orders. Let's start with something silly:

## order 2:

In [9]:
lm = train_char_lm("shakespeare_input.txt", order=2)
print(generate_text(lm, 2))

Fir ithe threwd
BENVOLAFF:
DO:
A must art
As doe
she of traest ing; he bere is amnine will man a prand if not shaves, and did, wasky hiscit. I.

GLOSE:
Wituouly.

QUEEDGAR:
Muster ence his loneve day
whathy, wing unwake give ther
you muck,
Wit:
Thatichme,
Thim?

Diereply
res Caes mus thereat's
Witse in hen,
NO Somen; tim but you devinvinvy, yount will yould
fath
whave do merill'd, a mad. I cucid lems; levesord, vaught; pringer iftern and wome AESSIDEMO:
Is in a grant und re
MARD I knoth theady tor Bir, them; halle of is nothe Niceron rest wit's prawar inceds knoter noblet feir hemas clen.

Yeare for facread; and give hemplaut not amn
And,
The such deacquic
at th a lings of thou gonchater.

MOGELIA:
Wer extraideadviusbaince,
MALIETRAY:
And by wee: ang is peadar de;
Yornall th mord! gues Neithere.

KINE:
The food.

KINGH Enot
Hublood
I ke hou not shonou dress-dard;
HUS:
SULIFF:
Hece my is frainis;
Fords.

PISTRUCE:
Yous Duke livencioulls my lon hem nes whe ne.

Clonsievery crese goor she

## order 4 

Order 2 was not so great... but what if we increase the order to 4?



In [10]:
lm = train_char_lm("shakespeare_input.txt", order=4)
print(generate_text(lm, 4))

First man to
not--Charlot's grounds? I have when I shall heavy is alone:
Thou do your Almight at hundered at hence.

HELENA:
No was it ranklins in powere in you. The shall this park-contract and left up that come;
Still, my lips,
Nor to life, or with all bear dires her worm you it?

BENEDICK:
Neanmoins! any me this enemy?

ROSALIND:
I thing that we devise a mannerly into chides.

MELUN:
Fie, capitol; but he not as banish of thout-vied and more; I will tell the we may oppress then an of the general please when the moreoversal tell commander'd men who lie by this pair life.

QUEEN GERTRUDE:
And laughter wife. Come to go
who is a dog in your for the courers. Who's she might!
Black of place music love anger; I have thou shalt hour.

CLEONTES:
Paris brotherwhelm'd vanish a sister.

KING HENRY:
Thou known, that, in that hear,
That dowry state.

VENTINE:
Of this fix'd woman and better?

First of this bird, profess togethere a few subject!
Who comes Brutus, audacious simpless the desper one be



## order 7

Order 4 is already quite reasonable, and reads like English. Just 4 letters history! What if we increase it to 7?

In [11]:
lm = train_char_lm("shakespeare_input.txt", order=7)
print(generate_text(lm, 7))

First Clown:

HAMLET:
Sir, my lord and so
rushling, I say.

First Commoners grace, the moon: but the window, bids you, for I stay there was as much, would I were about with some help.

CARDINAL WOLSEY:
You should shed forced me.

KENT:
Royally!
Where the love of such dulcet,
His wife's firm to be a villains' throats too dear a letter.

Second Gentleman:
Their weapons.

ALONSO:
Irreparable is the next day's deed: now thou sweatest! comes here.

ADRIANA:
God bless he purpose of yourself this world.

ANTONIO:
And so art thou injurer of my soldiers?
Though my child, let us to exclaim of every thing, which is in practise,
Is all too strict account
Of all these instructs you; for you.

MENENIUS:
Nay: in that our death eaten in this youthful deeds
Do breed a kind of death
is too long.

GONZALO:
Heaven for friend, which within fourteen,' an hour my though you were bed-time?
Warwick and you my staff of France that night:
I stay the time of no mercy.

EARL OF DOUGLAS:
There is Lord Scales
Unto t

## How about order 10?

In [12]:
lm = train_char_lm("shakespeare_input.txt", order=10)
print(generate_text(lm, 10))

First Citizen:
Nay, you must
Forget that by hanging of the maids a-row and bound them.

ADRIANA:
To fetch my poor distressed lord, even such delightful pleasing to a man. O, be some of them have scope to beat,
Since foes have scope;
Do what you heard not?

Gentlewoman?
What is't but to breathe in fruitful meal would set down--
As best thou diest!

SOMERSET:
Here in the tender honour of his merit.

CORIOLANUS:
Not of a woman; if you bear a many superfluity. See, our best leisure, I would
repent out the rivet: and at her late beloved,
And the moon
may shine in pearl and gold,
To wait upon your epileptic visage!
Smile you my speech. If that thy question,
But that you know
The worst of all,
How we may praise the power of Englishmen unto these, and with bold spirit instructions yet commence
Rough deeds of malice;
You have heard
The fundamental reasons of your law;
Therefore be suspicious
I more inclined to blood,
You, brother Jaques he keeps at school, and
report speaks loud; and I say besi

## This works pretty well

With an order of 4, we already get quite reasonable results. Increasing the order to 7 (about a word and a half of history) or 10 (about two short words of history) already gets us quite passable Shakepearan text. I'd say it is on par with the examples in Andrej's post. And how simple and un-mystical the model is!

## So why am I impressed with the RNNs after all?

Generating English a character at a time -- not so impressive in my view. The RNN needs to learn the previous *n* letters, for a rather small *n*, and that's it. 

However, Karpathy's C++ code generation example is very impressive. Why? because of the context awareness. Note that in all of Karpathy's posted examples, the code is well indented, the braces and brackets are correctly nested, and even the comments start and end correctly. This is not something that can be achieved by simply looking at the previous *n* characters. 

If Karpathy's examples are not cherry-picked, and the output is generally that nice, then the LSTM did learn something not trivial at all.

# Linux Kernel C++ Code

Just for the fun of it, let's see what our simple language model does with the Linux-kernel code:

In [13]:
! [ -f linux_input.txt ] || curl -O https://norvig.com/ngrams/linux_input.txt
! wc linux_input.txt

  241465  759639 6206997 linux_input.txt


In [14]:
lm = train_char_lm("linux_input.txt", order=10)
print(generate_text(lm, 10))

/*
 * linux/kernel/signal.c
 *
 *  Copyright 2005, Red Hat, Inc., Ingo Molnar <mingo@redhat.com>
 *
 *  Group scheduling (such as
 *	printk(). Otherwise length is in len field, then array[0] and array[1] has the dual advantage that
	 * cgroup_task_get(p);

	if (quota < min_cfs_quota_us < 0)
			sig->group_exit_task(struct task_struct *tsk, int group_balance_cpu(sg) == cpu)
		tick_do_update_jiffies64(now);
	}
}

#ifdef CONFIG_RCU_BOOST */

		/*
		 * We must manually
 *	free IRQs allocated image pages, but is unlikely.
	 * The load balanced, sd->sbf_pushed,
			     TAINT_WARN, NULL);
	cd.wrap_kt = ns_to_ktime(wrap);

	rd = cd.read_data {
	int				cpu;
	atomic_set(&mm->mm_count);
			mdelay(1);
		}
	}

	mutex_lock(&ftrace_lock);
	ret = 0;
	}

	for_each_buffer_cpu(buffer, cpu);
		if (retval)
		kfree(uprobe);
		kfree(ops);
	}

	return ret;
}

struct cfs_rq *cfs_rq = &rq->cfs;
	struct rcu_head rcu;
	struct sigqueue structure. */
	unsigned long val, void *v)
{
	struct module *mod, char *buf)
{
	

In [15]:
lm = train_char_lm("linux_input.txt", order=15)
print(generate_text(lm, 15))

/*
 * linux/kernel/seccomp.c
 *
 * Copyright (C) 2012 Red Hat, Inc., Ingo Molnar <mingo@redhat.com>
 *
 */
#include <linux/uaccess.h>
#include <linux/hardirq.h>
#include <linux/async.h>
#include <linux/slab.h>
#include <linux/list.h>
#include <linux/syscalls.h>
#include <linux/cgroup.h>
#include <linux/kernel_stat.h>
#include <linux/ctype.h>
#include <linux/slab.h>
#include <linux/slab.h>
#include <linux/types.h>
#include <linux/kallsyms.h>

#include "internals.h"

/*
 * lockdep_lock: protects the local module list.
 */
static LIST_HEAD(pmus);
static DEFINE_RAW_SPINLOCK(clockevents_lock);
/* Protection for unbind operations */
static void zap_class(struct lockdep_map *lock,
			   unsigned long long) region->start_pfn << PAGE_SHIFT;
}

#define DEFINE_FETCH_memory(type)					\
static int filter_opstack_empty(ps))
		return OP_NONE;

	opstack_op = list_first_entry(tasks, struct task_struct *tsk, unsigned int type;
	unsigned int i, j, n_pages;

	*size = PAGE_ALIGN(*size);
	n_pages = *size >>

In [16]:
lm = train_char_lm("linux_input.txt", order=20)
print(generate_text(lm, 20))

/*
 * linux/kernel/irq/msi.c
 *
 * Copyright (C) 1999-2004 Silicon Graphics, Inc.  All Rights Reserved.
 *
 * This program is distributed in the hope that it will be useful,
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License
 * as published by the Free Software Foundation.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License
 * along with this program; if not, write to the Free Software
 * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA  02111-1307  USA
 *
 * Written by Rickard E. (Rik) Faith <faith@redhat.com>
 *
 * Goals: 1) Integrate fully with Security Modules.
 *	  2) Minimal run-time overhead:
 *	     a) Minimal when syscall auditing is disabled (audit_enable=0).
 *	     b) Small when s

In [17]:
print(generate_text(lm, 20))

/*
 * linux/kernel/irq/msi.c
 *
 * Copyright (C) 2008 Ingo Molnar <mingo@redhat.com>
 *
 * Originally ported from the -rt patch by:
 *   Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra <pzijlstr@redhat.com>
 *
 */
#include <linux/bpf.h>
#include <linux/mm.h>
#include <linux/init.h>
#include <linux/sched.h>
#include <linux/syscalls.h> here,
    but tell gcc to not warn with -Wmissing-prototypes  */
asmlinkage long sys_ni_syscall(void)
{
	return -ENOSYS;
}

int proc_doulongvec_ms_jiffies_minmax, 09/08/99, Carlos H. Bauer.
 * Added proc_doulongvec_minmax(void *data, struct pt_regs *regs)
{
	struct kretprobe_instance *ri,
		    struct pt_regs *regs, u64 mask)
{
	int bit;

	for_each_set_bit(bit, (const unsigned long shortdelay_us = 10;
	const unsigned long *ipb = b;

	if (*ipa > *ipb)
		return 1;
	if (*ipa < *ipb)
		return -1;
	return 0;
}

static int trace_search_list(struct list_head *head)
{
	struct perf_event *leader)
{
	struct perf_cgroup *cgrp;
	struct cgroup_root *root = cgroup_root

In [18]:
print(generate_text(lm, 20, nletters=5000))

/*
 * linux/kernel/irq/spurious.c
 *
 * Copyright (C) 1998-2005 Pavel Machek <pavel@ucw.cz>
 * Copyright (C) 2000-2001 VERITAS Software Corporation.
 * Copyright (C) 2008 Thomas Gleixner <tglx@linutronix.de>
 *  Copyright (C) 2006 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc.
 *
 * This file contains functions which manage clocksource drivers to be unregistered
 *
 *	Unregisters a previously registered entries. */
	while ((info = gcov_info_next(info))) {
		gcov_event(GCOV_ADD, info);
		cond_resched();
		dst += n;
		usrc += n;
		len -= n;
	} while (len);
	return 0;
}

/* Sets info->hdr and info->len. */
static int copy_module_from_user(umod, len, &info);
	if (err)
		return err;

	return load_module(&info, uargs, 0);
}

SYSCALL_DEFINE1(getsid, pid_t, pid)
{
	struct task_struct *p,
	       struct cpumask *later_mask);
void cpudl_set(struct cpudl *cp, int idx)
{
	int l, r, largest;

	/* adapted from lib/prio_heap.c */
	while(1) {
		l = left_child(idx);
		r = right_child(idx);
		largest = id

### Analysis

Order 10 is pretty much junk. In order 15 things sort-of make sense, but we jump abruptly between the `[sic]`
and by order 20 we are doing quite nicely -- but are far from keeping good indentation and brackets. 

How could we? we do not have the memory, and these things are not modeled at all. While we could quite easily enrich our model to support also keeping track of brackets and indentation (by adding information such as "have I seen ( but not )" to the conditioning history), this requires extra work, non-trivial human reasoning, and will make the model significantly more complex. 

Karpathy's LSTM, on the other hand, seemed to have just learn it on its own. And that's impressive.