### Training Code
Here is the code for training the model. `fname` is a file to read the characters from. `order` is the history size to consult. Note that we pad the data with leading `~` so that we also learn how to start.


In [1]:
from collections import *

def train_char_lm(fname, order=4):
    data = file(fname).read()
    lm = defaultdict(Counter)
    pad = "~" * order
    data = pad + data
    for i in xrange(len(data)-order):
        history, char = data[i:i+order], data[i+order]
        lm[history][char]+=1
    def normalize(counter):
        s = float(sum(counter.values()))
        return [(c,cnt/s) for c,cnt in counter.iteritems()]
    outlm = {hist:normalize(chars) for hist, chars in lm.iteritems()}
    return outlm

Let's train it on Andrej's Shakespears's text:

In [2]:
!wget http://cs.stanford.edu/people/karpathy/char-rnn/shakespeare_input.txt

--2017-02-01 11:33:34--  http://cs.stanford.edu/people/karpathy/char-rnn/shakespeare_input.txt
Resolving cs.stanford.edu (cs.stanford.edu)... 171.64.64.64
Connecting to cs.stanford.edu (cs.stanford.edu)|171.64.64.64|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4573338 (4.4M) [text/plain]
Saving to: ‘shakespeare_input.txt’


2017-02-01 11:33:34 (12.8 MB/s) - ‘shakespeare_input.txt’ saved [4573338/4573338]



In [3]:
lm = train_char_lm("shakespeare_input.txt", order=4)

Ok. Now let's do some queries:

In [4]:
lm['ello']

[('!', 0.0068143100511073255),
 (' ', 0.013628620102214651),
 ("'", 0.017035775127768313),
 (',', 0.027257240204429302),
 ('.', 0.0068143100511073255),
 ('r', 0.059625212947189095),
 ('u', 0.03747870528109029),
 ('w', 0.817717206132879),
 ('n', 0.0017035775127768314),
 (':', 0.005110732538330494),
 ('?', 0.0068143100511073255)]

In [5]:
lm['Firs']

[('t', 1.0)]

In [6]:
lm['rst ']

[("'", 0.0008025682182985554),
 ('A', 0.0056179775280898875),
 ('C', 0.09550561797752809),
 ('B', 0.009630818619582664),
 ('E', 0.0016051364365971107),
 ('D', 0.0032102728731942215),
 ('G', 0.0898876404494382),
 ('F', 0.012038523274478331),
 ('I', 0.009630818619582664),
 ('H', 0.0040128410914927765),
 ('K', 0.008025682182985553),
 ('M', 0.0593900481540931),
 ('L', 0.10674157303370786),
 ('O', 0.018459069020866775),
 ('N', 0.0008025682182985554),
 ('P', 0.014446227929373997),
 ('S', 0.16292134831460675),
 ('R', 0.0008025682182985554),
 ('T', 0.0032102728731942215),
 ('W', 0.033707865168539325),
 ('a', 0.02247191011235955),
 ('c', 0.012841091492776886),
 ('b', 0.024879614767255216),
 ('e', 0.0032102728731942215),
 ('d', 0.015248796147672551),
 ('g', 0.011235955056179775),
 ('f', 0.011235955056179775),
 ('i', 0.016853932584269662),
 ('h', 0.019261637239165328),
 ('k', 0.0040128410914927765),
 ('m', 0.02247191011235955),
 ('l', 0.01043338683788122),
 ('o', 0.030497592295345103),
 ('n', 0.0

So `ello` is followed by either space, punctuation or `w` (or `r`, `u`, `n`), `Firs` is pretty much deterministic, and the word following `ist ` can start with pretty much every letter.

### Generating from the model
Generating is also very simple. To generate a letter, we will take the history, look at the last $order$ characteters, and then sample a random letter based on the corresponding distribution.

In [7]:
from random import random

def generate_letter(lm, history, order):
        history = history[-order:]
        dist = lm[history]
        x = random()
        for c,v in dist:
            x = x - v
            if x <= 0: return c

To generate a passage of $k$ characters, we just seed it with the initial history and run letter generation in a loop, updating the history at each turn.

In [8]:
def generate_text(lm, order, nletters=1000):
    history = "~" * order
    out = []
    for i in xrange(nletters):
        c = generate_letter(lm, history, order)
        history = history[-order:] + c
        out.append(c)
    return "".join(out)

### Generated Shakespeare from different order models

Let's try to generate text based on different language-model orders. Let's start with something silly:

### order 2:

In [9]:
lm = train_char_lm("shakespeare_input.txt", order=2)
print generate_text(lm, 2)

Fien;
All the know, whou,
Till an deed. I the lows hen he humman my forebtereered train musave thow'stiet its mans ist ove thim thet-lientsmintue suffed, welp.

I hearle.
Posem even pat his I'll mased scou me who, their youstur. She in
MARMISTA:
We an,
Teas or gen me cand I kin unto scind me?

HARISTRA:
PERICK:

FRA:
Sir lewas murliumble
Deat Trons wer!'

Of th ass
home, mand shoutim.

Diegif al, athim; all; I'll her ne,
And red Caeld man we's me,
Whe unt heare poseche be. Yourayer!

Fram min as th ang but;
And dive, at brothen th mande thimaund hirstell broad
th,
SHARLAUDINE:
I a pre en? Clou ne not, your not, shontlethy wit?

Mare now th his dam, to for not.

Fir.

Viany.

GO:
Brou hence,
Whaliparry
HES:
Gartize ito lestat but, willy,
SIDARWICK:
The cauld.

OPARK:
Your fathy at ster, yet or. Hal, to to cry beeld thappothince witere's ther.

GUICK:
Is allip yoned nothim.

CLOCTAR:
I boo sup if we not mure speads.

DON:
ISTERNENIO:
Whou a not fought the spendon'tion.

LEPHERICKLY:
MENR

Not so great.. but what if we increase the order to 4?

### order 4

In [10]:
lm = train_char_lm("shakespeare_input.txt", order=4)
print generate_text(lm, 4)

First Lord And ther:
Third, he crown to restraitor, upon,
I have you may patience;
But I for I will feet his with me; I bless not promonth occasion of his
knave!

CATESBY:

MAMILLO:
Excellusion
Hark!--Whose the name.

SIR WALTER BLUNT:
Now these this yours to me.

HORATIO:
O Regan.

Clown:
Conce,
And sweariest a rowed in in the sith shall.

SHALLOW:
Nay, I wisdom
Thy thoughts
The whet of thought the greath;
The criest, above
Unto my people placed for bitter is no marry, answer master. Where confess me other full give wish.

PRINCENTINE:
Good consieur Le Benedicate, by three opposed by them is brother that vices from encourse!
Is my knave pint
Thy lord.

FALSTAFF:
My look about of the Romeo, he here in
they done she neighbour
active pat; and give is and truth a peace, I should nobody.

PISANIO:
Report
We here her fough, and revoltingly. Reason that you, true,
You hast the comest Rosalind up and know would a solicite,
Betray, what comforth
Live,
Because the news?
You urge
That this, Satu

In [11]:
lm = train_char_lm("shakespeare_input.txt", order=4)
print generate_text(lm, 4)

First Citizens, make voice to my love meants, and turn the kingdom, sure,
O'er wish.
I'll twelves one that I, majesty come come and a dearer:
'Tis to best orname, master
thou will requestion of it, what not living is nothink.

First Claudio, I wast delibert?
rece of Marcius, sir, head of Venice. Well, my chambering that though token? prodigal, as Cato, and get not Nature;
With and you; and my knew robbing come speak:
I bear not, my hung Georget have my loved, to hides, mistre at not thou shake hath finds of that sheep-sheep stay;
The king; and is ther's spoke them at.
On, Macbething to their dry. Witch,
Cannot murder father's back. Our dried, present she generald crowns, rounderers
Are you son the good, as well thee, Hero, have he deligiously away
A sooth done, to hand,
And that some of danged slew,
And all because of didst need with himselve board,
And so? them. Nay, the hard-head
To me ask of fear:
Polixeness prince perceiving herbs: having and so,
Ingration of we knighter:
But with 

This is already quite reasonable, and reads like English. Just 4 letters history! What if we increase it to 7?

### order 7

In [12]:
lm = train_char_lm("shakespeare_input.txt", order=7)
print generate_text(lm, 7)

First Citizens are odious,
For all incite us to the thing.
But, I pray thee, fellow of a tavern a most no revenge, thy behalf;
And palmy state, she's warm!
If they did was made after weep the drum toward do you know when the comfort. But indeed, his
majestical, apish, shall I do? I would have forswore my cheese, and thoroughly penn'd
And on our awe,
Or been all that humour of his chief gone with you will come of truth,
Pluck this All-Souls' day is yourself,
No doubt it not be a sovereign, let them on Lud's-town march
towards London,
To sit undertake you may pare him! God sa' me, 'twere to thee,
But dead, hereafter, drybeat them, gentlemen
To slay your feast him,
This head.
Hark! he stirs:
Start not trouble a
lord. If it hath a power
Meets with the middle.

First Gentleman,
accordant, her voice goes that make a man may still?

TROILUS:
O brave brought on kiss
She vied so long. I have struck this particulars.

SLENDER:

DOCTOR CAIUS:
Verily,
Your dangerous unto yourselves.

NORFOLK:
Be a

### How about 10?

In [13]:
lm = train_char_lm("shakespeare_input.txt", order=10)
print generate_text(lm, 10)

First Citizen:
Ay, sir; so should I, in these greens before you choose the rigour of severest law.

Second Murderer:
He needs as many, sir, as I say, to vex
her I will or no?
O, torture me in this plain, so many hollow falsehood!
Why did he so? I charge ye, bear her fan!
To see him in a dark hour. Resolve yourself:
Nay, task me to my chance,
Is queen of all this?

BRUTUS:
Could you on the hip,
Abuse him to a better gone, so must thy grave,
And give us notice of your duty throughly.
Signior Benedick, the marriage: Lastly,
If I do vow a friendship
That young Prince John a full commission: I will
on with my good cousin; farewell.
And now, good Cassio!

IAGO:
Awake! what, have you served the place,
And we'll consign to.

KING LEAR:
O, ho, are you hurt, lieutenant; and 'tis powerful grace than boy.

DUKE:
Where you shall hear music and see the bottom of the selfsame sun that makes that men should possess.

KING:
I am wrapp'd in fire,
To burn the lodging out. Give him the revolt of thine was

### This works pretty well

With an order of 4, we already get quite reasonable results. Increasing the order to 7 (~word and a half of history) or 10 (~two short words of history) already gets us quite passable Shakepearan text. I'd say it is on par with the examples in Andrej's post. And how simple and un-mystical the model is!

### So why am I impressed with the RNNs after all?

Generating English a character at a time -- not so impressive in my view. The RNN needs to learn the previous $n$ letters, for a rather small $n$, and that's it. 

However, the code-generation example is very impressive. Why? because of the context awareness. Note that in all of the posted examples, the code is well indented, the braces and brackets are correctly nested, and even the comments start and end correctly. This is not something that can be achieved by simply looking at the previous $n$ letters. 

If the examples are not cherry-picked, and the output is generally that nice, then the LSTM did learn something not trivial at all.

Just for the fun of it, let's see what our simple language model does with the linux-kernel code:

In [None]:
!wget http://cs.stanford.edu/people/karpathy/char-rnn/linux_input.txt

--2017-02-01 11:34:40--  http://cs.stanford.edu/people/karpathy/char-rnn/linux_input.txt
Resolving cs.stanford.edu (cs.stanford.edu)... 171.64.64.64
Connecting to cs.stanford.edu (cs.stanford.edu)|171.64.64.64|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6206996 (5.9M) [text/plain]
Saving to: ‘linux_input.txt’


2017-02-01 11:34:40 (12.7 MB/s) - ‘linux_input.txt’ saved [6206996/6206996]



In [None]:
lm = train_char_lm("linux_input.txt", order=10)
print generate_text(lm, 10)

In [None]:
lm = train_char_lm("linux_input.txt", order=15)
print generate_text(lm, 15)

In [None]:
lm = train_char_lm("linux_input.txt", order=20)
print generate_text(lm, 20)

In [None]:
print generate_text(lm, 20)

In [None]:
print generate_text(lm, 20, nletters=5000)

Order 10 is pretty much junk. In order 15 things sort-of make sense, but we jump abruptly between the 
and by order 20 we are doing quite nicely -- but are far from keeping good indentation and brackets. 

How could we? we do not have the memory, and these things are not modeled at all. While we could quite easily enrich our model to support also keeping track of brackets and indentation (by adding information such as "have I seen ( but not )" to the conditioning history), this requires extra work, non-trivial human reasoning, and will make the model significantly more complex. 

The LSTM, on the other hand, seemed to have just learn it on its own. And that's impressive.