# **N-gram MLE Playground**

## Unsmoothed MLE on a character-level language model

### Training

In [62]:
from collections import defaultdict, Counter


def train_char_level_lm(data: list, block_size: int = 4) -> dict:
    dict = defaultdict(Counter)
    padding = "~" * block_size
    data = padding + data
    # counting
    for i in range(len(data) - block_size):
        input, output = data[i : i + block_size], data[i + block_size]
        dict[input][output] += 1

    # normalization
    def normalize(counter: Counter) -> list:
        size = float(sum(counter.values()))
        return [(c, cnt / size) for c, cnt in counter.items()]

    return {input: normalize(output) for input, output in dict.items()}

Get the Andreg Karpathy's **Shakepears**'s dataset:

In [16]:
!wget https://raw.githubusercontent.com/karpathy/char-rnn/refs/heads/master/data/tinyshakespeare/input.txt

--2025-01-23 13:20:59--  https://raw.githubusercontent.com/karpathy/char-rnn/refs/heads/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: 'input.txt'

     0K .......... .......... .......... .......... ..........  4%  541K 2s
    50K .......... .......... .......... .......... ..........  9%  966K 1s
   100K .......... .......... .......... .......... .......... 13% 2.52M 1s
   150K .......... .......... .......... .......... .......... 18% 3.70M 1s
   200K .......... .......... .......... .......... .......... 22% 1.04M 1s
   250K .......... .......... .......... .......... .......... 27% 4.68M 1s
   300K .......... .......... .......... .......... .......... 32% 1.94M 1s
   350K ..

Load **data**

In [21]:
with open("input.txt", "r") as f:
    data = f.read()
len(data)

1115394

Train the **model**

In [24]:
model = train_char_level_lm(data=data, block_size=4)

Some queries

In [27]:
model["Hell"]

[(' ', 1.0)]

In [26]:
model["tua "]

[('l', 0.5), ('t', 0.5)]

In [28]:
model["sing"]

[(' ', 0.5578231292517006),
 ('u', 0.034013605442176874),
 (';', 0.013605442176870748),
 ('l', 0.10204081632653061),
 ('.', 0.06802721088435375),
 (',', 0.061224489795918366),
 (':', 0.013605442176870748),
 ('s', 0.08843537414965986),
 ("'", 0.006802721088435374),
 ('i', 0.013605442176870748),
 ('e', 0.006802721088435374),
 ('\n', 0.027210884353741496),
 ('!', 0.006802721088435374)]

### Sampling

In [71]:
from random import random

def character_sampling(
    model: dict, 
    input: str, 
    block_size: int
    ) -> str:
    
    input = input[-block_size:]
    distribution = model[input]
    
    x = random()
    for char, pct in distribution:
        x -= pct
        if x <= 0: 
            return char
        
def text_sampling(
    model: dict, 
    block_size: int, 
    n_char: int = 1_000
    ) -> str:
    
    input = "~" * block_size
    output = []
    
    for i in range(n_char):
        # sampling a character
        char = character_sampling(model, input=input, block_size=block_size)
        # update input
        input = input[-block_size:] + char
        # append to output
        output.append(char)
        
    return "".join(output)

### Play with different `block_size`s

#### `block_size` = 2

In [73]:
model_2gram = train_char_level_lm(data=data, block_size=2)
print(text_sampling(model=model_2gram,block_size=2))

Fir: I hisbuthe que's olikenswady ou colors nin,
And they we shat? bou nathe the loon.
If yours.

As de thy deady ploo hingell coll ithip ear witheande upte.

Go, alt thish
Thimplard's I he of yourds the a de girome emen I we safeaks!

Wout therst, tis th! forming yousuniour IV:
For thim!

Nay, youlneave by grans your burld thrive Lors yourpent shily
in anglaing swerrield seen dese afecand cle romench ar mand Ser
And be upowead bunse the come then ty of fand thee day
Whalif hok'd womps foebeaver er.
Wrall whour ciat thy on
Hard cand him:
Whased wice, yould sword now's morseed. Wher
ANUS:
Shours st:
Youghs ou mither.
Agazen I hand theith hey,
And faunest le;
Anded wer bow no musirse;
As que.

CORIANDA:
MARD I flaus th ye.

My sorm ante hatedneelf cancer them shin my ine be so, and my goleappy heardo das not win crain he'st thrit wert
Geons beitherved hicius nown. I hill ling the sheneven of wat wit
I an to of Go vid, my hally corst plat he thead play a us nothincent,
To haven an, th:
So

#### `block_size` = 4

In [74]:
model_4gram = train_char_level_lm(data=data, block_size=4)
print(text_sampling(model=model_4gram,block_size=4))

First: beat,
And here is name?
Was evenge on the city
And somen we know have give as I am fair!

PAULINA:
My lord,
For the this i' the devils,
And joys my head:
Good and say, good not some possess caps and all not now I am I shall me subject not,
That what not stock my heaven, letterly, a son, Aufidius!

LEONTES:
Farewell me, Corioli having tongue
Fault order grave him our best to-day:
In all mother, naked thou are hope:
Thinks in thence: letters on his ruin's lady,
But, so despair!

PRINCENTIO:
This count from the advantagems
Are some tend send man arms
Upon us.

GRUMIO:
Will the highness call take the me in abused truments then I belier made this say
the fill freely
Yielder; thou be limit. You some but does, the conventure's day your arrant
An old entable villain, hands
Till I know upon this most:
Here's have
you from peppery?

Proclaim'd three ill-vex'd.

MENENIUS:
No; it no dignified
Of ever hearth,
And was back of a qual scene your say in then you slave, uncles,
As if he see ill n

#### `block_size` = 7

In [75]:
model_7gram = train_char_level_lm(data=data, block_size=7)
print(text_sampling(model=model_7gram,block_size=7))

First Citizen:
You have no such mercy which, being so
capital? Tell me here: murderer:
No;
For my patience taken: 'shrew me, but a man for stirring up the Montague,
Resolved mates!
Are you go:
I think so, which you at the gaoler, the this
cuff was but a very pink of it?

DUKE OF YORK:
Come, my sovereign,
You said so. Farewell, go, poor Rome,
And neither shalt do not taken your ladyship?

ANGELO:
And more honour two notorious Prince Edward dares, and my desert!

CORIOLANUS:
Which out of him as well as reverse an say 'silver sound
The cruel with smiles;
And heard theirs, their hats.
But leave
to think no less.

FLORIZEL:
What dread
These and reigns. I never reign and these heads butts me about! Believe your grace.

First Huntsman:
It would succeed that means
To make my ring.

First Servingman:
How now! who comes, offer we refuse thy foot.

CORIOLANUS:
What shall die.

KING RICHARD III:
Thanks, gentleman, and they so?

BUCKINGHAM:
Upon my gracious lord.

PRINCE EDWARD:
Now breaks this dam

#### `block_size` = 10

In [76]:
model_2gram = train_char_level_lm(data=data, block_size=10)
print(text_sampling(model=model_2gram,block_size=10))

First Citizen:
Clubs, bills, and partisans, in hands as old,
Canker'd with that odds he weighs King Richard let me speak:
As I do know;
And all my followers.

MONTAGUE:
Thou villain, and soon persuade
Both him and her, sir: what have we here! Mercy on 's, a barne a very
pretty barne! A boy or a child, will go by thy direction:
If your more ponderous and settled project
May suffer alteration, finding
Myself thus alter'd from the whom, I see,
There's more work.
What is the way to lay the city but the people,
As if you were not I a little for my counsel,
Which must be even in our government.
You thus have marked me.

HASTINGS:
Sound trumpets sound
While we devise him.

COMINIUS:
I have not yet made doubt but Rome was ready
To answer us.

AUFIDIUS:
Whence come you? what's your will, sir,
No remedy.

FRIAR LAURENCE:
Hold; get you gone:
You have taken treasure of the foe?

NORFOLK:
My lord,
Fear none of them
with his charity, obedience fails
To the greater poll, and in his full tilth and hus

## Further reading

1. Andrej Karpathy's post: <https://karpathy.github.io/2015/05/21/rnn-effectiveness/>;
2. Yoav Goldberg on Andrej post: <https://nbviewer.org/gist/yoavg/d76121dfde2618422139>.