# **N-gram MLE Playground**

## Unsmoothed MLE on a character-level language model

### Training

In [23]:
from collections import defaultdict, Counter

def train_char_level_lm(data: list, block_size: int = 4) -> dict:
    dict = defaultdict(Counter)
    padding = "." * block_size
    data = padding + data
    # counting
    for i in range(len(data) - block_size):
        input, output = data[i:i+block_size], data[i+block_size]
        dict[input][output] += 1
    # normalization    
    def normalize(counter: Counter) -> list:
        size = float(sum(counter.values()))
        return [(c, cnt/size) for c, cnt in counter.items()]

    return {input:normalize(output) for input, output in dict.items()}

Get the Andreg Karpathy's **Shakepears**'s dataset:

In [16]:
!wget https://raw.githubusercontent.com/karpathy/char-rnn/refs/heads/master/data/tinyshakespeare/input.txt

--2025-01-23 13:20:59--  https://raw.githubusercontent.com/karpathy/char-rnn/refs/heads/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: 'input.txt'

     0K .......... .......... .......... .......... ..........  4%  541K 2s
    50K .......... .......... .......... .......... ..........  9%  966K 1s
   100K .......... .......... .......... .......... .......... 13% 2.52M 1s
   150K .......... .......... .......... .......... .......... 18% 3.70M 1s
   200K .......... .......... .......... .......... .......... 22% 1.04M 1s
   250K .......... .......... .......... .......... .......... 27% 4.68M 1s
   300K .......... .......... .......... .......... .......... 32% 1.94M 1s
   350K ..

Load **data**

In [21]:
with open('input.txt', 'r') as f:
    data = f.read()
len(data)

1115394

Train the **model**

In [24]:
model = train_char_level_lm(data=data, block_size=4)

Some queries

In [27]:
model['Hell']

[(' ', 1.0)]

In [26]:
model['tua ']

[('l', 0.5), ('t', 0.5)]

In [28]:
model['sing']

[(' ', 0.5578231292517006),
 ('u', 0.034013605442176874),
 (';', 0.013605442176870748),
 ('l', 0.10204081632653061),
 ('.', 0.06802721088435375),
 (',', 0.061224489795918366),
 (':', 0.013605442176870748),
 ('s', 0.08843537414965986),
 ("'", 0.006802721088435374),
 ('i', 0.013605442176870748),
 ('e', 0.006802721088435374),
 ('\n', 0.027210884353741496),
 ('!', 0.006802721088435374)]

### Sampling

## Further reading

1. Andrej Karpathy's post: <https://karpathy.github.io/2015/05/21/rnn-effectiveness/>