<div align="right"><i>Peter Norvig<br>July 2025</i></div>

# Letter Frequency Revisited

A reader, Bernard, writes:

>I am mailing because a while ago I was looking for a source of the letter frequency of all unique words. I then stumbled upon your great English letter frequency count research: [mayzner.html](https://norvig.com/mayzner.html). My question: Are you able to provide these frequencies (**letter, start, end and digram frequency**) based on a Scrabble-style word list (including all conjugated forms, but excluding proper nouns) for English and/or Dutch words?

Yes, Bernard, I am able! I have the [`sowpods.txt`](sowpods.txt) Scrabble word list for English, and after some research, I find that the official word list for Dutch is copyright protected, but there is an [unoffical list](https://github.com/OpenTaal/opentaal-wordlist/blob/master/elements/basiswoorden-gekeurd.txt) at the [OpenTaal](https://github.com/OpenTaal) project. It contains proper names (capitalized) and some phrases (with space characters) and numbers, but I can eliminate those to create the file `dutch.txt`.

Here is the code to print a report on the various frequencies. I use `_a` to mean an `a` at the start of a word, and `z_` to mean a `z` at the end of a word.

In [1]:
from typing import Iterable, Counter

def ngrams(n: int, wordlist: str) -> Iterable[str]:
    """An iterable of length-n subsequences of wordlist, with '_' between words."""
    yield '_' + wordlist[:n-1]
    for i in range(len(wordlist) - n + 1):
        yield wordlist[i:i+n].replace('\n', '_')

def report(wordlist: str) -> None:
    """Report frequencies for this wordlist."""
    letters = Counter(ngrams(1, wordlist))
    digrams = Counter(ngrams(2, wordlist))
    print(f'Word list: {len(wordlist.split()):,d} words; {len(wordlist):,d} characters')
    table('Letter Frequencies',       letters)
    table('First Letter Frequencies', digrams, lambda ab: ab.startswith('_'))
    table('Last Letter Frequencies',  digrams, lambda ab: ab.endswith('_'))
    table('Digram Frequencies',       digrams)

def table(title, counter, predicate=lambda ab: True) -> None:
    """Print the entries in the counter, with a title."""
    print('\n' + '=' * len(title) + '\n' + title + '\n')
    for x in sorted(counter):
        if predicate(x):
            print(f'{x}: {counter[x]:6d}')

Here is the code to create an unofficial word list (from a larger word list that may include capitalized names, numbers, etc.):

In [2]:
def create_wordlist(in_name: str, out_name: str) -> None:
    """Create a wordlist of (approximately) legal Scrabble words."""
    with open(out_name, mode='w') as out:
        for line in open(in_name):
            word = line.strip()
            if word.islower() and word.isalpha():
                print(word, file=out)

# create_wordlist('basiswoorden-gekeurd.txt', 'dutch.txt') # Only need to do this once

# Frequencies in Dutch

In [3]:
report(open('dutch.txt').read())

Word list: 191,963 words; 2,434,098 characters

Letter Frequencies

_: 191964
a: 164106
b:  49490
c:  52211
d:  88692
e: 332610
f:  28494
g:  84049
h:  48075
i: 172154
j:  24584
k:  67270
l: 106130
m:  57752
n: 162702
o: 137333
p:  56369
q:    573
r: 176061
s: 129774
t: 147881
u:  61889
v:  41975
w:  26481
x:   2118
y:   4049
z:  17192
à:      1
ä:      7
å:      2
ç:     12
è:    219
é:    397
ê:     46
ë:    833
í:      1
î:      2
ï:    388
ñ:      5
ô:      1
ö:    164
û:      4
ü:     39

First Letter Frequencies

_a:  11308
_b:  18655
_c:   6706
_d:   9576
_e:   4395
_f:   4351
_g:  10075
_h:   8452
_i:   5116
_j:   1647
_k:  13213
_l:   7578
_m:  10183
_n:   3079
_o:   9459
_p:   9868
_q:    113
_r:   7141
_s:  16727
_t:   8484
_u:   1861
_v:  12636
_w:   6707
_x:     36
_y:     55
_z:   4533
_à:      1
_å:      1
_é:      3
_ö:      1
_ü:      3

Last Letter Frequencies

a_:   2400
b_:    302
c_:     77
d_:  16840
e_:  24385
f_:   3118
g_:  22428
h_:   1670
i_:   1028
j_:   159

# Frequencies in English

In [4]:
report(open('sowpods.txt').read())

Word list: 267,751 words; 2,707,014 characters

Letter Frequencies

_: 267752
a: 188703
b:  44953
c:  98230
d:  81731
e: 275582
f:  28930
g:  67910
h:  60702
i: 220483
j:   4010
k:  22075
l: 127865
m:  70700
n: 163637
o: 161752
p:  73286
q:   4104
r: 170521
s: 234672
t: 159471
u:  80636
v:  22521
w:  18393
x:   6852
y:  39772
z:  11772

First Letter Frequencies

_a:  15723
_b:  14359
_c:  24120
_d:  15917
_e:  11030
_f:  10141
_g:   8914
_h:  10119
_i:   9390
_j:   2179
_k:   3131
_l:   7636
_m:  15077
_n:   6205
_o:   8604
_p:  23352
_q:   1352
_r:  14507
_s:  30575
_t:  13919
_u:   9070
_v:   4414
_w:   5611
_x:    303
_y:    985
_z:   1118

Last Letter Frequencies

a_:   5312
b_:    341
c_:   6004
d_:  24097
e_:  27350
f_:    513
g_:  19422
h_:   3146
i_:   1474
j_:     10
k_:   2202
l_:   8401
m_:   4471
n_:  11671
o_:   1783
p_:   1480
q_:      6
r_:  13775
s_: 102551
t_:  12940
u_:    354
v_:     39
w_:    589
x_:    547
y_:  19118
z_:    155

Digram Frequencies

_a:  15723
_b:  