<div align="right"><i>Peter Norvig<br>July 2025</i></div>

# Letter Frequency Revisited: Dutch

A reader, Bernard, writes:

>I am mailing because a while ago I was looking for a source of the letter frequency of all unique words. I then stumbled upon your great English letter frequency count research: [mayzner.html](https://norvig.com/mayzner.html). My question: Are you able to provide these frequencies (**letter, start, end and digram frequency**) based on a Scrabble-style word list (including all conjugated forms, but excluding proper nouns) for English and/or Dutch words?

Yes, Bernard, I am able! I have the [`sowpods.txt`](sowpods.txt) Scrabble word list for English, and after some research, I find that the official word list for Dutch is copyright protected, but there is an [unoffical list](https://github.com/OpenTaal/opentaal-words/blob/master/elements/basiswoorden-gekeurd.txt) at the [OpenTaal](https://github.com/OpenTaal) project. It contains some non-words, but I can eliminate those to create the file `dutch.txt`.

Here is the code to print a report on the various frequencies. I use `_a` to mean an `a` at the start of a word, and `z_` to mean a `z` at the end of a word. To be clear, these are frequencies for the word list (dictionary), not for running text. (In running text, some words, like "the" in English would be very common; in the dictionary each word counts just once.) Letters with diacriticals (accents) are changed to the base letter.

In [1]:
from typing import Iterable, Counter, List
import itertools
import math

def report(words: str, columns=7) -> None:
    """Report letter and digram frequencies for these words."""
    Nwords = len(words.split())
    Nletters = sum(map(str.isalpha, words))
    print(f'Word list: {Nwords:,d} words; {Nletters:,d} letters; {Nletters/Nwords:5.2f} letters/word')
    letters = ngrams(1, words)
    digrams = ngrams(2, words)
    print_table('One-Letter Frequencies:\nAny Letter      First Letter   Last Letter',
                column(letters), 
                column(digrams, lambda ab: ab.startswith('_')), 
                column(digrams, lambda ab: ab.endswith('_')))
    print_table('Two-Character Frequencies:',
                *batched(column(digrams), math.ceil(len(digrams) / columns)))
    
# Table to translate away diacriticals, and change word breaks to "_".
translation_table = str.maketrans('àäåçèéêëêëíîïñôöûü \n', 
                                  'aaaceeeeeeiiinoouu__')

def ngrams(n: int, text: str, translation_table=translation_table) -> Counter[str]:
    """A Counter of length-n subsequences of `text`, with '_' between words."""
    text = ('_' + text).translate(translation_table)
    return Counter(text[i:i+n] for i in range(len(text) - n + 1))

def column(counter, predicate=None, width=15) -> List[str]:
    """A column of strings representing 'key:frequency' pairs."""
    total = sum(counter.values())
    return [f'{x}: {counter[x]/total:9.6%}'.ljust(width) for x in sorted(counter)
            if predicate is None or predicate(x)]

def print_table(title: str, *columns) -> str:
    """Print the columns, with a title."""
    print('\n' + title + '\n')
    for i in range(max(map(len, columns))):
        print(*[(col[i] if i < len(col) else '') for col in columns])

def batched(iterable, n) -> Iterable[tuple]:
    """Batch data from the iterable into tuples of length n. The last batch may be shorter than n."""
    # From itertools recipes: batched('ABCDEFG', 2) → AB CD EF G
    iterator = iter(iterable)
    while batch := tuple(itertools.islice(iterator, n)):
        yield batch

Here is the code to create an unofficial word list from a larger word list that may include capitalized names, numbers, phrases, etc.:

In [2]:
def create_word_file(in_name: str, out_name: str) -> None:
    """Create a file of (approximately) legal Scrabble words."""
    with open(out_name, mode='w') as out:
        for line in open(in_name):
            word = line.strip()
            if word.islower() and word.isalpha():
                print(word, file=out)

# create_word_file('basiswoorden-gekeurd.txt', 'dutch.txt') # Only need to do this once

# Frequencies in Dutch

In [3]:
report(open('dutch.txt').read())

Word list: 191,963 words; 2,242,135 letters; 11.68 letters/word

One-Letter Frequencies:
Any Letter      First Letter   Last Letter

_: 7.886450%    _a: 0.464649%   a_: 0.098640%  
a: 6.742372%    _b: 0.766403%   b_: 0.012407%  
b: 2.033196%    _c: 0.275502%   c_: 0.003163%  
c: 2.145476%    _d: 0.393411%   d_: 0.691837%  
d: 3.643730%    _e: 0.180683%   e_: 1.010929%  
e: 13.726023%   _f: 0.178752%   f_: 0.128097%  
f: 1.170618%    _g: 0.413911%   g_: 0.921409%  
g: 3.452982%    _h: 0.347233%   h_: 0.068609%  
h: 1.975063%    _i: 0.210181%   i_: 0.042233%  
i: 7.088660%    _j: 0.067664%   j_: 0.065527%  
j: 1.009984%    _k: 0.542829%   k_: 0.443326%  
k: 2.763651%    _l: 0.311327%   l_: 0.507786%  
l: 4.360135%    _m: 0.418348%   m_: 0.244526%  
m: 2.372623%    _n: 0.126494%   n_: 1.028964%  
n: 6.684486%    _o: 0.388645%   o_: 0.052504%  
o: 5.648825%    _p: 0.405407%   p_: 0.174520%  
p: 2.315806%    _q: 0.004642%   q_: 0.000041%  
q: 0.023541%    _r: 0.293374%   r_: 0.882627%  
r: 

# Frequencies in English

In [4]:
report(open('sowpods.txt').read())

Word list: 267,751 words; 2,439,263 letters;  9.11 letters/word

One-Letter Frequencies:
Any Letter      First Letter   Last Letter

_: 9.891042%    _a: 0.580824%   a_: 0.196231%  
a: 6.970889%    _b: 0.530437%   b_: 0.012597%  
b: 1.660611%    _c: 0.891019%   c_: 0.221794%  
c: 3.628720%    _d: 0.587991%   d_: 0.890169%  
d: 3.019230%    _e: 0.407460%   e_: 1.010338%  
e: 10.180291%   _f: 0.374619%   f_: 0.018951%  
f: 1.068705%    _g: 0.329293%   g_: 0.717470%  
g: 2.508667%    _h: 0.373807%   h_: 0.116217%  
h: 2.242396%    _i: 0.346877%   i_: 0.054451%  
i: 8.144875%    _j: 0.080495%   j_: 0.000369%  
j: 0.148134%    _k: 0.115662%   k_: 0.081344%  
k: 0.815474%    _l: 0.282082%   l_: 0.310342%  
l: 4.723468%    _m: 0.556961%   m_: 0.165164%  
m: 2.611733%    _n: 0.229219%   n_: 0.431139%  
n: 6.044924%    _o: 0.317841%   o_: 0.065866%  
o: 5.975290%    _p: 0.862648%   p_: 0.054673%  
p: 2.707262%    _q: 0.049944%   q_: 0.000222%  
q: 0.151606%    _r: 0.535904%   r_: 0.508863%  
r: 