<div align="right"><i>Peter Norvig<br>July 2025</i></div>

# Letter Frequency Revisited: Dutch

A reader, Bernard, writes:

>I am mailing because a while ago I was looking for a source of the letter frequency of all unique words. I then stumbled upon your great English letter frequency count research: [mayzner.html](https://norvig.com/mayzner.html). My question: Are you able to provide these frequencies (**letter, start, end and digram frequency**) based on a Scrabble-style word list (including all conjugated forms, but excluding proper nouns) for English and/or Dutch words?

Yes, Bernard, I am able! I have the [`sowpods.txt`](sowpods.txt) Scrabble word list for English, and after some research, I find that the official word list for Dutch is copyright protected, but there is an [unoffical list](https://github.com/OpenTaal/opentaal-words/blob/master/elements/basiswoorden-gekeurd.txt) at the [OpenTaal](https://github.com/OpenTaal) project. It contains some non-words, but I can eliminate those to create the file `dutch.txt`.

Here is the code to print a report on the various frequencies. I use `_a` to mean an `a` at the start of a word, and `z_` to mean a `z` at the end of a word. To be clear, these are frequencies for the word list (dictionary), not for running text. (In running text, some words, like "the" in English would be very common; in the dictionary each word counts just once.) Letters with diacriticals (accents) are changed to the base letter (e.g. "é" becomes "e"). We show each result twice: once sorted alphabetically, and once with the most frequent entries first.

In [1]:
from typing import Counter, List

# Table to translate away diacriticals, and change word breaks to "_".
translation_table = str.maketrans('àäåçèéêëêëíîïñôöûü \n\t', 
                                  'aaaceeeeeeiiinoouu___')

def ngrams(n: int, text: str) -> Counter[str]:
    """A Counter of length-n overlapping subsequences of `text`."""
    return Counter(text[i:i+n] for i in range(len(text) - n + 1))

def report(text: str, columns=7, translation_table=translation_table) -> None:
    """Print a report of letter and digram frequencies for the text"""
    Nwords = len(text.split())
    Nletters = sum(map(str.isalpha, text))
    print(f'Word list: {Nwords:,d} words; {Nletters:,d} letters; {Nletters/Nwords:5.2f} letters/word')
    unaccented = ('_' + text).translate(translation_table)
    letters = ngrams(1, unaccented)
    digrams = ngrams(2, unaccented)
    one_letters = [{c:  letters[c]  for c  in letters if c != '_'},
                   {_c: digrams[_c] for _c in digrams if _c.startswith('_')},
                   {c_: digrams[c_] for c_ in digrams if c_.endswith('_')}]
    print_table('One-Letter Frequencies:\n' + 2 * 'Any Letter      First Letter     Last Letter      ',
                [*map(by_keys, one_letters), 
                 *map(by_values, one_letters)])
    print_table('Two-Character Frequencies:', 
                [by_keys(digrams), 
                 by_values(digrams)])

def print_table(title: str, dicts: List[dict]) -> str:
    """Print the contents of the dicts in a table, with a title."""
    totals = [sum (dic.values()) for dic in dicts]
    print('\n' + title + '\n')
    rows = zip(*[dic.items() for dic in dicts])
    for row in rows:
        print(*[f'{key}: {val/totals[i]:9.5%}   ' 
                for i, (key, val) in enumerate(row)])
        
def by_keys(dic: dict) -> dict: 
    """This dict, rearranged with keys in sorted order."""
    return {k: dic[k] for k in sorted(dic)}
            
def by_values(dic: dict) -> dict: 
    """This dict, rearranged with values in decreasing order."""
    return {k: dic[k] for k in sorted(dic, key=lambda x: -dic[x])}

Here is the code to create an unofficial word list from a file that may include capitalized names, numbers, phrases, etc.:

In [2]:
def create_word_file(in_file_name: str, out_file_name: str) -> None:
    """Create a file of (approximately) legal Scrabble words."""
    with open(out_file_name, mode='w') as out:
        for line in open(in_file_name):
            word = line.strip()
            if word.islower() and word.isalpha():
                print(word, file=out)

# create_word_file('basiswoorden-gekeurd.txt', 'dutch.txt') # Only need to do this once

# Frequencies in Dutch

In [3]:
report(open('dutch.txt').read())

Word list: 191,963 words; 2,242,135 letters; 11.68 letters/word

One-Letter Frequencies:
Any Letter      First Letter     Last Letter      Any Letter      First Letter     Last Letter      

a:  7.31963%    _a:  5.89176%    a_:  1.25076%    e: 14.90120%    _b:  9.71802%    n_: 13.04731%   
b:  2.20727%    _b:  9.71802%    b_:  0.15732%    r:  7.85238%    _s:  8.71366%    e_: 12.81862%   
c:  2.32916%    _c:  3.49338%    c_:  0.04011%    i:  7.69557%    _k:  6.88310%    t_: 11.68819%   
d:  3.95569%    _d:  4.98846%    d_:  8.77252%    a:  7.31963%    _v:  6.58252%    g_: 11.68350%   
e: 14.90120%    _e:  2.29107%    e_: 12.81862%    n:  7.25679%    _a:  5.89176%    r_: 11.19174%   
f:  1.27084%    _f:  2.26658%    f_:  1.62427%    t:  6.59554%    _m:  5.30467%    d_:  8.77252%   
g:  3.74861%    _g:  5.24841%    g_: 11.68350%    o:  6.13246%    _g:  5.24841%    l_:  6.43874%   
h:  2.14416%    _h:  4.40293%    h_:  0.86996%    s:  5.78797%    _p:  5.14057%    s_:  6.07773%   
i:  7.695

# Frequencies in English

In [4]:
report(open('sowpods.txt').read())

Word list: 267,751 words; 2,439,263 letters;  9.11 letters/word

One-Letter Frequencies:
Any Letter      First Letter     Last Letter      Any Letter      First Letter     Last Letter      

a:  7.73607%    _a:  5.87225%    a_:  1.98393%    e: 11.29776%    _s: 11.41919%    s_: 38.30088%   
b:  1.84289%    _b:  5.36282%    b_:  0.12736%    s:  9.62061%    _c:  9.00837%    e_: 10.21471%   
c:  4.02704%    _c:  9.00837%    c_:  2.24238%    i:  9.03892%    _p:  8.72154%    d_:  8.99978%   
d:  3.35064%    _d:  5.94470%    d_:  8.99978%    a:  7.73607%    _d:  5.94470%    g_:  7.25375%   
e: 11.29776%    _e:  4.11950%    e_: 10.21471%    r:  6.99068%    _a:  5.87225%    y_:  7.14022%   
f:  1.18601%    _f:  3.78747%    f_:  0.19160%    n:  6.70846%    _m:  5.63098%    r_:  5.14471%   
g:  2.78404%    _g:  3.32921%    g_:  7.25375%    o:  6.63118%    _r:  5.41809%    t_:  4.83285%   
h:  2.48854%    _h:  3.77926%    h_:  1.17497%    t:  6.53767%    _b:  5.36282%    n_:  4.35890%   
i:  9.038