In [1]:
import pandas as pd

# Load four letter words 

In [2]:
# Kucera & Francis Word Pool downloaded from http://memory.psych.upenn.edu/files/wordpools/kfpool.txt
kf_corpus = pd.read_csv('kfpool.txt', header=None, sep=' ', names=['word', 'frequency'])

kf_corpus

Unnamed: 0,word,frequency
0,a,23248
1,a-coming,2
2,a-crowing,1
3,a-drinking,1
4,a-gracious,1
...,...,...
46367,zur,2
46368,zurcher,2
46369,zurich,2
46370,zwei,1


Let's see how many four letter words there are in the corpus.

In [3]:
(kf_corpus.word.str.len() == 4).sum()

2347

That is more than 1179 words reported in McClelland & Rumelhart, 1981. Probably they filtered by frequency.

In [4]:
for i in range(7):
    word_count = len(kf_corpus[(kf_corpus.word.str.len() == 4) & (kf_corpus.frequency > i)])
    print(f'There are {word_count} four letter words with frequency larger than {i}')

There are 2347 four letter words with frequency larger than 0
There are 1721 four letter words with frequency larger than 1
There are 1444 four letter words with frequency larger than 2
There are 1294 four letter words with frequency larger than 3
There are 1188 four letter words with frequency larger than 4
There are 1092 four letter words with frequency larger than 5
There are 1029 four letter words with frequency larger than 6


Frequency threshold of 4 yields the number closest to 1179.

In [5]:
four_letter_words = kf_corpus[(kf_corpus.word.str.len() == 4) & (kf_corpus.frequency > 4)]
len(four_letter_words)

1188

# Encode letters as feature bundles 

In the original model, the input letters came from a simplified font in which each letter is composed of a number of simplified line strokes (*features*) as in the following image:

![Font from Rumelhar & Siple, 1974](rumelhart-siple-font.jpg)

Here are all the features numbered from 0 to 13:

![Numbered line features](line_features.png)

A list of letters specifying which features it is composed of.

In [6]:
feature_numbers = {
    'A': [0, 1, 2, 3, 4, 6, 8],
    'B': [2, 3, 4, 5, 7, 8, 9],
    'C': [0, 1, 2, 5],
    'D': [2, 3, 4, 5, 7, 9],
    'E': [0, 1, 2, 5, 6],
    'F': [0, 1, 2, 6],
    'G': [0, 1, 2, 4, 5, 8],
    'H': [0, 1, 3, 4, 6, 8],
    'I': [2, 5, 7, 9],
    'J': [0, 3, 4, 5],
    'K': [0, 1, 6, 11, 12],
    'L': [0, 1, 5],
    'M': [0, 1, 3, 4, 10, 11],
    'N': [0, 1, 3, 4, 10, 12],
    'O': [0, 1, 2, 3, 4, 5],
    'P': [0, 1, 2, 3, 6, 8],
    'Q': [0, 1, 2, 3, 4, 5, 12],
    'R': [0, 1, 2, 3, 6, 8, 12],
    'S': [1, 2, 4, 5, 6, 8],
    'T': [2, 7, 9],
    'U': [0, 1, 3, 4, 5],
    'V': [0, 1, 11, 13],
    'W': [0, 1, 3, 4, 12, 13],
    'X': [10, 11, 12, 13],
    'Y': [9, 10, 11],
    'Z': [2, 5, 11, 13]
}