# Character confusion versus focus word-based correction

The function to obtain an anagram hash for a word type is $$
\operatorname{Key}(w) = \sum_{i=1}^{|w|}f(c_i)^n,
$$ with $f$ being a particular numerical value assigned to each character in the alphabet $A$, and $c_1, \ldots, c_{|w|}$ the actual characters in the input string $w$.

In Reynaert (2010) the numerical value for a word string is obtained by summing the code value, e.g. ISO Latin-1, of each character in the string raised to a power $n$, where $n$ is empirically set at 5.

The maximum Levenshtein distance. Or, the $k$ character differences that the system will search for.

In [1]:
distance_k = 2

## Sequential focus word-based approach

In [18]:
from collections import defaultdict
LexiconHash = defaultdict(set)
Alphabet = {chr(x): x for x in range(97,123)}

In [8]:
focus_word = 'molensteen'

In [11]:
power_n = 5

def key(w):
    result = 0
    for ci in w:
        result += Alphabet[ci]**power_n
    return result

In [19]:
for w in ['steenmolen', 'fietslamp', 'kaas', 'kans', 'krans', 'kind', 'kits', 'eend', 'deen']:
    LexiconHash[key(w)].add(w)

In [12]:
key('steenmolen')

151787591822

In [20]:
print(LexiconHash)

defaultdict(<class 'set'>, {151787591822: {'steenmolen'}, 131720990015: {'fietslamp'}, 51313769696: {'kaas'}, 58831529439: {'kans'}, 78085675263: {'krans'}, 52893432932: {'kind'}, 67905321383: {'kits'}, 47125301002: {'deen', 'eend'}})
