Skip to content

Different Implementations of Norvig's spellchecker

Notifications You must be signed in to change notification settings

gus-massa/norvig-spell

 
 

Repository files navigation

Implementations of Norvig's spelling corrector in various languages.

Languages

hat-trie is the fast trie implementation by dcjones.

Building

The hat-trie library is included as a submodule. Doing

git submodule init
git submodule update
bash -c 'cd hat-trie && autoreconf -i && ./configure && make'

will build it.

The Haskell programs depend on the bytestring, bytestring-trie and unordered-containers packages.

The Java program requires Java 8 and Maven.

Once you have all the dependencies, running make at the top level will build all the programs and place them in bin/.

Each program can be run as

norvig_xx [training data set]

where [training data set] is a plain text file from which the program will learn word frequencies. The programs expect to be given one word per line on standard input and print

word, correction

pairs on standard output.

The folder data/ has a training file train.txt and a test file test.txt. The first is exactly Norvig's big.txt. The second is Norvig's test set but with multiword strings removed.

Performance

On Linux,running make benchmark creates a file benchmarks/all.md containing performance results.

The different implementations use different algorithms and code organization, so the benchmarks are not in any sense a comparison of the different languages.

This is what I get on my setup.

Version Time (s) Memory use
(Max resident,M)
Lines of code
Python 12.6 75.5 31
C++ (unordered_map) 6.0 5.6 165
C++ (hat-trie) 3.5 3.6 166
Haskell (HashMap) 11.5 81.3 60
Haskell (Trie) 22.3 157.0 61
C (dynamic programming) 0.3 162.7 225
Racket (v.6.1.1) 16.2 277.7 73
(with tests)
Java 10.8 363.2 100

About

Different Implementations of Norvig's spellchecker

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C 20.4%
  • Haskell 17.8%
  • C++ 17.7%
  • Racket 15.8%
  • Java 14.0%
  • Makefile 7.9%
  • Other 6.4%